Slack-like Chat System
Role: Software Engineer
Problem Statement
Design a scalable enterprise chat system like Slack that supports:
-
Messaging: Send messages to group channels (1:N) and direct messages (1:1)
-
Channel Management: Create group chats, add/remove users
-
Notifications: Real-time delivery for online users, push notifications for offline users
-
Rich Media: Support images, files, and other attachments in messages
-
Message Deletion: Allow users to delete their messages
-
Multi-tenancy: Different companies share infrastructure with complete data isolation
Scale considerations:
-
Millions of concurrent users
-
Thousands of workspaces (companies)
-
Channels with 10,000+ members
-
Sub-second message delivery
Comprehensive Solution Resources
For detailed architectural solutions and implementation approaches, these two articles provide excellent technical depth:
-
Slack Architecture Deep Dive - Comprehensive breakdown of Slack's architecture, including WebSocket management, message routing, and scalability patterns
-
How to Design Slack - Step-by-step guide covering core components, data models, and system design trade-offs
These articles thoroughly explain:
-
System architecture and service design
-
WebSocket management and real-time delivery
-
Database schema and sharding strategies
-
Message ordering and ID generation
-
Multi-tenancy implementation
-
Fault tolerance mechanisms
-
Notification systems
-
Rich media handling
The content below focuses on real interview experiences from candidates who interviewed at , showing what actually gets discussed and emphasized during the interview.
Real Interview Experiences
Experience 1: Large Channel Fan-Out and Notifications
"Focused on large channel fan-out challenge; how notifications (notification on unread messages, notification on message push when users are offline) work; how db can scale up."
What the interviewer emphasized:
-
Large channel fan-out problem: How do you efficiently deliver a message to 10,000+ members in a channel?
-
Notification system design: Different strategies for online vs. offline users
-
Database scalability: Sharding strategies, query patterns, read vs. write optimization
Key insight: This is the most critical technical challenge in chat systems. Interviewers expect you to discuss dispatcher patterns, pub/sub mechanisms (Redis Pub/Sub, Kafka), and the trade-offs between broadcasting to all servers vs. targeted delivery. Be prepared to explain how you maintain a mapping of channels → gateway servers and how to handle memory/network constraints.
Experience 2: Service Architecture and Message Ordering
"Designed a Channel Service (separately managing user and channel information) and Message Service (managing messages). The interviewer asked detailed questions about message ID generation, loose ordering of message delivery, and what happens if messages appear at exactly the same moment. Also asked about Channel Service database table design - I discussed designing one-to-many relationships storing user_to_channels and channel_to_users with read-prioritized structure. I mentioned using webhooks to deliver to Slack clients for the delivery part but didn't have time to finish explaining. Overall, I felt completely underprepared for this question."
What the interviewer emphasized:
-
Service separation: Why separate Channel Service (user/channel metadata) from Message Service (message persistence)?
-
Message ID generation: Requirements (globally unique, time-sortable, distributed)
-
Message ordering: What happens when two messages arrive at exactly the same millisecond? How do you handle loose ordering?
-
Database schema design: Denormalized tables for read optimization
-
user_to_channelstable: Fast lookup for "what channels is user X in?" -
channel_to_userstable: Fast lookup for "who is in channel Y?" -
Message delivery mechanism: WebSocket delivery, webhook patterns
Key insight: The candidate felt unprepared because the interviewer dove deep into technical details. Be ready to discuss:
-
IDs or timestamp-based ID generation (
timestamp + sequence + server_id) -
Why you need single-leader replication for ordering guarantees
-
Client-side tie-breaking using lexicographic message ID ordering
-
Trade-offs of eventual consistency vs. strong consistency
Experience 3: Multi-Tenant Data Isolation
"The system design question was about an enterprise chat system. I prepared using engineering blogs and reference articles. The interviewer asked how to separate data for different companies' users. The answer should involve multi-tenancy with authentication, encryption, sharding, etc. During the deep dive, I waited for the interviewer to ask about fault tolerance and scale issues instead of proactively leading the discussion myself. Overall, this round went reasonably well."
What the interviewer emphasized:
-
Multi-tenancy architecture: How do you separate data for different companies/workspaces?
-
Critical security requirements:
-
Authentication: Workspace-scoped tokens, verify user belongs to workspace
-
Encryption: At-rest (tenant-specific keys) and in-transit (TLS)
-
Sharding: Logical partitioning (workspace_id in all tables) vs. physical partitioning (separate databases)
-
Proactive discussion: The candidate noted they should have led the conversation on fault tolerance and scalability instead of waiting for interviewer to ask
Key insight: Multi-tenancy is what differentiates enterprise chat from consumer messaging apps. You MUST discuss:
-
Adding
workspace_idorteam_idto every database table -
Database row-level security policies
-
Logical vs. physical isolation trade-offs
-
Resource isolation (rate limiting per workspace, preventing one tenant from monopolizing resources)
-
API gateway validating workspace context before any operation
The candidate's reflection is valuable: proactively lead the discussion on architecture concerns rather than waiting to be prompted.
Experience 4: WebSocket, Redis Pub/Sub, and Fault Tolerance
"The core discussion focused on WebSocket and Redis Pub/Sub. I felt the interviewer was particularly concerned about fault tolerance. They also asked about what makes Slack different from other messaging apps, such as the data model and data relationships. My answer was to add team_id or enterprise_id in the database to implement logical or physical partitioning of data for different enterprises."
What the interviewer emphasized:
-
Real-time infrastructure: WebSocket connection management, subscription lifecycle
-
Message distribution: Redis Pub/Sub for broadcasting messages across gateway servers
-
Fault tolerance (major focus area):
-
What happens when gateway servers crash?
-
How do clients reconnect and recover state?
-
How do you handle database failures?
-
Message queue failure scenarios (at-least-once delivery, client deduplication)
-
Enterprise-specific features: How is Slack's data model different from consumer messaging apps?
-
Answer: Multi-tenancy with
team_idorenterprise_idin database schema -
Logical vs. physical data partitioning strategies
Key insight: The interviewer cared deeply about fault tolerance. Be prepared to discuss:
-
Client reconnection strategies (exponential backoff, resubscribe to channels)
-
Bootstrap service providing snapshots of missed state
-
Database replication (factor 3+, active-active topology)
-
Thundering herd problem (entire company reconnects after outage → rate limiting, connection jitter)
-
Why WebSocket servers should be stateless for easy horizontal scaling
Common Patterns Across All Interviews
Analyzing these real experiences reveals what interviewers consistently emphasize:
1. Database Schema and Scalability (appeared in all 4 experiences)
-
Experience 1: Database scalability, sharding strategies
-
Experience 2: Database table design, denormalized tables
-
Experience 3: Sharding for multi-tenancy
-
Experience 4: Data model and data relationships
What this means: Every candidate was asked about database design. You must be ready to:
-
Design schema with
workspace_idin every table -
Explain read-optimized denormalization (user_to_channels vs channel_to_users)
-
Discuss sharding strategies (channel_id as shard key)
-
Show understanding of query patterns and indexes
2. Large Channel Fan-Out and Message Distribution (appeared in 2/4 experiences)
-
Experience 1: Explicit focus on fan-out challenge
-
Experience 4: Redis Pub/Sub for broadcasting across gateway servers
What this means: This is THE core technical challenge. Understand:
-
Dispatcher patterns (channel → gateway server mapping)
-
Pub/sub mechanisms (Redis, Kafka)
-
Trade-offs: memory usage, network bandwidth, latency
-
Server-side fan-out vs client-side approaches
3. Multi-Tenancy and Data Isolation (appeared in 2/4 experiences)
-
Experience 3: Main focus on separating company data
-
Experience 4: Data model differences for enterprise vs consumer apps
What this means: This differentiates enterprise from consumer chat. Discuss:
-
Workspace-scoped authentication and authorization
-
Logical vs physical data partitioning
-
Resource isolation and rate limiting per tenant
-
Security implications and compliance
4. Fault Tolerance (appeared in 2/4 experiences)
-
Experience 3: Candidate should have proactively discussed
-
Experience 4: Major focus area for the interviewer
What this means: Interviewers specifically probe failure scenarios. Proactively discuss:
-
Client reconnection and state recovery
-
Database replication and failover
-
Thundering herd problem
-
Graceful degradation
5. Real-Time Infrastructure (appeared in 2/4 experiences)
-
Experience 2: Message delivery mechanism, webhooks
-
Experience 4: WebSocket management, Redis Pub/Sub
What this means: Show deep understanding beyond "use WebSockets":
-
Connection lifecycle and subscription management
-
When to use WebSocket vs HTTP
-
State management in gateway servers
6. Message Ordering and ID Generation (appeared in 1/4 experiences, but deep dive)
- Experience 2: Detailed questions on ID generation and simultaneous messages
What this means: When asked, expect deep technical discussion:
-
Concrete ID scheme (, timestamp-based)
-
Handling simultaneous messages
-
Ordering guarantees and consistency trade-offs
Critical Interview Strategy Lesson
Candidates who struggled were those who waited for the interviewer to probe.
Successful approaches involved proactively leading the discussion on:
-
Scale (how does this handle millions of users?)
-
Fault tolerance (what happens when X fails?)
-
Multi-tenancy (how do you isolate company data?)
Don't wait to be asked—bring these up yourself.