Back to OpenAI questions
System DesignSoftware Engineer

Slack-like Chat System

Role: Software Engineer


Problem Statement

Design a scalable enterprise chat system like Slack that supports:

  • Messaging: Send messages to group channels (1:N) and direct messages (1:1)

  • Channel Management: Create group chats, add/remove users

  • Notifications: Real-time delivery for online users, push notifications for offline users

  • Rich Media: Support images, files, and other attachments in messages

  • Message Deletion: Allow users to delete their messages

  • Multi-tenancy: Different companies share infrastructure with complete data isolation

Scale considerations:

  • Millions of concurrent users

  • Thousands of workspaces (companies)

  • Channels with 10,000+ members

  • Sub-second message delivery

Comprehensive Solution Resources

For detailed architectural solutions and implementation approaches, these two articles provide excellent technical depth:

  • Slack Architecture Deep Dive - Comprehensive breakdown of Slack's architecture, including WebSocket management, message routing, and scalability patterns

  • How to Design Slack - Step-by-step guide covering core components, data models, and system design trade-offs

These articles thoroughly explain:

  • System architecture and service design

  • WebSocket management and real-time delivery

  • Database schema and sharding strategies

  • Message ordering and ID generation

  • Multi-tenancy implementation

  • Fault tolerance mechanisms

  • Notification systems

  • Rich media handling

The content below focuses on real interview experiences from candidates who interviewed at , showing what actually gets discussed and emphasized during the interview.

Real Interview Experiences

Experience 1: Large Channel Fan-Out and Notifications

"Focused on large channel fan-out challenge; how notifications (notification on unread messages, notification on message push when users are offline) work; how db can scale up."

What the interviewer emphasized:

  • Large channel fan-out problem: How do you efficiently deliver a message to 10,000+ members in a channel?

  • Notification system design: Different strategies for online vs. offline users

  • Database scalability: Sharding strategies, query patterns, read vs. write optimization

Key insight: This is the most critical technical challenge in chat systems. Interviewers expect you to discuss dispatcher patterns, pub/sub mechanisms (Redis Pub/Sub, Kafka), and the trade-offs between broadcasting to all servers vs. targeted delivery. Be prepared to explain how you maintain a mapping of channels → gateway servers and how to handle memory/network constraints.

Experience 2: Service Architecture and Message Ordering

"Designed a Channel Service (separately managing user and channel information) and Message Service (managing messages). The interviewer asked detailed questions about message ID generation, loose ordering of message delivery, and what happens if messages appear at exactly the same moment. Also asked about Channel Service database table design - I discussed designing one-to-many relationships storing user_to_channels and channel_to_users with read-prioritized structure. I mentioned using webhooks to deliver to Slack clients for the delivery part but didn't have time to finish explaining. Overall, I felt completely underprepared for this question."

What the interviewer emphasized:

  • Service separation: Why separate Channel Service (user/channel metadata) from Message Service (message persistence)?

  • Message ID generation: Requirements (globally unique, time-sortable, distributed)

  • Message ordering: What happens when two messages arrive at exactly the same millisecond? How do you handle loose ordering?

  • Database schema design: Denormalized tables for read optimization

  • user_to_channels table: Fast lookup for "what channels is user X in?"

  • channel_to_users table: Fast lookup for "who is in channel Y?"

  • Message delivery mechanism: WebSocket delivery, webhook patterns

Key insight: The candidate felt unprepared because the interviewer dove deep into technical details. Be ready to discuss:

  • IDs or timestamp-based ID generation (timestamp + sequence + server_id)

  • Why you need single-leader replication for ordering guarantees

  • Client-side tie-breaking using lexicographic message ID ordering

  • Trade-offs of eventual consistency vs. strong consistency

Experience 3: Multi-Tenant Data Isolation

"The system design question was about an enterprise chat system. I prepared using engineering blogs and reference articles. The interviewer asked how to separate data for different companies' users. The answer should involve multi-tenancy with authentication, encryption, sharding, etc. During the deep dive, I waited for the interviewer to ask about fault tolerance and scale issues instead of proactively leading the discussion myself. Overall, this round went reasonably well."

What the interviewer emphasized:

  • Multi-tenancy architecture: How do you separate data for different companies/workspaces?

  • Critical security requirements:

  • Authentication: Workspace-scoped tokens, verify user belongs to workspace

  • Encryption: At-rest (tenant-specific keys) and in-transit (TLS)

  • Sharding: Logical partitioning (workspace_id in all tables) vs. physical partitioning (separate databases)

  • Proactive discussion: The candidate noted they should have led the conversation on fault tolerance and scalability instead of waiting for interviewer to ask

Key insight: Multi-tenancy is what differentiates enterprise chat from consumer messaging apps. You MUST discuss:

  • Adding workspace_id or team_id to every database table

  • Database row-level security policies

  • Logical vs. physical isolation trade-offs

  • Resource isolation (rate limiting per workspace, preventing one tenant from monopolizing resources)

  • API gateway validating workspace context before any operation

The candidate's reflection is valuable: proactively lead the discussion on architecture concerns rather than waiting to be prompted.

Experience 4: WebSocket, Redis Pub/Sub, and Fault Tolerance

"The core discussion focused on WebSocket and Redis Pub/Sub. I felt the interviewer was particularly concerned about fault tolerance. They also asked about what makes Slack different from other messaging apps, such as the data model and data relationships. My answer was to add team_id or enterprise_id in the database to implement logical or physical partitioning of data for different enterprises."

What the interviewer emphasized:

  • Real-time infrastructure: WebSocket connection management, subscription lifecycle

  • Message distribution: Redis Pub/Sub for broadcasting messages across gateway servers

  • Fault tolerance (major focus area):

  • What happens when gateway servers crash?

  • How do clients reconnect and recover state?

  • How do you handle database failures?

  • Message queue failure scenarios (at-least-once delivery, client deduplication)

  • Enterprise-specific features: How is Slack's data model different from consumer messaging apps?

  • Answer: Multi-tenancy with team_id or enterprise_id in database schema

  • Logical vs. physical data partitioning strategies

Key insight: The interviewer cared deeply about fault tolerance. Be prepared to discuss:

  • Client reconnection strategies (exponential backoff, resubscribe to channels)

  • Bootstrap service providing snapshots of missed state

  • Database replication (factor 3+, active-active topology)

  • Thundering herd problem (entire company reconnects after outage → rate limiting, connection jitter)

  • Why WebSocket servers should be stateless for easy horizontal scaling

Common Patterns Across All Interviews

Analyzing these real experiences reveals what interviewers consistently emphasize:

1. Database Schema and Scalability (appeared in all 4 experiences)

  • Experience 1: Database scalability, sharding strategies

  • Experience 2: Database table design, denormalized tables

  • Experience 3: Sharding for multi-tenancy

  • Experience 4: Data model and data relationships

What this means: Every candidate was asked about database design. You must be ready to:

  • Design schema with workspace_id in every table

  • Explain read-optimized denormalization (user_to_channels vs channel_to_users)

  • Discuss sharding strategies (channel_id as shard key)

  • Show understanding of query patterns and indexes

2. Large Channel Fan-Out and Message Distribution (appeared in 2/4 experiences)

  • Experience 1: Explicit focus on fan-out challenge

  • Experience 4: Redis Pub/Sub for broadcasting across gateway servers

What this means: This is THE core technical challenge. Understand:

  • Dispatcher patterns (channel → gateway server mapping)

  • Pub/sub mechanisms (Redis, Kafka)

  • Trade-offs: memory usage, network bandwidth, latency

  • Server-side fan-out vs client-side approaches

3. Multi-Tenancy and Data Isolation (appeared in 2/4 experiences)

  • Experience 3: Main focus on separating company data

  • Experience 4: Data model differences for enterprise vs consumer apps

What this means: This differentiates enterprise from consumer chat. Discuss:

  • Workspace-scoped authentication and authorization

  • Logical vs physical data partitioning

  • Resource isolation and rate limiting per tenant

  • Security implications and compliance

4. Fault Tolerance (appeared in 2/4 experiences)

  • Experience 3: Candidate should have proactively discussed

  • Experience 4: Major focus area for the interviewer

What this means: Interviewers specifically probe failure scenarios. Proactively discuss:

  • Client reconnection and state recovery

  • Database replication and failover

  • Thundering herd problem

  • Graceful degradation

5. Real-Time Infrastructure (appeared in 2/4 experiences)

  • Experience 2: Message delivery mechanism, webhooks

  • Experience 4: WebSocket management, Redis Pub/Sub

What this means: Show deep understanding beyond "use WebSockets":

  • Connection lifecycle and subscription management

  • When to use WebSocket vs HTTP

  • State management in gateway servers

6. Message Ordering and ID Generation (appeared in 1/4 experiences, but deep dive)

  • Experience 2: Detailed questions on ID generation and simultaneous messages

What this means: When asked, expect deep technical discussion:

  • Concrete ID scheme (, timestamp-based)

  • Handling simultaneous messages

  • Ordering guarantees and consistency trade-offs

Critical Interview Strategy Lesson

Candidates who struggled were those who waited for the interviewer to probe.

Successful approaches involved proactively leading the discussion on:

  • Scale (how does this handle millions of users?)

  • Fault tolerance (what happens when X fails?)

  • Multi-tenancy (how do you isolate company data?)

Don't wait to be asked—bring these up yourself.