Back to xAI questions
System DesignSoftware Engineer

Push Notification System for New Posts

Role: Software Engineer


Design a notification system that sends push notifications to followers when an author publishes a new post. Followers may have multiple devices, can mute or disable notifications, and some authors can have millions of followers.

This walkthrough follows the Interview Framework and focuses on what you would actually present in a 45-60 minute interview.

Keep the scope tight. This question is about reliable notification fanout and delivery, not about designing the full feed-ranking or recommendation system.

Phase 1: Requirements (~5 minutes)

Functional Requirements

  • Users should be able to publish a new post that triggers notifications to eligible followers

  • Followers should be able to receive push notifications on their registered devices in near real time

  • Users should be able to configure notification preferences such as enable/disable, mute, and quiet hours

  • The system should avoid duplicate notifications and retry transient delivery failures safely

  • Users should be able to open the notification and deep-link to the new post, while the system tracks delivery/open status

Features like feed ranking, email digests, mention notifications, and recommendation logic are useful follow-ups, but they should stay below the line in the initial design.

Non-Functional Requirements

RequirementTargetRationale
Latencyp50 under 5s, p99 under 30s for normal authorsNotifications should feel real time
Availability99.9%+Users expect post alerts to work consistently
DurabilityNo lost post eventsMissing notifications are hard to recover from
CorrectnessNo duplicate pushes for the same post/follower/channelDuplicate alerts destroy trust quickly
Scalability30M DAU, 15M posts/day, celebrity fanout up to 20M followersThe hot-author case dominates the design
Cost efficiencyMinimize unnecessary provider sendsPush providers and fanout pipelines cost real money

The main insight is that this is not a simple "send one message" system. It is a massive fanout pipeline with filtering, deduplication, rate limiting, and third-party delivery constraints.

Capacity Estimation

text
Assumptions:
- Daily active users: 30 million
- Posts per day: 15 million
- Average opted-in followers per post: 200
- Candidate notifications per day: 15M * 200 = 3B

Traffic:
- Average posts/sec: 15M / 86,400 ~= 174
- Average notifications/sec: 3B / 86,400 ~= 35K
- Peak notifications/sec (10x burst): 350K+
- Celebrity post: up to 20M recipients for a single post

Storage:
- Compact notification metadata per recipient: ~250 bytes
- Daily hot storage: 3B * 250 bytes ~= 750 GB/day
- 7-day hot retention ~= 5.25 TB before replication

Do not store the fully rendered push payload for every recipient. Store compact metadata such as template ID, actor ID, post ID, and state. Render the final provider payload at dispatch time.

Phase 2: Data Model (~5 minutes)

Core Entities

Notification Lifecycle

Provider acceptance is not the same as user-visible delivery. APNs or FCM may accept the message, but the device can still be offline, the token can be stale, or the user may never open the push.

Notification is the follower-level logical record. NotificationAttempt stores per-device outcomes. The parent Notification.status is aggregated from child attempts: dispatched means at least one device was accepted by a provider, opened means any device opened it, and exhausted means all target devices permanently failed or the retry budget was spent.

Quiet hours usually mean defer, not drop. Use scheduled_for to store the first eligible delivery time and place the notification onto a delayed queue or scheduled partition until that time arrives.

Phase 3: API Design (~5 minutes)

Protocol Choice

  • REST for post creation, device registration, and preference management

  • Durable event stream / queue for internal async fanout

  • APNs / FCM / Web Push for actual device delivery

WebSocket is not the primary protocol here because the core requirement is offline-capable push delivery. If the product also needs a live in-app notification center, you can add WebSocket or SSE later as a secondary channel.

Client-Facing APIs

http
# Create a post
POST /api/posts
Content-Type: application/json

{
  "text": "We just launched a new feature",
  "visibility": "public",
  "notify_followers": true
}

Response:
{
  "post_id": "post_123",
  "created_at": "2026-03-12T22:10:00Z"
}

# Register or refresh a device token
POST /api/devices
{
  "provider": "apns",
  "platform": "ios",
  "push_token": "token_abc"
}

# Update global push preferences
PUT /api/notification-settings
{
  "push_enabled": true,
  "quiet_hours_start": 22,
  "quiet_hours_end": 7,
  "timezone": "America/Los_Angeles"
}

# Update post-notification preferences for an author
PUT /api/follows/{author_id}/notification-settings
{
  "notify_on_post": true,
  "muted": false
}

# Fetch notification history / in-app inbox
GET /api/notifications?cursor=notif_456&limit=50

# Client acknowledges that a notification was opened
POST /api/notifications/{notification_id}/ack
{
  "device_id": "dev_789",
  "event": "opened"
}

Internal Events

json
// Published after the post transaction commits
{
  "event_type": "post_created",
  "post_id": "post_123",
  "author_id": "user_42",
  "visibility": "public",
  "created_at": "2026-03-12T22:10:00Z"
}

// Fanout shard job
{
  "job_id": "job_555",
  "post_id": "post_123",
  "author_id": "user_42",
  "shard_id": 18,
  "cursor": "follower_9000000"
}

Use a transactional outbox between POST /api/posts and the post_created event. Otherwise you risk writing the post successfully but losing the notification trigger if the process crashes before publishing to Kafka or your queue.

Phase 4: High-Level Design

Architecture Overview

Write and Fanout Flow

  • User creates a post through POST /api/posts

  • Post Service stores the post and writes a post_created outbox row in the same transaction

  • Outbox relay publishes the event to a durable bus such as Kafka, Pulsar, or SQS + SNS

  • Notification Orchestrator validates that the post is eligible for notifications and creates fanout shard jobs

  • Fanout Workers scan follower shards for that author, apply per-author overrides plus global quiet-hour filters, and create idempotent notification records backed by a unique key in the Notification Store

  • If the follower is inside quiet hours, the worker sets scheduled_for and pushes the notification onto the delayed side of the dispatch queue; otherwise it is ready immediately

  • Dispatch Workers fetch active device tokens from the Device Registry, batch by provider, and send push requests through APNs, FCM, or Web Push

  • Mobile or web clients deep-link into the post and optionally acknowledge opened back to the Notification API

Key Components

ComponentResponsibilityNotes
Post ServiceWrites posts and outbox rowsStrong consistency at write time
Follower GraphStores followers by (author_id, shard_id)Prevents celebrity authors from becoming one hot partition
Preference StoreGlobal push settings plus per-author overridesQuiet hours are user-level; mute/follow overrides are edge-level
Device RegistryStores active push tokens per user/deviceDispatch uses it for token lookup and invalid-token cleanup
Notification OrchestratorDecides how to fan out a postSplits normal vs celebrity handling
Notification StorePersists lifecycle state and inbox dataUnique key on (post_id, follower_id, channel) is the dedupe source of truth
Dispatch WorkersSend to providers with retry logicBatch by provider and platform
DLQHolds poison messages or repeated failuresRequired for safe recovery
DataStoreReasoning
PostsPostgreSQL / MySQLTransactional writes and product metadata
Follower graphCassandra / Bigtable / DynamoDBBucket followers by (author_id, shard_id) for scalable fanout scans
Device registryPostgreSQL / DynamoDBActive token lookup by user plus invalid-token updates
Notification stateCassandra / DynamoDBVery high write throughput with TTL
Provider rate limits + optional prefilterRedisFast token buckets, counters, and hot retry suppression
Async transportKafka / Pulsar / SQSDurable, replayable fanout pipeline

Store per-author overrides such as notify_on_post and muted on the follow edge, but keep global quiet hours in a separate user preference record. That avoids rewriting every follow row when a user changes timezone or sleep schedule.

Redis can be used as an optional prefilter for hot retries, but the authoritative dedupe guarantee should come from the unique key in the Notification Store.

Phase 5: Scaling and Trade-offs (~15-20 minutes)

Deep Dive 1: Celebrity Fanout

The hard case is an author with millions of followers. A naive single job that loads all followers and sends all pushes will fail due to memory pressure, retry storms, and provider throttling.

Use a sharded campaign model:

python
if opted_in_follower_count < 100_000:
    enqueue_all_shards(post_id, priority="high")
else:
    create_campaign(post_id, shard_size=50_000, paced_dispatch=True)

For celebrity posts:

  • Split followers into deterministic buckets by (author_id, shard_id)

  • Pace shard execution so provider quotas are respected

  • Prioritize recently active followers first if the product allows it

  • Keep shard jobs idempotent so retries do not create duplicate pushes

Do not scan "all followers of a celebrity" in one database request. That creates a hot partition and gives you no recovery point. Shard the fanout work and checkpoint progress.

Deep Dive 2: Delivery Semantics

You cannot guarantee true exactly-once push delivery because the queue, workers, and providers all operate with at-least-once behavior. The practical design is:

  • Effectively-once logical notification creation via a unique key in the Notification Store

  • At-least-once dispatch attempts with idempotent provider requests where possible

  • Best-effort user-visible delivery because device state is outside your control

This is why the Notification row is the canonical follower-level record and NotificationAttempt rows are the per-device delivery log.

Deep Dive 3: Preference Filtering and Timing

A follower may:

  • Disable post notifications entirely

  • Mute only a specific author

  • Enter quiet hours in their own timezone

  • Unfollow or block the author after the post is created

The safest rule is to filter as late as possible, during shard expansion or just before dispatch. That reduces stale sends caused by race conditions between post creation and preference changes.

For quiet hours specifically, late filtering usually means defer instead of suppress: create the notification row, compute the next valid send time from the user's timezone, set scheduled_for, and let the delayed queue release it later.

Trade-off:

  • Filter once at event creation: cheaper, but stale

  • Filter late during fanout/dispatch: more reads, but more correct

For user trust, late filtering is usually worth the extra cost.

Deep Dive 4: Provider Failures and Token Hygiene

Push providers fail in different ways:

  • Temporary errors: retry with exponential backoff and jitter

  • Permanent errors: mark device token invalid and stop sending

  • Slow provider region: shift traffic if multi-region routing is available

Important practices:

  • Batch sends by provider and platform

  • Cap retry age so a "new post" push is not delivered hours later

  • Send invalid-token events back to Device Service for cleanup

Availability and Multi-Region

For high availability:

  • Run stateless API, orchestration, and dispatch workers in multiple regions

  • Keep the event bus replicated or use region-local queues with mirrored failover

  • Store notifications in a replicated database with regional failover

  • Avoid cross-region synchronous calls in the hot path; only the initial post write needs strong consistency

The follower graph and notification store can usually tolerate eventual consistency across regions. Missing a few milliseconds of replica lag is far better than slowing down the entire post path.

Common Pitfalls

Confusing provider acceptance with delivery. APNs or FCM returning success only means they accepted the message, not that the user saw it.

No transactional outbox. Writing the post to the database and publishing the event separately creates a classic lost-notification failure mode.

Ignoring hot authors. A design that works for 1,000 followers often collapses for 10 million followers unless fanout is sharded and paced.

No dedupe key. Retries in the queue or worker layer will create duplicate pushes without an idempotent notification record.

Filtering too early. If you only evaluate mute, block, and quiet-hour settings at post creation time, users can still receive notifications they turned off seconds later.

Interview Checklist

Before wrapping up, verify you covered:

Requirements Phase

  • Core scope is new-post push notifications, not the full feed system

  • Functional requirements include preferences, retries, and dedupe

  • Non-functional requirements include latency, correctness, and celebrity spikes

  • Quick capacity estimate shows fanout scale

Data Model

  • Post, FollowEdge, UserNotificationPreference, Device, Notification, NotificationAttempt, FanoutJob

  • Unique dedupe key explained

  • Follower-level state separated from per-device attempt history

API Design

  • REST APIs for posts, devices, preferences, and acknowledgements

  • Internal post_created event defined

  • Transactional outbox justified

High-Level Design

  • Architecture diagram with fanout and dispatch pipeline

  • Follower graph, preference model, and notification store explained

  • APNs / FCM integration covered

Scaling and Trade-offs

  • Celebrity fanout strategy explained

  • At-least-once vs exactly-once trade-off discussed

  • Provider retry behavior and invalid token cleanup covered

  • Multi-region availability and late preference filtering mentioned

Summary

AspectRecommendationRationale
Post triggerTransactional outboxPrevent lost notification events
Fanout strategySharded jobs by (author_id, shard_id) bucketHandles celebrity-scale writes safely
Follower storeWide-column KV by (author_id, shard_id)Efficient follower scans without hot partitions
Notification recordCanonical row with unique dedupe keyPrevents duplicate pushes
Delivery logSeparate attempt tableTracks retries and provider outcomes
Dispatch layerBatched workers with rate limitingRespects APNs / FCM quotas
Preference handlingGlobal user prefs + per-author overrides, filtered lateBetter correctness without rewriting wide follow edges
Reliability modelAt-least-once attempts, idempotent creationPractical and robust

The strongest answer here is not "use Kafka and push notifications." It is showing that you understand the real bottlenecks: fanout amplification, celebrity hot keys, provider throttling, idempotency, and user-preference correctness.