Push Notification System for New Posts
Role: Software Engineer
Design a notification system that sends push notifications to followers when an author publishes a new post. Followers may have multiple devices, can mute or disable notifications, and some authors can have millions of followers.
This walkthrough follows the Interview Framework and focuses on what you would actually present in a 45-60 minute interview.
Keep the scope tight. This question is about reliable notification fanout and delivery, not about designing the full feed-ranking or recommendation system.
Phase 1: Requirements (~5 minutes)
Functional Requirements
-
Users should be able to publish a new post that triggers notifications to eligible followers
-
Followers should be able to receive push notifications on their registered devices in near real time
-
Users should be able to configure notification preferences such as enable/disable, mute, and quiet hours
-
The system should avoid duplicate notifications and retry transient delivery failures safely
-
Users should be able to open the notification and deep-link to the new post, while the system tracks delivery/open status
Features like feed ranking, email digests, mention notifications, and recommendation logic are useful follow-ups, but they should stay below the line in the initial design.
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Latency | p50 under 5s, p99 under 30s for normal authors | Notifications should feel real time |
| Availability | 99.9%+ | Users expect post alerts to work consistently |
| Durability | No lost post events | Missing notifications are hard to recover from |
| Correctness | No duplicate pushes for the same post/follower/channel | Duplicate alerts destroy trust quickly |
| Scalability | 30M DAU, 15M posts/day, celebrity fanout up to 20M followers | The hot-author case dominates the design |
| Cost efficiency | Minimize unnecessary provider sends | Push providers and fanout pipelines cost real money |
The main insight is that this is not a simple "send one message" system. It is a massive fanout pipeline with filtering, deduplication, rate limiting, and third-party delivery constraints.
Capacity Estimation
Assumptions:
- Daily active users: 30 million
- Posts per day: 15 million
- Average opted-in followers per post: 200
- Candidate notifications per day: 15M * 200 = 3B
Traffic:
- Average posts/sec: 15M / 86,400 ~= 174
- Average notifications/sec: 3B / 86,400 ~= 35K
- Peak notifications/sec (10x burst): 350K+
- Celebrity post: up to 20M recipients for a single post
Storage:
- Compact notification metadata per recipient: ~250 bytes
- Daily hot storage: 3B * 250 bytes ~= 750 GB/day
- 7-day hot retention ~= 5.25 TB before replicationDo not store the fully rendered push payload for every recipient. Store compact metadata such as template ID, actor ID, post ID, and state. Render the final provider payload at dispatch time.
Phase 2: Data Model (~5 minutes)
Core Entities
Notification Lifecycle
Provider acceptance is not the same as user-visible delivery. APNs or FCM may accept the message, but the device can still be offline, the token can be stale, or the user may never open the push.
Notification is the follower-level logical record. NotificationAttempt stores per-device outcomes. The parent Notification.status is aggregated from child attempts: dispatched means at least one device was accepted by a provider, opened means any device opened it, and exhausted means all target devices permanently failed or the retry budget was spent.
Quiet hours usually mean defer, not drop. Use scheduled_for to store the first eligible delivery time and place the notification onto a delayed queue or scheduled partition until that time arrives.
Phase 3: API Design (~5 minutes)
Protocol Choice
-
REST for post creation, device registration, and preference management
-
Durable event stream / queue for internal async fanout
-
APNs / FCM / Web Push for actual device delivery
WebSocket is not the primary protocol here because the core requirement is offline-capable push delivery. If the product also needs a live in-app notification center, you can add WebSocket or SSE later as a secondary channel.
Client-Facing APIs
# Create a post
POST /api/posts
Content-Type: application/json
{
"text": "We just launched a new feature",
"visibility": "public",
"notify_followers": true
}
Response:
{
"post_id": "post_123",
"created_at": "2026-03-12T22:10:00Z"
}
# Register or refresh a device token
POST /api/devices
{
"provider": "apns",
"platform": "ios",
"push_token": "token_abc"
}
# Update global push preferences
PUT /api/notification-settings
{
"push_enabled": true,
"quiet_hours_start": 22,
"quiet_hours_end": 7,
"timezone": "America/Los_Angeles"
}
# Update post-notification preferences for an author
PUT /api/follows/{author_id}/notification-settings
{
"notify_on_post": true,
"muted": false
}
# Fetch notification history / in-app inbox
GET /api/notifications?cursor=notif_456&limit=50
# Client acknowledges that a notification was opened
POST /api/notifications/{notification_id}/ack
{
"device_id": "dev_789",
"event": "opened"
}Internal Events
// Published after the post transaction commits
{
"event_type": "post_created",
"post_id": "post_123",
"author_id": "user_42",
"visibility": "public",
"created_at": "2026-03-12T22:10:00Z"
}
// Fanout shard job
{
"job_id": "job_555",
"post_id": "post_123",
"author_id": "user_42",
"shard_id": 18,
"cursor": "follower_9000000"
}Use a transactional outbox between POST /api/posts and the post_created event. Otherwise you risk writing the post successfully but losing the notification trigger if the process crashes before publishing to Kafka or your queue.
Phase 4: High-Level Design
Architecture Overview
Write and Fanout Flow
-
User creates a post through
POST /api/posts -
Post Service stores the post and writes a
post_createdoutbox row in the same transaction -
Outbox relay publishes the event to a durable bus such as Kafka, Pulsar, or SQS + SNS
-
Notification Orchestrator validates that the post is eligible for notifications and creates fanout shard jobs
-
Fanout Workers scan follower shards for that author, apply per-author overrides plus global quiet-hour filters, and create idempotent notification records backed by a unique key in the Notification Store
-
If the follower is inside quiet hours, the worker sets
scheduled_forand pushes the notification onto the delayed side of the dispatch queue; otherwise it is ready immediately -
Dispatch Workers fetch active device tokens from the Device Registry, batch by provider, and send push requests through APNs, FCM, or Web Push
-
Mobile or web clients deep-link into the post and optionally acknowledge
openedback to the Notification API
Key Components
| Component | Responsibility | Notes |
|---|---|---|
| Post Service | Writes posts and outbox rows | Strong consistency at write time |
| Follower Graph | Stores followers by (author_id, shard_id) | Prevents celebrity authors from becoming one hot partition |
| Preference Store | Global push settings plus per-author overrides | Quiet hours are user-level; mute/follow overrides are edge-level |
| Device Registry | Stores active push tokens per user/device | Dispatch uses it for token lookup and invalid-token cleanup |
| Notification Orchestrator | Decides how to fan out a post | Splits normal vs celebrity handling |
| Notification Store | Persists lifecycle state and inbox data | Unique key on (post_id, follower_id, channel) is the dedupe source of truth |
| Dispatch Workers | Send to providers with retry logic | Batch by provider and platform |
| DLQ | Holds poison messages or repeated failures | Required for safe recovery |
Recommended Storage Choices
| Data | Store | Reasoning |
|---|---|---|
| Posts | PostgreSQL / MySQL | Transactional writes and product metadata |
| Follower graph | Cassandra / Bigtable / DynamoDB | Bucket followers by (author_id, shard_id) for scalable fanout scans |
| Device registry | PostgreSQL / DynamoDB | Active token lookup by user plus invalid-token updates |
| Notification state | Cassandra / DynamoDB | Very high write throughput with TTL |
| Provider rate limits + optional prefilter | Redis | Fast token buckets, counters, and hot retry suppression |
| Async transport | Kafka / Pulsar / SQS | Durable, replayable fanout pipeline |
Store per-author overrides such as notify_on_post and muted on the follow edge, but keep global quiet hours in a separate user preference record. That avoids rewriting every follow row when a user changes timezone or sleep schedule.
Redis can be used as an optional prefilter for hot retries, but the authoritative dedupe guarantee should come from the unique key in the Notification Store.
Phase 5: Scaling and Trade-offs (~15-20 minutes)
Deep Dive 1: Celebrity Fanout
The hard case is an author with millions of followers. A naive single job that loads all followers and sends all pushes will fail due to memory pressure, retry storms, and provider throttling.
Use a sharded campaign model:
if opted_in_follower_count < 100_000:
enqueue_all_shards(post_id, priority="high")
else:
create_campaign(post_id, shard_size=50_000, paced_dispatch=True)For celebrity posts:
-
Split followers into deterministic buckets by
(author_id, shard_id) -
Pace shard execution so provider quotas are respected
-
Prioritize recently active followers first if the product allows it
-
Keep shard jobs idempotent so retries do not create duplicate pushes
Do not scan "all followers of a celebrity" in one database request. That creates a hot partition and gives you no recovery point. Shard the fanout work and checkpoint progress.
Deep Dive 2: Delivery Semantics
You cannot guarantee true exactly-once push delivery because the queue, workers, and providers all operate with at-least-once behavior. The practical design is:
-
Effectively-once logical notification creation via a unique key in the Notification Store
-
At-least-once dispatch attempts with idempotent provider requests where possible
-
Best-effort user-visible delivery because device state is outside your control
This is why the Notification row is the canonical follower-level record and NotificationAttempt rows are the per-device delivery log.
Deep Dive 3: Preference Filtering and Timing
A follower may:
-
Disable post notifications entirely
-
Mute only a specific author
-
Enter quiet hours in their own timezone
-
Unfollow or block the author after the post is created
The safest rule is to filter as late as possible, during shard expansion or just before dispatch. That reduces stale sends caused by race conditions between post creation and preference changes.
For quiet hours specifically, late filtering usually means defer instead of suppress: create the notification row, compute the next valid send time from the user's timezone, set scheduled_for, and let the delayed queue release it later.
Trade-off:
-
Filter once at event creation: cheaper, but stale
-
Filter late during fanout/dispatch: more reads, but more correct
For user trust, late filtering is usually worth the extra cost.
Deep Dive 4: Provider Failures and Token Hygiene
Push providers fail in different ways:
-
Temporary errors: retry with exponential backoff and jitter
-
Permanent errors: mark device token invalid and stop sending
-
Slow provider region: shift traffic if multi-region routing is available
Important practices:
-
Batch sends by provider and platform
-
Cap retry age so a "new post" push is not delivered hours later
-
Send invalid-token events back to Device Service for cleanup
Availability and Multi-Region
For high availability:
-
Run stateless API, orchestration, and dispatch workers in multiple regions
-
Keep the event bus replicated or use region-local queues with mirrored failover
-
Store notifications in a replicated database with regional failover
-
Avoid cross-region synchronous calls in the hot path; only the initial post write needs strong consistency
The follower graph and notification store can usually tolerate eventual consistency across regions. Missing a few milliseconds of replica lag is far better than slowing down the entire post path.
Common Pitfalls
Confusing provider acceptance with delivery. APNs or FCM returning success only means they accepted the message, not that the user saw it.
No transactional outbox. Writing the post to the database and publishing the event separately creates a classic lost-notification failure mode.
Ignoring hot authors. A design that works for 1,000 followers often collapses for 10 million followers unless fanout is sharded and paced.
No dedupe key. Retries in the queue or worker layer will create duplicate pushes without an idempotent notification record.
Filtering too early. If you only evaluate mute, block, and quiet-hour settings at post creation time, users can still receive notifications they turned off seconds later.
Interview Checklist
Before wrapping up, verify you covered:
Requirements Phase
-
Core scope is new-post push notifications, not the full feed system
-
Functional requirements include preferences, retries, and dedupe
-
Non-functional requirements include latency, correctness, and celebrity spikes
-
Quick capacity estimate shows fanout scale
Data Model
-
Post, FollowEdge, UserNotificationPreference, Device, Notification, NotificationAttempt, FanoutJob
-
Unique dedupe key explained
-
Follower-level state separated from per-device attempt history
API Design
-
REST APIs for posts, devices, preferences, and acknowledgements
-
Internal
post_createdevent defined -
Transactional outbox justified
High-Level Design
-
Architecture diagram with fanout and dispatch pipeline
-
Follower graph, preference model, and notification store explained
-
APNs / FCM integration covered
Scaling and Trade-offs
-
Celebrity fanout strategy explained
-
At-least-once vs exactly-once trade-off discussed
-
Provider retry behavior and invalid token cleanup covered
-
Multi-region availability and late preference filtering mentioned
Summary
| Aspect | Recommendation | Rationale |
|---|---|---|
| Post trigger | Transactional outbox | Prevent lost notification events |
| Fanout strategy | Sharded jobs by (author_id, shard_id) bucket | Handles celebrity-scale writes safely |
| Follower store | Wide-column KV by (author_id, shard_id) | Efficient follower scans without hot partitions |
| Notification record | Canonical row with unique dedupe key | Prevents duplicate pushes |
| Delivery log | Separate attempt table | Tracks retries and provider outcomes |
| Dispatch layer | Batched workers with rate limiting | Respects APNs / FCM quotas |
| Preference handling | Global user prefs + per-author overrides, filtered late | Better correctness without rewriting wide follow edges |
| Reliability model | At-least-once attempts, idempotent creation | Practical and robust |
The strongest answer here is not "use Kafka and push notifications." It is showing that you understand the real bottlenecks: fanout amplification, celebrity hot keys, provider throttling, idempotency, and user-preference correctness.