Design a Quota System

Role: Software Engineer

Design a quota management system that tracks and enforces resource usage limits across multiple backend services. Think Google's unified 100 GB storage quota shared between Drive, Photos, and Gmail—every upload must be checked to prevent users from exceeding their limit, even under high concurrency.

The core challenge isn't CRUD—it's strict enforcement under high concurrency without sacrificing latency or availability. A naïve implementation either allows overdraft or creates a performance bottleneck.

This walkthrough follows the Interview Framework. Use it as a guide, not a script.

Phase 1: Requirements

Functional Requirements

Services should be able to consume quota on behalf of a user (e.g., Drive requests 5 GB for a file upload)
Services should be able to release quota when resources are freed (e.g., user deletes a file)
Services should be able to cancel in-flight reservations for failed/aborted operations
Users should be able to query their current usage across all services

Assume a single global quota per user (not per-service sub-quotas) unless the interviewer adds this. Features like quota tiers by plan, admin overrides, and usage analytics are below the line.

Non-Functional Requirements

Requirement	Target	Rationale
Strict consistency	No overdraft at any time	If 2 GB remaining, two concurrent 2 GB requests must not both succeed
Low latency (P99)	< 50ms	Quota checks are on the critical path of every upload/write
High availability	99.99%	Quota service downtime blocks all upstream services
High throughput	10K–100K+ QPS	Aggregated write traffic across all services is large

The "no overdraft" requirement is what makes this problem hard. It rules out eventually consistent approaches and forces you to think carefully about concurrency control. This is the defining constraint the interviewer will probe.

Capacity Estimation

Metric	Value
Daily Active Users	10 million
Avg quota operations/user/day	20 (uploads, deletes, edits)
Peak QPS	200M ops/day ÷ 86,400 ≈ 2,300 QPS avg, 23K+ QPS peak
Quota record size	~100 bytes (user_id, used, limit, reserved)
Total storage	10M users × 100B = 1 GB

Quota data is tiny—one row per user. This is a compute-bound problem (high aggregate write QPS), not a storage-bound problem. The back-of-envelope tells you sharding is about concurrency distribution, not data size.

Phase 2: Data Model

Core Entities

text

UserQuota
├── user_id (PK)
├── quota_limit          // e.g., 100 GB (set by plan/tier)
├── used                 // confirmed usage across all services
├── reserved             // temporarily held, pending confirmation
├── updated_at

Reservation
├── reservation_id (PK)
├── user_id (FK)
├── service_id           // which upstream service (drive, photos, gmail)
├── idempotency_key      // dedupe retries from same service request
├── amount               // how much quota reserved
├── status               // pending | confirmed | released | expired
├── created_at
├── expires_at           // TTL for abandoned reservations

Key Insight: Reserve → Confirm/Cancel

The data model separates reserved from used because upstream operations (like file uploads) take time. A 5 GB file upload reserves quota immediately but only confirms it after the upload completes.

Available quota = quota_limit - used - reserved

A common mistake is tracking only used quota. Without reservations, two concurrent uploads can each see "10 GB available" and both proceed, causing overdraft. The reserved field is essential for strict enforcement.

Phase 3: API Design

Reserve Quota (Synchronous — Critical Path)

text

POST /v1/quota/reserve
Headers:
  X-Service-Id: google-drive
  Idempotency-Key: upload_abc123

Request:
{
  "user_id": "user_456",
  "amount_bytes": 5368709120   // 5 GB
}

Response (200):
{
  "reservation_id": "res_789",
  "available_after": 53687091200,
  "expires_at": "2024-01-15T12:30:00Z"
}

Response (409 Conflict):
{
  "error": "INSUFFICIENT_QUOTA",
  "available": 2147483648,  // only 2 GB available
  "requested": 5368709120
}

Confirm Reservation

text

POST /v1/quota/confirm
{ "reservation_id": "res_789" }

Cancel Reservation (Failed/Aborted Upload)

text

POST /v1/quota/cancel
{ "reservation_id": "res_789" }

Release Used Quota (Delete Path)

text

POST /v1/quota/release
{
  "user_id": "user_456",
  "amount_bytes": 5368709120,
  "service_id": "google-drive",
  "reference_id": "file_xyz"
}

Get Usage

text

GET /v1/quota/usage?user_id=user_456

Response:
{
  "user_id": "user_456",
  "quota_limit": 107374182400,   // 100 GB
  "used": 53687091200,            // 50 GB confirmed
  "reserved": 5368709120,         // 5 GB pending
  "available": 48318382080        // ~45 GB available
}

The reserve endpoint must be synchronous because the caller needs an immediate yes/no before proceeding with the upload. Confirm/cancel should also be synchronous for correctness. Releasing already-used quota (delete path) can be async because delayed crediting is a UX issue, not an overdraft risk.

Phase 4: High-Level Design

Architecture Overview

Component Responsibilities

Component	Responsibility
Upstream Services	Drive, Photos, Gmail call quota service before consuming resources
Load Balancer	Routes requests; can use consistent hashing on user_id for locality
Quota Servers	Stateless application servers handling reserve/release/query logic
PostgreSQL	Source of truth for quota data, sharded by user_id
Event Bus (Optional)	Async fanout for analytics, audit logs, and downstream consumers
Expiration Worker	Background job that reclaims expired reservations

Reserve Flow (Critical Path)

This is the most important flow—it must be atomic to prevent overdraft:

sql

BEGIN;

-- Idempotency: unique(service_id, idempotency_key)
SELECT reservation_id, status, amount
FROM reservations
WHERE service_id = 
  AND idempotency_key = 
FOR UPDATE;
-- If row exists, return it (safe retry)

WITH reserved_row AS (
  UPDATE user_quota
  SET reserved = reserved + ,
      updated_at = NOW()
  WHERE user_id = 
    AND (quota_limit - used - reserved) >= 
  RETURNING user_id
)
INSERT INTO reservations (
  reservation_id, user_id, service_id, idempotency_key, amount, status, created_at, expires_at
)
SELECT , user_id, , , , 'pending', NOW(), NOW() + INTERVAL '30 minutes'
FROM reserved_row;

COMMIT;

Walkthrough:

Drive wants to upload a 5 GB file for user_456
Drive calls POST /v1/quota/reserve with amount = 5 GB
Quota Server starts a transaction, checks idempotency, then runs conditional UPDATE + INSERT reservation
If the INSERT affects one row, reservation succeeds → return reservation_id
If the conditional UPDATE finds no row, INSERT gets no input row → return 409 (insufficient quota)
Drive proceeds with upload only if reservation succeeded
After upload completes, Drive calls POST /v1/quota/confirm to convert reserved → used
If Drive crashes or upload fails, the reservation expires after TTL

The conditional UPDATE is still the core concurrency guard. Wrapping it with reservation creation in the same transaction gives both guarantees: no overdraft and no "reserved bytes without a reservation row" inconsistencies.

Confirm Flow

sql

BEGIN;

WITH target AS (
  SELECT reservation_id, user_id, amount, status
  FROM reservations
  WHERE reservation_id = 
  FOR UPDATE
)
UPDATE user_quota q
SET used = q.used + t.amount,
    reserved = q.reserved - t.amount,
    updated_at = NOW()
FROM target t
WHERE q.user_id = t.user_id
  AND t.status = 'pending';

UPDATE reservations
SET status = 'confirmed'
WHERE reservation_id = 
  AND status = 'pending';
COMMIT;

If the reservation is already confirmed, return success (idempotent retry). If it's released or expired, return 409.

Cancel Reservation Flow (Failed Upload)

sql

BEGIN;
WITH target AS (
  UPDATE reservations
  SET status = 'released'
  WHERE reservation_id = 
    AND status = 'pending'
  RETURNING user_id, amount
)
UPDATE user_quota q
SET reserved = q.reserved - t.amount,
    updated_at = NOW()
FROM target t
WHERE q.user_id = t.user_id;
COMMIT;

Release Used Quota Flow (Delete Path)

sql

UPDATE user_quota
SET used = used - ,
    updated_at = NOW()
WHERE user_id = 
  AND used >= ;

Reservation Expiration

If a service reserves quota but never confirms (crash, abandoned upload), the reservation must be reclaimed:

sql

-- Worker loop: process in batches to avoid long transactions
WITH expired AS (
  UPDATE reservations
  SET status = 'expired'
  WHERE reservation_id IN (
    SELECT reservation_id
    FROM reservations
    WHERE status = 'pending'
      AND expires_at < NOW()
    ORDER BY expires_at
    LIMIT 1000
    FOR UPDATE SKIP LOCKED
  )
  RETURNING user_id, amount
)
UPDATE user_quota q
SET reserved = q.reserved - e.amount,
    updated_at = NOW()
FROM expired e
WHERE q.user_id = e.user_id;

Reservation TTL creates a trade-off: too short and legitimate slow uploads lose quota mid-upload; too long and abandoned reservations block available quota. Default to 30 minutes with configurable TTLs for large uploads.

Phase 5: Scaling & Trade-offs

1. Handling High Concurrency (Aggregate Load)

At peak loads of tens of thousands of QPS, a single PostgreSQL instance becomes a bottleneck due to the aggregate volume of atomic UPDATE queries. The issue isn't typically per-user "hot rows" (a single user rarely generates enough QPS to cause lock contention), but the total database write throughput.

Solution: Shard by User ID

Since each user's quota is independent, shard user_quota by user_id:

With 100 shards, no single shard sees more than 1/100th of total QPS. Since each user's operations go to exactly one shard, the atomic UPDATE still works correctly.

2. Why NOT a Cache Layer?

A natural instinct is to add Redis in front of the database. But for quota enforcement, this creates serious problems:

Approach	Issue
Cache-aside	Two servers read "10 GB available" from cache, both proceed → overdraft
Write-through	Cache and DB can become inconsistent during failures
Redis as primary	Redis doesn't have durability guarantees for quota data

For strict quota enforcement, the database IS the coordination layer. Adding a cache splits the atomic check-and-deduct across two systems and breaks the guarantee. Only cache the read path (GET /usage) where stale data is acceptable.

3. Replication Strategy

The "no overdraft" requirement forces strong consistency:

Synchronous replication for writes: Wait for a write quorum (for example, primary + one synchronous replica) before success. Preserves durability without requiring every replica.
Any replica for reads: Usage queries are non-critical; reading from any replica gives lower latency.

text

Write path (reserve):  Client → Primary → Write Quorum ACK → Response
Read path (usage):     Client → Any Replica → Response

Be explicit about this trade-off: "We sacrifice some write latency for strong consistency because the cost of overdraft (giving users unpaid storage) is higher than slightly slower quota checks. The p99 for a conditional row update inside a short transaction with synchronous replication is typically 10-20ms—well within our 50ms budget."

4. Handling Reserved-but-Never-Used Quota

Three strategies work together:

TTL-based expiration (primary): Each reservation has expires_at. Background worker reclaims expired ones. Default: 30 min.

Heartbeat extension: For long-running operations, the upstream service periodically extends the reservation TTL. If heartbeats stop, the reservation expires normally.

Callback on failure: Upstream services should call cancel on failure (best-effort). TTL expiration is the safety net when callbacks fail.

5. Scaling Levels

Scale	Architecture
< 10K QPS	Single PostgreSQL with connection pooling
10K–100K QPS	Sharded PostgreSQL (10–50 shards)
100K–1M QPS	Sharded PostgreSQL (50–200 shards) + read replicas
> 1M QPS	Per-service local quota budgets with periodic sync

6. Per-Service Local Budgets (Advanced — Extreme Scale Only)

Central quota service allocates a "budget" to each upstream service. Each service checks locally—no network call needed. When budget runs low, request more from central.

Trade-off: Sacrifices strict per-byte enforcement for throughput. A user might briefly see different available quota from different services. Only use if strict consistency can be relaxed.

Common Pitfalls

Using read-then-write instead of atomic updates — Checking available quota in one query and deducting in another creates a TOCTOU race condition. Two concurrent requests can both pass the check and both deduct, causing overdraft.

Adding a cache for quota enforcement — Caching introduces stale reads that break strict consistency. The database is the coordination point. Only cache for non-critical read paths.

Forgetting reservation expiration — Without TTLs on reservations, crashed services permanently "leak" quota. Always implement a background expiration worker.

Sharding by service instead of user — A global user quota would be split across service-owned shards, forcing cross-shard coordination for every reserve check. Shard by user_id so each user's quota is authoritative in one place.

Interview Checklist

Topic	Key Points to Cover
Strict enforcement	Conditional UPDATE inside a transaction prevents overdraft
Reserve-confirm-cancel	Explicit cancel path handles failed/aborted in-flight operations
Reservation TTL	Background worker reclaims abandoned reservations
No cache for writes	Cache introduces consistency gaps; DB is the coordination layer
Sharding	Shard by user_id to distribute aggregate write load
Replication	Synchronous writes for durability, async reads for performance
Back-of-envelope	Tiny data (1 GB), compute-bound problem, not storage-bound
Scaling levels	Single DB → Sharded → Local budgets depending on QPS

Summary

Aspect	Solution	Rationale
Core abstraction	UserQuota + Reservations	Separates confirmed usage from in-flight operations
Overdraft prevention	Conditional UPDATE + transactional insert	No TOCTOU race and no orphaned reservation state
In-flight operations	Reserve → Confirm / Cancel	TTL expiration reclaims abandoned quota
Concurrency	Shard by user_id	Distributes aggregate write load across DB shards
Consistency	Strong for writes, eventual for reads	No overdraft guarantee requires strong writes
Scaling	Stateless servers + sharded PostgreSQL	Handles 10K–1M QPS with horizontal scaling
Extreme scale	Per-service local budgets	Trades strict consistency for throughput