Back to Anthropic questions
System DesignSoftware Engineer

Batch Processing Service

Role: Software Engineer


Problem Statement

Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.

Given Function Signature

You are provided with a fixed backend function that you cannot modify:

python
def batchstring(inputs: list[str]) -> list[str]:
    """
    Processes a batch of string inputs and returns string outputs.

    Constraints:
    - Input size: 1-100 strings per batch
    - Output size: 1-100 strings (one per input)
    - Latency: ~100ms per batch (fixed, regardless of batch size within limits)
    - Concurrency: Each GPU instance can only process ONE batch at a time
    """
    # Fixed implementation - you cannot modify this
    pass

Core Challenge

How do you design a service that:

  • Accepts individual synchronous HTTP requests from users

  • Aggregates them into batches internally

  • Routes batches to available GPU workers

  • Maps responses back to the original requesters

  • Maintains low latency while maximizing throughput

Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Phase 1: Requirements

Functional Requirements

Frame requirements as user capabilities:

  • Submit inference requests — Users should be able to send a single string input via HTTP and receive a processed string output

  • Synchronous response — Users should receive responses in the same HTTP connection (no polling or callbacks)

  • High concurrency — System should handle thousands of simultaneous users without degradation

Keep functional requirements minimal for this problem. The complexity lies in the internal batching mechanism, not user-facing features.

Non-Functional Requirements

RequirementTargetRationale
LatencyP95 < 200msReal-time inference for interactive applications
Throughput1,000 RPSModerate scale for initial deployment
Availability99.9%Standard SLA for production APIs
GPU Utilization70-80%Balance efficiency with headroom for spikes

The 100ms fixed GPU processing time is the irreducible minimum. All other latency (batching, routing, network) must fit within the remaining ~100ms budget.

Capacity Estimation

GPU count calculation:

text
Given:
- Target: 1,000 RPS
- GPU processing: 100ms per batch (10 batches/sec/GPU)
- Target batch size: 32 requests

Throughput per GPU = 10 batches/sec × 32 requests/batch = 320 RPS

Raw GPUs needed = 1,000 RPS / 320 RPS = 3.125 → 4 GPUs

With 70% utilization headroom = 4 / 0.7 ≈ 6 GPUs

Latency breakdown:

text
Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS)
Network overhead: ~10ms
GPU processing: 100ms
───────────────────────
Total average: ~126ms ✓ (under 200ms target)

Concurrent connections:

text
Connections = RPS × Average latency = 1,000 × 0.126s = 126 concurrent

Well within default OS limits (can support 10K+ with tuning)

Phase 2: Data Model

Core Entities

text
InferenceRequest
├── request_id: UUID (unique identifier)
├── input: string (user's input text)
├── timestamp: datetime (arrival time)
├── return_to: string (API instance identifier)
└── status: enum (pending, processing, completed, failed)

Batch
├── batch_id: UUID
├── requests: list[InferenceRequest] (1-100 items)
├── created_at: datetime
└── gpu_id: string (assigned GPU)

InferenceResponse
├── request_id: UUID (maps to original request)
├── output: string (processed result)
└── latency_ms: int (total processing time)

Data Locality

DataStorageLifetime
Pending requestsIn-memory (API server)Until response received
Request queueRedis ListUntil batched
Batch queueRedis ListUntil GPU claims it
Response routingRedis Pub/SubEphemeral

This system is largely stateless—no durable database needed. All state is transient and tied to in-flight requests.

Phase 3: API Design

Protocol Choice: REST

REST is appropriate because:

  • Simple request-response model

  • Standard HTTP semantics

  • Easy to integrate with any client

Endpoints

Submit Inference Request

http
POST /api/inference
Content-Type: application/json

Request:
{
  "input": "E equals "
}

Response (200 OK):
{
  "output": "E equals mc^2",
  "request_id": "req_abc123",
  "latency_ms": 142
}

Response (503 Service Unavailable):
{
  "error": "SERVICE_OVERLOADED",
  "message": "Too many pending requests",
  "retry_after_ms": 5000
}

Error Codes

CodeMeaningWhen
200SuccessRequest processed
400Bad RequestInvalid input format
429Rate LimitedUser exceeded quota
503OverloadedQueue full, reject fast
504TimeoutRequest exceeded deadline

Phase 4: High-Level Design

Architecture

Request Flow

Step 1: Request Arrival

text
User → Load Balancer → API Server 2

API Server 2:
1. Generate unique request_id: "req_xyz"
2. Store connection: pending_requests["req_xyz"] = http_connection
3. Push to Redis queue: {request_id, input, return_to: "api-2"}
4. Wait for response (keep connection open)

Step 2: Batch Formation

text
Batching Service:
1. BLPOP from Redis request queue (blocks until available)
2. Accumulate requests into current_batch
3. Trigger when: batch.size == 32 OR elapsed_time > 50ms
4. Push formed batch to Redis batch queue

Step 3: GPU Processing (Pull-Based)

text
GPU Worker (runs in a loop):
1. BLPOP from batch queue (blocks until batch available)
2. Execute batchstring(inputs) → 100ms
3. Publish results directly to Redis Pub/Sub

Why pull-based? GPUs claim batches atomically via BLPOP—no race conditions, no need to track GPU availability. When a GPU finishes, it simply pulls the next batch.

Step 4: Response Routing

text
GPU Worker (after processing):
For each (request_id, output) in results:
  Publish to Redis channel "responses:api-2":
    {request_id: "req_xyz", output: "..."}

API Server 2:
1. Subscribed to "responses:api-2"
2. Receive message for request_id "req_xyz"
3. Lookup: connection = pending_requests["req_xyz"]
4. Send HTTP response through connection
5. Delete from pending_requests

Batching Strategy

Timeout-based batching is critical. Pure size-based batching causes unacceptable latency during low traffic periods.

python
class BatchingService:
    def __init__(self, batch_size=32, timeout_ms=50):
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.current_batch = []
        self.batch_start_time = None

    def should_send_batch(self):
        if len(self.current_batch) >= self.batch_size:
            return True  # Size trigger
        if self.batch_start_time and elapsed_ms() > self.timeout_ms:
            return True  # Timeout trigger
        return False

Trade-off: Timeout Selection

At 1000 RPS, requests arrive every 1ms. A batch of 32 fills in ~32ms, so the size trigger fires before the timeout. The timeout only matters during low traffic.

Traffic20ms Timeout50ms Timeout100ms Timeout
1000 RPSTimeout triggers at 20ms (20 req)Size triggers at 32ms (32 req)Size triggers at 32ms (32 req)
100 RPSTimeout triggers (2 req)Timeout triggers (5 req)Timeout triggers (10 req)
10 RPSTimeout triggers (0-1 req)Timeout triggers (0-1 req)Timeout triggers (1 req)

Rule of thumb: Set timeout to ~50% of your latency budget after GPU processing. With 100ms remaining (200ms target - 100ms GPU), a 50ms timeout leaves headroom for network overhead.

Response Mapping: The Critical Design Decision

This is where most candidates fail. The HTTP connection exists between User ↔ API Server. The Batching Service cannot directly send responses through that connection.

Why is this hard?

  • User connects to API Server 1

  • Request is batched by the Batching Service and processed by a GPU worker (different process)

  • GPU worker publishes the result to a routing channel

  • How does the result get back to API Server 1's HTTP connection?

Solution: Redis Pub/Sub for Response Routing

text
Each API instance subscribes to its own channel:
  API-1 subscribes to "responses:api-1"
  API-2 subscribes to "responses:api-2"

Request includes "return_to" field identifying origin API instance

GPU workers publish each response to the correct channel

Alternative: Collocated Architecture

For simpler deployments, run the batching logic within each API server:

text
┌────────────────────────────────┐     ┌─────────────────┐
│ API Server Instance            │     │   Batch Queue   │
│ ┌───────────────────────────┐  │     │   (Redis)       │
│ │ HTTP Handler              │  │     └────────┬────────┘
│ │ - Keeps connection open   │  │              │
│ └───────────┬───────────────┘  │              v
│             │                  │     ┌─────────────────┐
│ ┌───────────▼───────────────┐  │     │  GPU Workers    │
│ │ Local Batcher             │──┼────>│  (shared pool)  │
│ │ - In-memory batch         │  │     └─────────────────┘
│ │ - Direct connection ref   │<─┼──── Response via Pub/Sub
│ └───────────────────────────┘  │
└────────────────────────────────┘

Why "Medium" GPU efficiency? With 3 API instances, each forms batches independently. At 1000 RPS split evenly (~333 RPS each), each instance fills a batch of 32 in ~96ms. Meanwhile, the centralized approach aggregates all 1000 RPS and fills batches in ~32ms—fewer partially-filled batches.

Trade-off comparison:

ApproachComplexityLatencyGPU EfficiencyScale Limit
CollocatedLow~120msMedium (independent batching)~5K RPS
Separate ServiceHigh~140msHigh (global batching)50K+ RPS

When to use collocated: Start with collocated for simplicity. Migrate to separate batching service when you observe GPU under-utilization due to small batch sizes across instances.

Phase 5: Scaling & Trade-offs

Addressing Non-Functional Requirements

1. Latency (P95 < 200ms)

  • Timeout-based batching caps waiting time at 50ms

  • Co-locate components in same availability zone

  • Use connection pooling to Redis

  • Monitor and alert when P95 approaches 180ms

2. Throughput (1,000 RPS)

With 6 GPUs at 70% utilization:

  • Theoretical max: 6 × 320 = 1,920 RPS

  • Sustainable: 1,920 × 0.7 = 1,344 RPS ✓

3. Availability (99.9%)

ComponentFailure ImpactMitigation
Load BalancerTotal outageDeploy redundant pair
API ServerPartial degradation3+ instances, health checks
RedisQueue/routing failsRedis Cluster with replicas
Batching ServiceBatching stopsMultiple instances pulling from same queue
GPU WorkerReduced capacityAuto-scaling, requeue failed batches

Batching Service resilience: Multiple batching instances can safely pull from the same request queue (BLPOP is atomic). If one crashes, others continue. Partially-formed batches in the crashed instance are lost, but those requests timeout at the API layer and users retry.

Identifying Bottlenecks

Bottleneck 1: GPU Processing (100ms fixed)

This is the irreducible bottleneck. The only solution is horizontal scaling (more GPUs).

text
RPS needed → GPUs required
500 RPS    → 3 GPUs
1,000 RPS  → 6 GPUs
5,000 RPS  → 25 GPUs
10,000 RPS → 50 GPUs

Bottleneck 2: Connection Limits (C10K Problem)

text
At 10,000 RPS with 150ms latency:
Concurrent connections = 10,000 × 0.15 = 1,500 per LB

Solutions:
- Increase OS file descriptor limits (ulimit -n 65536)
- Horizontal scale API instances
- Consider HTTP/2 multiplexing

Bottleneck 3: Redis Throughput

text
Operations per request:
- 1 LPUSH (API → request queue)
- 1 RPOP (batching service dequeues)
- 1 LPUSH per batch (batching → batch queue, amortized: 1/32 per request)
- 1 BLPOP per batch (GPU claims batch, amortized: 1/32 per request)
- 1 PUBLISH (GPU → response routing)

At 10,000 RPS: ~30,000 Redis ops/sec (3 per request)

Redis easily handles 100K+ ops/sec on modest hardware ✓

Deep Dive: Failure Handling

GPU crash mid-batch — If a GPU fails while processing, 32 user requests fail simultaneously. This requires explicit handling.

GPU Worker with timeout and retry:

python
async def gpu_worker_loop():
    while True:
        batch = await redis.blpop("batch_queue", timeout=0)
        try:
            result = await asyncio.wait_for(
                asyncio.to_thread(batchstring, batch.inputs),
                timeout=0.3  # 3x expected (300ms)
            )
            await publish_results(batch, result)
        except asyncio.TimeoutError:
            # GPU hung - requeue batch for another GPU
            await redis.lpush("batch_queue", batch)
            await report_unhealthy()
            break  # Exit and let orchestrator restart this worker
        except Exception as e:
            # Processing failed - notify all waiters with error
            await publish_errors(batch, str(e))

Timeout handling at API layer:

python
async def handle_request(input: str):
    request_id = generate_id()
    pending_requests[request_id] = asyncio.get_running_loop().create_future()

    await enqueue_request(request_id, input)

    try:
        result = await asyncio.wait_for(
            pending_requests[request_id],
            timeout=5.0  # 5 second deadline
        )
        return result
    except asyncio.TimeoutError:
        del pending_requests[request_id]
        raise HTTPException(504, "Request timeout")

Trade-off Discussion: Latency vs Throughput

Aggressive batching (larger batches, longer timeout):

  • Pro: Higher GPU utilization, lower cost per request

  • Con: Higher user-perceived latency

Conservative batching (smaller batches, shorter timeout):

  • Pro: Lower latency, better user experience

  • Con: More GPU overhead, higher cost

Adaptive approach:

python
def calculate_batch_params(queue_depth, gpu_utilization):
    if gpu_utilization < 0.5 and queue_depth < 10:
        # Under-utilized: prioritize latency
        return BatchParams(size=16, timeout_ms=30)
    elif queue_depth > 100 or gpu_utilization > 0.85:
        # Overloaded: prioritize throughput
        return BatchParams(size=64, timeout_ms=100)
    else:
        # Normal: balanced
        return BatchParams(size=32, timeout_ms=50)

Auto-Scaling Strategy

SignalThresholdAction
Queue depth > 50030 secondsAdd 3 GPUs
Queue depth > 10060 secondsAdd 1 GPU
GPU utilization > 85%5 minutesAdd 1 GPU
GPU utilization < 40%15 minutesRemove 1 GPU

GPU cold start time (1-5 minutes) means aggressive scaling-up is necessary. Scale down conservatively to avoid thrashing.

Interview Checklist

Requirements Phase:

  • Clarified latency SLA (P95 target)

  • Confirmed throughput requirements (RPS)

  • Asked about user priority tiers

  • Verified GPU processing constraints

Data Model Phase:

  • Identified transient vs persistent data

  • Request tracking structure defined

  • Batch formation structure defined

API Phase:

  • Single endpoint with clear contract

  • Error codes for all failure modes

  • Timeout behavior specified

High-Level Design Phase:

  • Complete request flow drawn

  • Response routing mechanism explained

  • Batching strategy with timeout

  • GPU assignment logic

Scaling Phase:

  • GPU count calculation shown

  • Connection limits addressed

  • Failure handling for GPU crashes

  • Auto-scaling triggers defined

Summary

AspectDecisionRationale
ArchitectureSeparate batching serviceBetter GPU utilization at scale
Batching triggerSize OR timeout (50ms)Balances latency and throughput
Response routingRedis Pub/Sub per API instanceDecouples API from batching
GPU assignmentPull-based from batch queueNatural backpressure, race-free
Failure handlingRequeue batch, timeout at APITransparent recovery without data loss
Scaling signalQueue depth + utilizationProactive scaling before SLA breach

This design handles 1,000 RPS with ~140ms average latency using 6 GPUs, meeting all functional and non-functional requirements while remaining operationally simple.