Batch Processing Service

Role: Software Engineer

Problem Statement

Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.

Given Function Signature

You are provided with a fixed backend function that you cannot modify:

python

def batchstring(inputs: list[str]) -> list[str]:
    """
    Processes a batch of string inputs and returns string outputs.

    Constraints:
    - Input size: 1-100 strings per batch
    - Output size: 1-100 strings (one per input)
    - Latency: ~100ms per batch (fixed, regardless of batch size within limits)
    - Concurrency: Each GPU instance can only process ONE batch at a time
    """
    # Fixed implementation - you cannot modify this
    pass

Core Challenge

How do you design a service that:

Accepts individual synchronous HTTP requests from users
Aggregates them into batches internally
Routes batches to available GPU workers
Maps responses back to the original requesters
Maintains low latency while maximizing throughput

Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Phase 1: Requirements

Functional Requirements

Frame requirements as user capabilities:

Submit inference requests — Users should be able to send a single string input via HTTP and receive a processed string output
Synchronous response — Users should receive responses in the same HTTP connection (no polling or callbacks)
High concurrency — System should handle thousands of simultaneous users without degradation

Keep functional requirements minimal for this problem. The complexity lies in the internal batching mechanism, not user-facing features.

Non-Functional Requirements

Requirement	Target	Rationale
Latency	P95 < 200ms	Real-time inference for interactive applications
Throughput	1,000 RPS	Moderate scale for initial deployment
Availability	99.9%	Standard SLA for production APIs
GPU Utilization	70-80%	Balance efficiency with headroom for spikes

The 100ms fixed GPU processing time is the irreducible minimum. All other latency (batching, routing, network) must fit within the remaining ~100ms budget.

Capacity Estimation

GPU count calculation:

text

Given:
- Target: 1,000 RPS
- GPU processing: 100ms per batch (10 batches/sec/GPU)
- Target batch size: 32 requests

Throughput per GPU = 10 batches/sec × 32 requests/batch = 320 RPS

Raw GPUs needed = 1,000 RPS / 320 RPS = 3.125 → 4 GPUs

With 70% utilization headroom = 4 / 0.7 ≈ 6 GPUs

Latency breakdown:

text

Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS)
Network overhead: ~10ms
GPU processing: 100ms
───────────────────────
Total average: ~126ms ✓ (under 200ms target)

Concurrent connections:

text

Connections = RPS × Average latency = 1,000 × 0.126s = 126 concurrent

Well within default OS limits (can support 10K+ with tuning)

Phase 2: Data Model

Core Entities

text

InferenceRequest
├── request_id: UUID (unique identifier)
├── input: string (user's input text)
├── timestamp: datetime (arrival time)
├── return_to: string (API instance identifier)
└── status: enum (pending, processing, completed, failed)

Batch
├── batch_id: UUID
├── requests: list[InferenceRequest] (1-100 items)
├── created_at: datetime
└── gpu_id: string (assigned GPU)

InferenceResponse
├── request_id: UUID (maps to original request)
├── output: string (processed result)
└── latency_ms: int (total processing time)

Data Locality

Data	Storage	Lifetime
Pending requests	In-memory (API server)	Until response received
Request queue	Redis List	Until batched
Batch queue	Redis List	Until GPU claims it
Response routing	Redis Pub/Sub	Ephemeral

This system is largely stateless—no durable database needed. All state is transient and tied to in-flight requests.

Phase 3: API Design

Protocol Choice: REST

REST is appropriate because:

Simple request-response model
Standard HTTP semantics
Easy to integrate with any client

Endpoints

Submit Inference Request

http

POST /api/inference
Content-Type: application/json

Request:
{
  "input": "E equals "
}

Response (200 OK):
{
  "output": "E equals mc^2",
  "request_id": "req_abc123",
  "latency_ms": 142
}

Response (503 Service Unavailable):
{
  "error": "SERVICE_OVERLOADED",
  "message": "Too many pending requests",
  "retry_after_ms": 5000
}

Error Codes

Code	Meaning	When
200	Success	Request processed
400	Bad Request	Invalid input format
429	Rate Limited	User exceeded quota
503	Overloaded	Queue full, reject fast
504	Timeout	Request exceeded deadline

Phase 4: High-Level Design

Architecture

Request Flow

Step 1: Request Arrival

text

User → Load Balancer → API Server 2

API Server 2:
1. Generate unique request_id: "req_xyz"
2. Store connection: pending_requests["req_xyz"] = http_connection
3. Push to Redis queue: {request_id, input, return_to: "api-2"}
4. Wait for response (keep connection open)

Step 2: Batch Formation

text

Batching Service:
1. BLPOP from Redis request queue (blocks until available)
2. Accumulate requests into current_batch
3. Trigger when: batch.size == 32 OR elapsed_time > 50ms
4. Push formed batch to Redis batch queue

Step 3: GPU Processing (Pull-Based)

text

GPU Worker (runs in a loop):
1. BLPOP from batch queue (blocks until batch available)
2. Execute batchstring(inputs) → 100ms
3. Publish results directly to Redis Pub/Sub

Why pull-based? GPUs claim batches atomically via BLPOP—no race conditions, no need to track GPU availability. When a GPU finishes, it simply pulls the next batch.

Step 4: Response Routing

text

GPU Worker (after processing):
For each (request_id, output) in results:
  Publish to Redis channel "responses:api-2":
    {request_id: "req_xyz", output: "..."}

API Server 2:
1. Subscribed to "responses:api-2"
2. Receive message for request_id "req_xyz"
3. Lookup: connection = pending_requests["req_xyz"]
4. Send HTTP response through connection
5. Delete from pending_requests

Batching Strategy

Timeout-based batching is critical. Pure size-based batching causes unacceptable latency during low traffic periods.

python

class BatchingService:
    def __init__(self, batch_size=32, timeout_ms=50):
        self.batch_size = batch_size
        self.timeout_ms = timeout_ms
        self.current_batch = []
        self.batch_start_time = None

    def should_send_batch(self):
        if len(self.current_batch) >= self.batch_size:
            return True  # Size trigger
        if self.batch_start_time and elapsed_ms() > self.timeout_ms:
            return True  # Timeout trigger
        return False

Trade-off: Timeout Selection

At 1000 RPS, requests arrive every 1ms. A batch of 32 fills in ~32ms, so the size trigger fires before the timeout. The timeout only matters during low traffic.

Traffic	20ms Timeout	50ms Timeout	100ms Timeout
1000 RPS	Timeout triggers at 20ms (20 req)	Size triggers at 32ms (32 req)	Size triggers at 32ms (32 req)
100 RPS	Timeout triggers (2 req)	Timeout triggers (5 req)	Timeout triggers (10 req)
10 RPS	Timeout triggers (0-1 req)	Timeout triggers (0-1 req)	Timeout triggers (1 req)

Rule of thumb: Set timeout to ~50% of your latency budget after GPU processing. With 100ms remaining (200ms target - 100ms GPU), a 50ms timeout leaves headroom for network overhead.

Response Mapping: The Critical Design Decision

This is where most candidates fail. The HTTP connection exists between User ↔ API Server. The Batching Service cannot directly send responses through that connection.

Why is this hard?

User connects to API Server 1
Request is batched by the Batching Service and processed by a GPU worker (different process)
GPU worker publishes the result to a routing channel
How does the result get back to API Server 1's HTTP connection?

Solution: Redis Pub/Sub for Response Routing

text

Each API instance subscribes to its own channel:
  API-1 subscribes to "responses:api-1"
  API-2 subscribes to "responses:api-2"

Request includes "return_to" field identifying origin API instance

GPU workers publish each response to the correct channel

Alternative: Collocated Architecture

For simpler deployments, run the batching logic within each API server:

text

┌────────────────────────────────┐     ┌─────────────────┐
│ API Server Instance            │     │   Batch Queue   │
│ ┌───────────────────────────┐  │     │   (Redis)       │
│ │ HTTP Handler              │  │     └────────┬────────┘
│ │ - Keeps connection open   │  │              │
│ └───────────┬───────────────┘  │              v
│             │                  │     ┌─────────────────┐
│ ┌───────────▼───────────────┐  │     │  GPU Workers    │
│ │ Local Batcher             │──┼────>│  (shared pool)  │
│ │ - In-memory batch         │  │     └─────────────────┘
│ │ - Direct connection ref   │<─┼──── Response via Pub/Sub
│ └───────────────────────────┘  │
└────────────────────────────────┘

Why "Medium" GPU efficiency? With 3 API instances, each forms batches independently. At 1000 RPS split evenly (~333 RPS each), each instance fills a batch of 32 in ~96ms. Meanwhile, the centralized approach aggregates all 1000 RPS and fills batches in ~32ms—fewer partially-filled batches.

Trade-off comparison:

Approach	Complexity	Latency	GPU Efficiency	Scale Limit
Collocated	Low	~120ms	Medium (independent batching)	~5K RPS
Separate Service	High	~140ms	High (global batching)	50K+ RPS

When to use collocated: Start with collocated for simplicity. Migrate to separate batching service when you observe GPU under-utilization due to small batch sizes across instances.

Phase 5: Scaling & Trade-offs

Addressing Non-Functional Requirements

1. Latency (P95 < 200ms)

Timeout-based batching caps waiting time at 50ms
Co-locate components in same availability zone
Use connection pooling to Redis
Monitor and alert when P95 approaches 180ms

2. Throughput (1,000 RPS)

With 6 GPUs at 70% utilization:

Theoretical max: 6 × 320 = 1,920 RPS
Sustainable: 1,920 × 0.7 = 1,344 RPS ✓

3. Availability (99.9%)

Component	Failure Impact	Mitigation
Load Balancer	Total outage	Deploy redundant pair
API Server	Partial degradation	3+ instances, health checks
Redis	Queue/routing fails	Redis Cluster with replicas
Batching Service	Batching stops	Multiple instances pulling from same queue
GPU Worker	Reduced capacity	Auto-scaling, requeue failed batches

Batching Service resilience: Multiple batching instances can safely pull from the same request queue (BLPOP is atomic). If one crashes, others continue. Partially-formed batches in the crashed instance are lost, but those requests timeout at the API layer and users retry.

Identifying Bottlenecks

Bottleneck 1: GPU Processing (100ms fixed)

This is the irreducible bottleneck. The only solution is horizontal scaling (more GPUs).

text

RPS needed → GPUs required
500 RPS    → 3 GPUs
1,000 RPS  → 6 GPUs
5,000 RPS  → 25 GPUs
10,000 RPS → 50 GPUs

Bottleneck 2: Connection Limits (C10K Problem)

text

At 10,000 RPS with 150ms latency:
Concurrent connections = 10,000 × 0.15 = 1,500 per LB

Solutions:
- Increase OS file descriptor limits (ulimit -n 65536)
- Horizontal scale API instances
- Consider HTTP/2 multiplexing

Bottleneck 3: Redis Throughput

text

Operations per request:
- 1 LPUSH (API → request queue)
- 1 RPOP (batching service dequeues)
- 1 LPUSH per batch (batching → batch queue, amortized: 1/32 per request)
- 1 BLPOP per batch (GPU claims batch, amortized: 1/32 per request)
- 1 PUBLISH (GPU → response routing)

At 10,000 RPS: ~30,000 Redis ops/sec (3 per request)

Redis easily handles 100K+ ops/sec on modest hardware ✓

Deep Dive: Failure Handling

GPU crash mid-batch — If a GPU fails while processing, 32 user requests fail simultaneously. This requires explicit handling.

GPU Worker with timeout and retry:

python

async def gpu_worker_loop():
    while True:
        batch = await redis.blpop("batch_queue", timeout=0)
        try:
            result = await asyncio.wait_for(
                asyncio.to_thread(batchstring, batch.inputs),
                timeout=0.3  # 3x expected (300ms)
            )
            await publish_results(batch, result)
        except asyncio.TimeoutError:
            # GPU hung - requeue batch for another GPU
            await redis.lpush("batch_queue", batch)
            await report_unhealthy()
            break  # Exit and let orchestrator restart this worker
        except Exception as e:
            # Processing failed - notify all waiters with error
            await publish_errors(batch, str(e))

Timeout handling at API layer:

python

async def handle_request(input: str):
    request_id = generate_id()
    pending_requests[request_id] = asyncio.get_running_loop().create_future()

    await enqueue_request(request_id, input)

    try:
        result = await asyncio.wait_for(
            pending_requests[request_id],
            timeout=5.0  # 5 second deadline
        )
        return result
    except asyncio.TimeoutError:
        del pending_requests[request_id]
        raise HTTPException(504, "Request timeout")

Trade-off Discussion: Latency vs Throughput

Aggressive batching (larger batches, longer timeout):

Pro: Higher GPU utilization, lower cost per request
Con: Higher user-perceived latency

Conservative batching (smaller batches, shorter timeout):

Pro: Lower latency, better user experience
Con: More GPU overhead, higher cost

Adaptive approach:

python

def calculate_batch_params(queue_depth, gpu_utilization):
    if gpu_utilization < 0.5 and queue_depth < 10:
        # Under-utilized: prioritize latency
        return BatchParams(size=16, timeout_ms=30)
    elif queue_depth > 100 or gpu_utilization > 0.85:
        # Overloaded: prioritize throughput
        return BatchParams(size=64, timeout_ms=100)
    else:
        # Normal: balanced
        return BatchParams(size=32, timeout_ms=50)

Auto-Scaling Strategy

Signal	Threshold	Action
Queue depth > 500	30 seconds	Add 3 GPUs
Queue depth > 100	60 seconds	Add 1 GPU
GPU utilization > 85%	5 minutes	Add 1 GPU
GPU utilization < 40%	15 minutes	Remove 1 GPU

GPU cold start time (1-5 minutes) means aggressive scaling-up is necessary. Scale down conservatively to avoid thrashing.

Interview Checklist

Requirements Phase:

Clarified latency SLA (P95 target)
Confirmed throughput requirements (RPS)
Asked about user priority tiers
Verified GPU processing constraints

Data Model Phase:

Identified transient vs persistent data
Request tracking structure defined
Batch formation structure defined

API Phase:

Single endpoint with clear contract
Error codes for all failure modes
Timeout behavior specified

High-Level Design Phase:

Complete request flow drawn
Response routing mechanism explained
Batching strategy with timeout
GPU assignment logic

Scaling Phase:

GPU count calculation shown
Connection limits addressed
Failure handling for GPU crashes
Auto-scaling triggers defined

Summary

Aspect	Decision	Rationale
Architecture	Separate batching service	Better GPU utilization at scale
Batching trigger	Size OR timeout (50ms)	Balances latency and throughput
Response routing	Redis Pub/Sub per API instance	Decouples API from batching
GPU assignment	Pull-based from batch queue	Natural backpressure, race-free
Failure handling	Requeue batch, timeout at API	Transparent recovery without data loss
Scaling signal	Queue depth + utilization	Proactive scaling before SLA breach

This design handles 1,000 RPS with ~140ms average latency using 6 GPUs, meeting all functional and non-functional requirements while remaining operationally simple.