Batch Processing Service
Role: Software Engineer
Problem Statement
Design an HTTP API that exposes a batch processing function for large language model inference. Individual users make single synchronous requests, but internally the system must batch these requests together for efficient GPU processing.
Given Function Signature
You are provided with a fixed backend function that you cannot modify:
def batchstring(inputs: list[str]) -> list[str]:
"""
Processes a batch of string inputs and returns string outputs.
Constraints:
- Input size: 1-100 strings per batch
- Output size: 1-100 strings (one per input)
- Latency: ~100ms per batch (fixed, regardless of batch size within limits)
- Concurrency: Each GPU instance can only process ONE batch at a time
"""
# Fixed implementation - you cannot modify this
passCore Challenge
How do you design a service that:
-
Accepts individual synchronous HTTP requests from users
-
Aggregates them into batches internally
-
Routes batches to available GPU workers
-
Maps responses back to the original requesters
-
Maintains low latency while maximizing throughput
Related question: Inference API System Design. That question gives you an existing API and focuses on operational infrastructure — priority queues, rate limiting, and auto-scaling. This question gives you only a bare function and focuses on the core mechanics — how to collect individual HTTP requests into batches and route GPU responses back to the correct waiting connections.
Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.
Phase 1: Requirements
Functional Requirements
Frame requirements as user capabilities:
-
Submit inference requests — Users should be able to send a single string input via HTTP and receive a processed string output
-
Synchronous response — Users should receive responses in the same HTTP connection (no polling or callbacks)
-
High concurrency — System should handle thousands of simultaneous users without degradation
Keep functional requirements minimal for this problem. The complexity lies in the internal batching mechanism, not user-facing features.
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Latency | P95 < 200ms | Real-time inference for interactive applications |
| Throughput | 1,000 RPS | Moderate scale for initial deployment |
| Availability | 99.9% | Standard SLA for production APIs |
| GPU Utilization | 70-80% | Balance efficiency with headroom for spikes |
The 100ms fixed GPU processing time is the irreducible minimum. All other latency (batching, routing, network) must fit within the remaining ~100ms budget.
Capacity Estimation
GPU count calculation:
Given:
- Target: 1,000 RPS
- GPU processing: 100ms per batch (10 batches/sec/GPU)
- Target batch size: 32 requests
Throughput per GPU = 10 batches/sec × 32 requests/batch = 320 RPS
Raw GPUs needed = 1,000 RPS / 320 RPS = 3.125 → 4 GPUs
With 70% utilization headroom = 4 / 0.7 ≈ 6 GPUsLatency breakdown:
Average batching delay: ~16ms (half of 32ms to fill batch at 1000 RPS)
Network overhead: ~10ms
GPU processing: 100ms
───────────────────────
Total average: ~126ms ✓ (under 200ms target)Concurrent connections:
Connections = RPS × Average latency = 1,000 × 0.126s = 126 concurrent
Well within default OS limits (can support 10K+ with tuning)Phase 2: Data Model
Core Entities
InferenceRequest
├── request_id: UUID (unique identifier)
├── input: string (user's input text)
├── timestamp: datetime (arrival time)
├── return_to: string (API instance identifier)
└── status: enum (pending, processing, completed, failed)
Batch
├── batch_id: UUID
├── requests: list[InferenceRequest] (1-100 items)
├── created_at: datetime
└── gpu_id: string (assigned GPU)
InferenceResponse
├── request_id: UUID (maps to original request)
├── output: string (processed result)
└── latency_ms: int (total processing time)Data Locality
| Data | Storage | Lifetime |
|---|---|---|
| Pending requests | In-memory (API server) | Until response received |
| Request queue | Redis List | Until batched |
| Batch queue | Redis List | Until GPU claims it |
| Response routing | Redis Pub/Sub | Ephemeral |
This system is largely stateless—no durable database needed. All state is transient and tied to in-flight requests.
Phase 3: API Design
Protocol Choice: REST
REST is appropriate because:
-
Simple request-response model
-
Standard HTTP semantics
-
Easy to integrate with any client
Endpoints
Submit Inference Request
POST /api/inference
Content-Type: application/json
Request:
{
"input": "E equals "
}
Response (200 OK):
{
"output": "E equals mc^2",
"request_id": "req_abc123",
"latency_ms": 142
}
Response (503 Service Unavailable):
{
"error": "SERVICE_OVERLOADED",
"message": "Too many pending requests",
"retry_after_ms": 5000
}Error Codes
| Code | Meaning | When |
|---|---|---|
| 200 | Success | Request processed |
| 400 | Bad Request | Invalid input format |
| 429 | Rate Limited | User exceeded quota |
| 503 | Overloaded | Queue full, reject fast |
| 504 | Timeout | Request exceeded deadline |
Phase 4: High-Level Design
Architecture
Request Flow
Step 1: Request Arrival
User → Load Balancer → API Server 2
API Server 2:
1. Generate unique request_id: "req_xyz"
2. Store connection: pending_requests["req_xyz"] = http_connection
3. Push to Redis queue: {request_id, input, return_to: "api-2"}
4. Wait for response (keep connection open)Step 2: Batch Formation
Batching Service:
1. BLPOP from Redis request queue (blocks until available)
2. Accumulate requests into current_batch
3. Trigger when: batch.size == 32 OR elapsed_time > 50ms
4. Push formed batch to Redis batch queueStep 3: GPU Processing (Pull-Based)
GPU Worker (runs in a loop):
1. BLPOP from batch queue (blocks until batch available)
2. Execute batchstring(inputs) → 100ms
3. Publish results directly to Redis Pub/SubWhy pull-based? GPUs claim batches atomically via BLPOP—no race conditions, no need to track GPU availability. When a GPU finishes, it simply pulls the next batch.
Step 4: Response Routing
GPU Worker (after processing):
For each (request_id, output) in results:
Publish to Redis channel "responses:api-2":
{request_id: "req_xyz", output: "..."}
API Server 2:
1. Subscribed to "responses:api-2"
2. Receive message for request_id "req_xyz"
3. Lookup: connection = pending_requests["req_xyz"]
4. Send HTTP response through connection
5. Delete from pending_requestsBatching Strategy
Timeout-based batching is critical. Pure size-based batching causes unacceptable latency during low traffic periods.
class BatchingService:
def __init__(self, batch_size=32, timeout_ms=50):
self.batch_size = batch_size
self.timeout_ms = timeout_ms
self.current_batch = []
self.batch_start_time = None
def should_send_batch(self):
if len(self.current_batch) >= self.batch_size:
return True # Size trigger
if self.batch_start_time and elapsed_ms() > self.timeout_ms:
return True # Timeout trigger
return FalseTrade-off: Timeout Selection
At 1000 RPS, requests arrive every 1ms. A batch of 32 fills in ~32ms, so the size trigger fires before the timeout. The timeout only matters during low traffic.
| Traffic | 20ms Timeout | 50ms Timeout | 100ms Timeout |
|---|---|---|---|
| 1000 RPS | Timeout triggers at 20ms (20 req) | Size triggers at 32ms (32 req) | Size triggers at 32ms (32 req) |
| 100 RPS | Timeout triggers (2 req) | Timeout triggers (5 req) | Timeout triggers (10 req) |
| 10 RPS | Timeout triggers (0-1 req) | Timeout triggers (0-1 req) | Timeout triggers (1 req) |
Rule of thumb: Set timeout to ~50% of your latency budget after GPU processing. With 100ms remaining (200ms target - 100ms GPU), a 50ms timeout leaves headroom for network overhead.
Response Mapping: The Critical Design Decision
This is where most candidates fail. The HTTP connection exists between User ↔ API Server. The Batching Service cannot directly send responses through that connection.
Why is this hard?
-
User connects to API Server 1
-
Request is batched by the Batching Service and processed by a GPU worker (different process)
-
GPU worker publishes the result to a routing channel
-
How does the result get back to API Server 1's HTTP connection?
Solution: Redis Pub/Sub for Response Routing
Each API instance subscribes to its own channel:
API-1 subscribes to "responses:api-1"
API-2 subscribes to "responses:api-2"
Request includes "return_to" field identifying origin API instance
GPU workers publish each response to the correct channelAlternative: Collocated Architecture
For simpler deployments, run the batching logic within each API server:
┌────────────────────────────────┐ ┌─────────────────┐
│ API Server Instance │ │ Batch Queue │
│ ┌───────────────────────────┐ │ │ (Redis) │
│ │ HTTP Handler │ │ └────────┬────────┘
│ │ - Keeps connection open │ │ │
│ └───────────┬───────────────┘ │ v
│ │ │ ┌─────────────────┐
│ ┌───────────▼───────────────┐ │ │ GPU Workers │
│ │ Local Batcher │──┼────>│ (shared pool) │
│ │ - In-memory batch │ │ └─────────────────┘
│ │ - Direct connection ref │<─┼──── Response via Pub/Sub
│ └───────────────────────────┘ │
└────────────────────────────────┘Why "Medium" GPU efficiency? With 3 API instances, each forms batches independently. At 1000 RPS split evenly (~333 RPS each), each instance fills a batch of 32 in ~96ms. Meanwhile, the centralized approach aggregates all 1000 RPS and fills batches in ~32ms—fewer partially-filled batches.
Trade-off comparison:
| Approach | Complexity | Latency | GPU Efficiency | Scale Limit |
|---|---|---|---|---|
| Collocated | Low | ~120ms | Medium (independent batching) | ~5K RPS |
| Separate Service | High | ~140ms | High (global batching) | 50K+ RPS |
When to use collocated: Start with collocated for simplicity. Migrate to separate batching service when you observe GPU under-utilization due to small batch sizes across instances.
Phase 5: Scaling & Trade-offs
Addressing Non-Functional Requirements
1. Latency (P95 < 200ms)
-
Timeout-based batching caps waiting time at 50ms
-
Co-locate components in same availability zone
-
Use connection pooling to Redis
-
Monitor and alert when P95 approaches 180ms
2. Throughput (1,000 RPS)
With 6 GPUs at 70% utilization:
-
Theoretical max: 6 × 320 = 1,920 RPS
-
Sustainable: 1,920 × 0.7 = 1,344 RPS ✓
3. Availability (99.9%)
| Component | Failure Impact | Mitigation |
|---|---|---|
| Load Balancer | Total outage | Deploy redundant pair |
| API Server | Partial degradation | 3+ instances, health checks |
| Redis | Queue/routing fails | Redis Cluster with replicas |
| Batching Service | Batching stops | Multiple instances pulling from same queue |
| GPU Worker | Reduced capacity | Auto-scaling, requeue failed batches |
Batching Service resilience: Multiple batching instances can safely pull from the same request queue (BLPOP is atomic). If one crashes, others continue. Partially-formed batches in the crashed instance are lost, but those requests timeout at the API layer and users retry.
Identifying Bottlenecks
Bottleneck 1: GPU Processing (100ms fixed)
This is the irreducible bottleneck. The only solution is horizontal scaling (more GPUs).
RPS needed → GPUs required
500 RPS → 3 GPUs
1,000 RPS → 6 GPUs
5,000 RPS → 25 GPUs
10,000 RPS → 50 GPUsBottleneck 2: Connection Limits (C10K Problem)
At 10,000 RPS with 150ms latency:
Concurrent connections = 10,000 × 0.15 = 1,500 per LB
Solutions:
- Increase OS file descriptor limits (ulimit -n 65536)
- Horizontal scale API instances
- Consider HTTP/2 multiplexingBottleneck 3: Redis Throughput
Operations per request:
- 1 LPUSH (API → request queue)
- 1 RPOP (batching service dequeues)
- 1 LPUSH per batch (batching → batch queue, amortized: 1/32 per request)
- 1 BLPOP per batch (GPU claims batch, amortized: 1/32 per request)
- 1 PUBLISH (GPU → response routing)
At 10,000 RPS: ~30,000 Redis ops/sec (3 per request)
Redis easily handles 100K+ ops/sec on modest hardware ✓Deep Dive: Failure Handling
GPU crash mid-batch — If a GPU fails while processing, 32 user requests fail simultaneously. This requires explicit handling.
GPU Worker with timeout and retry:
async def gpu_worker_loop():
while True:
batch = await redis.blpop("batch_queue", timeout=0)
try:
result = await asyncio.wait_for(
asyncio.to_thread(batchstring, batch.inputs),
timeout=0.3 # 3x expected (300ms)
)
await publish_results(batch, result)
except asyncio.TimeoutError:
# GPU hung - requeue batch for another GPU
await redis.lpush("batch_queue", batch)
await report_unhealthy()
break # Exit and let orchestrator restart this worker
except Exception as e:
# Processing failed - notify all waiters with error
await publish_errors(batch, str(e))Timeout handling at API layer:
async def handle_request(input: str):
request_id = generate_id()
pending_requests[request_id] = asyncio.get_running_loop().create_future()
await enqueue_request(request_id, input)
try:
result = await asyncio.wait_for(
pending_requests[request_id],
timeout=5.0 # 5 second deadline
)
return result
except asyncio.TimeoutError:
del pending_requests[request_id]
raise HTTPException(504, "Request timeout")Trade-off Discussion: Latency vs Throughput
Aggressive batching (larger batches, longer timeout):
-
Pro: Higher GPU utilization, lower cost per request
-
Con: Higher user-perceived latency
Conservative batching (smaller batches, shorter timeout):
-
Pro: Lower latency, better user experience
-
Con: More GPU overhead, higher cost
Adaptive approach:
def calculate_batch_params(queue_depth, gpu_utilization):
if gpu_utilization < 0.5 and queue_depth < 10:
# Under-utilized: prioritize latency
return BatchParams(size=16, timeout_ms=30)
elif queue_depth > 100 or gpu_utilization > 0.85:
# Overloaded: prioritize throughput
return BatchParams(size=64, timeout_ms=100)
else:
# Normal: balanced
return BatchParams(size=32, timeout_ms=50)Auto-Scaling Strategy
| Signal | Threshold | Action |
|---|---|---|
| Queue depth > 500 | 30 seconds | Add 3 GPUs |
| Queue depth > 100 | 60 seconds | Add 1 GPU |
| GPU utilization > 85% | 5 minutes | Add 1 GPU |
| GPU utilization < 40% | 15 minutes | Remove 1 GPU |
GPU cold start time (1-5 minutes) means aggressive scaling-up is necessary. Scale down conservatively to avoid thrashing.
Interview Checklist
Requirements Phase:
-
Clarified latency SLA (P95 target)
-
Confirmed throughput requirements (RPS)
-
Asked about user priority tiers
-
Verified GPU processing constraints
Data Model Phase:
-
Identified transient vs persistent data
-
Request tracking structure defined
-
Batch formation structure defined
API Phase:
-
Single endpoint with clear contract
-
Error codes for all failure modes
-
Timeout behavior specified
High-Level Design Phase:
-
Complete request flow drawn
-
Response routing mechanism explained
-
Batching strategy with timeout
-
GPU assignment logic
Scaling Phase:
-
GPU count calculation shown
-
Connection limits addressed
-
Failure handling for GPU crashes
-
Auto-scaling triggers defined
Summary
| Aspect | Decision | Rationale |
|---|---|---|
| Architecture | Separate batching service | Better GPU utilization at scale |
| Batching trigger | Size OR timeout (50ms) | Balances latency and throughput |
| Response routing | Redis Pub/Sub per API instance | Decouples API from batching |
| GPU assignment | Pull-based from batch queue | Natural backpressure, race-free |
| Failure handling | Requeue batch, timeout at API | Transparent recovery without data loss |
| Scaling signal | Queue depth + utilization | Proactive scaling before SLA breach |
This design handles 1,000 RPS with ~140ms average latency using 6 GPUs, meeting all functional and non-functional requirements while remaining operationally simple.