Back to Databricks questions
System DesignSoftware Engineer

Book Price Aggregator and Purchase Platform

Role: Software Engineer


Problem Statement

Design a book marketplace intermediary service that helps customers purchase books at the best price from multiple third-party book sellers. The system acts as a price aggregator and purchase coordinator — you are the middleman, not a direct seller, and you hold no inventory.

How it works:

  • A customer submits a purchase request with a book ID, their maximum acceptable price, and payment information

  • Your service queries 50–200 downstream book seller APIs for price and availability

  • If the lowest price ≤ the customer's max price, complete the purchase and charge the customer

  • If the lowest price exceeds the customer's max price, return it as a counter-offer

  • If no sellers have the book in stock, return not_available

Scale & Constraints:

  • Books: 1–2 million unique books

  • Sellers: 50–200 third-party book seller APIs

  • Latency SLA: 10–20 seconds for the entire request flow

  • Operation type: Asynchronous — the focus is on not overloading downstream APIs, not on instant response

Potential Follow-up Questions

During the interview, the interviewer may ask you to address these challenges:

1. Batch Quote Optimization How do you efficiently query prices from 50–200 sellers without exceeding the latency SLA? Sequential calls at 100ms each would consume the entire 20-second budget.

2. Noisy Neighbor Prevention How do you prevent high-volume order creation from overwhelming downstream book seller APIs? Multiple concurrent customer requests all hitting the same sellers creates a thundering herd problem.

3. Inventory Race Conditions A book may be available when you query the price but sold out by the time you attempt to purchase. How do you handle this? (Hint: don't just try the cheapest — sort by price and try in order.)

4. Failure Tolerance Downstream seller APIs can fail, timeout, or return errors. How do you handle partial failures when some sellers respond and others don't?

5. Payment Processing When do you charge the customer's credit card? What happens if the downstream purchase fails after you've already charged?

6. Scaling to More Sellers How would your design change if you had 10,000 sellers instead of 200?

7. Price Fluctuations What if a seller's price changes between when you query and when you purchase?

Disclaimer: This is a sample solution to help you get started. To better prepare for the interview, you should think through the question yourself and try to come up with your own solution. System design questions are open-ended and have multiple valid approaches.

Phase 1: Requirements

Functional Requirements

  • Submit purchase request — Customers submit a book ID, max acceptable price, and payment information

  • Aggregate prices — Query multiple downstream seller APIs in parallel for price and availability

  • Execute purchase — If the lowest price meets the customer's budget, purchase from the cheapest seller and charge the customer

  • Return counter-offer — If the lowest price exceeds the budget, return it so the customer can decide

  • Handle purchase failures — If the cheapest seller is out of stock, try the next cheapest, and so on

  • Return not available — If no sellers have inventory, return not_available and release any authorization

Non-Functional Requirements

  • Latency: Complete the full flow within 10–20 seconds

  • Availability: 99.9%+ uptime — graceful degradation when individual sellers are down

  • Consistency: Strong consistency for payment operations (no double charges, no charges without successful purchases)

  • Reliability: Tolerate partial downstream failures — a few failing sellers should not block the entire request

Capacity Estimation

text
Traffic:
- 1M active users/month, 2 searches each, 20% conversion = 400K purchase requests/month
- ~13K requests/day = ~0.15 QPS average, ~1.5 QPS peak

Downstream seller API calls:
- Each request queries ~100 sellers on average
- 13K x 100 = 1.3M seller API calls/day = ~15 QPS average, ~150 QPS peak

Storage:
- Request metadata: 400K x 1 KB = 400 MB/month
- Seller quotes: 400K x 100 quotes x 500 bytes = 20 GB/month

Traffic is very low. The bottleneck is not your service's throughput — it's seller API rate limits and latency. The entire design should focus on efficient downstream API management and fan-out control, not horizontal scaling of your own services.

Phase 2: Data Model

Core Entities

sql
-- Track customer purchase requests (PostgreSQL)
PurchaseRequest {
  request_id:   UUID (PK)
  customer_id:  VARCHAR
  book_id:      VARCHAR
  max_price:    DECIMAL(10,2)
  status:       Enum (pending, purchased, price_not_met, not_available, failed)
  created_at:   Timestamp
  updated_at:   Timestamp
}

-- Track quotes from each seller for each request
SellerQuote {
  quote_id:      UUID (PK)
  request_id:    UUID (FK -> PurchaseRequest)
  seller_id:     VARCHAR
  price:         DECIMAL(10,2)
  available:     Boolean
  response_time: Integer (ms)
  error_message: Text
  quoted_at:     Timestamp
}

-- Track purchase attempts and outcomes
PurchaseAttempt {
  attempt_id:    UUID (PK)
  request_id:    UUID (FK -> PurchaseRequest)
  seller_id:     VARCHAR
  attempt_order: Integer  -- 1st cheapest, 2nd cheapest, etc.
  status:        Enum (success, out_of_stock, failed)
  error_message: Text
  attempted_at:  Timestamp
}

-- Seller health metrics (Redis for real-time state)
SellerHealth {
  seller_id:              VARCHAR (PK)
  avg_response_time_ms:   Integer
  failure_rate:           Float
  circuit_breaker_status: Enum (closed, open, half_open)
  rate_limit_per_minute:  Integer
  updated_at:             Timestamp
}

Operational State (Redis)

text
Rate limit counters:
  Key: rate_limit:{seller_id}:{minute_timestamp}
  Type: Integer (auto-increment, TTL 60s)

Circuit breaker state:
  Key: circuit:{seller_id}
  Type: Hash { status, failure_count, last_failure_at, opened_at }

Quote cache (short-lived):
  Key: quote:{book_id}:{seller_id}
  Type: JSON { price, available, quote_id, expires_at }
  TTL: 60 seconds

Coalesced quote list:
  Key: quotes:{book_id}
  Type: JSON [ { seller_id, price, available, quote_id, expires_at } ]
  TTL: 60 seconds

Use PostgreSQL for the primary data store — ACID properties are critical for payment operations. The traffic is low enough that a single instance with a read replica handles the load. Use Redis for operational state that needs fast access across distributed workers.

Phase 3: API Design

Customer-Facing API (REST)

Submit Purchase Request

text
POST /api/v1/purchase-requests
Authorization: Bearer {token}

Request:
{
  "book_id": "isbn-978-0134685991",
  "max_price": 29.99,
  "payment_token": "tok_abc123"
}

Response (202 Accepted):
{
  "request_id": "req-xyz-789",
  "status": "pending",
  "status_url": "/api/v1/purchase-requests/req-xyz-789"
}

Check Status

text
GET /api/v1/purchase-requests/{request_id}
Authorization: Bearer {token}

Response (purchased):
{
  "request_id": "req-xyz-789",
  "status": "purchased",
  "final_price": 24.99,
  "seller_id": "seller-42",
  "confirmation": "CONF-456"
}

Response (price_not_met):
{
  "request_id": "req-xyz-789",
  "status": "price_not_met",
  "lowest_price": 34.99,
  "message": "Best price (.99) exceeds your max (.99)",
  "next_action": "resubmit_with_higher_max_price"
}

Response (not_available):
{
  "request_id": "req-xyz-789",
  "status": "not_available",
  "message": "No sellers had this book in stock at the time of request"
}

Downstream Seller API (You Define This)

A key part of this interview is defining the downstream seller API contract yourself. The sellers also sell books directly to their own customers, so you must design around their constraints.

Get Quote

text
GET /api/v1/books/{book_id}/quote

Response:
{
  "book_id": "isbn-978-0134685991",
  "available": true,
  "price": 24.99,
  "quote_id": "q-abc",
  "expires_at": "2024-01-15T10:30:00Z"
}

Hold Inventory

text
POST /api/v1/books/hold
{
  "quote_id": "q-abc",
  "quantity": 1
}

Response:
{
  "hold_id": "hold-xyz",
  "price": 24.99,
  "expires_at": "2024-01-15T10:35:00Z"
}

Complete Purchase

text
POST /api/v1/orders
Idempotency-Key: {request_id}:{seller_id}

{
  "hold_id": "hold-xyz",
  "buyer_reference": "ref-789"
}

Response:
{
  "order_id": "order-456",
  "status": "confirmed",
  "price": 24.99
}

The customer pays you (via authorize/capture), and you settle with sellers separately — through invoicing, net settlement, or your own merchant account. Never pass the customer's payment token to a third-party seller. Some sellers may not support separate hold/purchase steps. For those, use an atomic purchase endpoint that combines both operations. Your Purchase Orchestrator should handle both seller types.

Phase 4: High-Level Design

Architecture Diagram

Request Flow

Data Flow Summary

  • Customer → Purchase Request Service: Validate request, authorize payment hold

  • Purchase Request Service → Queue: Enqueue event, return request_id immediately (async)

  • Price Aggregator → All Sellers (parallel): Fan out quote requests, skipping sellers with open circuit breakers or exhausted rate limits

  • Price Aggregator: Sort quotes by price, evaluate against max_price; if no available quotes, mark not_available and void authorization

  • Purchase Orchestrator → Sellers (sequential by price): Attempt hold → purchase, starting from cheapest

  • Purchase Orchestrator → Payment Service: Capture on success, void on failure

  • Notification Service → Customer: Email/push with outcome

Why async? The operation is asynchronous because the focus of this problem is on not overloading downstream seller APIs, not on instant response. The customer submits a request and polls for the result or receives a notification. This decoupling lets you manage backpressure against seller APIs.

Phase 5: Scaling & Trade-offs

Deep Dive: Parallel Seller Querying

Problem: Querying 200 sellers sequentially at 100ms each = 20 seconds — the entire latency budget, with no room for processing or purchase attempts.

Solution: Query all sellers concurrently using async I/O.

python
async def query_all_sellers(book_id, sellers):
    tasks = []
    for seller in sellers:
        # Skip unhealthy or rate-limited sellers
        if circuit_breaker.is_open(seller.id):
            continue
        if not rate_limiter.try_acquire(seller.id):
            continue
        tasks.append(query_seller(seller, book_id, timeout=3.0))

    # Wait for all, or stop at global timeout
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter successful quotes
    quotes = [r for r in results if isinstance(r, Quote) and r.available]
    return sorted(quotes, key=lambda q: q.price)

With 200 concurrent requests, wall-clock latency is bounded by the slowest responders and the global timeout, not the sum of per-seller latencies. Expect low single-digit seconds for typical responses, and worst-case latency close to your 10-second global timeout. This leaves room for processing and purchase attempts.

Use HTTP connection pooling with keep-alive to reduce TLS handshake overhead. Set a per-seller timeout of 3 seconds and a global timeout of 10 seconds. Don't wait for stragglers — use the quotes you have when the global timeout hits.

Deep Dive: Noisy Neighbor Prevention

Problem: Multiple concurrent customer requests all query the same 200 sellers simultaneously, potentially exceeding each seller's rate limits and getting your service blocked.

Solution: Distributed fixed window counter in Redis.

python
def try_acquire_rate_limit(seller_id):
    key = f"rate_limit:{seller_id}:{current_minute()}"
    count = redis.incr(key)
    if count == 1:
        redis.expire(key, 60)  # TTL on first increment

    seller_limit = get_seller_limit(seller_id)  # e.g., 100 req/min
    return count <= seller_limit

When a seller's rate limit is exhausted, the worker skips that seller for the current request. The request still proceeds with the remaining sellers — it doesn't fail entirely.

The sellers' APIs also serve their own direct customers. If your service floods their APIs, they'll throttle or block you entirely, degrading the experience for all your customers. This is the "noisy neighbor" problem the interviewer is testing.

Deep Dive: Circuit Breaker Pattern

Problem: When a seller's API goes down, continuing to send requests wastes time waiting for timeouts and uses connection resources.

python
class CircuitBreaker:
    # States: CLOSED -> OPEN -> HALF_OPEN -> CLOSED
    FAILURE_THRESHOLD = 0.5   # 50% failure rate
    TIME_WINDOW = 60          # 60-second sliding window
    MIN_CALLS = 10            # Minimum calls before tripping
    COOLDOWN = 30             # Seconds before testing recovery

    def should_allow(self, seller_id):
        state = redis.hgetall(f"circuit:{seller_id}")
        if state['status'] == 'OPEN':
            if now() - state['opened_at'] > self.COOLDOWN:
                return True  # HALF_OPEN: allow one test request
            return False     # Still open, skip this seller
        return True          # CLOSED: allow all requests

Store circuit breaker state in Redis so all workers share the same view. When a circuit is OPEN, the Price Aggregator skips that seller entirely — no network call, no timeout wait.

Circuit breakers enable graceful degradation: losing one seller just means fewer quotes, not a failed request. The system continues functioning with the remaining healthy sellers.

Deep Dive: Inventory Race Conditions

Problem: You query a seller and get a quote of .99, but by the time you try to purchase, the book is sold out. If you only tried the cheapest seller, the entire request fails.

Solution: Sort all quotes by price ascending and attempt purchases in order.

python
async def execute_purchase(sorted_quotes, customer, request_id):
    for quote in sorted_quotes:
        if quote.price > customer.max_price:
            break  # All remaining quotes are too expensive

        # Idempotency key ensures safe retries if the network call times out
        # — the seller deduplicates by this key and won't create a double purchase
        idempotency_key = f"{request_id}:{quote.seller_id}"
        result = await attempt_purchase(quote.seller_id, quote, idempotency_key)
        record_attempt(quote.seller_id, result)

        if result.status == 'SUCCESS':
            await capture_payment(customer, actual_price=quote.price)
            return PurchaseResult.PURCHASED

        # OUT_OF_STOCK or ERROR -> try next cheapest
        continue

    # All options exhausted
    await void_payment(customer)
    return PurchaseResult.NOT_AVAILABLE

This is a key design point that interviewers test explicitly. A naive approach of only trying the cheapest seller is fragile. The sorted retry approach maximizes the chance of a successful purchase while still getting the best price. Don't just find the cheapest — sort by price and try in order.

Deep Dive: Payment Processing

Problem: If you charge before confirming the seller purchase, a failed purchase means you need a refund (bad UX, takes days). If you charge after, the card might decline (seller already committed).

Solution: Authorize-then-capture (two-phase payment).

StepWhenActionEffect
AuthorizeRequest submittedHold max_price on customer's cardFunds reserved, not charged
CapturePurchase succeedsCharge actual_price (≤ max_price)Customer charged the real amount
VoidPurchase fails or price not metRelease the holdCustomer never charged
text
Timeline (happy path):
  Request submitted  ->  Authorize .99 (max price)
  Best price found   ->  .99
  Seller purchase    ->  Success
  Capture            ->  Charge .99 (actual price <= authorized amount)

Timeline (failure):
  Request submitted  ->  Authorize .99
  All sellers fail   ->  Void authorization (hold released immediately)

Authorizations typically expire in 7 days if not captured or voided ('s default; some card networks allow up to 31 days). This pattern ensures customers are never charged for failed purchases and the capture always succeeds because funds are already held.

Counter-offer handling: On price_not_met, void the authorization immediately and return the lowest price as an informational counter-offer. If the customer accepts, they resubmit with a higher max_price (new authorization), which avoids holding funds while they decide. If you want a one-click accept, add a short-lived "accept offer" endpoint and re-authorize at acceptance time.

Crash recovery: If the worker crashes after capturing payment but before writing the "purchased" status to the database, the customer is charged but the system doesn't reflect it. To handle this, log the capture result to a transactional outbox table within the same database transaction as the status update. A reconciliation worker periodically compares captured payments against purchase records and resolves any mismatches.

Deep Dive: Caching and Request Deduplication

Quote caching reduces seller API load:

text
Cache key: quote:{book_id}:{seller_id}
TTL: 60 seconds

If multiple customers request the same book within 60 seconds, reuse cached quotes instead of hitting seller APIs again. The quote_id returned by each seller is what locks in the quoted price, so the cache TTL should always be shorter than the seller's quote validity window (the expires_at field). A stale cache entry is harmless — the worst case is a slightly outdated price estimate — because the actual purchase uses the quote_id, not the cached price.

Request coalescing for concurrent requests to the same book:

python
async def get_quotes_with_coalescing(book_id):
    lock_key = f"lock:quotes:{book_id}"

    # Try to acquire lock (atomic set-if-not-exists)
    acquired = redis.set(lock_key, "1", nx=True, ex=10)

    if acquired:
        # We're the leader — fetch from sellers
        quotes = await query_all_sellers(book_id)
        redis.set(f"quotes:{book_id}", serialize(quotes), ex=60)
        redis.delete(lock_key)
        return quotes
    else:
        # Another request is fetching — poll for cached result
        for _ in range(25):  # Up to 5 seconds (25 x 200ms)
            await sleep(0.2)
            cached = redis.get(f"quotes:{book_id}")
            if cached:
                return deserialize(cached)
        # Leader timed out or failed — fall back to direct query
        return await query_all_sellers(book_id)

Request coalescing reduces seller API calls by up to 90% for popular books. Each customer's results are evaluated against their own max_price — only the seller queries are shared, not the purchase decision.

Deep Dive: Scaling to 10,000 Sellers

If the seller count grows from 200 to 10,000, querying all sellers within the latency SLA becomes impossible.

Solution: Smart seller selection using historical data scoring.

python
def score_seller(seller_id, book_id):
    return (
        book_match_score(seller_id, book_id)  * 0.40 +  # Carried this book before?
        category_match(seller_id, book_id)    * 0.25 +  # Specializes in this category?
        price_competitiveness(seller_id)       * 0.20 +  # Often in top 10 cheapest?
        availability_rate(seller_id)           * 0.10 +  # High stock availability?
        speed_score(seller_id)                 * 0.05    # Fast API response?
    )

# Query only top 50-100 sellers by score

Multi-phase query approach:

PhaseTime BudgetActionTypical Outcome
Phase 10–5sQuery top 50 sellers (highest score)Most requests resolved here
Phase 25–10sQuery next 100 sellersRare books, tight budgets
Phase 310–15sLast resort batch, or return best offerEdge cases only

Trade-offs Discussion

DecisionTrade-off
Async with message queueSimpler backpressure management vs. customer must poll for results
Parallel seller queriesMeets latency SLA vs. high concurrent connection count
60s quote cache TTLReduces seller API load vs. slightly stale prices
Per-seller circuit breakersFail fast on broken sellers vs. possibly missing a recovered seller
Authorize-then-captureNever charge without purchase vs. funds held during processing
Request coalescingFewer API calls vs. added complexity and slight delay for waiting requests
Historical seller scoring (10K)Feasible at scale vs. may miss unknown cheap sellers

Interview Checklist

  • Clarified the async nature of the operation (not instant response)

  • Defined both customer-facing and downstream seller APIs

  • Calculated capacity and identified the real bottleneck (seller API limits, not traffic)

  • Explained parallel querying strategy to meet 10-20s latency SLA

  • Designed rate limiting per seller (noisy neighbor prevention)

  • Covered circuit breaker pattern for failing sellers

  • Described price-sorted retry for inventory race conditions

  • Explained authorize-then-capture payment flow with crash recovery

  • Used idempotency keys for safe purchase retries

  • Discussed caching / request deduplication to reduce seller API load

  • Addressed scaling to 10K+ sellers (smart selection, multi-phase queries)

Summary

AspectSolution
ArchitectureAsync with message queue — decouple acceptance from processing
Seller queryingParallel with async I/O — bounded by global timeout vs 20s sequential
Rate limitingDistributed fixed window counter (Redis) — per-seller quotas
Failure handlingCircuit breaker per seller — fail fast, auto-recovery
Purchase strategyPrice-sorted retry — handle inventory race conditions
PaymentAuthorize → Capture/Void — never charge without successful purchase
Caching60s TTL quote cache + request coalescing
DatabasePostgreSQL (ACID for payments) + Redis (operational state)
Scaling (10K sellers)Historical scoring + multi-phase query