Back to Coinbase questions
System DesignSoftware Engineer

Credit Approval Risk Engine

Role: Software Engineer


Design a system that evaluates credit card applications and decides whether to approve, deny, or send them to manual review. The system calls many external providers (credit bureau, KYC/AML, fraud, income/employment verification), so downstream integrations are slow and failure-prone. The core challenge is decoupling these calls with async orchestration while preserving correctness.

This walkthrough follows the Interview Framework. Use it as a guide, not a script.

This is intentionally a "general" interview question without one fixed answer. Interviewers usually look for good decomposition, idempotent workflows, failure handling, and practical trade-offs.

Phase 1: Requirements

Functional Requirements

  • Applicants should be able to submit a new credit card application and receive an application ID immediately

  • The system should collect risk signals from multiple external services asynchronously (bureau, fraud, KYC, income)

  • The system should evaluate eligibility rules and risk score to produce approved, denied, or manual_review

  • Applicants and support agents should be able to check real-time application status and final decision

  • The system should support retries, provider failover, and manual re-evaluation without duplicate external side effects

Do not block the user request on all downstream calls finishing. Return quickly with processing, then complete decisioning asynchronously.

Non-Functional Requirements

RequirementTargetRationale
CorrectnessNo duplicate applications or duplicate external submissionsFinancial and compliance-sensitive flow
Decision latencyp95 under 10s for auto-decision path; under 5 minutes for async completion pathUser experience + business conversion
Availability99.95% for application submission/status APIsOnboarding is revenue-critical
AuditabilityFull trace of inputs, rules, scores, and decision versionRegulatory/compliance requirements
Scalability2M applications/day, bursty partner trafficMust absorb partner campaigns and spikes

Capacity Estimation

MetricValue
Applications/day2,000,000
Average submit QPS~23 QPS
Peak submit QPS500+ QPS
External checks per application4-8 calls
Peak downstream call throughput2,000-4,000 calls/s
Application record size~3 KB hot row + ~10 KB signals/audit
Annual storage (raw + audit)~9-12 TB/year (compressed event log + DB)

Call out that average QPS is small, but downstream fanout multiplies load. External dependencies, not API CPU, are the real bottleneck.

Phase 2: Data Model

Core Entities

Application State Machine

Never let callbacks write approved/denied directly. Callbacks only update RiskSignal; a single decision service must own final state transitions.

Phase 3: API Design

Protocol Choice

  • REST for external application submission and status lookup

  • Event-driven async processing for internal workflow steps and provider callbacks

  • Workflow orchestration API (Step Functions/Temporal style) for long-running stateful execution

External APIs

http
POST /api/credit-applications
Headers:
  Idempotency-Key: 01J3R2J7F1TQ0K6W6N8A4X9V7R
Request:
{
  "product_type": "cashback_card",
  "applicant": {
    "legal_name": "Jane Doe",
    "dob": "1994-04-08",
    "ssn_last4": "1234",
    "annual_income": 180000
  }
}
Response:
{
  "application_id": "app_123",
  "status": "processing",
  "next_poll_after_seconds": 3
}

GET /api/credit-applications/{application_id}
Response:
{
  "application_id": "app_123",
  "status": "manual_review",
  "latest_step": "fraud_check_timeout",
  "updated_at": "2026-02-10T19:40:01Z"
}

GET /api/credit-applications/{application_id}/decision
Response:
{
  "application_id": "app_456",
  "outcome": "approved",
  "risk_score": 0.18,
  "reason_codes": ["BUREAU_OK", "FRAUD_LOW", "INCOME_VERIFIED"]
}

Internal Event Contracts

text
Topic: credit.application.submitted
Message: { application_id, applicant_id, idempotency_key, submitted_at }

Topic: credit.external.requested
Message: { application_id, provider, provider_request_id, attempt, timeout_ms }

Topic: credit.external.completed
Message: { application_id, provider, provider_request_id, status, signal_payload, received_at }

Topic: credit.decision.ready
Message: { application_id, required_signals_complete, missing_providers[] }

Topic: credit.decision.finalized
Message: { application_id, outcome, risk_score, reason_codes, ruleset_version }

Provider Callback API

http
POST /internal/providers/{provider}/callbacks
Request:
{
  "provider_request_id": "bureau_req_789",
  "status": "success",
  "payload": { ...provider_specific_fields... },
  "signature": "hmac..."
}

Phase 4: High-Level Design

Architecture Overview

Component Responsibilities

Application API

  • Validates schema and auth

  • Performs idempotency check before creating an application

  • Writes initial record + outbox event in a single transaction

  • Returns immediately with processing

Outbox Relay

  • Polls unpublished outbox rows from the database

  • Publishes durable application.submitted events to Kafka

  • Marks events as published to guarantee at-least-once delivery across API/Kafka failures

Workflow Orchestrator (Step Functions/Temporal style)

  • Maintains per-application workflow state

  • Schedules steps: bureau check, KYC, fraud, income verification

  • Applies timeout/retry policies per provider

  • Supports compensation paths (e.g., mark manual review when mandatory provider fails)

Worker Pools

  • Pull work from workflow tasks, call adapters, and publish completion events

  • Horizontally scalable and isolated by provider type

  • Retries with exponential backoff and jitter

External Adapter Gateway

  • Normalizes provider-specific APIs

  • Enforces per-provider token bucket rate limits

  • Adds circuit breaker + fallback provider routing

  • Signs and verifies callbacks

Signal Processor

  • Consumes credit.external.completed and upserts RiskSignal

  • Detects readiness criteria and emits credit.decision.ready

  • Updates cached status snapshots for low-latency polling

Rule Engine

  • Stores deterministic eligibility and policy rules (hard cutoffs, compliance checks)

  • Versioned rulesets so decisions are reproducible in audits

  • Produces reason codes used in adverse action explanations

Decision Service

  • Waits for required signals or timeout policy

  • Computes final outcome using rule engine + optional model score

  • Ensures single-writer decision finalization (approved/denied/manual_review)

Status Cache

  • Caches latest application status for low-latency polling APIs

  • TTL and event-driven invalidation on state transitions

Data Flow: Async Decisioning

Idempotency Strategy

Three layers:

  • API idempotency key: UNIQUE(applicant_id, idempotency_key) prevents duplicate applications

  • Workflow execution dedupe: one workflow run per application_id; duplicate start events are ignored

  • Provider request idempotency: deterministic provider_request_id ({application_id}:{provider}:{attempt}) avoids duplicate chargeable calls

sql
CREATE UNIQUE INDEX uq_applicant_idempotency
ON credit_applications (applicant_id, idempotency_key);

INSERT INTO credit_applications (
  id, applicant_id, idempotency_key, status, created_at
)
VALUES (
  '6f8b30ce-b3c2-4709-aab6-8927bde5f6ef',
  '1f4d66c2-0be0-4d3d-9f09-a0e7ca8eb6b9',
  '01J3R2J7F1TQ0K6W6N8A4X9V7R',
  'processing',
  NOW()
)
ON CONFLICT (applicant_id, idempotency_key) DO NOTHING
RETURNING id;

Rate Limiting Strategy

LayerStrategyPurpose
Client/APISliding window or token bucket per user/partnerProtect public APIs
Workflow dispatchGlobal concurrency cap per workflow typeAvoid internal overload
Provider adapterToken bucket per provider credentialRespect vendor limits/SLA
Retry subsystemRetry budget + circuit breakerPrevent retry storms

Call out "retry budget" explicitly. Interviewers care less about one retry policy and more about proving retries cannot amplify outages.

Phase 5: Scaling & Trade-offs

Meeting Non-Functional Requirements

Correctness and Auditability

StrategyWhy It Works
Single decision writer servicePrevents conflicting final states
Versioned rulesets + reason codesSupports explainability and compliance
Append-only event logFull audit and replay capability
Idempotency at API/workflow/provider layersPrevents duplicate side effects

Latency and Throughput

StrategyWhy It Works
Async orchestration + immediate processing responseRemoves long-tail provider latency from user request path
Parallel external checksShrinks critical path decision time
Status cacheFast polling reads without hammering primary DB
Partitioned event topics by application_idScales consumers horizontally

Bottlenecks and Mitigations

1. Slow or failing external providers

  • Use per-provider circuit breakers and fallback providers

  • Downgrade to manual_review when required checks exceed timeout budget

  • Keep provider adapters isolated so one vendor incident does not block all checks

2. Event backlog during traffic spikes

  • Autoscale workers on consumer lag

  • Separate high-priority topics (decision.ready) from bulk topics

  • Apply admission control and 429 for abusive partner traffic

3. Database write pressure (many status transitions)

  • Persist events to Kafka first, then batch upsert derived state

  • Use read replicas for status/history APIs

  • Archive old events to cold storage for long retention

Trade-off: Workflow Engine vs Custom Orchestrator

OptionProsCons
Custom orchestration in app codeFull flexibility, fewer platform dependenciesHard to reason about retries/timeouts/compensation; costly to maintain
Managed workflow engine (recommended)Built-in state, retries, visibility, durabilityVendor lock-in, extra operational model

Trade-off: Kafka Event Bus vs Direct RPC Chain

OptionProsCons
Direct synchronous RPC chainSimple to start, easy tracingTight coupling, poor resilience to downstream failures
Event bus + async workers (recommended)Loose coupling, buffering, replay, independent scalingEventual consistency and operational complexity

Failure Mode Deep Dive

FailureRiskRecovery
API retries from client timeoutDuplicate applicationsIdempotency key returns existing application
Worker crash after provider callLost callback handlingWorkflow timeout + re-query provider status
Kafka consumer outageDecision delayReplay from offsets after recovery
Rule deployment bugIncorrect approvals/denialsRuleset versioning + fast rollback + re-decision pipeline
Redis cache outageSlow status readsServe from DB fallback; repopulate cache asynchronously

A common mistake is letting each external callback decide independently. That creates race conditions and conflicting outcomes. Keep decision finalization centralized.

Interview Checklist

Requirements Phase

  • Clarified that many downstream services are external and unreliable

  • Set latency goals for both immediate response and final decision

  • Included compliance/auditability as first-class requirements

Data Model Phase

  • Defined Application, RiskSignal, Decision, and WorkflowExecution

  • Showed clear state machine with manual_review path

  • Included append-only events for audit/replay

API Design Phase

  • Included idempotent submission API

  • Included status + decision retrieval APIs

  • Defined internal event contracts for workflow steps

High-Level Design Phase

  • Used workflow engine to orchestrate async checks

  • Decoupled with Kafka + worker pools

  • Added rule engine, cache, rate limiting, and adapter gateway

Scaling Phase

  • Covered retry storms, circuit breakers, and provider outages

  • Discussed workflow-vs-custom and async-vs-sync trade-offs

  • Explained single-writer decision finalization for correctness

Key Points to Emphasize

  • Asynchronous by design: external checks are slow/unreliable, so orchestration must be durable and non-blocking.

  • Idempotency everywhere: API submission, workflow execution, and provider calls all need dedupe.

  • Centralized decision ownership: one service finalizes outcomes to avoid race conditions.

  • Rate limiting + retry budgets: these prevent outages from cascading across downstream dependencies.

  • Auditability is mandatory: versioned rules + append-only events make decisions explainable and reversible.