Credit Approval Risk Engine
Role: Software Engineer
Design a system that evaluates credit card applications and decides whether to approve, deny, or send them to manual review. The system calls many external providers (credit bureau, KYC/AML, fraud, income/employment verification), so downstream integrations are slow and failure-prone. The core challenge is decoupling these calls with async orchestration while preserving correctness.
This walkthrough follows the Interview Framework. Use it as a guide, not a script.
This is intentionally a "general" interview question without one fixed answer. Interviewers usually look for good decomposition, idempotent workflows, failure handling, and practical trade-offs.
Phase 1: Requirements
Functional Requirements
-
Applicants should be able to submit a new credit card application and receive an application ID immediately
-
The system should collect risk signals from multiple external services asynchronously (bureau, fraud, KYC, income)
-
The system should evaluate eligibility rules and risk score to produce
approved,denied, ormanual_review -
Applicants and support agents should be able to check real-time application status and final decision
-
The system should support retries, provider failover, and manual re-evaluation without duplicate external side effects
Do not block the user request on all downstream calls finishing. Return quickly with processing, then complete decisioning asynchronously.
Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Correctness | No duplicate applications or duplicate external submissions | Financial and compliance-sensitive flow |
| Decision latency | p95 under 10s for auto-decision path; under 5 minutes for async completion path | User experience + business conversion |
| Availability | 99.95% for application submission/status APIs | Onboarding is revenue-critical |
| Auditability | Full trace of inputs, rules, scores, and decision version | Regulatory/compliance requirements |
| Scalability | 2M applications/day, bursty partner traffic | Must absorb partner campaigns and spikes |
Capacity Estimation
| Metric | Value |
|---|---|
| Applications/day | 2,000,000 |
| Average submit QPS | ~23 QPS |
| Peak submit QPS | 500+ QPS |
| External checks per application | 4-8 calls |
| Peak downstream call throughput | 2,000-4,000 calls/s |
| Application record size | ~3 KB hot row + ~10 KB signals/audit |
| Annual storage (raw + audit) | ~9-12 TB/year (compressed event log + DB) |
Call out that average QPS is small, but downstream fanout multiplies load. External dependencies, not API CPU, are the real bottleneck.
Phase 2: Data Model
Core Entities
Application State Machine
Never let callbacks write approved/denied directly. Callbacks only update RiskSignal; a single decision service must own final state transitions.
Phase 3: API Design
Protocol Choice
-
REST for external application submission and status lookup
-
Event-driven async processing for internal workflow steps and provider callbacks
-
Workflow orchestration API (Step Functions/Temporal style) for long-running stateful execution
External APIs
POST /api/credit-applications
Headers:
Idempotency-Key: 01J3R2J7F1TQ0K6W6N8A4X9V7R
Request:
{
"product_type": "cashback_card",
"applicant": {
"legal_name": "Jane Doe",
"dob": "1994-04-08",
"ssn_last4": "1234",
"annual_income": 180000
}
}
Response:
{
"application_id": "app_123",
"status": "processing",
"next_poll_after_seconds": 3
}
GET /api/credit-applications/{application_id}
Response:
{
"application_id": "app_123",
"status": "manual_review",
"latest_step": "fraud_check_timeout",
"updated_at": "2026-02-10T19:40:01Z"
}
GET /api/credit-applications/{application_id}/decision
Response:
{
"application_id": "app_456",
"outcome": "approved",
"risk_score": 0.18,
"reason_codes": ["BUREAU_OK", "FRAUD_LOW", "INCOME_VERIFIED"]
}Internal Event Contracts
Topic: credit.application.submitted
Message: { application_id, applicant_id, idempotency_key, submitted_at }
Topic: credit.external.requested
Message: { application_id, provider, provider_request_id, attempt, timeout_ms }
Topic: credit.external.completed
Message: { application_id, provider, provider_request_id, status, signal_payload, received_at }
Topic: credit.decision.ready
Message: { application_id, required_signals_complete, missing_providers[] }
Topic: credit.decision.finalized
Message: { application_id, outcome, risk_score, reason_codes, ruleset_version }Provider Callback API
POST /internal/providers/{provider}/callbacks
Request:
{
"provider_request_id": "bureau_req_789",
"status": "success",
"payload": { ...provider_specific_fields... },
"signature": "hmac..."
}Phase 4: High-Level Design
Architecture Overview
Component Responsibilities
Application API
-
Validates schema and auth
-
Performs idempotency check before creating an application
-
Writes initial record + outbox event in a single transaction
-
Returns immediately with
processing
Outbox Relay
-
Polls unpublished outbox rows from the database
-
Publishes durable
application.submittedevents to Kafka -
Marks events as published to guarantee at-least-once delivery across API/Kafka failures
Workflow Orchestrator (Step Functions/Temporal style)
-
Maintains per-application workflow state
-
Schedules steps: bureau check, KYC, fraud, income verification
-
Applies timeout/retry policies per provider
-
Supports compensation paths (e.g., mark manual review when mandatory provider fails)
Worker Pools
-
Pull work from workflow tasks, call adapters, and publish completion events
-
Horizontally scalable and isolated by provider type
-
Retries with exponential backoff and jitter
External Adapter Gateway
-
Normalizes provider-specific APIs
-
Enforces per-provider token bucket rate limits
-
Adds circuit breaker + fallback provider routing
-
Signs and verifies callbacks
Signal Processor
-
Consumes
credit.external.completedand upsertsRiskSignal -
Detects readiness criteria and emits
credit.decision.ready -
Updates cached status snapshots for low-latency polling
Rule Engine
-
Stores deterministic eligibility and policy rules (hard cutoffs, compliance checks)
-
Versioned rulesets so decisions are reproducible in audits
-
Produces reason codes used in adverse action explanations
Decision Service
-
Waits for required signals or timeout policy
-
Computes final outcome using rule engine + optional model score
-
Ensures single-writer decision finalization (
approved/denied/manual_review)
Status Cache
-
Caches latest application status for low-latency polling APIs
-
TTL and event-driven invalidation on state transitions
Data Flow: Async Decisioning
Idempotency Strategy
Three layers:
-
API idempotency key:
UNIQUE(applicant_id, idempotency_key)prevents duplicate applications -
Workflow execution dedupe: one workflow run per
application_id; duplicate start events are ignored -
Provider request idempotency: deterministic
provider_request_id({application_id}:{provider}:{attempt}) avoids duplicate chargeable calls
CREATE UNIQUE INDEX uq_applicant_idempotency
ON credit_applications (applicant_id, idempotency_key);
INSERT INTO credit_applications (
id, applicant_id, idempotency_key, status, created_at
)
VALUES (
'6f8b30ce-b3c2-4709-aab6-8927bde5f6ef',
'1f4d66c2-0be0-4d3d-9f09-a0e7ca8eb6b9',
'01J3R2J7F1TQ0K6W6N8A4X9V7R',
'processing',
NOW()
)
ON CONFLICT (applicant_id, idempotency_key) DO NOTHING
RETURNING id;Rate Limiting Strategy
| Layer | Strategy | Purpose |
|---|---|---|
| Client/API | Sliding window or token bucket per user/partner | Protect public APIs |
| Workflow dispatch | Global concurrency cap per workflow type | Avoid internal overload |
| Provider adapter | Token bucket per provider credential | Respect vendor limits/SLA |
| Retry subsystem | Retry budget + circuit breaker | Prevent retry storms |
Call out "retry budget" explicitly. Interviewers care less about one retry policy and more about proving retries cannot amplify outages.
Phase 5: Scaling & Trade-offs
Meeting Non-Functional Requirements
Correctness and Auditability
| Strategy | Why It Works |
|---|---|
| Single decision writer service | Prevents conflicting final states |
| Versioned rulesets + reason codes | Supports explainability and compliance |
| Append-only event log | Full audit and replay capability |
| Idempotency at API/workflow/provider layers | Prevents duplicate side effects |
Latency and Throughput
| Strategy | Why It Works |
|---|---|
Async orchestration + immediate processing response | Removes long-tail provider latency from user request path |
| Parallel external checks | Shrinks critical path decision time |
| Status cache | Fast polling reads without hammering primary DB |
Partitioned event topics by application_id | Scales consumers horizontally |
Bottlenecks and Mitigations
1. Slow or failing external providers
-
Use per-provider circuit breakers and fallback providers
-
Downgrade to
manual_reviewwhen required checks exceed timeout budget -
Keep provider adapters isolated so one vendor incident does not block all checks
2. Event backlog during traffic spikes
-
Autoscale workers on consumer lag
-
Separate high-priority topics (
decision.ready) from bulk topics -
Apply admission control and 429 for abusive partner traffic
3. Database write pressure (many status transitions)
-
Persist events to Kafka first, then batch upsert derived state
-
Use read replicas for status/history APIs
-
Archive old events to cold storage for long retention
Trade-off: Workflow Engine vs Custom Orchestrator
| Option | Pros | Cons |
|---|---|---|
| Custom orchestration in app code | Full flexibility, fewer platform dependencies | Hard to reason about retries/timeouts/compensation; costly to maintain |
| Managed workflow engine (recommended) | Built-in state, retries, visibility, durability | Vendor lock-in, extra operational model |
Trade-off: Kafka Event Bus vs Direct RPC Chain
| Option | Pros | Cons |
|---|---|---|
| Direct synchronous RPC chain | Simple to start, easy tracing | Tight coupling, poor resilience to downstream failures |
| Event bus + async workers (recommended) | Loose coupling, buffering, replay, independent scaling | Eventual consistency and operational complexity |
Failure Mode Deep Dive
| Failure | Risk | Recovery |
|---|---|---|
| API retries from client timeout | Duplicate applications | Idempotency key returns existing application |
| Worker crash after provider call | Lost callback handling | Workflow timeout + re-query provider status |
| Kafka consumer outage | Decision delay | Replay from offsets after recovery |
| Rule deployment bug | Incorrect approvals/denials | Ruleset versioning + fast rollback + re-decision pipeline |
| Redis cache outage | Slow status reads | Serve from DB fallback; repopulate cache asynchronously |
A common mistake is letting each external callback decide independently. That creates race conditions and conflicting outcomes. Keep decision finalization centralized.
Interview Checklist
Requirements Phase
-
Clarified that many downstream services are external and unreliable
-
Set latency goals for both immediate response and final decision
-
Included compliance/auditability as first-class requirements
Data Model Phase
-
Defined
Application,RiskSignal,Decision, andWorkflowExecution -
Showed clear state machine with
manual_reviewpath -
Included append-only events for audit/replay
API Design Phase
-
Included idempotent submission API
-
Included status + decision retrieval APIs
-
Defined internal event contracts for workflow steps
High-Level Design Phase
-
Used workflow engine to orchestrate async checks
-
Decoupled with Kafka + worker pools
-
Added rule engine, cache, rate limiting, and adapter gateway
Scaling Phase
-
Covered retry storms, circuit breakers, and provider outages
-
Discussed workflow-vs-custom and async-vs-sync trade-offs
-
Explained single-writer decision finalization for correctness
Key Points to Emphasize
-
Asynchronous by design: external checks are slow/unreliable, so orchestration must be durable and non-blocking.
-
Idempotency everywhere: API submission, workflow execution, and provider calls all need dedupe.
-
Centralized decision ownership: one service finalizes outcomes to avoid race conditions.
-
Rate limiting + retry budgets: these prevent outages from cascading across downstream dependencies.
-
Auditability is mandatory: versioned rules + append-only events make decisions explainable and reversible.