Credit Approval Risk Engine

Role: Software Engineer

Design a system that evaluates credit card applications and decides whether to approve, deny, or send them to manual review. The system calls many external providers (credit bureau, KYC/AML, fraud, income/employment verification), so downstream integrations are slow and failure-prone. The core challenge is decoupling these calls with async orchestration while preserving correctness.

This walkthrough follows the Interview Framework. Use it as a guide, not a script.

This is intentionally a "general" interview question without one fixed answer. Interviewers usually look for good decomposition, idempotent workflows, failure handling, and practical trade-offs.

Phase 1: Requirements

Functional Requirements

Applicants should be able to submit a new credit card application and receive an application ID immediately
The system should collect risk signals from multiple external services asynchronously (bureau, fraud, KYC, income)
The system should evaluate eligibility rules and risk score to produce approved, denied, or manual_review
Applicants and support agents should be able to check real-time application status and final decision
The system should support retries, provider failover, and manual re-evaluation without duplicate external side effects

Do not block the user request on all downstream calls finishing. Return quickly with processing, then complete decisioning asynchronously.

Non-Functional Requirements

Requirement	Target	Rationale
Correctness	No duplicate applications or duplicate external submissions	Financial and compliance-sensitive flow
Decision latency	p95 under 10s for auto-decision path; under 5 minutes for async completion path	User experience + business conversion
Availability	99.95% for application submission/status APIs	Onboarding is revenue-critical
Auditability	Full trace of inputs, rules, scores, and decision version	Regulatory/compliance requirements
Scalability	2M applications/day, bursty partner traffic	Must absorb partner campaigns and spikes

Capacity Estimation

Metric	Value
Applications/day	2,000,000
Average submit QPS	~23 QPS
Peak submit QPS	500+ QPS
External checks per application	4-8 calls
Peak downstream call throughput	2,000-4,000 calls/s
Application record size	~3 KB hot row + ~10 KB signals/audit
Annual storage (raw + audit)	~9-12 TB/year (compressed event log + DB)

Call out that average QPS is small, but downstream fanout multiplies load. External dependencies, not API CPU, are the real bottleneck.

Phase 2: Data Model

Core Entities

Application State Machine

Never let callbacks write approved/denied directly. Callbacks only update RiskSignal; a single decision service must own final state transitions.

Phase 3: API Design

Protocol Choice

REST for external application submission and status lookup
Event-driven async processing for internal workflow steps and provider callbacks
Workflow orchestration API (Step Functions/Temporal style) for long-running stateful execution

External APIs

http

POST /api/credit-applications
Headers:
  Idempotency-Key: 01J3R2J7F1TQ0K6W6N8A4X9V7R
Request:
{
  "product_type": "cashback_card",
  "applicant": {
    "legal_name": "Jane Doe",
    "dob": "1994-04-08",
    "ssn_last4": "1234",
    "annual_income": 180000
  }
}
Response:
{
  "application_id": "app_123",
  "status": "processing",
  "next_poll_after_seconds": 3
}

GET /api/credit-applications/{application_id}
Response:
{
  "application_id": "app_123",
  "status": "manual_review",
  "latest_step": "fraud_check_timeout",
  "updated_at": "2026-02-10T19:40:01Z"
}

GET /api/credit-applications/{application_id}/decision
Response:
{
  "application_id": "app_456",
  "outcome": "approved",
  "risk_score": 0.18,
  "reason_codes": ["BUREAU_OK", "FRAUD_LOW", "INCOME_VERIFIED"]
}

Internal Event Contracts

text

Topic: credit.application.submitted
Message: { application_id, applicant_id, idempotency_key, submitted_at }

Topic: credit.external.requested
Message: { application_id, provider, provider_request_id, attempt, timeout_ms }

Topic: credit.external.completed
Message: { application_id, provider, provider_request_id, status, signal_payload, received_at }

Topic: credit.decision.ready
Message: { application_id, required_signals_complete, missing_providers[] }

Topic: credit.decision.finalized
Message: { application_id, outcome, risk_score, reason_codes, ruleset_version }

Provider Callback API

http

POST /internal/providers/{provider}/callbacks
Request:
{
  "provider_request_id": "bureau_req_789",
  "status": "success",
  "payload": { ...provider_specific_fields... },
  "signature": "hmac..."
}

Phase 4: High-Level Design

Architecture Overview

Component Responsibilities

Application API

Validates schema and auth
Performs idempotency check before creating an application
Writes initial record + outbox event in a single transaction
Returns immediately with processing

Outbox Relay

Polls unpublished outbox rows from the database
Publishes durable application.submitted events to Kafka
Marks events as published to guarantee at-least-once delivery across API/Kafka failures

Workflow Orchestrator (Step Functions/Temporal style)

Maintains per-application workflow state
Schedules steps: bureau check, KYC, fraud, income verification
Applies timeout/retry policies per provider
Supports compensation paths (e.g., mark manual review when mandatory provider fails)

Worker Pools

Pull work from workflow tasks, call adapters, and publish completion events
Horizontally scalable and isolated by provider type
Retries with exponential backoff and jitter

External Adapter Gateway

Normalizes provider-specific APIs
Enforces per-provider token bucket rate limits
Adds circuit breaker + fallback provider routing
Signs and verifies callbacks

Signal Processor

Consumes credit.external.completed and upserts RiskSignal
Detects readiness criteria and emits credit.decision.ready
Updates cached status snapshots for low-latency polling

Rule Engine

Stores deterministic eligibility and policy rules (hard cutoffs, compliance checks)
Versioned rulesets so decisions are reproducible in audits
Produces reason codes used in adverse action explanations

Decision Service

Waits for required signals or timeout policy
Computes final outcome using rule engine + optional model score
Ensures single-writer decision finalization (approved/denied/manual_review)

Status Cache

Caches latest application status for low-latency polling APIs
TTL and event-driven invalidation on state transitions

Data Flow: Async Decisioning

Idempotency Strategy

Three layers:

API idempotency key: UNIQUE(applicant_id, idempotency_key) prevents duplicate applications
Workflow execution dedupe: one workflow run per application_id; duplicate start events are ignored
Provider request idempotency: deterministic provider_request_id ({application_id}:{provider}:{attempt}) avoids duplicate chargeable calls

sql

CREATE UNIQUE INDEX uq_applicant_idempotency
ON credit_applications (applicant_id, idempotency_key);

INSERT INTO credit_applications (
  id, applicant_id, idempotency_key, status, created_at
)
VALUES (
  '6f8b30ce-b3c2-4709-aab6-8927bde5f6ef',
  '1f4d66c2-0be0-4d3d-9f09-a0e7ca8eb6b9',
  '01J3R2J7F1TQ0K6W6N8A4X9V7R',
  'processing',
  NOW()
)
ON CONFLICT (applicant_id, idempotency_key) DO NOTHING
RETURNING id;

Rate Limiting Strategy

Layer	Strategy	Purpose
Client/API	Sliding window or token bucket per user/partner	Protect public APIs
Workflow dispatch	Global concurrency cap per workflow type	Avoid internal overload
Provider adapter	Token bucket per provider credential	Respect vendor limits/SLA
Retry subsystem	Retry budget + circuit breaker	Prevent retry storms

Call out "retry budget" explicitly. Interviewers care less about one retry policy and more about proving retries cannot amplify outages.

Phase 5: Scaling & Trade-offs

Meeting Non-Functional Requirements

Correctness and Auditability

Strategy	Why It Works
Single decision writer service	Prevents conflicting final states
Versioned rulesets + reason codes	Supports explainability and compliance
Append-only event log	Full audit and replay capability
Idempotency at API/workflow/provider layers	Prevents duplicate side effects

Latency and Throughput

Strategy	Why It Works
Async orchestration + immediate `processing` response	Removes long-tail provider latency from user request path
Parallel external checks	Shrinks critical path decision time
Status cache	Fast polling reads without hammering primary DB
Partitioned event topics by `application_id`	Scales consumers horizontally

Bottlenecks and Mitigations

1. Slow or failing external providers

Use per-provider circuit breakers and fallback providers
Downgrade to manual_review when required checks exceed timeout budget
Keep provider adapters isolated so one vendor incident does not block all checks

2. Event backlog during traffic spikes

Autoscale workers on consumer lag
Separate high-priority topics (decision.ready) from bulk topics
Apply admission control and 429 for abusive partner traffic

3. Database write pressure (many status transitions)

Persist events to Kafka first, then batch upsert derived state
Use read replicas for status/history APIs
Archive old events to cold storage for long retention

Trade-off: Workflow Engine vs Custom Orchestrator

Option	Pros	Cons
Custom orchestration in app code	Full flexibility, fewer platform dependencies	Hard to reason about retries/timeouts/compensation; costly to maintain
Managed workflow engine (recommended)	Built-in state, retries, visibility, durability	Vendor lock-in, extra operational model

Trade-off: Kafka Event Bus vs Direct RPC Chain

Option	Pros	Cons
Direct synchronous RPC chain	Simple to start, easy tracing	Tight coupling, poor resilience to downstream failures
Event bus + async workers (recommended)	Loose coupling, buffering, replay, independent scaling	Eventual consistency and operational complexity

Failure Mode Deep Dive

Failure	Risk	Recovery
API retries from client timeout	Duplicate applications	Idempotency key returns existing application
Worker crash after provider call	Lost callback handling	Workflow timeout + re-query provider status
Kafka consumer outage	Decision delay	Replay from offsets after recovery
Rule deployment bug	Incorrect approvals/denials	Ruleset versioning + fast rollback + re-decision pipeline
Redis cache outage	Slow status reads	Serve from DB fallback; repopulate cache asynchronously

A common mistake is letting each external callback decide independently. That creates race conditions and conflicting outcomes. Keep decision finalization centralized.

Interview Checklist

Requirements Phase

Clarified that many downstream services are external and unreliable
Set latency goals for both immediate response and final decision
Included compliance/auditability as first-class requirements

Data Model Phase

Defined Application, RiskSignal, Decision, and WorkflowExecution
Showed clear state machine with manual_review path
Included append-only events for audit/replay

API Design Phase

Included idempotent submission API
Included status + decision retrieval APIs
Defined internal event contracts for workflow steps

High-Level Design Phase

Used workflow engine to orchestrate async checks
Decoupled with Kafka + worker pools
Added rule engine, cache, rate limiting, and adapter gateway

Scaling Phase

Covered retry storms, circuit breakers, and provider outages
Discussed workflow-vs-custom and async-vs-sync trade-offs
Explained single-writer decision finalization for correctness

Key Points to Emphasize

Asynchronous by design: external checks are slow/unreliable, so orchestration must be durable and non-blocking.
Idempotency everywhere: API submission, workflow execution, and provider calls all need dedupe.
Centralized decision ownership: one service finalizes outcomes to avoid race conditions.
Rate limiting + retry budgets: these prevent outages from cascading across downstream dependencies.
Auditability is mandatory: versioned rules + append-only events make decisions explainable and reversible.