Metric Counter Library

Role: Software Engineer

Problem Statement

Design a library that enables services to collect and aggregate metrics. This is a foundational component used by many production systems to track events, measure performance, and power dashboards and alerting.

Your library should provide a simple interface for services to record and query metric values. The core functionality includes:

Increment counters: Record occurrences of events (e.g., API calls, errors, user actions)
Time-windowed aggregation: Query metrics over specific time windows (last minute, last hour, last day)
Multi-dimensional metrics: Support tagging metrics with dimensions (e.g., endpoint, region, status code)

Core API Design

java

// Record a metric increment
void increment(String metricName)
void increment(String metricName, Map<String, String> tags)
void increment(String metricName, Map<String, String> tags, long value)

// Query aggregated metrics
long getCount(String metricName, TimeWindow window)
long getCount(String metricName, Map<String, String> tags, TimeWindow window)

// Example usage:
counter.increment("api.requests", Map.of("endpoint", "/payments", "status", "200"))
long lastMinute = counter.getCount("api.requests", TimeWindow.LAST_MINUTE)

Key Design Considerations

Performance: How do you handle high-throughput metric recording without impacting the host service's latency?
Memory efficiency: How do you bound memory usage while supporting multiple time windows?
Accuracy vs. resource tradeoffs: Is approximate counting acceptable? When would you use probabilistic data structures?
Flush and persistence: How and when do metrics get exported to a central aggregation system?

Potential Follow-up Questions

During the interview, expect the interviewer to probe deeper on specific areas:

Time Window Implementation: How do you implement sliding windows vs. tumbling windows? What are the tradeoffs?
High Cardinality: How do you handle metrics with many unique tag combinations (e.g., per-user metrics)?
Thread Safety: How do you ensure correctness when multiple threads record metrics concurrently?
Batching and Flushing: How often does the library flush metrics to a backend? What happens during flush failures?
Clock Skew: How do you handle time-based windows when system clocks may drift?
Aggregation Strategy: Do you aggregate locally before sending, or send raw events? What are the tradeoffs?
Memory Bounds: How do you prevent unbounded memory growth from old time windows or high-cardinality tags?
Backend Integration: How does the library integrate with systems like Prometheus, StatsD, or custom aggregation services?

Recommended Resources

Since this is a well-documented system design problem, here are some excellent resources that cover the architecture and implementation details:

Distributed Logging & Metrics Framework - YouTube - Systems design interview walkthrough with Ex-Google SWE
Ad Click Aggregator System Design - System Design Handbook - Deep dive into time-window based counting and aggregation patterns

Interview Experience

This question tests practical knowledge of metrics systems that many engineers use daily. Candidates who have experience with monitoring libraries (Micrometer, StatsD clients, Prometheus client libraries) tend to perform well. Focus on explaining concrete implementation details for time-windowed counting and discuss real-world constraints like memory limits and flush reliability.