November 15, 2024 · 8 min read

Building Scalable Systems: Lessons from Processing 50M Events/Day

Distributed SystemsArchitectureKafkaPerformance

After spending 18 months building a data pipeline that processes over 50 million events per day at my previous role, I've collected a set of lessons that I wish someone had told me before I started. This isn't a "how to build X" tutorial — it's a reflection on the real decisions that shaped our architecture.

The Problem With Premature Optimization

When we started, our instinct was to reach for the most powerful tools immediately. Kafka, Kubernetes, distributed tracing — the whole nine yards. We were building for "scale" before we had any load to scale.

The mistake: We spent 3 months building infrastructure for 50M events/day when we had 50K events/day.

The right approach is to measure first, optimize second. Start simple. A single PostgreSQL instance can handle more than you think. Add complexity only when you have data that demands it.

The Event Streaming Architecture

When we did need to scale, here's what our final architecture looked like:

Producer Services → Kafka (3 brokers) → Flink Jobs → TimescaleDB + Elasticsearch
                                        ↓
                                   Redis Cache

Each component was chosen for a specific reason:

Kafka for durability and replay capability — we needed to reprocess historical events during schema changes
Flink over Spark Streaming for true stateful stream processing with exactly-once semantics
TimescaleDB because time-series data has unique access patterns that PostgreSQL's hypertable extension handles beautifully
Elasticsearch for full-text search across event metadata

Key Lessons

1. Backpressure Is Your Best Friend

The single most important concept in stream processing is backpressure — the ability of a downstream consumer to signal that it's overwhelmed. Without it, your system will cascade-fail under load.

Flink handles this natively. We configured it with:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setBufferTimeout(100); // 100ms buffer timeout
env.setParallelism(8);     // 8 parallel operators

2. Schema Evolution Is Hard — Plan For It

We had 3 major schema changes in 18 months. The first one took 2 weeks to migrate. The third took 2 hours. Here's what changed:

We adopted Apache Avro with a schema registry
Every event carries a schema version ID
Consumers can decode any version using the registry

This sounds like overhead. It paid back 10x.

3. Monitor Everything, Alert on Almost Nothing

We had 200+ metrics exposed via Prometheus. We alerted on 4:

Consumer lag > 10K events
End-to-end latency > 500ms (P99)
Error rate > 0.1%
Disk utilization > 80%

Alert fatigue is real. If your team ignores alerts, they're worse than no alerts.

The Latency Journey

Here's how our P99 latency evolved:

| Phase | Architecture | P99 Latency | |-------|-------------|-------------| | v1 | Monolith + PostgreSQL polling | 2,400ms | | v2 | Kafka + custom consumer | 350ms | | v3 | Kafka + Flink + Redis cache | 87ms |

The biggest single improvement was adding a Redis cache layer between Flink and the read path. 80% of reads were for data processed in the last 5 minutes — keeping that hot in Redis was a simple win.

What I'd Do Differently

Invest in observability earlier. We added distributed tracing in month 8. It should have been day 1.
Document architecture decisions as ADRs. Six months later, you won't remember why you made that trade-off.
Build for operational simplicity. The fanciest architecture means nothing if your on-call engineer can't debug it at 3am.

Building systems at this scale is genuinely fun when you approach it with the right mindset. The tools are getting better every year. But the fundamentals — understanding your data, measuring before optimizing, and keeping operations simple — those never change.

If you have questions or want to discuss architecture decisions, feel free to reach out.

All posts