Early in 2023 our team set out to ingest metrics from thousands of edge devices. The prototype processed events in a single Node.js process and melted down once throughput exceeded a few thousand messages per second. Memory pressure, blocking CPU work, and lack of observability made the system unpredictable.
Image credit: Node.js documentation
Understanding the event loop
The first exercise was mapping the life of a request through libuv's event loop. We classified work into three buckets:
- Purely asynchronous I/O such as network calls and disk writes.
- CPU bound transformations like protobuf decoding.
- Coordinated state mutations to Postgres.
By measuring time spent in each phase we realized only 30% of our CPU was serving meaningful work; the rest waited on mutexes or userland queues. Flame graphs from 0x and clinic.js highlighted hotspots in JSON parsing and regex validation.
Clustering and worker threads
We adopted a hybrid strategy that used the cluster module for process‑level parallelism and worker_threads for CPU intensive sections. The master process owned TCP sockets and balanced connections across workers using SO_REUSEPORT.
import cluster from 'node:cluster';
import { availableParallelism } from 'node:os';
if (cluster.isPrimary) {
const cpus = availableParallelism();
for (let i = 0; i < cpus; i++) cluster.fork();
} else {
// each worker exposes an HTTP server and consumes jobs from Redis
}This alone multiplied throughput by the number of cores. For background computations we spun up a pool of worker threads backed by Piscina, which allowed us to offload CPU hotspots without blocking the event loop.
Orchestrating work with Redis
Network bursts routinely exceeded our processing capacity. Rather than reject traffic we introduced a Redis backed queue using bullmq. Producers pushed tasks with deduplication keys so that idempotent operations collapsed into a single entry. Workers pulled jobs in batches to reduce round trips.
queue.add('ingest', payload, { jobId: payload.id });Backpressure was implemented by tracking queue depth and dynamically adjusting the number of active workers. Once the queue exceeded a million items the API started returning 429 responses, signalling clients to retry with jitter.
Idempotency and ordering
With horizontal scale came race conditions. Each job handler was designed to be idempotent by recording processed offsets in Postgres. For operations requiring ordering, such as updating counters, we used pg_try_advisory_xact_lock to create lightweight transactional locks. It was faster than distributed locks yet safe enough for our workloads.
Observability as a first‑class feature
Metrics, logs, and traces were wired in before the second iteration went to production. Prometheus captured queue depth, event loop lag, and worker restarts. OpenTelemetry propagated trace context through Redis so that a dashboard click could be followed from HTTP ingress to database commit. Alerting thresholds were tuned using real traffic replays.
Failure modes and graceful restarts
Chaos testing revealed subtle bugs: workers that died while holding a Postgres connection, and queues that grew unbounded during deploys. We added health checks, SIGTERM handlers with draining logic, and circuit breakers around external APIs. Multi‑region deployment came next with a leader election mechanism based on redlock to coordinate scheduled jobs.
Results
After these iterations the pipeline sustained 8 million events per minute with P95 latency below 150 ms across two regions. More importantly, the architecture embraced failure: any worker could crash without data loss, and replaying history was a matter of pointing a consumer at the log and pressing play.
Scaling Node.js isn't about fighting the event loop; it's about understanding its constraints and designing systems that cooperate with it.