Inside a lean RTB system processing 350M daily requests with sub-100ms latency, built by a 3-person team on a $10k cloud budget.Inside a lean RTB system processing 350M daily requests with sub-100ms latency, built by a 3-person team on a $10k cloud budget.

2 Billion Requests, 100ms Deadlines, $10k a Month: Engineering a Lean Global RTB System

In the world of programmatic audio, 101ms means zero revenue. That’s not a performance problem. That’s an existence problem.

We run an audio advertising DSP that processes roughly 350 million requests a day-about 2 billion a week. Every single request has to be received, evaluated, and responded to in under 100 milliseconds. Miss that window, and the bid is dropped. At 101ms, we effectively don’t exist.

We do this with a three-person engineering team and a cloud bill that stays under $10,000 a month. The same team owns everything-the bidder infrastructure, application layer, databases, and operations.

The RTB Ecosystem

If you’re not deep into ad tech, here’s the short version. Programmatic advertising runs on Real-Time Bidding (RTB).

  • On the supply side are SSPs (Supply-Side Platforms). These are platforms like AdsWizz or Triton that aggregate ad inventory across thousands of apps and publishers-podcast players, streaming apps, and other audio surfaces.

  • On the demand side are DSPs (Demand-Side Platforms)-systems like ours that represent advertisers and decide, in real time, whether to bid on a given impression.

When a listener opens a podcast app and an ad slot becomes available, the app talks to its SSP. The SSP then sends bid requests to multiple DSPs. Our system receives that request, evaluates the user and active campaigns, computes a bid price, and sends back a response. All of this has to happen in under 100ms, end to end.

Once we accepted that, it was obvious that standard web-app patterns wouldn’t work. We couldn’t afford network hops between microservices, and we definitely couldn’t afford the cloud bill that comes with a sprawling Kubernetes setup. To survive, we had to design for constraints first.

That constraint wasn’t theoretical. Early on, when we still had cloud credits, we didn’t worry much about efficiency. Once those credits ran out, every architectural decision suddenly had a real price tag. That’s when we had to step back and re-architect the system for latency, cost, and operational simplicity.

What follows is the architecture that lets a three-person team run a global RTB system at scale-without burning money or losing sleep.

1. Geography as the First Filter: Designing for Network Physics

The major audio SSPs we integrate with - Triton, AdsWizz, Magnite (formerly Rubicon), and MCTV -are hosted across the US, Europe, and Southeast Asia. Our application layer, which holds campaign logic for India-based advertisers, lives in India.

The raw network round trip between these regions alone is around 240 milliseconds, even before any application logic runs. That already puts us more than twice over the allowed budget.

The fix was straightforward but non-negotiable. We deployed our bidding clusters across four strategic AWS regions: US-West (Oregon), EU-Central (Frankfurt), and Asia-Pacific (Singapore and Hong Kong).

This ensures that when an SSP sends a bid request, our distance to metal is negligible-typically single-digit milliseconds-leaving most of the 100ms budget for actual compute.

2. The Monolith-Microservice Hybrid

We arrived at an architecture that sits somewhere between a traditional monolith and a microservices setup. Instead of spreading services across the network, we co-locate a small cluster of tightly coupled services on a single virtual machine, and then scale those VMs horizontally behind a global load balancer.

Inside each VM-treated as a single node-it looks like a miniature distributed system:

  • Nginx: A single ingress container that acts as the internal traffic cop.
  • Bidder: 3–5 replicas of the Node.js bidding service running on the same machine to fully utilize the CPU.
  • Redis: A local Redis instance running right alongside the code.

The payoff is data locality by default. The bidder doesn’t cross the network to fetch data; it reads from localhost. This also lets us push resource density much higher. By using larger VMs (for example, c6i.xlarge) and sharing a single in-memory dataset across multiple workers, we get a much better RAM-to-CPU ratio-minimizing waste while maximizing throughput.

3. The Logic: Failing Fast

To hit a 100ms deadline, it’s not enough to have fast code. The system must decide quickly when not to do work.

Most incoming bid requests will never result in a win. Letting those requests touch state, databases, or write paths is the fastest way to blow a latency budget. We structure the bidder as a decision tree ordered by "likelihood to fail," executing the cheapest checks first.

The checks run roughly in this order:

  • Geography check: A large fraction of bids fail here. This is a simple comparison and returns almost immediately.
  • OS and device checks: If geography passes, we validate platform and device constraints.
  • Publisher/App Targeting: Checking if the app matches our allowed inventory.
  • Stateful logic: Only if a request survives these filters do we touch heavier paths-user-level data, budget and pacing logic, or anything that requires coordination.

By failing fast, we reject 90% of traffic in microseconds, reserving CPU cycles for the bids that actually convert.

4. The Data Architecture: Redis for Rules, Aerospike for Profiles

We split our data architecture into two tiers based on access patterns and the cost of "staleness."

The Rules: Local Redis

Each node holds "hot" metadata (active campaigns and budget caps) in its local Redis. A cron worker uses a pull model, hitting our central API every 60 seconds to download the latest "truth."

At our scale, this 60-second staleness creates a risk: The Overspend. If multiple nodes act on the same stale budget data simultaneously, they might spend more than the client allocated.

We handle this with a Braking Mechanism: we treat fleet size as an input to budget control. As the number of active bidding nodes increases, each node dynamically tapers the "slice" of budget it is allowed to use. Even if a node is working with stale data, its local limits are already throttled based on the current fleet size. Spend slows down smoothly instead of spiking, and the system converges on the next refresh cycle.

The Profiles: Aerospike (The SSD Advantage)

User-level data-profile lookups, frequency caps, and retargeting-requires sub-millisecond access to massive, mutable datasets. A traditional RDBMS cannot handle this load at our price point.

We use Aerospike for this layer because it is optimized to treat SSDs as primary storage, rather than keeping the entire working set in RAM. This gives us predictable latency while keeping hardware costs a fraction of a pure-RAM solution like ElastiCache.

To keep this layer affordable, we are ruthless about data hygiene:

  • The 10-Day Rule: We only retain data for users active in the last 10 days.
  • The 10M Cap: we actively prune the dataset to the top few million most active users.

If a user stops engaging, their data is evicted. This discipline ensures storage costs remain flat while the bidder focuses only on the users that matter.

5. The "Async" Secret: Decoupling the Burden

Our first version technically tried to be async. We would send the bid response immediately, and let the Node.js function continue running in the background to write the log to MongoDB.

The Trap: Traffic isn't flat; it bursts. During peaks, even though the bid response was gone, our Node.js containers were left "floating"-holding open connections and burning CPU cycles trying to finish database writes. This caused our auto-scaler to spin up more VMs just to handle the backlog of writes, which then crushed our database with connections.

The Fix: We moved from "Async Code" to "Async Architecture." We introduced Kafka as a shock absorber between the bidder and the write path.. Now, the bidder pushes a message to Kafka and immediately frees up its resources. The "write" is handled by a separate consumer fleet at a controlled pace.

6. Benchmarking: Measure, Don’t Guess

At this scale, guessing instance types is expensive. So instead of relying on theory, we benchmarked.

We had hypotheses-whether the bidder was compute- or memory-bound, how many worker processes made sense per node, and how instance size affected latency. We tested across multiple AWS instance families and sizes, and varied how many bidding services ran on a single node.

We took the same approach with the data layer, trying different machine types to see what worked best for MongoDB under real load, prioritizing stable latency over peak throughput.

All of this was validated with load tests that pushed production-like QPS through the system and measured response times and tail latency. Some assumptions held. Many didn’t.

The takeaway was simple: at a 100ms budget, benchmarks beat theory every time.

7. FinOps: How We Keep the Bill at $10k

Startups don’t die when traffic spikes. They die when cloud credits run out. For us, cost control is not an afterthought-it’s part of the system design.

Some savings come from negotiation. We moved billing through a reseller for a flat volume discount. For bandwidth, we committed to roughly 1 PB of monthly transfer, which allowed us to negotiate CDN pricing down by more than 60% compared to on-demand rates. At our traffic levels, that single commitment moved the needle meaningfully.

We also separate concerns: stateless bidding servers run on heavily discounted spot capacity, while stateful databases use predictable, long-lived instances.

At our scale, storing hundreds of millions of records a day is a ticking time bomb. Detailed bid logs have short retention-on the order of days, not months. Data we want to keep is aggregated early-hourly instead of per event-and moved into cheaper, query-friendly stores. Anything required only for compliance or historical audits is pushed into cold storage, where costs drop by 80–90% compared to standard tiers.

The biggest saver, though, isn’t a discount. It’s utilization.

Most systems run CPUs at 20–30% “for safety.” Ours doesn’t have to. Because the architecture is fault-tolerant and designed to shed load gracefully, we’re comfortable running CPUs at 80% utilization. We don’t pay for idle cycles. If we’re paying for a core, that core is bidding.

That mindset-treating efficiency as a first-class constraint-is what keeps the bill flat even as traffic grows.

8. Operating with Three People: The "Tuition Fee" Philosophy

Running global scale with a three-person team only works if systems-not people-carry the load. We design operations to eliminate human intervention from the critical path and remove single points of failure wherever possible. We also avoid role silos - each engineer can navigate the full stack - so operational knowledge is never a single point of failure.

Infrastructure is disposable by default. If a node misbehaves, we don’t debug it in place-we terminate it and let the auto-scaler replace it. Recovery is faster than investigation, and every instance comes from a known, clean state.

Observability matters, but dashboards alone don’t help a small team. We rely on actionable alerts (via Datadog) that trigger only when human attention is required. If a regional cluster slows down or a consumer falls behind, we get paged. Everything else is automated.

Automation, however, isn’t magic. We still pay tuition fees from time to time.

In one early incident, a background job responsible for refreshing the budget state failed silently. The bidder kept operating on stale data and spent on a campaign that should have ended-a classic “zombie bidder” scenario. We refunded the client, fixed the gap, and hardened the system with heartbeat checks monitored by Datadog. Budget freshness is now explicit, and silence itself triggers an alert. If the bidder doesn’t receive a valid update, it fails closed.

In a small team, mistakes are inevitable. Repeating them is optional.

Closing Thoughts

None of the decisions in this system are exotic on their own. What makes it work is how they compound.

Network physics sets the boundary. Locality and fail-fast logic protect the 100ms path. Asynchronous boundaries absorb traffic bursts. Measurement replaces intuition. Cost discipline and high utilization keep the economics honest. Operations are designed so humans intervene only when the system can’t recover on its own.

This isn’t a story about ad tech or a specific stack. It’s a reminder that at scale, constraints are not obstacles-they’re design inputs. When you treat latency, cost, and headcount as first-class requirements, the architecture that emerges tends to be simpler, not more complex.

The goal wasn’t to build something impressive. It was to build something that survives.

\n

\

Market Opportunity
Cloud Logo
Cloud Price(CLOUD)
$0.0771
$0.0771$0.0771
0.00%
USD
Cloud (CLOUD) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.