Insights

Aarki’s Bidding Brain Evolution, Part 2: Scaling with GPUs at Billion-Row Scale

August 28, 2025

About The Author

Ben Weber is the VP of Machine Learning at Aarki, where he leads the charge on building smarter models that power millions of real-time bidding decisions per second. He’s held senior roles across McAfee, Zynga, Twitch, and EA. From building bidders to rewriting prediction pipelines, his experience is hands-on, production-grade, and battle-tested in the real world.

In Part 1 of this series, we shared how Aarki evolved from classic machine learning to deep neural networks, teaching our DSP to think smarter with deep learning. But smarter models can only go so far without the infrastructure to support them.

This post covers the second half of that story: how we made the call to invest in GPUs, why public cloud didn’t scale for us, and what it took to build a system that trains on billions of records daily.

Training on CPUs Wasn’t Enough Anymore

Our machine learning platform at Aarki trains dozens of models daily, with some of our largest models training on over one billion records per day. In order to train our models efficiently, we are using hardware acceleration for training deep learning models using GPUs.

We’re leveraging the TensorFlow ecosystem and NVIDIA’s CUDA platform to process massive amounts of data in order to determine which ad impressions will deliver the best value for our advertisers. When we started our migration to deep learning models for our Demand-Side Platform (DSP), we had a decision to make: do we rent or buy GPUs?

In this post, we’ll highlight the options that we explored, the path that we took to demonstrate the value of GPUs, and our path forward to support deep learning at scale.

Why Our Infra Didn’t Start with GPUs

Our Data Centers Were Already Optimized for Speed

Our DSP is deployed in four private data centers that are co-located with the trading locations of major programmatic ad exchanges. We invested in hardware to support processing over five million bid requests per second on machines that we own and operate.

While this approach is much less flexible than using a public cloud option where you can spin up new machines on demand, such as AWS or GCP, we avoid paying a large bill for network egress and our amortized cost per request is lower than using one of these platforms.

Our Early Infra Focused on Core DSP Throughput, Not Training

We initially focused on hardware to support our DSP, and we did not make an upfront investment in GPUs. Not having GPUs in our data center did not present an issue for our past generation of deep learning models, because the library that we use for training models is optimized for Rust and does not support hardware acceleration.

DeepFM Worked Without Acceleration—Until It Didn’t

Our previous models use Deep Factorization Machines, which combine a deep stage that learns embeddings for all of the input features and a factorization machine stage that learns interactions between the features. These models worked well for retargeting campaigns, which focused on post-install conversion events, such as identifying which installs are most likely to convert to paying users.

However, as we started working on pre-install models, such as click prediction, we started hitting bottlenecks with our model training platform. We were able to train our models on hundreds of millions of records per day, but model training was too slow to scale to over one billion records per day.

The Shift to Deep Neural Networks (DNNs) Made GPUs Necessary

Our New Models Demand Hardware Acceleration

Our next generation of pricing models uses deep learning frameworks that support hardware acceleration for both model training and model serving. We are initially focused on using GPUs for model training, where the size of our training data sets is the primary bottleneck.

TensorFlow and PyTorch Power Our DNN Training Stack

Our current models in production are using the TensorFlow framework for our pricing models, and we are also using the PyTorch framework for pipelines that create embeddings that we feed into our models. We train models using hardware acceleration and convert these models into ONNX format, which can be served in Rust using CPUs on our existing DSP infrastructure.

Proving the Value of GPUs

Testing Training Outside Our Data Centers

Before investing in GPUs for our private data center, we wanted to make sure that our next generation of deep learning models would improve the performance of our DSP and confirm that our new models can be trained much more efficiently on GPUs than CPUs.

To get confidence in training efficiency, we explored a few options for training on GPUs outside of our data center. We initially explored model training on sampled data sets on machines in our office, then explored spinning up GPU instances on AWS, and finally settled on using Google Colab to confirm that using GPUs would greatly improve our model training times.

A/B Testing New vs Old Models

We trained challenger models on Google Colab and ran A/B tests between our current and next-generation deep learning models to build confidence in our migration to GPUs for model training. One of the key decisions that we had to make when choosing whether to buy GPUs was determining which card types would work best for our model training use cases.

L4 GPUs Delivered a 10x Speedup Over CPUs

We used Google Colab to gather model training statistics using NVIDIA T4, L4, and A100 instance types and found that L4 cards worked well for our model architectures. Compared to training on CPUs, we were able to get a 10x speedup using NVIDIA L4 cards, reducing batch processing time from 200 milliseconds to 20 milliseconds for our largest models.

We found that L4 cards were a good starting point for our investment in GPUs. We will continue to evaluate which instance types provide the best performance for our models as we explore new architectures, increase model parameters, and add more features to our models.

Prototypes Proved It, But We Needed Production-Ready Scale

Our A/B testing using Google Colab enabled us to demonstrate the value of moving to GPUs, where we were able to train on much larger data sets and outperform our current generation of deep learning models. Our next step was to automate these pipelines to support daily model retraining on billions of records.

Google Colab is great for prototyping and experimentation, but it’s not intended for production workloads, and Google recommends using Vertex AI for these pipelines. We needed to decide if our path forward would be leveraging a public cloud option, such as AWS SageMaker or GCP Vertex, or purchasing our own hardware for accelerating our model training.

Renting GPUs Didn’t Work for Our Scale or Setup

Data Transfer and TensorFlow I/O Made Cloud Training Impractical

The biggest issue that we faced with renting cards for model training was transferring data from our private data center to where the GPUs are located. We prepare the training datasets using Spark on a cluster of machines in our private data center and we output the encoded datasets in TensorFlow’s TFRecord format.

Our largest datasets are hundreds of gigabytes in size, even when gzip compressing the records, and we train dozens of models of this size with different training datasets daily.

An additional issue is that the support for training TensorFlow models using files on cloud storage directly, such as AWS S3 and GCP Cloud Storage, was removed from the core TensorFlow library and moved to the TensorFlow I/O library, which is no longer maintained. Moving all of this data around as part of daily model training would introduce orchestration challenges for our model pipelines, and we would have to pay cloud storage and transfer costs for terabytes of data every day.

We Brought GPUs In-House and Built the Pipeline Around Them

We already knew that NVIDIA L4 cards would provide a good starting point for training our models, given the prior testing we did with Google Colab. We worked with a vendor to price out options for adding servers to our private data center with multiple L4 cards per machine, and started with a small batch of GPU-powered machines.

Once we had the machines installed, we were able to keep everything in-house in our private data center, avoiding our biggest pain point of transferring data. This approach enabled us to invest more in open-source tooling, including Prefect for orchestration, TensorBoard for profiling, and MLflow for ML Ops of our model pipelines.

The buy option is not always the best option for GPUs, but it worked well for us because of our private data center deployment and predictable workloads for model training.

We Now Train on Billions of Rows, In-House, Every Day

It took some engineering work to coordinate the hand-off between our Spark workflows for data preparation and GPU jobs for model training, but it was worth the effort, and we are now able to train our largest models on billions of examples daily.

The buy option isn’t always the right path, but for our private data center and predictable model workloads, it gave us the control, efficiency, and scale we needed.

After All, You Can’t Scale Intelligence Without Scaling The Infrastructure

Part 1 was about evolving the brain. Part 2 was about giving that brain the infrastructure it needs to grow. By investing in GPUs and bringing model training in-house, we enabled a leap in performance without a leap in cloud costs.

With deep learning now central to our bidding logic, and GPUs powering daily billion-row training, we’ve built a system designed to learn faster, optimize smarter, and scale without compromise.

Missed the beginning of the story? Read Part 1 here.

If you enjoyed this blog, shoot Ben a note at bweber@aarki.com. He would love to hear from you!

en_USEN