Insights

Aarki’s Bidding Brain Evolution, Part 1: Deep Learning for Mobile Growth

August 22, 2025

About The Author
Ben Weber is the VP of Machine Learning at Aarki, where he leads the charge on building smarter models that power millions of real-time bidding decisions per second. He’s held senior roles across McAfee, Zynga, Twitch, and EA. From building bidders to rewriting prediction pipelines, his experience is hands-on, production-grade, and battle-tested in the real world.

Machine learning is at the core of Aarki’s Demand-Side Platform (DSP) and we use deep learning models on every bid request that we process. It’s an ongoing journey to prototype, deploy, and tune deep learning models for ad tech, and this capability has enabled us to make continued improvements to our platform and deliver better performance for our advertisers.

In this post we’ll discuss:

Our motivations for moving from classic to deep learning for pricing models
An overview of our first-generation deep learning models
The next generation of models we’re building
Our infrastructure and how we’re using GPUs to scale

Why We Don’t Use Cloud Platforms for ML

We Process Over 5 Million QPS In-House

Our DSP is deployed in four different data centers around the globe in trading locations co-located with programmatic ad exchanges including Unity, Google Adx, and Fyber.

Owning Hardware Saves on Cost per Request

We have made an investment in owning the hardware that we operate to process bid requests from exchanges, handle data from Mobile Measurement Partners (MMPs), and support our Encore platform. This setup gives us a lower cost to process requests compared to using a public cloud platform such as AWS.

But It Comes with Tradeoffs

We lack the ability to spin up new virtual machines on command and cannot use platforms such as GCP’s Vertex AI for serving deep learning models. This means we need to be more intentional about the infrastructure we build.

The Big Bang: Building Our Own Machine Learning Stack

It Starts With All Pipelines Running in Private Data Centers

The approach that we have taken for training models and serving deep learning in our DSP is to build our machine learning pipelines within our private data centers.

We Rely on a Full Open-Source Ecosystem

We use Spark, Hadoop, Aerospike, ClickHouse, Redash, Prefect, and Streamlit. PySpark is our primary tool for training models in Python, enabling our ML team to use Python as a common language, and Rust powers our DSP for its concurrency, safety, and performance.

Classic Machine Learning Models Weren’t Built for Scale

Logistic Regression: Fast but Requires Manual Engineering

Many Demand-Side Platforms start with approaches that are referred to as shallow or classic machine learning models, which include methods such as logistic regression and decision trees. Logistic regression works well to start, because models are fast to evaluate and there are distributed approaches for training models with large feature counts.

An example is MLlib, if you are using the Spark ecosystem. Ensemble learning methods such as XGBoost and LightGBM are also common for DSPs that support distributed training for large data sets.

Logistic Regression Can’t Handle Feature Interactions

When using logistic regression to predict outcomes, such as whether a user will click on an ad impression, you need to model interactions between your input features to get a model that performs well. For example, you may want to model the interaction between the mobile app that you are advertising and the publisher app that would be rendering the ad impression.

Because logistic regression does not support modeling these types of interactions directly, the typical approach is to use feature engineering methods that manually create combinations of your input features and to pass this expanded feature set to your logistic regression model.

This approach can work well to start, but as you add more and more features to your machine learning models, the parameter count will start to explode as you add more feature interactions.

Feature Explosion Creates Infrastructure Challenges

I’ve worked on click prediction models with over 100 million features, where the publisher app is combined with every other feature to try to maximize the performance of the model. The parameter count was so large that the MLlib implementation in Spark ran into issues with model training, and we had to explore other approaches for training our models in a cost-efficient manner.

The main issues with logistic regression models are that you have to manually define the interactions between features, it limits how many feature crosses you can include, and it results in very large parameter models that can be expensive to train.

Gradient Boosting Models Scale Better, but Not Far Enough

At Aarki, we initially worked with gradient boosted algorithms, starting with LightGBM, for our pricing models that predict user conversions. A key benefit of LightGBM is that it does learn interactions between features within the model itself, which logistic regression does not support.

LightGBM worked reasonably well when we only had a few dozen features, but as we scaled up to hundreds of features in our models we ran into a few different issues. Like most classic machine learning methods, LightGBM had trouble working with high-cardinality categorical features, such as the publisher app feature which can have millions of different values.

LightGBM does have different methods for handling categorical features that are somewhat analogous to embeddings in deep learning models, but we couldn’t get these to train and serve efficiently for our use cases.

Moving Beyond Classic ML To Deep Learning

Both of the classic approaches for building predictive models for DSPs that we explored suffer a core issue: they were unable to work well with large numbers of categorical features, which are a key component of our data sets.

Our solution to this problem was to begin using embeddings in our models, which provide a technique for representing categorical data as dense vectors, enabling dimensionality reduction and more efficient model training and inference.

First Generation of Deep Learning: DeepFM

We adopted DeepFM models, which combine two stages: a primary (deep) stage that learns embeddings for all input features, and a secondary (factorization machine) stage that models feature interactions. This approach enables us to move away from manual feature engineering and gives the model the ability to generalize better to unseen feature interactions.

Rust Compatibility Was a Must

Since our DSP is written in Rust, we needed a solution with in-process inference. We use the Fwumious Wabbit library to train and serve DeepFM models entirely in Rust. This allows us to avoid the overhead of remote calls and serve predictions directly inside our bidding engine.

Improvements Over Classic ML

DeepFM allowed us to scale to hundreds of input features and make more accurate predictions while staying efficient. Embeddings helped us compress high-cardinality features like app IDs and user signals into dense vectors that could be learned and updated automatically, improving both model accuracy and training efficiency.

But DeepFM Has Its Own Limits

The factorization machine stage required grouping features into namespaces, which meant some manual work in defining how different feature groups interacted. One of the limitations that we faced was that the Fwumious Wabbit training library did not support GPUs for hardware acceleration, which constrained how much data we could use and how fast we could iterate.

While DeepFM was a meaningful step forward, we knew we’d need to evolve further to support larger datasets, deeper models, and faster retraining.

Ushering The Next Generation of Deep Learning

What We Want from the Next Generation

To continue improving performance at scale, we needed a new class of models that could meet the demands of more complex use cases and larger datasets. Our first-generation deep learning models proved the value of embedding-based architectures, but they weren’t enough to unlock all the opportunities we saw ahead. Our goals for the next generation were clear:

Train on larger data sets
Retrain more frequently
Handle new prediction tasks (ranking, relevance)
Improve our pricing models

Moving to TensorFlow and PyTorch

We’ve invested in GPUs in our private data centers and are fully adopting TensorFlow and PyTorch to train modern deep neural networks.

Market Pricing Models Train on Billions of Rows

Our initial DNN models focus on predicting optimal bid prices for first-price auctions. We use Spark for large-scale data preparation and TF Records on our GPU machines to efficiently train TensorFlow models on over a billion records daily.

How We Serve Models in Production

We convert models to ONNX format and serve them using the ort library in Rust for in-process inference. Training happens on GPU-powered machines; serving remains CPU-based in our DSP.

Architectures We’re Exploring

For core conversion prediction models, we’re testing DCNv2 and TabTransformer. We’re also exploring Two Towers architectures to better match ads to inventory.

GPU Investment Unlocks Next-Level Model Performance

With our current infrastructure and GPUs in place, we’re scaling deep learning to deliver smarter bidding, faster iteration, and better ROI for advertisers.

Why This Journey Matters

Every evolution in our machine learning strategy—from logistic regression to DeepFM to modern DNNs—has been driven by one core principle: deliver better results for our advertisers at scale. This post covered the technical path we took to identify and solve the growing limitations of classic ML, and how we built deep learning into the fabric of our DSP.

But model architecture is just one part of the equation. In part 2 of this post, we go deeper into the infrastructure powering it all: why we invested in GPUs, how we validated that decision, and what it unlocks for model training at scale.

If you enjoyed this blog, shoot Ben a note at bweber@aarki.com. He would love to hear from you!