Creative Iteration Using Bayesian Multi-armed Bandits


By Igor Raush, Software Engineer

We continue the blog series that we believe will give you a clear understanding on how Aarki finds ways to fully understand, test, and optimize creative ads to ensure that targeted users see the best possible variant. In our previous article, we discussed broadly what strategies we use for creative iteration. In this article, we will present the experiments we made to solve the multi-armed bandit problem. 

Choice of prior

Recall that, in a Bayesian Bernoulli bandit formulation, we assume each arm draws rewards with probability

θ ∼ Beta(1 + P, 1 + N − P)

where N is the number of Bernoulli trials and P is the number of successes to date.

Here, we assume a uniform, non-informative Beta(1, 1) prior on θ. However, this may be a poor choice. In reality, we know the reward probabilities θ to be small; click rates are on the order of 1-10%, and install rates are on the order of 0.1-1%. Using a non-informative prior causes new creatives to be explored far too aggressively.

For this reason, we implemented a prior based on the average reward rate across all creative variants in our DSP. We also introduced a hyper-parameter M, which controls the influence of this prior. Intuitively, M can be interpreted as the number of samples fed into the bandit as prior knowledge. A high value of M produces an influential prior, resulting in conservative exploration; a low value of M results in more aggressive exploration. A value of M = 0 corresponds to a Beta(1, 1) prior.

Experimental setup

Our creative team has produced eight creative variants (referred to as A, B, ..., H) belonging to two creative groups (landscape and portrait). For a given impression, we must choose to show either variant A-D in a landscape publisher, or variant E-H in a portrait publisher (so, for example, variants A and E never compete against each other).

We launched a total of five experiments with an equal share of volume (20% each).

  1. Four bandit strategies, corresponding to M = 0, 10, 100, 1000, optimizing for conversion rate.
  2. One control campaign, allocating 2% of qualifying traffic to variants A, C, E, and G, and 98% to the remaining variants.


Examining performance by strategy, we see that bandit strategies deliver a higher install rate than the control, both in aggregate, and when comparing each strategy alone.


We can also examine the traffic allocation to each creative variant by strategy.


We see that variants G and H deliver the highest install rate overall, and that bandit strategies choose to serve these variants for the vast majority of traffic.


The question remains whether the lift in install rate coming from bandit strategies is statistically significant. Rather than using classical statistical tests, we chose to calculate the probability that the underlying install rate is higher for the bandit strategy than the manual allocation strategy.

This comes down to computing P(θB > θC) = 1 − P(θB < θC). We can again assume θB and θC to be Beta-distributed, and analytically compute

P(θB < θC) = ∫01dx0xdy PB(x) PC(y)

However, it is far easier to solve the problem by sampling; e.g., in Python

num_samples = 10000  # can increase to get more precise estimates
B_samples = np.random.beta(B_successes, B_trials - B_successes, num_samples)
C_samples = np.random.beta(C_successes, C_trials - C_successes, num_samples)
p_B_gt_C = np.mean(B_samples > C_samples)

Applying this technique to our experimental data, we see that the bandit strategies deliver a higher install rate than the control with 99.26% confidence, indicating a statistically significant result.

At Aarki, our data scientists are constantly experimenting to develop sophisticated machine learning algorithms that allow us a better understanding of our data and better predict user profiles. That’s how we reach and acquire the best users and deliver strong ROI to our clients.

Topics: Machine Learning