By Igor Raush, Software Engineer
We continue the blog series that we believe will give you a clear understanding on how Aarki finds ways to fully understand, test, and optimize creative ads to ensure that targeted users see the best possible variant. Read on for insights on contextual creative selection.
It is useful to frame creative A/B testing (and A/B testing of any kind) as a multi-armed bandit problem. In the "naïve" formulation, we aim to learn and explore the reward distributions of multiple creative variants V1, V2, ... VK, while optimizing the total reward – usually, an advertiser KPI.
The classic Beta-Bernoulli bandit formulation makes a rather strong assumption – that the reward probability of the ith variant, θi, is not context-dependent. In reality, and when it comes to creative optimization in particular, different users respond differently to different creative variants. Also, the reward distribution of a particular creative variant may change with time; for example, seasonal creatives may lose their relevance.
We will explore a simple extension to the Beta-Bernoulli bandit to incorporate a generic feature vector x into the reward distribution. This can be a combination of contextual, temporal, and behavioural features.
Recall that, in the classic Bayesian Beta-Bernoulli bandit formulation, we assume that
θ ~ Beta(1 + P, 1 + N - P)
Since we are working with a binary reward, it's natural to instead each bandit reward distribution as a Bayesian logistic regression problem. Given a d-dimensional feature vector x,
w ~ Nd(0, λ-1 Id)
θ = σ(x · w)
To briefly summarize the approach we outlined in past articles on posterior inference for Bayesian logistic regression, we use gradient descent methods to find mMAP, the posterior modes of the regression coefficients, and then use the Laplace approximation to learn the posterior precisions q. This gives a Gaussian approximation to the coefficient posterior distribution,
w* ~ Nd(mMAP, q)
With this posterior distribution, we have several contextual analogues to the "non-contextual" Beta-Bernoulli bandit exploration strategies. We summarize them below.
- ε-greedy. We randomly choose creative variants to collect a training set (a "burn-in" period). Following the burn-in period, we randomly choose a variant for a fraction ε of traffic; for the remaining traffic, we calculate reward probabilities θ = σ(x · mMAP), and choose the variant with the highest reward probability for context x.
- Upper-confidence bound (UCB). We choose coefficients wUCB(90) using the 90th quantile of the posterior distribution w*. We then calculate the reward probabilities θ = σ(x · wUCB(90)) and proceed as above.
- Thompson sampling. Akin to Gibbs sampling, we sample from the posterior distribution w*. We calculate reward probabilites θ using these samples, effectively producing samples from the posterior distribution over θ we then proceed as above.
The amount of data required for fitting an online Bayesian logistic regression model is proportional to the dimensionality of the dataset. In the early stages of creative testing, the contextual bandit may exhibit poor predictive performance (effectively, a cold-start problem). Some possible solutions to mitigate this problem are
- Use aggressive feature selection and dimensionality reduction techniques. This is somewhat counter-productive, as one of the goals of bandit models is to efficiently explore new inventory.
- Use a hybridized technique, transitioning from Beta-Bernoulli bandits to progressively higher-dimensional logistic regression bandits as the variant gathers more training data.