Bayesian Approach to Transfer Learning: Predicting Rare Events

By Igor Raush, Software Engineer

An advertisement's click-through rate (CTR) is often used as an early indicator of its effectiveness; however, the ultimate goal of any campaign is to reach and acquire prospective customers. Unfortunately, the CTR is often weakly, or even inversely correlated with the quality of a user segment, as measured by the retention rate or ROI. As a result, training models to optimize for clicks can lead to wasting impressions on low-quality users, achieving high CTR but poor ROI.

While click data is abundant, and it is easy to build models to accurately predict click probability, it is far more problematic to optimize for ROI. Purchase and retention events are rare, making it notoriously difficult to derive a good signal from the sparse dataset.

Bayesian Logistic Regression

In a previous article, we discussed how a Bayesian approach to probability prediction provides a framework for incorporating prior knowledge about the system into the model. The approach is summarized in Bayes' formula:
$$ p(\theta \mid y) \propto p(y \mid \theta)\ p(\theta) $$
In English, "the posterior distribution of the model parameters given the observed data $ p(\theta \mid y) $ is proportional to the data likelihood given the model parameters $ p(y \mid \theta) $ multiplied by the prior distrIbution on model parameters $ p(\theta) $."

In the special case of logistic regression, given a set of coefficients $ \beta $ and a set of independent variables $ X $, the dependent variable $ y $ is assumed to be a Bernoulli random variable with distribution
$$ p(y \mid \beta, X) = \text{Bernoulli}(\sigma(\beta^T X)) $$
Each coefficient $ \beta_i $ can be interpreted as the effect of the independent variable $ X_i $ on the dependent variable $ y $.

Previously, we had chosen a shrinkage prior $ p(\beta) = \mathcal N(\beta \mid 0, \lambda^{-1}) $, expressing a tendency of the model parameters towards zero, i.e., the prior belief that the independent variables $ X $ have no effect on the dependent variable $ y $. This is often referred to as regularization, and leads us to choose the simplest model which sufficiently explains the data.

Transfer Learning

Bayesian_model_2.jpgIn reality, we are free to choose any prior which captures our knowledge about the system. In particular, the prior can come from the results of a stronger model -- one trained on a related dataset. If we are optimizing a particular campaign's ROI, we can attempt to capture the characteristics of high-quality users in some generic domain, and transfer that knowledge to "kick-start" a campaign-specific model.

This way, we are capturing the signal from a highly-informative dataset $ \tilde X, \tilde y $ in the distribution $ \tilde{p}(\beta \mid \tilde{X}, \tilde{y}) $, and using this signal when inferring the posterior distribution from the less-informative campaign-specific dataset
$$ p(\beta \mid X, y) \propto p(y \mid \beta, X)\ \tilde p(\beta \mid \tilde X, \tilde y) $$
Powerful user-level models can be built on non-attributed revenue and retention event streams. We can leverage performance profiles of existing loyal and high-ROI users to bootstrap the ad campaigns and find users with similar interests and performance. While non-attributed datasets typically do not contain all information about the context in which the user was acquired, they are extremely informative with regard to features of the high-quality user profile.

At Aarki, with the help of our data scientists, we develop advanced machine learning algorithms to reach and acquire prospective users and deliver strong ROI. Contact us at to learn how machine learning can help you today!

Topics: Machine Learning