Leveraging External Datasets for Probability Prediction by Decomposing Probabilities

By Igor Raush, Software Engineer 

Advertisers are increasingly interested in optimizing their campaigns directly on the return on investment (ROI) or the return on ad spend (ROAS). In a real-time bidding setting, it becomes crucial to predict the expected revenue from a particular ad impression, which, in combination with the KPI, will determine the amount we are willing to bid.


To model post-impression and post-install user behavior, we can consider an entire event funnel, e.g.

$$ \text{impression } (\mathbf x) \rightarrow
\text{click} \rightarrow
\text{install} \rightarrow
\text{session} \rightarrow
\text{conversion (e.g. purchase)} $$

which, for brevity, we can generalize as

$$ \mathbf x \rightarrow e_1 \rightarrow e_2 \rightarrow \cdots \rightarrow e_N $$

Events which are further down the funnel are generally more valuable, more rare, and harder to predict. The ultimate goal is to learn the joint distribution $ p(e_1, \dots, e_N \mid \mathbf x) $ of all funnel events, thereby learning to predict the probability of any event (in particular, the conversion probability) for each impression / user combination during bid time.

However, learning this distribution is complicated by the sparsity of the dataset. Samples with labels for all $ e_i $ are extremely rare. Consider that a campaign serving 10 million impressions $ \mathbf x $ could deliver only 10 conversions (purchases) $ e_N $.

We can instead decompose the joint distribution as

$$ p(e_1, \dots, e_N \mid \mathbf x) = p_N(e_N \mid e_{N-1}, \mathbf x) \dots p_1(e_1 \mid \mathbf x) $$

and learn each partial likelihood $ p_i $ separately, from varying sources of data. In principle, this is an approximation; generally, the combination of maximum partial likelihoods does not give the overall maximum likelihood. In practice, this approach gives good estimates on many datasets.


Multiplicative-2-1.jpgFor instance, if we consider an impression-level dataset with labeled installs, and an external first-party advertiser dataset containing all installs and conversions delivered through any channel, we can train two logistic regression models, $ \theta_1 $ and $ \theta_2 $, such that

$ p(\text{install} = 1 \mid \mathbf x) = \sigma \left( \theta_1^T \mathbf x \right) $
$ p(\text{conversion} = 1 \mid \text{install}, \mathbf x) = \sigma \left( \theta_2^T \mathbf x \right) $ 

At this point, we can estimate the conversion probability via

$$ p(\text{conversion} = 1 \mid \mathbf x) = \sigma \left(\theta_1^T \mathbf x \right) \sigma \left( \theta_2^T \mathbf x \right) $$

Note that $\theta_2$ here has considerably less variance since it is learned from a much larger dataset. There are some practical considerations with this approach:

1. The model $\theta_2$ can only rely on device-level features, since auction-level features are irrelevant for many user acquisition channels. 

2. The conversion distribution in the external dataset must be sufficiently close to what we expect to see in the bid stream. For instance, if we find that organic and non-organic users exhibit very different purchase behaviors, we can choose to drop organic installs from the external dataset. Within fixed cohorts, the average conversion rate from the external dataset must match that from the bid stream; otherwise, the model will predict mis-calibrated probabilities.

Machine learning algorithms are often black boxes so choice and tuning of the algorithm is the key to success. Aarki’s data scientists and engineers are developing advanced machine learning algorithms to predict the expected revenue from a particular ad impression and make sure that your advertising budget is spent on the right users. 

Topics: Machine Learning