By Igor Raush, Software Engineer

The focus of app marketing is shifting from driving installs and minimizing cost per install to acquiring users who will become paying customers, effectively maximizing the ROI. Unfortunately, high-LTV users who can be attributed to an ad campaign are extremely rare in comparison to the number of impressions served, and it is difficult to accurately capture the profile of a quality user.

A promising strategy is lookalike modeling; we consider a set of engaged users and spenders coming from a variety of sources - i.e., non-attributed revenue and retention events - and target an audience which is maximally similar to this set.

Quantifying Lookalike Audiences

First, it is necessary to define what we mean by "similar" when talking about users. For our purposes, we consider a user to be a collection of features extracted from his or her installed apps, browsing history, session duration, and other attributes of the user profile.

These features can be mapped onto a high-dimensional Euclidean space, at which point the similarity between two users $u_i$ and $u_j$ can be defined simply as the distance between the corresponding points in the space $|u_i - u_j|$. We assume that two users are similar if the distance between them is small.

Directly Targeting Lookalike Users

Once we have designed a way to map users onto a metric space, we can directly target a set of users most similar to a seed user set $\{ u_0, ..., u_N \}$. First, we compute the centroid of the seed audience - resulting in an "average" high-LTV user.
$$\bar u = \frac 1N \sum_i u_i$$
Then, we consider a bank of profiles $\{ u^\prime_0, ..., u^\prime_M \}$ collected over time from other campaigns, and pick all users from the bank for which
$$|u^\prime_i - \bar u| < T$$
for some threshold $T$. This gives us a targeting audience who are similar to the seed set for a new ad campaign.

Of course, this approach suffers from the extraneous parameter $T$, the "radius" around the seed set from which we choose users to target. This parameter is difficult to choose a priori, since the distance between users is unitless. Moreover, it requires a careful consideration of the tradeoff between user quality and campaign reach; choosing $T$ too small will leave too few users to target, while choosing $T$ too large will target users who are too dissimilar to the seed set to be valuable.

Predicting High-quality Users

A more robust approach is to train a classifier to distinguish between low- and high-quality users. Once a distance measure is defined on users, this can be done using any standard classification technique (e.g. logistic regression, decision trees). To build the labeled dataset, we can consider the seed set of

positive samples $U^+ = \{ u_0, ..., u_N \}$ and randomly select a complementary set of negative samples $U^-$ from the bid stream. We can then train the classifier on the set $\{ U^+, U^- \}$.

This classifier is difficult to use directly in a bidding strategy, since it will not provide a well-calibrated probability that a user is valuable (due to the way the dataset is built). Nevertheless, there is an easy way to make use of the signal captured by this model. We use a click- or install-optimizing bidding algorithm, and then filter the bids it generates by the user-quality classifier. For instance, if we want to acquire valuable users with high confidence, we only bid on an impression when the filtering model predicts that the user is high-quality with probability exceeding 50%.

We evaluated this technique offline on historical campaign data to explore the effect of the filtering threshold on the resulting ROI. As shown in the figure, increasing the threshold leads to a dramatic increase in ROI, as well as a decrease in CPI.

Aarki’s data scientists and engineers are developing advanced machine learning algorithms to reach and acquire the best users and deliver strong ROI. Contact us at partnerships@aarki.com to learn how machine learning can help accelerate your marketing goals.

Topics: Machine Learning