Wouldn’t it be great if we could predict the performance of our next mobile app marketing campaign even before it starts?
More importantly, it would definitely be useful if we could identify the key aspects of a campaign that are most likely to drive its performance.
As the gap between cost per install (CPI) and average revenue per user (ARPU) continues to worsen, app marketers are under increasing pressure to deliver better post-install campaign performance. A popular strategy to address this challenge is to spread the initial test budget across dozens of different media sources and then scale around the better performing ones.
Machine learning can be used to answer these and similar questions. In this article, we compare a few alternate machine learning approaches and discuss how they can be used to predict campaign performance.
A few weeks back, we got a data dump from our analyst team containing 280 million transactional records representing a random sample of programmatic campaigns run at Aarki over the past year.
For the predictive analysis, we defined a dependent variable that represents the level of improvement achieved in CPI over the course of the campaign. Initial CPI was defined as the average CPI during the first 3 days of the campaign.
Improvement = 1 - (Minimum Daily CPI / Initial CPI)
We evaluated 11 explanatory variables, including some artificially constructed dummy variables. These variables were initially fed into a doubly-censored tobit model, which represented the baseline model against which the different machine learning models were tested. Statistical analysis of significance (t-statistic) resulted in dropping of three variables at the 95% confidence interval. The final list of 8 explanatory variables was used across all four models tested.
The overall data was aggregated by campaign and by day, and then partitioned into the training and test datasets using a 70-30 random split.
The modeling effort consisted of constructing a baseline model to serve as reference and three alternate machine learning models. All four models were trained on the same dataset and then used to predict the outcome for the test data. In some cases, the training was iterated until the best fit model was identified. Mean squared error (MSE) was used as the primary metric for evaluation of model performance.
In machine learning, it is customary to use a multiple linear regression model as baseline. Since we have a limited dependent variable, i.e., it can only take values between 0 and 1, we chose to use a doubly-censored tobit model for the baseline. This model is quite similar to multiple linear regression except for the constraints on the dependent variable.
The tobit model was fitted using maximum likelihood estimation (MLE). The model has a decent predictive power with mean squared error of 0.0321. A multiple linear regression model fitted on the same variables resulted in an R-squared value of 0.422 on the training dataset, further reinforcing the predictive ability of these variables.
Examination of the z-values (figure 1) indicates that use of multivariate creative optimization (is.multivariate) and a higher number of creative variants tested (variants) are most likely to result in big improvements in campaign CPI. This observation validates Aarki’s core differentiation in the marketplace around the importance of ad creative in driving campaign performance. We also notice if the initial CPI is greater than $7 (hi.cpi), we can expect greater overall improvements but the absolute value of CPI (cpi_base) is not as important.
The duration of the campaign (is.long, duration) has a somewhat positive impact since the Aarki Encore algorithm has more time to optimize. However, since the algorithm converges fast, this factor is not very significant. Interestingly, budget (is.hibudget, cost) is seen to have a negative impact - likely because higher budget campaigns tend to be often optimized more for volume rather than CPI.
Figure 1: Summary of Tobit Model
Machine Learning Models
The primary machine learning formulations tested include artificial neural network (ANN), random forest, and support vector machine (SVM). Several other models such as multiple linear regression, latent variable logistic regression, and classification and regression trees, were also tested. The results of these tests are not included in this article since they were inferior to the models presented in this article.
Looking at figure 2, we observe that all the three machine learning models have better predictive power than the tobit model. Among the three machine learning models, artificial neural network performs the worst with a MSE of 0.0302, while Support Vector Machine has a MSE that is almost 25% lower at 0.0229. Random Forest falls in the middle with 0.0278 MSE. This indicates that choice of the right algorithm can play a major role in the ability of machine learning to predict campaign performance.
We also notice that the machine learning models - SVM in particular - tend to have better accuracy for larger values of the dependent variable. This suggests the possibility of exploring multi-regime models and introduction of interaction variables.
Figure 2: Comparison of Machine Learning Models
Artificial Neural Network
The neural network is essentially a black box and not much inference can be drawn from the output of the model. The network was trained using a multilayer perceptron (MLP). We found that a network with one hidden layer containing 4 nodes performed best for this dataset.
In general, ANNs perform worse for domains that are short tailed in the regularized solution space, i.e., where a few features can explain a large part of the space. Based on the tobit analysis, we know this is the case and could be one explanation for a relatively worse performance. ANNs also ignore interaction effects between explanatory variables, so the performance can likely be improved by introducing additional variables.
The training results of a random forest are also quite unintuitive, especially when compared to a single decision tree. From figure 3, we can see that the random forest explains 35.08% of the variance. Node purity is an indicator of the relative importance of the nodes in the random forest. The baseline CPI and number of variants used in creative optimization were identified as the most important decision nodes that drove many splits. The low purity of the dummy variables also indicates that the logical splits for the random forest are likely at values that are different from the ones used in defining these variables.
The performance of this random forest could probably be improved by redefining some of the dummy variables and adding interaction variables. However, for the purpose of this analysis, it was important for us to control the variable set.
Figure 3: Summary of Random Forest Model
Figure 4 shows the impact of number of decision trees in the forest on overall error reduction. It is good to observe that most of the error reduction occurs with less than 100 trees.
Figure 4: Error Reduction as a Function of Number of Trees in Random Forest
Support Vector Machine
The support vector machine was identified as the most effective machine learning model in this analysis. Since SVMs are easier to train, more intuitive, and computationally cheaper, these seem to be the logical choice in solving problems related to the prediction of campaign performance.
Figure 5 shows that 282 support vectors were created to build this model using the eps-regression algorithm. We conducted a tuning analysis (figure 6) to select the right values of epsilon and cost for the SVM. Choice of kernel parameters is a tricky issue for SVMs but in this case we did not find the optimal solution to be particularly susceptible to changes in these parameters.
Figure 5: Summary of Support Vector Machine Model
Figure 6: Performance of Support Vector Machine as a Function of Cost and Epsilon
In this article, we have presented four different models for predicting the performance of mobile app marketing campaigns. The models are applied to a random sample derived from historical programmatic campaign data. The results are a good indication of the predictive ability of machine learning algorithms over more traditional parametric modeling approaches. They show that not all approaches produce the same results - so choice and tuning of the algorithm is key to success. We also see that machine learning algorithms are often black boxes and so must be used in conjunction with other approaches to achieve better interpretation.
The next time someone talks about using machine learning for campaign optimization, make sure to ask them what algorithm they are using and their key assumptions. For more details on this study, please contact us at firstname.lastname@example.org.