Log in

How to Measure Personalization Performance with A/B Tests

Personalization often goes hand in hand with A/B testing. Used together, these tools can provide a one-of-a-kind insight into how much of a difference some tweaks to your marketing efforts can make and whether there are ways to help you make the most of them.

In this article, we are going to look at:

Personalization As a Tool To Grow App Revenue

The aim of personalization is to give users content, experiences, and features that are tailored to their individual needs.

The customized offer can be almost anything – from personalized color to something that may be particularly relevant to a user.

ML models for personalization

One of the most popular approaches to personalization is the use of machine learning models that process big data to generate customized offers.

For the best results, the model should be consistent with your end goal and product type:

ML-powered personalization can boost key app metrics such as LTV, retention or app revenue or decrease churn.

Read more about ML-powered personalization in our blog posts.

How to Use A/B testing to Measure Personalization Performance

A/B testing is the simplest and most common way of measuring the efficacy of changes to a product.

A/B testing helps you to ensure that the effect of changes is not due to chance and will have lasting benefits. This is particularly relevant to apps with a well-established audience that is sensitive to change.

However, this tool has some limitations – A/B testing of offers can help you find the best fit for the tested audience as a whole, but not for a specific segment.

For example, if A/B testing splits the audience into three segments and identifies the optimal price of an offer at USD 1.99, chances are that this price is:

This approach results in part of the revenue being lost – users from the first segment are ready to pay more, although this is not the case for those from the third segment. That is to say, users from the third segment may never make a purchase and generate revenue for the app.

A/B Testing Automation and Personalized Offers

Personalized offers can be an answer to the lost revenue problem. Services such as MyTracker Personalize can find the offer that will work best for a particular audience segment or individual users.

With the help of this service, you can provide users with personalized real-time recommendations, including customized offers and ranked lists of products and prices.

MyTracker Personalize ranks in-game items from the store according to user preferences and offers personalized discounts. This is where audience segmentation comes in.

How audience segmentation works

First, MyTracker Personalize splits the audience into segments and identifies the most appropriate offer for each of them in order to optimize the target metric.

The MyTracker Personalize segmentation is based on a set of criteria such as, for example:

Algorithm for finding a segment-specific offer – the multi-armed bandit

To find offers, MyTracker Personalize uses models powered by the multi-armed bandit algorithm, which relies on two strategies:

  1. Analyze the users’ responses to the offers and collect feedback.
  2. Based on the feedback, select the offer that has been the most popular choice in the respective segment.

After the first stage of data collection, the multi-armed bandit identifies the offer that generates the highest benefit based on the selected target metric – for example, ARPU – and defines that offer as the best performer.

The algorithm starts to show the best-performing offer more often and the other offers less often. This makes sense, given that, over time, other events may occur in-app (for example, new discounts) or new users may change their behavior. And some of the best-performing offers may become the worst performing, and vice versa.

It is important for the algorithm to know which offer converts into sales more often and which one performs worst. Based on this data, the system optimizes the way the offer is shown: the more purchases an offer generates, the more often it is delivered to users.

For example, as can be seen from the figure below, on the second day, Offer 2 was shown more often because it had generated the highest ARPU and was prioritized. The other offers were also shown, though less often.

It can be said that the multi-armed bandit is an A/B test that adjusts group sizes to find the offer that performs best at a given moment and thus minimizes your losses from suboptimal groups.

Statistical significance in A/B testing

When performing A/B testing, it is important to make sure that the results are not due to chance before you begin to analyze them and draw conclusions. For this purpose, a statistical test is carried out to evaluate the statistical significance of the result.

Statistical significance is the percentage confidence that the results are not due purely to chance.

To calculate statistical significance, the system uses three main parameters of the respective metric:

Other factors may also be taken into account, depending on the project.

Hypotheses of presence or lack of changes are then tested. For example, whether “average values in groups A and B are the same or different.”

After the metric is normalized, we have a graph with a continuous line and two dashed lines. The left dashed line stands for the initial metric and the right one for the target test value. The continuous line indicates the statistical significance level.

The closer the continuous and left dashed lines are to each other, the more reliable the results are. The graph below shows a case where statistical significance has not been achieved.

This often happens when testing has just started and not enough time has passed to collect sufficient data. In this case, a box plot or histogram can be based on current results. Here we can see that personalization works better than standard offers, even though statistical significance has not been achieved due to insufficient data. We’ll have to wait a little for it to climb into positive territory.

Now that we have theory at our fingertips, let's look at how testing and personalized offers work in practice.

Case Study of A/B Testing of Personalized Offers in Hustle Castle

Hustle Castle is a medieval castle simulator with RPG elements.

Hypothesis: using the MyTracker Personalize recommendation engine can boost ARPU in a non-paying segment by 10–30% with personalized offers.

Key metric: ARPU.

Test duration: 3 months

Testing: all non-paying users were divided into two groups. The control group continued to receive offers based on the same logic as before. The recommendation group, on the other hand, was receiving 1 out of 100 offers agreed with the studio once every two weeks.

MyTracker Personalize split the recommendation group into two segments using an ML model predicting the likelihood of payment. The model demonstrated the best audience segmentation compared to the other segmentation models – by country, churn probability, and other criteria.

The segments were further used to test several models for identifying the best offers. As a result, a multi-armed bandit powered by the Thompson sampling algorithm performed best for the first segment, while a multi-armed bandit based on a greedy algorithm was more effective for the second segment.

The result: with personalized offers, Hustle Castle’s ARPU in the recommendation group grew by 23%.

Read more about this case study in our blog.


MyTracker Personalize is an offer personalization service. It can help you automate A/B testing and boost app revenue by 10–30%.

Connect MyTracker in three easy steps:

  1. Integrate the SDK.
  2. Select the segment and offers for recommendation.
  3. Launch personalized offers.
Request a demo and get a 3-month free trial. Your personal manager will assist you with connection and setup.Request a free demo

Tags: personalization A/B testing