Stats show that roughly 80% of companies regularly run A/B tests to compare different UX designs and try out different ad creatives, newsletters, etc.
It takes 4 weeks on average to run a single test, and the process will be even longer if the number of target actions is limited.
Sadly, only one out of eight tests yields meaningful results that confirm the initial hypothesis, i.e. seven others do not lead to a revenue increase.
So, is there a way to somehow speed up the process and test more hypotheses per unit of time?
There certainly is! In this article, we’ll explore a number of ways you can tackle this challenge leveraging the right product approach and a little bit of math.
Before using our tips to speed up A/B testing, you need to be sure that you’ve accurately calculated its actual duration; otherwise, there’s a risk they won’t yield any tangible results.
Let’s refresh our memory on the minimum sample size for the testing results to have statistical significance:
где:
μc — is the mean value of the control sample metric,
μt— is the mean value of the test sample metric,
nc — is the number of observations in the control sample.
nt — is the number of observations in the test sample,
σc — is the standard deviation of the control sample metric,
σt —is the standard deviation of the test sample metric,
k — is the ratio of the test sample size to the control sample size nt/nc (the number is usually 1),
t1-α/2, t1-β —are normal function values with a percentile in the lower index (at standard alpha level and beta error values of = 0.05 = 0.2, their normal function values stand at 0.95 and 0.8, respectively).
If the test group and the control group have the same sample size, the formula looks like this:
— And this is exactly what we are going to use.
An outlier is a piece of data that lies an abnormal distance away from other observations, like if someone spent 100 US dollars in your app when the average user spends between 8 and 12 US dollars. Outliers in A/B testing data increase the metric dispersion and the testing duration, as well as skew the mean estimate. This may ultimately lead to an incorrect hypothesis being taken for a truth.
Outliers are a common problem in any statistical experiment.
So, what can we do about it?
Let’s look at the following example:
User | Metric (purchase amount) | Group |
user1 | 100$ | control |
user2 | 10$ | control |
user3 | 8$ | control |
user4 | 12$ | test |
user5 | 11$ | test |
user6 | 13$ | test |
The first user’s purchase amount differs significantly from other values. This is a classic example of an outlier.
Now let’s look at average and standard deviation:
μcontrol = 39.33 and σcontrol = 42.905
μtest = 12 and σtest = 0.816
One of the ways to deal with outliers is to weed out users with anomalously high or low metrics. There are two methods of setting the threshold value:
If there is enough data, it’s preferable you take the first option; otherwise, you may lose valuable information about the metric. When we filter out users from the control group, we often lose some data from the test group which aren’t outliers.
Filtered data can look like this:
User | Metric (purchase amount) | Group |
user2 | 10$ | control |
user3 | 8$ | control |
user4 | 12$ | test |
user5 | 11$ | test |
user6 | 13$ | test |
μcontrol = 9 and σcontrol = 1
μtest = 12 and σtest = 0.816
As you can see, the dispersion has decreased significantly.
This method works with all ratio metrics, such as CR or conversions. The idea is to move into a new feature space – to a proportional, but more sensitive metric.
Let’s take user-level conversion values:
And then the overall conversion value:
Now we calculate the CR of the control group and end up with the following transformation:
Separately, metric values usually stand between 0 and 1. Linearization allows you to extend this range and get a larger effect size.
Note that this approach increases the dispersion, but the sensitivity improves nonetheless.
Let’s say we have a CR conversion metric (purchase-to-view rate) and the following data:
User | Group | Views | Purchases | CR(u) | L(u) |
user1 | control | 1000 | 100 | 0.1 | 49 |
user2 | control | 4000 | 200 | 0.05 | -4 |
user3 | control | 2000 | 60 | 0.03 | -42 |
user4 | test | 1000 | 110 | 0.11 | 59 |
user5 | test | 2000 | 120 | 0.06 | 18 |
user6 | test | 4000 | 280 | 0.07 | 76 |
The overall conversion is calculated as the ratio of the sum of users’ purchases to the sum of users’ views:
In our example, it will look like this:
We can switch to an equivalent metric by defining a new linearized function for each user:
Instead of CR(u), we will now review the L(u) metric. Before linearization:
After linearization:
As the dispersion increased concurrently with the effect, we need to look at their ratio:
Ultimately, we end up with a significant gain of 87.5 / 0.017 = 5,147.05 times.
This method is based on breaking down all observations into independent strata (groups) and using a stratified sample mean instead of the regular sample mean:
where k is the number of strata,
pk — is the share of strata observations,
Yk — is the mean value of stratum k,
n — is the total number of observations,
nk — is the number of observations in stratum k.
Sample mean is equal to stratified mean. The standard deviation of a random variable will be:
where σk —is the standard deviation in stratum k.
Since we are interested in the deviations of the metric rather than the mean metric, we will need only the numerator in the future.
Related groups within a strongly dispersed aggregate pool can lead to a major reduction in the testing time.
Let’s look at the following example:
User | Stratum | Group | Metric (purchase amount) |
user1 | high_payment | control | 10$ |
user2 | low_payment | control | 2$ |
user3 | high_payment | control | 9$ |
user4 | low_payment | control | 2$ |
user5 | high_payment | test | 10$ |
user6 | low_payment | test | 3$ |
user7 | high_payment | test | 12$ |
user8 | low_payment | test | 2$ |
The formula is applied to each group individually. You should work it out in advance to gain balanced strata right away, both in the control and test groups. However, if you’ve already used unbalanced samples, you can try using post-stratification.
Before stratification:
After stratification:
As a result, the dispersion decreased. This will lead to a lower MDE and, as a result, to a higher metric sensitivity.
On paper, we accelerated the process by 5.732 / 0.864 = 6.6 times, but in the real world environment the differences between strata may be far less drastic.
The CUPED method (Controlled-experiment Using Pre-Experiment Data) relies on certain preliminary data about the metric.
Imagine that before the experiment we knew X data about certain users. And after conducting the experiment, the very same users showed Y data. Now, we can review the metric:
where θ is the coefficient;
Y — is the current metric value for the user;
X — is the pre-experiment metric value for the user.
To achieve the minimum dispersion of this metric, we should use the following coefficient (the same for the control and test groups):
This will reduce the metric dispersion if there’s a correlation between X and Y.
An important point: this method requires us to have preliminary data about users. Obviously, this can’t always be the case, but if you have no such information, the metric won’t be any different from the regular sample mean.
Sometimes X is also called a covariate in this approach. It’s important that the covariate of the test and control groups has the same mean, otherwise the experiment will yield distorted results.
For example:
θ = 1.3125 (calculated using the formula above based on the aggregate data)
User | Group | New metric (Y) | Old metric (X) | CUPED |
user1 | control | 3$ | 2$ | 0.375 |
user2 | control | 2$ | 1$ | 0.6875 |
user3 | control | 1$ | 0$ | 1 |
user4 | test | 1.5$ | 0$ | 1.5 |
user5 | test | 3$ | 2$ | 0.375 |
user6 | test | 2.5$ | 1$ | 1.1875 |
After CUPED:
The dispersion decreased and we gained a testing speed increase of 1.027 / 0.538 = 1.9 times.
If you want to speed up A/B testing, stick to the following three principles:
It is important to understand that every approach listed here works for average metric estimates. If you plan to run split testing with other metrics, you will need more specialized criteria and methods.
A/B tests always require constant monitoring and adjustments. Are you looking for a way to gather relevant data for your business as quickly as possible and without wasting time on manual experiment tweaks?
Try out MyTracker Personalize, with a built-in automation system you can quickly connect for all your A/B testing needs. Its efficiency was confirmed by a three-month experiment at using MyTracker Personalize models in Hustle Castle, a mobile castle simulator. The personalized offers helped increase ARPU by 23% within the tested group.