A/B testing, also known as split testing, is a way to test hypotheses to see if potential changes may help improve your product.
It can be used to get the best insights into how revenue, retention, or other key metrics would be affected by any design, text, or pricing changes you are considering.
During an A/B test, users are divided into two random groups. One group is shown the previous, unaltered version of the product (A) and the other the new version (B). Once a sufficient amount of data has been collected, the two versions are compared to identify the best performer.
Sounds easy, however, only one in eight A/B tests produces a truly significant result.
The result may be affected by a variety of factors, ranging from insufficient data and bad audience grouping to unique features of different browsers and device types.
If you want to get accurate data, it is crucial to follow certain rules at all stages of testing, from setting the goal to analyzing the results.
A hypothesis outlines a possible solution to a problem.
For example, your online store has a conversion rate of 4%. After attending a color perception webinar, your marketing professional suggests changing the color of the Buy button from an aggressive red to a more peaceful green so the user doesn't feel as much pressure. They say it may boost conversion by 2–2.5 times.
In this case, a hypothesis can be formulated as follows: “Making the Buy button green instead of red will increase conversion to 10% from 4%.”
An A/B test will either confirm the hypothesis (the page with the green button will have a 10% higher conversion rate) or refute it (conversion will drop, remain flat, or change by a mere 0.5–1% – possibly accidentally).
The size of the testing audience can be estimated using the following formula:
where n is the sample size we seek to determine;
Z is a coefficient to be picked from a special table, depending on the confidence level. In most cases, a confidence level of 0.95 or 0.99 is used, which corresponds to Z = 1.96 or 2.58, respectively;
p is the proportion of users that have performed the required action (e.g. made a purchase from the existing landing page). If there is no historical data on this, p = 0.5 (50%);
q = 1 – p (the proportion of respondents without the required attribute);
∆ is the desired margin of error (this depends on the purpose of the test). For the purposes of business decision-making, the margin of error normally should not be higher than 4%, which requires a sample of 500–600 respondents. For key strategic decisions, the margin of error should be as low as possible.
There are a couple more rules to follow when it comes to the audience.
Different devices may display changes differently. For example, tiny details are likely to get missed on big screens.
You have to make sure the devices are evenly distributed between the audience segments throughout the test. If you fail to distribute them correctly, you will need to run the test again due to distorted statistics.
If you have many devices of various types, e.g. 40% are smartphones with a 750 x 1334 resolution, 40% are smartphones with a 1440 x 2960 resolution, and 20% are tablets with a 2048 x 2732 resolution, you need to divide them into groups and run the test on each group individually.
For A/B tests, devices are usually grouped into two categories: web and mobile.
The metric used for A/B testing should be relevant to your hypothesis.
Let us take the one we mentioned above: “Making the Buy button green instead of red will increase conversion to 10% from 4%.” In this case, you'd want to use Conversion Rate, which is the ratio of users who pressed the button to all users who visited the page during the selected period.
Session length or changes in revenue are irrelevant here and should be used elsewhere. If you measure them in this test, you might get data that falsely confirms or refutes the initial hypothesis.
For successful split-run testing, remember one simple rule:
One goal, one element, one metric.
You also need to remember that testing groups may be of different sizes, which means you should only use metrics that aren’t linked to the number of users: ARPU instead of Revenue, or Registration Rate instead of the absolute number of registrations.
The duration of an A/B test largely depends on the objective and size of the audience needed to deliver statistical significance.
Many A/B test tools require two weeks as a minimum period because it is sufficient to collect data and achieve most objectives with minor changes. This also helps to cover user behavior on different days of the week, which may vary greatly; for example, on Monday, Friday, and Sunday.
However, if a large sample size, major changes, or a high statistical significance indicator are involved, tests may run for one to three months.
There are two simple rules for deciding on the timing of a split test:
Beyond that, it’s important not to jump to conclusions prematurely. For example, in a situation where variation B has been in the lead for a few days and fully meets expectations, it's tempting to implement it sooner rather than later.
You shouldn't stop the test before the deadline. The resulting values may change at any time, so premature evaluation of findings renders the test pointless.
Split-testing results can only be deemed reliable at a certain level of statistical significance.
Statistical significance is the percentage confidence that the results aren’t due purely to chance. Frequently used significance levels are 90%, 95%, and 99%.
For instance, at 95% significance, in our example of buttons, it’s assumed that 5 out of 100 clicks were made regardless of the color change.
Assuming the hypothesis is confirmed and the conversion rate in the example really did increase from 4% to 10% with a statistical significance of 95%, the result can be considered reliable. Let's plot a graph to illustrate it.
However, if the conversion rate increased from 4% to 5.5% with a statistical significance of 95%, the result is very likely to have occurred randomly.
A/B testing is a complex and multi-level process incorporating many parameters that need to be monitored:
Also, to collect valid data, do not factor in ad traffic: these users may have different motivations and behaviors, so your results will be distorted.
To ensure the accuracy of future test results, you should make sure the system works correctly before starting the test. There are two ways to do this:
This is used to check that the system is working correctly.
During this test, two identical versions are compared against each other. The result should yield the same metric values.
If the metrics differ by a statistically significant margin, you should check the systems for grouping users and collecting results.
This is used when ongoing monitoring is needed to determine whether a test is working.
This is a combination of both types: the A/A test is run first, and if the system is confirmed to be working, the regular A/B test is automatically started.
The main disadvantage of this option is that you need more users to run the test and more time to gather statistics.
For successful A/B testing, you need to strictly follow at least the following seven rules:
If you don't stick to these rules, there is no point in A/B testing: you will have to rerun the test each time, which is a waste of time and money that could have been spent on developing the product.
The worst-case scenario is when a business decision, which involves a considerable budget and team effort going forward, is based on incorrect testing. This may lead to significant financial and reputational losses.
A/B tests always require constant monitoring and adjustments. The good news is that split-testing can be automated for website and mobile app owners to be able to grow their business without wasting their time on continuous process monitoring.
Are you looking for a way to gather relevant data for your business as quickly as possible and without wasting time on manual experiment tweaks? See our recent case study with Hustle Castle, a mobile castle simulator, and how they managed to increase ARPU by 23% within the tested group during a three-month experiment at using MyTracker Personalize models.