7 Golden Rules of A/B Testing to Save You Time and Money

Author: Dmitry Krasnov Product Manager, Predictive Analytics

December 07, 2021

A/B testing, also known as split testing, is a way to test hypotheses to see if potential changes may help improve your product.

It can be used to get the best insights into how revenue, retention, or other key metrics would be affected by any design, text, or pricing changes you are considering.

During an A/B test, users are divided into two random groups. One group is shown the previous, unaltered version of the product (A) and the other the new version (B). Once a sufficient amount of data has been collected, the two versions are compared to identify the best performer.

Sounds easy, however, only one in eight A/B tests produces a truly significant result.

The result may be affected by a variety of factors, ranging from insufficient data and bad audience grouping to unique features of different browsers and device types.

If you want to get accurate data, it is crucial to follow certain rules at all stages of testing, from setting the goal to analyzing the results.

Rule 1. Form a hypothesis

A hypothesis outlines a possible solution to a problem.

For example, your online store has a conversion rate of 4%. After attending a color perception webinar, your marketing professional suggests changing the color of the Buy button from an aggressive red to a more peaceful green so the user doesn't feel as much pressure. They say it may boost conversion by 2–2.5 times.

In this case, a hypothesis can be formulated as follows: “Making the Buy button green instead of red will increase conversion to 10% from 4%.”

An A/B test will either confirm the hypothesis (the page with the green button will have a 10% higher conversion rate) or refute it (conversion will drop, remain flat, or change by a mere 0.5–1% – possibly accidentally).

Rule 2. Quantify the audience

The size of the testing audience can be estimated using the following formula:

how to quantify audience for split testing

where n is the sample size we seek to determine;

Z is a coefficient to be picked from a special table, depending on the confidence level. In most cases, a confidence level of 0.95 or 0.99 is used, which corresponds to Z = 1.96 or 2.58, respectively;

p is the proportion of users that have performed the required action (e.g. made a purchase from the existing landing page). If there is no historical data on this, p = 0.5 (50%);

q = 1 – p (the proportion of respondents without the required attribute);

∆ is the desired margin of error (this depends on the purpose of the test). For the purposes of business decision-making, the margin of error normally should not be higher than 4%, which requires a sample of 500–600 respondents. For key strategic decisions, the margin of error should be as low as possible.

margin of error split testing — The margin of error vs. sample size

There are a couple more rules to follow when it comes to the audience.

First, you need to choose whether to involve your entire audience or just part of it. If it’s the former, you will get the necessary number of people sooner. However, if the hypothesis is proved wrong, the impact on the company's reputation or revenue might be worse.
And second, you need to think about which groups to target – whether it should be new or regular users. In most cases, newcomers are preferable, because regulars may have become used to the UI and thus might not spot the changes, clicking the new buttons like they did the old ones.

Rule 3. Group the devices by type

Different devices may display changes differently. For example, tiny details are likely to get missed on big screens.

You have to make sure the devices are evenly distributed between the audience segments throughout the test. If you fail to distribute them correctly, you will need to run the test again due to distorted statistics.

If you have many devices of various types, e.g. 40% are smartphones with a 750 x 1334 resolution, 40% are smartphones with a 1440 x 2960 resolution, and 20% are tablets with a 2048 x 2732 resolution, you need to divide them into groups and run the test on each group individually.

For A/B tests, devices are usually grouped into two categories: web and mobile.

Rule 4. Choose a metric

The metric used for A/B testing should be relevant to your hypothesis.

Let us take the one we mentioned above: “Making the Buy button green instead of red will increase conversion to 10% from 4%.” In this case, you'd want to use Conversion Rate, which is the ratio of users who pressed the button to all users who visited the page during the selected period.

Session length or changes in revenue are irrelevant here and should be used elsewhere. If you measure them in this test, you might get data that falsely confirms or refutes the initial hypothesis.

For successful split-run testing, remember one simple rule:

One goal, one element, one metric.

split testing choose metric — Testing multiple elements at once won’t tell you which one has made the difference.

You also need to remember that testing groups may be of different sizes, which means you should only use metrics that aren’t linked to the number of users: ARPU instead of Revenue, or Registration Rate instead of the absolute number of registrations.

Rule 5. Decide on the timing of the test

The duration of an A/B test largely depends on the objective and size of the audience needed to deliver statistical significance.

Many A/B test tools require two weeks as a minimum period because it is sufficient to collect data and achieve most objectives with minor changes. This also helps to cover user behavior on different days of the week, which may vary greatly; for example, on Monday, Friday, and Sunday.

However, if a large sample size, major changes, or a high statistical significance indicator are involved, tests may run for one to three months.

There are two simple rules for deciding on the timing of a split test:

Do not stop the test until you have reached the minimum sample size that makes the results statistically significant;
Do not stop the test until at least one full business cycle has been completed. For example, if the average time from the first visit to the first purchase in an online store is three weeks, allocate three weeks to testing.

Beyond that, it’s important not to jump to conclusions prematurely. For example, in a situation where variation B has been in the lead for a few days and fully meets expectations, it's tempting to implement it sooner rather than later.

You shouldn't stop the test before the deadline. The resulting values may change at any time, so premature evaluation of findings renders the test pointless.

Rule 6. Take statistical significance into account

Split-testing results can only be deemed reliable at a certain level of statistical significance.

Statistical significance is the percentage confidence that the results aren’t due purely to chance. Frequently used significance levels are 90%, 95%, and 99%.

For instance, at 95% significance, in our example of buttons, it’s assumed that 5 out of 100 clicks were made regardless of the color change.

Assuming the hypothesis is confirmed and the conversion rate in the example really did increase from 4% to 10% with a statistical significance of 95%, the result can be considered reliable. Let's plot a graph to illustrate it.

ab test calculator — Source: https://abtestguide.com/calc/

However, if the conversion rate increased from 4% to 5.5% with a statistical significance of 95%, the result is very likely to have occurred randomly.

Rule 7. Make sure the system works correctly

A/B testing is a complex and multi-level process incorporating many parameters that need to be monitored:

Browser versions – the display of changes and user behavior in different browsers may vary and distort the test results.
The amount of traffic – if there is insufficient data, you are likely to run the risk of drawing an incorrect conclusion about the test result due to a high margin of error in the calculations. If your website or app has low traffic and few transactions, the test will have to run longer.

Also, to collect valid data, do not factor in ad traffic: these users may have different motivations and behaviors, so your results will be distorted.

Uneven distribution of traffic among groups – audiences should not overlap with each other during testing. To distribute users to groups, you can assign a segment ID and save it to the browser cache.

To ensure the accuracy of future test results, you should make sure the system works correctly before starting the test. There are two ways to do this:

A/A test

This is used to check that the system is working correctly.

During this test, two identical versions are compared against each other. The result should yield the same metric values.

If the metrics differ by a statistically significant margin, you should check the systems for grouping users and collecting results.

A/A/B-test

This is used when ongoing monitoring is needed to determine whether a test is working.

This is a combination of both types: the A/A test is run first, and if the system is confirmed to be working, the regular A/B test is automatically started.

The main disadvantage of this option is that you need more users to run the test and more time to gather statistics.

Takeaways

For successful A/B testing, you need to strictly follow at least the following seven rules:

If you don't stick to these rules, there is no point in A/B testing: you will have to rerun the test each time, which is a waste of time and money that could have been spent on developing the product.

The worst-case scenario is when a business decision, which involves a considerable budget and team effort going forward, is based on incorrect testing. This may lead to significant financial and reputational losses.

Automate A/B testing

A/B tests always require constant monitoring and adjustments. The good news is that split-testing can be automated for website and mobile app owners to be able to grow their business without wasting their time on continuous process monitoring.

Are you looking for a way to gather relevant data for your business as quickly as possible and without wasting time on manual experiment tweaks? See our recent case study with Hustle Castle, a mobile castle simulator, and how they managed to increase ARPU by 23% within the tested group during a three-month experiment at using MyTracker Personalize models.

Tags: gaming personalization A/B testing