AB Test Calculator

Creators

Rijk de Wet

Website

A self-described nerd, Rijk is passionate about making a positive difference in the lives of Omni’s users. He’s an avid programmer, musician, and board game player, and both his calculators and side-projects reflect his hobbies. He believes that any problem can be solved with the right set of equations and a few lines of code. See full profile

Check our editorial policy

Reviewers

Hanna Pamuła, PhD

Hanna Pamuła

PhD

Website

Research Gate

Hanna (Hania) Pamuła holds a Ph.D. in Bioacoustics / Mechanical Engineering, obtained at AGH University of Science and Technology. She has participated in research work in labs in France and the UK and presented papers at several international conferences. Hania has a penchant for photography and graphic design. When not in the office, she’s probably traveling, hiking, or out in the field, watching birds and recording their calls. See full profile

Check our editorial policy

and Jack Bowater

New

Welcome to the AB test calculator. Made a new miracle drug and need to decide if it works? Want to see if your website's new feature gets you more clicks? Or just want to see who is the best driver out of men and women? An AB test can answer all these questions and more with the power of two-proportion statistical significance. Here, we'll show you what an AB test is and how to calculate statistical significance for two populations so that you can see if you've found a significant difference.

What is an AB test?

An AB test is a statistical significance test in which two populations are compared to see if they differ significantly on a single characteristic.

Are women more likely to be vegetarian than men?
Do your customers like the green "Add to Cart" button more than the red one?
Does switching to a new brand of fertilizer lead to more watermelon seeds growing to maturity?

Let's consider an AB test in a more statistical environment. An AB test (also known as a split test or a two-proportion statistical significance test) is a form of Z-test. Our Z-test's null hypothesis (the default assumption) is that the two proportions are the same. We want to prove the null hypothesis wrong; we want to prove that there is a difference!

What is statistical significance?

An outcome of an experiment has statistical significance when it has occurred despite being highly unlikely. You flipped a coin 100 times and got 60 heads where you expected 50 heads — is your coin unfairly biased towards heads, or did you just get lucky with your flips? Statistical significance can be helpful when you must decide whether a result you've obtained through your experiment is really due to an underlying factor (your coin really is biased) or due to chance or error in your sampling (you just got lucky).

How to determine two-proportion statistical significance

Remember, we want to disprove the null hypothesis stating that our populations do not differ on our specified characteristic. We first have to choose a suitable confidence interval — the probability of accepting the null hypothesis when it is true. The most commonly used confidence level is 95%, but you can use any percentage. Our calculator supports confidence levels of 90%, 95%, 98%, and 99%. We have to subtract our confidence level from 100% to get the alpha level, which is the probability of rejecting the null hypothesis when the null hypothesis is false. We also have to go one step further and divide our alpha level by two since we're performing a two-tailed Z-test. Take a look at the image below:

A bell curve with the acceptance and rejection regions indicated. — The acceptance and rejection regions on a normal distribution.

A split test works by determining the Z-score of the two-population problem and determining whether it falls into the acceptance region (the blue segment) or the rejection region (the red segments) of the normal distribution. If it falls into the acceptance region, we must accept the null hypothesis. If it falls into the rejection region, we can reject the null hypothesis. You can see in the figure above that the rejection region is much smaller than the acceptance region. This difference in size might make it hard to disprove the null hypothesis, but if you do end up disproving it, it's undeniably false.

To perform a two-proportion Z-test, start with your two populations. Determine their sample sizes ( $n_1$ and $n_2$ ) and the number of positive results in each group ( $t_1$ and $t_2$ ). You can now calculate your population's proportions $p_1 = t_1/n_1$ and $p_2 = t_2/n_2$ , as well as the overall sample proportion,

\small p = \frac{t_1+t_2}{n_1+n_2}

Then you can calculate your Z-score as follows:

\small Z = \frac{p_1-p_2}{\sqrt{p\cdot(1-p)\cdot(1/n_1+1/n_2)}}

The choice of confidence level determines the alpha level's Z-score $Z_{\alpha/2}$ , which is what you're going to compare to your Z-score $Z$ . You can find the alpha level's Z-score for your chosen confidence level from this table:

Confidence level	Alpha level $\alpha$	Alpha level's Z-score $Z_{\alpha/2}$
90%	10%	1.645
95%	5%	1.96
98%	2%	2.326
99%	1%	2.576

Now that you have both your Z-score and the alpha level's Z-score, you can take the last step and compare them. If your Z-score's magnitude is larger than your alpha level's Z-score (that is, if $\vert Z\vert > \vert Z_{\alpha/2} \vert$ ), then you can reject the null hypothesis - your results are statistically significant. If $\vert Z\vert < \vert Z_{\alpha/2} \vert$ , you must admit defeat and accept the null hypothesis — your populations are too similar to warrant any excitement.

To summarise the process:

Calculate your Z-score $Z$ with the populations' data.
Pick a confidence level and get $Z_{\alpha/2}$ from the handy table.
Compare $Z$ with $Z_{\alpha/2}$ :
- If $\vert Z\vert < \vert Z_{\alpha/2} \vert$ , accept the null hypothesis.
- If $\vert Z\vert > \vert Z_{\alpha/2} \vert$ , reject the null hypothesis.

What are an AB test's shortcomings?

As a two-proportion Z-test, an AB test has some conditions that must be met for the test to be effective.

Your sample sizes should both be larger than thirty. Using fifty or more samples per group is ideal. If you use too few samples, central limit theorem states that they won't be normally distributed, which is crucial in a Z-test.
Your sample sizes should be approximately the same. This might not always be possible, depending on the method of data collection.
Your data should be randomly selected from their respective populations. If the samples are not representative of the target populations, your data is unbalanced.

How to calculate statistical significance with the AB test calculator

The AB test calculator is easy to use.

Enter the two groups' information into their fields. You have to know the groups' sizes and the number or percentage of positive samples for each group.
Decide on a confidence level. The higher the confidence level, the more confident you can be in the significance of your results.
Check out your results. Using these inputs, the AB test calculator will perform the split test and tell you whether the difference in your populations is statistically significant.

FAQs

What is a Z-score?

A Z-score denotes a value's relationship to a group of values from the same distribution. The Z-score is measured in the number of standard deviations from the group's mean. A higher Z-score means the value is further away from the distribution's mean.

Why do Z-scores matter?

Suppose I tell you John is 6 feet tall. That might tell you John's height, but how tall is John compared to the average male of his age? To figure that out, you'd need to compare John's height to the heights of a sizeable group of other men. The Z-score is a standardized method of performing this comparison and describing the result.

What is the normal distribution?

The normal distribution (also known as the bell curve) is a probability density function that happens to represent a lot of real-world physical quantities. Because of this widespread applicability, it's used in Z-tests to represent the distribution of possible values and to compare them to a new value when testing for statistical significance.