A/B Test Sample Sizing
Calculating Sample Size in the Real World
Summary: Proper A/B test sample sizing requires some strategic thought. It should come from the decision the team is trying to make, not from a generic rule of thumb. This article walks through the basic steps for calculating sample size in a practical way, including how to define the current baseline, choose the smallest improvement worth detecting, select the right test settings, and explain the final number to stakeholders.
A few weeks ago, I wrote about how to make A/B tests more useful from a UX research perspective. That article was mostly about an overarching A/B test strategy. I focused on things like writing a clear hypothesis, designing a meaningful variation, choosing better metrics, and not overstating what the result proves.
⚠️ Disclaimer: In my older article, Sample Sizing Cheat Sheet for Startups, I recommended much smaller A/B test samples for startup and MVP contexts. That article was about fast directional learning, while this article is about more defensible A/B testing in a live product.
This week’s article is about the next practical question teams usually ask:
What sample size do we need?
That question sounds simple, but it can get confusing quickly. Some people expect there to be one standard A/B test sample size. Others start with the number of users they think they can get and work backward from there. “We always A/B with 200.” Something like that sound familiar? Others use whatever default number their experimentation platform gives them without understanding what went into the calculation.
I do not think UX researchers need to become statisticians to have this conversation. But I do think we need to understand the basic inputs well enough to explain what the number means. This basic calculation needs contextual inputs that only you, as the UX researcher, and your stakeholders can provide because of your knowledge of your org. That includes the design improvements that would matter most to the business, the quality of the hypothesis, the strength of the variation, and the behavior being measured.
The goal of this article is to show the basic inputs that go into an A/B test sample-size calculation, explain what they mean in plain language, and walk through a real example.
The Basic Inputs You Need
Before you calculate anything, you need to know what you are measuring and what kind of change would matter. To calculate sample size, you need a few inputs.
These inputs matter because the calculation is only as useful as the numbers you put into it. If the inputs are unrealistic, the sample-size recommendation will be unrealistic too. This is where teams often get into trouble. They want the equation to poop out the perfect number, even though they have not made any decisions based on their own real-world context. Lazy, lazy, lazy. But in the real world, the sample size should come from the decision the team is trying to make.
Start With Current Rate
The two most important inputs are the baseline rate and the smallest improvement the team cares about detecting.
Baseline rate = What is happening today. For example, if 18 out of 100 users complete the action, the baseline is 18%. If 5 out of 100 users click the call to action, the baseline is 5%. You get the idea.
The baseline tells the calculation what normally happens before you change anything.
Sometimes you will have a clean baseline from usage analytics. That is ideal. Other times, you may have to make a reasonable estimate. That is not perfect, but it is better than not taking baseline into account at all. (Unfortunately, I see this misunderstood and dismissed most of the time in the real world.)
After that, you need to decide on the smallest improvement worth detecting. This is called the Minimum Detectable Effect, or MDE.
MDE = The smallest performance improvement the team would actually care about. For example, let’s say 18% of users complete the task today. The team probably would not want to redesign the workflow, run an A/B test, and change the product just to move that number from 18% to 19%. That improvement may be real, but it may not be big enough to justify the work.
So the team has to ask a practical question:
“How much better would the new version need to perform for this change to be worth making?”Maybe 20% is not enough because that is only a small improvement over the current 18%. Maybe 25% feels close, but the team wants the test to be planned around a slightly larger, cleaner target.
In this example, the team decides that moving from 18% to about 26% would be a large enough improvement to justify the change.
3 main points here:
That 26% is strategic and comes from adding the improvement the team cares about to the current baseline:
18% current completion rate + 8 percentage point improvement = 26% target completion rate
In sample-size terms, that means the team is planning the test around an 8 percentage point improvement.
The MDE is not just a statistics decision. It is a product decision.
This is why I usually want stakeholders involved in the MDE conversation. The researcher can explain the tradeoffs, but the product team needs to be honest about what size improvement would actually change the decision.
And as a last resort, if all this MDE talk is still confusing, try asking an LLM to help you estimate a reasonable starting point. Do not ask it to “calculate the perfect MDE.” Ask it to help you think through what size improvement would be meaningful based on the product context, the type of change, the current baseline, and the decision the team is trying to make.
Here is a prompt template you could use:
I am planning an A/B test and need help estimating a practical Minimum Detectable Effect, (MDE).
Here is the context:
Current baseline rate: {fill in your context here}
Describe what users are trying to do: {fill in your context here}
Describe what Version A is: {fill in your context here}
Describe what Version B changes: {fill in your context here}
Whether this is a small change, moderate change, or large workflow change: {fill in your context here}
How risky the change is to the business: {fill in your context here}
Whether the change is easy to reverse: {fill in your context here}
What size improvement would make the team feel the change is worth shipping: {fill in your context here}
Any other business, UX, or operational reasons this improvement would matter: {fill in your context here}
Based on this context, recommend a practical MDE in percentage points. Explain why that MDE seems reasonable, what a smaller MDE would imply for sample size, and what a larger MDE would imply for sample size. Keep the explanation simple enough for a non-statistician to understand.
I would still treat the LLM’s answer as a starting point and not the final decision. The team should review the recommendation and decide whether the improvement is actually meaningful in the real-world context of the product. This is one of the safer and more targeted ways Gen AI can be useful in our work, and the human in the loop is built into the approach. Try it for yourself and tell me how it goes!
Small Improvements Need Larger Samples
The smaller the improvement you want to detect, the more users you need.
A small performance improvement can get lost in normal variation. (signal vs. noise.) A larger improvement is easier to detect because the difference between Version A and Version B is bigger.
This is why teams should not start with, “Can we test this with 200 users?”
The better question is: What size difference are we trying to detect?
Here is the basic tradeoff:
If the team does not have enough traffic to detect a small improvement, they have a few options. They can run the test longer, accept a larger MDE, choose a different method, or admit that the A/B test is not practical.
That does not mean lowering rigor randomly. It means matching the test to the decision.
For a tiny button copy change, expecting an 8 percentage point improvement may be unrealistic. For a larger workflow change, an 8 percentage point improvement may be a reasonable threshold.
The simple version is this: if the team wants to detect small behavioral differences in a design change, they need more data. If the team only cares about larger behavioral differences in a design change, the test can usually have a smaller sample size.
Test Settings
After the baseline and MDE, you need to choose a few test settings.
These are usually confidence level, power, and split.
Confidence level is about how much risk you are willing to accept that the result is just noise. A 95% confidence level is more cautious and usually requires more users. A 90% confidence level accepts a little more uncertainty and usually requires fewer users. I usually consider 90% confidence for lower-risk, reversible product decisions. I would use 95% for higher-risk decisions, especially if the decision is expensive, hard to reverse, or related to high-risk user interactions.
Power is the chance that the test will detect the improvement if the improvement is really there. I usually use 80% power because it is a common default.
Split is how users are divided between the two versions. Most of the time, I assume a 50/50 split. Half of users see Version A. Half of users see Version B.
Here’s how I think of it:
My default planning assumptions are almost always:
Real-World Example
Scenario - A product team wants to A/B test a changed workflow. The current workflow starts from a dashboard card. The card includes a call to action that sends users into a multi-step flow. The team wants to know whether the revised experience gets more users to complete the intended action.
The team does not have a perfect baseline, but based on what we know, we estimate that about 18% of users currently complete the action.
Because this is a larger workflow change, not a tiny copy tweak, the team decides that an 8 percentage point improvement would be meaningful. In other words, the new version would need to move the completion rate from 18% to about 26%.
We decide to use 90% confidence because this is a lower-risk product decision and the team does not need 95% confidence to make the call.
We use 80% power because that is the standard default I usually use.
We use a 50/50 split because there is no strong reason to send more users to one version than the other.
Here are the calculation inputs:
In the real world, I just use this calculator:
https://samplesizecalculator.net/ab-test-sample-size-calculator/
For most UX-related A/B tests, the outcome is usually a yes/no behavior. The user completed the workflow or did not. The user clicked the call to action or did not. The user submitted the form or did not.
Assuming yes/no behavior, and with the inputs shown above, the sample size is about 331 users per version. So that means I would plan for about 331 users in Version A and 331 users in Version B.
And remember, because today, about 18% of users complete the task. We want to know if the new version can raise that to about 26%. That’s how we get to 331 users per version.
Base formula:
For this example, the calculation looks like this:
You do not need to memorize that formula. The important thing to understand is that the calculation uses the current behavior, the amount of change you want to detect, and the amount of uncertainty you are willing to accept.
The sample-size calculation does not mean the test is automatically good. It just means that, given these assumptions, this is roughly how many users the team needs to compare the two versions.
I would also not treat that number as perfect. If the baseline estimate changes, the sample size can change. If the team wants to detect a smaller improvement, the sample size will go up. If the team wants 95% confidence instead of 90%, the sample size will also go up.
That is why I try to make the assumptions visible. The sample-size number is only useful when the team understands what it is based on.
Sample-Size Checklist
Use this checklist before you calculate the sample size.
Define the behavior you are measuring. Be specific. “Users completed the four-step workflow” is clearer than “engagement.”
Get the current baseline rate. Use analytics if you have them. If not, use the best estimate you have and say clearly that it is an estimate.
Decide the smallest improvement that would matter. This is the Minimum Detectable Effect, or MDE. Decide whether the team cares about a 2 percentage point improvement, a 5 percentage point improvement, an 8 percentage point improvement, or something else.
Choose the confidence level. I usually consider 90% confidence for lower-risk, reversible product decisions and 95% confidence for higher-risk decisions.
Use 80% power unless there is a strong reason not to. This is the standard I typically use.
Use a 50/50 split unless there is a reason to do something else. Half of users see Version A. Half of users see Version B.
Calculate the sample size per version. Do not only report the total sample size. Teams need to know how many users are needed in each version.
Round up. If the calculator says 330.31, do not plan for 330. Plan for 331 or more.
Decide the stopping point before the test starts. Do not wait until the results look good and then stop the test. Set the target sample size ahead of time.
Do not overstate what the result proves. The test can tell you whether one version performed differently from another version under the conditions of the test. It does not automatically explain why the difference happened. Read more about A/B Testing strategy here.
Conclusion
The big takeaway here is that A/B test sample sizing is mostly about making the team’s assumptions visible.
The hard part is not doing the calculation. The hard part is agreeing on what behavior you are measuring, what improvement would matter, and how much uncertainty the team is willing to accept.
That is why UX researchers should lead these conversations within our orgs. We must help others to think in terms of sample size being connected to study design by default. It depends on the hypothesis, the variation, the outcome metric, and the decision the team is trying to make.
A correctly sized A/B test can still be interpreted badly. It can tell you whether one version performed differently from another, but it does not automatically explain why. So use the calculator. Get the number. Round up. Then make sure the test is designed clearly, sized realistically, and interpreted carefully. Thanks for reading!






