How do I calculate the right sample size for my Shopify A/B test?

Use the formula: Test duration (days) = (Sample size needed × 2) / Daily traffic to tested page. For a 2.5% baseline CVR with 95% confidence, detecting a 10% relative lift needs about 28,000 sessions per variant. Detecting a 5% lift needs 110,000 sessions per variant: impractical for most stores.

When should I stop an A/B test?

Pre-commit to a stopping rule before launch. A reasonable rule: minimum 14 days running, minimum 200 conversions per variant, statistical significance at 95%, and effect direction stable for at least 7 days. Stopping at the first sign of significance leads to false-positive rates of 30-40% rather than the advertised 5%.

Can low-traffic Shopify stores run A/B tests?

Stores under 1,000 sessions/day struggle with traditional A/B testing for small effects. Options: test bigger changes (25-30% effects detectable in 1-2 weeks at low traffic), aggregate across products, use multi-arm bandits, or use genetic algorithm optimization that evolves layouts continuously without requiring a binary winner-takes-all decision.

Should I test for add-to-cart, checkout-initiated, or purchase events?

Test for purchase events when sample size allows. Add-to-cart lift does not always translate to purchase lift: a change can lift add-to-cart by 15% while leaving purchase flat or negative. Lower-funnel conversion events have larger sample size requirements but more reliable signal.

How Long Does a Shopify A/B Test Need to Run? (Direct Answer with the Math)

Q: How long does a Shopify A/B test need to run?

Most Shopify A/B tests need 2 to 4 weeks to reach statistical significance. The exact duration depends on your daily traffic and the size of the effect you are detecting. A store with 1,000 sessions/day looking for a 10% CVR lift needs about 3 weeks; a store with 5,000 sessions/day looking for a 15% lift needs about 10 days.

Quick answer: Most Shopify A/B tests need to run for 2 to 4 weeks to reach statistical significance, with the exact duration depending on your traffic volume and the size of the effect you're detecting. A store with 1,000 sessions/day looking for a 10% CVR lift needs about 3 weeks; a store with 5,000 sessions/day looking for a 15% lift needs about 10 days; a store with 200 sessions/day looking for a 5% lift may need 3+ months, at which point the test is usually not worth running. This guide shows you how to calculate the right duration for your specific store and what to do when traffic is too low for traditional testing. For stores at any traffic level, the best Shopify apps to increase conversion rate covers alternatives to manual testing.

The Formula That Determines Test Duration

A/B test duration is determined by:

Sample size needed: calculated from your current CVR, the minimum detectable effect (MDE) you want to find, and your statistical confidence level
Daily traffic on the page being tested: sessions per day for the relevant segment

The relationship is:

Test duration (days) = (Sample size needed × 2) / (Daily traffic to tested page)

The factor of 2 accounts for splitting traffic between control and variant. If you're running a 50/50 split, each variant needs the full sample size, but you're collecting data on both at the same time, so total daily traffic is the divisor.

A Practical Sample Size Table

For a CVR around 2.5% (the Shopify median), with 95% confidence and 80% power, the sample sizes you need per variant are:

| Minimum detectable effect | Sample size per variant | Total sample (2 variants) | |---|---|---| | 5% relative lift | ~110,000 sessions | 220,000 | | 10% relative lift | ~28,000 sessions | 56,000 | | 15% relative lift | ~12,500 sessions | 25,000 | | 20% relative lift | ~7,000 sessions | 14,000 | | 25% relative lift | ~4,500 sessions | 9,000 | | 30% relative lift | ~3,200 sessions | 6,400 |

Now apply your daily traffic to find the test duration:

| Daily traffic | 5% MDE | 10% MDE | 15% MDE | 20% MDE | 30% MDE | |---|---|---|---|---|---| | 200 sessions | 1,100 days (impossible) | 280 days | 125 days | 70 days | 32 days | | 500 sessions | 440 days | 112 days | 50 days | 28 days | 13 days | | 1,000 sessions | 220 days | 56 days | 25 days | 14 days | 7 days | | 2,500 sessions | 88 days | 22 days | 10 days | 6 days | 3 days | | 5,000 sessions | 44 days | 11 days | 5 days | 3 days | 2 days | | 10,000 sessions | 22 days | 6 days | 3 days | 2 days | 1 day |

The takeaway: most Shopify stores running between 500 and 5,000 sessions/day can detect 15-20% effects within 1-3 weeks. Detecting a 5% effect is impractical for stores below 5,000 sessions/day.

Why Some Tests Fail Despite "Significance"

A common mistake is calling a winner the moment your testing tool shows 95% significance. This is wrong for several reasons:

Peeking inflates false positive rate. If you check daily and stop the test the moment significance shows, your true false-positive rate is closer to 30-40%, not 5%. You will declare winners that are actually noise.

Pre-commit to a sample size and a duration. Decide before launching that you will run for at least N days OR collect at least M conversions per variant, and don't stop early no matter what intermediate numbers show.

Run for full weekly cycles. Visitor behavior varies by day of week. A test that runs Monday-Wednesday may show one variant winning because Wednesday-shoppers happened to behave differently. Always run for at least one full 7-day cycle, ideally two.

Beware of the "novelty effect." A new layout often outperforms an old one in the first 3-4 days because of novelty/curiosity. The lift fades. Run for at least 2 weeks to see the steady-state performance.

What Counts as a "Conversion" Affects Duration

A test on cart-page CVR has different sample size requirements than a test on landing-page-to-checkout. The lower in the funnel you measure, the more traffic you need.

For Shopify A/B tests, the practical conversion events:

Add-to-cart (highest event volume; smallest sample sizes needed; fastest tests but weakest signal)
Checkout initiated (medium volume; useful for cart and checkout tests)
Purchase (lowest event volume; largest sample sizes; slow but most reliable signal)

Many stores test for add-to-cart lift and assume it translates to purchase lift. This is often wrong. A change that lifts add-to-cart by 15% can lift purchase by 2-5% or even decrease it (if the added cart items are abandoned). Test for purchase events when you can.

What to Do When Your Traffic Is Too Low

If your store does under 1,000 sessions/day, traditional A/B testing can take months for any meaningful effect. Options:

Test bigger changes. A 25-30% effect is detectable in 1-2 weeks at low traffic. Don't bother with button color tests; test layout overhauls.

Multi-arm bandits. Bandit algorithms allocate traffic toward winning variants while still gathering data. They reach decisions faster than fixed A/B tests but introduce bias. Reasonable for low-traffic stores willing to trade some statistical rigor for speed.

Genetic algorithm optimization. Tests entire populations of variants simultaneously, evolves the best performers, and continues testing as performance shifts. Works at lower traffic levels because each test isn't a binary winner-takes-all decision; instead, the algorithm gradually shifts toward better-performing combinations across many variables. This is what Eevy AI runs.

Aggregate across products. Instead of testing on a single product page, test the same change across 20-50 product pages and aggregate the results. Multiplies your effective sample size.

Sequential A/B/n testing. Run a single bigger test first (variant A vs B), then test the winner against a new C variant. Slower but easier to interpret than parallel multi-variant tests.

Common Test Duration Mistakes

Stopping at the first sign of significance: leads to false positives at 30-40% rate
Running for less than 7 days: misses weekly cycle effects
Testing small changes at low traffic: burns weeks for nothing
Comparing against last month's baseline: confounds with seasonality, algorithm changes, traffic source shifts
Treating one test winner as universally true: winners on mobile may lose on desktop, or vice versa
Not pre-defining the success metric: leads to "p-hacking" by trying multiple metrics until one shows significance

When to Stop a Test

Pre-commit to a stopping rule before you launch. A reasonable rule for most Shopify A/B tests:

Minimum 14 days running
Minimum 200 conversions per variant
Pre-defined sample size from your duration calculator
Statistical significance at 95% confidence
Effect direction stable for at least 7 days (not flipping back and forth)

If all five are met, call the winner. If any one is missing, keep running.

Practical Recommendations by Store Size

Under 500 sessions/day: Don't run traditional A/B tests. Use genetic algorithms or test very large layout changes only. Otherwise focus on absolute foundational improvements (page speed, social proof volume) that don't require testing infrastructure.

500-2,000 sessions/day: Test changes with expected 15%+ effects. Run for 3-6 weeks. Pre-commit to sample size and duration. Limit to 1-2 concurrent tests per page.

2,000-10,000 sessions/day: Standard A/B testing works well. 1-3 week tests for 10-20% effects. Can run multiple concurrent tests if they're on different pages.

10,000+ sessions/day: A/B test infrastructure pays off. Can detect 5-10% effects in under 2 weeks. Consider always-on multivariate or bandit infrastructure.

How Eevy AI Approaches Testing

Traditional A/B testing requires you to design hypotheses, allocate traffic, wait, and interpret. For most Shopify stores, this is a part-time job that isn't getting done.

Eevy AI's genetic algorithm runs continuous optimization without you defining hypotheses. It maintains a population of layout variations, evaluates them against your real traffic, and breeds the best performers into new variations. The algorithm handles the math automatically: sample sizes, statistical significance, multi-armed bandit allocation.

This means stores at lower traffic levels (where traditional A/B testing breaks down) can still benefit from continuous CRO. A store with 500-1,000 sessions/day that would need 3 months for one A/B test can run an evolving optimization in the background that improves layout performance week over week.

TL;DR

Most Shopify A/B tests need 2-4 weeks at typical store traffic
Low-traffic stores (< 1,000/day) often cannot test small changes. Focus on big layout changes or use genetic algorithm optimization
Pre-commit to sample size and duration before launching to avoid false positives from peeking
Run for at least one full 7-day cycle even if significance shows earlier
Test for purchase events when you can; add-to-cart lift doesn't always translate to revenue lift
The right test for your store depends on your traffic; use the table above to calibrate