Hi everyone. Have a question that is not directly ...
# announcements
Hi everyone. Have a question that is not directly related to Growthbook but rather to A/B testing in general. Precisely, sample size estimation (now that you have Frequentiest framework in your product, it’s even more relevant). So, when estimating the required sample size for the experiment, one usually needs the base conversion rate so they can estimate the required sample size. The problem we have is that our conversion rate fluctuates wildly, so even if the historical data shows the base conversion rate of 9%, it can significantly drop or get higher during the experiment due to some external factors (like, we started getting less quality traffic from our PPC ads). Are there any ways to take this into account so that we still have the statistical power of at least 80% even if the “real” conversion rate drops during the experiment for both variants? My first guess would be to “underestimate” base conversion rate, but i’m not sure I’m on the right track here. Say, we looked at the data for the last 3 months and discovered that our conversion rate was 9%, would it be statistically and mathematically valid to specify the base conversion rate to be 8% instead of 9%, so that we “safeguard” ourselves against conversion drop during the experiment? Of course, it would require larger sample size, but that’s a sacrifice we are willing to take.
Hi Yevhen. There are three key components to estimating if your experiment is going to have enough power: (a) sample size (normally determined by runtime in online experimentation), (b) the effect size of your experiment, and (c) the variance of your metric of interest. When you say the "conversion rate", do you mean just the overall conversion rate or do you mean the difference in conversion rates in your A/B test overall? In general, you're right that you could use historical data and assume a (a) smaller effect size and (b) larger variance to "safeguard" yourself against having an underpowered experiment.
If you're just saying that 9% is your overall base conversion rate, then lowering it to 8% (for a binomial metric) will actually decrease
(c) the variance of your metric
and increase your power. But it will also likely decrease
(b) the effect size of your experiment
because in reality it is often hard to move very small numbers that happen infrequently, and this would decrease your power. So in this case, it isn't totally clear what effect it will have. If you want to be more conservative, you'll want to explicitly either (a) just increase that sample size, (b) decrease your expected effect size, or (c) increase the expected variance of your metric.
would it be statistically and mathematically valid
Estimating power is almost always a guessing game. Tweaking your historical data to be more or less conservative doesn't undermine any statistical validity, it just can change the risk you're taking on when running an experiment. In general, frequentist statistics will be valid so long as you run the experiment to conclusion. Of course, those statistics won't be very useful if they come from an underpowered experiment where you will likely just get a null result. So to that end, I think being more conservative in the situation you're in makes the most sense.
I guess I was in a little bit of a hurry and didn’t provide all the necessary data. Let’s say the historical data shows that our base conversion rate is 9%. Our minimum effect of interest is the uplift of 10%. We calculate the required sample size and come up with the following numbers: To detect 10% uplift with the statistical power 0.80, significance level 0.05 and base conversion rate of 9%, we would need 16083 users for each variation. BUT We also want to protect ourselves from getting an underpowered experiment in case the actual conversion rate drops during the experiment runtime. So, what we are trying to do is to assume that our conversion rate may drop to 8% next month due to some external factors. So we calculate the required sample size using the following input: To detect 10% uplift with the statistical power 0.80, significance level 0.05 and base conversion rate of 8%, we would need 18296 users for each variation. What we did here is we left the minimum detectable effect, significance level and statistical power at the same level but assumed that the base conversion rate could actually be worse that that found in the historical data. This way, we tried to make sure that if our real conversion rate dropped to 8% during the experiment, we would still be able to detect the uplift of 10% with the 80% chance. Of course, if the real conversion rate actually gets higher, we would get higher statistical power to detect the 10% uplift, which is always welcome.
Yep! This sounds totally reasonable and it is generally good practice to be a bit conservative with historical numbers and experimentation in general, if you can afford the extra runtime. In either case, if you run your experiment to completion you won't have issues with the statistical validity of your experiment results (but your true experiment power remains somewhat unknown even after the experiment). For what it is worth, the main reason that the power calculation says you need more users is that when your base conversion rate drops to 8%, you're actually increasing the absolute minimal effect of interest (while the relative minimal effect of interest stays fixed at 10%). With a base conversion of 9%, a relative effect of 10% is the same as an absolute increase in 0.9 percentage points, or an increase in the proportion converting of 0.009 (
9% * 0.1 = 0.9%
0.09 * 0.1 = 0.009
) With a base conversion of 8%, a relative effect of 10% is the same as an absolute increase in 0.8 percentage points, or an increase in the proportion converting of 0.008 (
8% * 0.1 = 0.8%
0.08 * 0.1 = 0.008
). Therefore, you need more users in the second case since you're actually looking to detect a smaller absolute lift in the proportion of users converting.
Thanks! You’ve been great help! I assume the same approach can be used when estimating variance of the treatment when deciding on the required sample size? In fact, I heard somewhere it’s OK to multiply the base variance by two when providing input for treatment population variance. Of course we can never be sure about what variance we will get in treatment, but I guess multiplying the base one by two should give us some extra confidence that we will get decent statistical power in case of variance increase in the treatment.
That's also going to be conservative, and is just a rule of thumb, but it seems like it's not a bad idea. I'd say the bigger effect you're expecting, the more likely treatment variance is to increase. Not as a rule, but probably on average. Also: this mostly applies to non-binomial metrics, because for binomial metrics you just need the proportion to determine the variance.
yep, nice to have another validation that I’m moving in the right direction. Those are two things that’s been bothering me for a while. Have a great day!