Hi Growthbook team, having the same issue as I des...
# announcements
r
Hi Growthbook team, having the same issue as I described here https://growthbookusers.slack.com/archives/C01T6PKD9C3/p1687531717500479 , I now have 8 A/A experiments running with 8k+ impressions and 1.5K+ conversions but the confidence level and the percent change still fluctuate daily, anything I can do to verify our setup?
f
@helpful-application-7107 any thoughts?
h
1. The confidence level and percent change fluctuating seems normal if you're still collecting data, but they should fluctuate less as you collect more data (regardless of whether this is an A/B or A/A test). 2. Some A/A tests will show marginal significance like this as a matter of pure chance. If it keeps happening then there may be something going on with your setup, but it's hard to know. This screen shot is perfectly consistent with a functioning experiment. The other user's suggestion in that last thread was reasonable. What are you seeing across all 8 experiments?
r
the fluctuation hasn't really changed in the past 2 weeks, they seem to reach stat-sig and then loose it, how much data would be necessary? it feels like 8k impressions and 1.5k conversions should be enough right?
all 8 experiments are showing very different results, some examples are +14% and +18% in one experiment with control and 2 variations and then -4.1% and +4.2% in another two with control and 1 variation
h
So with A/A tests, you really hope that there are no differences, because you're hoping to see that there's no changes across variations. In your email, it looks mostly like one test going back and forth over a significance threshold, which is more or less what I would expect with one A/A test that had an "unlucky" draw. The rest should show inconclusive results, which would indicate that the A/A test is working.
r
image.png,image.png,image.png
h
Ideally, for A/A tests we would have an explicit test that demonstrates that you've collected enough data to ascertain that the groups are no more than X% different from each other.
This all looks consistent with functioning A/A tests.
👍 1
r
I don't understand, are these 3 A/A tests not showing a big difference between the different variations?
I'd expect to see a 0.1% change
h
Ideally, everything would show 0% uplift with 50% chance to beat control, but that doesn't happen in real life, or even in best case simulations. There is random noise. For example, imagine you're running 20 A/A tests. In a frequentist engine, if your significance threshold is at 5% (0.05), then you would get, on average, 1 A/A test that shows statistical significance. In fact, you can show that with perfectly functioning A/A tests that are independent from one another, the probability that at least one of these is a false positive, and shows up as stat sig, is 0.64, or 64%. You seeing one test out of 8 A/A tests, some with multiple variations, that has a stat-sig outcome is not surprising to me. For the sake of our back-of-the-envolope math, you've run like 12 hypothesis tests (8 a/a tests, but some have more than 2 variations). If you were a frequentist with a 0.05 threshold and these tests were independent, you'd get at least one false positive 46% of the time.
👍 1
r
let me maybe ask a more basic question, is the idea of running A/A tests not to get to statsig and as close to 0% lift as possible? I do expect that maybe one of them would be a false positive, but all of the tests I've run have been false positives with different changes in uplift
h
The ideal with an A/A test is to see no stat sig with these tests, and as close to 0% lift as possible.
but all of the tests I've run have been false positives with different changes in uplift
Only one of the tests you've showed me is a "false positive" in that it was stat sig. The non-zero uplift that isn't statsig is a feature of your sample size. If you collected more data those uplift percentages should shrink towards zero, but not always.
r
the emails have 4 different experiments reaching statsig and it was on a random uplift, and it has happened to all 8 experiments at one point or another
how much more data collection do you estimate would be necessary for them to shrink towards 0? so far they have just shifted in different directions and not towards 0 as I collect more data
h
the emails have 4 different experiments reaching statsig and it was on a random uplift, and it has happened to all 8 experiments at one point or another
This is akin to the well known peeking problem. If you're running 10 a/a tests, and you check them every 6 hours then the probability of false positives is very, very high.
how much more data collection do you estimate would be necessary for them to shrink towards 0? so far they have just shifted in different directions and not towards 0 as I collect more data
It's hard to say. You're seeing uplift of about 4% in your A/A tests. This is about a 0.5 percentage point absolute difference (e.g. your variations are like 12% and 12.5% conversion rates). In this way, I'd say your effects are pretty close to 0. But let's talk about sample sizes. Imagine this 4% was real. In this case, if the true effect was 4%, in order to reliably get a stat sig result showing that this effect is real, you would need ~75k users per variation! So if you wanted to rule out effect sizes smaller than 4% reliable, you'd need even more than 75k users per variation.
Here's a quick simulation I did. This shows the distribution of simulated Uplift percents from working A/A tests when you have a 12% conversion rate. The different plots in the rows show you the distribution of Uplift percents based on the sample size. You have like ~26k users, in the top row. As you can see, getting 4% uplift in this case is not super unusual. So what range would be small enough for you? If it's the bottom, where almost all uplift from your A/A test is between -4 and 4, and most is between -2 and 2, you'd need 100k users total.
I put a line at 4 just to show you where some of your results are, and to show they aren't that far in the tail of the distribution when you have 26k users!
sim_aa.png
💡 1
a
Great charts! They show how, even with 50k total users, a relatively non-minor % of AA tests will show a + or -4% change or greater. @rich-psychiatrist-16291 the variance is due to your sample size, you wouldn't hit 0.1% change unless you had a huge sample size eg 100k-500k users. Bruno, for Bayesian A/B tests, I highly recommend planning an estimated test run length first. You can do that by plugging your data into a calc like this one. With some example data similar to yours, it shows that if you expect/aim to hit a 5% change, you would need a total sample size of 40,000. When planning, you can edit the + uplift, to learn what % uplift you need to hit over X weeks. This helps you estimate if the uplift you can detect is achievable, and helps prevent from stopping a Bayesian test too early 🙂
r
alright, I had to meditate on it but I understand now, @helpful-application-7107 thanks for walking me through it and the detailed simulation. I remember now that the last time I did an A/A test we didn't even use any tools, we calculated the uplift deviation in google sheets actually, so now this all makes sense to me and I'm glad there's nothing wrong with the implementation
@adventurous-computer-41819 thanks for the tip! we're just starting to add experimentation and tracking in new parts of our system so I don't have numbers yet, but I'll definitely take your advice and and use a calculator for future experiments, thank you!