Hi Growthbook team having the same issue as I described here GrowthBook Users #announcements

Hi Growthbook team, having the same issue as I des...

rich-psychiatrist-16291

07/07/2023, 7:45 PM

Hi Growthbook team, having the same issue as I described here https://growthbookusers.slack.com/archives/C01T6PKD9C3/p1687531717500479 , I now have 8 A/A experiments running with 8k+ impressions and 1.5K+ conversions but the confidence level and the percent change still fluctuate daily, anything I can do to verify our setup?

fresh-football-47124

07/07/2023, 9:22 PM

@helpful-application-7107 any thoughts?

helpful-application-7107

07/07/2023, 9:27 PM

1. The confidence level and percent change fluctuating seems normal if you're still collecting data, but they should fluctuate less as you collect more data (regardless of whether this is an A/B or A/A test). 2. Some A/A tests will show marginal significance like this as a matter of pure chance. If it keeps happening then there may be something going on with your setup, but it's hard to know. This screen shot is perfectly consistent with a functioning experiment. The other user's suggestion in that last thread was reasonable. What are you seeing across all 8 experiments?

rich-psychiatrist-16291

07/07/2023, 9:33 PM

the fluctuation hasn't really changed in the past 2 weeks, they seem to reach stat-sig and then loose it, how much data would be necessary? it feels like 8k impressions and 1.5k conversions should be enough right?

rich-psychiatrist-16291

07/07/2023, 9:35 PM

all 8 experiments are showing very different results, some examples are +14% and +18% in one experiment with control and 2 variations and then -4.1% and +4.2% in another two with control and 1 variation

helpful-application-7107

07/07/2023, 9:35 PM

So with A/A tests, you really hope that there are no differences, because you're hoping to see that there's no changes across variations. In your email, it looks mostly like one test going back and forth over a significance threshold, which is more or less what I would expect with one A/A test that had an "unlucky" draw. The rest should show inconclusive results, which would indicate that the A/A test is working.

rich-psychiatrist-16291

07/07/2023, 9:36 PM

image.png,image.png,image.png

helpful-application-7107

07/07/2023, 9:36 PM

Ideally, for A/A tests we would have an explicit test that demonstrates that you've collected enough data to ascertain that the groups are no more than X% different from each other.

helpful-application-7107

07/07/2023, 9:36 PM

This all looks consistent with functioning A/A tests.

👍 1

rich-psychiatrist-16291

07/07/2023, 9:38 PM

I don't understand, are these 3 A/A tests not showing a big difference between the different variations?

rich-psychiatrist-16291

07/07/2023, 9:38 PM

I'd expect to see a 0.1% change

helpful-application-7107

07/07/2023, 9:46 PM

Ideally, everything would show 0% uplift with 50% chance to beat control, but that doesn't happen in real life, or even in best case simulations. There is random noise. For example, imagine you're running 20 A/A tests. In a frequentist engine, if your significance threshold is at 5% (0.05), then you would get, on average, 1 A/A test that shows statistical significance. In fact, you can show that with perfectly functioning A/A tests that are independent from one another, the probability that at least one of these is a false positive, and shows up as stat sig, is 0.64, or 64%. You seeing one test out of 8 A/A tests, some with multiple variations, that has a stat-sig outcome is not surprising to me. For the sake of our back-of-the-envolope math, you've run like 12 hypothesis tests (8 a/a tests, but some have more than 2 variations). If you were a frequentist with a 0.05 threshold and these tests were independent, you'd get at least one false positive 46% of the time.

👍 1

rich-psychiatrist-16291

07/07/2023, 9:53 PM

let me maybe ask a more basic question, is the idea of running A/A tests not to get to statsig and as close to 0% lift as possible? I do expect that maybe one of them would be a false positive, but all of the tests I've run have been false positives with different changes in uplift

helpful-application-7107

07/07/2023, 9:58 PM

The ideal with an A/A test is to see no stat sig with these tests, and as close to 0% lift as possible.

helpful-application-7107

07/07/2023, 9:59 PM

but all of the tests I've run have been false positives with different changes in uplift

Only one of the tests you've showed me is a "false positive" in that it was stat sig. The non-zero uplift that isn't statsig is a feature of your sample size. If you collected more data those uplift percentages should shrink towards zero, but not always.

rich-psychiatrist-16291

07/07/2023, 10:08 PM

the emails have 4 different experiments reaching statsig and it was on a random uplift, and it has happened to all 8 experiments at one point or another

rich-psychiatrist-16291

07/07/2023, 10:09 PM

how much more data collection do you estimate would be necessary for them to shrink towards 0? so far they have just shifted in different directions and not towards 0 as I collect more data

helpful-application-7107

07/07/2023, 10:20 PM

the emails have 4 different experiments reaching statsig and it was on a random uplift, and it has happened to all 8 experiments at one point or another

This is akin to the well known peeking problem. If you're running 10 a/a tests, and you check them every 6 hours then the probability of false positives is very, very high.

how much more data collection do you estimate would be necessary for them to shrink towards 0? so far they have just shifted in different directions and not towards 0 as I collect more data

It's hard to say. You're seeing uplift of about 4% in your A/A tests. This is about a 0.5 percentage point absolute difference (e.g. your variations are like 12% and 12.5% conversion rates). In this way, I'd say your effects are pretty close to 0. But let's talk about sample sizes. Imagine this 4% was real. In this case, if the true effect was 4%, in order to reliably get a stat sig result showing that this effect is real, you would need ~75k users per variation! So if you wanted to rule out effect sizes smaller than 4% reliable, you'd need even more than 75k users per variation.

helpful-application-7107

07/07/2023, 10:36 PM

Here's a quick simulation I did. This shows the distribution of simulated Uplift percents from working A/A tests when you have a 12% conversion rate. The different plots in the rows show you the distribution of Uplift percents based on the sample size. You have like ~26k users, in the top row. As you can see, getting 4% uplift in this case is not super unusual. So what range would be small enough for you? If it's the bottom, where almost all uplift from your A/A test is between -4 and 4, and most is between -2 and 2, you'd need 100k users total.

helpful-application-7107

07/07/2023, 10:37 PM

I put a line at 4 just to show you where some of your results are, and to show they aren't that far in the tail of the distribution when you have 26k users!

helpful-application-7107

07/07/2023, 10:39 PM

sim_aa.png

💡 1

adventurous-computer-41819

07/10/2023, 11:35 AM

Great charts! They show how, even with 50k total users, a relatively non-minor % of AA tests will show a + or -4% change or greater. @rich-psychiatrist-16291 the variance is due to your sample size, you wouldn't hit 0.1% change unless you had a huge sample size eg 100k-500k users. Bruno, for Bayesian A/B tests, I highly recommend planning an estimated test run length first. You can do that by plugging your data into a calc like this one. With some example data similar to yours, it shows that if you expect/aim to hit a 5% change, you would need a total sample size of 40,000. When planning, you can edit the + uplift, to learn what % uplift you need to hit over X weeks. This helps you estimate if the uplift you can detect is achievable, and helps prevent from stopping a Bayesian test too early 🙂

rich-psychiatrist-16291

07/10/2023, 3:40 PM

alright, I had to meditate on it but I understand now, @helpful-application-7107 thanks for walking me through it and the detailed simulation. I remember now that the last time I did an A/A test we didn't even use any tools, we calculated the uplift deviation in google sheets actually, so now this all makes sense to me and I'm glad there's nothing wrong with the implementation

rich-psychiatrist-16291

07/10/2023, 3:41 PM

@adventurous-computer-41819 thanks for the tip! we're just starting to add experimentation and tracking in new parts of our system so I don't have numbers yet, but I'll definitely take your advice and and use a calculator for future experiments, thank you!

137 Views

Open in Slack

Previous Next