We’re running an A/A experiment (random assignment to two variants, but user sees the same product). Typically there’s been small variation for metrics in Growthbook but most of the violin plots are centered at close to 0 now that ~50,000 users have been assigned. Our assignments should be random.
But one of our core binomial metrics is showing a 2.3% percent drop in the “experiment” group. Growthbook computes a 3.68% chance that the variant will beat the control, when I’d expect ~50% chance, since the product is the same. This is a ratio metric, with the denominator set to one of the other states in our funnel. The denominator’s metric shows the expected 0% change. There’s a conversion window on the metric, but I tried copying the metric and 1) setting the denominator to all experiment users and 2) modifying the conversion window, and the percent size / probability distribution is roughly the same.
This experiment contains has been running for 4 months. I created a new experiment phase and looked at just data from the last 1 month, and I wasn’t able to reproduce this. Seeing a 0% percent change for the new phase.
I had a few questions, and I’m open to any advice people have on interpreting the results of A/A tests in Growthbook:
• What could help explain this outcome? Are these false positives expected, even after such a long experiment length?
• Could this suggest a flaw in our assignment logic or our metric definition in any way? Again, the difference in most other metrics is close to 0, and the second phase showed a 0% difference, so the first result (on 4x the users) is confusing.
• Does this suggest that I should be skeptical of any experiment result that moves the metric less than ~2.3%? Is the A/A test helping me understand the underlying natural variance for the metric in any way?
08/14/2022, 12:52 AM
It's likely just a false positive. One thing you can do is break down the results by different dimensions and see if one of them explains the change. For example, maybe one of your variations randomly happened to get more US visitors in it and they might convert at a much higher rate for that metric.
That would tell you that when looking at that metric, you should make sure to take that dimension into account before making conclusions for an experiment.
08/14/2022, 1:07 AM
Thanks, that helps. Is there a fixed false positive rate in Growthbook? e.g. 5% of experiment-metric results will be false positives? I'm not sure how to reason about this so if you have other rules of thumb when interpreting results it would be great to hear!
Is there even a concept of a long-term Type I error rate across experiments with Bayesian methods? I had thought we collected more data and the risk of choosing the winning variant went to 0, we could have relative confidence that was the winner. But is there still a fixed false positive rate even in these scenarios where the probability distribution has no overlap with 0 and it looks like a clear winner?
08/14/2022, 7:35 PM
Bayesian methods focus more on minimizing loss as opposed to worrying about the false positive rate directly. If you make the wrong decision, but the risk/loss is super low, it's not really that bad.
That being said, you are likely using some decision rules for Bayesian experiments like "if chance to beat control goes above 95%, call it a winner". In that case, you are in essence choosing to have a 5% false positive rate across experiments.
08/14/2022, 7:49 PM
So if Growthbook shows a 99.99% chance of a variant beating control and the risk has stabilized at 0, is there truly only a .01% chance that result is a false positive? If we can use 100% - chance to beat control as a heuristic for a “false positive threshold” then that definitely helps us reason through it
08/14/2022, 8:11 PM
Yes, I think that's the right way to think about it. The one thing GrowthBook does not take into account right now is a multiple testing correction. So if you have lots of metrics, or look at lots of different dimension breakdowns, you're false positive rate will increase since you are in essence doing a lot of mini experiments. If you have 20 independent metrics and each one shows 95% chance to beat control, it's likely one of them is a false positive.