Hello, is there a way or a best practice on how lo...
# announcements
b
Hello, is there a way or a best practice on how long a bayesian based test should run? We had several cases where results completely turned around within a week after the test was running for 4 weeks already. I attached 2 screenshots of an example, where we had 36% chance to beat control (metric: Suggested Jobs Page - Qualified Application) with -3.7% decrease and a week later with only 500 more users per group (2600 to 3100) the chance to beat control went up to 82% with a 10% increase in the same metric. This was just random that we left the test running for another week delivering completely different results and I don't understand how to tell this beforehand, i.e. being sure that the results won't change anymore after a specific runtime.
w
I am not into bayesian stats but here are some moments to consider. Keep in mind the novelty effect if your testing feature brings up new experience for users. The second thing - applications/pages seems to be a ratio metric and with ratio metrics there is almost always additional layer of complexity. I'd probably just tried to re-calculate the same stats with frequentists approach for comparison.
f
this is fairly normal, when you chance to beat control is not significant, you can’t draw a meaningful conclusions. If your metric has a high amount of variability, you may never get to a significant result, and the credible interval will stay wide.
b
I understand but how do I know if the chance to beat control is significant? Bayesian statistics don't seem to give me anything to have an idea about this. If we use many of these ratio metrics, would you suggest to switch to frequentist statistics?
f
we use a 95% or 5% as significant thresholds by default
b
so I should wait for chance to beat control to reach 95% or 5% before I stop it?
f
yes, or you’re happy with the risk scores
b
I see, thanks
l
@billowy-horse-43368 the other thing that can be helpful in understanding how much movement your metrics have is to regularly run an “A/A” test. So you are randomly assigning users but both into the control experience. I find this really helpful for both: 1. confirming that experiment assignment systems are working and 2. helping the team see how metrics naturally vary. You could even define a data source that randomly assigns users to fake variants via SQL and not even run an experiment at all if you just want to look at the stats component.
👍 1
w
Yeah indeed AA testing is a must thing, it helped me finding a few hidden issues. A prominent example would be an empty string for user_id. Oops, turns out it gives you a variation, A/A test helped me to find that out.
b
Cool that you mention A/A tets, we also considered those recently. Thanks a lot @wooden-country-60054 and @late-dentist-52023!
@fresh-football-47124 sorry for getting into this again but we had some cases now where the chance to beat control was at 95% after a short time with a low amount of users per variation, i.e. around 150, and then dropped back to around 80%. As I understood this should usually not happen, right? Is it worth to somehow still calculate a needed sample size with an additional tool or would you suggest to still rely on the chance to beat control?
f
for metrics with a high variability, you might want to increase the min sample size so you’re not staring at random noise
you can also have policies around not making decisions on results within your min duration period
b
@fresh-football-47124 is there any way to automatically calculate variability within GrowthBook? I feel like this would be a very helpful feature to understand when tests become conclusive
f
@helpful-application-7107 any thoughts on this? ^
h
Yeah, so a measure of variability would be the cornerstone of our power calculator/general sample size estimator which is a top priority for Q2.
There's not currently a great way to measure this with respect to the experiment of interest, but we want to make it really easy to get this information both in the experiment results and as part of a general purpose tool.
b
@helpful-application-7107 thanks for the update, looking forward to it!