Hi everyone. I am trying to wrap my head around implementing guardrail metrics (especially “trust-related” ones).
The business request is “test conversion into registration but discard the results if we see the significant change in the guardrail metric (either uplift or drop) and investigate. The guardrail metric being page load speed.
I am not sure how to correctly implement it and keep the false negative and false positive rate at the same threshold (0.80 and 0.05 respectively).
Could anyone point me to relevant resources? The only resource I have is the book by Kohavi “Trustworthy Online Controlled Experiments” where he covers this subject, but nothing about controlling false negative and false positive rate when implementing them in experiments
03/21/2023, 11:49 AM
Whatever you can measure and associate with the variation you can compare. Page load time is a number. The first difficult part probably is how to implement technically collecting that information (page load time in the scope of the experiment). The second moment is the calculation - if your splitting unit is user and analytical unit is page, then if you have multiple pageviews by user you are dealing with ratio metric (total_time/pageviews). You'll need to keep it in mind when creating the metric. Just my two cents (I'm not from the GB team).
03/21/2023, 2:37 PM
Hi Yury. Yes, all your mentioned potential issues have been addressed. I am just not sure how to set up test correctly so that I get the desired false negative and false positive rate. When monitoring one single metric without guardrails - the calculations are straightforward. When guardrails get into play - that’s where I get confused.
I checked this articles that describes guardrails framework in AirBnb but there’s nothing about error rate control
Though they do mention that multiple guardrail metrics lead to higher probability of “false alarm”
03/21/2023, 2:52 PM
Guardrail metric is just another metric, nothing special, you are just making sure it's not moving into unwanted direction. You can apply the same calculation for Type 1 and 2 errors as to any other metric when doing power analysis. If you don't want to dive into all the complexity related to ratio metrics, you can just compare two data sets after calculating averages for each user. In some cases it's not the right thing to do, but in your case I think it would be alright, as you are probably interested in avoiding rather big move of the guardrail metric. So it's just as simple/complicated as getting power analysis for two numeric datasets (variations) of page load times (average per user) using any statsig tool.