Can you clarify the difference in how minimum samp...
# give-feedback
l
Can you clarify the difference in how minimum sample size is applied for duration metrics? For a binomial metric the minimum sample size is described as:
The number of conversions required in an experiment variation before showing results (default 150)
For a duration metric the minimum sample size is described as:
The total duration required in an experiment variation before showing results (default 02:30)
So binomial appears to apply as a count, but duration is dependent on the value. If we are testing small values (e.g. load times, things in milliseconds) and were to use
avg(time_in_ms/1000.0)
as our user value aggregation in an attempt to understand if an experiment has a positive or negative impact on load time — how does that relate to the total duration for the minimum sample size — should we just set that setting to
0
? e.g. if our average user saw 0.1 seconds, do I need 1500 average users to have reached that total duration threshold or is there a different interpretation?
h
Minimum sample size largely functions to prevent the experiment results from displaying results prematurely. If your metric values are very small (as it appears they will be for your average load time), then you probably do want to lower your minimum sample size. To your initial question, the minimum sample size will prevent the front-end showing results when BOTH comparison variations (e.g. control and treatment) have a
sum
of user-level values that is less the min sample size. If the
sum
of the user-level aggregates for either variation is > than min sample size, then we display the results in the front-end.
if our average user saw 0.1 seconds, do I need 1500 average users to have reached that total duration threshold
Yes, but this would be per variation, roughly.
You can set it to 0 and then just rely on the people interpreting results to have the patience to wait for enough users to be bucketed, rather than set a min sample size, or you can set it to some larger value. If you set the min sample size, to, say 10, and your average is 0.1s, then you should start to see results when either variation gets around ~100 users, depending on the actual values you get (as per your original interpretation)
l
👍 thanks for confirming, I was expecting “sample size” to refer to the number of users (or potential exposures in a multiple exposures context) included rather than the sum of the duration. e.g. if 150 users in each cohort were averaging 0.1s I would have expected that to have met the threshold the same as 150 users averaging 1000 seconds. Does that also means that at least 1 user in each variant with a very large value (real or error) then (e.g. 1 user with 3600 seconds of value) would also be sufficient for results to display?
h
The name is (and I'm guessing here) likely a hold over from when it was targeted at binomial metrics, but I agree it seems like it should refer to users and we could improve that.
l
We have both scenarios, things like that “load time” concept as well as video consumption which can be hours worth of time.
h
Does that also means that at least 1 user in each variant with a very large value (real or error) then (e.g. 1 user with 3600 seconds of value) would also be sufficient for results to display?
That's correct. You could use hard capping (e.g. winsorization) to prevent this, and we're working on adding percentile capping.
l
👍 ok, that helps, we can use that to adjust the configuration for our duration metrics.
h
Great! Sorry for the lack of clarity but hopefully the levers are there to help you get the functionality you want. I'll note down the UX concerns and if there's some functionality that would be nice to have, definitely let us know.
l
Sure, having the values not share the same global default when a metric is created (but one translates the default of “150” into 2 minutes 30 seconds) could be one approach.
h
Agreed.