Hello guys! I have a question about capping. There...
# ask-questions
s
Hello guys! I have a question about capping. There are several strategies: the first is to cap every metric based on the primary—meaning that for subsequent metrics we take ONLY the users who were capped in the primary metric. The second (as in GrowthBook) is to cap each metric independently, so completely different users are included; not to mention ratio metrics, where the numerator and denominator are capped separately. Detailed analysis of the strategies Strategy 1: Single selection by the primary metric Pros: • Statistical coherence — all metrics are analyzed on the same selected sample of users, which ensures consistent interpretation of results. • Multiple-testing control — the risk of false positives is reduced, since sample selection is independent of secondary and guardrail metrics. • Preservation of the correlation structure — because capping is applied to the same group of users for all metrics, the natural relationships between metrics are preserved. • Focus on the main test goal — the primary metric defines the sample, which aligns with the experiment’s main hypothesis. Cons: • Potential information loss — outliers in secondary or guardrail metrics may contain valuable information that will be lost when selecting by the primary metric. • Suboptimal sensitivity — for metrics with different variance, a single capping scheme may be suboptimal. Strategy 2: Independent quantile computation Pros: • Optimization for each metric — each metric gets an optimal capping threshold, which can improve the sensitivity of individual tests. • Retention of metric-specific information — outliers important for a specific metric remain in the analysis. • Analytical flexibility — allows different outlier handling methods depending on each metric’s specifics. Cons: • Violation of statistical comparability — different samples for different metrics can lead to contradictory conclusions and complicate interpretation. • Increased multiple-testing risk — independent selection for each metric raises the probability of false positives. • Interpretation complexity — results are based on different user samples, which complicates understanding the overall effect of the intervention. • Loss of coherence — can lead to a situation where the primary metric shows improvement on one sample, while a guardrail metric worsens on another. Impact on interpretation and conclusions When using the first strategy: • Integrity of results — all metrics are analyzed on a coherent sample, which provides a unified interpretation of the effect. • Lower risk of conflicting conclusions — since all metrics are measured on the same users, results are more consistent. • Easier decision-making — a clear metric hierarchy simplifies interpreting the experiment’s outcomes. When using the second strategy: • Potential inconsistency — different samples may lead to metrics showing opposite effects. • Cause-and-effect ambiguity — it is hard to determine whether the observed effect is due to the intervention or sample differences. • Need for additional adjustments — more advanced statistical methods are required for proper analysis. Statistical considerations Error control For primary metrics, multiple-testing correction must be applied, whereas for guardrail metrics correction is not required, since the goal is to demonstrate non-inferiority. Test power The first strategy provides more stable statistical power due to sample coherence. The second strategy can paradoxically reduce power as data volume increases because of different temporal patterns across metrics. Why do you only use the second strategy? Which one is more appropriate?
h
Hi Vladislav. We are winsorizing metrics, not throwing away user counts. Every metric is winsorized independently by our capping, but the users included in every analysis are the same. This might change your analysis here. Our approach ensures we always analyze on the same user group, but we only curtail information for the metric by winsorizing in cases where that introduction of bias (through capping) will best reduce the variance for that particular metric.
s
Yes, I understand, but the point is that in fact, when we analyze each metric with winsorization, we're doing it on a different subgroup of users who ended up in the test. Essentially, we're drawing conclusions not based on the entire population or on one subgroup of users across all metrics, but on each metric's unique subgroup of users, which does not allow us to accurately assess the effect of the feature on specific users—the metric becomes blurred. If we take into account the significant outliers from highly active users and the fact that these outliers can greatly affect the metric's volatility and final results, it seems that drawing conclusions based on the results of such winsorization is not quite correct.
h
Winsorization limits the impact of these outliers, it does not remove them from the sample. So you are affecting the results of different sub-groups across metrics, but not removing them from the analysis altogether. I think your other approach is reasonable, but much more complicated to implement. Introducing strong cross-metric dependencies has been against many of the goals of our platform to be transparent in how it works and ensure that doing something like adding a new metric does not drastically change the results of other metrics. It also is worse for variance reduction, if that is your main goal. But it is nice to know that you are winsorizing on the same users across metrics, I just don't think that solves the fundamental problem which is that winsorization still introduces bias in all of your analyses. Since you are doing that, you may as well drive the greatest amount of variance reduction on that metric as is possible.
s
Yes, you are absolutely right! However, according to methodologies adopted in big tech and validated in scientific articles and subject literature, it is recommended to consider one, at most two, main metrics in an experiment. When performing winsorization, the users for secondary and guardrail metrics should be selected based on the primary metric. Here’s an example: Let’s say we conduct an experiment in a large marketplace: 1. Group A happens to include a millionaire who made a single purchase—buying a watch for one million dollars. 2. Group B includes a wholesale buyer who made many orders, resulting in a large total amount. Our primary metric is the average order value, calculated as the total sum of orders per user divided by the number of orders. Secondary metric is the mean income—average of all revenue, total revenue divided by number of buyers. How would separate winsorization for these metrics affect the results? Would Simpson’s paradox occur here? Let me try to answer the questions above myself: When modeling the scenario with the millionaire (Group A) and the wholesale buyer (Group B), critical issues were revealed: Without winsorization: • Primary metric (average order value): A = $7,067.87, B = $180.30 (A wins) • Secondary metric (average income): A = $10,107.05, B = $364.20 (A wins) With separate winsorization: • Primary metric: A = $76.26, B = $57.55 (A still wins) • Secondary metric: A = $109.05, B = $116.25 (now B wins!) Simpson’s paradox appeared, specifically in the secondary metric. This happened due to: • Different winsorization thresholds for each metric • Asymmetric influence of outliers on the various groups • Sample inconsistency between metrics The errors in the current approach are as follows: 1. Independent Winsorization of Metrics Modern practice of applying separate winsorization to each metric contradicts scientific recommendations. This leads to: • Distortion of the relationships between metrics • Loss of linkage between primary and secondary indicators • Potentially contradictory conclusions 1. Violation of the Unified Sample Principle Big tech methodology (Microsoft, Netflix, Google) emphasizes the importance of using the same user sample for all metrics. Secondary metrics must be analyzed on the same users as the primary metric. 3. Absence of Stratification Our experiment demonstrates a classic case where different user types (millionaire vs. regular buyers) require stratification prior to the experiment. Recommendations for Methodology Adjustments 1. Unified Winsorization at the User Level Recommended approach: • Determine outliers using the primary metric on the combined sample (control + treatment) • Apply identical winsorization thresholds to both groups • Use the same user sample for all metrics 1. Metric Hierarchy Following big tech methodology: • Primary metric: one main metric for decision-making • Secondary metrics: analyzed on the same sample as the primary • Guardrail metrics: safety metrics to control for negative effects 1. Application of CUPED Use CUPED (Controlled-experiment Using Pre-Experiment Data) to reduce variance: • Reduces influence of outliers without removing them • Increases statistical power of the experiment • Maintains unbiased estimates 1. Stratified Randomization To prevent such situations: • Predefine user segments (regular, VIP, wholesalers) • Ensure balanced allocation across groups • Apply block randomization within strata In conclusion, separate winsorization of metrics creates serious methodological problems. In this case, Simpson’s paradox is a symptom of a deeper problem: the violation of the principle of sample consistency between metrics. The correct approach requires: • Unified winsorization at the user level based on the primary metric • Use of one sample for all metrics • Application of modern variance-reduction methods (CUPED) • Pre-stratification based on known factors This corresponds to methodologies used in leading tech companies and is supported by scientific publications on experimental design