abundant-zebra-16833
10/17/2023, 6:39 PMgbstats
. Specifically, I wanted to implement clustered standard errors into the sample mean tests ([formula 11 from the Duflo Glennerster Kremer 2006 paper](https://www.nber.org/system/files/working_papers/t0333/t0333.pdf)). I saw some overall guidelines in CONTRIBUTING.md
, and I think I've zeroed in on the file where I should be making my edits - I think it's [in gbstats/shared/models.py
](https://github.com/growthbook/growthbook/blob/main/packages/stats/gbstats/shared/models.py#L37-L51)). But I wanted to check in with someone (@helpful-application-7107) before I just got to coding. My main confusion is on where the data gets processed so that we can identify that the unit of analysis != unit of randomization.helpful-application-7107
10/17/2023, 6:54 PMgbstats
itself may be less painful (you'll need entirely new Statistic
objects that take additional summary stats as their constructor values), but then hooking it up to GrowthBook will require more fundamental changes.abundant-zebra-16833
10/17/2023, 7:00 PMwithin_cluster_variance
, since clusters will vary in their variances. We settled on just choosing the average variance from a sample of clusters (since computing the variances for all clusters often takes forever).
⢠This is more a theoretical question, which is how do we treat the intracluster correlation given that clusters can evolve - gaining and losing members over time, and also overlapping in members. Current plan was... ignoring this. š
helpful-application-7107
10/17/2023, 7:04 PMt
and then rolling up data into time t+1
when group membership could have changed you're getting into shaky ground. I think you have to assume that their original cluster is the cluster you care about, although it can get worse and worse in the case of online a/b testing where exposure mappings are not just 0/1 because of joining new clusters.abundant-zebra-16833
10/17/2023, 7:07 PMhelpful-application-7107
10/17/2023, 7:08 PMestimatr
, a package I worked on back then: https://declaredesign.org/r/estimatr/articles/mathematical-notes.html#cluster-robust-variance-and-degrees-of-freedom
I can't vouch for my clarity of thinking on the issue tat the time, but there's some more references in there.abundant-zebra-16833
10/17/2023, 7:14 PMhelpful-application-7107
10/17/2023, 7:14 PMSpecifically, I wanted to implement clustered standard errorsYou've done this to yourself, Angela š
Sometimes our samples themselves are just really big, and so it was taking a while to calculate the ICC.What was the process you were using?
abundant-zebra-16833
10/17/2023, 7:27 PMcluster_id, user_id, metric_value
and they wanted to calculate the overall variance (fine), the between-cluster variance (also fine), and then the within-cluster variance (slow). They were using R, I could probably dig up the notebook and DM some excerpts. But I think it may have been a loop through each cluster_id
to calculate that group's var. AKA I don't think a covariance matrix was used.helpful-application-7107
10/17/2023, 8:11 PM