Totally love Growthbook, but I ran up a 3k bigquer...
# announcements
p
Totally love Growthbook, but I ran up a 3k bigquery bill backtesting historical experiments and modeling our metrics over our actual event log. 🤦‍♂️ Yes, I should be pre-computing my enrollments and metrics in data pipelines. No, I didn’t have a strategy for it before. So I’m coming to you, hat in hand, wondering if folks here have any advice for metric modeling. Bonus points if folks have figured out how to have experimental metric events also play nice with a “Metrics Layer” like dbt metrics.
f
Are you partitioning tables by date in BigQuery? If so, you can add template variables to your data source and metrics SQL that should help a lot:
Copy code
WHERE timestamp BETWEEN '{{ startDate }}' AND '{{ endDate }}'
We do plan to integrate with other metrics layers like dbt, but we don't have that yet.
p
Recently added those parameters to my metrics, though wasn’t sure I could add it to my data sources too. That’ll cut down a fair bit.
More so looking for guidance for producing aligned metric tables, so that there can both be a metric layer definition (used to calculate aggregates), but also efficient incremental tables of the timestamped events that can be used for producing timestamped conversions for growthbook.
f
Pre computing metrics and enrollment for experimentation is hard since there are multiple dimensions and variations to account for. Plus, you usually need to be able to run adhoc analyses on the raw event data anyway so you'll still need to maintain that.
Most companies I've seen that have tried that have given up and focused on optimizing raw event stream queries and adding caching layers instead. Incremental queries could also help.
p
I think that’s what would be ideal, for implementation and long-term ergonomics. Yet I’m having a difficult time envisioning how that might be done. Our eventstream is so much larger than the relevant events that needing to filter on each query really ads to the cost. I can’t help but think it’d be somewhat more efficient to not pre-aggregate the metrics, but to pre-filter the events (on partitioned, incremental tables) • eventstream ◦ all_experiments_enrollment_events_and_dimensions ◦ metric0.conversion_event ◦ metric1.conversion_event ◦ metric2.conversion_event That’d have the benefit of growthbook metrics just being a direct
select userid, [alt_ids], timestamp, value from metric0.conversion_event
. Would that not be advised? Admittedly, then you’ve got your metric calculation definition somewhere in the metrics layer (hitting the full eventstream), but then needing to replicate parts of that logic to produce just the conversion_events (but not calculate). 🤔 feels like a problem for the dbt slack.
f
Oh, I see. So you have a single table with all events mixed together. Then, yeah I think making dedicated tables for enrollment and each metric does make sense. And dbt is definitely the best tool to handle that
p
Is there a better way to organize source events?
(Also this is way too minute of an issue to be talking to a customer before 6am about)
f
No worries. I'm usually on Central time. A single events table is fine for the raw event log, but you probably don't want to query that directly very often and use per-metric tables instead. Some event tracking tools like Segment automatically create separate tables for each event type which is usually a decent first step. A true modelling layer like dbt gets you a lot more control though.