Hi Everyone, Do we have growthbook's integration w...
# announcements
f
Hi Everyone, Do we have growthbook's integration with DBT (Data Build Tool) of any sorts?? Thanks
w
Hi. Unfortunately we don't have a direct integration with DBT at the moment but it is something we are planning on doing in the future.
p
We did a fair bit of this at the Philadelphia Inquirer. It wasn't anything native to/with growthbook; mostly just good event-based modeling practices. That said, there were a few things that were really nice, in particular how we aligned some of the dbt-style variable templating with growthbook's parameters for efficient metric querying. Would've been great to version control our metrics with DBT, which we sort of did manually.
f
@proud-engine-56337 Did you guys maintains any documentation as to how you people managed dbt-style variable templating with growthbook's parameters for efficient metric querying?? We were just evaluating the possible option as directly querying the warehouse has it's own limitations. Some direction/guide would really be helpful.
p
We never got around to fully using a DBT Metric as the Growthbook Metric then using Growthbook to query through the DBT Metrics Layer. And yes, I totally agree that the community would benefit from some cost engineering guidance (cc @fresh-football-47124). I'm no longer there, but will recount things as best as I can. Initially all of the growthbook queries hit our primary web event fact table. That got comically expensive. Fast. (I blew through about an additional USD 2k in 3 days by accident in BigQuery costs). Luckily, @stocky-energy-64916 did a bunch of cost engineering to make it pennies a day. There's been some optimization by the Growthbook team in the past year both in terms of query performance and cost but this should still hold. Each query Growthbook makes relies on use your Experiment Enrollment query. Updating an experiment with 12 metrics makes 12 these queries, so making the enrollment query efficient is key. We used DBT to create an incremental model that just contained experiment enrollments and made sure both our source fact table (
fct_hits
) and our new enrollment aggregate table (
agg_experiment_enrollment
) were timestamp partitioned. When we built
agg_experiment_enrollment
it would only need to scan the last day or so of its source fact table (down from ~2TB -> ~50gb when it queried the source table) and produced maybe 100mb/day of records of new enrollments. Then we set the enrollment query to use the date-based filtering as part of the where clause. Once again, with date partitioning, this meant each enrollment query would only look at the days in the experiment, rather than the whole table. Then, for a while, we were building new models in DBT for each and every Growthbook metric. It was cheaper, but bad form. So we made aggregates/reports around some of the more important conversion events (e.g ad impression, registration, pageviews) that were just focused versions of the fact table with just the rows for those events & relevant columns. This might save upwards of 50gb/query of just throwing out the name of the event it would filter on from the upstream fct table.
Copy code
fct_hits
    - agg_experiment_enrollment
    - agg_pageview
    - agg_registration_event
This is all to say, knowing how to effectively use DBT and your data warehouse can make growthbook work well for overcoming some of those direct query limitations. (This is probably worth writing clearly as a blog post at some point rather than rushing it as I'm running out the door now 😉 )
❤️ 1
f
@helpful-application-7107 ^
h
This writeup is great David. We already wanted to do a few things on GrowthBook's end to help with this issue: • Reduce uses of the Experiment Assignment/Enrollment query per experiment analysis • Leverage fact tables within growthbook to reduce scans of metric sources • Eventually, consider ways to cache unit-level assignment tables so that future scans of the assignment source can look at less data You've definitely raised some good ways we can, in-app, prompt users to consider approaches that will improve the efficiency of their queries through cost engineering.