Hello GrowthBook team, first - cannot resist givin...
# ask-questions
f
Hello GrowthBook team, first - cannot resist giving a shoutout for building such a fabulous product that makes several things easier. One challenge we are running into is the cost of the query that fires eventually in BigQuery. The individual CTEs (WITH clauses), we were able to add additional where conditions in there to keep it limited. For example:
AND event_timestamp BETWEEN '2022-11-09 00:00:00' AND '2022-12-05 23:59:00'
However, when the final larger query runs, it does not seem to honor the partitions and BigQuery either takes more than 6 hours to run (after which GB timesout) OR BigQuery complains there is not enough quota available (which is a setting on our end). I was wondering if there is a way to add where clause conditions to the outer query (the final select statement) to restrict it within the timestamp we desire, so that BigQuery uses the partitions?
h
Hi! Can you share the full query that is timing out? Just so I know for sure that we're looking at the same issue.
I'm wondering whether it is the final select statement that is the problem, or whether its our initial scan over the experiment assignment and/or metric source and I just want to see what you've done so that I can make sure I'm diagnosing the same query.
f
Hi Luke, thank you for getting back. Here's a query that throws the out of quota error:
Copy code
-- Conversion By Session ID (binomial)
WITH
  __rawExperiment as (
    SELECT
      session_id AS user_id,
      event_timestamp AS timestamp,
      experiment_key AS experiment_id,
      branch_key AS variation_id,
      experiment_name,
      branch_name AS variation_name
    FROM
      `bqprojectname.tf_ab_test_analysis.ab_sessions` AS e
    WHERE
      experiment_key = 'xyz'
      AND event_timestamp BETWEEN '2022-11-09 00:00:00' AND '2022-12-05 23:59:00'
  ),
  __experiment as ( -- Viewed Experiment
    SELECT
      e.user_id as user_id,
      cast(e.variation_id as string) as variation,
      CAST(e.timestamp as DATETIME) as timestamp,
      CAST(e.timestamp as DATETIME) as conversion_start,
      DATETIME_ADD(CAST(e.timestamp as DATETIME), INTERVAL 72 HOUR) as conversion_end
    FROM
      __rawExperiment e
    WHERE
      e.experiment_id = 'xyz'
      AND CAST(e.timestamp as DATETIME) >= DATETIME("2022-11-09 00:00:00")
      AND CAST(e.timestamp as DATETIME) <= DATETIME("2022-12-05 23:59:00")
  ),
  __metric as ( -- Metric (Conversion By Session ID)
    SELECT
      user_id as user_id,
      1 as value,
      CAST(m.timestamp as DATETIME) as timestamp,
      CAST(m.timestamp as DATETIME) as conversion_start,
      CAST(m.timestamp as DATETIME) as conversion_end
    FROM
      (
        SELECT DISTINCT
          session_id AS user_id,
          event_timestamp AS timestamp
        FROM
          `bqprojectname.tf_ab_test_analysis.ab_checkout_with_sales` AS e
        WHERE
          experiment_key = 'xyz'
          AND event_timestamp BETWEEN '2022-11-09 00:00:00' AND '2022-12-08 23:59:00'
      ) m
    WHERE
      CAST(m.timestamp as DATETIME) >= DATETIME("2022-11-09 00:00:00")
      AND CAST(m.timestamp as DATETIME) <= DATETIME("2022-12-08 23:59:00")
  ),
  __distinctUsers as (
    -- One row per user/dimension
    SELECT
      e.user_id,
      cast('All' as string) as dimension,
      (
        CASE
          WHEN count(distinct e.variation) > 1 THEN '__multiple__'
          ELSE max(e.variation)
        END
      ) as variation,
      MIN(e.conversion_start) as conversion_start,
      MIN(e.conversion_end) as conversion_end
    FROM
      __experiment e
    GROUP BY
      e.user_id
  ),
  __userMetric as (
    -- Add in the aggregate metric value for each user
    SELECT
      d.variation,
      d.dimension,
      d.user_id,
      1 as value
    FROM
      __distinctUsers d
      JOIN __metric m ON (m.user_id = d.user_id)
    WHERE
      m.timestamp >= d.conversion_start
      AND m.timestamp <= d.conversion_end
    GROUP BY
      variation,
      dimension,
      d.user_id
  ),
  __overallUsers as (
    -- Number of users in each variation
    SELECT
      variation,
      dimension,
      COUNT(*) as users
    FROM
      __distinctUsers
    GROUP BY
      variation,
      dimension
  ),
  __stats as (
    -- One row per variation/dimension with aggregations
    SELECT
      m.variation,
      m.dimension,
      COUNT(*) as count,
      AVG(m.value) as m_mean,
      VAR_SAMP(m.value) as m_var,
      sum(m.value) as m_sum
    FROM
      __userMetric m
    GROUP BY
      m.variation,
      m.dimension
  ),
  __overall as (
    SELECT
      u.variation,
      u.dimension,
      u.users as count,
      s.m_mean * s.count / u.users as mean,
      (s.count / u.users) * (1 - s.count / u.users) as variance,
      u.users as users
    FROM
      __overallUsers u
      LEFT JOIN __stats s ON (
        s.variation = u.variation
        AND s.dimension = u.dimension
      )
  )
SELECT
  variation,
  dimension,
  count,
  mean,
  (
    CASE
      WHEN variance > 0 THEN sqrt(variance)
      ELSE 0
    END
  ) as stddev,
  users
FROM
  __overall
h
One more question: you indicate that adding the where clause elsewhere solves this issue. Can you do that by modifying the query manually and running it elsewhere? if so, what does that query look like?
f
yeah, I can export this query to BigQuery console and run those queries, but will miss out on the UI and other inferences in there for the wider audience.
h
Of course, I'm just wondering where in the above query you are suggesting that the WHERE clause would improve performance
f
At the point where FROM __overall is running, I was wondering if there is an ability to limit based on timestamp
h
Hmmm,
__overall
doesn't have any of the timestamp data, so I'm not sure what a filter at that stage would do.
f
We are partitioning on timestamps. Since timestamp is not extrapolated to this level, I could not think of an easy way
h
Got it
f
yep, exactly. Thought I will fire the question here to see if there are any other creative ways to handle this scenario
h
Yeah, I'm just seeing in your query that you filter on timestamp for both the metric and experiment assignment source, so I'm not sure why it wouldn't be honoring the partitions.
f
that has been the $2.5M question in my mind. I first posted a support ticket with Google, and so far they have only pointed out at the entire query rather than that CTE
h
Ok, I have to think about this a bit more. The big query only references the costly CTEs one time each (
__metric
and
__overall
), so I think it might be a slightly different issue.
f
sounds good, thank you. Anything before end of the day will be great (jk :)) Over here, rest of my leadership team are excited about Growthbook. For now, I am creating snapshot tables that reduces the load and surfacing some of the AB results.
h
My current concern is that building
__userMetric
is the most expensive part of this query due to the join, and that we aren't filtering down the metric source table enough.
f
oh okay, let me check
h
Especially if it works for the snapshot tables?
f
That's a good callout. At
__userMetric
stage, timestamp is still available and yes it will be great to have the ability to add event_timestap filter there.
h
Yeah, but the sub-tables have already been filtered?
Can you try adding that where clause there and seeing what happens?
f
yep, sure I can. Please give me sometime (30 mins ish) to get back.
h
Of course.
f
yeah, the snapshot tables have been filtered out by timestamp and also minimal number of records (either for segments or one other thing, we ended up doing a min() to reduce the number of records)
h
Yeah, I think this kind of pre-processing might be necessary to allow on-demand estimation of results if the experiment runs long enough, but we should also explore other cost-savings like ensuring filters exist in the right places.
f
agreed. The challenge with pre processing was that we have to create those snapshot tables each and every time a user wants to review the result, and requires such coordination (between a user who is using the UI with someone on the tech side).
h
Yeah, this is not the best user experience.
👍 1
Of course, hourly or daily snapshot tables could help solve this issue, but then you're creating persistent views of the data that you have to pay to store.
f
FWIW, I created materialized views, but because BigQuery enforces the partition logic to be the same as that of the source tables, we could not leverage materialized views much. Without the partitions the cost (time) was just the same as source tables.
Just tested it, BQ isnt convinced yet.
Added a similar condition even for
distinctUsers
section, still the same result.
h
How long does it take for this error to get thrown?
f
fairly instantly.. I think as part of its initial query analysis & estimation stage, it seems to give up.
h
Ok, that's what I suspected.
👍 1
And when you remove that timestamp that you added, you get a similar error?
f
yeah, without both the conditions (that I just added), its the same result.
h
Ok, unfortunately I suspect there isn't a quick fix for this. Would you mind opening an issue about the query complexity? https://github.com/growthbook/growthbook/issues
f
sure thing, will do. Thank you for the help thus far!