Hello GrowthBook team first cannot resist giving a shoutout GrowthBook Users #ask-questions

Hello GrowthBook team, first - cannot resist givin...

freezing-kite-62929

12/15/2022, 5:47 PM

Hello GrowthBook team, first - cannot resist giving a shoutout for building such a fabulous product that makes several things easier. One challenge we are running into is the cost of the query that fires eventually in BigQuery. The individual CTEs (WITH clauses), we were able to add additional where conditions in there to keep it limited. For example:

AND event_timestamp BETWEEN '2022-11-09 00:00:00' AND '2022-12-05 23:59:00'

However, when the final larger query runs, it does not seem to honor the partitions and BigQuery either takes more than 6 hours to run (after which GB timesout) OR BigQuery complains there is not enough quota available (which is a setting on our end). I was wondering if there is a way to add where clause conditions to the outer query (the final select statement) to restrict it within the timestamp we desire, so that BigQuery uses the partitions?

helpful-application-7107

12/15/2022, 5:58 PM

Hi! Can you share the full query that is timing out? Just so I know for sure that we're looking at the same issue.

helpful-application-7107

12/15/2022, 6:11 PM

I'm wondering whether it is the final select statement that is the problem, or whether its our initial scan over the experiment assignment and/or metric source and I just want to see what you've done so that I can make sure I'm diagnosing the same query.

freezing-kite-62929

12/15/2022, 6:18 PM

Hi Luke, thank you for getting back. Here's a query that throws the out of quota error:

Copy code

-- Conversion By Session ID (binomial)
WITH
  __rawExperiment as (
    SELECT
      session_id AS user_id,
      event_timestamp AS timestamp,
      experiment_key AS experiment_id,
      branch_key AS variation_id,
      experiment_name,
      branch_name AS variation_name
    FROM
      `bqprojectname.tf_ab_test_analysis.ab_sessions` AS e
    WHERE
      experiment_key = 'xyz'
      AND event_timestamp BETWEEN '2022-11-09 00:00:00' AND '2022-12-05 23:59:00'
  ),
  __experiment as ( -- Viewed Experiment
    SELECT
      e.user_id as user_id,
      cast(e.variation_id as string) as variation,
      CAST(e.timestamp as DATETIME) as timestamp,
      CAST(e.timestamp as DATETIME) as conversion_start,
      DATETIME_ADD(CAST(e.timestamp as DATETIME), INTERVAL 72 HOUR) as conversion_end
    FROM
      __rawExperiment e
    WHERE
      e.experiment_id = 'xyz'
      AND CAST(e.timestamp as DATETIME) >= DATETIME("2022-11-09 00:00:00")
      AND CAST(e.timestamp as DATETIME) <= DATETIME("2022-12-05 23:59:00")
  ),
  __metric as ( -- Metric (Conversion By Session ID)
    SELECT
      user_id as user_id,
      1 as value,
      CAST(m.timestamp as DATETIME) as timestamp,
      CAST(m.timestamp as DATETIME) as conversion_start,
      CAST(m.timestamp as DATETIME) as conversion_end
    FROM
      (
        SELECT DISTINCT
          session_id AS user_id,
          event_timestamp AS timestamp
        FROM
          `bqprojectname.tf_ab_test_analysis.ab_checkout_with_sales` AS e
        WHERE
          experiment_key = 'xyz'
          AND event_timestamp BETWEEN '2022-11-09 00:00:00' AND '2022-12-08 23:59:00'
      ) m
    WHERE
      CAST(m.timestamp as DATETIME) >= DATETIME("2022-11-09 00:00:00")
      AND CAST(m.timestamp as DATETIME) <= DATETIME("2022-12-08 23:59:00")
  ),
  __distinctUsers as (
    -- One row per user/dimension
    SELECT
      e.user_id,
      cast('All' as string) as dimension,
      (
        CASE
          WHEN count(distinct e.variation) > 1 THEN '__multiple__'
          ELSE max(e.variation)
        END
      ) as variation,
      MIN(e.conversion_start) as conversion_start,
      MIN(e.conversion_end) as conversion_end
    FROM
      __experiment e
    GROUP BY
      e.user_id
  ),
  __userMetric as (
    -- Add in the aggregate metric value for each user
    SELECT
      d.variation,
      d.dimension,
      d.user_id,
      1 as value
    FROM
      __distinctUsers d
      JOIN __metric m ON (m.user_id = d.user_id)
    WHERE
      m.timestamp >= d.conversion_start
      AND m.timestamp <= d.conversion_end
    GROUP BY
      variation,
      dimension,
      d.user_id
  ),
  __overallUsers as (
    -- Number of users in each variation
    SELECT
      variation,
      dimension,
      COUNT(*) as users
    FROM
      __distinctUsers
    GROUP BY
      variation,
      dimension
  ),
  __stats as (
    -- One row per variation/dimension with aggregations
    SELECT
      m.variation,
      m.dimension,
      COUNT(*) as count,
      AVG(m.value) as m_mean,
      VAR_SAMP(m.value) as m_var,
      sum(m.value) as m_sum
    FROM
      __userMetric m
    GROUP BY
      m.variation,
      m.dimension
  ),
  __overall as (
    SELECT
      u.variation,
      u.dimension,
      u.users as count,
      s.m_mean * s.count / u.users as mean,
      (s.count / u.users) * (1 - s.count / u.users) as variance,
      u.users as users
    FROM
      __overallUsers u
      LEFT JOIN __stats s ON (
        s.variation = u.variation
        AND s.dimension = u.dimension
      )
  )
SELECT
  variation,
  dimension,
  count,
  mean,
  (
    CASE
      WHEN variance > 0 THEN sqrt(variance)
      ELSE 0
    END
  ) as stddev,
  users
FROM
  __overall

helpful-application-7107

12/15/2022, 6:23 PM

One more question: you indicate that adding the where clause elsewhere solves this issue. Can you do that by modifying the query manually and running it elsewhere? if so, what does that query look like?

freezing-kite-62929

12/15/2022, 6:24 PM

yeah, I can export this query to BigQuery console and run those queries, but will miss out on the UI and other inferences in there for the wider audience.

helpful-application-7107

12/15/2022, 6:25 PM

Of course, I'm just wondering where in the above query you are suggesting that the WHERE clause would improve performance

freezing-kite-62929

12/15/2022, 6:26 PM

At the point where FROM __overall is running, I was wondering if there is an ability to limit based on timestamp

helpful-application-7107

12/15/2022, 6:27 PM

Hmmm,

__overall

doesn't have any of the timestamp data, so I'm not sure what a filter at that stage would do.

freezing-kite-62929

12/15/2022, 6:27 PM

We are partitioning on timestamps. Since timestamp is not extrapolated to this level, I could not think of an easy way

helpful-application-7107

12/15/2022, 6:27 PM

Got it

freezing-kite-62929

12/15/2022, 6:27 PM

yep, exactly. Thought I will fire the question here to see if there are any other creative ways to handle this scenario

helpful-application-7107

12/15/2022, 6:28 PM

Yeah, I'm just seeing in your query that you filter on timestamp for both the metric and experiment assignment source, so I'm not sure why it wouldn't be honoring the partitions.

freezing-kite-62929

12/15/2022, 6:29 PM

that has been the $2.5M question in my mind. I first posted a support ticket with Google, and so far they have only pointed out at the entire query rather than that CTE

freezing-kite-62929

12/15/2022, 6:30 PM

They pointed me to this: https://cloud.google.com/bigquery/docs/best-practices-performance-compute#avoid_multiple_evaluations_of_the_same_common_table_expressions_ctes Still reading through that.

helpful-application-7107

12/15/2022, 6:32 PM

Ok, I have to think about this a bit more. The big query only references the costly CTEs one time each (

__metric

and

__overall

), so I think it might be a slightly different issue.

freezing-kite-62929

12/15/2022, 6:42 PM

sounds good, thank you. Anything before end of the day will be great (jk :)) Over here, rest of my leadership team are excited about Growthbook. For now, I am creating snapshot tables that reduces the load and surfacing some of the AB results.

helpful-application-7107

12/15/2022, 6:44 PM

My current concern is that building

__userMetric

is the most expensive part of this query due to the join, and that we aren't filtering down the metric source table enough.

freezing-kite-62929

12/15/2022, 6:54 PM

oh okay, let me check

helpful-application-7107

12/15/2022, 6:54 PM

Especially if it works for the snapshot tables?

freezing-kite-62929

12/15/2022, 6:56 PM

That's a good callout. At

__userMetric

stage, timestamp is still available and yes it will be great to have the ability to add event_timestap filter there.

helpful-application-7107

12/15/2022, 6:56 PM

Yeah, but the sub-tables have already been filtered?

helpful-application-7107

12/15/2022, 6:56 PM

Can you try adding that where clause there and seeing what happens?

freezing-kite-62929

12/15/2022, 6:57 PM

yep, sure I can. Please give me sometime (30 mins ish) to get back.

helpful-application-7107

12/15/2022, 6:57 PM

Of course.

freezing-kite-62929

12/15/2022, 6:58 PM

yeah, the snapshot tables have been filtered out by timestamp and also minimal number of records (either for segments or one other thing, we ended up doing a min() to reduce the number of records)

helpful-application-7107

12/15/2022, 6:59 PM

Yeah, I think this kind of pre-processing might be necessary to allow on-demand estimation of results if the experiment runs long enough, but we should also explore other cost-savings like ensuring filters exist in the right places.

freezing-kite-62929

12/15/2022, 7:00 PM

agreed. The challenge with pre processing was that we have to create those snapshot tables each and every time a user wants to review the result, and requires such coordination (between a user who is using the UI with someone on the tech side).

helpful-application-7107

12/15/2022, 7:04 PM

Yeah, this is not the best user experience.

👍 1

helpful-application-7107

12/15/2022, 7:04 PM

Of course, hourly or daily snapshot tables could help solve this issue, but then you're creating persistent views of the data that you have to pay to store.

freezing-kite-62929

12/15/2022, 7:06 PM

FWIW, I created materialized views, but because BigQuery enforces the partition logic to be the same as that of the source tables, we could not leverage materialized views much. Without the partitions the cost (time) was just the same as source tables.

freezing-kite-62929

12/15/2022, 7:24 PM

Just tested it, BQ isnt convinced yet.

freezing-kite-62929

12/15/2022, 7:30 PM

Added a similar condition even for

distinctUsers

section, still the same result.

helpful-application-7107

12/15/2022, 7:31 PM

How long does it take for this error to get thrown?

freezing-kite-62929

12/15/2022, 7:32 PM

fairly instantly.. I think as part of its initial query analysis & estimation stage, it seems to give up.

helpful-application-7107

12/15/2022, 7:32 PM

Ok, that's what I suspected.

👍 1

helpful-application-7107

12/15/2022, 7:32 PM

And when you remove that timestamp that you added, you get a similar error?

freezing-kite-62929

12/15/2022, 7:33 PM

yeah, without both the conditions (that I just added), its the same result.

helpful-application-7107

12/15/2022, 7:34 PM

Ok, unfortunately I suspect there isn't a quick fix for this. Would you mind opening an issue about the query complexity? https://github.com/growthbook/growthbook/issues

freezing-kite-62929

12/15/2022, 7:34 PM

sure thing, will do. Thank you for the help thus far!

83 Views

Open in Slack

Previous Next