When using an activation metric, bigquery takes a ...
# give-feedback
r
When using an activation metric, bigquery takes a very long time to compute the results. The actual 21 days I'm currently analyzing I stopped after 45 minutes. Usually a sign that a join is exploding. When leaving out the activation metric, All 7 queries (7 metrics in the test) take less than a minute. Ran some manual timings, limiting to 1 day, 1 metric. The query takes 4+ minutes. Diagnostics, subquery outputs (count(*) from ...) step by step •
__rawExperiment
is ~4 mio rows (1 day) •
_experiment
comes in at 2 mio rows •
__metric
at 1.8 •
__activationMetric0
at 16K •
__activatedUsers
1.2 mio •
__distinctUsers
- takes about 46 seconds, results in 7K rows •
__userMetric
- takes 4 minutes something, results in 7K rows There is the bottleneck. Query details report 11.2 mio rows read. at some stage Not sure how/why this is such a resource hog, maybe you can spot something.
Copy code
__userMetric as (
  -- Add in the aggregate metric value for each user
  SELECT
    d.variation,
    d.dimension,
    LEAST(10, SUM(m.value)) as value
  FROM
    __distinctUsers d
    JOIN __metric m ON (m.user_pseudo_id = d.user_pseudo_id)
  WHERE
    m.conversion_start >= d.conversion_start
    AND m.conversion_start <= d.conversion_end
  GROUP BY
    variation,
    dimension,
    d.user_pseudo_id
)
f
Are you sure it's that particular subquery? That one is identical whether or not you have an activation metric. The
__distinctUsers
subquery is the one that incorporates the activation metric.
r
Did an other run. It indeed becomes slow earlier, with the distinctUsers. subquery. (where I first mentioned 46 seconds, it now takes 4 minutes).
@future-teacher-7046 When I add an aggregation step:
Copy code
__activatedUsers as (
  SELECT
    initial.user_pseudo_id,
    t0.conversion_start as conversion_start,
    t0.conversion_end as conversion_end
  FROM
    __experiment initial
    JOIN __activationMetric0 t0 ON (t0.user_pseudo_id = initial.user_pseudo_id)
  WHERE
    t0.conversion_start >= initial.conversion_start
    AND t0.conversion_start <= initial.conversion_end
),

__activatedAgg as (
  select
    user_pseudo_id,
    min(conversion_start) as conversion_start,
    min(conversion_end) as conversion_end
  from __activatedUsers
  group by 1
)
And then do a join on that CTE, the query runs fast again. From 4+ minutes to sub 10s
Copy code
__distinctUsers as (
  -- One row per user/dimension
  SELECT
    e.user_pseudo_id,
    cast('All' as string) as dimension,
    (
      CASE
      WHEN count(distinct e.variation) > 1 THEN '__multiple__'
      ELSE max(e.variation) END
    ) as variation,
    MIN(a.conversion_start) as conversion_start,
    MIN(a.conversion_end) as conversion_end
  FROM
    __experiment e
  JOIN __activatedAgg a using(user_pseudo_id)
  GROUP BY
    e.user_pseudo_id
)
Ok this works, too: aggregate in __activatedUsers directly. This makes inital.user_pseudo_id distinct, and prevents an explosive table later.
Copy code
__activatedUsers as (
  SELECT
    initial.user_pseudo_id,
    min(t0.conversion_start) as conversion_start,
    min(t0.conversion_end) as conversion_end,
  FROM
    __experiment initial
    JOIN __activationMetric0 t0 ON (t0.user_pseudo_id = initial.user_pseudo_id)
  WHERE
    t0.conversion_start >= initial.conversion_start
    AND t0.conversion_start <= initial.conversion_end
  group by 1
),
f
Thanks for looking into this. I think that
group by
in
__activatedUsers
could work at least some of the time. We recently added support for different Attribution Models and some of them need access to the individual activation events.