Hi everyone, I'm using Growthbook with Clickhouse ...
# ask-questions
h
Hi everyone, I'm using Growthbook with Clickhouse for A/B testing analysis and my queries often time out or error out due to memory usage. Are there any tips/tricks to preventing this? I've tried creating Projections (read indexes) that sort by experiment_id and timestamp but it looks like all the JOINs that Growthbook performs causes the memory issues.
b
Hi Luke, could you send the SQL you’re using for the queries that are timing out?
h
Sure! Here's a query directly from Growthbook:
Copy code
-- Time To First Byte (duration)
    WITH
      
      
    
    __rawExperiment AS (
      SELECT
  host,
  anonymous_id,
  timestamp as timestamp,
  experiment_id as experiment_id,
  variant_id as variation_id
FROM
  events
WHERE
  event = 'experiment_viewed'
    ),
    __experimentExposures AS (
      -- Viewed Experiment
      SELECT
        e.anonymous_id as anonymous_id
        , toString(e.variation_id) as variation
        , e.timestamp as timestamp
        
      FROM
          __rawExperiment e
      WHERE
          e.experiment_id = '63f9e184aaaf29ffe3925b6855490506'
          AND e.timestamp >= toDateTime('2023-10-14 18:00:00', 'UTC')
          AND e.timestamp <= toDateTime('2023-11-28 16:33:51', 'UTC')
          
    )
    , __activationMetric as (-- Metric (RB Analytics - Is Likely Human)
      SELECT
        anonymous_id as anonymous_id,
        1 as value,
        m.timestamp as timestamp
      FROM
        (
              SELECT
  host,
  anonymous_id,
  timestamp
FROM events
WHERE bot_score > 29 
  and context_user_agent not like '%bot%'
  and context_user_agent not like '%Bot%'
            ) m
        
        WHERE m.timestamp >= toDateTime('2023-10-14 18:00:00', 'UTC') AND m.timestamp <= toDateTime('2023-12-28 16:33:51', 'UTC')
    )
        
    
    
    , __experimentUnits AS (
      -- One row per user
      SELECT
        e.anonymous_id AS anonymous_id
        , if(count(distinct e.variation) > 1, '__multiple__', max(e.variation)) AS variation
        , MIN(e.timestamp) AS first_exposure_timestamp
        
        
        
        , MIN(if(
      a.timestamp >= e.timestamp
      , a.timestamp, NULL)) AS first_activation_timestamp
            
      FROM
        __experimentExposures e
        
        
        LEFT JOIN __activationMetric a ON (a.anonymous_id = e.anonymous_id)
      
      GROUP BY
        e.anonymous_id
    ),
      __distinctUsers AS (
        SELECT
          anonymous_id,
          toString('All') AS dimension,
          variation,
          first_activation_timestamp AS timestamp,
          dateTrunc('day', first_exposure_timestamp) AS first_exposure_date
          
        FROM __experimentUnits
        WHERE first_activation_timestamp IS NOT NULL
      )
      , __metric as (-- Metric (Time To First Byte)
      SELECT
        anonymous_id as anonymous_id,
        m.value as value,
        m.timestamp as timestamp
      FROM
        (
              SELECT
  host,
  anonymous_id,
  timestamp,
  (value/1000) as value -- It's looking for seconds so we divide by 1000
FROM events
WHERE event = 'ttfb'
            ) m
        
        WHERE m.timestamp >= toDateTime('2023-10-14 18:00:00', 'UTC') AND m.timestamp <= toDateTime('2023-11-28 16:33:51', 'UTC')
    )
      , __denominator0 as (-- Metric (RB Analytics - Is TTFB session)
      SELECT
        anonymous_id as anonymous_id,
        1 as value,
        m.timestamp as timestamp
      FROM
        (
              SELECT
  host,
  anonymous_id,
  timestamp
FROM events
WHERE event = 'ttfb'
            ) m
        
        WHERE m.timestamp >= toDateTime('2023-10-14 18:00:00', 'UTC') AND m.timestamp <= toDateTime('2023-11-28 16:33:51', 'UTC')
    )
      , __denominatorUsers as (
      -- one row per user
      SELECT
        initial.anonymous_id AS anonymous_id,
        MIN(initial.dimension) AS dimension,
        MIN(initial.variation) AS variation,
        MIN(initial.first_exposure_date) AS first_exposure_date,
        
        MIN(t0.timestamp) AS timestamp
      FROM
        __distinctUsers initial
        JOIN __denominator0 t0 ON (
            t0.anonymous_id = initial.anonymous_id
          )
      WHERE
        
      t0.timestamp >= initial.timestamp
      
      GROUP BY
        initial.anonymous_id)
      
      , __userMetricJoin as (
        SELECT
          d.variation AS variation,
          d.dimension AS dimension,
          
          d.anonymous_id AS anonymous_id,
          if(
        
      m.timestamp >= d.timestamp
      
        
      , m.value, NULL) as value
        FROM
          __denominatorUsers d
        LEFT JOIN __metric m ON (
          m.anonymous_id = d.anonymous_id
        )
        
      )
      , __userMetricAgg as (
        -- Add in the aggregate metric value for each user
        SELECT
          variation,
          dimension,
          
          anonymous_id,
          AVG(value) as value
        FROM
          __userMetricJoin
        GROUP BY
          variation,
          dimension,
          
          anonymous_id
      )
      
        , __capValue AS (
            
      SELECT quantileDeterministic(0.75)(value, 1234) AS cap_value
      FROM __userMetricAgg
      WHERE value IS NOT NULL
    
        )
        
      
      
      -- One row per variation/dimension with aggregations
      SELECT
        m.variation AS variation,
        m.dimension AS dimension,
        COUNT(*) AS users,
        'mean' as statistic_type,
        'count' as main_metric_type,
        MAX(COALESCE(cap.cap_value, 0)) as main_cap_value,
        SUM(LEAST(
        toFloat64(COALESCE(m.value, 0)),
        cap.cap_value
      )) AS main_sum,
        SUM(POWER(LEAST(
        toFloat64(COALESCE(m.value, 0)),
        cap.cap_value
      ), 2)) AS main_sum_squares
        
        
      FROM
        __userMetricAgg m
      
      
      CROSS JOIN __capValue cap
      
      GROUP BY
        m.variation,
        m.dimension
b
Thanks! Could you tell me more about the metric(s) you're analyzing?
I'm not sure it's related to the timeouts, but could you confirm this part of the setup documentation was completed?
h
this query is for Time To First Byte - its a web performance metric based on how long it takes the server to respond (one of our gaurdrail metrics actually). But I have 7 different metrics that we are tracking per experiment.
Yes, that part of the documentation is complete. The connection to the database works, i can get metrics, and experiments fine. Its just that now that the database is larger the vast majority of the queries are failing.
b
Ok, gotcha. Thanks for that info. Do you know the size of the data that's stored in the database?
h
Currently at 232 million events
If I could modify the growthbook queries to filter the data by "host" it would drastically reduce the amount of data that is queried and likely fix the timeout/memory issues. Is there any way I can modify the growthbook queries like this?
b
You can definitely edit the Metric queries, yes. Select the Metric you want to edit the SQL for, and look for the "Query Settings" section in the right-side pane. If you have the right user permissions you'll be able to edit the Metric Query SQL.
h
I think that would require a new metric for every experiment if I'm not mistaken. We're running about 40 different experiments and it would be cumbersome to create all 8 metrics for each experiment. Is it possible to modify the queries at the experiment level so the metrics can remain ambiguous to the datasource?
h
Some options: • You could consider trying to do the bot filtering in the base exposure query since it looks like the information comes from the same source, rather than using an activation metric. Maybe create a new experiment assignment query that is (anonymous_id without bots). • The only experiment level information you can have at the metric level right now is the
{{ experimentId }}
, you could create some filter at the metric level that uses that template variable in some clever way to create the right filter. I'm not sure what the
host
filter you're imagining is though so I'm not sure. • I'd also just confirm that our query is properly hitting date partitions just to be safe. Some notes: • We have also built
Fact Tables
to help you define metrics in a way that will not have to scan that
events
table multiple times for an analysis like this. While you can set up metric definitions on the same base SQL right now, the work to build performance improvements on top of those fact tables is still ongoing. Ratio metrics are definitely inefficient right now, and we're hoping that the fact table refactor can do a lot to help trim down query costs in that regard.
h
Thank you @helpful-application-7107 for the tips! I added the bot filtering in the
Experiment Assignment Queries
and also as a filter for the fact table. Moving to the fact table definitely helped w/ performance. Only problem now is that the numbers display incorrectly because I cant control certain aspects of the metric. For example - all my events go to the
events
table and write their metrics into the
value
column. Most of the values are in milliseconds but the fact table only lets me set it as
Time (seconds)
. I got around this with normal metrics by multiplying the value by .001 but I don't have the ability to do that in Fact Tables. In addition to this - sometimes the values aren't milliseconds - it might be a measure such as how many pixels an element shifted. Since I can't break up the definition of the
value
column based on the metric it doesn't allow me to see the changes properly.
h
Hmm, yeah I see the problem here. It's a bit awkward given but maybe you can do something like the following (cc: @future-teacher-7046): Fact Table SQL
Copy code
SELECT
  host,
  anonymous_id,
  timestamp,
  (value/1000) as value_s,
  value as value
FROM events
Columns
value
<- just a normal number
value_s
<- format seconds Filters Then you can add a bunch of filters here for your different
where
clauses Metrics Then you can define your time metrics using the
value_s
column and your other metrics using the
value
column. There's also an option where you use
CASE WHEN
in your Fact Table SQL to define your metric columns but that seems worse.
👍 1
h
ah thats a good idea! I'll do that thanks. One more thing - when i try using Cap User Values on a metric it returns no results regardless of what percetile I use.
h
Ah! This is a known bug with clickhouse right now: https://github.com/growthbook/growthbook/issues/1881
👀 1
h
ah bummer! Glad it's been identified though. I'll keep an eye on this issue. Thanks again
h
I hope to resolve it very soon!
🙏 1
Hey, I landed a fix here to use the nondeterministic quantile method. Please give it a shot and let me know if it works for you.
🙌 1
h
wow great news! Thank you! I'll let you know if it's working for me
@helpful-application-7107 Looks like I'm still getting no data. Here's the query for reference:
Copy code
-- Revenue (mean)
WITH __rawExperiment AS (
SELECT
	host,
	anonymous_id,
	timestamp as timestamp,
	experiment_id as experiment_id,
	variant_id as variation_id
FROM
	events
WHERE
	event = 'experiment_viewed'
	and context_user_agent not ilike '%bot%'
    ),
    
    __experimentExposures AS (
-- Viewed Experiment
SELECT
	e.anonymous_id as anonymous_id
        ,
	toString(e.variation_id) as variation
        ,
	e.timestamp as timestamp
FROM
	__rawExperiment e
WHERE
	e.experiment_id = '1cbfa0516ae56757d5615ddcfa3e919e'
	AND e.timestamp >= toDateTime('2023-11-01 00:02:00',
	'UTC')
		AND e.timestamp <= toDateTime('2023-12-04 17:15:19',
		'UTC')
       ),
__experimentUnits AS (
-- One row per user
SELECT
	e.anonymous_id AS anonymous_id
        ,
	if(count(distinct e.variation) > 1,
	'__multiple__',
	max(e.variation)) AS variation
        ,
	MIN(e.timestamp) AS first_exposure_timestamp
FROM
	__experimentExposures e
GROUP BY
	e.anonymous_id
    ),
      __distinctUsers AS (
SELECT
	anonymous_id,
	toString('All') AS dimension,
	variation,
	first_exposure_timestamp AS timestamp,
	dateTrunc('day',
	first_exposure_timestamp) AS first_exposure_date
FROM
	__experimentUnits
        
      )
      ,
__metric as (
-- Metric (Revenue)
SELECT
	anonymous_id as anonymous_id,
	m.amount as value,
	m.timestamp as timestamp
FROM
	(
	SELECT
		anonymous_id,
		timestamp,
		host,
		context_user_agent,
		context_device,
		context_browser,
		bot_score,
		event,
		value,
		(value / 1000) value_seconds,
		amount,
		quantity,
		currency
	FROM
		events
            ) m
WHERE
	event = 'checkout_completed'
	AND context_user_agent not ilike '%bot%'
	AND m.timestamp >= toDateTime('2023-11-01 00:02:00',
	'UTC')
		AND m.timestamp <= toDateTime('2023-12-04 17:15:19',
		'UTC')
    ),
    
__userMetricJoin as (
SELECT
	d.variation AS variation,
	d.dimension AS dimension,
	d.anonymous_id AS anonymous_id,
	if(m.timestamp >= d.timestamp, m.value,	NULL) as value
FROM __distinctUsers d
LEFT JOIN __metric m ON	(m.anonymous_id = d.anonymous_id)  
  ),
  
__userMetricAgg as (
-- Add in the aggregate metric value for each user
SELECT
	variation,
	dimension,
	anonymous_id,
	SUM(COALESCE(value, 0)) as value
FROM
	__userMetricJoin
GROUP BY
	variation,
	dimension,
	anonymous_id
),

__capValue AS (
SELECT
	quantile(0.95)(value) AS cap_value
FROM
	__userMetricAgg
WHERE
	value IS NOT NULL
    
        )
-- One row per variation/dimension with aggregations
      SELECT
	m.variation AS variation,
	m.dimension AS dimension,
	COUNT(*) AS users,
	'mean' as statistic_type,
	'count' as main_metric_type,
	MAX(COALESCE(cap.cap_value, 0)) as main_cap_value,
	SUM(LEAST(
        toFloat64(COALESCE(m.value, 0)),
        cap.cap_value
      )) AS main_sum,
	SUM(POWER(LEAST(
        toFloat64(COALESCE(m.value, 0)),
        cap.cap_value
      ), 2)) AS main_sum_squares
FROM
	__userMetricAgg m
CROSS JOIN __capValue cap
GROUP BY
	m.variation,
	m.dimension
h
What is the result in the View Queries modal showing as the returned data?
h
image.png
h
Oh, is there no returned
main_cap_value
column?
seems like there should be based on your SQL
Other than that, it looks like things are working fine, it's just there's, for some reason, no values being returned by this part of the query:
Copy code
__metric as (
-- Metric (Revenue)
SELECT
	anonymous_id as anonymous_id,
	m.amount as value,
	m.timestamp as timestamp
FROM
	(
	SELECT
		anonymous_id,
		timestamp,
		host,
		context_user_agent,
		context_device,
		context_browser,
		bot_score,
		event,
		value,
		(value / 1000) value_seconds,
		amount,
		quantity,
		currency
	FROM
		events
            ) m
WHERE
	event = 'checkout_completed'
	AND context_user_agent not ilike '%bot%'
	AND m.timestamp >= toDateTime('2023-11-01 00:02:00',
	'UTC')
		AND m.timestamp <= toDateTime('2023-12-04 17:15:19',
		'UTC')
    ),
    
__userMetricJoin as (
SELECT
	d.variation AS variation,
	d.dimension AS dimension,
	d.anonymous_id AS anonymous_id,
	if(m.timestamp >= d.timestamp, m.value,	NULL) as value
FROM __distinctUsers d
LEFT JOIN __metric m ON	(m.anonymous_id = d.anonymous_id)  
  )
The culprits could be: • That where clause on the
evets
table is dropping all your data for some reason (seems unlikely, but possible). If you just run that CTE, do you see data in your data warehouse? • The join is not finding overlapping anonymous_id values, if for some reason your anonymous_id values aren't the same for users when you are initializing the growthbook SDK in your app and when you're tracking their conversions/other events (not uncommon problem if people aren't properly storing the anonymous ids using the trackingCallback and re-using them for event tracking) • There's some weirdness with the date windows not overlapping, but that seems unlikely given your queries so far and traffic
h
So if I re-run the query and select * from __capValue I get 0.0. If I rerun it and select * from __userMetricAgg I get all the values but a ton of 0 value rows. If I modify the __userMetricAgg to exclude 0's it returns a more appropriate value:
Copy code
__userMetricAgg as (
        SELECT
          variation,
          dimension,
          anonymous_id,
          SUM(COALESCE(value, 0)) as value
        FROM
          __userMetricJoin
      
        GROUP BY
          variation,
          dimension,
          anonymous_id
         HAVING value > 0  -- I added this line
      )
this leads me to think that I need to change my metric to filter out 0's
h
Looks like the 95th percentile is 0!
h
if I don't filter out non-checkout events that appears to be the case. Should the query be filtering out any non-checkout events though?
h
Well the 95th percentile is currently defined over the user aggregated values in your experiment. So it will include 0s from people in the experiment who have no values on the experiment.
h
i think its because
__distinctUsers
is getting all users in the experiment and its joining
__metric
to that which causes a bunch of 0/null values.
h
Yes.
h
Is this different than how the non fact table metrics calculate percentiles too? I don't recall this being an issue there
h
No, it should be the same, unless you entered a custom aggregation we default to doing
SUM(COALESCE(value, 0))
h
Hmm, I just compared the two (metric table vs regular metric) and get different results. It looks like the
SUM(COALESCE(value, 0))
is the only major difference. Its just
SUM(value)
on the regular metric. If I change the fact table query to use
SUM(value)
It works correctly
h
Is your regular metric set up with a custom aggregation that is
sum(value)
?
h
Ah - looks like it is custom. I wasn't aware you could leave the "User Value Aggregation" field empty.
h
Yeah, sorry that's a bit of an oversight in that UI. So that's the main reason this is going on. It's not clear that 95th percentile should exclude 0s, since statistically the percentile is defined over the user aggregate levels, so that's why our default is to include those. You could increase the percentile as a potential solution.
h
I see, only problem I'd have there is that I'd need to create a revenue metric for each experiment since each one of our customer's may have a different "conversion rate" that I'd have to base the percentile off of. Some customers have a 1% conversion rate while others may be upwards of 10%. Would it be possible to add a custom aggregation in the Fact Table so I can do a simple
SUM
there too?
h
Right. We aren't going to introduce custom aggregations for fact table metrics, but we could consider adding an option to apply the 95th percentile to either all user level values or just to non-zero user level values. Would you be willing to open an issue to that effect?
🙏 1
h
Yeah I can do that! Thanks for the help
I added an issue here. Hopefully it's descriptive enough: https://github.com/growthbook/growthbook/issues/1912
thankyou 1
h
Late to the follow-up here, but I did close this out and the feature is available now, for what it's worth.
🙌 1
h
Great, thanks Luke! I'll check it out and let you know if I run into any issues