https://www.growthbook.io/ logo
l

late-teacher-86348

08/11/2021, 11:13 PM
Super excited about getting Growth Book set up and likely using it for my organization. I'm testing a basic ClickHouse setup at the moment and getting this when attempting to "import past experiments":
Copy code
501: Code: 48, e.displayText() = DB::Exception: JOIN ON inequalities are not supported. Unexpected 'users > threshold': While processing users > threshold (version 21.3.15.2.altinity+stable (altinity build))
It seems to be a ClickHouse thing – only equi-joins supported. What does this mean for using Growth Book with ClickHouse?
f

future-teacher-7046

08/11/2021, 11:21 PM
Hmm, we do use comparison operators in joins for a few of the queries, usually to restrict by date. Looks like from that issue that we could just move those conditions to the WHERE clause. That will likely make queries slower for data sources that do support those joins, so we may need to fork the logic when generating the query.
I don't have a clickhouse warehouse set up to test readily. If you could view the full query I could help hand-craft a version that moves the conditions to the WHERE clause and see if that runs on your clickhouse instance
l

late-teacher-86348

08/12/2021, 11:55 AM
Hey Jeremy, sure – here's the query generated by "import past experiments": https://pastebin.com/kQmtcED3 And the resulting error response:
Copy code
501: Code: 48, e.displayText() = DB::Exception: JOIN ON inequalities are not supported. Unexpected 'users > threshold': While processing users > threshold (version 21.3.15.2.altinity+stable (altinity build))
Thanks for the super-fast response, Growth Book looks promising. What database would you recommend if ClickHouse might not be so tested? This limitation seems pretty big for our purposes too.
f

future-teacher-7046

08/12/2021, 12:10 PM
Copy code
-- Past Experiments
WITH __experiments as (
  SELECT
    user_id,
    anonymous_id,
    received_at as timestamp,
    experiment_id,
    variation_id,
    context_page_path as url,
    context_user_agent as user_agent
  FROM
    experiment_viewed
),
__experimentDates as (
  SELECT
    experiment_id,
    variation_id,
    date_trunc('day', timestamp) as date,
    count(distinct anonymous_id) as users
  FROM
    __experiments
  WHERE
    timestamp > toDateTime('2020-08-12 11:51:08')
  GROUP BY
    experiment_id,
    variation_id,
    date_trunc('day', timestamp)
),
__userThresholds as (
  SELECT
    experiment_id,
    variation_id,
    -- It's common for a small number of tracking events to continue coming in
    -- long after an experiment ends, so limit to days with enough traffic
    max(users) * 0.05 as threshold
  FROM
    __experimentDates
  WHERE
    -- Skip days where a variation got 5 or fewer visitors since it's probably not real traffic
    users > 5
  GROUP BY
    experiment_id,
    variation_id
),
__variations as (
  SELECT
    d.experiment_id,
    d.variation_id,
    MIN(d.date) as start_date,
    MAX(d.date) as end_date,
    SUM(d.users) as users
  FROM
    __experimentDates d
    JOIN __userThresholds u ON (
      d.experiment_id = u.experiment_id
      AND d.variation_id = u.variation_id
    )
  WHERE
    d.users > u.threshold
  GROUP BY
    d.experiment_id,
    d.variation_id
)
SELECT
  *
FROM
  __variations
WHERE
  -- Skip experiments with fewer than 200 users since they don't have enough data
  users > 200 -- Skip experiments that are 5 days or shorter (most likely means it was stopped early)
  AND date_diff('day', start_date, end_date) > 5 -- Skip experiments that start of the very first day since we're likely missing data
  AND date_diff(
    'day',
    toDateTime('2020-08-12 11:51:08'),
    start_date
  ) > 2
ORDER BY
  experiment_id ASC,
  variation_id ASC
Can you try running the above directly on click house and see if it works?
Click house is one of our newest data sources and there usually a few edge cases to work out. Snowflake, Redshift, and Postgres are all well tested. We were just working with another user on BigQuery to fix some date edge cases. Luckily we only have about 5 unique sql queries that we run so it doesn't take that long to fix issues
👍 1
l

late-teacher-86348

08/12/2021, 1:44 PM
Here's what I get when running that query directly:
Copy code
Received exception from server (version 21.3.15):
Code: 46. DB::Exception: Received from localhost:9000. DB::Exception: Unknown function date_diff. Maybe you meant: ['dateDiff']: While processing ((date_diff('day', toDateTime('2020-08-12 11:51:08'), min(date) AS start_date) > 2) AND (date_diff('day', start_date, max(date) AS end_date) > 5)) AND ((sum(users) AS users) > 200).
Great to hear it isn't likely to be a huge challenge given the 5 queries 🙂
f

future-teacher-7046

08/12/2021, 1:59 PM
that's strange. https://clickhouse.tech/docs/en/sql-reference/functions/date-time-functions/#date_diff It says
date_diff
is an alias of
dateDiff
what version of click house are you using? It's really hard to tell from their documentation when certain features were added
oh, nevermind I see 21.3.15 in the error message
Can you try again with this one:
Copy code
WITH __experiments as (
  SELECT
    user_id,
    anonymous_id,
    received_at as timestamp,
    experiment_id,
    variation_id,
    context_page_path as url,
    context_user_agent as user_agent
  FROM
    experiment_viewed
),
__experimentDates as (
  SELECT
    experiment_id,
    variation_id,
    dateTrunc('day', timestamp) as date,
    count(distinct anonymous_id) as users
  FROM
    __experiments
  WHERE
    timestamp > toDateTime('2020-08-12 11:51:08')
  GROUP BY
    experiment_id,
    variation_id,
    dateTrunc('day', timestamp)
),
__userThresholds as (
  SELECT
    experiment_id,
    variation_id,
    max(users) * 0.05 as threshold
  FROM
    __experimentDates
  WHERE
    users > 5
  GROUP BY
    experiment_id,
    variation_id
),
__variations as (
  SELECT
    d.experiment_id,
    d.variation_id,
    MIN(d.date) as start_date,
    MAX(d.date) as end_date,
    SUM(d.users) as users
  FROM
    __experimentDates d
    JOIN __userThresholds u ON (
      d.experiment_id = u.experiment_id
      AND d.variation_id = u.variation_id
    )
  WHERE
    d.users > u.threshold
  GROUP BY
    d.experiment_id,
    d.variation_id
)
SELECT
  *
FROM
  __variations
WHERE
  users > 200
  AND dateDiff('day', start_date, end_date) > 5
  AND dateDiff(
    'day',
    toDateTime('2020-08-12 11:51:08'),
    start_date
  ) > 2
ORDER BY
  experiment_id ASC,
  variation_id ASC
l

late-teacher-86348

08/12/2021, 3:00 PM
It doesn't throw any errors! Although it also doesn't return any results, but I'm guessing that's due to my poor dummy data. 🙂
f

future-teacher-7046

08/12/2021, 3:09 PM
ok, that's progress at least! I can make those changes in the repo today. Thanks for the help debugging so far
🙌 1
l

late-teacher-86348

08/12/2021, 4:52 PM
Thanks for making changes so quickly!
f

future-teacher-7046

08/12/2021, 5:33 PM
I have a working PR - https://github.com/growthbook/growthbook/pull/40 Going to do some more testing to make sure I didn't inadvertently break any other data sources.
👍 1
l

late-teacher-86348

08/12/2021, 6:15 PM
Sounds good. I'm ready to test it on growthbook.io when you give the word!
Getting a new error now 😅
Copy code
rows.map is not a function
Query:
Copy code
-- Past Experiments
WITH __experiments as (
  SELECT
    user_id,
    anonymous_id,
    received_at as timestamp,
    experiment_id,
    variation_id,
    context_page_path as url,
    context_user_agent as user_agent
  FROM
    experiment_viewed
),
__experimentDates as (
  SELECT
    experiment_id,
    variation_id,
    dateTrunc('day', timestamp) as date,
    count(distinct anonymous_id) as users
  FROM
    __experiments
  WHERE
    timestamp > toDateTime('2020-08-12 18:42:35')
  GROUP BY
    experiment_id,
    variation_id,
    dateTrunc('day', timestamp)
),
__userThresholds as (
  SELECT
    experiment_id,
    variation_id,
    -- It's common for a small number of tracking events to continue coming in
    -- long after an experiment ends, so limit to days with enough traffic
    max(users) * 0.05 as threshold
  FROM
    __experimentDates
  WHERE
    -- Skip days where a variation got 5 or fewer visitors since it's probably not real traffic
    users > 5
  GROUP BY
    experiment_id,
    variation_id
),
__variations as (
  SELECT
    d.experiment_id,
    d.variation_id,
    MIN(d.date) as start_date,
    MAX(d.date) as end_date,
    SUM(d.users) as users
  FROM
    __experimentDates d
    JOIN __userThresholds u ON (
      d.experiment_id = u.experiment_id
      AND d.variation_id = u.variation_id
    )
  WHERE
    d.users > u.threshold
  GROUP BY
    d.experiment_id,
    d.variation_id
)
SELECT
  *
FROM
  __variations
WHERE
  -- Skip experiments with fewer than 200 users since they don't have enough data
  users > 200 -- Skip experiments that are 5 days or shorter (most likely means it was stopped early)
  AND dateDiff('day', start_date, end_date) > 5 -- Skip experiments that start of the very first day since we're likely missing data
  AND dateDiff(
    'day',
    toDateTime('2020-08-12 18:42:35'),
    start_date
  ) > 2
ORDER BY
  experiment_id ASC,
  variation_id ASC
f

future-teacher-7046

08/12/2021, 8:06 PM
ok, that must be something with how the clickhouse client is returning data. I'll need to set up a local clickhouse instance to test that out
l

late-teacher-86348

08/13/2021, 1:28 AM
I'll do what I can do to help! Let me know if there's anything.
You know, I think it could be because of two things based on my searching: 1. React is trying to process a JSON response rather than an array (maybe because of how ClickHouse responds) 2. There are no results that have > 200 users and/or are spread out over more than 5 days, considering I'm using dummy data that didn't take that into account I think #2 is likely even if #1 is not the case. If so, a more friendly response might be good, if possible. I can work to ensure both conditions are met with the dummy data.
f

future-teacher-7046

08/13/2021, 2:35 PM
I was able to load some dummy data into Clickhouse and was able to replicate the issue. The nodejs library is not parsing the query result correctly. Working on a fix now
I think ClickHouse is returning tab separated values when I'm expected JSON
Just pushed a fix. Will take about 10-15 minutes to deploy to docker
I was specifying "format: json", but the nodejs client library wasn't adding the proper
x-clickhouse-format
header to the request. So clickhouse was returning tab-separated results and the library was expecting JSON and failing to parse it.
l

late-teacher-86348

08/14/2021, 1:16 AM
Awesome. Thank you!!
4 Views