Hi everyone Quick question specific to GA4 GA4 data is writt GrowthBook Users #ask-questions

Hi everyone! Quick question specific to GA4. GA4 d...

prehistoric-beard-84272

09/17/2023, 8:50 AM

Hi everyone! Quick question specific to GA4. GA4 data is written into 2 event tables: "events_" and "events_intraday_". (events_intraday_ contains roughly the last 2 rolling days of data, while events_* contains all data before that. events_intraday_* tables are periodically moved to events_*) So to get a real-time view with history, we need to get data from both tables. That means we need to add a union all to each query. I was wondering whether there is a centralized way to redefine events as a union all of both these table, so do we don't have to add that union all to each query? Thanks for the help!

fresh-football-47124

09/18/2023, 6:32 AM

https://linen.growthbook.io/t/10508969/when-using-ga4-data-through-bigquery-is-it-possible-to-inclu#103de95e-66c0-4d46-9665-85e7c4552f8e

👍 1

prehistoric-beard-84272

09/18/2023, 8:07 AM

Thank you, @fresh-football-47124!

white-fireman-22476

09/18/2023, 8:07 AM

It should go on the experiment assignment query. You should add a group by to filter out duplicates because I believe there might be a possibility of duplicates between the interday table and the other tables

prehistoric-beard-84272

09/18/2023, 8:08 AM

Interesting thought @careful-machine-67313 I'll check that. Thank you!

white-fireman-22476

09/18/2023, 8:09 AM

Could you maybe check if the query:

Copy code

SELECT
user_pseudo_id as anonymous_id,
TIMESTAMP_MICROS(event_timestamp) as timestamp,
experiment_id_param.value.string_value AS experiment_id,
variation_id_param.value.int_value AS variation_id,
geo.country as country,
traffic_source.source as source,
traffic_source.medium as medium,
device.category as device,
device.web_info.browser as browser,
device.operating_system as os
FROM
defaultProjectName.defaultDataSet.events_*,
UNNEST(event_params) AS experiment_id_param,
UNNEST(event_params) AS variation_id_param
WHERE
_TABLE_SUFFIX BETWEEN '{{startYear}}{{startMonth}}{{startDay}}' AND '{{endYear}}{{endMonth}}{{endDay}}'
AND event_name = 'experiment_viewed'

AND experiment_id_param.key = 'experiment_id'
AND variation_id_param.key = 'variation_id'
AND user_pseudo_id is not null

UNION ALL

SELECT
user_pseudo_id as anonymous_id,
TIMESTAMP_MICROS(event_timestamp) as timestamp,
experiment_id_param.value.string_value AS experiment_id,
variation_id_param.value.int_value AS variation_id,
geo.country as country,
traffic_source.source as source,
traffic_source.medium as medium,
device.category as device,
device.web_info.browser as browser,
device.operating_system as os
FROM
defaultProjectName.defaultDataSet.events_interday,
UNNEST(event_params) AS experiment_id_param,
UNNEST(event_params) AS variation_id_param
WHERE
event_name = 'experiment_viewed'

AND experiment_id_param.key = 'experiment_id'
AND variation_id_param.key = 'variation_id'
AND user_pseudo_id is not null

GROUP BY
user_pseudo_id, event_timestamp, experiment_id, variation_id, country, source, medium, device, browser, os

works for you for anonymous id? If so I'll make the change for everyone so future people adding the datasource will get this by default.

white-fireman-22476

09/18/2023, 8:12 AM

Likewise:

Copy code

SELECT
user_id as user_id,
TIMESTAMP_MICROS(event_timestamp) as timestamp,
experiment_id_param.value.string_value AS experiment_id,
variation_id_param.value.int_value AS variation_id,
geo.country as country,
traffic_source.source as source,
traffic_source.medium as medium,
device.category as device,
device.web_info.browser as browser,
device.operating_system as os
FROM
`defaultProjectName`.`defaultDataSet`.`events_*`,
UNNEST(event_params) AS experiment_id_param,
UNNEST(event_params) AS variation_id_param
WHERE
_TABLE_SUFFIX BETWEEN '{{startYear}}{{startMonth}}{{startDay}}' AND '{{endYear}}{{endMonth}}{{endDay}}'
AND event_name = 'experiment_viewed'  
AND experiment_id_param.key = 'experiment_id'
AND variation_id_param.key = 'variation_id'
AND user_id is not null

UNION ALL

SELECT
user_id as user_id,
TIMESTAMP_MICROS(event_timestamp) as timestamp,
experiment_id_param.value.string_value AS experiment_id,
variation_id_param.value.int_value AS variation_id,
geo.country as country,
traffic_source.source as source,
traffic_source.medium as medium,
device.category as device,
device.web_info.browser as browser,
device.operating_system as os
FROM
`defaultProjectName`.`defaultDataSet`.`events_interday`,
UNNEST(event_params) AS experiment_id_param,
UNNEST(event_params) AS variation_id_param
WHERE
event_name = 'experiment_viewed'  
AND experiment_id_param.key = 'experiment_id'
AND variation_id_param.key = 'variation_id'
AND user_id is not null

GROUP BY
user_id, event_timestamp, experiment_id, variation_id, country, source, medium, device, browser, os

should work for the user_id one.

white-fireman-22476

09/18/2023, 8:13 AM

In both cases you would need to change the:

defaultProjectName.defaultDataSet

bit to be what you have now.

prehistoric-beard-84272

09/18/2023, 8:14 AM

Sure, I'll check

white-fireman-22476

09/18/2023, 8:14 AM

Thanks.

prehistoric-beard-84272

09/18/2023, 8:17 AM

The second query after union all also needs the where statement to filter by data, no? Intraday usally has 2 days in them (today and yesterday)

white-fireman-22476

09/18/2023, 8:18 AM

I think the GROUP BY at the end should get rid of duplicates which is needed anyway, no?

white-fireman-22476

09/18/2023, 8:19 AM

Or what were you thinking exactly?

prehistoric-beard-84272

09/18/2023, 8:21 AM

It's not about the duplicates, but that it will always include all intraday records even if the date filter is set outside of that range (e.g. for last week).

white-fireman-22476

09/18/2023, 8:22 AM

ah yes. That makes sense. We wouldn't want to include them always.

white-fireman-22476

09/18/2023, 8:27 AM

I guess we can do it on timestamp then... let me figure out what that should be exactly...

prehistoric-beard-84272

09/18/2023, 8:29 AM

_Table_suffix is more efficient as far a I know, as the table is partitioned by date (I think it only queries the respective partitions instead of the entire table)

white-fireman-22476

09/18/2023, 8:30 AM

but that doesn't work on the interday table right?

prehistoric-beard-84272

09/18/2023, 8:30 AM

it does, the intraday also has 2 partions (usually)

prehistoric-beard-84272

09/18/2023, 8:31 AM

image.png

white-fireman-22476

09/18/2023, 8:32 AM

ah. I see. And the two tables are still in the same format as with the event_* but with just the last two days?

prehistoric-beard-84272

09/18/2023, 8:34 AM

yep. They are streamed in real time. The reason for the separation is that it doesn't contain all values yet. Every day the older intraday partition is converted in an event_ partition and populated with some additional info that google has processed. And then a new empty intraday_ table is created

prehistoric-beard-84272

09/18/2023, 8:34 AM

Well the new intra_day_ table is created at the beginning of every day

white-fireman-22476

09/18/2023, 8:35 AM

Ok then maybe the best approach is like what was in the link that Graham sent. But I believe that duplicates need to get removed still

prehistoric-beard-84272

09/18/2023, 8:36 AM

Thats why "usually 2". Sometimes there are 3 for a short period of time

white-fireman-22476

09/18/2023, 8:44 AM

So for the user_id

Copy code

SELECT
  user_id as user_id,
  TIMESTAMP_MICROS(event_timestamp) as timestamp,
  experiment_id_param.value.string_value AS experiment_id,
  variation_id_param.value.int_value AS variation_id,
  geo.country as country,
  traffic_source.source as source,
  traffic_source.medium as medium,
  device.category as device,
  device.web_info.browser as browser,
  device.operating_system as os
FROM
  `defaultProjectName`.`defaultDataSet`.`events_*`,
  UNNEST(event_params) AS experiment_id_param,
  UNNEST(event_params) AS variation_id_param
WHERE
  REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+') BETWEEN '{{date startDateISO "yyyyMMdd"}}' AND '{{date endDateISO "yyyyMMdd"}}'
  AND event_name = 'experiment_viewed'  
  AND experiment_id_param.key = 'experiment_id'
  AND variation_id_param.key = 'variation_id'
  AND user_id is not null
GROUP BY
user_id, event_timestamp, experiment_id, variation_id, country, source, medium, device, browser, os

should work if you replace the

defaultProjectName

defaultDataSet

, right?

white-fireman-22476

09/18/2023, 8:45 AM

wait...

white-fireman-22476

09/18/2023, 8:45 AM

ok updated it.

white-fireman-22476

09/18/2023, 8:45 AM

could you try that last one?

prehistoric-beard-84272

09/18/2023, 8:51 AM

Yes, that works. I tested it with page_views as I don't have GB experiements in the events table yet (just started over the weekend running a/a tests)

prehistoric-beard-84272

09/18/2023, 8:54 AM

Never mind, also works for the experiments.

white-fireman-22476

09/18/2023, 8:55 AM

Thanks for checking. I'm not sure yet if the GROUP BY is really needed ... we might deduplicate it elsewhere.

white-fireman-22476

09/18/2023, 8:55 AM

Do you get the same results both with and without it?

prehistoric-beard-84272

09/18/2023, 8:56 AM

I checked for the experiments query: 91123 rows vs 91118 rows (grouped) I'm quite surprised about that to be honest

white-fireman-22476

09/18/2023, 9:01 AM

hrmm... I need to research more. I think GA4 might delete from the iterday as it adds to historical... but I'm not sure how transactional that is. I've seen some things saying there could be duplicates and others saying there is not. So I'm not sure if the group by removed those 5 duplicate rows, or if new rows were added in the time between your two queries.

prehistoric-beard-84272

09/18/2023, 9:05 AM

Sorry - my blunder there: This is a fairly high traffic site and I had today's intraday included in the last query. So between running these two queries these 5 extra events came in. If I only include the last events_* partition (9/16) and first intraday_* partition (9/17), then both values are identitcal

prehistoric-beard-84272

09/18/2023, 9:06 AM

This being said, there can be short periods of time where the new events_* partition was created while the old intraday_* partition was not deleted yet.

white-fireman-22476

09/18/2023, 9:08 AM

Yes I'd imagine that might be the case, but I haven't gotten any authoratative source saying so yet. Have you seen it somewhere?

prehistoric-beard-84272

09/18/2023, 9:09 AM

Nope, not that I have noticed that. I don't think that this state will last longer than a few minutes tops

white-fireman-22476

09/18/2023, 9:10 AM

Having the GROUP BY can't hurt the accuracy... though perhaps it makes the query a bit more expensive.

prehistoric-beard-84272

09/18/2023, 9:14 AM

Nope just tested and the bytes billed are exactly the same. The grouped query might run a bit longer though

white-fireman-22476

09/18/2023, 9:18 AM

Hey thanks so much. This is really helpful. I will change the default query to have the new RegEx and to include the GROUP BY then. I would imagine that the difference in time will not be so great and its probably worth it to avoid some future user having complaints because they ran a query at the wrong time and got duplicates.

prehistoric-beard-84272

09/18/2023, 9:19 AM

My pleasure, thank you too! If you need any input regarding GA4 and BigQuery feel free to hit me up any time.

white-fireman-22476

09/18/2023, 9:26 AM

Thanks! Can you check the time of running the two queries just to make sure that they are indeed reasonable? That would be a big help.

prehistoric-beard-84272

09/18/2023, 9:28 AM

That would need some scientific testing. Because I just had a few runs and the results were all over the map. I guess hardware availability (CPU usage) has a bigger impact than the group by.

white-fireman-22476

09/18/2023, 9:29 AM

By all over the place, was it like orders of magnitude different?

prehistoric-beard-84272

09/18/2023, 9:31 AM

No, more like each query ranges between 4 and 8 seconds without having a obvious tendency. Maybe we need to adjust GB to run a/b tests on BigQuery 😛

white-fireman-22476

09/18/2023, 9:54 AM

FWIW here is the PR: https://github.com/growthbook/growthbook/pull/1673 to add this for future GB users. Thanks so much for helping me test and making GB better! Let me know if you would like me to mention you by name or github handle or anything as part of the PR.

prehistoric-beard-84272

09/18/2023, 10:04 AM

No need. Thanks for creating this great tool and provide it as open source!

❤️ 1

prehistoric-beard-84272

09/18/2023, 10:50 AM

@white-fireman-22476: Does this update also affect the queries for discovered metrics?

white-fireman-22476

09/20/2023, 9:02 AM

Thanks Sascha for pointing that out. Yes I have updated the PR to include the metrics and the auto generated metrics. However a reviewer is concerned that the query as is would not be using the partitions correctly. If you are willing to help out more, could you do an EXPLAIN ANALYZE in BQ for the following three queries and post the results. 1. The original query:

WHERE TABLE_SUFFIX BETWEEN ...

2. How it is in the PR:

WHERE REGEXP_EXTRACT(_TABLE_SUFFIX, r'[0-9]+') BETWEEN ...

3. Using substr:

WHERE (TABLE_SUFFIX BETWEEN ... OR SUBSTR(_TABLE_SUFFIX, 9) BETWEEN ...)

prehistoric-beard-84272

09/20/2023, 9:06 AM

Sure will do. Will send an update here

❤️ 1

white-fireman-22476

09/20/2023, 9:32 AM

If you could also verify that 2 and 3 return the same thing by using an end date that is an hour in the past or so, to make sure I got the query correct that would be great. (Eventually I'm going to need to set up a GA4 instance for me to test out easier myself. 🙂 )

white-fireman-22476

09/20/2023, 10:21 AM

ah it looks like my 3rd query didn't get the latest day.

white-fireman-22476

09/20/2023, 10:22 AM

I think it might need to be SUBSTR(__TABLE__SUFFIX, 10)

white-fireman-22476

09/20/2023, 10:23 AM

Thanks for also testing it with the

UNION ALL

prehistoric-beard-84272

09/20/2023, 10:24 AM

BigQuery does not have EXPLAIN ANALYZE in the web interface, but the offer additonal menus with details Run for 1 without Union all:

prehistoric-beard-84272

09/20/2023, 10:29 AM

Run 2

prehistoric-beard-84272

09/20/2023, 10:31 AM

Run 3

white-fireman-22476

09/20/2023, 10:32 AM

Hrmm the one with the

UNION ALL

there seems to be the same number of rows returned. I guess that is because yesterday was fully loaded.

prehistoric-beard-84272

09/20/2023, 10:37 AM

Run 1 with union all

prehistoric-beard-84272

09/20/2023, 10:37 AM

Not sure why the last union all was not correct, this is the correct one

prehistoric-beard-84272

09/20/2023, 10:38 AM

So looks like all that include intraday return the same results

white-fireman-22476

09/20/2023, 10:50 AM

Thanks! That's good. It does seem like the slot time consumed goes from 5sec to 9sec for either the regex or the substr and 11s for the union all. The Bytes billed is trivially higher which makes sense as there is one more day of data. I'm not an expert on the BQ costs, but I think slot time is the main driver, and it would suggest that by adding the most recent day the query cost might almost double. 😞. For some of our customers with tons of data, the query costs can be substantial. So unless there is a fourth way I'm not sure if we will be able to make this the default for everyone. I'll try and talk more the BQ cost expert here and see if we can come up with anything.

prehistoric-beard-84272

09/20/2023, 10:55 AM

Billing is mostly by query size (bytes), not computation. While they charge for computation, it's usually negligible compared to the bytes billed. ~~The reason why the costs increased alot is because my first day with experiments was Sep 15th, and the last 2 days I ran experiments continuously~~ Actually just double checked the data, and the costs are basically the same. the queries were about 5GB each. Each user has 1TB free per months, and even then its marginal. https://cloud.google.com/bigquery/pricing

white-fireman-22476

09/20/2023, 11:04 AM

From my reading of the pricing page, for people doing on-demand computing then it is based on bytes, and adding the most recent day won't be adding much (or anything if they are below the limit), however for people doing "Capacity Pricing" then it is by compute time and the Slot time consumed will matter.

prehistoric-beard-84272

09/20/2023, 11:06 AM

Hmmm, fair point. Though I'd expect that people on that plan already pay a lot to BQ anyways.

white-fireman-22476

09/20/2023, 11:07 AM

right ... and they wouldn't necessarily want to pay double. Anyway thanks so very very much for helping out here.

white-fireman-22476

09/20/2023, 11:08 AM

I'll let you know what we end up deciding.

prehistoric-beard-84272

09/20/2023, 11:09 AM

Sure thing. Yes, please, would be curious. I think real time might be worth the extra cost.

white-fireman-22476

09/20/2023, 12:22 PM

Hi Sascha. Could you please try one more thing that has a higher chance of using the partitions correctly:

WHERE (_TABLE_SUFFIX BETWEEN "20230901" AND "20230919" OR _TABLE_SUFFIX BETWEEN "intraday_20230901" AND "intraday_20230919")

prehistoric-beard-84272

09/20/2023, 1:20 PM

Interesting idea, never used that before.

white-fireman-22476

09/20/2023, 2:12 PM

What was the slot time consumed on those? Your screenshot cut it off

prehistoric-beard-84272

09/20/2023, 2:13 PM

Sorry, here you go

white-fireman-22476

09/20/2023, 3:25 PM

Thanks

white-fireman-22476

09/21/2023, 9:56 AM

Hi Sascha. Here is something slightly unrelated. I saw on https://cloud.google.com/bigquery/docs/querying-wildcard-tables#filter_selected_tables_using_table_suffix at the end that they mention that queries generally perform better when the prefix is longer. Could you compare query 1, with a modified query1 with a longer prefix ie.

FROM project.dataset.events_202309*

. If it is a good bit faster/cheaper for bigger GA database then we might have to go with the UNION ALL to take advantage of it. Also how old is your GA database. ... want to calculate what percentage of the data is older than 202309 that we would be excluding?

prehistoric-beard-84272

09/21/2023, 10:02 AM

If you don't need queries specific to experiment_viewed, I can use another property and query for about 10 months of data

prehistoric-beard-84272

09/21/2023, 10:03 AM

events_202309* would query only data for September, not before, nor after

white-fireman-22476

09/21/2023, 10:04 AM

Yeah it doesn't need to be specific. I think if it is also date partitioned it would give us a good idea of the performance benefit.

prehistoric-beard-84272

09/21/2023, 10:04 AM

Do you have specific scenarios in mind?

white-fireman-22476

09/21/2023, 10:05 AM

Yeah we have both the start date and the end date of the experiment, so we should be able to use the longest common prefix between the two dates.

prehistoric-beard-84272

09/21/2023, 10:05 AM

I see

prehistoric-beard-84272

09/21/2023, 10:09 AM

Something like this?

white-fireman-22476

09/21/2023, 10:10 AM

yes

white-fireman-22476

09/21/2023, 10:11 AM

compared with

events_* WHERE _TABLE_SUFFIX BETWEEN "20230801" AND "20230831"

prehistoric-beard-84272

09/21/2023, 10:19 AM

Downside with this one is, that it will through an error for intraday tables if there aren't any intraday_tables with that suffix (like for the August date range).

prehistoric-beard-84272

09/21/2023, 10:22 AM

The aggregation seems to have a significant bigger impact than the where clause

prehistoric-beard-84272

09/21/2023, 10:27 AM

Also the problem is that it varies. For instance, re-runs of the second (shorter prefix, longer suffix query): Thats why I said, this would require more scientific testing

prehistoric-beard-84272

09/21/2023, 10:30 AM

And for the longer prefix / short suffix:

white-fireman-22476

09/21/2023, 10:33 AM

Yeah that does seem to have a high variance. It looks like cost wise it wouldn't matter for you and the majority of GA4 users that are billed by data, since that is the same for both queries. For those who pay based on slot time ... yeah that would be too hard to tell without lots more testing. I did see the longer prefix one did take 12s vs 14s .. not sure how variable that is.

prehistoric-beard-84272

09/21/2023, 10:35 AM

I'm not sure. I guess to get representative numbers for that would require test via the API (how GB actually runs the queries) and see if you can get performance stats

white-fireman-22476

09/21/2023, 10:36 AM

Downside with this one is, that it will through an error for intraday tables if there aren't any intraday_tables with that suffix (like for the August date range).

Does that throw an error for the union all approach? Theoretically we can also have the default query ignore the intraday bit of the query if the end_date is older than 2 days ago.

prehistoric-beard-84272

09/21/2023, 10:38 AM

Yes, it also throws an error for union all. A filter might work, but as soon as you run tests across months or new year there are a lot of conditions you need to check

white-fireman-22476

09/21/2023, 10:39 AM

We can do date compare in js.

👍 1

white-fireman-22476

09/21/2023, 10:49 AM

Anyway my takeaway here is that it is probably no difference in cost for most people and maybe slight improvement in speed of query. Since the documentation recommends it as best practice it can't hurt to do that ... except then the query becomes harder for people to read and grok, which I guess is a bit of a tradeoff.

prehistoric-beard-84272

09/21/2023, 10:50 AM

Yes, sounds good. You will notice if the tickets here go up 🙂

white-fireman-22476

09/21/2023, 11:13 AM

Ok one last survey question for my n of 1 study. It was pointed out by my colleague that if we use intraday tables then when people stop experiments that for the remainder of the day the results might change. 1. How much does that matter to you? 2. If we only use the intraday tables for the cases where the experiment was started less than a day ago - so you would get immediate results on the day you start the experiment, but then it would only update once a day afterwards - would that satisfy your needs?

prehistoric-beard-84272

09/21/2023, 11:21 AM

1.) I'm not sure if I understand it correctly. Doesn't this apply to all tables because of the 72h window? 2.) Especially for high traffic sites this could be a problem. I still would want to query current data, at least manually.

white-fireman-22476

09/21/2023, 11:30 AM

What 72h window? You mean DEFAULT_CONVERSION_WINDOW_HOURS? I believe we don't count people who convert if they converted after the end-date of the experiment - if someone stopped it. Perhaps though this is not an issue if we are doing filtering after we get the results back. Anyway thanks for your response.

👍 1

82 Views

Open in Slack

Previous Next