How Do I Trigger A Scheduled Query When The GA4 Daily Export Happens? (2024)

The question we’re going to be looking at today is inspired by our course, Query GA4 Data in Google BigQuery.

How do I run a scheduled query in BigQuery as soon as the Google Analytics 4 daily export is complete for any given day?

If you’ve ever tried to build any sort of logic on top of the Google Analytics 4 daily export to Google BigQuery, you might have noticed one very annoying thing.

The daily export schedule is erraticat best. On one day, it might land in BigQuery before noon, on another day late in the afternoon, and on some other day it might arrive past midnight your time the following day!

Add to this the fact that it looks like there can be more than oneload job from the service account associated with the export, and you have something of a mess on your hands.

Today’s question in Simmer Clips deals with this dilemma. I’ll help you solve it in a more deterministic way: instead of trying to figure out a time of day when it’s “safest” to run your queries, I’ll show you how you can build a simple pipelinewith a sole purpose:

Run a scheduled queryas soon as the daily export job is complete.

Video walkthrough

If you prefer this walkthrough invideo format, you can check out the video below.

Don’t forget to subscribe to the Simmer YouTube channel for more content like this.

How Do I Trigger A Scheduled Query When The GA4 Daily Export Happens? (1)

Overview of the problem

Before we get started, let’s take a look at the problem.

Run the following query in your project.

You need the change the region specifier `region-eu` to whatever matches the Google Cloud region where your GA4 / BigQuery dataset resides.

What we’ll build

Here’s a rough overview of the system that we’ll build.

When the BigQuery load job completes, it generates a log entryinto Google Cloud Logging (this happens automatically).
In Cloud Logging, we build a Logs Router, which sends the load job log entry into a Pub/Sub topic.
A Cloud Functionis run when the Pub/Sub topic receives the log. The Cloud Function checks the destination table ID from the log entry, and then it runs the scheduled query in Google BigQuery, setting the target date to match that of the table ID.

There might be other, more elegant ways to do this. However, using this system gives us an opportunity to learn about some useful Google Cloud Platform components and services.

1. Create the scheduled query

The first thing you’ll need to do is build the scheduled query itself. The easiest way to do this is to build the query and then save it as a scheduled query using the BigQuery console user interface.

Here’s the query I’m going to use. It’s a simpleflatten query, which takes some key dimensions from the GA4 export and flattens them into auser data table I can then use later for joins.

The focus of this article isn’t what I have in the query but rather how I orchestrate it. Take a look:

SELECT COALESCE(user_id, user_pseudo_id) AS user_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id, traffic_source.source AS acquisition_source, traffic_source.medium AS acquisition_medium, ARRAY_AGG((SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'source') IGNORE NULLS)[OFFSET(0)] AS session_source, ARRAY_AGG((SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'medium') IGNORE NULLS)[OFFSET(0)] AS session_medium, SUM(ecommerce.purchase_revenue) AS total_session_revenueFROM `project.my_dataset.events_*`WHERE _table_suffix = FORMAT_DATE('%Y%m%d', DATE_SUB(@run_date, INTERVAL 1 DAY)) AND COALESCE(user_id, user_pseudo_id) IS NOT nullGROUP BY 1, 2, 3, 4ORDER BY session_id ASC

The purpose of this query is to generate aflattened table of the GA4 export. It generates a table of users and sessions, including the originalacquisition campaign of the user as well as thesession source/medium (by taking the first non-null source/medium combination of any given session and applying it to the entire session).

The key part of this query is in theWHERE clause. See the@run_date thing? That’s automatically populated by therun time of the scheduled query.

What you’ll learn later is that we can actuallymanipulate this run time to match the table that the GA4 data was loaded to. This is an absolutely critical part of this process because it allows us to safely run the scheduled query onbackfilled data too.

This is a sample output of the query:

When you’re happy with the query, click theSCHEDULE button in the BigQuery console top button and thenCreate new scheduled query.

The three options you need to pay attention to in the scheduled query settings are:

Set the repeatsoption to On-demand. We want to run this query only when triggered manually (i.e. by the Cloud Function we’ll build later).
Use a variablein the Destination Table ID.
Set the query to Overwrite rather than Append.

Point (2) is important. If you want the table ID to have a similar YYYYMMDDsuffix as the GA4 export, you can set the field to something like this:

events_{run_time-24h|"%Y%m%d"}

Here, the syntax in the curly braces basically takes theday before the scheduled query run time and generates aYYYYMMDD string from it. That way if the run time of the scheduled query isMay 25th, 2023, the destination table ID for the scheduled query becomes:

events_20230524

And this will match the destination table ID of the GA4 export, too, as you’ll see when we build the Cloud Function.

Set the query toOverwrite the destination table. This is because the source of the query is theentire GA4 daily table, so it doesn’t matter if you overwrite the destination table each time the query is run. As I mentioned earlier, the GA4 daily job can happen multiple times per day, and it’s possible for backfilling to happen, too.

By having theOverwrite option selected, it ensures the flattened table is always generated from the most recent state of the source table (the GA4 daily export table).

2. Create the Logs Router

We now have our scheduled query built, waiting to be triggered with the GA4 load job.

But that’s theend of the pipeline. We need to build everything else now.

TheLogs Router looks for a log entry that is generated when a GA4 daily load job is complete. We use this as the trigger event for our entire pipeline setup.

This log entry is then shipped to aPub/Sub topic, which starts the Cloud Function.

Now, head over to Logs Router and click theCREATE SINK button in the top to create a new router sink.

Give the sink a name.
ChooseGoogle Cloud Pub/Sub topic as the destination.
From the list of available Pub/Sub topics, click tocreate a new topic.

Give the topic and ID, and leave all the other settings as their defaults. ClickCREATE TOPIC.In theBuild inclusion filter, copy-paste the following code. Modify the dataset IDon line 3to match the dataset ID of your GA4 export dataset.
In theBuild inclusion filter, copy-paste the following code. Modify thedataset IDonline 3to match the dataset ID of your GA4 export dataset.

protoPayload.methodName="jobservice.jobcompleted"protoPayload.authenticationInfo.principalEmail="firebase-measurement@system.gserviceaccount.com"protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.datasetId="my_dataset"protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.tableId=~"^events_\d+"

ClickCREATE SINKonce you’re done.

The inclusion filter looks for logs that have the methodjobservice.jobcompleted and were specifically generated by thefirebase-measurement@... service account. This is the account that performs all the Google Analytics 4 export jobs.

Here’s what the Logs Router should look like:

Just one thing remains! We need to build theGoogle Cloud Function that is subscribed to the Pub/Sub topic receiving these load job log entries.

3. Build the Google Cloud Function

Google Cloud Functions are serverless, well, functions that can be run on-demand in the cloud. Their purpose is to perform small, low-intensity tasks at scale.

In this case, we want the Cloud Function to subscribe to the Pub/Sub topic we created above.

We then want it to parse the log entry for thedestination table details, so that we can calibrate the run time of the scheduled query to match the date of the table that the GA4 daily job created (or updated).

So, head on over to Cloud Functions and clickCREATE FUNCTION to get started.

Keep environment as1st gen.
Give the function a descriptive name.
Choose aregion for the function – this can be anything you like but it would make sense to have the regions of your BigQuery datasets and related Google Cloud components be the same.
ChooseCloud Pub/Sub as the trigger type.
Select thePub/Sub topic you created in the previous chapter as the trigger.
Check the boxRetry on failure (this retries the Cloud Function up to 7 days in case it fails in an error).
ClickSAVEto save the trigger settings.

This is what it should look like:

Ignore theRuntime, build, connections and security settings accordion and clickNEXT to continue.

In the next screen, we need to add the function code itself. This is what gets executed when the function is run.

KeepNode.js as the runtime (choose the latest non-Preview version).
Clickpackage.json to edit its contents.
Add the following line in the“dependencies” property:

"@google-cloud/bigquery-data-transfer": "^3.1.3"

This is whatpackage.json should look like (the version numbers might be different when you try this solution):

Selectindex.js from the list of files to activate the code editor.
Edit theEntry point field on top of the editor to berunScheduledQuery.
Copy-paste the following code into the editor.

const bigqueryDataTransfer = require('@google-cloud/bigquery-data-transfer');exports.runScheduledQuery = async (event, context) => { // Update configuration options const projectId = ''; const configId = ''; const region = ''; // Load the log data from the buffer const eventData = JSON.parse(Buffer.from(event.data, 'base64').toString()); const destinationTableId = eventData.protoPayload.serviceData.jobCompletedEvent.job.jobConfiguration.load.destinationTable.tableId; // Grab the table date and turn it into the run time for the scheduled query const tableTime = destinationTableId.replace('events_', ''); const year = tableTime.substring(0, 4), month = tableTime.substring(4, 6), day = tableTime.substring(6, 8); // Set the run time for the day after the table date so that the scheduled query works with "yesterday's" data const runTime = new Date(Date.UTC(year, month - 1, parseInt(day) + 1, 12)); // Create a proto-buffer Timestamp object from this const requestedRunTime = bigqueryDataTransfer.protos.google.protobuf.Timestamp.fromObject({ seconds: runTime / 1000, nanos: (runTime % 1000) * 1e6 }); const client = new bigqueryDataTransfer.v1.DataTransferServiceClient(); const parent = client.projectLocationTransferConfigPath(projectId, region, configId); const request = { parent, requestedRunTime }; const response = await client.startManualTransferRuns(request); return response;};

Change theprojectId value to match the Google Cloud Platform project ID where your build yourscheduled query
To get values forregion andconfigId, browse to scheduled queries, open your scheduled query, and click theConfiguration tabto view its details.
region value should be the Google Cloud region of the Destination dataset, so click through to that to check if you don’t remember what it was.
configIdis the GUID at the end of the Resource nameof the scheduled query.

This is what my function looks like:

My destination dataset ismulti-region EU, hence the'eu' value forregion. If your dataset is in a specific region, the value would be something likeus-west1, for example.

Once you’re happy, clickDEPLOY in the bottom of the screen to deploy the function.

The purpose of the function is to take the log entry and parse it for thedestination table ID (events_YYYYMMDD).

The date part is then extracted, and a new proto-bufferTimestamp object is created from that date object, as this is required if you want to manually set the run time of the query.

The only special thing we do is moveone day forward in time, because we want the run time to beone day after the destination table ID that was updated.

This is to mirror how the GA4 daily export works – it always updatesyesterday’s table (or a table further in the past).

When you click DEPLOY, it will take a moment for the function to be ready for action. Once the spinner stops spinning and you see a green checkmark, you’re good to go.

4. Test the setup

I couldn’t figure out a way to manually test the Cloud Function, as it would require generating an identical log entry as those created by theGA4 export service account.

If you have an idea how to test this with a mock log entry, please let me know in the comments!

There are typically 2–3 daily load jobs from GA4, so you’ll need to wait some hours in the afternoon for a load job to be generated.

The best way to check if your scheduled queries are running is to go to the BigQuery user interface and check thePROJECT HISTORY tab at the bottom of the screen.

EachLOAD job where the owner isfirebase-measurement@... should be followed shortly by aQUERY job with a Job ID that startsscheduled_query.... If you see this, it means your setup is working, and you can verify this by visiting the dataset where your scheduled query writes its output to.

If youdon’t see this, there’s a problem somewhere along the line, and a good first step is to visitLogs Explorer and look for error logs related to one or more of the components in your setup.

Typical culprits are an incorrectly configuredLogs Router or a mistake in theCloud Function.

Summary

It might be a good idea to double-check your setup by watching the video as well.

As far as Google Cloud Platform pipelines go, this isnot a complicated setup. You actually only had to configurethree things:

Thescheduled query that you want to run on-demand,
theLogs Router that listens for load job complete entries,
and theCloud Functionthat triggers with the log entry and runs the scheduled query.

Yes, you created aPub/Sub topic during the process, but the beauty of the Google Cloud Platform is that this was embedded in the flow of setting up the pipeline itself, and you never had to actually configure the Pub/Sub topic or its subscriptions separately, it was all done without having to break your stride.

This is why the Google Cloud Platform is such a pleasure to work with. The way it guides the user through interconnected components is way ahead of its competition (in my opinion).

I hope you found this informational, and I hope you’ve been bitten by the Google Cloud Platform bug (if you weren’t a fan already). Let me know in the comments if you have any questions about the setup!

If you’ve built this funnel some other way, I’d love to hear more about your approach, too.