Google Search Console to BigQuery: The Complete Guide to GSC Bulk Export

You may have heard the hype about integrating Google Search Console and Google BigQuery using the new Bulk Export feature. If you’re like me, you recognize the power of analyzing SEO data with SQL or the BI tool of your choice (you’ve probably heard of Looker and Looker Studio.)

Or you might be thinking to yourself, “So what? I have a spreadsheet program that works just fine.” 

If you want to understand what GSC+BQ means for SEO analytics, keep reading—I will tell you.  If you just want to get into the details of using BigQuery with Google Search Console, skip ahead.

What is Google BigQuery, and what is it for?

You may have heard of Google BigQuery as it’s grown in popularity over the last 10+ years. If you work in SEO, you probably know that  Google has, for several years, been promoting the native integration between Google Analytics and BigQuery, so you probably understand that it’s a tool for analytics.

BigQuery is technically a cloud data warehouse, and you can think of a data warehouse as a database purpose-built for storing and efficiently analyzing LOTS of data. 

To put BigQuery’s scale in perspective, let’s consider my laptop, which has a hard drive that can store up to 500GB of data. BigQuery, on the other hand, can work with petabytes of data, and ONE petabyte is 2 MILLION times the storage size on my laptop. And a petabyte is 50 MILLION times as much data as you can store in a Google Sheet.

Besides having massive storage capacity, a cloud data warehouse is mainly used to store data from many different sources, from marketing and advertising to human resources, to purchasing and anything else a business might want to analyze. 

A relevant SEO use case would be aggregating all your GSC, GA, and Shopify shop data to analyze what keywords you should target to increase sales.

Finally, cloud data warehouses are powerful because they are… in the cloud. Being in the cloud means that anybody with permission to connect to the data warehouse can access it from anywhere with an internet connection.

This isn’t the first time anybody has analyzed GSC  data in BigQuery, but it’s much easier now. Before this integration, you would have to either build an ETL (extract, transform, load) pipeline or use a vendor like Fivetran or Airbyte. Needless to say, moving this much data from the Google Search Console API has its challenges. Trust me, I know.

Getting started with Google Search Console and BigQuery

Ok, you’re ready to jump aboard the data warehouse train. What’s next?

Setting up the Google Search Console BigQuery integration is remarkably simple. Just follow these five steps:

  1. Set up your Google accounts: You’ll need two accounts: a Google Search Console account (you must be a property owner) and a Google Cloud Platform (GCP) project. “Google Cloud projects form the basis for creating, enabling, and using all Google Cloud services  (including BigQuery), enabling billing, adding and removing collaborators, and managing permissions for Google Cloud resources.”  You can create a new project here.
    Create a Google Cloud Account
    *Note that you will also need to set up billing for your project. Did you think they would give this away for free?!
  2. Enable BigQuery: Now that you have your project up and running to manage access and billing, you can enable BigQuery. This link will take you directly to the BigQuery setup flow.

    You now have your own BigQuery instance!

    Don’t worry. You won’t pay anything initially, but if you keep loading in data and start to query it, your bill will add up gradually over time.
  3. Set up your BigQuery permissions: I’m taking this directly from Google’s documentation:
    Grant permission to Search Console to dump data to your project:
    • Navigate in the sidebar to IAM and Admin. The page should say Permissions for project <your_project>.
    • Click + GRANT ACCESS to open a side panel that says Add principals.
    • In New Principals, paste the following service account name: [email protected]
    • Grant it two roles: BigQuery Job User (bigquery.jobUser in the command-line interface) and  BigQuery Data Editor (bigquery.dataEditor in the command-line interface).
    • Click Save.
  4. Setup Google Search Console (almost done!): Luckily, this is the easiest part. Just head over to the bulk export settings page in Google Search Console and enter the project ID that corresponds to the GCP account that you set up in Step 1. Also, enter the storage region that fits your data privacy and storage price demands. (This is only really significant if you’re setting this up for production.)

    You can find your project id on your GCP project settings page. Make sure that you select the right GCP project, and in the project selector dropdown, then the project ID is displayed at the top of the page.
  1. Success! If I’ve done my job correctly and you’ve followed these steps, you should see a screen that looks like this.
    Bear in mind that your first export will happen up to 48 hours after your successful configuration in Search Console, and it will load data from the previous day (not historical data.)

Estimating your pricing (and how much you get for free)

Guess what. This integration isn’t totally free. After all, why would Google be giving it away?

Google BigQuery pricing works similarly to Google One pricing (storage for Drive, Gmail, etc.) Google offers a hefty amount for free, then charges you for additional storage units beyond their free tier.  The only difference is that BigQuery pricing also adds on the dimension of what they call “analysis,” which is basically billing for the cost of processing data in analytical workloads (i.e., querying and other data operations.)

Don’t worry, though. BigQuery’s free tier is pretty generous. You can store up to 10GB of data and query up to 10TB of data for free every month

Now I’m not going to get into all the detail about the different designations for data and use (you can read about pricing here), but to put this in context, my site, which gets around 30k impressions per week, loads about 5MB of data into BigQuery per week. 

Don’t take my word for it, but at that rate, unless I grow my site or I query the data like crazy, I’ll stay under the free tier for the next five years. 

💡 The volume of data that is more a factor of keyword variety than it is search volume. A site with a low search volume for lots of keywords will generate more data than a  site with lots of search volume for a single keyword.

This is all to say, keep an eye on your storage and use. You can see how much data is stored in each table by clicking on the table in the resource viewer (on the left) and viewing the table’s “DETAILS” tab. Then you can compare that to costs with their pricing calculator.

* Note that you can stop the export at any time from the Google Search Console bulk export settings page.

There is one more thing to note about costs — specifically limiting costs. You can reduce the amount of data that is loaded into BigQuery by specifying the data retention time. See this snippet from the Search Central blog.

Tables are retained forever, by default, as are partitions, subject to any global defaults set by your Google Cloud project or organization.

If you want to avoid accruing data indefinitely, we recommend putting an expiration on the partition after an acceptable period of time: a month, six months, twelve months, or whatever is reasonable for your needs and the amount of data that you accrue. Putting an expiration date on the entire table is probably not what you want, as it will delete all your data.

Bulk Export limitations

While we’re deep in these weeds, let’s talk about limitations. Two limitations immediately come to mind: 1) anonymized queries and 2) lack of control over how the data is organized.

Anonymized queries

Anonymized queries are Google search queries that Google deems to have the potential to identify the searcher due to the nature or uniqueness of the query. If you’ve used Google Search Console, you’re probably familiar with this limitation, so it’s not such a big deal.  

The nice thing about how the bulk export handles anonymized queries (compared to other ETL vendors who access data from the API) is that it writes rows that aggregate all the metrics for all the anonymized queries per site/URL per day instead of just omitting them the rows. 

These anonymized rows are helpful because you get complete sums of impressions and clicks when you aggregate the data—not just sums of impressions and clicks for data where the search query is known.

That said, while anonymized queries are a limitation, this integration handles them pretty well.  The next limitation is more significant. 

💡 One thing to be mindful of is the difference in anonymized query volume between the  searchdata_url_impression table and the searchdata_site_impression table. Like the GSC interface, some queries for particular URLs in particular countries might be so infrequent that they could potentially identify the searcher. As a result, you’ll see a greater portion of anonymized queries in your searchdata_url_impression table than in your searchdata_site_impression table.

Schema control and configuration

UPDATE 3/29/2023: According to Google Webmaster Central, “Today we’re updating bulk data exports to allow multiple GSC properties to export to one Cloud project. To do so, you need to customize your dataset name when setting up your export to have a unique dataset name for each export.” Now you can specify the name of the dataset where the data will land (although it must be prefixed with "searchconsole_"

If you only operate or want to analyze one site, Bulk Export is plenty. But if you wanted to analyze the search data from two or more different Google Search Console accounts together, well, at this point, you’re out of luck. Google made the integration easy to set up and manage, but that all comes at the expense of configuration.

The issue is that you can only associate one GSC account with one GCP project. And the dataset name has to be “searchconsole.” (This is not longer true! See the update above.)

If you built your own integration or you used an off-the-shelf ETL tool to extract your GCP data, you could land data from multiple accounts into the same project, into the same schema, and the same tables. That way, you could compare how several owned sites rank against a keyword on a given day.

As far as I can tell, it’s impossible to analyze data from two or more different GSC accounts together with the bulk export integration.

Get to know your Google Search Console data

The bulk export condenses all of Google Search Console’s performance reports into two tables: searchdata_url_impression and searchdata_site_impression. In addition to that, the integration also maintains a log table named ExportLog that “contains information about each successful export to one of the previous data tables.”

The difference between searchdata_url_impression and searchdata_site_impression is what we call “granularity.” Granularity refers to the “grain” or level of aggregation within a table or query. 

You could imagine a table with an “atomic” grain of Google search data, which would basically be a row for each individual search query and all the corresponding metadata. Since that’s completely infeasible due to the massive amount of data that the tables would store (and the obvious privacy issues), search performance tables are aggregated into a courser grain.

Each table’s metrics are aggregated into one line for each combination of several variables. 

The granularity of the searchdata_site_impression table is determined by how impression data falls into the following dimensions:

  • Date
  • Property URL
  • Query
  • Anonymized query (True/False)
  • Country
  • Search Type
  • Device

In BigQuery, the searchdata_site_impression looks like this:

The searchdata_url_impression table is significantly more granular than the previous table. In addition to the dimensions listed above, the data in searchdata_url_impression is broken down further into the following dimensions. These dimensions are listed in the Data Dictionaries section below.

The image below shows how “wide” this table is because of all the boolean values. I’ve highlighted the “true” values to illustrate how infrequently the values will differ for most sites. For some sites, these boolean values will be critical, but it all depends on your SEO strategy.

Lot's of boolean columns that describe how the URL was displayed in a SERP

Data Dictionaries

One thing worth noting about the Google Search Console Export is that not all fields will be relevant to all Google Search Console properties. For example, cooking websites will have data related to recipe search features, while ecommerce websites will have data related to shopping features. For the most part, the column will be filled with mostly false values, so don’t worry if that is the case.

Here are descriptions of each of the fields in the two analytical tables:

searchdata_site_impression

Here is a description for every field in the searchdata_site_impressions table:

Dimension or MetricField nameTypeDescription
DIMENSIONdata_dateDATEThe day on which the data in this row was generated (Pacific Time).
DIMENSIONsite_urlSTRINGURL of the property. For domain-level properties, this will be sc-domain property-name. For URL-prefix properties, it will be the full URL of the property definition. Examples: sc-domain, developers.google.com, https//developers.google.com/webmaster-tools/
DIMENSIONquerySTRINGThe user query. When is_anonymized_query is true, this will be a zero-length string. 
DIMENSIONis_anonymized_queryBOOLEANRare queries (called anonymized queries) are marked with this bool. The query field will be null when it’s true to protect the privacy of users making the query.
DIMENSIONcountrySTRINGCountry from where the query was made, in ISO-3166-1-Alpha-3 format.
DIMENSIONsearch_typeSTRINGOne of the following string values
web: The default (“All”) tab in Google Search;
image: The “Image” tab in Google Search;
video: The “Video” tab in Google Search;
news: The “News” tab in Google Search;
discover: Discover results;
googleNews: news.google.com and the Google News app on Android and iOS.
DIMENSIONdeviceSTRINGThe device from which the query was made.
METRICimpressionsINTEGERThe number of impressions for this row.
METRICclicksINTEGERThe number of clicks for this row.
METRICsum_top_positionINTEGERThe sum of the topmost position of the site in the search results for each impression in that table row, where zero is the top position in the results. To calculate the average position (which is 1-based), calculate SUM(sum_top_position)/SUM(impressions) + 1

searchdata_url_impression

Here is a description for every field in the searchdata_url_impressions table:

Dimension or MetricField nameTypeDescription
DIMENSIONdata_dateDATEThe day on which the data in this row was generated (Pacific Time).
DIMENSIONsite_urlSTRINGURL of the property. For domain-level properties, this will be sc-domain property-name. For URL-prefix properties, it will be the full URL of the property definition.
DIMENSIONurlSTRINGThe fully-qualified URL where the user eventually lands when they click the search result or Discover story.
DIMENSIONquerySTRINGThe user query. When is_anonymized_query is true, this will be a zero-length string. 
DIMENSIONis_anonymized_queryBOOLEANRare queries (called anonymized queries) are marked with this bool. The query field will be null when it’s true to protect the privacy of users making the query.
DIMENSIONis_anonymized_discoverBOOLEANWhether the data row is under the Discover anonymization threshold. When under the threshold, some other fields (like URL and country) will be missing to protect user privacy.
DIMENSIONcountrySTRINGCountry from where the query was made, in ISO-3166-1-Alpha-3 format.
DIMENSIONsearch_typeSTRINGOne of the following string values web: The default (“All”) tab in Google Search; image: The “Image” tab in Google Search; video: The “Video” tab in Google Search; news: The “News” tab in Google Search; discover Discover results; Google news: news.google.com and the Google News app on Android and iOS
DIMENSIONdeviceSTRINGThe device from which the query was made.
DIMENSIONis_amp_top_storiesBOOLEANAppearance in the Top Stories carousel https://support.google.com/news/publisher-center/answer/9607026?hl=en
DIMENSIONis_amp_blue_linkBOOLEANOne of the following string values web: The default (“All”) tab in Google Search; image: The “Image” tab in Google Search; video: The “Video” tab in Google Search; news: The “News” tab in Google Search; discover Discover results; google news: news.google.com and the Google News app on Android and iOS
DIMENSIONis_job_listingBOOLEANAppearance as a job posting result that shows a summarized view of a job https://developers.google.com/search/docs/appearance/structured-data/job-posting
DIMENSIONis_job_detailsBOOLEANAppearance as an AMP page listed as a normal blue link
DIMENSIONis_tpf_qaBOOLEANAppearance as a Q&A page rich result https://developers.google.com/search/docs/appearance/structured-data/qapage
DIMENSIONis_tpf_faqBOOLEANAppearance as an FAQ page https://developers.google.com/search/docs/appearance/structured-data/faqpage
DIMENSIONis_tpf_howtoBOOLEANAppearance as a How-to rich result https://developers.google.com/search/docs/appearance/structured-data/how-to
DIMENSIONis_webliteBOOLEANA deprecated search feature that allowed Google to serve faster, lighter pages to people searching on entry-level devices. https://support.google.com/websearch/answer/9836344
DIMENSIONis_actionBOOLEANAppearance with an action that can be taken on the result https://developers.google.com/assistant/content/overview
DIMENSIONis_events_listingBOOLEANAppearance as a job posting result listed in a collection of job postings https://developers.google.com/search/docs/appearance/structured-data/job-posting
DIMENSIONis_events_detailsBOOLEANAppearance as a detailed description of an event https://developers.google.com/search/docs/appearance/structured-data/event
DIMENSIONis_search_appearance_android_appBOOLEANAppearance as an event listed among other events https://developers.google.com/search/docs/appearance/structured-data/event
DIMENSIONis_amp_storyBOOLEANAppearance as an article featured in the Top Stories carousel. These used to require AMP but do not any longer. https://developers.google.com/search/docs/appearance/enable-web-stories
DIMENSIONis_amp_image_resultBOOLEANAn Image Search result, where the image is hosted in an AMP page https://developers.google.com/search/blog/2019/07/helping-publishers-and-users-get-more
DIMENSIONis_videoBOOLEANAppearance as a video feature that appears in either general search results (type Web) or Discover https://developers.google.com/search/blog/2019/10/search-console-video-results-reports
DIMENSIONis_organic_shoppingBOOLEANAppearance as an organic shopping result https://developers.google.com/search/blog/2022/11/shopping-tab-with-search-console
DIMENSIONis_review_snippetBOOLEANAppearance as a video feature that appear in either general search results (type Web) or Discover https://developers.google.com/search/blog/2019/10/search-console-video-results-reports
DIMENSIONis_special_announcementBOOLEANAppearance with a special announcements structured data element with information. For example, information about COVID-19. https://developers.google.com/search/blog/2020/05/special-announcements-search-console
DIMENSIONis_recipe_featureBOOLEANAppearances as a recipe listing feature that lists information specific to that recipe. https://developers.google.com/search/docs/appearance/structured-data/recipe
DIMENSIONis_recipe_rich_snippetBOOLEANA Recipe rich result that appeared outside of the recipe list with more detail https://developers.google.com/search/docs/appearance/structured-data/recipe
DIMENSIONis_subscribed_contentBOOLEANAppearance as a Q&A feature for flashcard pages https://developers.google.com/search/docs/appearance/structured-data/education-qa
DIMENSIONis_page_experienceBOOLEANAppearance while qualified for good page experience https://support.google.com/webmasters/answer/10218333?hl=en
DIMENSIONis_practice_problemsBOOLEANA practice problem search feature https://developers.google.com/search/docs/appearance/structured-data/practice-problems
DIMENSIONis_math_solversBOOLEANA math problem search feature https://developers.google.com/search/docs/appearance/structured-data/math-solvers
DIMENSIONis_translated_resultBOOLEANAppearance as a translated resulthttps://developers.google.com/search/docs/appearance/translated-results
DIMENSIONis_edu_q_and_aBOOLEANA zero-based number indicating the topmost position of this URL in the search results for the query. (Zero is the top position in the results.) To calculate the average position (which is 1-based), calculate SUM(sum_position)/SUM(impressions) + 1.
METRICimpressionsINTEGERThe number of impressions for this row.
METRICclicksINTEGERThe number of clicks for this row.
METRICsum_positionINTEGERAppearance with a review snippet rich result https://developers.google.com/search/docs/appearance/structured-data/review-snippet

Tips and Tricks for SQL Analysis

Now that we’ve gotten through all the nitty gritty details, let’s take a look at what you can do with the Google Search Console data. I’m only including a few simple queries here because 1) many of the queries I described in my SEO analytics in SQL post will apply here, and 2) the possibilities are nearly endless, so, at best, I can only inspire you to try things. 

Let’s start with the basics and then get a little more interesting.

Using multiple filters at the same time

This is one thing that bugs me about Google Search Console. Unlike Google Analytics, where you can apply multiple filters on multiple dimensions, GSC only allows you to apply one filter per dimension. If you have a big site with lots of content and different hierarchies, this can be very limiting. 

Luckily in BigQuery, you can apply as many filters as you like on a single dimension. Here’s an example of how I would aggregate impressions and clicks for all keywords that contain all three of the following tokens: “web,” “analytics,” and “consult” (as in “consulting or consultant.”)

If you uncomment the three commented lines, you’ll get stats aggregated by search query.

SELECT
 -- s.query,
 sum(s.impressions),
 sum(s.clicks)
FROM `mvp-data-321618.searchconsole.searchdata_site_impression` s
WHERE s.data_date > CURRENT_DATE('America/Los_Angeles') - 28
 and s.query like '%web%'
 and s.query like '%analytics%'
 and s.query like '%consult%'
-- GROUP BY s.query
-- ORDER BY 1 desc

💡 Notice the WHERE clause. Not only do I filter on the query, but I also filter in data from the last four weeks (using the timezone that Google Search Console uses for reporting). This is a useful time range, but it also reduces the amount of data scanned and thus reduces the cost of running the query!

Keyword variety per URL

The Google Search Console UI makes it easy to aggregate sums and percentages, but it is impossible to count how many distinct queries are leading to impressions for a given URL. Here’s a query to do just that.

SELECT
 u.url,
 COUNT(DISTINCT u.query) distinct_queries,
 SUM(u.impressions) imps,
 SUM(u.clicks) clicks
FROM `mvp-data-321618.searchconsole.searchdata_url_impression` u
WHERE u.data_date > CURRENT_DATE('America/Los_Angeles') - 28
GROUP BY u.url
ORDER BY 2 DESC

💡 Notice that I’ve used the DISTINCT keyword to count each unique query only once. If I hadn’t done this, I would have counted each query whether it was a duplicate or not.

It’s interesting to see that the URLs with the most distinct queries are technical posts that probably have a lot of specific but infrequent queries about the topic. If you’ve ever heard about zero search volume queries that bring traffic, you’re looking at it!

How much site traffic is coming from anonymized queries?

The question about query variety begs another question. How many queries are coming in so infrequently that they are aggregated into the anonymized query bucket? Here’s a query to provide some statistics and insights about that.

SELECT
 u.url,
 SUM(u.impressions) total_imps,
 SUM(CASE WHEN u.is_anonymized_query = TRUE THEN u.impressions END) anonymized_imps,
 SUM(u.clicks) total_clicks,
 SUM(CASE WHEN u.is_anonymized_query = TRUE THEN u.clicks END) anonymized_clicks,
 ROUND(SUM(CASE WHEN u.is_anonymized_query = TRUE THEN u.impressions END) / SUM(u.impressions),2) pct_anonymized_imps,
 ROUND(SUM(CASE WHEN u.is_anonymized_query = TRUE THEN u.clicks END) / SUM(u.clicks),2) pct_anonymized_clicks
FROM `mvp-data-321618.searchconsole.searchdata_url_impression` u
WHERE u.data_date > CURRENT_DATE('America/Los_Angeles') - 28
GROUP BY u.url
HAVING
 total_imps > 1000
 AND total_clicks > 0
ORDER BY pct_anonymized_imps desc

💡 Using a CASE statement within an aggregate function (like SUM) filters in rows of data that evaluate to true. This is how I sum anonymized queries vs. all queries.

Does it surprise you that over half of the impressions are anonymized for several of these URLs? It makes me wonder what the opportunity is here. One idea could be to break these pages out into more specific topics to see if Google would rank them differently. Maybe not my best idea, but if you ask enough questions, you’re bound to find lots of leverage!

SQL Queries to search queries to …

Now it’s your turn. You have everything you need to know about the Google Search Console + BigQuery integration to get up and running, so start analyzing your data! If you want help getting set up or running analyses, I’m happy to help. And if anything is unclear or missing, please leave a comment!

Follow on LinkedIn

1 thought on “Google Search Console to BigQuery: The Complete Guide to GSC Bulk Export”

  1. Pingback: SEO Analytics in SQL for Beginners: A step-by-step tutorial • Trevor Fox

Leave a Comment

Your email address will not be published. Required fields are marked *