When you reach the limits of Google Analytics custom reports and you still need more, Google Spreadsheets and the Google Analytics Add-On can take you past sampling, data consistency and dimension challenges.
This post is all about the Google Analytics + Google Spreadsheets workflow. It is an end-to-end example of how you can extract more value out of your Google Analytics data when you work with it in its raw(ish) form in Google Spreadsheets.
The end goal is to look at how blog post traffic grows or decays with time and ultimately Forecast Organic Traffic to Blog Posts (coming soon). But this post sets the foundation for that by showing how the data is extracted, cleaned and organized using a pivot table in order to get there.
There is also a bit of feature engineering to make the analysis possible. To do this we will extract date that the post was posted from the URL and effectively turn the “Date Posted ” and “Post Age” into custom dimensions of each page. But enough setup. Let’s start.
This posts assumes a few minor things but hopefully will be easy to follow otherwise:
- Google Spreadsheets Google Analytics Add-On already plugged in
- Basic familiarity with Regular Expressions aka. RegEx (helpful but not necessary)
- Basic familiarity with Pivot Tables in Google Spreadsheets
- Blog URLs with dates as subdirectories eg “/2015/08” (or collect post date as Custom Dimension)
Creating Your Custom Report
Once you have the Google Analytics Add-On up and running this is actually pretty simple. It just takes a bit of trial and error to ensure that you’ve queried exactly what you want. From the Report Configuration tab, most of the time there are only a few, but very important, fields that you will need to worry about: Date Ranges, Dimensions, Metrics, and Filters.
The purpose of this is to look at sessions by blog post URLs by month, over the entire history of the site
I chose the Last N Days because I know that is roughly the age of my site and a date range that is too broad is ok. I chose sessions because this relates to how many people will land on a given landing page in a given month of a given year. So that is all pretty straight forward.
Filters can get a bit tricky. A good practice for setting up filters is to first build them using the Google Analytics Query Explorer. That will enable you to rapidly test your queries before you are ready to enter them into your spreadsheet.
There are several filter operators that you can use but I kept this one fairly simple. It consists of three components (separated by semicolons):
||Only organic traffic ( == means equals)
||Only landing page URLs that contain 20 (@= means contains)
||Only landing page URLs that do not contain a query string because that’s mostly garbage. (!@ means does not contain)
I used ga:landingPagePath rather than page title because that can be less consistent than URLs. They are more likely to change and will sometimes show as your 404 page title. Blog post URL’s are a more consistent unique identifier for a post but it is important to note that sometimes people will change blog posts URLs. We will deal with that later.
Cleaning the Data for Consistency
Even with good data collection practices, cleaning the data is extremely important for accurate analysis. In my case, I had changed a couple blog post URLs over time and had to manually omit a few that my query’s filter did not catch. In this case, data cleansing becomes very important for two reasons: 1. Posts that are not combined by their unique and consistent URL will show as two separate posts in the pivot table which will skew summary statistics and 2. Posts that are not real URLs with a small number of sessions will really skew summary statistics. Consistency is key for Pivot Tables.
So in this case, for small exceptions, I just deleted the rows. For more common exceptions, I used the REGEXREPLACE function. This is a powerful tool for data cleansing and unique to Google Spreadsheets. It allows you to select a part of a string that matches a RegEx pattern and replace with with whatever you might want. In this case, I just searched what I wanted to remove and replaced it with an empty string. eg.
I used this to remove ”/blog” from the URLs that were used before I transitioned from Squarespace to WordPress and date numbers because some had been changed when the post was updated.
Extracting Blog Post Posting Dates
Extracting the date that the post was posted is actually pretty simple. Again, I used another RegEx function, REGEXEXTRACT to do it:
The RegEx pattern finds any four digits then a slash and any one or two digits. Then the extracted string is split into two cells by the slash. This yields the year in one column and the month in the next. I combined the month and year of the post date and the month and the year of the Google Analytics date into Date objects so that I could use the DATEDIFF function to calculate the age of the blog post as a dimension of the post. Maybe this is too much gory detail but hopefully it’s useful to somebody.
Finally, we end up with is something that looks like this. This allows for pivoting the data about each post by each month of age.
Step 4: Pivot Tables FTW!
Finally, all this work pays off. The result is one table of blog post URLs by Age and one table of URLs by summary statistics.
The blog post age table allows me to see if the posts show a natural growth or decay of traffic over time. The big purpose of the table is to create the chart below that basically shows that growth and decay don’t look that consistent. But this is an important finding which helps frame thinking about modeling blog traffic in 2016.
(Not shown here, the Google Spreadsheets SPARKLINE function is a great way to visualize Google Analytics Data.)
The summary statistic pivot table will be used for forecasting 2016 traffic. This just happens to be the topic of my next post in which the tables are used to answer 1. What is the likelihood of reaching 25,000 sessions in 2016 if things stay the same? and 2. What needs to change in order to reach 25,000 sessions in 2016?
So in summary, so far, this post has proven a few things:
- If you work with Google Analytics and don’t know RegEx, go learn! Its an incredibly useful tool.
- My blog posts do not demonstrate consistent growth or decay
- I might just be as big of a nerd as my girlfriend thinks I am.
Hope it might be useful in thinking about similar problems or maybe even creating new ones!