Setting up Airbyte ETL: Minimum Viable Data Stack Part II

In the first post in the Minimum Viable Data Stack series, we set up a process to start using SQL to analyze CSV data. We set up a Postgres database instance on a Mac personal computer, uploaded a CSV file, and wrote a query to analyze the data in the CSV file. 

That’s a good start! You could follow that pattern to do some interesting analysis on CSV files that wouldn’t fit in Excel or Google Sheets. But that pattern is slow and requires you to continually upload new data as new data becomes available. 

This post will demonstrate how to connect directly to a data source so that you can automatically load data as it becomes available.

This process is called ETL, short for Extract, Transform, Load. Put simply,  ETL just means, “connecting to a data source, structuring the data  in a way that it can be stored in database tables, and loading it into those tables.” There’s a lot more to it if you really want to get into it, but for our purposes, this is all you’ll need to know right now.

For this part of the tutorial, we are going to use an open-source ETL tool called Airbyte to connect to Hubspot and load some Contact data into the same Postgres database we set up before. Then we’ll run a couple of analytical queries to whet your appetite!

Setting up an ETL Tool (Airbyte)

I chose Airbyte for this demo because it is open source which means it’s free to use as long as you have a computer or a server to run it on. Much of it is based on the open-source work of another ETL tool called Stitch had been pushing before they got acquired by Talend. That project was called Singer.

The best thing about Airbyte for our Minimum Viable Data Stack is that they make running the open-source code so easy because it is packaged in yet another software framework called Docker. Yes, if you’re keeping score at home, that means we are using one open-source framework packaged in another open-source framework packaged in yet another open-source framework. Oh, the beauty of open-source!

To keep this tutorial manageable, I am going to completely “hand wave” the Docker setup. Luckily, it’s easy to do. Since this tutorial is for Mac, follow the Docker installation instructions for Mac.

🎵 Interlude music plays as you install Docker 🎵

Once you’ve installed Docker, you can run the Docker app which will then allow you to run apps called “containers.” Think of a container as “all the code and dependencies an app needs, packaged so you can basically just click start” (Instead of having to load all the individual dependencies one by one!)

Setting up Docker

We’re only going to download and run one app on Docker: Airbyte! 

Note: If you need help on the next few steps, Airbyte has a Slack community that is really helpful.

To download Airbyte the instructions are simple. Just open up your terminal (you can find this by using Mac’s spotlight search [cmd+space] and typing in “Terminal”). In the terminal just paste in the following three commands:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

The commands tell your computer to copy all the code from their Github repository to your computer into a folder called “airbyte”, then “cd” aka “changing the directory” to the “airbyte” directory, then tell  Docker to run the Airbyte app container.

The beauty of this is that once you run this the first time from the command line, you can start Airbyte from the Docker UI by just clicking the “play” button.

Installing Airbyte via the command line

Airbyte will do a bit of setup and then your terminal will display the text shown above. At that point Airbyte is running on your computer and to use it, all you have to do is open your browser and go to http://localhost:8000

If you’re wondering how this works, Airbyte is running a webserver to provide a web interface to interact with the code that does all the heavy-ETL-lifting. If this were a common ETL tool like Stitch or Fivetran, the webserver and the ETL processes would run on an actual server instead of your personal computer.

If everything has gone according to plan you can go to  http://localhost:8000 and see the Airbyte app UI running and ready to start ETL-ing!

Setting up the Posgress connection in Airbyte

Setting up your first ETL job (Hubspot to Postgres)

I’ll admit, that last part got a little gruesome but don’t worry, it gets easier from here (as long as everything goes according to plan…)

From here we have to connect both our database and data sources to Airbyte so it has access to the source data and permission to write to the database.

I’ve chosen to load data from Hubspot because it is really easy to connect and because it shows the ups and downs of ETL… And of course, we’re still using Postgres.

Creating a Postgres Destination

All you have to do is paste in your database credentials from Postgres.app. Here are Airbyte’s instructions for connecting to Postgres

These are the same ODBC credentials we used to connect Postico in the last article. You can find them on the Postico home screen by clicking your database name. Note 

Posgres ODBC connection details

In my case, these are the settings:

  • Name: I called it “Postgres.app” but it could be anything you want
  • Host: Use host.docker.internal (localhost doesn’t work with Docker. See instruction above)
  • Port: 5432 
  • Database Name: Mine is “trevorfox.” That’s the name of my default Postgres.app database
  • Schema: I left it as “public.” You might want to use schemas for organizational purposes. Schemas are “namespaces” and you can think of them as folders for your tables. 
  • User: Again, mine is “trevorfox” because that is my default from when I set up Postgres.app
  • Password: You can leave this blank unless you set up a password on in Postgres.app 

From there you can test your connection and you should see a message that says, “All connection tests passed!”

Creating a Hubspot Source

You’ll first need to retrieve your API key. Once you’ve got it, you can create a new Hubspot Source in Airbyte.

I used these settings:

  • Name: Hubspot
  • Source type: Hubspot
  • API Key:  My Hubspot  API key
  • start_date:  I used 2017-01-25T00:00:00Z which is the machine-readable timestamp for 1/1/2017
Setting up the Hubspot connection in Airbyte

Here’s a cute picture to celebrate getting this far!

Getting started with Airbyte

Creating the ETL Connection

Since we’ve already created a Destination and a Source, all we have to do is to tell Airbyte we want to extract from the Source and load data to the Destination. 

Go back to the Destinations screen and open your Postgres.app source, click “add source,” and choose your source. For me, this is the source I created called “Hubspot.”

Airbyte will then go and test both the Source and Destination. Once both tests succeed, you can set up your sync. 

Setting up an ETL job with Airbyte

There are a lot of settings! Luckily you can leave most of them as they are until you want to get more specific about how you store and organize your data. 

For now, set the Sync frequency to “manual,” and uncheck every Hubspot object besides Contacts.

In the future, you could choose to load more objects for more sophisticated analysis but starting with Contacts is good because it will be a lot faster to complete the first load and the analyses will still be pretty interesting.

 Click the “Set up connection” button at the bottom of the screen.

Setting up the first sync with Airbyte

You’ve created your first Connection! Click “Sync now” and start the ETL job!

As the sync runs, you’ll see lots of logs. If you look carefully, you’ll see some that read “…  Records read: 3000” etc. which will give you a sense of the progress of the sync.

Airbyte ETL logs

What’s happening in Postgres now?

Airbyte is creating temporary tables and loading all the data into those. It will then copy that data into its final-state tables. Those tables will be structured in a way that is a bit easier to analyze. This is some more of the T and L of the ETL process!

As the sync is running, you can go back to Postico and refresh the table list (cmd+R) to see new tables as they are generated. 

Let’s look at the data!

When the job completes, you’ll notice that Airbyte has created a lot of tables in the database. There is a “contacts” table, but there are a lot of others prefixed with “contacts_.”

Why so many tables? 

These are all residue from taking data from a JSON REST API and turning it all into tables. JSON is a really flexible way to organize data. Tables are not. So in order to get all that nested JSON data to fit nicely into tables, you end up with lots of tables to represent all the nesting.  The Contacts API resource alone generated 124 “contacts_” tables. See for yourself:

​​select count(tablename)
from pg_tables t
where t.tablename like 'contacts_%'

This query queries the Postgres system table called pg_tables which, as you probably guessed, contains a row for each table with some metadata. By counting the tables that match the prefix “contacts_,” you’ll see all the tables that come from the Contacts resource. 

Why you care about Data Modeling and ELT

In order to structure this data in a way that is more suitable for analysis, you’ll have to join the tables together and select columns you want to keep. That cleaning process plus other business logic and filtering is called data modeling

Recently it has become more common to model your data with SQL once it’s in the database  (rather than right after it is extracted and before it’s loaded into the database). This gave rise to the term ELT to clarify that most of the transformations are happening after the data has landed in the database. There is a lot to discuss here. I might have to double back on this at some point…

Previewing SQL data tables

Luckily, we can ignore the majority of these tables. We are going to focus on the “contacts” table.

Analyzing the Data

Let’s first inspect the data and get a feel for what it looks like. It’s good to start with a sanity check to make sure that all your contacts are in there. This query will tell you how many of the contacts made it into the database. That should match what you see in the Hubspot app.

select count(*)
from contacts

One of the first questions you’ll probably have is how are my contacts growing over time? This is a good query to demonstrate a few things: the imperfect way that ETL tools write data into tables, the importance of data modeling, and ELT in action.

In the screenshot above, you’ll notice that the “contacts” table has a bunch of columns but one of them is full of lots of data. The “properties” column represents a nested object within the Hubspot Contacts API response. That object has all the interesting properties about a  Contact like when it was created, what country they are from, and other data a business might store about their Contacts in Hubspot. 

Airbyte, by default, dumps the whole object into Postgres as a JSON field. This means you have to get crafty in order to destructure the data into columns. Here’s how you would get a Contact’s id, the data it was created. (This would be the first step towards contact count over time)

select c.properties->>'hs_object_id' id, 
	c.createdat::date
from contacts c
limit 10;

Notice the field, “c.properties->>’hs_object_id’.“ The “->>” is how you get a JSON object field from the JSON-typed fields.

To count new contacts by month, we can add a little aggregation to the query above.

select date_trunc('week', c.createdat::date) created_month,
	count(distinct c.properties->>'hs_object_id')  contact_count
from contacts c
group by created_month
order by created_month desc;

THIS IS IT! This is the beauty of analytics with a proper analytics stack. Tomorrow, the next day, and every day in the future, you can run the Hubspot sync and see up-to-date metrics in this report!

You’ll learn that the more queries you run, the more you’ll get tired of cleaning and formatting the data. And that, my friends, is why data modeling and ELT!

SQL analytics in Postico

I changed to dark mode since the last post :]

Looking Forward

At this point, the stack is pretty viable. We have a Postgres database (our data warehouse), an ETL process that will keep our data in sync with the data in the source systems (Airbyte), and the tools for analyzing this data (SQL and Postico). Now you can answer any question you might have about your Hubspot data at whatever frequency you like—and you’ll never have to touch a spreadsheet!

The foundation is set but there is still more inspiration ahead. The natural place to go from here is a deeper dive into analysis and visualization. 

In the next post, we’ll set up Metabase to visualize the Hubspot data and create a simple dashboard based on some SQL queries. From there, I imagine we’ll head towards reverse ETL and push the analysis back to Hubspot. =]

I hope this was interesting at the least and helpful if you’re brave enough to follow along. Let me know in the comments if you got hung up or if there are instructions I should add.

From Spreadsheets to SQL: Step One towards the Minimum Viable Data Stack

“A spreadsheet is just data in single-player mode”

Last week I made a mistake. I boldly claimed that “A spreadsheet is just data in single-player mode.” And while I stand by that claim. I didn’t expect to be called to account for it.

As it turns out, the post was pretty popular and I think I know why. To me it boils down to five factors. 

  1. The scale and application of data still growing (duh)
  2. There aren’t enough people with the skills to work with data at scale
  3. There are plenty of resources to learn SQL but the path to using it in the “real world” isn’t very clear
  4. The tools have caught up and basically, anybody with spreadsheet skills can set up a data stack that works at scale
  5. Now is a great time to  upskill and become more effective in almost any career take advantage of the demand

The hard part? You have to pull yourself away from spreadsheets for a while—go slow to go fast (and big).

You’ll thank yourself in the end. Being able to think about data at scale will change how you approach your work and being able to work at scale will increase your efficiency and impact. On top of that, it’s just more fun!

A Minimum Viable Data Stack

In the spirit of a true MVP, This first step is going to get you from spreadsheet to SQL and with the least amount of overhead and a base level of utility.

In the next hour you will:

  • Stand up an analytical database
  • Write a SQL query to replace a pivot table
  • Have basic tooling and process for a repeatable analytical workflow 

In the next hour you will not (yet):

  • Run anything on a cloud server (or be able to share reports beyond your computer)
  • Setup any continually updating data syncs for “live” reporting

But don’t underestimate this. Once you start to work with data in this way, you’ll recognize that the process is a lot less error-prone and repeatable than spreadsheet work because the code is simpler and it’s easier to retrace your steps.

Starting up a Postgres database

Postgres isn’t the first thing that people think of for massive-scale data warehousing, but it works pretty well for analytics—especially at this scale, it is definitely the easiest way to get started and of course, it’s free. Ultimately, you’ll be working with BigQuery, Snowflake, and if this were 2016, Redshift. 

I apologize in advance but this tutorial will be for a Mac. It won’t really matter once everything moves to the cloud, but I don’t own a Windows machine…

The easiest way to get a Postgres server running on a Mac is Postgres.app. It wraps everything in a shiny Mac UI and the download experience is no different than something like Spotify.

Congrats! You have installed a Postgres server on your local machine and it’s up and running!

Here are some instructions for installing Postgres on Windows. And here’s a list of  Postgres clients for Windows that you’d use instead of Postico.

Now let’s see how quickly we get connected to the database.

Your Postgres server is up and running

Connecting to Postgres

There are plenty of good SQL editors for Postgres but since we are keeping this MVP, I’m going to recommend Postico. Again, it has a simple Mac UI and is designed for more of an analytical workflow than hardcore database management. 

  • Step 1: Head over to https://eggerapps.at/postico/ and download the app
  • Step 2: Move the app to the Applications folder and then open it by double-clicking on the icon
  • Step 3: Create a database connection by clicking on the “New Favorite” button. Leave all fields blank; the default values are suitable for connecting to Postgres.app. Optionally provide a nickname, eg. “Postgres.app”. Click “Connect”
  • Step 4: Go back to Postico and choose the SQL Query icon
  • Step 5: Test your connection by running a query.
Create a new “Favorite” connection in Postico

Run the query “select * from pg_tables;” to see a list of all the tables in your Postgres database. Since you haven’t loaded any tables, you’ll just see a list of Postgres system tables that start with the prefix, “pg_.” As you probably guessed, the “pg” stands for Postgres.

Running your first SQL query in Postico

You’ve done it! You’ve started up a Postgres database, connected to it, and run your first query!

Loading data into Postgres

Ok, the boring stuff is out of the way and it’s only been about 15 minutes! Now we can get to the actual analysis. Next, let’s load some actual data into Postgres.

Loading tables in Postgres is a little bit different (aka more involved) than loading a CSV into Google Sheets or Excel. You have to tell the database exactly how each table should be structured and then what data should be added to each table. 

You might not yet know how to run CREATE TABLE commands but that’s ok. There are tools out there that will shortcut that process for us too. 

The imaginatively named, convertcsv.com generates the SQL commands to populate tables based on a CSV file’s contents. There are lots of ways to populate data into a database but again, this is an MVP. 

For this tutorial, I’m using the Google Analytics Geo Targets CSV list found here. Why? Because the file is big enough that it would probably run pretty slowly in a spreadsheet tool.

  • Step 1: Head over to https://www.convertcsv.com/csv-to-sql.htm
  • Step 2: Select the “Choose File” tab to upload a CSV file. 
  • Step 3: Change the table name in the Output Options section  where it says “Schema.Table or View Name:” to “geotargets”
  • Step 4: Scroll down to the Generate Output section and click the “CSV to SQL Insert” button to update the output, then copy the SQL commands
  • Step 5: Go back to Postico and click on the SQL Query icon
  • Step 6: Paste the SQL commands into the SQL window 
  • Step 7: Highlight the entirety of the SQL commands and click “Execute Statement”
Uploading a CSV file to convertcsv.com

You’ve loaded data into your database! Now you can run super fast analyses on this data.

Analyze your data!

You’ve already run some commands in the SQL window in the previous step. The good news is it’s always just as simple as that. Now analysis is basically just the repetition of writing a command into the SQL editor and viewing the results. 

Here’s a simple analytical query that would be the equivalent of creating a pivot table that counts the number of rows within each group. Paste this in to find the results.

select g.country_code, count(criteria_id) count_id
from geotargets g
group by g.country_code
order by count_id desc

You’ve done it! You’ve replaced a spreadsheet with SQL! 

But this is not the end. PRACTICE, PRACTICE, PRACTICE! Analysis in SQL needs to become as comfortable as sorting, filtering, and pivoting in a spreadsheet.

Looking forward

If you’re thinking to yourself, “this still feels like single-player mode…” you’re right. This is like the first level of a game where you play in single-player mode so you can learn how to play the game and avoid getting destroyed in a multiplayer scenario.

In fact, you probably wouldn’t do this type of analysis in a database unless you were going to pull in CSV files with millions of rows or if you were to pull in a bunch of other spreadsheets and join them all together. In those cases, you’d see significant performance improvements over an actual spreadsheet program.

The real utility of a database for analysis comes when you have data dynamically being imported from an ETL tool or custom code. On top of that, running a database (or data warehouse) on the cloud makes it possible for you and your team to access the data and reports in real-time instead of just doing analysis locally and sending it to someone else as a CSV or other type of report document. Hopefully, I don’t need to tell you why that is a bad process!

If I stay motivated… The next step will be to dynamically import data into the new Postgres database with an ETL tool called Airbyte which also runs on your computer. At that point, the scale and complexity of the analysis will really increase.  

After that, as long as I keep at it… the next step would be to set up a BigQuery instance on Google Cloud. At that point, you can combine a cloud-based business intelligence tool with Airbyte and BigQuery and start to get a taste of what a functioning modern data stack looks like. 

I hope this was a helpful start for you.  Let me know in the comments if you get hung up.

Intro to SQL User-Defined Functions: A Redshift UDF Tutorial

As a data analyst, your credibility is as valuable as your analytical skills. And to maintain your credibility, it’s important to be able to answer questions correctly and consistently. That’s why you must be careful to integrate reproducibility into your SQL analyses. This tutorial is going to show you how you can use Redshift User Defined Functions (UDFs) to do just that.

Reproducibility in SQL Analysis

I’ve learned that there are two broad factors to reproducibility. The first is the data—different data for the same analysis is going to produce different results. A good example would be a court case: if you ask two witnesses the same question, each one will probably tell you something similar but likely slightly different. 

The second factor is the analytical methods. If we use the court case example again, this would be like the prosecution and the defense asking a witness the same question in two different ways. The lawyers would do this with the intent to get two different answers.

This post is more concerned with the second factor of reproducibility, the analytical method. Whenever you have to write complex SQL queries to get an answer, your analytical method (the SQL query) becomes a big variable. SQL is iterative by nature! Think about it, just be adding and removing “WHEN” conditions, you’re liable to drastically change your results. 

As you iterate on a numerical calculation or classification in a CASE expression you are likely to change your query results. And what happens when you have to perform the same analysis weeks later? You better hope you use the same iteration of your SQL query the second time as the first! 

And that is exactly where User-Defined Functions become so valuable! 

User-Defined Functions (UDFs) are simply a way of saving one or more calculations or expressions with a name so that you can refer to it as a SQL function for further use.

What are User Defined Functions?

User-Defined Functions can be used just like any other function in SQL like SUBSTRING or ROUND except you get to define what the output of the function is, given the input.

User-Defined Functions (UDFs) are simply a way of saving one or more calculations or expressions with a name so that you can refer to it as a SQL function for further use.

They are a great way to simplify your SQL queries and make them more reproducible at the same time. You can basically take several lines of code that produce one value from your SELECT statement, give it a name, and keep it for future use. Using UDFs, you can ensure that, given the same data, your calculations will always produce the same result.

UDF Functions are Scalar Functions. What does scalar mean?

As you learn about UDFs, you’ll see references to the word “scalar.” Scalar just means that the function is defined with one or more parameters and returns a single result. Just like the ROUND function has one parameter (the number) and an optional second parameter (the number of decimal places for rounding) and returns the rounded number. The function is applied to every value in a column, but it only returns one value for each row in that column.

A Hello World! SQL UDF Example

If you are familiar with any kind of programming language, this should be pretty simple. The CREATE FUNCTION syntax only requires a function name and a return data type. That’s it. 

A function called hello_world that returns ‘HELLO WORLD!’ every time would look like this:

create function hello_world ( )
  returns varchar
stable
as $$
  select 'HELLO WORLD!'
$$ language sql; 

In that case, the input data type and the output data type are both varchar because “HELLO WORLD!” is a text output. You could use your function like this:

select hello_world() as my_first_function;

And you’d get an output that looks like this:

my_first_function
HELLO WORLD!

But that wouldn’t be very interesting. You’ll generally want to modify the input(s) of your functions. Let’s take apart a more interesting UDF example.

How to Write SQL UDF Functions

This example function, called url_category takes a varchar as an input (a URL) and returns a varchar output (the category of the URL). To do this, the function compares the input (shown as $1 because it is the first parameter) to the conditions of a case expression.

You could also write this function with two parameters. Here’s an example if you were using Google Analytics data. You could take in the parameters, hostname and a page_path to get more granular with your URL categorization.

SQL UDF Functions with Multiple Arguments

This is Redshift’s example from their docs. It takes two parameters (both specified as float) and returns the value that is greater of the two.

create function f_sql_greater (float, float)
  returns float
stable
as $$
  select case when $1 > $2 then $1
    else $2
  end
$$ language sql;  

To refer to the different parameters in the function, you just use the dollar sign ($) and the order of the parameter in the function definition. As long as you follow that convention, you could go wild with your input parameters!

Redshift UDF Limitations

UDFs are basically restricted to anything that you can normally do inside a SELECT clause. The only exception would be subqueries—you cannot use subqueries in a UDF. This means you’re limited to constant or literal values, compound expressions, comparison conditions, CASE expressions, and any other scalar function. But that’s quite a lot! 

Common UDF Errors and their Causes

Once you start writing UDFs, you’ll find that it’s pretty easy going but there are two especially common “gotchas” 

ERROR:  return type mismatch in function declared to return {data type}

DETAIL:  Actual return type is {data type}.

This just means that you’ve created a function where the output value has a different data type than you said it would. Check that the return data type that you specified is the same as the function is actually returning. This can be tricky if your function is using a CASE expression because a CASE could accidentally return two different data types.

ERROR:  The select expression can not have subqueries.

CONTEXT:  Create SQL function “try_this” body

This means you tried to write a SELECT statement in your function that includes a subquery. You can’t do that.

ERROR:  function function_name({data type}) does not exist

HINT:  No function matches the given name and argument types. You may need to add explicit type casts.

There is one especially odd thing about Redshift UDFs. You can have several functions with the same name as long as they take different arguments or argument types. This can get confusing. The error here means that you’ve called a function with the wrong type of argument. Check the input data type of your function and make sure it’s the same as you input data.

Scaling your SQL Analysis with Confidence!

User-Defined Functions make it really easy to repeat your analytical method across team members and across time. All you have to do is define a function once and let everyone know that they can use it. On top of that, if you want to change the logic of your function you only have to do it in one place and then that logic will be changed for each user in every workbench, notebook, or dashboard!

Take advantage of this clever tool. Your team will thank you, and you will thank you later!

Getting Started with SQL for Marketing (with Facebook Ads Example)

As a digital marketer, I use SQL every single day. And looking back on my career so far, it would be fair (though a bit reductive) to say that I could define my career by two distinct periods: before I learned SQL and after I learned SQL. The two periods are distinct for three main reasons: 

  1. After learnings SQL I am faster at gaining insight from data 
  2. After learnings SQL I am able to make decisions based on more data
  3. As a result, I’ve been making better marketing decisions—and I have seen the traffic, conversion rates, and ROI to prove it. (Thanks to SQL)

If you’re at a crossroads in your career and you find yourself asking, “what coding language should I learn,” here is my case for SQL.

What is SQL (for Digital Marketing)

When you see SQL you might think it means “Sales Qualified Lead” but more commonly, SQL stands for “Structured Query Language.” It is a programming language that allows you to retrieve (or update, alter or delete) data from relational databases. (Relational is just a fancy word for a database that stores data in tables.) 

It’s kind of like ordering from McDonald’s. SQL is a language – a specific set of instructions – that you use to specify the results you want, the way you want them, in the quantity you want. Basically, SQL allows you to have your data your way.

How is SQL Used in Business

SQL has two main uses: applications and analysis. Applications (apps) from CandyCrush to Instagram store content and data about users in databases and then use it to create an experience (like keep track of how many comments you have on an Instagram post). On the other hand, you can use SQL for analysis in the same way you can sort, filter, and pivot data in Excel. (except with a lot more data)

SQL is different from most programming languages like Javascript, Python, and PHP because it only has one use: retrieving data from relational databases. So you can’t use SQL to build a website or a chatbot but you can use programming languages like Javascript, Python, and PHP to send SQL commands to databases and do something interesting with the results. WordPress is a good example of this. WordPress is written in PHP and the PHP code sends the SQL commands to a MySQL database and formats the data into blog articles and article lists.

What’s the difference between SQL and Excel?

Remember when you learned your first Excel formula? Pivot tables? VLOOKUP? You probably through you could take on the world! SQL is like that times 100. SQL and Excel are similar because they both allow you to analyze, manipulate, and make calculations, and join data in tables. 

The biggest difference between Excel and SQL is that you can analyze exponentially more data exponentially faster with SQL but you can’t update the data in SQL quite as easily. Also, SQL commands define how you want your data table to look when the data is retrieved so you are working with entire tables rather than individual cells. The benefit of this is that you don’t have to worry about making mistakes when copying formulas (and the analysis errors that come with that.) On the whole, I’d say SQL is much better than Excel, most of the time.

SQL Example in Marketing

This example shows an ROI analysis using SQL code that you could use in a Facebook Ads dashboard. This example calculates how many customers you’ve acquired per country since the beginning of 2020, and the Facebook Ads spend that was spent in that country. 

SELECT country, sum(customer_count) total_customers, sum(spend) ad_spend
FROM (
	SELECT customers.ip_country, count(email) customer_count
	FROM customers 
	WHERE customers.createdate > '2020-01-01'
	GROUP BY country) new_customers
JOIN facebook_ads ON facebook_ads.country = new_customers.ip_country
WHERE ad_spend.date > '2020-01-01'
GROUP BY country
ORDER BY ad_spend desc;

The example does the following:

  1. Aggregate a table of customers into a table of countries and customer counts who have become customers since January 1st, 2020.
  2. Joins that table with another table that contains Facebook Ads data by day
  3. Filters in only Facebook Ad spend data since January 1st, 2020
  4. Aggregates this all into a single table that has three columns: country, count of new customers from that country, and the ad spend for that country.

The good news is, this is about as complex as SQL gets. Pretty much everything else in SQL is just a variation of this.

Is SQL worth Learning?

In a word, yes. There is a practical reason and a conceptual one. The conceptual one is that learning SQL, like learning data structures or other programming languages, will expand how you think about data. It will help you organize data for analysis more efficiently and help you structure your thinking about how to answer questions with data. So even without a database, SQL can help you work with data.

The practical reason for learning SQL is that it allows you to gain insight faster, from more data, and come to better conclusions. That is true if you are analyzing keywords for PPC or SEO, analyzing how leads flow through your sales funnel, analyzing how to improve your email open rates, or analyzing traffic ROI.

 Here are just a few good reasons.

  1. You’ll spend less time trying to export and import data into spreadsheets
  2. You’ll be able to replicate your analysis easily from week to week or month to month
  3. You’ll be able to analyze more than 10k rows of data at once
  4. You can use BI tools to build dashboards with your data (and always keep them fresh)
  5. You’ll be able to merge bigger datasets together faster than VLOOKUPs
  6. You won’t have to ask for help from IT people, DBAs or engineers to get data out of a database or data warehouse for your analysis

How long does it take to learn SQL?

With dedication, you can develop a strong foundation in SQL in five weeks. I recommend Duke’s SQL class on Coursera to go from zero to usable SQL skills in less than two months. With that class and a couple of books about PostgreSQL, I was on par with most analysts at Postmates (except I had the context about the data!). A few months later  I learned enough to record this SQL demo with SEM data.

There are even good Android/iPhone apps that will help you learn the syntax through repetition. The class I recommend below (Data Manipulation at Scale: Systems and Algorithms) for Python also touches on SQL so it’s a double whammy and Python Anywhere also features hosted MySQL, so that’s a double-double whammy!

If you are looking for a short but substantive overview of SQL, this video from Free Code Camp is pretty good. I’m not suggesting you’re going to know how to write SQL in four hours, but at least you will get the gist.

All that being said, like many programming languages, learning SQL is a continual practice because, after learning the language, you can expand into managing a database rather than just analyzing data in a database. You can also pair your SQL skill with other programming skills to make all sorts of interesting applications! The good news, for the most part, it’s like riding a bike, once you learn it, you don’t really forget it—but it will take you a bit of time to re-learn a wheelie.

SEO with the Google Search Console API and Python

The thing I enjoy most about SEO is thinking at scale. Postmates is fun because sometimes its more appropriate to size opportunities on a logarithmic scale than a linear one.

But there is a challenge that comes along with that: opportunities scale logarithmically, but I don’t really scale… at all. That’s where scripting comes in.

SQL, Bash, Javascript, and Python regularly come in handy to identify opportunities and solve problems. This example demonstrates how scripting can be used in digital marketing to solve the challenges of having a lot of potentially useful data.

Visualize your Google Search Console data for free with Keyword Clarity. Import your keywords with one click and find patterns with interactive visualizations.

Scaling SEO with the Google Search Console API

Most, if not all, big ecommerce and marketplace sites are backed by databases. And the bigger these places are, the more likely they are to have multiple stakeholders managing and altering data in the database. From website users to customer support, to engineers, there several ways that database records can change. As a result, the site’s content grows, changes, and sometimes disappears.

It’s very important to know when these changes occur and what effect the changes will have on search engine crawling, indexing and results. Log files can come in handy but the Google Search Console is a pretty reliable source of truth for what Google sees and acknowledges on your site.

Getting Started

This guide will help you start working with the Google Search Console API, specifically with the Crawl Errors report but the script could easily be modified to query Google Search performance data or interact with sitemaps in GSC.

Want to learn about how APIs work? See: What is an API?

To get started, clone the Github Repository: https://github.com/trevorfox/google-search-console-api and follow the “Getting Started” steps on the README page. If you are unfamiliar with Github, don’t worry. This is an easy project to get you started.

Make sure you have the following:

Now for the fun stuff!

Connecting to the API

This script uses a slightly different method to connect to the API. Instead of using the Client ID and Client Secret directly in the code. The Google API auth flow accesses these variables from the client_secret.json file. This way you don’t have to modify the webmaster.py file at all, as long as the client_secret.json file is in the /config folder.

try:
    credentials = pickle.load(open("config/credentials.pickle", "rb"))
except (OSError, IOError) as e:
    flow = InstalledAppFlow.from_client_secrets_file('client_secret.json', scopes=OAUTH_SCOPE)
    credentials = flow.run_console()
    pickle.dump(credentials, open("config/credentials.pickle", "wb"))

webmasters_service = build('webmasters', 'v3', credentials=credentials)

For convenience, the script saves the credentials to the project folder as a pickle file. Storing the credentials this way means you only have to go through the Web authorization flow the first time you run the script. After that, the script will use the stored and “pickled” credentials.

Querying Google Search Console with Python

The auth flow builds the “webmasters_service” object which allows you to make authenticated API calls to the Google Search Console API. This is where Google documentation kinda sucks… I’m glad you came here.

The script’s webmasters_service object has several methods. Each one relates to one of the five ways you can query the API. The methods all correspond to verb methods (italicized below) that indicate how you would like to interact with or query the API.

The script currently uses the “webmaster_service.urlcrawlerrorssamples().list()” method to find how many crawled URLs had given type of error.

gsc_data = webmasters_service.urlcrawlerrorssamples().list(siteUrl=SITE_URL, category=ERROR_CATEGORY, platform='web').execute()

It can then optionally call “webmaster_service.urlcrawlerrorssamples().markAsFixed(…)” to note that the URL error has been acknowledged- removing it from the webmaster reports.

Google Search Console API Methods

There are five ways to interact with the Google Search Console API. Each is listed below as “webmaster_service” because that is the variable name of the object in the script.

webmasters_service.urlcrawlerrorssamples()

This allows you to get details for a single URL and list details for several URLs. You can also programmatically mark URL’s as Fixed with the markAsFixed method. *Note that marking something as fixed only changes the data in Google Search Console. It does not tell Googlebot anything or change crawl behavior.

The resources are represented as follows. As you might imagine, this will help you find the source of broken links and get an understanding of how frequently your site is crawled.

{
 "pageUrl": "some/page-path",
 "urlDetails": {
 "linkedFromUrls": ["https://example.com/some/other-page"],
 "containingSitemaps": ["https://example.com/sitemap.xml"]
 },
 "last_crawled": "2018-03-13T02:19:02.000Z",
 "first_detected": "2018-03-09T11:15:15.000Z",
 "responseCode": 404
}

webmasters_service.urlcrawlerrorscounts()

If you get this data, you will get back the day-by-day data to recreate the chart in the URL Errors report.

Crawl Errors

 

 

 

webmasters_service.searchanalytics()

This is probably what you are most excited about. This allows you to query your search console data with several filters and page through the response data to get way more data than you can get with a CSV export from Google Search Console. Come to think of it, I should have used this for the demo…

The response looks like this with a “row” object for every record depending on you queried your data. In this case, only “device” was used to query the data so there would be three “rows,” each corresponding to one device.

{
 "rows": [
 {
 "keys": ["device"],
 "clicks": double,
 "impressions": double,
 "ctr": double,
 "position": double
 },
 ...
 ],
 "responseAggregationType": "auto"
}

webmasters_service.sites()

Get, list, add and delete sites from your Google Search Console account. This is perhaps really useful if you are a spammer creating hundreds or thousands of sites that you want to be able to monitor in Google Search Console.

webmasters_service.sitemaps()

Get, list, submit and delete sitemaps to Google Search Console. If you want to get into fine-grain detail into understanding indexing with your sitemaps, this is the way to add all of your segmented sitemaps. The response will look like this:

{
   "path": "https://example.com/sitemap.xml",
   "lastSubmitted": "2018-03-04T12:51:01.049Z",
   "isPending": false,
   "isSitemapsIndex": true,
   "lastDownloaded": "2018-03-20T13:17:28.643Z",
   "warnings": "1",
   "errors": "0",
  "contents": [
    { 
    "type": "web",
    "submitted": "62"    "indexed": "59"
    }
  ]
}

Modifying the Python Script

You might want to change the Search Console Query or do something with response data. The query is in webmasters.py and you can change the code to iterate through any query. The check method checker.py is used to “operate” on every response resource. It can do things that are a lot more interesting than printing response codes.

Query all the Things!

I hope this helps you move forward with your API usage, python scripting, and Search Engine Optimization… optimization. Any question? Leave a comment. And don’t forget to tell your friends!

 

Groucho Test

The content

Google Analytics to Google Spreadsheets is Data to Insights

When you reach the limits of Google Analytics custom reports and you still need more, Google Spreadsheets and the Google Analytics Add-On can take you past sampling, data consistency and dimension challenges.

This post is all about the Google Analytics + Google Spreadsheets workflow. It is an end-to-end example of how you can extract more value out of your Google Analytics data when you work with it in its raw(ish) form in Google Spreadsheets.

Blog Post Traffic Growth and Decay with Age

The end goal is to look at how blog post traffic grows or decays with time and ultimately Forecast Organic Traffic to Blog Posts (coming soon).  But this post sets the foundation for that by showing how the data is extracted, cleaned and organized using a pivot table in order to get there.

There is also a bit of feature engineering to make the analysis possible. To do this we will extract date that the post was posted from the URL and effectively turn the “Date Posted ” and “Post Age”  into custom dimensions of each page. But enough setup. Let’s start.

This posts assumes a few minor things but hopefully will be easy to follow otherwise:

  • Google Spreadsheets Google Analytics Add-On already plugged in
  • Basic familiarity with Regular Expressions aka. RegEx (helpful but not necessary)
  • Basic familiarity with Pivot Tables in Google Spreadsheets
  • Blog URLs with dates as subdirectories eg “/2015/08” (or collect post date as Custom Dimension)

Creating Your Custom Report

Once you have the Google Analytics Add-On up and running this is actually pretty simple. It just takes a bit of trial and error to ensure that you’ve queried exactly what you want. From the Report Configuration tab, most of the time there are only a few, but very important, fields that you will need to worry about: Date Ranges, Dimensions, Metrics,  and Filters.

Google Analytics Add-On Report Configuration

The purpose of this is to look at sessions by blog post URLs by month, over the entire history of the site

I chose the Last N Days because I know that is roughly the age of my site and a date range that is too broad is ok. I chose sessions because this relates to how many people will land on a given landing page in a given month of a given year. So that is all pretty straight forward.

Filters can get a bit tricky. A good practice for setting up filters is to first build them using the Google Analytics Query Explorer. That will enable you to rapidly test your queries before you are ready to enter them into your spreadsheet.

There are several filter operators that you can use but I kept this one fairly simple. It consists of three components (separated by semicolons):

ga:medium==organic; Only organic traffic ( == means equals)
ga:landingPagePath=@20; Only landing page URLs that contain 20  (@= means contains)
ga:landingPagePath!@? Only landing page URLs that do not contain a query string because that’s mostly garbage. (!@ means does not contain)

I used ga:landingPagePath rather than page title because that can be less consistent than URLs. They are more likely to change and will sometimes show as your 404 page title. Blog post URL’s are a more consistent unique identifier for a post but it is important to note that sometimes people will change blog posts URLs. We will deal with that later.

Cleaning the Data for Consistency

Even with good data collection practices, cleaning the data is extremely important for accurate analysis. In my case, I had changed a couple blog post URLs over time and had to manually omit a few that my query’s filter did not catch. In this case, data cleansing becomes very important for two reasons: 1. Posts that are not combined by their unique and consistent URL will show as two separate posts in the pivot table which will skew summary statistics and 2. Posts that are not real URLs with a small number of sessions will really skew summary statistics. Consistency is key for Pivot Tables.

So in this case, for small exceptions, I just deleted the rows. For more common exceptions, I used the REGEXREPLACE function. This is a powerful tool for data cleansing and unique to Google Spreadsheets. It allows you to select a part of a string that matches a RegEx pattern and replace with with whatever you might want. In this case, I just searched what I wanted to remove and replaced it with an empty string. eg.

=REGEXREPLACE(A141,"(/blog)?([0-9/])*/","")

I used this to remove ”/blog” from the URLs that were used before I transitioned from Squarespace to WordPress and date numbers because some had been changed when the post was updated.

Extracting Blog Post Posting Dates

Extracting the date that the post was posted is actually pretty simple. Again, I used another RegEx function, REGEXEXTRACT to do it:

=SPLIT(REGEXEXTRACT(cell,"[0-9]{4}/[0-9]{1,2}"))

The RegEx pattern finds any four digits then a slash and any one or two digits. Then the extracted string is split into two cells by the slash. This yields the year in one column and the month in the next. I combined the month and year of the post date and the month and the year of the Google Analytics date into Date objects so that I could use the DATEDIFF function to calculate the age of the blog post as a dimension of the post. Maybe this is too much gory detail but hopefully it’s useful to somebody.

Finally, we end up with is something that looks like this. This allows for pivoting the data about each post by each month of age.

Google Analytics Add-On Report Breakdown

Step 4: Pivot Tables FTW!

Finally, all this work pays off. The result is one table of blog post URLs by Age and one table of URLs by summary statistics.

Google Analytics data pivot table

The blog post age table allows me to see if the posts show a natural growth or decay of traffic over time. The big purpose of the table is to create the chart below that basically shows that growth and decay don’t look that consistent. But this is an important finding which helps frame thinking about modeling blog traffic in 2016.

(Not shown here, the Google Spreadsheets SPARKLINE function is a great way to visualize Google Analytics Data.)

The summary statistic pivot table will be used for forecasting 2016 traffic. This just happens to be the topic of my next post in which the tables are used to answer 1. What is the likelihood of reaching 25,000 sessions in 2016 if things stay the same?  and 2. What needs to change in order to reach 25,000 sessions in 2016?

So in summary, so far, this post has proven a few things:

  1. If you work with Google Analytics and don’t know RegEx, go learn! Its an incredibly useful tool.
  2. My blog posts do not demonstrate consistent growth or decay
  3. I might just be as big of a nerd as my girlfriend thinks I am.

Hope it might be useful in thinking about similar problems or maybe even creating new ones!

Is Slack Messenger Right for My Team? Analytics and Answers

Slack

From AOL Instant Messenger to WeChat stickers, digital communication has always fascinated me. From the beginning, there has always been so much we don’t understand about digital communication. It’s kind of like GMO; we just started using it without considering the implications.

We are continually learning how to use the the digital medium to achieve our communication goals. And meanwhile, our digital communication tools are ever evolving to better suit our needs. A prime example of this is the team messaging app, Slack.

Slack

Slack has adapted well and I would argue that it has dominated its ecosystem. There are a few reasons why I believe that it’s earned its position:

  1. It’s easy.
  2. It’s flexible.
  3. It’s not too flexible.

As a tool, Slack is malleable enough to form-fit your communication theories and practices and it does little to dictate them. This means that its utility and its effect are less a factor of the tool and more a factor of the our ability to shape its use.

So when the question was posed, “How well does Slack fit our needs as a team?” I have to admit I wasn’t sure. Days later, in my head, I answered the question with two more questions:

How well have we adapted the tool to us?

How well have we adapted to the tool?

The questions felt somewhat intangible but I had to start somewhere and me being me, I asked the data. I’ll admit I haven’t gotten to the heart of the questions… yet. But I did start to scratch the surface. So let’s step back from the philosophy for a minute, walk through the story, and start answering some questions.

So yeah, we tried Slack… Six months ago

A recently formed, fast moving and quickly growing team, we believed that we could determine own our ways of working. In the beginning, we set some ground rules about channel creation and, believe it or not, meme use (hence the #wtf channel). And that was about it. We promised ourselves that we would review the tool and its use. Then we went for it.

A while later, as I mentioned, a manager pointed out that we had never reviewed our team’s use of Slack. It seemed fine but the questions started to crop up in my head. Me being me, I had a to ask the data.

This all happened about the time that I started to play with Pandas. I didn’t answer the questions but I did get frustrated. Then I read Python for Data Analysis, pulled the data out of the Slack API (which only provides data about channels) and went a bit crazy with an iPython notebook.

To answer my theoretical questions, here are the first questions I had, a few that I didn’t and their answers.

How is Slack holding up over time?

Stacked Time Series

Don’t judge me. This was my first go with matplotlib.

This stacked time series shows the number of post per channel (shown in arbitrary and unfortunately non-unique colors) per week. The top outline of the figure shows the total number of messages for each week. The strata represent different channels and the height of each stratum represent the volume of messages during a given week.

It appears that there is a bit of a downward trend the overall number of messages per week. A linear regression supports that. The regression line indicates that there is a trend of about two fewer messages than the week before.

Linear Regression

If you ask why there appears to be a downward trend in total use over time, I think there a few ways to look at it. First, the stacked time series shows that high volume weeks are generally a result of one or two channels having big weeks rather than a slowing of use overall. This makes sense if you consider how we use channels.

We have channels for general topics and channels for projects. And projects being projects, they all have a given timeframe and endpoint. This would explain the “flare ups” in different channels from time to time. It would also explain why those same channels come to an end.

One way to capture the difference between short lived project channels and consistent topic channels is with a box plot. Box plots represent the distribution of total messages per week for each channel by showing the high and low week totals for a channel and describe the range (Interquartile Range) that weekly message totals commonly fall into.

Slack Analytics Channels Box Plot

Each box plot represents a Slack channel. The Y axis scales to the number of messages in that chanel

For a specific example, the channel on the far left (the first channel created, named #generalofficestuff) has had a relatively high maximum number of messages in a week, a minimum around 1 or 2 (maybe a vacation week) and 50% of all weeks in the last six months fall within about 7 and 28 messages with an average of 10 messages per week.

On the other hand, channels on the right side of the chart, more recently created and generally project-specific channels, describe the “flare ups” that can be seen in the stacked time series chart above. If you wanted to look deeper, you could make a histogram of the distribution of week totals per channel. But that is a different question and, for my purposes, well enough described with the box plot. 

So… how is Slack holding up over time?!

The simple answer is, use is declining. Simple linear regression shows this. The more detailed answer is, it depends. As the stacked time series and box plots suggest, in our case, use over time is better understood as a factor of the occurrence of projects that lend themselves especially well to Slack channels. I know what you’re saying, “I could have told you that without looking at any charts!” But at least this way nobody is arguing.

Projects… What about People?

Another way to look at this questions is not by the “what”, but by the “who.” Projects, and their project channels are basically composed of two components, a goal/topic and a group of people that are working toward that goal. So far we have only looked into the goal but this leaves the question, “are the people a bigger factor in the sustainability of a channel than the topic.

I looked at this question many ways but finally, I think I found one visual that explains as much as one can. This heat map shows the volume of messages in each channel per person. It offers a view into why some channels might see more action than others and it also suggests how project/channel members, and the synergy between them, might affect a channel’s use.

Slack Analyttics Hierarchical Clustering Heatmap

Volume of messages is represented by shade with Users (user_id) are on the Y axis and channels are on the X axis. Hierarchical clustering uses Euclidian distance to find similarities.

What I think is most interesting in this visualization is that is shows the associations between people based on the amount of involvement (posts) in a channel. The visual indicates that perhaps, use is as much a factor of people as the channel’s project or topic, or time.

There are, of course, other factors. We cannot factor out the possibility of communication moving into direct messages or private groups. But again, that is another question and beyond the bounds of this investigation.

So what?

So we got a glimpse at the big picture and gained a pretty good understanding of the root cause of what motivated the question. This is my favorite part. We get to sit back, relax, and generate a few new hypotheses until we run into a new question that we can’t avoid.

What I think is coolest about the findings is that it suggest a few more hypotheses about what communication media our team’s communication occasionally moves to and what media it competes with. Now these investigations start to broach the fundamental questions that we started with!

There are a few things at play here. And the following are just some guesses. It could be that email dominates some projects or project phases because we are interacting with outside partners (people) who, for whatever reason, cannot or will not use Slack. Sorry Slack. It could also be that, due to the real world that we live in, communication is either happening over chat apps like WeChat or WhatsApp.

In either case, we return to the idea of people adapting to tools that are adapting to people. The use of digital communication tools reflects the people who use them and each person’s use reflects the structure and offerings of the tool.

And what’s next?

Hopefully, if you read this you have more questions about this reality and I might (probably) go on to try to answer a few more. I think there are a few interesting ways to look at people are norming with Slack.

Maybe, you are interested in how all this Pandas/matplotlib stuff works because I am too. So I think it will be fun to post the iPython notebook and show how it all works.

Otherwise, it will be interesting to watch how this tool and this team continue to evolve.

Agile Strategy for Data Collection and Analytics

If you are like most people doing business online, it seems like there is always a long list of digital to-dos that are somewhere between “that will happen in Q4” and “that should have happened by Q4 last year.” Aside from the constant stream of daily hiccups that arise due to the asynchronous nature of our medium, if you are like most others managing a website, you face broader development challenges of slow servers, uncooperative CMS’s, or lame mobile experiences impacting your online success.

This is not failure that you have to accept! Let me introduce you to a little thing that has been bouncing around in the software/web development community that will make your online business operations feel less like swimming in peanut butter. It’s called Agile Development and it’s sexy. It’s fast and sexy like a cheetah wearing high heels.

We can apply these principles of Agile Development to data collection, analytics, and optimization to provide two exceptional benefits: rapid access to data and insight, and safeguards against constantly changing web properties.

For data collection, analytics, and optimization:

  • An Agile approach provides action before traditional methods provides insight
  • An Agile approach safeguards against the constant variability of the web medium

“If you fail to plan, you are planning to fail!” — Ben Franklin

Learning from Feature Driven Development

The Agile Development concept covers an array of development methodologies and practices, but I would like drill into one especially coherent and efficient method of Agile called Feature Driven Development.

Feature-Driven Development essentially works like this: an overall project is planned as a whole then it is separated into discrete pieces and each of these pieces is designed and developed as a component and added to the whole. This way, instead of having many semi-functional components, the project’s most valuable components are complete and fully functioning.

Phased Implementation (Not Iteration)

Because you might have already heard something about Agile Development, it is important, at this time to dispel the notion that Agile development is defined by iterations upon products. In a sense that is true but mostly it is the complete opposite of the Agile approach. The only iterations that happen are the planning, implementation, and completion of a new feature. This is not the same as adding layers upon existing features (more on this with the Definition of Done). The difference here is planning and the ability to see the project and business objectives as a whole.

Step 1: Develop an Overall Model

You must plan! Planning in an organization can be hard to motivate and difficult to initiate, but these planning steps will actually provide you with better, more actionable data sooner than not.

Understand the system. This is digital. There are a lot of moving parts. It is very important to really know how your digital presence affects your physical business and your overall business strategy and vice versa. Additionally, there are likely many components within your business that are (or could be) affected by the data that can be collected. This leads to my next suggestions.

Ask questions and seek multiple perspective. This is time to confront your assumptions about your business, your pain-points, and your data needs. It is important to really know the processes and decisions that are taking place and how they are (or are not) or could be affected by data. Communicating with those who interact with and make decisions on the data at any level will be extremely insightful.

Be strategic. Look at the big picture of the future, define your goal and work backwards. Agility does not come by luck but rather by being aware of and prepared for all foreseeable possibilities. Consider how things will change and what parts of your digital presence are shifting. How will redesigns, platform changes, and code freezes affect your strategy? This is generally the best way to face an analytics problem so this step applies very well to analytics. Agile was created to solve the problems of being short-sighted and reactive.

Step 2: Define the Parts of Your Plan

This is where the fun starts. There are multiple ways an analytics strategy can be divided and further subdivided into parts. When considering how to divide the project into parts, the goal should be to get to define parts at their most discrete, independent or atomic level. This will be helpful in prioritizing the parts into steps. Ultimately,  these parts can be regrouped based on similarity and development implementation process.

By Web Property and Section

An organization’s web presence is often not limited to a single site or app. There may be different properties or sections of web properties with different intents. Inevitably, some of these properties or sections will have a bigger impact on your organization’s goals and thus would be prioritized differently.

By Data Scope (User, Page/Screen, Event)

Each web properties has layers of data that can be gathered from it. Data about the user in the app or website’s database, information about the content of the page, and information about how the user interacts with the app or website can all be thought of discretely. These differ in terms of intelligence and the actual development work that is required to collect the data.

By Data Use

Another way to divide up the data-collection needs is by end use. For instance, you may be an ecommerce store that has different people or teams who are merchandising, planning and creating content, managing email, social campaigns, or paid media campaigns and/or optimizing the application and user experience. The data needs for each initiative will often overlap with other initiatives but sometimes data needs will be very different from others. These different data needs can be thought of as different parts of your strategy.

By Data Depth

Think 80/20 rule in terms of granularity. Some data is instantly useful. For instance, you may not be tracking clicks on your main call-to-actions or “Buy” buttons. These clicks are likely key micro-conversions and having this interaction insight can literally change your strategy overnight. Another layer of depth would be knowing what product was added to the cart as part of that event. A further layer would be configuring Google Analytics’ Enhanced Ecommerce to understand how users interact with products from the product page to the checkout. Each of these examples provide varying depths of data but also require varying amounts of development time.

Other features like Google Adwords Dynamic Remarketing and Google Analytics Content Groupings can be thought of similarly as they need more information to associate with the user or page.

Step 3: Prioritize

This is the most important step. This is where the unique value of the Agile approach really shines. This can drastically lower the cost and shorten the time to data-driven action. All the planning and foresight that took place before can be leveraged to make the right decisions for the most success.

Consider Goals

Duh. The whole reason you are gathering data is to be data-driven. The parts of your plan that most directly affect your top-line goals should be at the top of the list. Think about every time you have said or heard “If we only knew abc we could achieve xyz.” Now depending on the value of xyz, prioritize data collection.

Consider Time

This is what Agile is all about! With goal impact in mind communicate with relevant parties and your development team or partners to understand how long it will take to implement the code to start gathering data. Sometimes the value of data will scale to the development time, other times it may be as simple as using a Google Tag Manager click listener on calls-to-actions to send events to Google Analytics within a few minutes. Overall, its good to have some data to orient your decisions right away so go for the quick wins first and work with that as code is being implemented to get the real data gold.

Consider Cost

Unfortunately, bottom lines still exist and often development resource cost will have to be justified in implementing code to gather data. Some data collection might be cost prohibitive but it is possible that by gathering data that is easier to gather, such as standard Ecommerce implementation will give you the rationalization to get more in depth data down the road. Overall, get the valuable data that comes cheap, squeeze the life out of it until you need more depth.

Step 4: Implementation Cycle (Plan, Implement, QA)

Now, for the moment we’ve all been waiting for, let the collection begin! This is the step that most people think of when they think of Agile development; sprinting to complete a feature and then releasing it.  For Agile analytics, this works the same way. Now that there is a list of analytics “features” or streams of data that have been prioritized each step should be planned, implemented and tested successively.

Plan

This is a more detailed plan than the overall model. This plan defines how the data will be collected. For example, this is when Google Analytics Event or Custom Dimension naming conventions would be defined and documented. Be explicit. This will really improve the efficiency of the process.

Implement

Buy your development partner beer and pizza and pass your documentation on to them. Keep them happy and maintain a good relationship. There will be more implementations in the future. Hopefully, your documentation is clear but be open and responsive to questions; this is all about speed and accuracy.

Quality Assurance

This should happen in your development environment so that when the code is implemented on the site, the data that is reported is clear and accurate. Be thorough as this implementation should stay this way well into the future. If changes are to be made, be discreet, just as in implementation.

These three steps can happen simultaneously. For example, planning can happen on a future part as implementation and QA is happening on the present part.

Start Optimizing!

Agile is not simple but it’s also not magic. Speeding up the time to data-driven action is made possible by the planning that happens up front. Being proactive is not only a practice of Agile but also general best practice in analytics. It is the planning that makes agile efficiency possible. It may seem difficult, but putting in the effort to plan will put you in a position to act proactively agilely into the future.  Happy optimizing!