Setting up Airbyte ETL: Minimum Viable Data Stack Part II

In the first post in the Minimum Viable Data Stack series, we set up a process to start using SQL to analyze CSV data. We set up a Postgres database instance on a Mac personal computer, uploaded a CSV file, and wrote a query to analyze the data in the CSV file. 

That’s a good start! You could follow that pattern to do some interesting analysis on CSV files that wouldn’t fit in Excel or Google Sheets. But that pattern is slow and requires you to continually upload new data as new data becomes available. 

This post will demonstrate how to connect directly to a data source so that you can automatically load data as it becomes available.

This process is called ETL, short for Extract, Transform, Load. Put simply,  ETL just means, “connecting to a data source, structuring the data  in a way that it can be stored in database tables, and loading it into those tables.” There’s a lot more to it if you really want to get into it, but for our purposes, this is all you’ll need to know right now.

For this part of the tutorial, we are going to use an open-source ETL tool called Airbyte to connect to Hubspot and load some Contact data into the same Postgres database we set up before. Then we’ll run a couple of analytical queries to whet your appetite!

Setting up an ETL Tool (Airbyte)

I chose Airbyte for this demo because it is open source which means it’s free to use as long as you have a computer or a server to run it on. Much of it is based on the open-source work of another ETL tool called Stitch had been pushing before they got acquired by Talend. That project was called Singer.

The best thing about Airbyte for our Minimum Viable Data Stack is that they make running the open-source code so easy because it is packaged in yet another software framework called Docker. Yes, if you’re keeping score at home, that means we are using one open-source framework packaged in another open-source framework packaged in yet another open-source framework. Oh, the beauty of open-source!

To keep this tutorial manageable, I am going to completely “hand wave” the Docker setup. Luckily, it’s easy to do. Since this tutorial is for Mac, follow the Docker installation instructions for Mac.

🎵 Interlude music plays as you install Docker 🎵

Once you’ve installed Docker, you can run the Docker app which will then allow you to run apps called “containers.” Think of a container as “all the code and dependencies an app needs, packaged so you can basically just click start” (Instead of having to load all the individual dependencies one by one!)

Setting up Docker

We’re only going to download and run one app on Docker: Airbyte! 

Note: If you need help on the next few steps, Airbyte has a Slack community that is really helpful.

To download Airbyte the instructions are simple. Just open up your terminal (you can find this by using Mac’s spotlight search [cmd+space] and typing in “Terminal”). In the terminal just paste in the following three commands:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

The commands tell your computer to copy all the code from their Github repository to your computer into a folder called “airbyte”, then “cd” aka “changing the directory” to the “airbyte” directory, then tell  Docker to run the Airbyte app container.

The beauty of this is that once you run this the first time from the command line, you can start Airbyte from the Docker UI by just clicking the “play” button.

Installing Airbyte via the command line

Airbyte will do a bit of setup and then your terminal will display the text shown above. At that point Airbyte is running on your computer and to use it, all you have to do is open your browser and go to http://localhost:8000

If you’re wondering how this works, Airbyte is running a webserver to provide a web interface to interact with the code that does all the heavy-ETL-lifting. If this were a common ETL tool like Stitch or Fivetran, the webserver and the ETL processes would run on an actual server instead of your personal computer.

If everything has gone according to plan you can go to  http://localhost:8000 and see the Airbyte app UI running and ready to start ETL-ing!

Setting up the Posgress connection in Airbyte

Setting up your first ETL job (Hubspot to Postgres)

I’ll admit, that last part got a little gruesome but don’t worry, it gets easier from here (as long as everything goes according to plan…)

From here we have to connect both our database and data sources to Airbyte so it has access to the source data and permission to write to the database.

I’ve chosen to load data from Hubspot because it is really easy to connect and because it shows the ups and downs of ETL… And of course, we’re still using Postgres.

Creating a Postgres Destination

All you have to do is paste in your database credentials from Postgres.app. Here are Airbyte’s instructions for connecting to Postgres

These are the same ODBC credentials we used to connect Postico in the last article. You can find them on the Postico home screen by clicking your database name. Note 

Posgres ODBC connection details

In my case, these are the settings:

  • Name: I called it “Postgres.app” but it could be anything you want
  • Host: Use host.docker.internal (localhost doesn’t work with Docker. See instruction above)
  • Port: 5432 
  • Database Name: Mine is “trevorfox.” That’s the name of my default Postgres.app database
  • Schema: I left it as “public.” You might want to use schemas for organizational purposes. Schemas are “namespaces” and you can think of them as folders for your tables. 
  • User: Again, mine is “trevorfox” because that is my default from when I set up Postgres.app
  • Password: You can leave this blank unless you set up a password on in Postgres.app 

From there you can test your connection and you should see a message that says, “All connection tests passed!”

Creating a Hubspot Source

You’ll first need to retrieve your API key. Once you’ve got it, you can create a new Hubspot Source in Airbyte.

I used these settings:

  • Name: Hubspot
  • Source type: Hubspot
  • API Key:  My Hubspot  API key
  • start_date:  I used 2017-01-25T00:00:00Z which is the machine-readable timestamp for 1/1/2017
Setting up the Hubspot connection in Airbyte

Here’s a cute picture to celebrate getting this far!

Getting started with Airbyte

Creating the ETL Connection

Since we’ve already created a Destination and a Source, all we have to do is to tell Airbyte we want to extract from the Source and load data to the Destination. 

Go back to the Destinations screen and open your Postgres.app source, click “add source,” and choose your source. For me, this is the source I created called “Hubspot.”

Airbyte will then go and test both the Source and Destination. Once both tests succeed, you can set up your sync. 

Setting up an ETL job with Airbyte

There are a lot of settings! Luckily you can leave most of them as they are until you want to get more specific about how you store and organize your data. 

For now, set the Sync frequency to “manual,” and uncheck every Hubspot object besides Contacts.

In the future, you could choose to load more objects for more sophisticated analysis but starting with Contacts is good because it will be a lot faster to complete the first load and the analyses will still be pretty interesting.

 Click the “Set up connection” button at the bottom of the screen.

Setting up the first sync with Airbyte

You’ve created your first Connection! Click “Sync now” and start the ETL job!

As the sync runs, you’ll see lots of logs. If you look carefully, you’ll see some that read “…  Records read: 3000” etc. which will give you a sense of the progress of the sync.

Airbyte ETL logs

What’s happening in Postgres now?

Airbyte is creating temporary tables and loading all the data into those. It will then copy that data into its final-state tables. Those tables will be structured in a way that is a bit easier to analyze. This is some more of the T and L of the ETL process!

As the sync is running, you can go back to Postico and refresh the table list (cmd+R) to see new tables as they are generated. 

Let’s look at the data!

When the job completes, you’ll notice that Airbyte has created a lot of tables in the database. There is a “contacts” table, but there are a lot of others prefixed with “contacts_.”

Why so many tables? 

These are all residue from taking data from a JSON REST API and turning it all into tables. JSON is a really flexible way to organize data. Tables are not. So in order to get all that nested JSON data to fit nicely into tables, you end up with lots of tables to represent all the nesting.  The Contacts API resource alone generated 124 “contacts_” tables. See for yourself:

​​select count(tablename)
from pg_tables t
where t.tablename like 'contacts_%'

This query queries the Postgres system table called pg_tables which, as you probably guessed, contains a row for each table with some metadata. By counting the tables that match the prefix “contacts_,” you’ll see all the tables that come from the Contacts resource. 

Why you care about Data Modeling and ELT

In order to structure this data in a way that is more suitable for analysis, you’ll have to join the tables together and select columns you want to keep. That cleaning process plus other business logic and filtering is called data modeling

Recently it has become more common to model your data with SQL once it’s in the database  (rather than right after it is extracted and before it’s loaded into the database). This gave rise to the term ELT to clarify that most of the transformations are happening after the data has landed in the database. There is a lot to discuss here. I might have to double back on this at some point…

Previewing SQL data tables

Luckily, we can ignore the majority of these tables. We are going to focus on the “contacts” table.

Analyzing the Data

Let’s first inspect the data and get a feel for what it looks like. It’s good to start with a sanity check to make sure that all your contacts are in there. This query will tell you how many of the contacts made it into the database. That should match what you see in the Hubspot app.

select count(*)
from contacts

One of the first questions you’ll probably have is how are my contacts growing over time? This is a good query to demonstrate a few things: the imperfect way that ETL tools write data into tables, the importance of data modeling, and ELT in action.

In the screenshot above, you’ll notice that the “contacts” table has a bunch of columns but one of them is full of lots of data. The “properties” column represents a nested object within the Hubspot Contacts API response. That object has all the interesting properties about a  Contact like when it was created, what country they are from, and other data a business might store about their Contacts in Hubspot. 

Airbyte, by default, dumps the whole object into Postgres as a JSON field. This means you have to get crafty in order to destructure the data into columns. Here’s how you would get a Contact’s id, the data it was created. (This would be the first step towards contact count over time)

select c.properties->>'hs_object_id' id, 
	c.createdat::date
from contacts c
limit 10;

Notice the field, “c.properties->>’hs_object_id’.“ The “->>” is how you get a JSON object field from the JSON-typed fields.

To count new contacts by month, we can add a little aggregation to the query above.

select date_trunc('week', c.createdat::date) created_month,
	count(distinct c.properties->>'hs_object_id')  contact_count
from contacts c
group by created_month
order by created_month desc;

THIS IS IT! This is the beauty of analytics with a proper analytics stack. Tomorrow, the next day, and every day in the future, you can run the Hubspot sync and see up-to-date metrics in this report!

You’ll learn that the more queries you run, the more you’ll get tired of cleaning and formatting the data. And that, my friends, is why data modeling and ELT!

SQL analytics in Postico

I changed to dark mode since the last post :]

Looking Forward

At this point, the stack is pretty viable. We have a Postgres database (our data warehouse), an ETL process that will keep our data in sync with the data in the source systems (Airbyte), and the tools for analyzing this data (SQL and Postico). Now you can answer any question you might have about your Hubspot data at whatever frequency you like—and you’ll never have to touch a spreadsheet!

The foundation is set but there is still more inspiration ahead. The natural place to go from here is a deeper dive into analysis and visualization. 

In the next post, we’ll set up Metabase to visualize the Hubspot data and create a simple dashboard based on some SQL queries. From there, I imagine we’ll head towards reverse ETL and push the analysis back to Hubspot. =]

I hope this was interesting at the least and helpful if you’re brave enough to follow along. Let me know in the comments if you got hung up or if there are instructions I should add.

From Spreadsheets to SQL: Step One towards the Minimum Viable Data Stack

“A spreadsheet is just data in single-player mode”

Last week I made a mistake. I boldly claimed that “A spreadsheet is just data in single-player mode.” And while I stand by that claim. I didn’t expect to be called to account for it.

As it turns out, the post was pretty popular and I think I know why. To me it boils down to five factors. 

  1. The scale and application of data still growing (duh)
  2. There aren’t enough people with the skills to work with data at scale
  3. There are plenty of resources to learn SQL but the path to using it in the “real world” isn’t very clear
  4. The tools have caught up and basically, anybody with spreadsheet skills can set up a data stack that works at scale
  5. Now is a great time to  upskill and become more effective in almost any career take advantage of the demand

The hard part? You have to pull yourself away from spreadsheets for a while—go slow to go fast (and big).

You’ll thank yourself in the end. Being able to think about data at scale will change how you approach your work and being able to work at scale will increase your efficiency and impact. On top of that, it’s just more fun!

A Minimum Viable Data Stack

In the spirit of a true MVP, This first step is going to get you from spreadsheet to SQL and with the least amount of overhead and a base level of utility.

In the next hour you will:

  • Stand up an analytical database
  • Write a SQL query to replace a pivot table
  • Have basic tooling and process for a repeatable analytical workflow 

In the next hour you will not (yet):

  • Run anything on a cloud server (or be able to share reports beyond your computer)
  • Setup any continually updating data syncs for “live” reporting

But don’t underestimate this. Once you start to work with data in this way, you’ll recognize that the process is a lot less error-prone and repeatable than spreadsheet work because the code is simpler and it’s easier to retrace your steps.

Starting up a Postgres database

Postgres isn’t the first thing that people think of for massive-scale data warehousing, but it works pretty well for analytics—especially at this scale, it is definitely the easiest way to get started and of course, it’s free. Ultimately, you’ll be working with BigQuery, Snowflake, and if this were 2016, Redshift. 

I apologize in advance but this tutorial will be for a Mac. It won’t really matter once everything moves to the cloud, but I don’t own a Windows machine…

The easiest way to get a Postgres server running on a Mac is Postgres.app. It wraps everything in a shiny Mac UI and the download experience is no different than something like Spotify.

Congrats! You have installed a Postgres server on your local machine and it’s up and running!

Here are some instructions for installing Postgres on Windows. And here’s a list of  Postgres clients for Windows that you’d use instead of Postico.

Now let’s see how quickly we get connected to the database.

Your Postgres server is up and running

Connecting to Postgres

There are plenty of good SQL editors for Postgres but since we are keeping this MVP, I’m going to recommend Postico. Again, it has a simple Mac UI and is designed for more of an analytical workflow than hardcore database management. 

  • Step 1: Head over to https://eggerapps.at/postico/ and download the app
  • Step 2: Move the app to the Applications folder and then open it by double-clicking on the icon
  • Step 3: Create a database connection by clicking on the “New Favorite” button. Leave all fields blank; the default values are suitable for connecting to Postgres.app. Optionally provide a nickname, eg. “Postgres.app”. Click “Connect”
  • Step 4: Go back to Postico and choose the SQL Query icon
  • Step 5: Test your connection by running a query.
Create a new “Favorite” connection in Postico

Run the query “select * from pg_tables;” to see a list of all the tables in your Postgres database. Since you haven’t loaded any tables, you’ll just see a list of Postgres system tables that start with the prefix, “pg_.” As you probably guessed, the “pg” stands for Postgres.

Running your first SQL query in Postico

You’ve done it! You’ve started up a Postgres database, connected to it, and run your first query!

Loading data into Postgres

Ok, the boring stuff is out of the way and it’s only been about 15 minutes! Now we can get to the actual analysis. Next, let’s load some actual data into Postgres.

Loading tables in Postgres is a little bit different (aka more involved) than loading a CSV into Google Sheets or Excel. You have to tell the database exactly how each table should be structured and then what data should be added to each table. 

You might not yet know how to run CREATE TABLE commands but that’s ok. There are tools out there that will shortcut that process for us too. 

The imaginatively named, convertcsv.com generates the SQL commands to populate tables based on a CSV file’s contents. There are lots of ways to populate data into a database but again, this is an MVP. 

For this tutorial, I’m using the Google Analytics Geo Targets CSV list found here. Why? Because the file is big enough that it would probably run pretty slowly in a spreadsheet tool.

  • Step 1: Head over to https://www.convertcsv.com/csv-to-sql.htm
  • Step 2: Select the “Choose File” tab to upload a CSV file. 
  • Step 3: Change the table name in the Output Options section  where it says “Schema.Table or View Name:” to “geotargets”
  • Step 4: Scroll down to the Generate Output section and click the “CSV to SQL Insert” button to update the output, then copy the SQL commands
  • Step 5: Go back to Postico and click on the SQL Query icon
  • Step 6: Paste the SQL commands into the SQL window 
  • Step 7: Highlight the entirety of the SQL commands and click “Execute Statement”
Uploading a CSV file to convertcsv.com

You’ve loaded data into your database! Now you can run super fast analyses on this data.

Analyze your data!

You’ve already run some commands in the SQL window in the previous step. The good news is it’s always just as simple as that. Now analysis is basically just the repetition of writing a command into the SQL editor and viewing the results. 

Here’s a simple analytical query that would be the equivalent of creating a pivot table that counts the number of rows within each group. Paste this in to find the results.

select g.country_code, count(criteria_id) count_id
from geotargets g
group by g.country_code
order by count_id desc

You’ve done it! You’ve replaced a spreadsheet with SQL! 

But this is not the end. PRACTICE, PRACTICE, PRACTICE! Analysis in SQL needs to become as comfortable as sorting, filtering, and pivoting in a spreadsheet.

Looking forward

If you’re thinking to yourself, “this still feels like single-player mode…” you’re right. This is like the first level of a game where you play in single-player mode so you can learn how to play the game and avoid getting destroyed in a multiplayer scenario.

In fact, you probably wouldn’t do this type of analysis in a database unless you were going to pull in CSV files with millions of rows or if you were to pull in a bunch of other spreadsheets and join them all together. In those cases, you’d see significant performance improvements over an actual spreadsheet program.

The real utility of a database for analysis comes when you have data dynamically being imported from an ETL tool or custom code. On top of that, running a database (or data warehouse) on the cloud makes it possible for you and your team to access the data and reports in real-time instead of just doing analysis locally and sending it to someone else as a CSV or other type of report document. Hopefully, I don’t need to tell you why that is a bad process!

If I stay motivated… The next step will be to dynamically import data into the new Postgres database with an ETL tool called Airbyte which also runs on your computer. At that point, the scale and complexity of the analysis will really increase.  

After that, as long as I keep at it… the next step would be to set up a BigQuery instance on Google Cloud. At that point, you can combine a cloud-based business intelligence tool with Airbyte and BigQuery and start to get a taste of what a functioning modern data stack looks like. 

I hope this was a helpful start for you.  Let me know in the comments if you get hung up.

Intro to SQL User-Defined Functions: A Redshift UDF Tutorial

As a data analyst, your credibility is as valuable as your analytical skills. And to maintain your credibility, it’s important to be able to answer questions correctly and consistently. That’s why you must be careful to integrate reproducibility into your SQL analyses. This tutorial is going to show you how you can use Redshift User Defined Functions (UDFs) to do just that.

Reproducibility in SQL Analysis

I’ve learned that there are two broad factors to reproducibility. The first is the data—different data for the same analysis is going to produce different results. A good example would be a court case: if you ask two witnesses the same question, each one will probably tell you something similar but likely slightly different. 

The second factor is the analytical methods. If we use the court case example again, this would be like the prosecution and the defense asking a witness the same question in two different ways. The lawyers would do this with the intent to get two different answers.

This post is more concerned with the second factor of reproducibility, the analytical method. Whenever you have to write complex SQL queries to get an answer, your analytical method (the SQL query) becomes a big variable. SQL is iterative by nature! Think about it, just be adding and removing “WHEN” conditions, you’re liable to drastically change your results. 

As you iterate on a numerical calculation or classification in a CASE expression you are likely to change your query results. And what happens when you have to perform the same analysis weeks later? You better hope you use the same iteration of your SQL query the second time as the first! 

And that is exactly where User-Defined Functions become so valuable! 

User-Defined Functions (UDFs) are simply a way of saving one or more calculations or expressions with a name so that you can refer to it as a SQL function for further use.

What are User Defined Functions?

User-Defined Functions can be used just like any other function in SQL like SUBSTRING or ROUND except you get to define what the output of the function is, given the input.

User-Defined Functions (UDFs) are simply a way of saving one or more calculations or expressions with a name so that you can refer to it as a SQL function for further use.

They are a great way to simplify your SQL queries and make them more reproducible at the same time. You can basically take several lines of code that produce one value from your SELECT statement, give it a name, and keep it for future use. Using UDFs, you can ensure that, given the same data, your calculations will always produce the same result.

UDF Functions are Scalar Functions. What does scalar mean?

As you learn about UDFs, you’ll see references to the word “scalar.” Scalar just means that the function is defined with one or more parameters and returns a single result. Just like the ROUND function has one parameter (the number) and an optional second parameter (the number of decimal places for rounding) and returns the rounded number. The function is applied to every value in a column, but it only returns one value for each row in that column.

A Hello World! SQL UDF Example

If you are familiar with any kind of programming language, this should be pretty simple. The CREATE FUNCTION syntax only requires a function name and a return data type. That’s it. 

A function called hello_world that returns ‘HELLO WORLD!’ every time would look like this:

create function hello_world ( )
  returns varchar
stable
as $$
  select 'HELLO WORLD!'
$$ language sql; 

In that case, the input data type and the output data type are both varchar because “HELLO WORLD!” is a text output. You could use your function like this:

select hello_world() as my_first_function;

And you’d get an output that looks like this:

my_first_function
HELLO WORLD!

But that wouldn’t be very interesting. You’ll generally want to modify the input(s) of your functions. Let’s take apart a more interesting UDF example.

How to Write SQL UDF Functions

This example function, called url_category takes a varchar as an input (a URL) and returns a varchar output (the category of the URL). To do this, the function compares the input (shown as $1 because it is the first parameter) to the conditions of a case expression.

You could also write this function with two parameters. Here’s an example if you were using Google Analytics data. You could take in the parameters, hostname and a page_path to get more granular with your URL categorization.

SQL UDF Functions with Multiple Arguments

This is Redshift’s example from their docs. It takes two parameters (both specified as float) and returns the value that is greater of the two.

create function f_sql_greater (float, float)
  returns float
stable
as $$
  select case when $1 > $2 then $1
    else $2
  end
$$ language sql;  

To refer to the different parameters in the function, you just use the dollar sign ($) and the order of the parameter in the function definition. As long as you follow that convention, you could go wild with your input parameters!

Redshift UDF Limitations

UDFs are basically restricted to anything that you can normally do inside a SELECT clause. The only exception would be subqueries—you cannot use subqueries in a UDF. This means you’re limited to constant or literal values, compound expressions, comparison conditions, CASE expressions, and any other scalar function. But that’s quite a lot! 

Common UDF Errors and their Causes

Once you start writing UDFs, you’ll find that it’s pretty easy going but there are two especially common “gotchas” 

ERROR:  return type mismatch in function declared to return {data type}

DETAIL:  Actual return type is {data type}.

This just means that you’ve created a function where the output value has a different data type than you said it would. Check that the return data type that you specified is the same as the function is actually returning. This can be tricky if your function is using a CASE expression because a CASE could accidentally return two different data types.

ERROR:  The select expression can not have subqueries.

CONTEXT:  Create SQL function “try_this” body

This means you tried to write a SELECT statement in your function that includes a subquery. You can’t do that.

ERROR:  function function_name({data type}) does not exist

HINT:  No function matches the given name and argument types. You may need to add explicit type casts.

There is one especially odd thing about Redshift UDFs. You can have several functions with the same name as long as they take different arguments or argument types. This can get confusing. The error here means that you’ve called a function with the wrong type of argument. Check the input data type of your function and make sure it’s the same as you input data.

Scaling your SQL Analysis with Confidence!

User-Defined Functions make it really easy to repeat your analytical method across team members and across time. All you have to do is define a function once and let everyone know that they can use it. On top of that, if you want to change the logic of your function you only have to do it in one place and then that logic will be changed for each user in every workbench, notebook, or dashboard!

Take advantage of this clever tool. Your team will thank you, and you will thank you later!

Getting Started with SQL for Marketing (with Facebook Ads Example)

As a digital marketer, I use SQL every single day. And looking back on my career so far, it would be fair (though a bit reductive) to say that I could define my career by two distinct periods: before I learned SQL and after I learned SQL. The two periods are distinct for three main reasons: 

  1. After learnings SQL I am faster at gaining insight from data 
  2. After learnings SQL I am able to make decisions based on more data
  3. As a result, I’ve been making better marketing decisions—and I have seen the traffic, conversion rates, and ROI to prove it. (Thanks to SQL)

If you’re at a crossroads in your career and you find yourself asking, “what coding language should I learn,” here is my case for SQL.

What is SQL (for Digital Marketing)

When you see SQL you might think it means “Sales Qualified Lead” but more commonly, SQL stands for “Structured Query Language.” It is a programming language that allows you to retrieve (or update, alter or delete) data from relational databases. (Relational is just a fancy word for a database that stores data in tables.) 

It’s kind of like ordering from McDonald’s. SQL is a language – a specific set of instructions – that you use to specify the results you want, the way you want them, in the quantity you want. Basically, SQL allows you to have your data your way.

How is SQL Used in Business

SQL has two main uses: applications and analysis. Applications (apps) from CandyCrush to Instagram store content and data about users in databases and then use it to create an experience (like keep track of how many comments you have on an Instagram post). On the other hand, you can use SQL for analysis in the same way you can sort, filter, and pivot data in Excel. (except with a lot more data)

SQL is different from most programming languages like Javascript, Python, and PHP because it only has one use: retrieving data from relational databases. So you can’t use SQL to build a website or a chatbot but you can use programming languages like Javascript, Python, and PHP to send SQL commands to databases and do something interesting with the results. WordPress is a good example of this. WordPress is written in PHP and the PHP code sends the SQL commands to a MySQL database and formats the data into blog articles and article lists.

What’s the difference between SQL and Excel?

Remember when you learned your first Excel formula? Pivot tables? VLOOKUP? You probably through you could take on the world! SQL is like that times 100. SQL and Excel are similar because they both allow you to analyze, manipulate, and make calculations, and join data in tables. 

The biggest difference between Excel and SQL is that you can analyze exponentially more data exponentially faster with SQL but you can’t update the data in SQL quite as easily. Also, SQL commands define how you want your data table to look when the data is retrieved so you are working with entire tables rather than individual cells. The benefit of this is that you don’t have to worry about making mistakes when copying formulas (and the analysis errors that come with that.) On the whole, I’d say SQL is much better than Excel, most of the time.

SQL Example in Marketing

This example shows an ROI analysis using SQL code that you could use in a Facebook Ads dashboard. This example calculates how many customers you’ve acquired per country since the beginning of 2020, and the Facebook Ads spend that was spent in that country. 

SELECT country, sum(customer_count) total_customers, sum(spend) ad_spend
FROM (
	SELECT customers.ip_country, count(email) customer_count
	FROM customers 
	WHERE customers.createdate > '2020-01-01'
	GROUP BY country) new_customers
JOIN facebook_ads ON facebook_ads.country = new_customers.ip_country
WHERE ad_spend.date > '2020-01-01'
GROUP BY country
ORDER BY ad_spend desc;

The example does the following:

  1. Aggregate a table of customers into a table of countries and customer counts who have become customers since January 1st, 2020.
  2. Joins that table with another table that contains Facebook Ads data by day
  3. Filters in only Facebook Ad spend data since January 1st, 2020
  4. Aggregates this all into a single table that has three columns: country, count of new customers from that country, and the ad spend for that country.

The good news is, this is about as complex as SQL gets. Pretty much everything else in SQL is just a variation of this.

Is SQL worth Learning?

In a word, yes. There is a practical reason and a conceptual one. The conceptual one is that learning SQL, like learning data structures or other programming languages, will expand how you think about data. It will help you organize data for analysis more efficiently and help you structure your thinking about how to answer questions with data. So even without a database, SQL can help you work with data.

The practical reason for learning SQL is that it allows you to gain insight faster, from more data, and come to better conclusions. That is true if you are analyzing keywords for PPC or SEO, analyzing how leads flow through your sales funnel, analyzing how to improve your email open rates, or analyzing traffic ROI.

 Here are just a few good reasons.

  1. You’ll spend less time trying to export and import data into spreadsheets
  2. You’ll be able to replicate your analysis easily from week to week or month to month
  3. You’ll be able to analyze more than 10k rows of data at once
  4. You can use BI tools to build dashboards with your data (and always keep them fresh)
  5. You’ll be able to merge bigger datasets together faster than VLOOKUPs
  6. You won’t have to ask for help from IT people, DBAs or engineers to get data out of a database or data warehouse for your analysis

How long does it take to learn SQL?

With dedication, you can develop a strong foundation in SQL in five weeks. I recommend Duke’s SQL class on Coursera to go from zero to usable SQL skills in less than two months. With that class and a couple of books about PostgreSQL, I was on par with most analysts at Postmates (except I had the context about the data!). A few months later  I learned enough to record this SQL demo with SEM data.

There are even good Android/iPhone apps that will help you learn the syntax through repetition. The class I recommend below (Data Manipulation at Scale: Systems and Algorithms) for Python also touches on SQL so it’s a double whammy and Python Anywhere also features hosted MySQL, so that’s a double-double whammy!

If you are looking for a short but substantive overview of SQL, this video from Free Code Camp is pretty good. I’m not suggesting you’re going to know how to write SQL in four hours, but at least you will get the gist.

All that being said, like many programming languages, learning SQL is a continual practice because, after learning the language, you can expand into managing a database rather than just analyzing data in a database. You can also pair your SQL skill with other programming skills to make all sorts of interesting applications! The good news, for the most part, it’s like riding a bike, once you learn it, you don’t really forget it—but it will take you a bit of time to re-learn a wheelie.

Learn Programming and Databases for Digital Marketing | $10k Tech Skills 2/4

This is part t in the $10k Technical Skills for Digital Marketing Series. Part one introduced the importance of learning client-side technologies and offers a plan to learn Javascript, HTML and CSS for digital marketing. This post broadens the picture by introducing server-side programming and databases, which together compose web applications. Understanding how web applications work is a major benefit and should be essential knowledge for digital marketing. Enjoy!

Learning How Web Applications Work

From Google Bot to the Facebook Social Graph, to this WordPress blog; the web as we know it, is a massive system of interconnected applications. All these applications are simply programs and databases that run on servers. And while building these applications is a massive undertaking, learning the underlying processes and concepts is not. It takes nothing more than a bit of effort and time to learn enough about programming and databases to significantly set yourself and your resume apart from the average digital marketer.

While the benefits of learning how to write server-side code and interact with databases are not as immediately useful as many of the skills listed in Part 1, it is actually the process of learning this skill that presents the real value. The learning process will provide and intuition about how applications work and how processes can be scaled. This is key to digital marketing at scale.

If you can understand how search engine bots crawl websites, you can understand what makes a website crawl-friendly and you begin to understand the technical aspects of SEO. If you understand how algorithms work, you can understand Edge Rank and how Facebook decides to distribute content and broaden your reach. If you can understand how your CMS works you can map your analytics platform to it and gain better insight, which you can then use to, automate processes like email and offer personalized experiences. This new intuition about the web will continue to present opportunities.

You will also find many practical opportunities to employ your new programming and databases querying skills for digital marketing tasks and processes. While these skill starts to bleed into the realm of web development and data-science/business intelligence there are still many applications for server side scripting languages, from automation to optimization that can be very powerful for digital marketers.

Programing for the Web

When starting out on the road to learning server-side scripting, it is most realistic to start with PHP, Python or Ruby on Rails. All three are open-source, have strong communities and plenty of free learning resources. They all offer many similar advantages but each is powerful (and practical) in its own way.

programming languages for digital marketing

You see why I chose python…

PHP, for better or worse, has been the defacto server-side language of the Web for a long time. PHP is what powers WordPress, Magento, ModX and many other content management systems (CMS’s) and if you are in digital marketing for long you will likely run into at least one CMS powered by PHP. Learning PHP will come in handy when you find yourself wanting to add schematic markup for search engines or scripts for testing or analytics platforms like Optimizely or Google Tag Manager.

Depending on the site(s) and development resources (or lack thereof) that you are planning to work with, PHP may be good choice. It is the easiest code to deploy, as all popular web servers will support PHP.

Python is also used to build websites with frameworks like Django and Flask but more often, sites that are built with Python are apps built with a specific, custom purpose. Unlike, PHP and Ruby, which are designed for, web development; Python is a general-purpose language, which makes it go-to languages for data-science. (The resources featured here are most about how to learn python as that is the language I have focused learning the most. It has been great!)

For the technical marketer, Python is useful for scaling big(er) data science-y processes like web scraping, querying API’s, interactive analysis and reporting. Many processes that are carried out manually can be programmed using Python and run on a cron job or other triggers. One major benefit of Python is that it is so easy to learn thanks to the number of educational resources and friendly syntax. If you find yourself venturing into the world data science, you will be well prepared with Python as a large and active data science community supports it.

Ruby on Rails, well, I really haven’t played with it much but I have heard it’s very nice. The key, I hear is that it is good for rapid Web app development.

Node and JavaScript were much of the focus of Part 1: Learning Javascript.

Database Querying and Analysis

Digital marketing without data is not digital marketing and the digital marketer who is not data-literate is just a marketer. I am not arguing that all digital marketers should be become SQL ninjas but learning this skill, like programming, is as much about gaining an intuition about how systems and applications work as it is about developing a practical skill.

databases and analtyics

For a real-world use case that employs this skill as both intuition and a practical skill, look no further than Google Analytics. The Google Analytics web interface is ‘simply’ an elegant way to query, sort, filter and visualize site usage/performance data that is collected in a database. Having a general understanding of how Google Analytics stores data and how different data points/hit types interrelate allows you to be much more precise in your analysis and confident that the data that you pull from Google Analytics is accurate.

SQL knowledge can also help you in times that you need to pull raw data out of Google Analytics for further analysis or to avoid sampling. With Google Spreadsheets’ QUERY function, you can query spreadsheet data using SQL (Structured Query Language). For quick analysis and more complex inspection of data sets, writing SQL queries to explore and form data to your needs can be much quicker and easier to debug than writing a successive set of spreadsheet functions.

When dealing with large amounts of Google Analytics and sampling becomes a significant issue, Google’s BigQuery can be hooked up to Google Analytics to provide SQL-like query functionality with greater speed and scale. When you become comfortable with this GUI-less interface, the ability to query any database become much less daunting. You can then answer question by directly querying databases such as a website’s MySQL database using phpMyAdmin.

“Every question can be distilled into a database query,” Adam Ware of SwellPath told me when I first started learning about databases. The phrase seemed very exciting and has since proven accurate. I have come to realize that databases simply hold all the raw information in a defined structure. By asking the right question in the right way, your digital marketing insights are limited only by your data.

Once you start to understand how databases operate you will notice their appearance in apps across the web from ecommerce stores to analytics platforms to blogs. The understanding of how data is stored and how to extract the data that you want will also significantly improve your ability to use applications to their full potential, ideate optimization for existing apps and learn new applications. This intuition is skill that helps turn data into to knowledge and as you knowing is half the battle.

How to Learn Web Application Programming

Start Here: Codecademy.com

This is a great place to start with any web programming language. It is the quickest, easiest and most fun way to get up to speed with a programming language that I have found. Best of all it is free. It offers courses in PHP, Python and Ruby and hosts very helpful Q&A forums for coders who are just starting out.

Get up to Speed: Intro to Programming with Python (Udacity)

Once you have gotten a feel for programming (and a few bumps and bruises to go along with it) the next place to go is to start to understand the real power that programming offers. Udactity’s Inro to Programming in Python picks up where CodeAcademy.com leaves off and introduces capabilities rather than just syntax and style.

For the digital marketer, this course is especially useful because the course is taught through constructing a very rudimentary search engine crawler (or at least the general idea of one). This application opens a window of understanding how big applications work and will make you think differently about how search engines operate.

How the Web Works: Web Development (Udacity)

There is a lot more than just programming that differentiates marketers who can program from web developers. From hosting, to caching to cookies, this course does a good job introducing these concepts.

From my experience, it was a bit too difficult as a follow up from the Intro to Programming in Python course to actual create and deploy a web app, but it does give a substantially understand of technical web terminology to communicate effectively with web developers. (This is a very valuable skill if you ask me.) From this course you will have an understanding of what topics you need to take on in detail to accomplish what you need to do as a technical marketer.

How to Learn Data Analysis with Databases

Become Data-Driven: Intro to Data Science (U. Washington & Coursera)

In my opinion (and I am a bit of a biased data-geek), this is the best online course I have taken. Each lesson offered “aha!” moment after “aha!” moment while teaching really useful skills.

The course assumes only a bit of Python experience and offers a comprehensive introduction to everything from interacting with API’s with Python and to querying databases from the command line to how to think and communicate with data. Taking this course will make any digital marketer more data-driven and will back them up with the skills to take action.

Database Deep Dive: Introduction to Databases (Stanford & Coursera)

Slightly more academic than Intro to Data Science, this course provides a very strong foundation for understanding data and databases. If you are a “why does this work” type of person, this course will be very interesting.

From a practical standpoint, the course offers very good lessons on JSON and XML formats which are everywhere in digital marketing and their understanding is essential for working with API’s. The database portion of the course will take you at least as far as you will need to go for the digital marketing applications of databases.

Put it all Together: MongoDB University

If all these courses have been interesting to you and you have a good handle on programming, then this is the course for you! You will build a real webb app from the ground up while learning MongoDB hotness. Another digital marketing specific benefit to this course is that the app that you build is a blog. Understanding how blog content is retrieved and presented will help you understand a lot about semantic SEO.

I hope you have at least one direction that you are excited about. Leave a comment if you have any questions or follow the rest of  the $10k Technical Skills for Digital Marketing series by signing up for email notifications when new posts are up. API’s, web scraping and “how to learn” are still to come!