Writing Custom Dimension to Google Analytics from Snowflake DB

Like many data geeks, Google Analytics was the thing that first sparked my curiosity. Last week, Census released our Google Analytics integration. I could write a long list of game-changing applications for creating custom dimensions in Google Analytics from data in a Snoflake data warehouse but I’ve been swirling around one that I find particularly interesting. Here’s what I’m thinking.

Census, as you’d imagine, has a long sales cycle. It’s a freemium product in the middle of multiple stakeholders and touchpoints. It should come as no surprise that it’s easy for us to lose the connection between early marketing efforts and later sales outcomes. Lead scores based on demographics, activity, content consumption, or product usage have been really helpful for us to aggregate signals into a few metrics that show how similar new leads are to past customers.

The problem is that those metrics are only helpful for understanding the funnel after a trial is started. Growth is a process of expanding what works by experimenting and iterating. So to advance our experimentation towards the top of the funnel, we need valid signals early in the journey.

This is finally where Google Analytics comes in. GA is the platform we trust for web analytics and a quick way to understand attribution. But as you know, it was built for ecommerce—not for B2B SaaS. Firing a conversion based on an event gives us a really shallow view of success.

Send data to Google Analytics from Snowflake

With Clearbit Reveal and account-based scoring, we can put score-based thresholds on the traffic coming in (for example high/mid/low score). With reverse ETL data integration for Google Analytics, We can map these thresholds against custom dimensions to measure the volume of high-quality traffic a given channel is bringing in and evaluate channels/costs against traffic quality. Sure the signal is imperfect (all models are) but it’s a lot stronger than the far-less-frequent conversion events. It feels a lot better to me than living and dying by button clicks.

I recently presented on how to get consistent metrics across Google Analytics, your ads platforms, and Hubspot called Marketing as a Data Product: Operational Analytics for Growth which shows how I did this. Check it out!

Yes, you can use Census for free. You can even send custom dimensions from Google Sheets.

Setting up Airbyte ETL: Minimum Viable Data Stack Part II

In the first post in the Minimum Viable Data Stack series, we set up a process to start using SQL to analyze CSV data. We set up a Postgres database instance on a Mac personal computer, uploaded a CSV file, and wrote a query to analyze the data in the CSV file. 

That’s a good start! You could follow that pattern to do some interesting analysis on CSV files that wouldn’t fit in Excel or Google Sheets. But that pattern is slow and requires you to continually upload new data as new data becomes available. 

This post will demonstrate how to connect directly to a data source so that you can automatically load data as it becomes available.

This process is called ETL, short for Extract, Transform, Load. Put simply,  ETL just means, “connecting to a data source, structuring the data  in a way that it can be stored in database tables, and loading it into those tables.” There’s a lot more to it if you really want to get into it, but for our purposes, this is all you’ll need to know right now.

For this part of the tutorial, we are going to use an open-source ETL tool called Airbyte to connect to Hubspot and load some Contact data into the same Postgres database we set up before. Then we’ll run a couple of analytical queries to whet your appetite!

Setting up an ETL Tool (Airbyte)

I chose Airbyte for this demo because it is open source which means it’s free to use as long as you have a computer or a server to run it on. Much of it is based on the open-source work of another ETL tool called Stitch had been pushing before they got acquired by Talend. That project was called Singer.

The best thing about Airbyte for our Minimum Viable Data Stack is that they make running the open-source code so easy because it is packaged in yet another software framework called Docker. Yes, if you’re keeping score at home, that means we are using one open-source framework packaged in another open-source framework packaged in yet another open-source framework. Oh, the beauty of open-source!

To keep this tutorial manageable, I am going to completely “hand wave” the Docker setup. Luckily, it’s easy to do. Since this tutorial is for Mac, follow the Docker installation instructions for Mac.

🎵 Interlude music plays as you install Docker 🎵

Once you’ve installed Docker, you can run the Docker app which will then allow you to run apps called “containers.” Think of a container as “all the code and dependencies an app needs, packaged so you can basically just click start” (Instead of having to load all the individual dependencies one by one!)

Setting up Docker

We’re only going to download and run one app on Docker: Airbyte! 

Note: If you need help on the next few steps, Airbyte has a Slack community that is really helpful.

To download Airbyte the instructions are simple. Just open up your terminal (you can find this by using Mac’s spotlight search [cmd+space] and typing in “Terminal”). In the terminal just paste in the following three commands:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
docker-compose up

The commands tell your computer to copy all the code from their Github repository to your computer into a folder called “airbyte”, then “cd” aka “changing the directory” to the “airbyte” directory, then tell  Docker to run the Airbyte app container.

The beauty of this is that once you run this the first time from the command line, you can start Airbyte from the Docker UI by just clicking the “play” button.

Installing Airbyte via the command line

Airbyte will do a bit of setup and then your terminal will display the text shown above. At that point Airbyte is running on your computer and to use it, all you have to do is open your browser and go to http://localhost:8000

If you’re wondering how this works, Airbyte is running a webserver to provide a web interface to interact with the code that does all the heavy-ETL-lifting. If this were a common ETL tool like Stitch or Fivetran, the webserver and the ETL processes would run on an actual server instead of your personal computer.

If everything has gone according to plan you can go to  http://localhost:8000 and see the Airbyte app UI running and ready to start ETL-ing!

Setting up the Posgress connection in Airbyte

Setting up your first ETL job (Hubspot to Postgres)

I’ll admit, that last part got a little gruesome but don’t worry, it gets easier from here (as long as everything goes according to plan…)

From here we have to connect both our database and data sources to Airbyte so it has access to the source data and permission to write to the database.

I’ve chosen to load data from Hubspot because it is really easy to connect and because it shows the ups and downs of ETL… And of course, we’re still using Postgres.

Creating a Postgres Destination

All you have to do is paste in your database credentials from Postgres.app. Here are Airbyte’s instructions for connecting to Postgres

These are the same ODBC credentials we used to connect Postico in the last article. You can find them on the Postico home screen by clicking your database name. Note 

Posgres ODBC connection details

In my case, these are the settings:

  • Name: I called it “Postgres.app” but it could be anything you want
  • Host: Use host.docker.internal (localhost doesn’t work with Docker. See instruction above)
  • Port: 5432 
  • Database Name: Mine is “trevorfox.” That’s the name of my default Postgres.app database
  • Schema: I left it as “public.” You might want to use schemas for organizational purposes. Schemas are “namespaces” and you can think of them as folders for your tables. 
  • User: Again, mine is “trevorfox” because that is my default from when I set up Postgres.app
  • Password: You can leave this blank unless you set up a password on in Postgres.app 

From there you can test your connection and you should see a message that says, “All connection tests passed!”

Creating a Hubspot Source

You’ll first need to retrieve your API key. Once you’ve got it, you can create a new Hubspot Source in Airbyte.

I used these settings:

  • Name: Hubspot
  • Source type: Hubspot
  • API Key:  My Hubspot  API key
  • start_date:  I used 2017-01-25T00:00:00Z which is the machine-readable timestamp for 1/1/2017
Setting up the Hubspot connection in Airbyte

Here’s a cute picture to celebrate getting this far!

Getting started with Airbyte

Creating the ETL Connection

Since we’ve already created a Destination and a Source, all we have to do is to tell Airbyte we want to extract from the Source and load data to the Destination. 

Go back to the Destinations screen and open your Postgres.app source, click “add source,” and choose your source. For me, this is the source I created called “Hubspot.”

Airbyte will then go and test both the Source and Destination. Once both tests succeed, you can set up your sync. 

Setting up an ETL job with Airbyte

There are a lot of settings! Luckily you can leave most of them as they are until you want to get more specific about how you store and organize your data. 

For now, set the Sync frequency to “manual,” and uncheck every Hubspot object besides Contacts.

In the future, you could choose to load more objects for more sophisticated analysis but starting with Contacts is good because it will be a lot faster to complete the first load and the analyses will still be pretty interesting.

 Click the “Set up connection” button at the bottom of the screen.

Setting up the first sync with Airbyte

You’ve created your first Connection! Click “Sync now” and start the ETL job!

As the sync runs, you’ll see lots of logs. If you look carefully, you’ll see some that read “…  Records read: 3000” etc. which will give you a sense of the progress of the sync.

Airbyte ETL logs

What’s happening in Postgres now?

Airbyte is creating temporary tables and loading all the data into those. It will then copy that data into its final-state tables. Those tables will be structured in a way that is a bit easier to analyze. This is some more of the T and L of the ETL process!

As the sync is running, you can go back to Postico and refresh the table list (cmd+R) to see new tables as they are generated. 

Let’s look at the data!

When the job completes, you’ll notice that Airbyte has created a lot of tables in the database. There is a “contacts” table, but there are a lot of others prefixed with “contacts_.”

Why so many tables? 

These are all residue from taking data from a JSON REST API and turning it all into tables. JSON is a really flexible way to organize data. Tables are not. So in order to get all that nested JSON data to fit nicely into tables, you end up with lots of tables to represent all the nesting.  The Contacts API resource alone generated 124 “contacts_” tables. See for yourself:

​​select count(tablename)
from pg_tables t
where t.tablename like 'contacts_%'

This query queries the Postgres system table called pg_tables which, as you probably guessed, contains a row for each table with some metadata. By counting the tables that match the prefix “contacts_,” you’ll see all the tables that come from the Contacts resource. 

Why you care about Data Modeling and ELT

In order to structure this data in a way that is more suitable for analysis, you’ll have to join the tables together and select columns you want to keep. That cleaning process plus other business logic and filtering is called data modeling

Recently it has become more common to model your data with SQL once it’s in the database  (rather than right after it is extracted and before it’s loaded into the database). This gave rise to the term ELT to clarify that most of the transformations are happening after the data has landed in the database. There is a lot to discuss here. I might have to double back on this at some point…

Previewing SQL data tables

Luckily, we can ignore the majority of these tables. We are going to focus on the “contacts” table.

Analyzing the Data

Let’s first inspect the data and get a feel for what it looks like. It’s good to start with a sanity check to make sure that all your contacts are in there. This query will tell you how many of the contacts made it into the database. That should match what you see in the Hubspot app.

select count(*)
from contacts

One of the first questions you’ll probably have is how are my contacts growing over time? This is a good query to demonstrate a few things: the imperfect way that ETL tools write data into tables, the importance of data modeling, and ELT in action.

In the screenshot above, you’ll notice that the “contacts” table has a bunch of columns but one of them is full of lots of data. The “properties” column represents a nested object within the Hubspot Contacts API response. That object has all the interesting properties about a  Contact like when it was created, what country they are from, and other data a business might store about their Contacts in Hubspot. 

Airbyte, by default, dumps the whole object into Postgres as a JSON field. This means you have to get crafty in order to destructure the data into columns. Here’s how you would get a Contact’s id, the data it was created. (This would be the first step towards contact count over time)

select c.properties->>'hs_object_id' id, 
from contacts c
limit 10;

Notice the field, “c.properties->>’hs_object_id’.“ The “->>” is how you get a JSON object field from the JSON-typed fields.

To count new contacts by month, we can add a little aggregation to the query above.

select date_trunc('week', c.createdat::date) created_month,
	count(distinct c.properties->>'hs_object_id')  contact_count
from contacts c
group by created_month
order by created_month desc;

THIS IS IT! This is the beauty of analytics with a proper analytics stack. Tomorrow, the next day, and every day in the future, you can run the Hubspot sync and see up-to-date metrics in this report!

You’ll learn that the more queries you run, the more you’ll get tired of cleaning and formatting the data. And that, my friends, is why data modeling and ELT!

SQL analytics in Postico

I changed to dark mode since the last post :]

Looking Forward

At this point, the stack is pretty viable. We have a Postgres database (our data warehouse), an ETL process that will keep our data in sync with the data in the source systems (Airbyte), and the tools for analyzing this data (SQL and Postico). Now you can answer any question you might have about your Hubspot data at whatever frequency you like—and you’ll never have to touch a spreadsheet!

The foundation is set but there is still more inspiration ahead. The natural place to go from here is a deeper dive into analysis and visualization. 

In the next post, we’ll set up Metabase to visualize the Hubspot data and create a simple dashboard based on some SQL queries. From there, I imagine we’ll head towards reverse ETL and push the analysis back to Hubspot. =]

I hope this was interesting at the least and helpful if you’re brave enough to follow along. Let me know in the comments if you got hung up or if there are instructions I should add.

From Spreadsheets to SQL: Step One towards the Minimum Viable Data Stack

“A spreadsheet is just data in single-player mode”

Last week I made a mistake. I boldly claimed that “A spreadsheet is just data in single-player mode.” And while I stand by that claim. I didn’t expect to be called to account for it.

As it turns out, the post was pretty popular and I think I know why. To me it boils down to five factors. 

  1. The scale and application of data still growing (duh)
  2. There aren’t enough people with the skills to work with data at scale
  3. There are plenty of resources to learn SQL but the path to using it in the “real world” isn’t very clear
  4. The tools have caught up and basically, anybody with spreadsheet skills can set up a data stack that works at scale
  5. Now is a great time to  upskill and become more effective in almost any career take advantage of the demand

The hard part? You have to pull yourself away from spreadsheets for a while—go slow to go fast (and big).

You’ll thank yourself in the end. Being able to think about data at scale will change how you approach your work and being able to work at scale will increase your efficiency and impact. On top of that, it’s just more fun!

A Minimum Viable Data Stack

In the spirit of a true MVP, This first step is going to get you from spreadsheet to SQL and with the least amount of overhead and a base level of utility.

In the next hour you will:

  • Stand up an analytical database
  • Write a SQL query to replace a pivot table
  • Have basic tooling and process for a repeatable analytical workflow 

In the next hour you will not (yet):

  • Run anything on a cloud server (or be able to share reports beyond your computer)
  • Setup any continually updating data syncs for “live” reporting

But don’t underestimate this. Once you start to work with data in this way, you’ll recognize that the process is a lot less error-prone and repeatable than spreadsheet work because the code is simpler and it’s easier to retrace your steps.

Starting up a Postgres database

Postgres isn’t the first thing that people think of for massive-scale data warehousing, but it works pretty well for analytics—especially at this scale, it is definitely the easiest way to get started and of course, it’s free. Ultimately, you’ll be working with BigQuery, Snowflake, and if this were 2016, Redshift. 

I apologize in advance but this tutorial will be for a Mac. It won’t really matter once everything moves to the cloud, but I don’t own a Windows machine…

The easiest way to get a Postgres server running on a Mac is Postgres.app. It wraps everything in a shiny Mac UI and the download experience is no different than something like Spotify.

Congrats! You have installed a Postgres server on your local machine and it’s up and running!

Here are some instructions for installing Postgres on Windows. And here’s a list of  Postgres clients for Windows that you’d use instead of Postico.

Now let’s see how quickly we get connected to the database.

Your Postgres server is up and running

Connecting to Postgres

There are plenty of good SQL editors for Postgres but since we are keeping this MVP, I’m going to recommend Postico. Again, it has a simple Mac UI and is designed for more of an analytical workflow than hardcore database management. 

  • Step 1: Head over to https://eggerapps.at/postico/ and download the app
  • Step 2: Move the app to the Applications folder and then open it by double-clicking on the icon
  • Step 3: Create a database connection by clicking on the “New Favorite” button. Leave all fields blank; the default values are suitable for connecting to Postgres.app. Optionally provide a nickname, eg. “Postgres.app”. Click “Connect”
  • Step 4: Go back to Postico and choose the SQL Query icon
  • Step 5: Test your connection by running a query.
Create a new “Favorite” connection in Postico

Run the query “select * from pg_tables;” to see a list of all the tables in your Postgres database. Since you haven’t loaded any tables, you’ll just see a list of Postgres system tables that start with the prefix, “pg_.” As you probably guessed, the “pg” stands for Postgres.

Running your first SQL query in Postico

You’ve done it! You’ve started up a Postgres database, connected to it, and run your first query!

Loading data into Postgres

Ok, the boring stuff is out of the way and it’s only been about 15 minutes! Now we can get to the actual analysis. Next, let’s load some actual data into Postgres.

Loading tables in Postgres is a little bit different (aka more involved) than loading a CSV into Google Sheets or Excel. You have to tell the database exactly how each table should be structured and then what data should be added to each table. 

You might not yet know how to run CREATE TABLE commands but that’s ok. There are tools out there that will shortcut that process for us too. 

The imaginatively named, convertcsv.com generates the SQL commands to populate tables based on a CSV file’s contents. There are lots of ways to populate data into a database but again, this is an MVP. 

For this tutorial, I’m using the Google Analytics Geo Targets CSV list found here. Why? Because the file is big enough that it would probably run pretty slowly in a spreadsheet tool.

  • Step 1: Head over to https://www.convertcsv.com/csv-to-sql.htm
  • Step 2: Select the “Choose File” tab to upload a CSV file. 
  • Step 3: Change the table name in the Output Options section  where it says “Schema.Table or View Name:” to “geotargets”
  • Step 4: Scroll down to the Generate Output section and click the “CSV to SQL Insert” button to update the output, then copy the SQL commands
  • Step 5: Go back to Postico and click on the SQL Query icon
  • Step 6: Paste the SQL commands into the SQL window 
  • Step 7: Highlight the entirety of the SQL commands and click “Execute Statement”
Uploading a CSV file to convertcsv.com

You’ve loaded data into your database! Now you can run super fast analyses on this data.

Analyze your data!

You’ve already run some commands in the SQL window in the previous step. The good news is it’s always just as simple as that. Now analysis is basically just the repetition of writing a command into the SQL editor and viewing the results. 

Here’s a simple analytical query that would be the equivalent of creating a pivot table that counts the number of rows within each group. Paste this in to find the results.

select g.country_code, count(criteria_id) count_id
from geotargets g
group by g.country_code
order by count_id desc

You’ve done it! You’ve replaced a spreadsheet with SQL! 

But this is not the end. PRACTICE, PRACTICE, PRACTICE! Analysis in SQL needs to become as comfortable as sorting, filtering, and pivoting in a spreadsheet.

Looking forward

If you’re thinking to yourself, “this still feels like single-player mode…” you’re right. This is like the first level of a game where you play in single-player mode so you can learn how to play the game and avoid getting destroyed in a multiplayer scenario.

In fact, you probably wouldn’t do this type of analysis in a database unless you were going to pull in CSV files with millions of rows or if you were to pull in a bunch of other spreadsheets and join them all together. In those cases, you’d see significant performance improvements over an actual spreadsheet program.

The real utility of a database for analysis comes when you have data dynamically being imported from an ETL tool or custom code. On top of that, running a database (or data warehouse) on the cloud makes it possible for you and your team to access the data and reports in real-time instead of just doing analysis locally and sending it to someone else as a CSV or other type of report document. Hopefully, I don’t need to tell you why that is a bad process!

If I stay motivated… The next step will be to dynamically import data into the new Postgres database with an ETL tool called Airbyte which also runs on your computer. At that point, the scale and complexity of the analysis will really increase.  

After that, as long as I keep at it… the next step would be to set up a BigQuery instance on Google Cloud. At that point, you can combine a cloud-based business intelligence tool with Airbyte and BigQuery and start to get a taste of what a functioning modern data stack looks like. 

I hope this was a helpful start for you.  Let me know in the comments if you get hung up.

There’s something to learn from black hat SEO tactics for PDF files

Go search Google for “free tiktok followers” and you will see some brilliant, although dirty #blackhat #SEO. Most of the top ten results belong to a spammer who has hacked into dozens of sites and created fake profiles and uploaded .pdf files to public directories with spun content.

That scam is kind of surprising but it’s why it works that is particularly interesting. It demonstrates that, in spaces with emerging volume of low competition, competitors can win by virtue of just having one or two solid ranking factors despite featuring absolute garbage content.

For example, the City of Neehaw, Wisconsin runs a WordPress site that should be trusted. It’s a government site with lots of links pointing to it. In this corner of the internet, Google doesn’t care if the content screams spam because if the site is trustworthy, the content ranks.

Can you apply this tactic?

If you’re like me, you’d at least play with the idea…

When you get past your moral objections, you can apply this “insight,” by hosting .pdf files in the static directories of public hosting sites. Let’s look at an example.

This might blow your mind. The subdomain that hosts the static assets that belong to the sites hosted on Squarespace has a Moz Domain Authority of 80! To put that in context, YouTube is 100, and 80 is about the score of Squarespace.com itself!

No surprise, with that monstrous domain authority, and thousands upon thousand so hosted .pdf files, the site ranks for a ton of keywords.

And the traffic? About two-thirds of a billion visits per month.

Can a PDF outrank a page?

The short answer is yes. These PDF files rank #1 for dozens of searches and top 10 for millions. And that’s just from one site.

Granted, they are mostly very low competition keywords, so take that with a grain of salt. The highest competition search term that has PDF file ranking #1 has a keyword competition of 48. Not super high but volume doesn’t always equal value!

The thing to note here is how many of these searches are related to shady or illicit activities. Lots of them promise social media followers, different types of “hacks,” and copyright infringing activities.

And finally, I couldn’t find any instances where a Squarespace-hosted static PDF outranked a page on the Squarespace-hosted site. I can think of one big reason that Google would prefer a responsive page over a PDF: mobile devices don’t give a damn for PDFs.

Can a PDF file get a featured snippet?

Again, the short answer is yes. In the absence of competition, the quality and relevance of the content (and let’s not forget the domain authority) give the PDF file the edge.

The interesting thing here is that the snippets lack any structural formating. Because a pdf is much like a big .txt file in Google’s eyes, there is no option to create a table or bulleted lists. It’s just text or nowthing.

Should you apply this tactic?

Look, I’m not the SEO police. I’m just a guy who loves competition and is fascinated by search.

Should you hack a site just to upload some spam? No.

Is it worth trying to hosting a PDF on a public hosting domain if it’s your only chance at ranking for a target keyword? Maybe…

I love thinking about #minimumviableSEO and sometimes you have to be creative to rank for terms that your site isn’t ready to rank for.

But keep two things in mind. You won’t have tracking in Google Search Console or Google Analytics because. As you might have guessed, you can’t upload Google Search Console’s HTML verification file to the static hosting domains. (Of course, I tried it!) This means, the only tracking you’ll get is any tracked links that point to your site from a PDF.

So consider your costs and benefits. If you’re not just some spammer and you’re really thinking about SEO strategy, you’ve probably got better things to do with your time.

I hope this was interesting and it scratched your curiosity as it did mine!

Google Analytics 4 Pageview Custom Dimensions and Content Groupings

I decided it’s time I’d learn GA4 so I figured I’d implement it net new and figure it out along the way. A couple of things I rely on heavily on my sites are Content Groupings and Custom Dimensions. Luckily, GA4 handles these a lot better than UA—even if they are still a bit cumbersome.

I learned that there is only one Content Group parameter called content_group (instead of 5). It seems at first glance that event-scoped custom dimensions are capable of doing that which content groups used to handle (thanks to the new event-first rather than session-first paradigm).

To set customer dimensions, you can just set arbitrary parameters like canonical_name or page_author.

Shown below is an example of the query parameters that are sent with the page_view hit. Notice that the “ep.” (event parameter) prefix is added by the GA code. (I only passed canonical_name as a parameter) and GA adds the prefix when the hit is sent. This is also true for content_groups.

The query string parameters correspond to the Measurement Protocol

  • en: page_view
  • ep.criteria_id: 2554
  • ep.canonical_name: New%20Zealand
  • ep.countent_group4:
  • ep.countent_group3: NZ
  • ep.countent_group1: Country
  • ep.countent_group2: Active
  • ep.debug_mode: false

Notice that I set debug_mode to false so that I could take advantage of the debug view in the Google Analytics UI. That thing is pretty neat and helpful for spot-checking custom parameters.

One thing that seemed silly was that I couldn’t just arbitrarily name my content groups and assign them in the UI, but hey, it’s still a lot better than UA.

Defining and Naming Your Custom Dimensions

After you send the parameter with the hit, you have to set up the custom dimension in the GA4 UI. You can set them up by going into the left panel and selecting “Custom Definitions.”

The nice thing is that, if you’ve already set a parameter, you can just select the parameter name from the list and all you have to do is provide a Dimension Name and Description.

Pageview Custom Dimensions in Reports

To see the custom dimensions in a real report, you can go into the left panel and select Engagement > Pages and screens. Then you can find the dimension by clicking the big blue plus sign (+) and selecting whichever you like.

If you’re curious about how it works and you want to see it for yourself, you can go to the site and open up your Network console and filter for “?collect” to see the GA hits. The site is here: https://analyticscodes.com/ Please visit and send some data into that GA4 account!

(Re) Defining Growth Experiments

I believe above all else, the role of Growth in an organization is to learn. Of course, the objective is to learn in the service of growth but I believe that the growth role is different than many traditional roles because the role demands that you navigate through a lot of ambiguity to optimize for the fit and volume of transactions between a product and it’s customers. Learning is the best antidote to ambiguity.

I’ve been stuck on terminology lately. The more I’ve spoken with people or read on the topic, the more interpretations of the term “growth experiment” I’ve come across. And through churning on this topic over the last couple of weeks I came up with a definition that I’m not only comfortable with, but I’m also happy with.

But before I get into that, let me share the origin of this conundrum.

Many interpretations of “Experiment”

I started in digital marketing and that isn’t the best anchoring on this topic. That space has a very narrow and very traditional definition of the term “experiment.”

The digital marketing definition of an experiment draws heavily from the scientific definition and method: you develop a hypothesis about a causal factor and a change you want to make, you group your subjects into two or more groups, you keep a control group, and you expose your experimental group(s) to some conditions, you do some statistical analysis to determine if your test conditions had an effect and you move forward accordingly. 

You can run these types of experiments with digital ads, email campaigns, landing pages, and other forms of digital experiences. This is also the common conception for product designers and product managers. You can see evidence of this when you see URLs like /signup-b in a web app. 

The problem with this interpretation is that it requires experimenters to be able to collect enough data to come up with a statistically valid result. 

The downside of this belief are threefold: 

  1. Data volume requirement leads you to think that they can’t run “proper experiments” if they don’t have enough data
  2. You have to wait for such a long time to get to statistical significance that the rate of learning renders the experimentation next to useless. 
  3. It confines your thinking that an experiment is not an experiment if it doesn’t conform to the rigorous divide-and-expose-to-conditions structure it is not actually an experiment. 

The danger of this interpretation is the fallacy that I myself have struggled with: thinking that you can’t learn if you can’t experiment.

The worst possible failure mode is to believe that the only organization capable of experimentation are those who are capturing massive volumes of data and can run a controlled test on button color with infinite precision. 

Can you imagine, if only the products with over 100k users per month could run experiments? Or if businesses at the scale of Google, Facebook, and Amazon were the only businesses that experiment on anything more interesting than UI changes? This barrier would drastically stifle innovation. 

A more functional definition

The crux of the definition is the weight of the structure vs intent. Where I got stuck was the definition of  “experimentation” with the structure of the activity (groups and conditions) more than the intent of the activity (learning). 

A more functional definition prioritizes the intent over structure. If you ask yourself, “what is an experiment if the intent of the activity is to learn?” Then the scope and structure of the activity change accordingly. 

This leads me to a more functional definition:

A growth experiment is an activity that is structured in a way that the activity can provide evidence for an effect on growth and an understanding of why it had that effect.

In other words, it’s an activity that is designed to accelerate growth and has the feedback loop built in. In the face of ambiguity, a growth experiment drives a stake of certainty into the ground that suggests where you might experiment to grow into more unknown territory.

Then what becomes an experiment?

Now that the definition is focused appropriately, the scope widens significantly. Now you can fit any activity into that definition as long as you can structure the activity in such a way that it is instructive about growth. 

Now the question is, “should everything be an experiment?” After thinking about it, I couldn’t come up with a reason to say “no.” Without getting into all the details, I couldn’t think of an example of an activity that was so insignificant that it shouldn’t be structured in a way that I should be able to learn from it. 

If I can’t justify the effort in structuring the activity so that I can learn from it, should I take on the activity in the first place? If I won’t be able to tell if it was effective, how will I know it was effective?! 

If an activity isn’t justifiable, is it justified? 

Welcome to my own personal hell.

Inside the scope of an “experiment”

Run a campaign, build a feature, launch a new product, target a new market, raise prices, lower prices, start a business, kill a feature, replace a vendor, hire an employee. All these activities can be experiments but few of them fit within the scope of the A/B test definition I provided above. You’re never going to hire two employees and only intend to keep one. If you are, you should quit your job.

The common thread among all these activities is that you can develop a hypothesis about the intended effect and you can observe an initial state, observe the activity itself (especially the costs), and observe outcomes over time. You can develop an understanding of the causal relationship between these activities and your growth goals or your growth barriers.

Let’s break the structure down a bit more concretely. 

How to structure a growth experiment

I think there are five parts to an experiment. And you’ll notice none of them explicitly include a wiki page, Google Analytics, or a student’s T-test. The experiment structure should be appropriate to the investment in the activity.

Determine a relationship: Observe or assume a relationship between some action and your growth goals.

 “It looks like people who use the chat widget are more likely to become a customer”

Develop a hypothesis: Create a rationale for causation between affecting one side of the relationship and an expected outcome on the other. 

“If we pop up the chat when we get a signal that a user is lost, they will engage with the chat, we will reduce confusion, and ultimately increase the likelihood that the user becomes a customer.”

Create and observe feedback loop: Perhaps the most important part is being intentional about the type of feedback that you collect. Your feedback must be relevant to the activity and instructive. In the examples above, traffic doesn’t matter, neither does NPS. What matters is whether the people you intend to expose to chat actually use it, if they have a good experience with the chat, and it pushes them to become a customer. 

“I’m going to track the number of people that are exposed to the chat pop-up and if they engage with it. I’m going to check the transcripts of these conversations to see if they are helpful for the users.  I’m also going to check to see if these users become customers.”

Now, this actually might be a case for an A/B test but it likely is not, given the number of users who are likely to be exposed to the new experience. Bear in mind that this type of longitudinal study (A for a period, then B for a period) creates some risk around bias in interpretation but the qualitative feedback from the chats can be extremely instructive in their own right.

“Post Mortem” assessment: I think post mortems are great for the learning process and for getting people on the same page, but they are a little misleading because not all experiments “die.” I look at this as just taking the opportunity to review your feedback and ask yourself (and others,) “did this action have the effect that I intended, and can I discern why?”

“Did any users engage with the chat when we intended? Did it have any effect on their experience or on their likelihood to become a customer?

Evaluation: Evaluation is different from assessment because it layers on the analysis of the outcome on top of the activity itself. This is a good time to discern some theories about growth and ask yourself, was the activity worth it? And was it likely better than other alternate activities? Should we double down or shut it down? 

“We saw that the users that engaged with the chat by and large were looking for a different type of product. The additional load on our customer support team didn’t yield any favorable outcomes so we cannot justify keeping this activity running. However, we did discover some interesting insights about complementary features and how we could improve our onboarding flow”

I want to reiterate: the experiment structure should be appropriate to the investment in the activity. You never want the structure to be more cumbersome than the activity. So sometimes, the structure is just a little bit of observation and an open mindset to unexpected outcomes. 

You never want to get so bogged down in observation that you slow down your rate of learning. You can think of it as an optimization between creating the most possible learning opportunities while extracting as much learning from them as possible. You’ll probably end up 80/20 on both experiment velocity and feedback structure. 

Remember though, this is all about learning.

Read This Before You Run Your First SEO A/B Test

This is a quick post to provide context for the SEO A/B test-curious people out there. I was prompted by a thread in Measure Slack and figured long-form would make more sense. I didn’t want to make this another hideously long SEO-ified post but rather get to the point quickly.  Here’s the post and then I’ll dive into my thoughts about SEO A/B testing.

After writing this, I realized I didn’t actually address statistical significance but as you’ll see, if you’re running experiments that are dependent on a fine margin of statistical significance your time is probably better spent elsewhere. Read on to see what I mean.

How is it different than a regular A/B test?

SEO A/B tests differ from normal A/B  tests (like Optimizely or Optimize) in two major ways: implementation and measurement

Unlike A/B testing tools like Google Optimize that apply test treatment to the page once it’s rendered in the browser (using Javascript), test treatments in an SEO test should take place on the server. Why? Because Google bot needs to recognize these changes in order to index the page appropriately. Yes, Googlebot does run some Javascript on certain pages in certain conditions, but there is no guarantee that Googlebot will execute the Javascript code required for it to be exposed to a test treatment. This means that a test page needs to be crawled in its final state and the only way to guarantee that is if the changes happen server-side. That’s where things get weird.

There are tools out there for running server-side A/B tests but none are remotely as simple as Google Optimize—they all require server-side changes. SEO A/B testing frameworks are not terribly complex to code. A typical testing framework would take the identifier of a page (for example, the product ID or the integrations slug in the case of a website like Zapier) and applies a variant-assignment algorithm to the page. This could as simple as checking if a product ID ends in an odd or even number and applying A and B variants that way or as complex a string hashing function and a modulo operator that returns a 0 or a 1 to apply A and B variants. In any case, this is, at the end of the day, a substantial product feature. See how Pinterest runs tests.

On the measurement side of things, either you’re using a proper server-side A/B testing tool with measurement capabilities or you have to go out of your way to track the results in your own tool. If you go the “roll your own” route, the same A/B assignment logic that determines the page treatment needs to be passed along to your web analytics tool. A simple way to do this is to assign the variable in the data Layer and use Google Tag Manager to assign a Content Grouping (A or B) to the page in Google Analytics. Content Groupings are a better choice than hit-level custom dimensions because Content Groupings apply to landing pages by design. 

Here’s an example of an A/B test that had no effect. Can you tell when it starts? When it ends? Who knows—that’s how you know the test was not effective! 

Interesting note on this experiment: The treatment was Schema.org FAQ schema across ~8k pages. Google decided to only recognize the schema on 200 of them making it impossible to detect an effect…. and a waste of time to implement at scale if the change wouldn’t have a tangible effect.

Random assignment and Scale

If you’re thinking about running an SEO A/B test, random assignment and scale are two things you must consider from the outset. Just like browser-based SEO tests, you need to be able to trust that your groups are assigned randomly and guarantee that you will see enough traffic to produce valid results. I addressed random assignment a bit above and that’s not that hard to account for. It’s detecting a tests’ effect that creates a challenge.

The added layer of complexity in an SEO test is crawling and indexing. Because SEO tests are meant to effect a lift in ranking or CTR, you have to be positive that Google actually indexes your changes for the changes to take effect. Some pages don’t get crawled frequently and some, when they do, take forever to get re-indexed. This means there will be a lag to see results–the duration of that lag depends on your site’s crawl rate and Google’s opinion about how frequently it wants to reindex your pages. 

This means that scale matters a lot. If you know you will have to live with some degree of imperfection in your test, you have to overcome that with scale. By scale, I mean lots and lots of pages. The more pages you have and the more traffic they get, the more you will be able to detect your two time-series plots diverge as the pages are crawled, re-indexed, and changes take effect.

You’re probably asking, “how many pages do I need?” Well, I don’t have any science behind this but I would say 1,000+ for a bare minimum. And those 1k pages better have lots of traffic. If they don’t, it will be harder to attribute any changes to the test versus randomness. They also better have a lot of traffic because SEO A/B testing is a relatively high lift. A cost/benefit (potential) analysis is imperative before getting started. 

All that said, I’d be pretty confident in a 50/50 split across 10k pages. If you don’t have 10k pages, let alone 1k, you’re probably better off developing your programmatic SEO to reach this scale. VS pages, Integrations pages, Category list pages, and Locations pages are all good ways to get that page count up (and building them will have a bigger effect than all the optimizations after the fact).

Tracking and Goals

I mentioned some thoughts on tracking page variants in Google Analytics in the first section. That part is a technical problem. The goals part is a business problem.

Generally speaking, an SEO A/B test should focus on traffic. Why? Because in most cases, an SEO test will have the biggest impact on traffic and less of an impact on say, the persuasiveness or conversion rate of a set of pages. Sure, you could run title tag tests that might drastically change the keyword targeting or click-through intent but it’s usually safe to say that you will either start getting more or less of the same traffic and you can assume that traffic should convert to leads, revenue, etc at the same rate. 

Another argument for traffic is that changes in organic traffic volumes are going to be affected by fewer variables than revenue. The further the goal is from the lever your testing, the more data you have to collect to be sure that the test is actually what is causing the effect.

High impact tests

Finally, if you’ve made it this for you’re probably wondering about some test ideas that you have. Here are is how I think about prioritizing SEO tests. 

First, think about treatments that are high-SEO-impact. For me, title tags and meta descriptions are at the top of the list because, even if you aren’t able to affect rankings, this can have significant impacts on click-through rates. Another upside is that you will be able to see the effects of your test on the search pages. So a “intitle: <my title tag template string>” search in Google will give you a sense of how many of your pages have been indexed. This is something that you can check daily and see how Google is picking up your changes. 

Second, consider Schema.org changes because those can also highly impact SERP CTR. The downside, I’ve learned is that FAQ schema changes have the potential to actually hurt CTR if the searcher can answer their question from the search page and not click through. Most other types of structured data that are reflected in the SERP will have a positive effect. For example, try omitting the star rating schema if the product is less than a 3/5.

H1 tags, copy blocks, and page organization are other options but be careful with these because these will affect the page’s UX. Copy blocks are likely to have the biggest effect in search because they can broaden a page’s keyword exposure.

At some point you really have to ask yourself, is this test actually better as a browser-based test? Is it obvious enough just to make the change than to test it? Is this whole thing worth it or are there better things I could be spending resources on? (That last one is a good one!)

Ok, I hope that helped you gain a little more of a grasp around SEO/AB testing. It was a little bit of a barf post but hey, I have an 8-mo old baby to watch after these days!

How to Highlight Text On a Page with “Scroll to Text Fragments”

Scroll to Text Fragments

If you read my blog, you probably know that I am a bit of a geek about URLs. The “Universal Resource Locator” is a fundamental principle of the internet and are responsible for a lot of interesting features of the information superhighway.

URLs make it possible to navigate from page to page, to change the contents of a page with query parameters, to anchor to specific locations in a document, and even specify the results of REST API.  

Now URLs offer one more cool feature: the ability to highlight text based on the contents of a URL.

This new feature is called “Scroll to Text Fragments.”

Want to see how it works? Here’s a demonstration:  Try it!

What are “Scroll to Text Fragments?”

Scroll to Text fragments, (also known as text fragments) are snippets of text that, when appended to a URL,  indicate to the browser that the page’s corresponding text should be highlighted.  Scroll to Text Fragments are syntactically similar to URL Fragments but instead of using the hash symbol (#) followed by an identifier, Scroll to Text Fragments use the sequence: #:~:text=.

How do you Create  Scroll to Text URLs?

Scroll to Text URLs are created with the following syntax:


For those of you who are unfamiliar with this type of notation, I’ll break it down into its components.

Simple Form

The Scroll to Text Fragment requires two components, the fragment prefix: #:~:text=, and a snippet of text from the page. The snippet is denoted above as textStart.

So the simplest form of a  Scroll to Text URL would be https://example.com#:~:text=This%20is%20my%20text

Abbreviating Fragments by with a Snippet Start and End

You want to highlight text that is a longer passage of text but you may not want to add a ton of characters to the end of your URL. In that case, you can specify the startText and the endText to encapsulate the phrase without including the entire phrase.

To abbreviate the fragment, all you have to do is include a few words at the beginning of the chosen phrase and a few words at the end. Like this: https://example.com#:~:text=textStart,textEnd.

Let’s use the lyrics from the Beatles’ Yesterday as an example.  I’ll use this base URL for all the examples: https://www.lyrics.com/lyric/758159/The+Beatles/Yesterday

To highlight the second verse, you could use this link:

Abbreviated scroll to text snippets

Or you could use this abbreviated version:


Disambiguating Text with Prefixes and Suffixes

Disambiguating means, specifying a specific instance of a phrase when there are multiple instances of that phrase on the page. To disambiguate the text, you would use the full form like this https://example.com#:~:text=[prefix-,]textStart[,textEnd][,-suffix]

Again, lyrics are a perfect example thanks to choruses and bridges. Here is a link to the abbreviate form of Yesterday’s bridge. Because the link doesn’t specify which passage it refers to, the browser defaults to the first passage.

Select the first matching text

Here is an abbreviated link to the second passage. In order to specify that text, it includes a prefix- and a -suffix where both of those identifiers are the words that precede and follow the specific phrase.

Disambiguate text snippets with prefixes and suffixes

Notice that the fragment text identifier starts with yesterday-, (which is the last word of the verse that precedes the second bridge). It is also followed by ,-Yesterday%0ALove which is the text that follows the second bridge.

Selecting Multiple Text Snippets

This is where things get really wild. You can choose multiple text snippets by joining them together in the URL fragment with &text= (similar to query parameters).

Here is an example of choosing three distinct text snippets. Each snippet is the rhyming words of the song’s first verse. 

Select multiple text snippets

You’ll notice that I used a suffix to disambiguate the word “yesterday.” When specifying multiple phrases, disambiguation is especially important.

How do Scroll to Text URLs  Work?

Scroll to Text Fragments do three things: 

  1. Scroll to the location on the page of the specified text
  2. Highlight the text specified in the URL
  3. Style the highlighted DOM node according to the site’s :target CSS styling

The third item is especially interesting. If your site’s CSS defines a :target style, then you can specify the style that is associated with the specified text.

To do this, you would add the following to your CSS style sheet:

*:target {
  color: green;
  ... other fun styling

Why do Scroll to Text URLs matter?

You might be asking yourself, is this more than just a silly browser parlor trick? The answer is, it is, but it is also a lot more than that. Here are a few reasons why it is interesting:

Scroll to Text and SEO

You may have noticed that Google’s featured snippets now use this technology if the user’s browser supports it. According to the committed that wrote the proposal, “Fewer than 1% of clients use the “Find in Page” feature in Chrome on Android.” So this will prevent cases where searchers click a Featured Snippet but don’t know where the text is on the page and then have to go searching through the document’s text. This feature is a huge UX upgrade!


What’s better than citing your work, citing the exact passage that was referenced! Enough said.


We have yet to see all the interesting things that people will build on top of this feature. Medium’s highlight and share feature offers similar functionality and it provides a really nice experience. I anticipate that developers will incorporate this functionality into their sharing widgets for a little flare. 


Yeah, I’m a bookmarklet geek too and of course, the first thing that came to mind was to make on but luckily, there are a couple of bookmarklets out there already. One fancy one is by Supple and super web geek, Paul Kindlan made another. Both are pretty sweet.

Paul’s is beautiful in its simplicity. All you have to do is highlight some text, then click the bookmarklet and the js code resets the browser location with the appropriate scroll to text fragment appended to the URL.

  const selectedText = getSelection().toString();
  const newUrl = new URL(location);
  newUrl.hash = `:~:text=${encodeURIComponent(selectedText)}`;

To add Paul’s bookmarklet, just drag the link below into your bookmark bar.

Scroll to Text Link Creator

Exceptions and Things to Watch Out For

Since this technology is brand new,  it isn’t yet supported by all browsers. The feature is currently only supported by Chrome and IE Edge while Firefox and Safari do not currently support it. But hey, nice job, Edge!

Even in the browsers that do support the technology, there are some idiosyncrasies.  You can use this feature with ordinary fragment identifiers like page anchors, hash routing in SPAs (single page apps), and media fragments. But if you want to use a Scroll to Text Fragment you may run into trouble. 

Technically, it’s possible. Since it’s all fragments, you can append the :~:text= to the existing fragment (without the leading #).  But if you do that, you run the risk of breaking the pre-existing fragment’s functionality so watch out.

There is also a risk on the server-side. Scroll to text fragments cause some sites to return a 404 page depending on how the server is configured. Github is an example of a site that does this. To avoid this, you can configure your server to opt-out by including the document header, Document-Policy: force-load-at-top.

Highlight the World!

It’s a fun trick to know next time you need to send a link on Slack to the exact right place that you want to bring the recipient’s attention to.

Don’t forget to share and highlight responsibly!

How to Choose a Domain Name for SEO

I’ve been thinking about names a lot lately. My wife and I are running through the process of generating and narrowing down a list of names for our oncoming baby. We didn’t do ourselves any favors by not discovering the gender prior to birth. I like to think this is a good way to remove any bias from the experiment. ;p

What’s in a (domain) name?

Picking a domain name is similar to naming a kid. We, humans, like to label things and we love to assign a lot of meaning to those labels. (whether that meaning is valid or not.) If you’ve ever read Freakanomics, you’ll remember their evidence of the correlation between names and things like career prospects. Without diving into the ethical discussion here, you can’t help but recognize the weighty and enduring effect of a name.

When it comes to SEO, naming affects two major levers: relevance and trust. From a ranking algorithm perspective, a domain name is certainly less important than things like content, geolocation, language, links, and domain age (though there are some benefits that I’ll discuss below!)  On the other hand, from a search engine results page (SERP) perspective, a domain name can have a lot of influence on click-through-rates. 

Consider relevance. When we lived in Hong Kong, we lived near a pet shop called “Bob’s Paradise.” And while we eventually realized that it was a pet store named after the lazy french bulldog that greeted you with a snort when you walked in, this name proved highly irrelevant in search results. Whenever anyone searches for “dog food,” they are likely to skip over that listing and select something that seems more relevant to their query—like, for example, “Pet Line.”

The local pack in the SERP, it looks like they finally got wise to this and updated their Google My Business listing…

Relevant domain names on the SERP

Now let’s consider trust. We’ll continue with the Hong Kong-related anecdotes. Many Hongkongers are given a traditionally Cantonese name at birth, then given a chance to choose their English name when they are a kid. In theory, I love this idea. I probably would have named myself Firetruck Fox. But this childhood choice might have a tendency to initiate an uphill battle as one tries to establish authority in a workplace. GI Joe and Pussy are two names that come to mind.

Extending this anecdote, you’re probably unlikely to choose “pussydental.com” over gentledental.com in a search for a new dentist. I’m guessing… 

Relevance Signals and Exact Match Domains

There’s a lot of debate over whether exact match domains affect how a domain ranks for the term that they are matching. I’m not convinced anybody has proven things one way or another but there are some tangible factors to consider that have less to do with the domain name itself: backlink anchor text, page title templates, topical relevance.

Branded anchor text = Keyword anchor text

Consider an experimental site that I created: googleappscripting.com. Yeah, it’s an exact match domain for the main topic of the site: Google Apps Scripting

As you can see, I linked to it with its branding (which also happens to be the exact search terms it’s targeting.) By virtue of a cleverly selected domain name, it becomes much easier to get highly targeted anchor text links. 

Exact match domain keywords targeting

Keywords in Page Title Templates

Also, consider the home page title, Google Apps Script Tutorials and Examples • Making Google Apps Script Accessible to Everybody and post tag templates: {{post_title}} | Google Apps Script Tutorials and Examples. (Ok, I’ll admit they’re too long but loosen up!)

This way, every page on your site will have important keywords in the page title template. This is a commonly used tactic among affiliate sites. I know this because I recently had to research baby monitors. I found several sites like bestbabymonitors.com, babymonitorlist.com, and babymonitorguide.com. All these sites will have important primary keywords and relevant modifiers as part of every page title.  Most of the articles on these sites are just narrowly keyword-targeted listicles about baby monitors, so their page titles end up looking something like this:

{{keyword targeted list}} | Baby Monitor List


10 Best Baby Monitors for Security | Baby Monitor List

Keywords Everywhere in title

It’s a cheap, but effective strategy

Exact Match Names and Topical Correlation is Not Ranking Causation

Finally, it bugs me how often plain old topical relevance is considered causal and not correlative when it comes to why exact match domains affect SEO. When you name your domain after the topic you’re going to cover, there is just that plain old topical relevance. Google is going to rank your site for that topic. There’s an obvious correlation between your target topic, your content, and relevant search queries! 

It’s just like slapping an “Eggs” label on an egg carton. It’s not the label that signals that it’s a carton full of eggs; it’s the fact that it’s an egg carton. Everybody would understand what it is whether it was labeled or not. You cannot suggest that it’s because you labeled the egg carton with “Eggs” that people understand it as such. It’s a correlation. No way to prove causality here. 

There probably was a time that you could trick Google into thinking that searchers were looking for your brand name (think bestbabymonitors.com here) rather than a generic search term (best baby monitors). I’d argue, at this point, Google is sophisticated enough, thanks to the Knowledge Graph, to be able to differentiate between common search terms and brand names that were created to match the search terms.

How to signal trust in a domain name

On the SERP, you only have a fraction of a second to convey that your site is trustworthy. First impressions are everything in this situation, so you don’t want to leave anything to chance!

Obviously brand is a HUGE factor here. Consider the domains, Moz.com, or Yelp.com. Both of these brands have established a reputation by developing content and products that are trustworthy. (But if we’re just starting out with an SEO project, we’re not that lucky.)

One way to build trust is to choose a domain name that conveys that your site specializes in, or covers a topic comprehensively.  That is the rationale behind two projects that I’m launching in tandem with this series: techdefs.com and staticinsights.com. These names convey what you’re going to get: definitions for technical terminology, and some insight about static (site generators). More on those to come. =] 

Given the choice between a jaredsblog.com and staticinsights.com, I’d assert that most people are going to choose staticinsights.com for the query “pelican vs jekyll” because this site seems to focus on this topic specifically, whereas, well‚ who knows about Jared!? (We’ll find out if this is true  soon!)

And how to lose it…

If you’re starting from scratch, there’s more to lose than to gain when picking a domain name. Here are a couple of things to avoid when choosing a domain name.

The first is probably obvious: crazy top-level-domains (TLDs). Don’t think too hard on this. Choose a .com domain name as much as you can. Other TLD’s like .org or .net might be appropriate but if you can’t find the right .com name, you might consider re-exploring the .com possibilities before choosing an unusual TLD. 

Some TLDs like .io and .co are gaining acceptance in certain spaces but it will be a long time before the majority of internet searchers trust .xyz and .guru domain names.

This is a little like the Freakanomics name example from the beginning of this post. Humans have become comfortable with .com, .org, and .net TLDs. There isn’t necessarily anything rational about people’s bias toward these TLD’s but you might as well lean into it and avoid the uphill battle. 

Another obliquely related concept is the TLS protocol, aka HTTPS. This isn’t part of the domain name per se but in the eyes of the savvy internet user, that ”s” on the end of https is a real signal, (in fact a sign!) of safety. Launch your site with SSL from the start. 

We’ll dive into HTTPS a bunch more in the next post in this series covering how to give your Cloudfront distribution a proper domain name with Route 53 (and how to make it secure with HTTPS).

For now, I hope this gave you some things to consider in choosing your domain name.  Take the time to do this right the first time. NOBODY likes domain migrations.

Does Cloudfront impact SEO? Let’s set it up for a S3 static site and test it!

This is the fifth post in my series about static site SEO with AWS. In the last post, we uploaded the static site to Amazon S3 so that it’s publicly accessible on the web. In this post, we’ll move beyond static site basics and start to discuss how Cloudfront CDN impacts load speeds and SEO. Yay! We’re finally going to start talking about SEO!

This post will focus on a HUGE factor for SEO: speed. We’ll first take apart a few acronyms and then we’ll talk about how to make your website fast with Cloudfront CDN. Luckily it’s all pretty straightforward but with all things SEO: the devil’s in the details. Let’s get started!

How does Cloudfront CDN work?

It’s all about speed. Cloudfront is a Content Distribution Network (CDN) service offered by AWS.  If you are unfamiliar with what a CDN does or how they work, the concept is pretty simple even if the technology is pretty advanced. 

Cloudfront Edge Locations
Image from: https://aws.amazon.com/cloudfront/features/

Simply put: CDNs distribute your content from your web server (or in our case, an S3 bucket located in Ohio) to multiple other servers around the world called “edge locations.” These edge locations cache (store a copy) of your content so that it’s more readily available in different areas of the world.

This way when someone in Tokyo, Japan requests your website the requests don’t have to travel all the way to the S3 bucket in Ohio. Instead, the CDN intelligently routes their request to an edge location in Tokyo. This shortens the distance and reduces latency which means your website loads faster all over the world!

How does Cloudfront CDN improve SEO?

Speed matters, but CDN’s impact more than that when it comes to SEO. Search engines want to provide their users with relevant web content and an overall great experience. People hate slow websites so Google has to factor that into their ranking to ensure an overall good experience.

There is also another less obvious reason why search engines would favor faster websites: they have to crawl them. For Google, time is money when it comes to the energy cost of running the servers that crawl the web. If a website is slow, it actually costs them more to crawl that website than a faster website! That’s why CDNs and caching matter. (We’ll get to caching in the future.)

Search engine bots and crawling

There is also a third SEO benefit that comes from using a CDN. This is a bit of an advanced use case but if your site does a lot of Javascript client-side rendering, you can employ a CDN to deliver server-side rendered (SSR) pages to search engine bots.  This reduces the amount of time (and money) that search engines have to spend crawling your pages. 

Server-side rendering also means that a search engine doesn’t have to (or be able to) render Javascript just to parse your site’s content. That is a relatively expensive thing for a search engine to do. The benefit is that, since the search engine doesn’t have to spend so much effort to crawl and render your content, you will likely see a higher crawl rate which means you’ll have fresher content in search engine indexes. That’s great for SEO, especially for really large and dynamic sites. To do that, you’d have to use a CDN that offers edge functions like Cloudfront Lambda and Cloudflare Workers.

If you want to learn more about deploying Cloudfront for SEO check out this presentation.

But for our purposes, we are mostly concerned with flat out speed of content delivery. So let’s take a look at how a CDN improves speed.

Cloudfront CDN Example Speed Test 

In case you are like me and you don’t believe everything you read, here’s a side-by-side Pingdom website speed test to observe the effects of using the Cloudfront CDN. Both tests were run from Tokyo, Japan. The first test requests the new site’s homepage directly from the S3 bucket in Ohio and the second test requests the site after I’d deployed my site on Cloudfront. 

Test #1: From Japan to S3 bucket in Ohio

Pingdom test #1

Test #2: From Japan to nearest edge location when the Cloudfront Price Class was set to “Use Only U.S., Canada and Europe”

Pingdom test #2

Test #3: From Japan to nearest edge location when the Cloudfront Price Class was set to “Use All Edge Locations (Best Performance)” (so probably also Japan)

Pingdom test #3

I’m not sure why Pingdom didn’t render this last one…

In each of these tests, the most significant difference was in each request’s “Wait Time.”  Pingdom’s wait time is a measure of Time to First Bit (TTFB) which just means, how long does it take for the first bit of the requested resource to reach the browser. That’s a pretty critical metric though considering resources like javascript and CSS depend on the initial HTML document to load. 

Load time waterfall chart
Waterfall chart from Test #1

Here are the TTFB for the HTML document for each test:

  • Test #1 From Japan to S3 bucket in Ohio: 210 ms
  • Test #2: From Japan to the closest location in “U.S., Canada and Europe”: 114 ms
  • Test #3: From Japan to the closest global location (Japan): 6 ms!!
Cloudfront edge location speed test

As we can see, TTFB increases linearly with the distance the request has to travel. CDNs FTW!

Hopefully, this is enough to convince you that using a CDN is a great idea. Even if this test doesn’t directly prove an improvement in rankings, you can bet that your website’s audience will appreciate the decreased latency and improve load times.

Now let’s walk through setting up Cloudfront to deliver our static site hosted on S3.

Setting up Cloudfront CDN for a Static Site hosted on S3

Note: This configuration allows public read access on your website’s bucket. For more information, see Permissions Required for Website Access.

In the last post, we got a static site loaded to S3, so this post assumes you completed that. Otherwise, head back to and get your site loaded to S3.

S3 Static website hosting endpoint

NOTE #1:  It is really important that you use the static website hosting endpoint shown above for the next steps. That’s the one that looks like <your-bucket-name>.s3-website.<your-bucket’s-region>.amazonaws.com. This will be really important in the future.

NOTE #2: You should have already set up public read access for the bucket in the last post.

  • Leave the Origin Path blank. Your website should be in the root directory of your bucket.

NOTE #3: Don’t worry about SSL (HTTPS) or Alternate Domain Names  for now. We’ll come back to that in the next post.

  • For the Viewer Protocol Policy, select “Redirect HTTP to HTTPS” because it’s cool. We’ll get back to that later too.
  • Choose a Price Class that fits your needs. For low traffic websites, each option will be pennies but you can choose the best option for your needs depending on the geographies for which you want to optimize load times.
  • Leave all the other settings with their default settings.
  • Choose Create Distribution.

Now just sit back and wait! Cloudfront will propagate your content out to the edge locations that you selected based on the Price Class. 

Your website will soon be available soon via a Cloudfront URL that looks something like https://d1ix3q7vxbh8zd.cloudfront.net/ 

Speed Testing your Cloudfront Distribution

Want to run the tests mentioned above? 

  1. Go over to https://tools.pingdom.com/ 
  2. Enter the URL for your static site in your public S3 endpoint (<your-bucket-name>.s3-website.<your-bucket’s-region>.amazonaws.com) 
  3. Then try it with your new Cloudfront URL ( https://<blah-blah-blah>.cloudfront.net/ ) . 
  4. Play around with Cloudfront Price Classes and Pingdom locations to see how the CDN’s edge locations impact TTFB and load times.

Moving forward

I hope you have the tools to understand why CDN’s impact SEO and how to set them up. If you have any questions, please leave them in the comment section below.

In the next post, we will finally more the website to its own custom domain name with HTTPS!