This is a quick post to provide context for the SEO A/B test-curious people out there. I was prompted by a thread in Measure Slack and figured long-form would make more sense. I didn’t want to make this another hideously long SEO-ified post but rather get to the point quickly. Here’s the post and then I’ll dive into my thoughts about SEO A/B testing.
After writing this, I realized I didn’t actually address statistical significance but as you’ll see, if you’re running SEO experiments that are dependent on a fine margin of statistical significance you’re time is probably better spent elsewhere. Read on to see what I mean.
How is it different than a regular A/B test?
SEO A/B tests differ from normal A/B tests (like Optimizely or Optimize) in two major ways: implementation and measurement.
There are tools out there for running server-side A/B tests but none are remotely as simple as Google Optimize—they all require server-side changes. SEO A/B testing frameworks are not terribly complex to code. A typical testing framework would take the identifier of a page (for example, the product ID or the integrations slug in the case of a website like Zapier) and applies a variant-assignment algorithm to the page. This could as simple as checking if a product ID ends in an odd or even number and applying A and B variants that way or as complex a string hashing function and a modulo operator that returns a 0 or a 1 to apply A and B variants. In any case, this is, at the end of the day, a substantial product feature. See how Pinterest runs tests.
On the measurement side of things, either you’re using a proper server-side A/B testing tool with measurement capabilities or you have to go out of your way to track the results in your own tool. If you go the “roll your own” route, the same A/B assignment logic that determines the page treatment needs to be passed along to your web analytics tool. A simple way to do this is to assign the variable in the data Layer and use Google Tag Manager to assign a Content Grouping (A or B) to the page in Google Analytics. Content Groupings are a better choice than hit-level custom dimensions because Content Groupings apply to landing pages by design.
Here’s an example of an A/B test that had no effect. Can you tell when it starts? When it ends? Who knows—that’s how you know the test was not effective!
Interesting note on this experiment: The treatment was Schema.org FAQ schema across ~8k pages. Google decided to only recognize the schema on 200 of them making it impossible to detect an effect…. and a waste of time to implement at scale if the change wouldn’t have a tangible effect.
Random assignment and Scale
If you’re thinking about running an SEO A/B test, random assignment and scale are two things you must consider from the outset. Just like browser-based SEO tests, you need to be able to trust that your groups are assigned randomly and guarantee that you will see enough traffic to produce valid results. I addressed random assignment a bit above and that’s not that hard to account for. It’s detecting a tests’ effect that creates a challenge.
The added layer of complexity in an SEO test is crawling and indexing. Because SEO tests are meant to effect a lift in ranking or CTR, you have to be positive that Google actually indexes your changes for the changes to take effect. Some pages don’t get crawled frequently and some, when they do, take forever to get re-indexed. This means there will be a lag to see results–the duration of that lag depends on your site’s crawl rate and Google’s opinion about how frequently it wants to reindex your pages.
This means that scale matters a lot. If you know you will have to live with some degree of imperfection in your test, you have to overcome that with scale. By scale, I mean lots and lots of pages. The more pages you have and the more traffic they get, the more you will be able to detect your two time-series plots diverge as the pages are crawled, re-indexed, and changes take effect.
You’re probably asking, “how many pages do I need?” Well, I don’t have any science behind this but I would say 1,000+ for a bare minimum. And those 1k pages better have lots of traffic. If they don’t, it will be harder to attribute any changes to the test versus randomness. They also better have a lot of traffic because SEO A/B testing is a relatively high lift. A cost/benefit (potential) analysis is imperative before getting started.
All that said, I’d be pretty confident in a 50/50 split across 10k pages. If you don’t have 10k pages, let alone 1k, you’re probably better off developing your programmatic SEO to reach this scale. VS pages, Integrations pages, Category list pages, and Locations pages are all good ways to get that page count up (and building them will have a bigger effect than all the optimizations after the fact).
Tracking and Goals
I mentioned some thoughts on tracking page variants in Google Analytics in the first section. That part is a technical problem. The goals part is a business problem.
Generally speaking, an SEO A/B test should focus on traffic. Why? Because in most cases, an SEO test will have the biggest impact on traffic and less of an impact on say, the persuasiveness or conversion rate of a set of pages. Sure, you could run title tag tests that might drastically change the keyword targeting or click-through intent but it’s usually safe to say that you will either start getting more or less of the same traffic and you can assume that traffic should convert to leads, revenue, etc at the same rate.
Another argument for traffic is that changes in organic traffic volumes are going to be affected by fewer variables than revenue. The further the goal is from the lever your testing, the more data you have to collect to be sure that the test is actually what is causing the effect.
High impact tests
Finally, if you’ve made it this for you’re probably wondering about some test ideas that you have. Here are is how I think about prioritizing SEO tests.
First, think about treatments that are high-SEO-impact. For me, title tags and meta descriptions are at the top of the list because, even if you aren’t able to affect rankings, this can have significant impacts on click through rates. Another upside is that you will be able to see the effects of your test on the search pages. So a “intitle: <my title tag template string>” search in Google will give you a sense of how many of your pages have been indexed. This is something that you can check daily and see how Google is picking up your changes.
Second, consider Schema.org changes because those can also highly impact SERP CTR. The downside, I’ve learned is that FAQ schema changes have the potential to actually hurt CTR if the searcher can answer their question from the search page and not click through. Most other types of structured data that are reflected in the SERP will have a positive effect. For example, try omitting the star rating schema if the product is less than a 3/5.
H1 tags, copy blocks, and page organization are other options but be careful with these because these will affect the page’s UX. Copy blocks are likely to have the biggest effect in search because they can broaden a page’s keyword exposure.
At some point you really have to ask yourself, is this test actually better as a browser-based test? Is it obvious enough just to make the change than to test it? Is this whole thing worth it or are there better things I could be spending resources on? (That last one is a good one!)
Ok, I hope that helped you gain a little more of a grasp around SEO/AB testing. It was a little bit of a barf post but hey, I have an 8-mo old baby to watch after these days!