Analyzing XML Sitemap Files with Bash

I’ve been spending a lot of time working with sitemaps lately. They are a critical component of getting sites indexed by search engines and they are a great way to learn about your competition’s architecture and search strategy. I mean, they literally map out the content that your competition is trying to add to search engine indexes and where they are trying attract traffic.

The problem is, when sitemaps are larger than a few thousand URL’s, or when sitemaps are not pretty, or when index files and gzip compression get involved, the task of reading a sitemap becomes either slow, or manual and really slow. So… Bash! Crashing to the ground like an ancient and overly cryptic superhero, er… programming language, Bash is here to face off with it’s perfect use case.

If you are intrigued but unfamiliar with Linux Shell aka, “the terminal,” this is a perfect way to get your hands dirty and decide if its something you want to become more familiar with.

This post will introduce the syntax, and a few powerful commands to allow you to download, parse, and analyze sitemap files with Bash.

A Background on Bash

BASH (the Borne Again SHell) has a several welcoming characteristics. Like Javascript is to web browsers; Bash is a standard programming language to Linux operating systems like Mac OS and Arduino. This means you don’t have to worry about installing it— its already in the box!

Speed is also a big upside. The language itself is fast, and since you don’t have to write a lot of code its possible to get a lot done quickly. Its far from being the Swiss Army knife that Python is, but with the right use case, Bash is as powerful as its name sounds like it should be.

On the flip side, the worst thing about Bash, unlike Python and other modern languages, is the lack of resources make it hard to learn. “Man(ual)” pages are overly terse and sometimes feel like a they were written in their own language. Hopefully, this intro will be a little bit more engaging.

Bash also has limited use cases. It’s great for working with file systems and text files but it wouldn’t be my first choice for assembling anything thats very interactive. But now that we have our use case. Let’s Bash!

The Short of It

In a two lines, we are going to request a sitemap index file, and write every sitemap URL in a text file.

curl https://www.example.com/sitemap-index.xml | \
grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' > sitemaps.txt

We can then request those sitemap files and analyze the URL’s listed in them

curl https://www.example.com/some-sitemap.xml | \
gunzip | grep -e loc|\  sed 's|<loc>\(.*\)<\/loc>$|\1|g' | \
grep -e <filter pattern> | sort

With Python, this would take several libraries, more time reading docs, and more code to accomplish the same thing. With Bash, we only need 6 commands: curl, grep, gunzip, sed, sort, and uniq.

Bash programs are assembled like Legos (or plumbing, if that suits you better). So in order to understand how this all works, let’s take these programs apart and put them back together.

The Long of It

As you can see, Bash code is really terse. There’s a lot going in with little instruction— and little explanation. But be patient and keep reading, and this should all make sense.

Piping | data

You may have recognized a pattern; each line is broken up into pieces by the pipe character, “|”. “Piping” conveniently does what it sounds like it does. It sends data along a pipeline of programs which then manipulate that data and pass it on. More specifically, they send the output of a program (Stdout), line by line, into the input (Stdin) of the next program. To demonstrate this try this:

ls | head -n 3

That line of code says list the contents of the current directory, pipe it to the head command to only output the first three lines of the input. Similarly:

cat my-file.csv | sort > my-sorted-file.csv

That line of code says read the contents of the my-file.csv and pipe it to the sort command to the sort the lines of that file alphabetically. The angle bracket (>) at the end of the line means, “put the output of these commands into a file named “my-sorted-file.csv.” It’s all pretty simple, and it will make more sense as we build that scripts above.

cURL and gunzip to GET and unzip Sitemap Files

Most of what the two scripts do is text processing and filtering but the first step is getting the sitemap files. You’ve probably come across curl if you’ve ever read through REST API documentation. It is a short command with a lot of options but, in our case, we can keep it simple and just use curl as is. To ease into the curl command, let’s use it to find the location of website’s sitemap.

Most sites list the uRL of their sitemap in their robots.txt file. Making a HTTP GET request to a site’s robots.txt file is simple:

curl https://www.example.com/robots.txt

Now is also a good time to introduce grep. You can use grep to skip scanning through the robots.txt file to quickly find the sitemap URL. Just pipe the output of the robots file to grep filtering for the regular expression pattern, “sitemap:”

curl https://www.example.com/robots.txt | grep -e -i ‘sitemap:’

The result of this command should be the line of the robots.txt file that lists the location of the sitemap. Note that the grep -i means that the regular expression pattern match can be case-insensitive. That’s useful because some sites will start the line with a capital “S” in sitemap.

To get more comfortable with curl, try curl with the URL of this page and you will see that it prints a mess of HTML markup to the terminal window. See below for pretty-printing XML.

Many sites with really large sitemaps use gzip compression to reduce the amount of time it takes to download them. Using the curl command to get gzipped sitemaps doesn’t work too will— it will just print a mess of ugly binary code to the terminal Luckily, BASH has a built in command, gunzip, to unzip compressed sitemap files. To unzip zipped sitemap files just pipe the zipped response to gunzip and get the readable XML.

curl https://www.example.com/sitemap.xml | gunzip

But what happens if the output is not pretty printed xml?

Bash has a built-in for formatting XML called xmlllint:

curl https://www.example.com/sitemap.xml | gunzip | xmllint --format -

Finally, to avoid making tons of requests to the same sitemap while you’re learning, you can store the sitemap file in on your computer by sending the output to a file:

curl https://www.example.com/sitemap.xml | gunzip > saved-sitemap.xml

Now, instead of requesting the sitemap every time, you can just ‘cat’ the file which outputs every line of a file from top to bottom like so:

cat saved-sitemap.xml | head

That will show the first ten lines of the saved file. Or use tail for the last ten lines.

Parsing the XML and Finding URLs

Hopefully, piping data from program to program is a little more comfortable now, because this is where we get to start taking advantage of the better parts of Bash.

XML sitemaps follow a strict format which makes it easy to parse them. But there is a lot of extra text in them that is not particularly interesting. To find the part that is of interest to us, the URLs, we will use grep which sounds like a burp, but is actually used to filter a stream that is piped to it.

curl https://www.example.com/sitemap.xml | grep -e <regular expression filter pattern>

Grep reads each line of input and, if it matches the regular expression pattern after the -e, then it passes it along, filtering the contents of the input stream. We find all <loc/> XML tags by filtering out everything that doesn’t match the the expression, ‘loc’.

curl https://www.example.com/sitemap.xml | grep -e ‘loc’

This is where things get interesting.

The sed command is short for Stream EDitor. Like other Bash commands, it operates on one line at a time. Sed can be a very powerful tool for rewriting file contents, but for our purposes, it will extract sitemap URLs from the <loc/> elements filtered in by the grep command.

curl https://www.example.com/sitemap.xml | grep -e ‘loc’ | sed 's|<loc>\(.*\)<\/loc>|\1|'

This chain of commands gets the sitemap, filters in all <loc> elements, then extracts the URL from the <loc> elements leaving only a list of URLs– one per line.

Sed allows you to use regular expressions to determine what part of a line you want to keep or replace. There are five parts to sed expressions separated by, in this case, pipes ( | ). Noe that pipes could be any character like colons or backslashes but I like to use pipes because it looks cleaner when working with URLs.

The command works like this sed '<sed mode>|<regex to extract>|<regex to replace>|<flags>' Sed extracts anything between the escaped parenthesis \( and \) and stores them in the escaped \1 in the regex to replace in the command, “sed ‘s|<loc>\(.*\)<\/loc>$|\1|’. So everything, aka dot star between the <loc> and the <\loc> is extracted and replaced the extracted text (which is the URL)

That leaves the command that we started with:

curl https://www.example.com/some-sitemap.xml | grep -e loc | \

sed 's:<loc>\(.*\)<\/loc>$:\1:g' | grep -e <pattern> | sort

With that Bash code, you will get all the URLs in a sitemap sorted alphabetically. Note that the \ at the end of the line is used to continue the commands to the next line. Add add | wc -l to the end of the line to get the number of lines in the output of those commands.

Now you have the tools to analyze not only large sitemap files but all kinds of large text files!

My interest in Bash has grown because it makes it so easy compose functions together to get and store the results that you need. You now know all the basic building blocks for analyzing a sitemap, so now try for yourself. Try using grep to see what, and how many, files live in certain directories. Use sed extract the first directory and pipe that to uniq to see how many first level directories there. Any question that you have about the contents of a sitemap, you can now answer with Bash— and fast! Good luck!

There are a couple great books out there if you want to learn more about Bash as a data analysis tool:

Bash Cookbook and Data Science at the Command Line

Leave a Comment

Your email address will not be published. Required fields are marked *