Analyzing XML Sitemap Files with Bash

I’ve been spending a lot of time working with sitemaps lately. They are a critical component of getting sites indexed by search engines and a great way to learn about your competition’s architecture and search strategy. They literally map out the content that your competition is trying to add to search engine indexes and where they are trying to attract traffic.

The problem is that when sitemaps are larger than a few thousand URLs, or when sitemaps are not pretty, or when index files and gzip compression get involved, the task of reading a sitemap becomes either slow or manual and really slow. So… Bash! Crashing to the ground like an ancient and overly cryptic superhero, er… programming language, Bash is here to face off with its perfect use case.

If you are intrigued but unfamiliar with Linux Shell, aka “the terminal,” this is a perfect way to get your hands dirty and decide if it’s something you want to become more familiar with.

This post will introduce the syntax and a few powerful commands for downloading, parsing, and analyzing sitemap files with Bash.

Want an easier way to analyze sitemaps? Check out my XML Sitemap Analyzer!

A Background on Bash

BASH (the Borne Again SHell) has a several welcoming characteristics. Like Javascript is to web browsers; Bash is a standard programming language to Linux operating systems like Mac OS and Arduino. This means you don’t have to worry about installing it— its already in the box!

Speed is also a big upside. The language itself is fast, and since you don’t have to write a lot of code its possible to get a lot done quickly. Its far from being the Swiss Army knife that Python is, but with the right use case, Bash is as powerful as its name sounds like it should be.

On the flip side, the worst thing about Bash, unlike Python and other modern languages, is the lack of resources make it hard to learn. “Man(ual)” pages are overly terse and sometimes feel like a they were written in their own language. Hopefully, this intro will be a little bit more engaging.

Bash also has limited use cases. It’s great for working with file systems and text files but it wouldn’t be my first choice for assembling anything thats very interactive. But now that we have our use case. Let’s Bash!

The Short of It

In two lines, we are going to request a sitemap index file and write every sitemap URL in a text file.

curl | \
grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' > sitemaps.txt

We can then request those sitemap files and analyze the URLs listed in them

curl | \
gunzip | grep -e loc|\  sed 's|<loc>\(.*\)<\/loc>$|\1|g' | \
grep -e <filter pattern> | sort

With Python, accomplishing the same thing would require several libraries, more time reading docs, and more code. With Bash, we only need six commands: curl, grep, gunzip, sed, sort, and uniq.

Bash programs are assembled like Legos (or plumbing, if that suits you better). So, to understand how this all works, let’s take these programs apart and put them back together.

The Long of It

As you can see, Bash code is really terse. There’s a lot going in with little instruction— and little explanation. But be patient and keep reading, and this should all make sense.

Piping | data

You may have recognized a pattern; each line is broken up into pieces by the pipe character “|.” “Piping” conveniently does what it sounds like it does. It sends data along a pipeline of programs, which then manipulate that data and pass it on. More specifically, they send the output of a program (Stdout), line by line, into the input (Stdin) of the next program. To demonstrate this, try this:

ls | head -n 3

That line of code says to list the contents of the current directory and pipe it to the head command only to output the first three lines of the input. Similarly:

cat my-file.csv | sort > my-sorted-file.csv

That line of code says to read the contents of the my-file.csv and pipe it to the sort command to sort the lines of that file alphabetically. The angle bracket (>) at the end of the line means, “put the output of these commands into a file named “my-sorted-file.csv.” It’s all pretty simple, and it will make more sense as we build the scripts above.

cURL and gunzip to GET and unzip Sitemap Files

The two scripts do most of the text processing and filtering, but the first step is getting the sitemap files. If you’ve ever read through REST API documentation, you’ve probably encountered curl. It is a short command with a lot of options, but in our case, we can keep it simple and just use curl as is. To ease into the curl command, let’s use it to find the location of the website’s sitemap.

Most sites list the uRL of their sitemap in their robots.txt file. Making a HTTP GET request to a site’s robots.txt file is simple:


Now is also a good time to introduce grep. You can use grep to skip scanning through the robots.txt file to find the sitemap URL quickly. Just pipe the output of the robots file to grep filtering for the regular expression pattern, “sitemap:”

curl | grep -e -i ‘sitemap:’

The result of this command should be the line of the robots.txt file that lists the location of the sitemap. Note that the grep -i means that the regular expression pattern match can be case-insensitive. That’s useful because some sites will start the line with a capital “S” in a sitemap.

To get more comfortable with curl, try curl with the URL of this page and you will see that it prints a mess of HTML markup to the terminal window. See below for pretty-printing XML.

Many sites with really large sitemaps use gzip compression to reduce the amount of time it takes to download them. Using the curl command to get gzipped sitemaps doesn’t work too will— it will just print a mess of ugly binary code to the terminal Luckily, BASH has a built in command, gunzip, to unzip compressed sitemap files. To unzip zipped sitemap files just pipe the zipped response to gunzip and get the readable XML.

curl | gunzip

But what happens if the output is not pretty printed xml?

Bash has a built-in for formatting XML called xmlllint:

curl | gunzip | xmllint --format -

Finally, to avoid making tons of requests to the same sitemap while you’re learning, you can store the sitemap file in on your computer by sending the output to a file:

curl | gunzip > saved-sitemap.xml

Now, instead of requesting the sitemap every time, you can just ‘cat’ the file which outputs every line of a file from top to bottom like so:

cat saved-sitemap.xml | head

That will show the first ten lines of the saved file. Or use tail for the last ten lines.

Parsing the XML and Finding URLs

Hopefully, piping data from program to program is a little more comfortable now, because this is where we get to start taking advantage of the better parts of Bash.

XML sitemaps follow a strict format which makes it easy to parse them. But there is a lot of extra text in them that is not particularly interesting. To find the part that is of interest to us, the URLs, we will use grep which sounds like a burp, but is actually used to filter a stream that is piped to it.

curl | grep -e <regular expression filter pattern>

Grep reads each line of input and, if it matches the regular expression pattern after the -e, then it passes it along, filtering the contents of the input stream. We find all <loc/> XML tags by filtering out everything that doesn’t match the the expression, ‘loc’.

curl | grep -e ‘loc’

This is where things get interesting.

The sed command is short for Stream EDitor. Like other Bash commands, it operates on one line at a time. Sed can be a very powerful tool for rewriting file contents, but for our purposes, it will extract sitemap URLs from the <loc/> elements filtered in by the grep command.

curl | grep -e ‘loc’ | sed 's|<loc>\(.*\)<\/loc>|\1|'

This chain of commands gets the sitemap, filters in all <loc> elements, then extracts the URL from the <loc> elements leaving only a list of URLs– one per line.

Sed allows you to use regular expressions to determine what part of a line you want to keep or replace. There are five parts to sed expressions separated by, in this case, pipes ( | ). Noe that pipes could be any character like colons or backslashes but I like to use pipes because it looks cleaner when working with URLs.

The command works like this sed '<sed mode>|<regex to extract>|<regex to replace>|<flags>' Sed extracts anything between the escaped parenthesis \( and \) and stores them in the escaped \1 in the regex to replace in the command, “sed ‘s|<loc>\(.*\)<\/loc>$|\1|’. So everything, aka dot star between the <loc> and the <\loc> is extracted and replaced the extracted text (which is the URL)

That leaves the command that we started with:

curl | grep -e loc | \

sed 's:<loc>\(.*\)<\/loc>$:\1:g' | grep -e <pattern> | sort

With that Bash code, you will get all the URLs in a sitemap sorted alphabetically. Note that the \ at the end of the line is used to continue the commands to the next line. Add add | wc -l to the end of the line to get the number of lines in the output of those commands.

Now you have the tools to analyze not only large sitemap files but all kinds of large text files!

My interest in Bash has grown because it makes it so easy compose functions together to get and store the results that you need. You now know all the basic building blocks for analyzing a sitemap, so now try for yourself. Try using grep to see what, and how many, files live in certain directories. Use sed extract the first directory and pipe that to uniq to see how many first level directories there. Any question that you have about the contents of a sitemap, you can now answer with Bash— and fast! Good luck!

There are a couple great books out there if you want to learn more about Bash as a data analysis tool:

Bash Cookbook and Data Science at the Command Line

Leave a Comment

Your email address will not be published. Required fields are marked *