SEO with the Google Search Console API and Python

The thing I enjoy most about SEO is thinking at scale. Postmates is fun because sometimes its more appropriate to size opportunities on a logarithmic scale than a linear one.

But there is a challenge that comes along with that: opportunities scale logarithmically, but I don’t really scale… at all. That’s where scripting comes in.

SQL, Bash, Javascript, and Python regularly come in handy to identify opportunities and solve problems. This example demonstrates how scripting can solve some of the challenges of having a lot of potentially mutable data.

Scaling SEO with the Google Search Console API

Most, if not all, big ecommerce and marketplace sites are backed by databases. And the bigger these places are, the more likely they are to have multiple stakeholders managing and altering data in the database. From website users to customer support, to engineers, there are a number of ways that database records can change. As a result, the content of the site grows, changes, and sometimes disappears.

It’s very important to know when these changes occur and what effect the changes will have on search engine crawling, indexing and results. Log files can come in handy but the Google Search Console is a pretty reliable source of truth for what Google sees and acknowledges on your site.

Getting Started

This guide will help you start working with the Google Search Console API, specifically with the Crawl Errors report but the script could easily be modified to query Google Search performance data or interact with sitemaps in GSC.

To get started, clone the Github Repository: https://github.com/trevorfox/google-search-console-api and follow the “Getting Started” steps on the README page. If you are unfamiliar with Github, don’t worry. This is an easy project to get you started.

Make sure you have the following:

Now for the fun stuff!

Connecting to the API

This script uses a slightly different method to connect to the API. Instead of using the Client ID and Client Secret directly in the code. The Google API auth flow accesses these variables from the client_secret.json file. This way you don’t have to modify the webmaster.py file at all, as long as the client_secret.json file is in the /config folder.

try:
    credentials = pickle.load(open("config/credentials.pickle", "rb"))
except (OSError, IOError) as e:
    flow = InstalledAppFlow.from_client_secrets_file('client_secret.json', scopes=OAUTH_SCOPE)
    credentials = flow.run_console()
    pickle.dump(credentials, open("config/credentials.pickle", "wb"))

webmasters_service = build('webmasters', 'v3', credentials=credentials)

For convenience, the script saves the credentials to the project folder as a pickle file. Storing the credentials this way means you only have to go through the Web authorization flow the first time you run the script. After that, the script will use the stored and “pickled” credentials.

Querying Google Search Console with Python

The auth flow builds the “webmasters_service” object which allows you to make authenticated API calls to the Google Search Console API. This is where Google documentation kinda sucks… I’m glad you came here.

The script’s webmasters_service object has several methods. Each one relates to one of the five ways you can query the API. The methods all correspond to verb methods (italicized below) that indicate how you would like to interact with or query the API.

The script currently uses the “webmaster_service.urlcrawlerrorssamples().list()” method to find how many crawled URLs had given type of error.

gsc_data = webmasters_service.urlcrawlerrorssamples().list(siteUrl=SITE_URL, category=ERROR_CATEGORY, platform='web').execute()

It can then optionally call “webmaster_service.urlcrawlerrorssamples().markAsFixed(…)” to note that the URL error has been acknowledged- removing it from the webmaster reports.

Google Search Console API Methods

There are five ways to interact with the Google Search Console API. Each is listed below as “webmaster_service” because that is the variable name of the object in the script.

webmasters_service.urlcrawlerrorssamples()

This allows you to get details for a single URL and list details for several URLs. You can also programmatically mark URL’s as Fixed with the markAsFixed method. *Note that marking something as fixed only changes the data in Google Search Console. It does not tell Googlebot anything or change crawl behavior.

The resources are represented as follows. As you might imagine, this will help you find the source of broken links and get an understanding of how frequently your site is crawled.

{
 "pageUrl": "some/page-path",
 "urlDetails": {
 "linkedFromUrls": ["https://example.com/some/other-page"],
 "containingSitemaps": ["https://example.com/sitemap.xml"]
 },
 "last_crawled": "2018-03-13T02:19:02.000Z",
 "first_detected": "2018-03-09T11:15:15.000Z",
 "responseCode": 404
}

webmasters_service.urlcrawlerrorscounts()

If you get this data, you will get back the day-by-day data to recreate the chart in the URL Errors report.

Crawl Errors

 

 

 

webmasters_service.searchanalytics()

This is probably what you are most excited about. This allows you to query your search console data with several filters and page through the response data to get way more data than you can get with a CSV export from Google Search Console. Come to think of it, I should have used this for the demo…

The response looks like this with a “row” object for every record depending on you queried your data. In this case, only “device” was used to query the data so there would be three “rows,” each corresponding to one device.

{
 "rows": [
 {
 "keys": ["device"],
 "clicks": double,
 "impressions": double,
 "ctr": double,
 "position": double
 },
 ...
 ],
 "responseAggregationType": "auto"
}

webmasters_service.sites()

Get, list, add and delete sites from your Google Search Console account. This is perhaps really useful if you are a spammer creating hundreds or thousands of sites that you want to be able to monitor in Google Search Console.

webmasters_service.sitemaps()

Get, list, submit and delete sitemaps to Google Search Console. If you want to get into fine-grain detail into understanding indexing with your sitemaps, this is the way to add all of your segmented sitemaps. The response will look like this:

{
   "path": "https://example.com/sitemap.xml",
   "lastSubmitted": "2018-03-04T12:51:01.049Z",
   "isPending": false,
   "isSitemapsIndex": true,
   "lastDownloaded": "2018-03-20T13:17:28.643Z",
   "warnings": "1",
   "errors": "0",
  "contents": [
    { 
    "type": "web",
    "submitted": "62"    "indexed": "59"
    }
  ]
}

Modifying the Python Script

You might want to change the Search Console Query or do something with response data. The query is in webmasters.py and you can change the code to iterate through any query. The check method checker.py is used to “operate” on every response resource. It can do things that are a lot more interesting than printing response codes.

Query all the Things!

I hope this helps you move forward with your API usage, python scripting, and Search Engine Optimization… optimization. Any question? Leave a comment. And don’t forget to tell your friends!

 

Fetch As Google – From Googlebot’s Perspective

The Google Search Console Fetch and Render Tool is like putting on a pair of Googlebot goggles and looking at your own site through those lenses. If you have a simple website, (WordPress for example) that hasn’t been modified too much, you may not appreciate why this is important. But the more complex your site becomes with AJAX Javascript rendering or robots.txt rules, this tool becomes critical to understanding why your site is, or is not, optimized for crawling— and search in general.

When you ask Google to Fetch and Render a URL, Google makes a few requests to that URL, one to your robots.txt file, and one to your favicon. Some of these request matter more than others in terms of SEO signals. I set up some logging on a simple test site: rsvpz.co, to see if there was anything interesting going on, and hopefully to understand more about what these requests might signal about what Google and Googlebot care about in terms of crawling, rendering, and indexing.

The test pages were simple PHP pages with only a single line of content, “You’re invited!” The PHP was used to collect server variables and HTTP headers and send them to me as an email. This and server logs is how I gathered this information. Now let’s dive in!

Fetch and Render

The two side-by-side views of the page are generated from rendering the page with two different user agents that Google calls the Web Rendering Service (WRS).

Google Search Console Fetch as Google Tool

The rendering of the page under the heading, “This is how a visitor to your website would have seen the page” comes from the Google user agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Web Preview) Chrome/41.0.2272.118 Safari/537.36

As the user agent string suggests, this is essentially a Linux computer running a Chrome browser at version number 41. This is a little odd since that version was released in March of 2015. But this might be a good hint as to what technologies you can expect Googlebot to reliably recognize and support in discovering the content of your page.

Google sets some additional limits to what you should not expect for the WRS to render, namely: IndexedDB and WebSQL, Service Workers and WebGL. For more detail on what browser technologies are supported by Chrome 41, check out caniuse.com.

The rendering in the Fetching tab under the heading, “Downloaded HTTP response” and the Rendering tab under the heading, “This is how Googlebot saw the page” both come from the same request. The user agent is:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Search Console) Chrome/41.0.2272.118 Safari/537.36

The user agent string is practically the same as the Google Web Preview user agent. The user agent only differ in name.

The most significant difference between these two requests is that this request sets the Cache-Control header to ‘no-cache’ to ensure that the content at the URL is as fresh as possible. As the RFC states: “This allows an origin server to prevent caching even by caches that have been configured to return stale responses to client requests.” This makes sense; Google wants to have the freshest index possible. Their ideal index would never be even a second old.

This is further demonstrated in how Google makes requests when you request that a URL is indexed.

Requesting Indexing

Requesting indexing is a great tool when you have new pages on your site that are time sensitive, or your if site/pages go down due to server bugs. It is the fastest way let Googlebot know that everything is up and running. When you “Request Indexing” you are asking Googlebot to crawl the page. This is the only way to do this— submitting sitemaps an implicit request for Googlebot to crawl your pages, but this does not mean that all the URLs will be crawled.

Hello Googlebot

When you click “Request Indexing,” Googlebot request you page not once, but twice. Unlike the previous requests, these requests are from “Googlebot” itself (Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)). This duplicate request may provide some insight into what Google thinks and cares about when crawling the internet.

The first request is similar to the requests mentioned above. The Cache-Control header is set to no-cache, ensuring that the requested resource is not stale. In my test case, the Accept-Language header was set to the default language of my Google Search Console account, even though I had not specified a default language in for the site. The server is also in the US so this makes sense.

Googlebot Request #1 HTTP Headers

Name Value
Accept-Encoding gzip,deflate,br
User-Agent Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
From googlebot(at)googlebot.com
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Connection close
Cache-Control no-cache
Accept-Language en-US
Host rsvpz.co

 

Seconds later, the second request comes along with two important changes. First, the Cache-Control header is no longer set. This implies that the request will accept a cached resource.

Why does Googlebot care about caching? My belief is that it is to understand if and how the page is being cached. This is important because caching has a big effect on speed— cached pages do not need to be rendered on the server every time they are requests thus avoiding that time on the server wait time. A page that is cached, and how long it is cached is also a signal that the content may not be updated frequently. Google can take this as a sign that they do not need to crawl that as often as a page that changes every second in order to have the freshest version of the page. Think of this like the difference between this blog post and the homepage of Amazon or a stock ticker page.

Googlebot Request #2 HTTP Headers

Name Value
If-Modified-Since Mon, 26 Feb 2018 18:13:14 GMT
Accept-Encoding gzip,deflate,br
User-Agent Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
From googlebot(at)googlebot.com
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Connection close
Host rsvpz.co

 

The second request, as you might have guessed, does not set a Cache-Control header. This way, Google is able to “diff” the cached and non-cached version of the page to see if or how much they have changed.

The other change between the first and second request is that Googlebot does not set the Accept-Language header which allows the server to respond with the default language. This is likely used to understand if and how the page and site are internationalized.

Perhaps if I had set rel=”alternate” tags the crawl behavior would have been different. I will leave that experiment up to you.

I’ve spent a lot of time with Googlebot and SEO experiments lately. I am just starting to write about them. Sign up for email updates to learn more about Googlebot and leave a comment if you have any questions. Thanks for reading.

The three most important things I learned in 2017

2017 has been a wild year, to say the least. From waking up wondering if closed-mindedness was becoming mainstream to waking up wondering about the likelihood of nuclear war, to waking up in any one of over a dozen countries, there has been something remarkably unstable about 2017 (even for me).

But amidst all the shifting there was also something very grounding about 2017. I married an amazing woman and I finally found satisfaction in a day-to-day office job at Postmates. From starting the year as a remote freelancer in Hong Kong to ending it hanging shelves in our new apartment in Oakland; both change itself, and the adaptation to less change have taught me a few things. Here are some things that come to mind.

 

 

The attainability of freedom

When I was a teenager, my family and I were visiting Yosemite National Park where I met a freelance photographer for National Geographic. At that moment, I knew I wanted to have a job that I do from anywhere and allowing me to drift wherever my life might take me.

Overlooking the tea plantations of Sri Lanka as clouds slowly filled the valley below –  while auditing an ecommerce site – I realized that it had actually happened. I had clients who were willing to pay me a fair wage and didn’t mind if I was in Sri Lanka, Israel or the Moon as long as the quality of work was good.

What made this work was the same things that make any client relationship work: communication and dedication. By the time I was in Sri Lanka, I had worked remotely for them for this client for several months in Hong Kong. I had made sure that I was responsive to email in the evenings and early mornings and willing to take phone calls at 11 pm. And to be fair, by the time I was traveling, I was working only up to 20 hours a week. This ensured that I could remain extra-dedicated despite beach and jungle time while keeping rice and curry on the table.

It was great while it lasted but with fiance and a wedding on the way, this life was short-lived but real nonetheless.

 

The importance of disconnecting

When I look at my phone today, I see my emails, Instagram feed, and Coinbase losses in what Android developer mode calls “monochromy.” This is part of an effort to do something that was introduced to me by Tristan Harris, former “Design Ethicist & Product Philosopher” at Google who now runs timewellspent.io. The effort is to use my phone as a tool, not a drug. It might sound crazy until you really acknowledge how well product designers and analysts are optimizing for our continued use of their products. Constant, continual use is what they want, and coincidentally, constant, continual use is a good way to describe addiction.

My black and white phone screen, quite surprisingly, has had a massive effect on my interest in continuing to pop in and out of apps when I open my phone. That and killing all the notifications that “aren’t from people” has been a great release from the expectation of instant gratification. It has been extremely freeing.

It can be grounding too. Last year, my then fiance and I decided to charge our phones outside of our bedroom at night. This too affected a remarkable increase in our communication with each other. We are now married and oddly I recognize more and more each day how important that relationship is to foster and maintain. There is no post on Facebook that is more important than one word exchanged between us. It’s true. But it’s strange that we have to remind ourselves.

The importance of forgetting

Javascript- whatever you thought you knew doesn’t matter. Now, even my dog has opinions on transpilers and he scoffs at me for using jQuery. That is the story of technology and the paradigms that come and go with it. I can confirm that this is true in life as well.

Holding on to things, as I learned when I moved into a 400sq ft. apartment in Hong Kong makes it hard to acquire new things. Holding on to old data is costly and it makes finding relevant data slower and more difficult. Holding on to old ideas takes up room for new ideas.

Early on at Postmates, I became aware of a lot of records in the database that were either inaccurate or had no value to users yet we were serving these records as pages nonetheless- asking Googlebot to crawl them and often allowing them to be indexed for searchers to find. These pages didn’t do anything besides shift Googlebot’s crawl budget away from the high-quality pages on the site.

I’ve never taken so much pleasure in cleaning. We disabled 10’s or 100’s of thousands of records at a time and gradually we watched as Googlebot hit these 404’s and gave up on its futile effort to index the bad pages. So far, we’ve seen crawl behavior shifting- crawl volume has not decreased overall but the number and frequency of good pages being crawled have increased. By discarding the bad, the good came to the light.

Change is not always that easy. Sometimes discarding is more difficult than making an API call. Sometimes clearing space means profiling, inspecting, and reconciling with what exists. I’ve come to believe that this, more than acquiring new information is the important part of growth.

This last year, as new models of acceptable male behavior, have become popularized, normalized, and sometimes flat out exposed, I have looked into a “Black Mirror” of sorts and seen what a lifetime of intentional or even passive ignorance becomes. Inspecting and reconciling with the signals of this ignorance has lead to a cognitive dissonance between what I know to be right and what I was comfortable believing. It was jarring to reconcile with my egalitarian principles and the societal norms that I had taken advantage of and implicitly endorsed. Clearing away the old junk brings more good to light.

This was also, in part, why I found the need to disconnect. I got tired of breathing the same air in my little corral of the Facebook graph. I got tired of seeing only what I was expected to agree with and I became suspicious of feedback loop that self-corrected me into a narrow view of the world. This began to look like the same conditions that lead to the societal norms I mentioned above. I wanted to leave room for something different- something outside of this little world to explore.

 

 

If you follow this blog or googleappscripting.com, you know about my passion for learning. I write it because it helps me learn and it encourages others to explore. My hope for 2018 is the same as my hope for this blog- that open-mindedness and an explorative nature spreads. I look forward to exploring React and getting involved in the tech education community here in Oakland. I hope that you consider leaving or making a little space for some exploration too.

Analyzing XML Sitemap Files with Bash

I’ve been spending a lot of time working with sitemaps lately. They are a critical component of getting sites indexed by search engines and they are a great way to learn about your competition’s architecture and search strategy. I mean, they literally map out the content that your competition is trying to add to search engine indexes and where they are trying attract traffic.

The problem is, when sitemaps are larger than a few thousand URL’s, or when sitemaps are not pretty, or when index files and gzip compression get involved, the task of reading a sitemap becomes either slow, or manual and really slow. So… Bash! Crashing to the ground like an ancient and overly cryptic superhero, er… programming language, Bash is here to face off with it’s perfect use case.

If you are intrigued but unfamiliar with Linux Shell aka, “the terminal,” this is a perfect way to get your hands dirty and decide if its something you want to become more familiar with.

This post will introduce the syntax, and a few powerful commands to allow you to download, parse, and analyze sitemap files with Bash.

A Background on Bash

BASH (the Borne Again SHell) has a several welcoming characteristics. Like Javascript is to web browsers; Bash is a standard programming language to Linux operating systems like Mac OS and Arduino. This means you don’t have to worry about installing it— its already in the box!

Speed is also a big upside. The language itself is fast, and since you don’t have to write a lot of code its possible to get a lot done quickly. Its far from being the Swiss Army knife that Python is, but with the right use case, Bash is as powerful as its name sounds like it should be.

On the flip side, the worst thing about Bash, unlike Python and other modern languages, is the lack of resources make it hard to learn. “Man(ual)” pages are overly terse and sometimes feel like a they were written in their own language. Hopefully, this intro will be a little bit more engaging.

Bash also has limited use cases. It’s great for working with file systems and text files but it wouldn’t be my first choice for assembling anything thats very interactive. But now that we have our use case. Let’s Bash!

The Short of It

In a two lines, we are going to request a sitemap index file, and write every sitemap URL in a text file.

curl https://www.example.com/sitemap-index.xml | \
grep -e loc | sed 's|<loc>\(.*\)<\/loc>$|\1|g' > sitemaps.txt

We can then request those sitemap files and analyze the URL’s listed in them

curl https://www.example.com/some-sitemap.xml | \
gunzip | grep -e loc|\  sed 's|<loc>\(.*\)<\/loc>$|\1|g' | \
grep -e <filter pattern> | sort

With Python, this would take several libraries, more time reading docs, and more code to accomplish the same thing. With Bash, we only need 6 commands: curl, grep, gunzip, sed, sort, and uniq.

Bash programs are assembled like Legos (or plumbing, if that suits you better). So in order to understand how this all works, let’s take these programs apart and put them back together.

The Long of It

As you can see, Bash code is really terse. There’s a lot going in with little instruction— and little explanation. But be patient and keep reading, and this should all make sense.

Piping | data

You may have recognized a pattern; each line is broken up into pieces by the pipe character, “|”. “Piping” conveniently does what it sounds like it does. It sends data along a pipeline of programs which then manipulate that data and pass it on. More specifically, they send the output of a program (Stdout), line by line, into the input (Stdin) of the next program. To demonstrate this try this:

ls | head -n 3

That line of code says list the contents of the current directory, pipe it to the head command to only output the first three lines of the input. Similarly:

cat my-file.csv | sort > my-sorted-file.csv

That line of code says read the contents of the my-file.csv and pipe it to the sort command to the sort the lines of that file alphabetically. The angle bracket (>) at the end of the line means, “put the output of these commands into a file named “my-sorted-file.csv.” It’s all pretty simple, and it will make more sense as we build that scripts above.

cURL and gunzip to GET and unzip Sitemap Files

Most of what the two scripts do is text processing and filtering but the first step is getting the sitemap files. You’ve probably come across curl if you’ve ever read through REST API documentation. It is a short command with a lot of options but, in our case, we can keep it simple and just use curl as is. To ease into the curl command, let’s use it to find the location of website’s sitemap.

Most sites list the uRL of their sitemap in their robots.txt file. Making a HTTP GET request to a site’s robots.txt file is simple:

curl https://www.example.com/robots.txt

Now is also a good time to introduce grep. You can use grep to skip scanning through the robots.txt file to quickly find the sitemap URL. Just pipe the output of the robots file to grep filtering for the regular expression pattern, “sitemap:”

curl https://www.example.com/robots.txt | grep -e -i ‘sitemap:’

The result of this command should be the line of the robots.txt file that lists the location of the sitemap. Note that the grep -i means that the regular expression pattern match can be case-insensitive. That’s useful because some sites will start the line with a capital “S” in sitemap.

To get more comfortable with curl, try curl with the URL of this page and you will see that it prints a mess of HTML markup to the terminal window. See below for pretty-printing XML.

Many sites with really large sitemaps use gzip compression to reduce the amount of time it takes to download them. Using the curl command to get gzipped sitemaps doesn’t work too will— it will just print a mess of ugly binary code to the terminal Luckily, BASH has a built in command, gunzip, to unzip compressed sitemap files. To unzip zipped sitemap files just pipe the zipped response to gunzip and get the readable XML.

curl https://www.example.com/sitemap.xml | gunzip

But what happens if the output is not pretty printed xml?

Bash has a built-in for formatting XML called xmlllint:

curl https://www.example.com/sitemap.xml | gunzip | xmllint --format -

Finally, to avoid making tons of requests to the same sitemap while you’re learning, you can store the sitemap file in on your computer by sending the output to a file:

curl https://www.example.com/sitemap.xml | gunzip > saved-sitemap.xml

Now, instead of requesting the sitemap every time, you can just ‘cat’ the file which outputs every line of a file from top to bottom like so:

cat saved-sitemap.xml | head

That will show the first ten lines of the saved file. Or use tail for the last ten lines.

Parsing the XML and Finding URLs

Hopefully, piping data from program to program is a little more comfortable now, because this is where we get to start taking advantage of the better parts of Bash.

XML sitemaps follow a strict format which makes it easy to parse them. But there is a lot of extra text in them that is not particularly interesting. To find the part that is of interest to us, the URLs, we will use grep which sounds like a burp, but is actually used to filter a stream that is piped to it.

curl https://www.example.com/sitemap.xml | grep -e <regular expression filter pattern>

Grep reads each line of input and, if it matches the regular expression pattern after the -e, then it passes it along, filtering the contents of the input stream. We find all <loc/> XML tags by filtering out everything that doesn’t match the the expression, ‘loc’.

curl https://www.example.com/sitemap.xml | grep -e ‘loc’

This is where things get interesting.

The sed command is short for Stream EDitor. Like other Bash commands, it operates on one line at a time. Sed can be a very powerful tool for rewriting file contents, but for our purposes, it will extract sitemap URLs from the <loc/> elements filtered in by the grep command.

curl https://www.example.com/sitemap.xml | grep -e ‘loc’ | sed 's|<loc>\(.*\)<\/loc>|\1|'

This chain of commands gets the sitemap, filters in all <loc> elements, then extracts the URL from the <loc> elements leaving only a list of URLs– one per line.

Sed allows you to use regular expressions to determine what part of a line you want to keep or replace. There are five parts to sed expressions separated by, in this case, pipes ( | ). Noe that pipes could be any character like colons or backslashes but I like to use pipes because it looks cleaner when working with URLs.

The command works like this sed '<sed mode>|<regex to extract>|<regex to replace>|<flags>' Sed extracts anything between the escaped parenthesis \( and \) and stores them in the escaped \1 in the regex to replace in the command, “sed ‘s|<loc>\(.*\)<\/loc>$|\1|’. So everything, aka dot star between the <loc> and the <\loc> is extracted and replaced the extracted text (which is the URL)

That leaves the command that we started with:

curl https://www.example.com/some-sitemap.xml | grep -e loc | \

sed 's:<loc>\(.*\)<\/loc>$:\1:g' | grep -e <pattern> | sort

With that Bash code, you will get all the URLs in a sitemap sorted alphabetically. Note that the \ at the end of the line is used to continue the commands to the next line. Add add | wc -l to the end of the line to get the number of lines in the output of those commands.

Now you have the tools to analyze not only large sitemap files but all kinds of large text files!

My interest in Bash has grown because it makes it so easy compose functions together to get and store the results that you need. You now know all the basic building blocks for analyzing a sitemap, so now try for yourself. Try using grep to see what, and how many, files live in certain directories. Use sed extract the first directory and pipe that to uniq to see how many first level directories there. Any question that you have about the contents of a sitemap, you can now answer with Bash— and fast! Good luck!

There are a couple great books out there if you want to learn more about Bash as a data analysis tool:

Bash Cookbook and Data Science at the Command Line

How I became a Technical Marketer

As you know, I love learning and I am a huge advocate for self-service education. Recently, I had the honor of being interviewed on the Measureschool Youtube channel where I got to share my path in learning and employing the technical side of digital marketing. We talked about some of the resources I mentioned in the series on Digital Marketing Technical Skills and in the long version I share my fascination with Github and Push Messaging.

I hope the video encourages others to start and/or continue their technical learning journey!

Julian Jueneman also just launched a community called techmarketer.io which is a great place to learn from and share with others who are diving into all things technical marketing.

Meanwhile, I have been slow on this blog lately but only because I have been more involved in a couple other projects:

  • Keyword Visualization Tool: Currently in super-beta mode, but I would love it if you tested it out and provided some feedback on how it could be more useful. If you get Javascript errors, please share those with me too!  :*(
  • GoogleAppScripting.com: It is what the name implies. Learn about how to use Google Apps Script to make networked data solutions for yourself!
  • GlobalReportNews.com: A fake-fake news site that allows you to make up stories and share them on social media to see how many people will click them. Here is the Medium article about what I learned in the process. Pretty crazy stuff!

Never stop learning!

JSON-LD Bulk Editor and File Generator for Google Sheets

As Schema.org JSON-LD usage spreads across the internet and its utility increases, I find myself creating a lot more JSON-LD templates for non-standard entities… and then I find myself copying them over and over for each of their sibling entities. Finally, I decided I had had enough of this and looked to solve this problem for myself and other SEOs in this same boat.

The Schema.org JSON-LD Bulk Editor for Google Sheets is a Google Spreadsheet template and script that takes a “prototype” JSON-LD template and allows to create as many customized template copies of it as you need by filling in a spreadsheet. It then allows you to save each JSON-LD schema into its own JSON file to add to your site or app.

JSON-LD Bulk Generator Use Cases

I admit, I believe in the future of the semantic web and because of that, I think there will be a lot more use of JSON-LD. Developers will be able to create a lot of interesting applications on top of all that nice and clean data and we will hopefully benefit from this easy access to information.

Google and others are already consuming JSON-LD but, because their primary focus is things like Products, Recipes, and Locations, developers have created JSON-LD generators for those specific and rigid schemas.

This tool allows you to create and edit the schema that perfectly fits your use case. 

Example Use Cases:

  • You need a lot of customized markup for non-standard entities like CreativeWorks or Intangibles subclasses
  • It’s not worth building a schema into your CMS just to markup a few similar but unique pages
  • You want to add JSON-LD to several similar but unique hard-coded HTML pages
  • You are using Google Tag Manager but don’t have access to the variables needed for the application/ld+json HTML tag.
  • Hosting a few .jsonld files is easier than modifying your database

If you are wondering about how to put JSON-LD onto the page with Google Tag Manager, learn more about embedding JSON-LD with Javascript.

How to Edit and Generate JSON-LD Files

 

TO BEGIN:    MAKE A COPY OF THE SPREADSHEET

Then follow these three steps:

Schema.org JSON LD Bulk Editor for Google Sheets Step 1

Step 1: Paste your “prototype” schema 

First, create a “prototype” schema that will be copied with unique values. Use a code editor to avoid any JSON syntax errors.

Then paste your prototype schema into the green box on the “Start” tab.

The schema of the prototype will be identical to that of its copies. You can modify the schema in the next step but it is easiest to start with a perfect prototype.

Step 2: Create and Edit the Spreadsheet Template

In the JSON-LD Bulk Editor menu, click Create schema editor.

Behind the scenes, Google Apps Script is “flattening” the JSON object and making a header row of object keys on the Generator tab and filling the second row with the values from the prototype object.

Edit JSON-LD in Google Spreadsheets

 

The header row shows the JSON object keys in “dot-notation” this is similar to how Google Tag Manager, MongoDB, and Javascript, in general, allow you to access values within objects. So if the schema is:

{"aggregateRating": { 
  "@type": "AggregateRating",
  "bestRating": "100",
  "ratingCount": "24",
  "ratingValue": "87" }
}

The dot-notation for rating value would be aggregateRating.ratingValue. This works for arrays in the same way. The 0th element of an array would be someArray.0 and the 12th element of the array would be someArray.12.

Now simply fill in the rows values that correspond to the header for each entity.

Step 3: Generate some JSON-LD

Now that each entity has been filled out in the spreadsheet, its time to make some files! Just go to the JSON-LD Bulk Editor menu and click Save JSON files to Drive.

Behind the scenes, Google Apps Script builds the JSON object based on the values in the spreadsheet. The schema is then saved to a Google Drive folder called “JSON-LD Schema {{timestamp}}”. Each file is saved under the file name that was assigned on the spreadsheet with the file extension, .jsonld.

Save JSON-LD files to Google Drive

Your .jsonld files are ready!

Make Some Semantic Web!

I hope you will find this tool as useful as I have. For an easy way to use these JSON files, check out:

For an easy way to use these JSON files, check out: embedding JSON-LD with Javascript.

If this works for you or you see any improvements that could be made, it would be great to hear from you in the comments.

 

Using Hosted JSON-LD Files as applications/ld+json Scripts for SEO

Sometimes it’s just easier to separate concerns. Just like how stylesheets and .js scripts separate the form and function of the page from the presentation of the page, the same can be done with the JSON-LD schematic markup / structured data of the page. This example shows how you can add structured data for SEO using Javascript which in many cases may prove to be much easier than messing around with your server side code or CMS.

The following Javascript script shows how you can load a stored .jsonld file from your server onto your page as an application/ld+json script tag. Just add it to your page to add hosted JSON-LD files to your page.

How to Add JSON-LD Markup with Javascript

The code below does the following four steps.

  1. The js makes a call to the local .jsonld file
  2. When the file is returned, an application/ld+json script tag is created
  3. The contents of the .jsonld file is inserted as the contents of the script tag
  4. The tag data is ready to be consumed by other applications

The script uses jQuery but the same could be achieved with plain js or any other library/framework of choice. This code can also be added by a tag management system such as Google Tag Manager.

 <script>
   // Add a schema tag to your
   $.getJSON( "/your-schema-file.jsonld", function( data ) {
     $( "<script/>", {
       "type": "application/ld+json",
       "html": JSON.stringify(data)
     }).appendTo( "head" );
   });
 </script>

Does Google Think its Valid Structured Data?

Yes. See this demo for a live working example or go straight to Google’s Structured Data Testing Tool to see the valid results of the demo. Other crawlers may not recognize the script because the script is rendered to the page using Javascript. Therefore the crawler must be able to run javascript which is not all that common.

Creating .jsonld Files

To learn about every single minute detail of creating .jsonld files see this spec about syntax. But essentially, .jsonld files are no different syntactically than JSON files. It is only the specific way that the JSON-LD files signify entities that differs from JSON.

If you need to create multiple JSON-LD files, checkout this Bulk JSON-LD Generator for Google Sheets.

I hope you find this useful. I would love to hear your thoughts in the comments.

Google Apps Script doGet and doPost Tutorial + 6 Web App Examples

Google Apps Script Web Apps

(This post has been moved to a site all about Google Apps Script Tutorials.)

If you are new to Google Apps Script or programming in general, Google Apps Script Web Apps may seem daunting. The good news is that it is easy to understand with a little trial and error.

This tutorial will give you 12 recipes to help you get started with the doGet and doPost functions and show you how to return HTML, JSON, and plain text. From there, you can go on to customize your Web App to your needs.  But first, let’s get the basics out of the way.

What do doGet and doPost do?

Since you are starting to build a web app, you should probably know about HTTP. HTTP is the specification for how different machines on the web send and receive information from one another. HTTP allows several types of “requests,” but what we see most often are GET and POST requests. doGet and doPost functions define how a Google Apps Script web app scripts handle GET and POST requests from other machines and determine what they return or do.

A simple example of a GET request happens every time you enter a URL into your web browser. The web server at that domain then handles your request and returns an HTML web page. GET requests are the most common type of request on the internet. For our purposes, GET requests are made of up of a domain, e.g., example.com a path, e.g., /fruits/apples/ and often a query string e.g.,?variety=fuji&count=4With Google Apps Script, the domain will always be script.google.com, and the path will always look with something like /macros/s/AKf...ycb/exec (or /dev), and the query string can be whatever you want it to be, in order to specify how the web app dynamically generates the content it returns.

POST requests are similar to GET requests but slightly more powerful. Rather than asking for content, they are used to send data from one machine to another. The most common use of POST requests is web forms, like signup forms and ecommerce checkout pages. POST requests are not limited to sending form text though. They can transfer image, audio, and a number of other types of files. This tutorial will only cover JSON, but if you need more than that, you probably don’t need this tutorial!

Your First Web App “Hello World”

Let’s get a few easy examples out of the way. The next three code snippets serve essentially the same content in three different formats.

To try them out:

  1. Copy and paste them into a new Google Apps Script.
  2. Click Publish on the main menu and then Deploy as web app…
  3.  Click “Test web app for your latest code.”

Returning Plain Text

The ContentService.createTextOutput method is the most basic way to return content from a GET request but is the building block for most other types of content. The TextOutput objects created by this method returns a response of type, text/plain by default, but can be set to ATOM, CSV, ICAL, JAVASCRIPT, JSON, RSS, TEXT, VCARD, and XML.

function doGet(){

 textOutput = ContentService.createTextOutput("Hello World! Welcome to the web app.")
 return textOutput
}

Returning JSON

This is a simple pattern. Just serialize the Javascript object as JSON using JSON.stringify(), and set the content-type header with the setMimeType() method. This is important when returning different media types because you don’t want to let the recipient of your request guess the MIME type. For more on this, see my Google Apps script JSON REST API tutorial.

function doGet(){
 
 var appData = {
 "heading": "Hello World!",
 "body": "Welcome to the web app."
 };

 var JSONString = JSON.stringify(appData);
 var JSONOutput = ContentService.createTextOutput(JSONString);
 JSONOutput.setMimeType(ContentService.MimeType.JSON);
 return JSONOutput
}

Returning HTML

The HTMLService.createHtmlOutput() method has two purposes: to clean of any security risks before the HTML is returned to the browser and to implicitly set the MIME type to text/html. To construct web pages, you can assemble the HTML string, as shown below, or you can use templates.

function doGet(){
 
 var HTMLString = "<style> h1,p {font-family: 'Helvitica', 'Arial'}</style>" 
 + "<h1>Hello World!</h1>"
 + "<p>Welcome to the Web App";
 
 HTMLOutput = HtmlService.createHtmlOutput(HTMLString);
 return HTMLOutput
}

Dynamic Web Pages with the HTML Service

External Content, Query Strings, and HTML Templates

Let’s make this interesting. Google Apps Script can assemble web pages based on dynamic inputs. In the example below, we use an external CSS file and the query string parameters to create dynamic content. The doGet and doPost parameter take one default argument (commonly e), which holds information about the request “event.” From that e object, you can pull of information like query string values, as shown below. See the working version of the web app here.

To take this one step further, you can create HTML template files in Google Apps Script and evaluate those templates rather than creating the template in the script. Additionally, you can use information from the Google Drive, Gmail, and other Google APIs to populate your app.

function doGet(e){
 
  // use an externally hosted stylesheet
 var style = '<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet">';

 // get the query "greeting" parameter and set the default to "Hello" 
 var greeting = e.parameter.greeting || "Hello";
 // get the query "name" parameter and set the default to "World!"
 var name = e.parameter.name || "World";
 
 // create and use a template 
 var heading = HtmlService.createTemplate('<h1><?= greeting ?> <?= name ?>!</h1>')
 
 // set the template variables
 heading.greeting = greeting;
 heading.name = name;
 
 var content = "<p>Welcome to the web app.</p>";
 
 var HTMLOutput = HtmlService.createHtmlOutput();
 HTMLOutput.append(style);
 HTMLOutput.append(heading.evaluate().getContent());
 HTMLOutput.append(content);
 
 return HTMLOutput
}

Google Apps Script doPost Example

An API Proxy

This example was inspired by my last post about custom Slack slash commands. This doPost example demonstrates how to handle POST requests, access request query string parameters, make HTTP requests to other services (in this case, the Chuck Norris joke API) with UrlFetchApp, and break the script into functions.

The reason I made this was because I got data from the Slack app that was formatted as a query string,

&text=Trevor%20Fox

that I needed to use to get data from an API that would be formatted as JSON,

{ "type": "success", "value": { "id": 268, "joke": "Time waits for no man. Unless that man is Chuck Norris." } }

to then send back to Slack app as plain text.

Time waits for no man. Unless that man is Trevor Fox.

This problem also exists if you want to use another service that requires that data be formatted in a specific way- like the IFTTT Maker Channel.

function doPost(e){
 
 // Get the "text" query string value
 // eg. &text=Your%20Name
 var name = e.parameter.text
 
 // Get the Chuck Norris Joke
 var chuckQuote = postChuckNorris(name);
 
 // Return plain text Output
 return ContentService.createTextOutput(chuckQuote);
}


// Make a call to the Chuck Norris joke API
// parse the JSON and return only the joke
function postChuckNorris(name){

 var queryString = makeQueryString(name)
 
 var chuckData = UrlFetchApp.fetch("http://api.icndb.com/jokes/random/" + queryString);
 
 var chuckJSON = JSON.parse(chuckData.getContentText());
 var chuckQuote = chuckJSON.value.joke;
 
 return chuckQuote
}


// Helper function to assemble the query string for
// calling the Chuck Norris API from a given name
function makeQueryString(name){
 var query = ""
 
 if (name !== undefined){
 
  var names = name.split(" ");
  query = "?firstName=" + names[0];
 
  if (names[1] !== undefined){
   query += '&lastName=' + names[1];
  }
 
 }

 return query
}

More than Just Returning Content

Dynamically Create and Save a PDF File to Google Drive

This example is similar to the dynamic HTML example above but instead of making a dynamic web page, it creates and saves a PDF document in Google Drive. This would also work with doPost in a situation where you would want to send more information to the app when creating you PDF. Check out the live script. *Note, this will create a PDF file in you Google Drive each time you go to the URL.

This takes the functionality of the Google Apps Script web app one step further. This script demonstrates the capability to carry out any task or trigger any other service just from GET’ing or POST’ing to your web app.

function doGet(e){

 // get the query "greeting" parameter and set a default to "Hello" 
 var greeting = e.parameter.greeting || "Hello";
 // get the query "name" parameter and set a default to "World!"
 var name = e.parameter.name || "World";
 
 // create the PDF
 var pdf = makePDF(greeting,name)

 // save the PDF to Drive
 var driveFile = DriveApp.createFile(pdf).setName("greetings.pdf");

 // tell the user how to access it
 var fileURL = driveFile.getUrl();
 var fileName = driveFile.getName();
 
 var HTMLOutput = HtmlService.createHtmlOutput("<p>Your made a PDF.</p>" 
 + "<p> You can download it here: "
 + '<a target="blank" href="' + fileURL + '">' + fileName + '</a></p>');
 return HTMLOutput
}

function makePDF(greeting, name){

 // use an externally hosted stylesheet
 var style = '<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet">';
 // create and use a template 
 var heading = HtmlService.createTemplate('<h1><?= greeting ?> <?= name ?>!</h1>')
 
 // set the template variables
 heading.greeting = greeting;
 heading.name = name;
 
 var content = "<p>Here is your PDF.</p>";
 
 var PDFOutput = HtmlService.createHtmlOutput('<div class="container">');
 PDFOutput.append(style);
 PDFOutput.append(heading.evaluate().getContent());
 PDFOutput.append(content);
 PDFOutput.append('</div>');
 
 var blob = Utilities.newBlob(PDFOutput.getContent(), "text/html", "text.html");
 var pdf = blob.getAs("application/pdf");
 
 return pdf
 
}

Publishing and Permissions

When you are ready to launch your new Google Apps Script web app. You should save a version and define who and how users and other machines can access your app. There are a couple important things to note.

If your app offers any capability to interact with other Google Drive services, the app should be set to Execute the App As:  User accessing the web app.

If you want to allow other services, to interact with your web app (as an API or proxy, the app should be accessible to Execute the App As: Me (myemail@gmail.com) and allow Anyone, even anonymous.

 

I hope this gets you started! Let me know if you are having any trouble getting your app up and running in the comments. Happy hacking!

Slack Slash Commands for Everyone + 10 Free APIs

Slack_IconYou might be new to the whole idea of “bots” or acronyms like REST, API, HTTP, and cURL. And you probably don’t know how to code your way out of a wet paper bag. But this doesn’t mean you cannot or should not get your hands dirty and have a little fun making a customized little tool for yourself and your team. This is a quick and easy “Hello World” tutorial for creating your own Slack slash commands that anybody, no matter how technically savvy, can follow.

Custom slash commands for Slack allow you to access information from across the internet without leaving your favorite messaging tool.

Custom slash commands for Slack allow you to access information from across the internet without leaving your favorite messaging tool. In essence, slash commands allow your Slack app act as a Web browser, but instead of returning and displaying a fully rendered HTML web page, Slack displays a bit of text information from a slash command “bot.” It’s as simple as that.

How Custom Slash Commands Work

Slash commands work similar to the way your web browser retrieves an HTML web page found at the URL in the address bar. The difference is that the Slack app retrieves the plaintext information found at the URL you set in your custom slash command. In both cases, browser and the Slack app are sending an HTTP “GET” request to another server. Slack can also send a POST request to send information to a given location, but this tutorial is meant to be quick and easy for non-techies, so we are going to stick to retrieving plain text content.

So you are wondering, where can I get this text content if I can’t program a server to return something? Luckily there are plenty of plaintext APIs that return useful, interesting, and sometimes silly information. Check them out by skipping to the list of plaintext APIs below. Choose one of the APIs or find another one that you want to use, then test the API URL by pasting it into you

Choose one of the APIs or find another one that you want to use. Then test the API by pasting the URL into your browser. If the browser displays only text without any formatted HTML, you’re ready to make your slash command!

Create Your Custom Slash Command

Let’s get started! First, go to: {{yourSlackTeam}}.slack.com/apps/manage/custom-integrations and select Slash Commands. On the next page, click Add Configuration. Then just follow the next three steps.create a slack slash command

  1. Setup Your Request

    Once you have found the text API that you want to use in your slash command, enter the Command name, the URL of the API request, and leave the Method as “GET.”  For this demo, we are using the Lorem Ipsum generator API found here: http://loripsum.net/api/plaintext. Click the link to see what it provides.

    name your slash command

  2. Customize the Appearance

    Here is the fun part. You get to decide what your slash command’s response looks like. Choose a cute name and upload an image or choose from any one of the many Slack emojis.
    customize the slack bot appearance

  3. Help Users Use It

    For the Description, provide a short explanation of what your slash command will return to the user.
    describe your slash command

  4. Try it out!

    Type in your slash command or type part of the command and use the tab-complete to fill in the command. Hit enter and boom! You have your first working slash command.
    use your slack slach command

10 Free Plain Text APIs for Custom Slash Commands

Now that you know how to make your own slash command integration, here are a few API’s to play around with. If you made one that you want to share or found one that you like, please leave a link in the comment section.

  1. Lorem Ipsum: Just like the one in the demo, get all the Lorem Ipsum text you could ever ask for at http://loripsum.net/api/plaintext
  2. Check Your IP Address: If you move around a lot and you need to know your IP address, just call this API: https://api.ipify.org
  3. Get a Currency Exchange Rate: With a free API key, you can get more API calls than you will ever need. Here is the URL for US Dollars to Hong Kong Dollars: https://www.exchangerate-api.com/USD/HKD?k=yourAPIKey
  4. History Trivia: Get a trivia fact about a random year in world history: http://numbersapi.com/random/year or a random date: http://numbersapi.com/random/date
  5. Random Number Trivia: Get a trivia fact about a random number. http://numbersapi.com/random/trivia
  6. Chuck Norris Jokes: Ok, this one returns JSON, not plain text, but it’s short enough to parse it visually quite easy. Plus its just too funny to pass up. http://api.icndb.com/jokes/random
  7. Profound Quotes: Sometimes is just nice to hear some Nietzsche and then some Twain. Chat with the greats: http://api.forismatic.com/api/1.0/?method=getQuote&lang=en
  8. Yes/No Decision Making: Can’t decide whether you should do something? There’s an API for that. This is another JSON API, but since the response is so short, it is still easy to read. https://yesno.wtf/api
  9. Magic 8 Ball Decision Making: For the tough questions, you need a little magic. This one also returns human-friendly JSON, and can only answer one question per Slack command, so make it a good one! https://8ball.delegator.com/magic/JSON/Will%20I%20…%3F
  10. When does Donald Trump Leave Office? If you read the news in America, you may find yourself wondering this. Now Slack can answer this question at a moment’s notice. https://trevorfox.com/trumpout.php

Have fun! And remember, “with great power comes great responsibility.

Easy Gantt Chart Timelines in Google Sheets (with Template)

Google Docs and Gantt charts are a perfect match. Google Spreadsheets offers the ability to share and update spreadsheets in real-time which is a major benefit for any project team- especial those who work in different locations or time zones. On top of that, you can’t beat the free price!

There are many projects that are complex enough to demand a formal task planning and management hub but do not justify a full-featured, premium application. This tutorial will show you how to take your ordinary task list and turn it into a dynamic visual timeline — a Google Spreadsheet Gantt chart.

Google Spreadsheet Gantt Chart

View the Sample Chart

View an even better sample chart from a reader

There are other Google Spreadsheet Gantt chart examples that use the Chart feature as the visualization. I like to use the SPARKLINE() function. This keeps the project task visualization in the same place as all the important details about each task such as the RACI assignments or progress updates.

Sparklines Work Better Than Charts

Sparklines are essentially just little data visualizations in spreadsheet cells. To learn more about how the sparkline feature works, check out these sparkline examples. To create the visualization, we are going to use “bar” for the value of “charttype.” Then we get a little bit clever with colors to show the start and end dates of each task. The SPARKLINE formula for each task visual looks like this:

=SPARKLINE({INT(taskStart)-INT(projectStart), INT(taskFinish)-INT(projectFinish)},{"charttype","bar";"color1","white";"empty","zero"; "max",INT(projectFinish)-INT(projectStart)})

The projectStart and projectFinish values are the start and end date of the project, and the taskStart, and taskFinish values are the start and end dates for the task that is being shown in the timeline visualization.

SPARKLINE Gantt Chart

The reason everything is being wrapped in the INT() function is so that the dates can be subtracted from each other to provide the difference in days. The first argument to SPARKLINE puts two values in the array literal that are essentially:

{daysSinceProjectStartUntilTaskStart, daysSinceProjectStartUntilTaskFinish}

The SPARKLINE function then makes two bars, one which is colored "white", as to be invisible and the other which is colored blue (by default) or any color you choose by setting "color2". The value for "max" is the difference between the start and end of the project in days.

On the example template, there are a couple other features: a week-by-week ruler and the burndown visualization.

The week-by-week visualization uses a clever little formula to make an array of number incrementing by seven as the first argument to SPARKLINE to display alternating colored bars for each week of the project’s duration.

split(rept("7,",round((int(projectEnd)-int(projectStart))/7)),",")

The burn down visualization shows the days that have been burned through the project. This gives you a visual display of how well the project is keeping on track to its timeline. The first argument to SPARKLINE  is a dynamic value, calculated by subtracting the project’s start date from the current date:

int(today())-int(projectStart)

Customizing your Timelines

Each SPARKLINE function takes arguments for color1 and color2. These values set the color of the alternating bars in the bar visualization. For each task, color1 is set to white so to be invisible. But color2 can be set to anything that may be useful for managing your project. Colors could be specified by task owner or type, or even by dynamically set based on if they are ahead of schedule, in progress, late, etc…

Keep this in your Google Docs project folder with all of your other important project documentation for a neat project hub.