Preventing your site from being indexed, the right way

We’ve said it in 2009, and we’ll say it again: it keeps amazing us that there are still people using just a robots.txt files to prevent indexing of their site in Google or Bing. As a result their site shows up in the search engines anyway. You know why it keeps amazing us? Because robots.txt doesn’t actually do the latter, even though it does prevents indexing of your site. Let me explain how this works in this post.

For more on robots.txt, please read robots.txt: the ultimate guide.

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 – Buy now » Info

There is a difference between being indexed and being listed in Google

Before we explain things any further, we need to go over some terms here first:

  • Indexed / Indexing
    The process of downloading a site or a page’s content to the server of the search engine, thereby adding it to its “index”.
  • Ranking / Listing / Showing
    Showing a site in the search result pages (aka SERPs).

So, while the most common process goes from Indexing to Listing, a site doesn’t have to be indexed to be listed. If a link points to a page, domain or wherever, Google follows that link. If the robots.txt on that domain prevents indexing of that page by a search engine, it’ll still show the URL in the results if it can gather from other variables that it might be worth looking at. In the old days, that could have been DMOZ or the Yahoo directory, but I can imagine Google using, for instance, your My Business details these days, or the old data from these projects. There are more sites that summarize your website, right.

Now if the explanation above doesn’t make sense, have a look at this 2009 Matt Cutts video explanation:

If you have reasons to prevent indexing of your website, adding that request to the specific page you want to block like Matt is talking about, is still the right way to go. But you’ll need to inform Google about that meta robots tag.  So, if you want to effectively hide pages from the search engines you need them to index those pages. Even though that might seem contradictory. There are two ways of doing that.

Prevent listing of your page by adding a meta robots tag

The first option to prevent listing of your page is by using robots meta tags. We’ve got an ultimate guide on robots meta tags that’s more extensive, but it basically comes down to adding this tag to your page:

<meta name="robots" content="noindex,nofollow>

The issue with a tag like that is that you have to add it to each and every page.

Or by adding a X-Robots-Tag HTTP header

To make the process of adding the meta robots tag to every single page of your site a bit easier, the search engines came up with the X-Robots-Tag HTTP header. This allows you to specify an HTTP header called X-Robots-Tag and set the value as you would the meta robots tags value. The cool thing about this is that you can do it for an entire site. If your site is running on Apache, and mod_headers is enabled (it usually is), you could add the following single line to your .htaccess file:

Header set X-Robots-Tag "noindex, nofollow"

And this would have the effect that that entire site can be indexed. But would never be shown in the search results.

So, get rid of that robots.txt file with Disallow: / in it. Use the X-Robots-Tag or that meta robots tag instead!

Read more: ‘The ultimate guide to the meta robots tag’ »

HTTP status codes and what they mean for SEO

HTTP status codes, like 404, 301 and 500, might not mean much to a regular visitor, but for SEOs they are incredibly important. Not only that, search engine spiders, like Googlebot, use these to determine the health of a site. These status codes offer a way of seeing what happens between the browser and the server. Several of these codes indicate an error, for instance, that the requested content can’t be found, while others simply suggest a successful delivery of the requested material. In this article, we’re taking a closer look at the most important HTTP header codes and what they mean for SEO.

What are HTTP status codes and why do you see them?

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

A HTTP status code is a message the server sends when a request made by a browser can or cannot be fulfilled. According to the official W3C specs, there are dozens of status codes, many of which you’re unlikely to come across. If you need a handy overview of status code, including their code references, you can find one on HTTPstatuses.com.

To fully understand these codes, you have to know how a browser gets a web page. Every website visit starts by typing in the URL of a site or entering a search term in a search engine. The browser sends a request to the site’s IP address to get the associated web page. The server responds with a status code embedded in the HTTP header, telling the browser the result of the request. When everything is fine, a HTTP 200 header code is sent back to the browser, in conjunction with the content of the website.

However, it is also possible that there’s something wrong with the requested content or server. It could be that the page is not found, which gives back a 404 error page, or there might be a temporary, technical issue with the server, resulting in a 500 Internal Server Error. These HTTP status codes are an important tool to evaluate the health of the site and its server. If a site regularly sends improper HTTP header codes to a search engine indexing its contents, it might cause problems that will hurt its rankings.

Different ranges

There are five different ranges of HTTP status codes, defining different aspects of the transaction process between the client and the server. Below you’ll find the five ranges and its main goal:

  • 1xx – Informational
  • 2xx – Success
  • 3xx – Redirection
  • 4xx – Client error
  • 5xx – Server error

If you ever try to brew coffee in a teapot, your teapot will probably send you the status message 418: I’m a teapot.

Most important HTTP status codes for SEO

As we’ve said, the list of codes is long, but there are a couple that are especially important for SEOs and anyone working on their own site. We’ll do a quick rundown of these below:

200: OK / Success

This is how it probably should be; a client asks the server for content and the server replies with a 200 success message and the content the client needs. Both the server and the client are happy — and the visitor, of course. All messages in 2xx mean some sort of success.

301: Moved Permanently

A 301 HTTP header is used when the requested URL permanently moved to a new location. As you are working on your site, you will often use this, because you regularly need to make a 301 redirect to direct an old URL to a new one. If you don’t, users will see a 404 error page if they try to open the old URL and that’s not something you want. Using a 301 will make sure that the link value of the old URL transfers to the new URL.

Read more: ‘How to create a 301 redirect in WordPress’ »

302: Found

A 302 means that the target destination has been found, but it lives under a different location. However, it is a rather ambiguous status code, because it doesn’t tell if this is a temporary situation or not. Use a 302 redirect only if you want to temporarily redirect a URL to a different source and you are sure that you will use the same URL again. Since you tell search engines that the URL will be used again, none of the link value is transferred over to the new URL, so you shouldn’t use a 302 when moving your domain or making big changes to your site structure, for instance. In this Ask Yoast video, Joost details the difference between the two.

307: Temporary Redirect

The 307 code replaces the 302 in HTTP1.1 and could be seen as the only ‘true’ redirect. You can use a 307 redirect if you need to temporarily redirect a URL to a new one while keeping the original request method intact. A 307 looks a lot like a 302, except that it tells specifically that the URL has a temporary new location. The request can change over time, so the client has to keep using the original URL when making new requests.

403: Forbidden

A 403 tells the browser that the requested content is forbidden for the user. If they don’t have the correct credentials to log in, this content stays forbidden for that user.

404: Not Found

As one of the most visible status codes, the 404 HTTP header code is also one of the most important ones. When a server returns a 404 error, you know that the content has not been found and is probably deleted. Try not to bother visitors with these messages, so fix these errors as soon as possible. Use a redirect to send visitors from the old URL to a new article or page that has related content.

Monitor these 404 messages in Google Search Console at Crawl errors and try to keep them to the lowest amount possible. A lot of 404 errors might be seen by Google as a sign of bad maintenance. Which in return might influence your overall rankings. If your page is broken and in fact should be gone from your site, a 410 sends a clearer signal to Google.

Keep reading: ‘404 error pages: check and fix’ »

410: Gone

The result from a 410 status code is the same as a 404 since the content has not been found. However, with a 410 you tell search engines that you deleted the requested content, thus it’s much more specific than a 404. In a way, you order search engines to remove the URL from the index. Before you permanently delete something from your site, ask yourself if there is an equivalent of the page somewhere. If so, make a redirect, if not maybe you shouldn’t delete it and just improve it.

Read on: ‘How to properly delete a page from your site (404 or 410?)’ »

451: Unavailable for Legal Reasons

A rather new addition, the 451 HTTP status code shows that the requested content has been deleted because of legal reasons. If you received a takedown request or a judge ordered you to take specific content offline, you should use this code to tell search engines what happened to the page.

Read more: ‘HTTP 451: Content unavailable for legal reasons’ »

500: Internal Server Error

A 500 error is a generic error message saying that the server encountered an unexpected condition that prevented it from fulfilling the request, without getting specific on what caused it. These errors could come from anywhere, maybe your web host is doing something funny or a script on your site is malfunctioning. Check your server’s logs to see where things go wrong.

503: Service Unavailable

A server sends a 503 error message when it is currently unable to handle the request due to an outage or overload. Use this status code whenever you require temporary downtime, for instance when you are doing maintenance on your site. This way, search engines know they can come back later to find your site in working order again.

Keep reading: ‘503: Handling site maintenance correctly for SEO’ »

Working with HTTP status codes

HTTP status codes are a big part of the lives of SEOs, and that of search engines spiders for that matter. You’ll encounter them daily and it’s key to understand what the different status codes mean. For instance, if you delete a page from your site, it’s very important that you know the difference between a 301 and a 410 redirect. They serve different goals and, therefore, have different results.

If you want to get an idea of the kinds of status codes your site generates, you should log into your Google Search Console. Here, you’ll come across a page with crawl errors that the Googlebot found over a certain period of time. These crawl errors have to be fixed before your site can be indexed correctly. Or, you can connect Yoast SEO Premium with Google Search Console, view errors directly from the backend and fix the ones that need redirecting.

Manage redirects with Yoast SEO Premium

We get it, working with these things on a daily basis is time-consuming and pretty boring. However, if you use Yoast SEO Premium, creating redirects has never been easier. Every time you delete or move a post or page, the Redirects manager in Yoast SEO asks you whether you want to redirect it or not. Just pick the correct option and you’re good to go. Watch the video to see how easy it works.

That’s all, folks

Make yourself familiar with these codes, because you’ll see them pop up often. Knowing which redirects to use is an important skill that you’ll have to count on often when optimizing your site. One look at the crawl errors in Google Search Console should be enough to show you how much is going on under the hood.

Read on: ‘Crawl efficiency: making Google’s crawl easier’ »

Playing with the X-Robots-Tag HTTP header

Traditionally, you will use a robots.txt file on your server to manage what pages, folders, subdomains or other content search engines will be allowed to crawl. But did you know that there’s also such a thing as the X-Robots-Tag HTTP header? In this post we’ll discuss what the possibilities are and how this might be a better option for your blog.

Quick recap: robots.txt

Before we continue, let’s take a look at what a robots.txt file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.

Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.

Please note that using the robots.txt file to hide your entire website for search engines is definitely not recommended.

Say hello to X-Robots-Tag

Back in 2007, Google announced that they added support for the X-Robots-Tag directive. What this meant was that you not only could restrict access to search engines via a robots.txt file, you could also programmatically set various robot.txt-related directives in the headers of a HTTP response. Now, you might be thinking “But can’t I just use the robots meta tag instead?”. The answer is yes. And no. If you plan on programmatically blocking a particular page that is written in HTML, then using the meta tag should suffice. But if you plan on blocking crawling of, lets say an image, then you could use the HTTP response approach to do this in code. Obviously you can always use the latter method if you don’t feel like adding additional HTML to your website.

X-Robots-Tag directives

As Sebastian explained in 2008, there are two different kinds of directives: crawler directives and indexer directives. I’ll briefly explain the difference below.

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training$ 99 – Buy now » Info

Crawler directives

The robots.txt file only contains the so called ‘crawler directives’, which tells search engines where they are or aren’t allowed to go. By using the

Allow

directive, you can specify where search engines are allowed to crawl.

Disallow

does the exact opposite. Additionally, you can use the

Sitemap

directive to help search engines out and crawl your website even faster.

Note that it’s also possible to fine tune the directives for a specific search engine by using the

User-agent

directive in combination with the other directives.

As Sebastian points out and explains thoroughly in another post, pages can still show up in search results in case there are enough links pointing to it, despite explicitly defining these with the

Disallow

directive. This basically means that if you want to really hide something from the search engines, and thus from people using search, robots.txt won’t suffice.

Indexer directives

Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.

With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.

Example uses of the X-Robots-Tag

Theory is nice and all, but let’s see how you could use the X-Robots-Tag in the wild!

If you want to prevent search engines from showing files you’ve generated with PHP, you could add the following in the head of the header.php file:

header(&quot;X-Robots-Tag: noindex&quot;, true);

This would not prevent search engines from following the links on those pages. If you want to do that, then alter the previous example as follows:

header(&quot;X-Robots-Tag: noindex, nofollow&quot;, true);

Now, although using this method in PHP has its benefits, you’ll most likely end up wanting to block specific filetypes altogether. The more practical approach would be to add the X-Robots-Tag to your Apache server configuration or a .htaccess file.

Imagine you run a website which also has some .doc files, but you don’t want search engines to index that filetype for a particular reason. On Apache servers, you should add the following line to the configuration / a .htaccess file:

&lt;FilesMatch &quot;.doc$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

Or, if you’d want to do this for both .doc and .pdf files:

&lt;FilesMatch &quot;.(doc|pdf)$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

If you’re running Nginx instead of Apache, you can get a similar result by adding the following to the server configuration:

location ~* .(doc|pdf)$ {
	add_header  X-Robots-Tag &quot;index, noarchive, nosnippet&quot;;
}

There are cases in which the robots.txt file itself might show up in search results. By using an alteration of the previous method, you can prevent this from happening to your website:

&lt;FilesMatch &quot;robots.txt&quot;&gt;
Header set X-Robots-Tag &quot;noindex&quot;
&lt;/FilesMatch&gt;

And in Nginx:

location = robots.txt {
	add_header  X-Robots-Tag &quot;noindex&quot;;
}

Conclusion

As you can see based on the examples above, the X-Robots-Tag HTTP header is a very powerful tool. Use it wisely and with caution, as you won’t be the first to block your entire site by accident. Nevertheless, it’s a great addition to your toolset if you know how to use it.

Read more: ‘Meta robots tag: the ultimate guide’ »

HTTP 451: content unavailable for legal reasons

At the end of last year, a new HTTP status code saw the light. This status code, HTTP 451, is intended to be shown specifically when content has been blocked for legal reasons. If you’ve received a take down request, or are ordered by a judge to delete content, this is the status code that allows you to indicate that. The upcoming Yoast SEO Premium 3.1 release will have support for this new status code, allowing you to set a HTTP 451 status code for pages.

HTTP 451 status code introduction

What does HTTP 451 mean?

The HTTP 451 header is introduced with the specific meaning of making it explicitly clear when content is blocked for legal reasons. Or, in the wording of the official draft:

This status code can be used to provide transparency in circumstances where issues of law or public policy affect server operations.

While the end result is the same as for instance a 403 Forbidden status code, this status code makes it much clearer what is happening. It might make you search for something just a little bit deeper. The original idea stems from this blog post which is worth a read.

How to set an HTTP 451 header

There are two ways to set an HTTP 451 header:

Deleting the post or page

In the upcoming Yoast SEO Premium 3.1 release we’ve changed what happens when you delete a post or page. You will now get the following notice:

page-deleted-notice

The link underneath “Read this post” links to my earlier article about what to do when deleting a post or page. Because we’re assuming that most of the time when you delete content, it has nothing to do with a court order (we sure hope so), we haven’t added the 451 option here.

Creating a header without deleting the post or page

You can also just keep the post or page alive, which is especially useful if the court order, injunction or whatever it is that is forcing you to block the page, has a time restraint. Simply go into the redirects screen of Yoast SEO and create a 451 header for that specific URL:

451-header

An HTTP 451 template file

Along with the changes that allow you to set a 451 HTTP header, we’ve also created the option to have an HTTP 451 template file in your theme. It’s as simple as copy pasting the 404.php file in your theme to 451.php and modifying the content to have a good message.

I honestly hope you’ll never need this HTTP error, but if you do, you know now that you can do the right thing, provided you’re using Yoast SEO Premium!

Should we move to an all HTTPS web?

HTTPS EverywhereThere was a bit of tweeting in the SEO community today because Bing introduced an HTTPS version of their site and people thought that would mean they’d lose their keyword data. That’s not true, if you take the right precautions. I thought I’d write a bit of an intro in how all this works so you can make an informed decision on what to do and I’ll tell you what we will do.

Referrer data and keywords

When you click from http://example.com to http://yoast.com, your browser tells the website you went to (yoast.com in this case), where you came from. It does this through an HTTP header called the referrer. The referrer holds the URL of the previous page you were on. So if the previous page you were on was a search result page, it could look like this:

http://example.com/?q=example+search

If you clicked on that search result, and came to yoast.com, I could “parse” that referrer. I could check whether it holds a q variable and then see what you searched for. This is what analytics packages have been doing for quite a while now: they keep a list of websites that are search engines and then parse the referrer data for visits from those search engines to obtain the searched for keywords. So your analytics rely on the existence of that referrer to determine the keywords people searched for when they came to your site. And this is where a search engine moving to HTTPS starts giving some trouble.

HTTP, HTTPS and referrer data

The HTTPS protocol is designed as such that if you go from an HTTPS page to an HTTP page, you lose all referrer data. That’s necessary because you’re going from an encrypted to an unencrypted connection and if you’d pass data along there, you’d be breaking the security. If you go from HTTP to HTTPS or from HTTPS to HTTPS, this is not the case and the referrer is thus kept intact.

So if all search engines were on HTTPS and your site wasn’t, you’d never get keyword data. The solution for that is simple though: move your website to HTTPS and you’d suddenly have all your data back. This is the case with Bing’s HTTPS implementation: if you search on it and go to an HTTPS page from their results, the keyword data is all there, as you’d expect.

Google’s not provided

“But, but, but” I hear you think: would moving to HTTPS get me all my Google keywords as well? No. Google is doing some trickery when you click on a URL, they actually redirect you through another URL so that the site you visit does get referrer data (showing that you came from Google). They hide the keyword though, as they say that’s private data. Even if you think they’re right that keywords are private data, the wrong bit about what Google is doing is that they are still sending your keyword data to AdWords advertisers. I’ve written about that before in stronger words. If they were truly concerned about your privacy they’d hide that data too.

I’d argue, in fact, that Google is breaking the web more than Bing here: even though I’m going from HTTPS to HTTP, Google is telling the website I visited that I came from Google. It shouldn’t. That’s just wrong.

Is this “right” in the first place?

I’ve been thinking a lot about this. Of course, as a marketer, I love keyword data. I love knowing what people searched for, I love being able to profile based on that. But is it right? Let’s compare it with a real world case: say that you’re shopping in a mall. You leave store A, and they put a sticker on your back. You enter store B and the shopkeeper there takes the sticker from your back and can see what you looked for in store A. You would argue against that, wouldn’t you? Now if you walk from section to section in a store and the shopkeeper can see that and help you based on that, there’s arguably not that much wrong with that.

Of course there’s more to this, in real life a shopkeep can see you, your clothes, your behaviour etc. And of course, shopkeepers target on that too. Targeting always happens, perhaps it’s just that people should be more aware of this. In quite a few cases, it might actually be deemed helpful by the user too.

Aviator logoI’m thinking the same is true for referrer data on the web: if you go from site A to site B, perhaps referrer data shouldn’t be passed along. Within a site though, it’s probably better if you do get that data. This is exactly what Aviator does, a browser that touts itself as the most secure browser on the planet. I think it’s an interesting concept. While as a marketer I’d hate losing all that data, as a person I think it’s the right thing to do.

Another thing I should mention here is EFF’s HTTPS everywhere project (of which I used the logo in the top of this post), which helps you use HTTPS on websites that have HTTPS for users but don’t default to it.

Should we all go to HTTPS with our websites?

Now that Bing has launched its HTTPS version (even though the vast, vast majority of their users still get the HTTP version by default as you have to switch to it yourself), it makes even more sense to move your website to HTTPS.

Here at Yoast.com we’ve always had every page that contained a contact form and our checkout pages on HTTPS and everything else on HTTP. The reason for this was that HTTPS was slower than HTTP and we’d rather not put everything on HTTPS because of that. Google’s recent work on SPDY actually negates most of that speed issue though, if your hosting party supports it. It was one of my reasons to switch to Synthesis a while back.

There’s another issue with mixed HTTP / HTTPS websites: they’re horrible to maintain when you’re on WordPress because WordPress mostly sucks at it. When you’re on an HTTPS page all internal links will be HTTPS and vice versa, which is annoying for search engines too.

So we’ll be changing, moving everything to HTTPS somewhere in the coming weeks. My suggestion is you do that too. If we’re all on HTTPS, we all get referrer data from each other (for now at least), we get keyword data from search engines like Bing that play nice and we get a more secure web. I’d say that’s a win-win situation. I’d love to hear what you think!

This post first appeared on Yoast. Whoopity Doo!

HTTP 503: Handling site maintenance correctly for SEO

Last week I got a few messages from Google Webmaster Tools, saying it couldn’t access the robots.txt file on a site of a client. Turns out the client didn’t handle scheduled downtime correctly, causing problems with Google. While this article covers some rather basic technical SEO the last bit might be interesting for more advanced users. The message from Google Webmaster Tools read like this:

Over the last 24 hours, Googlebot encountered 41 errors while attempting to access your robots.txt. To ensure that we didn’t crawl any pages listed in that file, we postponed our crawl. Your site’s overall robots.txt error rate is 7.0%

HTTP status codes and search engines

A search engine constantly verifies whether content it’s linking to stille exist and hasn’t changed.  It verifies two things:

  1. is the content still being served with the correct HTTP status code (HTTP 200);
  2. is it still the same content.

An HTTP 200 status code means: all is well, here is the content you asked for. It is the only correct status code for content. If content has moved, you can redirect it, either permanently, with an HTTP 301 header, or temporarily, with an HTTP 302 or 307 header.

If your server gives any other HTTP status header, it means the search engine can no longer find the content. If you server gives a 200 HTTP status code, but the page is in fact an error and says something like “File not found” or has very little content, Google will classify it as a soft 404 in Google Webmaster Tools.

There is only one proper way of telling the search engine that you’re doing site maintenance:

How server downtime works for search engines

If, during a crawl, a search engine finds that some content no longer exists, ie. it gives a 404 HTTP status, it will usually remove that content from the search results until it can come back and verify that it’s there again. If this happens often, it’ll take longer and longer for the content to come back in the search results.

What you should be doing is giving a 503 HTTP status code. This is the definition of the 503 status code from the RFC that defines these status codes:

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

So, you have to send a 503 status code in combination with a Retry-After header. Basically you’re saying: hang on, we’re doing some maintenance, please come back in X minutes. That sounds a lot better than what a 404 error says: “Not Found”. A 404 literally means that the server can’t find anything to return for the URL that was given.

How do I send a 503 header?

In PHP the code for a 503 would be like this:

$protocol = "HTTP/1.0";
if ( "HTTP/1.1" == $_SERVER["SERVER_PROTOCOL"] )
  $protocol = "HTTP/1.1";
header( "$protocol 503 Service Unavailable", true, 503 );
header( "Retry-After: 3600" );

The delay time, 3600 in the above example, is given in seconds, so 3600 corresponds to 60 minutes. You can also specify the exact time when the visitor should come back, by sending a GMT date instead of the number of seconds. This would result in something like this:

header( "Retry-After: Fri, 19 Mar 2013 12:00:00 GMT" );

Use that with caution though, setting it to a wrong date might give unexpected results!

Our site is never down, we’re on WordPress

Nonsense. Every time you upgrade your core WordPress install, or when you’re updating plugins, WordPress will give a maintenance page. The default page sends out a proper 503 header. You can replace the default error page with a maintenance.php file in your wp-content folder, but if you do, you have to make sure that file sends out the proper 503 headers too. You can copy the code from the wp_maintenance() function.

If your database is down, WordPress actually sends an internal server error, using the dead_db() function. If you’re doing planned maintenance on your database, therefore, you’ll need to set up a custom database error message page, db-error.php in your wp-content folder that sends a proper 503 header.

Beware caches!

So where did our client go wrong?

Funnily enough, our client had properly configured 503 headers on their server. There was an issue though: they use a Varnish cache and that Varnish didn’t transfer the 503 status code correctly, it replaced it with a “general” HTTP 500 status, causing Google to send out that error email. I haven’t had a chance to test whether that is default Varnish behavior or something they broke, but it’s worth testing for your environment.

Pro tip: sending a 503 for your robots.txt

Per this post from Pierre Far of Google, if you send an HTTP 503 status code for your robots.txt, Google will halt all the crawling on your domain until it’s allowed to crawl the robots.txt again. This is actually a very useful way of preventing load on your server when doing maintenance. It still requires you to send a 503 for every URL on your server, including all static ones, but after Google has re-fetched the robots.txt it’ll probably stop hammering your server(s) for a while.

Conclusion: know what HTTP headers you’re sending

While writing this article I was reminded about a tweet quoting Vanessa Fox during last weeks SMX West:

I couldn’t agree more and would add to that: at all times. Now go check those headers!

HTTP 503: Handling site maintenance correctly for SEO is a post by on Yoast – The Art & Science of Website Optimization.

A good WordPress blog needs good hosting, you don’t want your blog to be slow, or, even worse, down, do you? Check out my thoughts on WordPress hosting!