How to keep your page out of the search results

If you want to keep your page out of the search results, there are a number of things you can do. Most of ’em are not hard and you can implement these without a ton of technical knowledge. If you can check a box, your content management system will probably have an option for that. Or allows nifty plugins like our own Yoast SEO to help you prevent the page from showing up in search results. In this post, I won’t give you difficult options to go about this. I will simply tell you what steps to take and things to consider.

Why do you want to keep your page out of the search results?

It sounds like a simple question, but it’s not, really. Why do you want to keep your page out of the search results in the first place? If you don’t want that page indexed, perhaps you shouldn’t publish it? There are obvious reasons to keep for instance your internal search result pages out of Google’s search result pages or a “Thank you”-page after an order or newsletter subscription that is of no use for other visitors. But when it comes to your actual, informative pages, there really should be a good reason to block these. Feel free to drop yours in the comments below this post.

If you don’t have a good reason, simply don’t write that page.

Private pages

If your website contains a section that is targeted at, for instance, an internal audience or a, so-called, extranet, you should consider offering that information password-protected. A section of your site that can only be reached after filling out login details won’t be indexed. Search engines simply have no way to log in and visit these pages.

How to keep your page out of the search results

If you are using WordPress, and are planning a section like this on your site, please read Chris Lema’s article about the membership plugins he compared.

Noindex your page

Like that aforementioned “Thank you”-page, there might be more pages like that which you want to block. And you might even have pages left after looking critically if some pages should be on your site anyway. The right way to keep a page out of the search results is to add a robots meta tag. We have written a lengthy article about that robots meta tag before, be sure to read that.

Adding it to your page is simple: you need to add that tag to the <head> section of your page, in the source code. You’ll find examples from the major search engines linked in the robots meta article as well.

Are you using WordPress, TYPO3 or Magento? Things are even easier. Please read on.

Noindex your page with Yoast SEO

The above mentioned content management systems have the option to install our Yoast SEO plugin/extension. In that plugin or extension, you have the option to noindex a page right from your editor.

In this example, I’ll use screenshots from the meta box in Yoast SEO for WordPress. You’ll find it in the post or page editor, below the copy you’ve written. In Magento and TYPO3 you can find it in similar locations.

How to keep your site out of the search results using Yoast SEO

Advanced tab Yoast SEO meta box

Click the Advanced tab in our Yoast SEO meta box. It’s the cog symbol on the left.
Use the selector at “Allow search engines to show this post/page in search results”, simply set that to “No” and you are done.

The second option in the screenshot is about following the links on that page. That allows you to keep your page out of the search results, but follow links on that page as these (internal) links matter for the other pages (again, read the robots meta article for more information). The third option: leave that as is, this is what you have set for the site-wide robots meta settings.

It’s really that simple: select the right value and your page will tell search engines to either keep the page in or out of the search results.

The last thing I want to mention here is: use with care. This robots meta setting will truly prevent a page from being indexed, unlike robots.txt suggestion to leave a page out of the search result pages. Google might ignore the latter, triggered by a lot of inbound links to the page. 

If you want to read up on how to keep your site from being indexed, please read Preventing your site from being indexed, the right way. Good luck optimizing!

The post How to keep your page out of the search results appeared first on Yoast.

SEO for a new website: the very first things to do

How does a new website start ranking? Does it just magically appear in Google after you’ve launched it? What things do you have to do to start ranking in Google and get traffic from the search engines? Here, I explain the first steps you’ll need to take right after the launch of your new website. Learn how to start working on the SEO for a new website!

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

First: you’ll need to have an external link

One of my closest friends launched a birthday party packages online store last week. It’s all in Dutch and it’s not WordPress (wrong choice of course, but I love her all the same :-)). After my friend launched her website, she celebrated and asked her friends, including me, what they thought of her new site. I love her site, but couldn’t find her in Google, not even if I googled the exact domain name. My first question to my friend was: do you have another site linking to your site? And her answer was ‘no’. I linked to her site from my personal site and after half a day, her website popped up in the search results. The very first step when working on SEO for a new website: getting at least one external link.

Why do you need an external link?

Google is a search engine that follows links. For Google to know about your site, it has to find it by following a link from another site. Google found my friend’s site because I put a link to that site on my personal site. When Google came around to crawl my site after I put the link there, it discovered the existence of my friend’s site. And indexed it. After indexing the site, it started to show the site in the search results.

Read more: ‘What does Google do?’ »

Next step: tweak your settings…

After that first link, your site probably will turn up in the search results. If it doesn’t turn up, it could be that the settings of your site are on noindex or is still blocked by robots.txt. If that’s the case, you’re telling Google not to index your site. Sometimes developers forget to turn either of these off after they finished working on your site.

Some pages are just not the best landing pages. You don’t want people landing on your check out page, for instance. And you don’t want this page to compete with other – useful – content or product pages to show up in the search results. Pages you don’t want to pop up in the search results ever (but there aren’t many of these) should have a noindex.

Yoast SEO can help you to set these pages to noindex. That means Google will not save this page in the index and it’ll not turn op in the search results.

Keep reading: ‘The ultimate guide to the robots meta tag’ »

Important third step: keyword research

My friend’s site now ranks on her domain name. That’s about it. She’s got some work to do to start ranking on other terms as well. When you want to improve the SEO for a new website you have carry out some proper keyword research. So go find out what your audience is searching for! What words do they use?

If you execute your keyword research properly, you’ll end up with a long list of search terms you want to be found for. Make sure to search for those terms in Google yourself. What results are there already? Who will be your online competitors for these search terms? What can you do to stand out from these results?

Read on: ‘Keyword research: the ultimate guide’ »

Learn how to write awesome and SEO friendly articles in our SEO Copywriting training »

SEO copywriting training$ 199 - Buy now » Info

And then: write, write, write

Then you start writing. Write about all those topics that are important to your audience. Use the words you came up with in your keyword research. You need to have content about the topics you want to rank for to start ranking in the search results.

Read more: ‘How to write a high quality and seo-friendly blog post’ »

But also: improve those snippets

Take a look at your results in the search engines once you start ranking (the so called snippets). Are those meta descriptions and the titles of the search results inviting? Are they tempting enough for your audience to click on them? Or should you write better ones?

Yoast SEO helps you to write great titles and meta descriptions. Use our snippet preview to create awesome snippets. That’ll really help in attracting traffic to your site.

Keep reading: ‘The snippet preview: what it means and how to use it?’ »

Think about site structure

Which pages and posts are most important? These should have other pages and posts linking to them. Make sure to link to the most important content. Google will follow your links, the post and pages that have the most internal links will be most likely to rank high in the search engines. Setting up such a structure, is basically telling Google which articles are important and which aren’t. Our brand new text link counter can be a great help to see if you’re linking often enough to your most important content.

Read on: ‘Internal linking for SEO: why and how’ »

Finally: do some link building

Google follows links. Links are important. So get the word out. Reach out to other site owners – preferably of topically related websites – and ask them to write about your new site. If Google follows multiple links to your website, it’ll crawl it more often. This is crucial when you do the SEO for a new website, and will eventually help in your rankings. Don’t go overboard in link building for SEO though, buying links is still a no-go:

Read more: ‘Link building: what not to do?’ »

Preventing your site from being indexed, the right way

We’ve said it in 2009, and we’ll say it again: it keeps amazing us that there are still people using just a robots.txt files to prevent indexing of their site in Google or Bing. As a result their site shows up in the search engines anyway. You know why it keeps amazing us? Because robots.txt doesn’t actually do the latter, even though it does prevents indexing of your site. Let me explain how this works in this post.

For more on robots.txt, please read robots.txt: the ultimate guide.

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

There is a difference between being indexed and being listed in Google

Before we explain things any further, we need to go over some terms here first:

  • Indexed / Indexing
    The process of downloading a site or a page’s content to the server of the search engine, thereby adding it to its “index”.
  • Ranking / Listing / Showing
    Showing a site in the search result pages (aka SERPs).

So, while the most common process goes from Indexing to Listing, a site doesn’t have to be indexed to be listed. If a link points to a page, domain or wherever, Google follows that link. If the robots.txt on that domain prevents indexing of that page by a search engine, it’ll still show the URL in the results if it can gather from other variables that it might be worth looking at. In the old days, that could have been DMOZ or the Yahoo directory, but I can imagine Google using, for instance, your My Business details these days, or the old data from these projects. There are more sites that summarize your website, right.

Now if the explanation above doesn’t make sense, have a look at this 2009 Matt Cutts video explanation:

If you have reasons to prevent indexing of your website, adding that request to the specific page you want to block like Matt is talking about, is still the right way to go. But you’ll need to inform Google about that meta robots tag.  So, if you want to effectively hide pages from the search engines you need them to index those pages. Even though that might seem contradictory. There are two ways of doing that.

Prevent listing of your page by adding a meta robots tag

The first option to prevent listing of your page is by using robots meta tags. We’ve got an ultimate guide on robots meta tags that’s more extensive, but it basically comes down to adding this tag to your page:

<meta name="robots" content="noindex,nofollow>

The issue with a tag like that is that you have to add it to each and every page.

Or by adding a X-Robots-Tag HTTP header

To make the process of adding the meta robots tag to every single page of your site a bit easier, the search engines came up with the X-Robots-Tag HTTP header. This allows you to specify an HTTP header called X-Robots-Tag and set the value as you would the meta robots tags value. The cool thing about this is that you can do it for an entire site. If your site is running on Apache, and mod_headers is enabled (it usually is), you could add the following single line to your .htaccess file:

Header set X-Robots-Tag "noindex, nofollow"

And this would have the effect that that entire site can be indexed. But would never be shown in the search results.

So, get rid of that robots.txt file with Disallow: / in it. Use the X-Robots-Tag or that meta robots tag instead!

Read more: ‘The ultimate guide to the meta robots tag’ »

Ask Yoast: Block your site’s search results pages?

Every website should have a decent internal search functionality that shows the visitors search results that fit their search query. However, those search results pages on your site don’t need to be shown in Google’s search results. In fact, Google advises against this too; it’s not a great user experience to click on a Google search result, just to end up on a search result page of your site. Learn what’s best practice to prevent this from happening!

User experience is not the only reason to prevent Google from including these pages in their search results. Spam domains can also abuse your search results pages, which is what happened to Krunoslav from Croatia. He therefore emailed Ask Yoast:

“Some spam domains were linking to the search results pages on my WordPress site. So what could I do to block Google from accessing my site search results? Is there any code that I could put in robots.txt?”

Check out the video or read the answer below!

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

Block your search results pages?

In the video, we explain what you could do to prevent Google from showing your site’s search results:

“Well, to be honest, I don’t think I would block them. What you could do, is try two different things:

1. One is do nothing and run our Yoast SEO plugin. We’ll automatically noindex all the search result pages on your site. But if that leads to weird rankings or to other stuff that is not really working for you, then you could do another thing:

2. The second way is to block them and put a disallow:/?=s* in your robots.txt. This basically means that you’re blocking Google from crawling your entire search query. I don’t know whether that’s the best solution though.

I would try noindex first and see if that does anything. If it doesn’t, then use the method of blocking your search results in your robots.txt.

Good luck!”

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers. Need some advice about SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘Block your site’s search results pages’ »

Block your site’s search result pages

Why should you block your internal search result pages for Google? Well, how would you feel if you are in dire need for the answer to your search query and end up on the internal search pages of a certain website? That’s one crappy experience. Google thinks so too. And prefers you not to have these internal search pages indexed.

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

Google considers these search results pages to be of lower quality than your actual informational pages. That doesn’t mean these internal search pages are useless, but it makes sense to block these internal search pages.

Back in 2007

10 Years ago, Google, or more specifically Matt Cutts, told us that we should block these pages in our robots.txt. The reason for that:

Typically, web search results don’t add value to users, and since our core goal is to provide the best search results possible, we generally exclude search results from our web search index. (Not all URLs that contains things like “/results” or “/search” are search results, of course.)
– Matt Cutts (2007)

Nothing changed, really. Even after 10 years of SEO changes, this remains the same. The Google Webmaster Guidelines still state that you should “Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages.” Furthermore, the guidelines state that webmasters should avoid techniques like automatically generated content, in this case, “Stitching or combining content from different web pages without adding sufficient value”.

However, blocking internal search pages in your robots.txt doesn’t seem the right solution. In 2007, it even made more sense to simply redirect the user to the first result of these internal search pages. These days, I’d rather use a slightly different solution.

Blocking internal search pages in 2017

I believe nowadays, using a noindex, follow meta robots tag is the way to go instead. It seems Google ‘listens’ to that meta robots tag and sometimes ignores the robots.txt. That happens, for instance, when a surplus of backlinks to a blocked page tells Google it is of interest to the public anyway. We’ve already mentioned this in our Ultimate guide to robots.txt.

The 2007 reason is still the same in 2017, by the way: linking to search pages from search pages delivers a poor experience for a visitor. For Google, on a mission to deliver the best result for your query, it makes a lot more sense to link directly to an article or another informative page.

Yoast SEO will block internal search pages for you

If you’re on WordPress and using our plugin, you’re fine. We’ve got you covered:

Block internal search pages

That’s located at SEO › Titles & Metas › Archives. Most other content management systems allow for templates for your site’s search results as well, so adding a simple line of code to that template will suffice:
<meta name="robots" content="noindex,follow"/>

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

Meta robots AND robots.txt?

If you try to block internal search pages by adding that meta robots tag and disallowing these in your robots.txt, please think again. Just the meta robots will do. Otherwise, you’ll risk losing the link value of these pages (hence the follow in the meta tag). If Google listens to your robots.txt, they will ignore the meta robots tag, right? And that’s not what you want. So just use the meta robots tag!

Back to you

Did you block your internal search results? And how did you do that? Go check for yourself! Any further insights or experiences are appreciated; just drop us a line in the comments.

Read more: ‘Robots.txt: the ultimate guide’ »

SEO basics: What is crawlability?

Ranking in the search engines requires a website with flawless technical SEO. Luckily, the Yoast SEO plugin takes care (of almost) everything on your WordPress site. Still, if you really want to get most out of your website and keep on outranking the competition, some basic knowledge of technical SEO is a must. In this post, I’ll explain one of the most important concepts of technical SEO: crawlability.

What is the crawler again?

A search engine like Google consists of a crawler, an index and an algorithm. The crawler follows the links. When Google’s crawler finds your website, it’ll read it and its content is saved in the index.

A crawler follows the links on the web. A crawler is also called a robot, a bot, or a spider. It goes around the internet 24/7. Once it comes to a website, it saves the HTML version of a page in a gigantic database, called the index. This index is updated every time the crawler comes around your website and finds a new or revised version of it. Depending on how important Google deems your site and the amount of changes you make on your website, the crawler comes around more or less often.

Read more: ‘SEO basics: what does Google do’ »

And what is crawlability?

Crawlability has to do with the possibilities Google has to crawl your website. Crawlers can be blocked from your site. There are a few ways to block a crawler from your website. If your website or a page on your website is blocked, you’re saying to Google’s crawler: “do not come here”. Your site or the respective page won’t turn up in the search results in most of these cases.
There are a few things that could prevent Google from crawling (or indexing) your website:

  • If your robots.txt file blocks the crawler, Google will not come to your website or specific web page.
  • Before crawling your website, the crawler will take a look at the HTTP header of your page. This HTTP header contains a status code. If this status code says that a page doesn’t exist, Google won’t crawl your website. In the module about HTTP headers of our (soon to be launched!) Technical SEO training we’ll tell you all about that.
  • If the robots meta tag on a specific page blocks the search engine from indexing that page, Google will crawl that page, but won’t add it to its index.

This flow chart might help you understand the process bots follow when attempting to index a page:

Want to learn all about crawlability?

Although crawlability is just the very basics of technical SEO (it has to do with all the things that enable Google to index your site), for most people it’s already pretty advanced stuff. Nevertheless, if you’re blocking – perhaps even without knowing! – crawlers from your site, you’ll never rank high in Google. So, if you’re serious about SEO, this should matter to you.

If you really want to understand all the technical aspects concerning crawlability, you should definitely check out our Technical SEO 1 training, which will be released this week. In this SEO course, we’ll teach you how to detect technical SEO issues and how to solve them (with our Yoast SEO plugin).

Keep reading: ‘How to get Google to crawl your site faster’ »

 

Ask Yoast: should I redirect my affiliate links?

There are several reasons for cloaking or redirecting affiliate links. For instance, it’s easier to work with affiliate links when you redirect them, plus you can make them look prettier. But do you know how to cloak affiliate links? We explained how the process works in one of our previous posts. This Ask Yoast is about the method of cloaking affiliate links we gave you in that post. Is it still a good idea to redirect affiliate links via the script we described?

Elias Nilson emailed us, saying that he read our article about cloaking affiliate links and he’s wondering if the solution is still up-to-date.

“Is it still a good idea to redirect affiliate links via the script you describe in your post?”

Check out the video or read the answer below!

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training€ 99 - Buy now » Info

Redirect affiliate links

Read this transcript to figure out if it is still a valid option to redirect affiliate links via the described script. Want to see the script directly? Read this post: ‘How to cloak affiliate links’:

Honestly, yes. Recently we updated the post about cloaking affiliate links, so the post and therefore the script is still up to date. Link cloaking, which sounds negative, because we use the word cloaking, is basically hiding from Google that you’re an affiliate. And if you’re an affiliate, that’s still the thing that you want to do, because usually Google ranks original content that is not by affiliates better than it does affiliates.

So, yes, I’d still recommend that method, the link will be below this post, so you can see the original post that we are referencing to. It’s a very simple method to cloak your affiliate links and I think it works in probably the best way that I know.

So, keep going. Good luck.

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers. Need help with SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘How to cloak your affiliate links’ »

Ask Yoast: nofollow layered navigation links?

If you have a big eCommerce site with lots of products, layered navigation can help your users to narrow down their search results. Layered or faceted navigation is an advanced way of filtering by providing groups of filters for (many) product attributes. In this filtering process, you might create a lot of URLs though, because the user will be able to filter and thereby group items in many ways, and those groups will all be available on separate URLs. So what should you do with all these URLs? Do you want Google to crawl them all?

In this Ask Yoast, we’ll answer a question from Daniel Jacobsen:

“Should I nofollow layered navigation links? And if so, why? Are there any disadvantages of this?”

Check out the video or read the answer below!

Want to outrank your competitor and get more sales? Read our Shop SEO eBook! »

Shop SEO$ 25 - Buy now » Info

Layered navigation links

Read this transcript to learn how to deal with layered or faceted navigation links:

“The question is: “Why would you want to do that?” If you have too many URLs, so if you have a layered or a faceted navigation that has far too many options -creating billions of different types of URLs for Google to crawl – then probably yes. At the same time you need to ask yourself: “Why does my navigation work that way?” And, “Can we make it any different?” But in a lot of eCommerce systems that’s very hard. So in those cases adding a nofollow to those links, does actually help to prevent Google from indexing each and every one of the versions of your site.

I’ve worked on a couple of sites with faceted navigation that had over a billion variations in URLs, even though they only had like 10,000 products. If that’s the sort of problem you have, then yes, you need to nofollow them and maybe you even need to use your robots.txt file to exclude some of those variants. So specific stuff that you don’t want indexed, for instance, if you don’t want color indexed, you could do a robots.txt line that says: “Disallow for everything that has color in the URL”. At that point you strip down what Google crawls and what it thinks is important. The problem with that is, that if Google has links pointing at that version from somewhere else, those links don’t count for your site’s ranking either.

So it’s a bit of a quid pro quo, where you have to think about what is the best thing to do. It’s a tough decision. I really would suggest getting an experienced technical SEO to look at your site if it really is a problem, because it’s not a simple cut-and-paste solution that works the same for every site.

Good luck!”

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers! Need help with SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘Internal search for online shops: an essential asset’ »

Playing with the X-Robots-Tag HTTP header

Traditionally, you will use a robots.txt file on your server to manage what pages, folders, subdomains or other content search engines will be allowed to crawl. But did you know that there’s also such a thing as the X-Robots-Tag HTTP header? In this post we’ll discuss what the possibilities are and how this might be a better option for your blog.

Quick recap: robots.txt

Before we continue, let’s take a look at what a robots.txt file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.

Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.

Please note that using the robots.txt file to hide your entire website for search engines is definitely not recommended.

Say hello to X-Robots-Tag

Back in 2007, Google announced that they added support for the X-Robots-Tag directive. What this meant was that you not only could restrict access to search engines via a robots.txt file, you could also programmatically set various robot.txt-related directives in the headers of a HTTP response. Now, you might be thinking “But can’t I just use the robots meta tag instead?”. The answer is yes. And no. If you plan on programmatically blocking a particular page that is written in HTML, then using the meta tag should suffice. But if you plan on blocking crawling of, lets say an image, then you could use the HTTP response approach to do this in code. Obviously you can always use the latter method if you don’t feel like adding additional HTML to your website.

X-Robots-Tag directives

As Sebastian explained in 2008, there are two different kinds of directives: crawler directives and indexer directives. I’ll briefly explain the difference below.

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training$ 99 - Buy now » Info

Crawler directives

The robots.txt file only contains the so called ‘crawler directives’, which tells search engines where they are or aren’t allowed to go. By using the

Allow

directive, you can specify where search engines are allowed to crawl.

Disallow

does the exact opposite. Additionally, you can use the

Sitemap

directive to help search engines out and crawl your website even faster.

Note that it’s also possible to fine tune the directives for a specific search engine by using the

User-agent

directive in combination with the other directives.

As Sebastian points out and explains thoroughly in another post, pages can still show up in search results in case there are enough links pointing to it, despite explicitly defining these with the

Disallow

directive. This basically means that if you want to really hide something from the search engines, and thus from people using search, robots.txt won’t suffice.

Indexer directives

Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.

With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.

Example uses of the X-Robots-Tag

Theory is nice and all, but let’s see how you could use the X-Robots-Tag in the wild!

If you want to prevent search engines from showing files you’ve generated with PHP, you could add the following in the head of the header.php file:

header(&quot;X-Robots-Tag: noindex&quot;, true);

This would not prevent search engines from following the links on those pages. If you want to do that, then alter the previous example as follows:

header(&quot;X-Robots-Tag: noindex, nofollow&quot;, true);

Now, although using this method in PHP has its benefits, you’ll most likely end up wanting to block specific filetypes altogether. The more practical approach would be to add the X-Robots-Tag to your Apache server configuration or a .htaccess file.

Imagine you run a website which also has some .doc files, but you don’t want search engines to index that filetype for a particular reason. On Apache servers, you should add the following line to the configuration / a .htaccess file:

&lt;FilesMatch &quot;.doc$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

Or, if you’d want to do this for both .doc and .pdf files:

&lt;FilesMatch &quot;.(doc|pdf)$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

If you’re running Nginx instead of Apache, you can get a similar result by adding the following to the server configuration:

location ~* .(doc|pdf)$ {
	add_header  X-Robots-Tag &quot;index, noarchive, nosnippet&quot;;
}

There are cases in which the robots.txt file itself might show up in search results. By using an alteration of the previous method, you can prevent this from happening to your website:

&lt;FilesMatch &quot;robots.txt&quot;&gt;
Header set X-Robots-Tag &quot;noindex&quot;
&lt;/FilesMatch&gt;

And in Nginx:

location = robots.txt {
	add_header  X-Robots-Tag &quot;noindex&quot;;
}

Conclusion

As you can see based on the examples above, the X-Robots-Tag HTTP header is a very powerful tool. Use it wisely and with caution, as you won’t be the first to block your entire site by accident. Nevertheless, it’s a great addition to your toolset if you know how to use it.

Read more: ‘Meta robots tag: the ultimate guide’ »

Crawl budget optimization

Google doesn’t always spider every page on a site instantly. In fact, sometimes it can take weeks. This might get in the way of your SEO efforts. Your newly optimized landing page might not get indexed. At that point, it becomes time to optimize your crawl budget.

Crawl budget is the time Google has in a given period to crawl your site. It might crawl 6 pages a day, it might crawl 5,000 pages, it might even crawl 4,000,000 pages every single day. This depends on many factors, which we’ll discuss in this article. Some of these factors are things you can influence.

How does a crawler work?

A crawler like Googlebot gets a list of URLs to crawl on a site. It goes through that list systematically. It grabs your robots.txt file every once in a while to make sure it’s still allowed to crawl each URL and then crawls the URLs one by one. Once a spider has crawled a URL and it has parsed the contents, it adds new URLs it has found on that page that it has to crawl back on the to-do list.

Several events can make Google feel a URL has to be crawled. It might have found new links pointing at content, or someone has tweeted it, or it might have been updated in the XML sitemap, etc etc…. There’s no way to make a list of all the reasons why Google would crawl a URL, but when it determines it has to, it adds it to the to-do list.

What is crawl budget?

Crawl budget is the number of pages Google will crawl on your site on any given day. This number varies slightly from day to day, but overall it’s relatively stable. The number of pages Google crawls, your “budget”, is generally determined by the size of your site, the “health” of your site (how many errors Google encounters) and the number of links to your site. 

When is crawl budget an issue?

Crawl budget is not a problem if Google has to crawl a lot of URLs on your site and it has allotted a lot of crawls. Say your site has 250,000 pages. Google crawls 2,500 pages on this particular site each day. It will crawl some (like the homepage) more than others. It could take up to 200 days before Google notices particular changes to your pages if you don’t act. Crawl budget is an issue now. If it crawls 50,000 a day, there’s no issue at all.

To quickly determine whether your site has a crawl budget issue, follow the steps below. This does assume your site has a relatively small number of URLs that Google crawls but doesn’t index (for instance because you added meta noindex).

  1. Determine how many pages you have on your site, the number of your URLs in your XML sitemaps might be a good start.
  2. Go into Google Search Console.
  3. Go to Crawl -> Crawl stats and take note of the average pages crawled per day.
  4. Divide the number of pages by the “Average crawled per day” number.
  5. If you end up with a number higher than ~10 (so you have 10x more pages than what Google crawls each day), you should optimize your crawl budget. If you end up with a number lower than 3, go read something else. 

    Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training!

    Yoast SEO for WordPress training

What URLs is Google crawling?

You really should know which URLs Google is crawling on your site. The only “real” way of knowing that is looking at your site’s server logs. For larger sites I personally prefer using Logstash + Kibana for that. For smaller sites, the guys at Screaming Frog have released quite a nice little tool, aptly called SEO Log File Analyser (note the S, they’re Brits).

Get your server logs and look at them

Depending on your type of hosting, you might not always be able to grab your log files. However, if you even so much as think you need to work on crawl budget optimization because your site is big, you should get them. If your host doesn’t allow you to get them, change hosts.

Fixing your site’s crawl budget is a lot like fixing a car. You can’t fix it by looking at the outside, you’ll have to open up that engine. Looking at logs is going to be scary at first. You’ll quickly find that there is a lot of noise in logs. You’ll find a lot of commonly occurring 404s that you think are nonsense. But you have to fix them. You have to get through the noise and make sure your site is not drowned in tons of old 404s.

Increase your crawl budget

Let’s look at the things that actually improve how many pages Google can crawl on your site.

Website maintenance: reduce errors

Step one in getting more pages crawled is making sure that the pages that are crawled return one of two possible return codes: 200 (for “OK”) or 301 (for “Go here instead”). All other return codes are not OK. To figure this out, you have to look at your site’s server logs. Google Analytics and most other analytics packages will only track pages that served a 200. So you won’t find many of the errors on your site in there.

Once you’ve got your server logs, try to find common errors, and fix them. The most simple way of doing that is by grabbing all the URLs that didn’t return 200 or 301 and then order by how often they were accessed. Fixing an error might mean that you have to fix code. Or you might have to redirect a URL elsewhere. If you know what caused the error, you can try to fix the source too.

Another good source to find errors is Google Search Console. Read this post by Michiel for more info on that. If you’re using Yoast SEO, connecting your site to Google Search Console through the plugin allows you to easily retrieve all those errors. If you’ve got Yoast SEO Premium, you can even redirect them away easily using the redirects manager.

Block parts of your site

If you have sections of your site that really don’t need to be in Google, block them using robots.txt. Only do this if you know what you’re doing, of course. One of the common problems we see on larger eCommerce sites is when they have a gazillion way to filter products. Every filter might add new URLs for Google. In cases like these, you really want to make sure that you’re letting Google spider only one or two of those filters and not all of them.

Reduce redirect chains

When you 301 redirect a URL, something weird happens. Google will see that new URL and add that URL to the to-do list. It doesn’t always follow it immediately, it adds it to its to-do list and just goes on. When you chain redirects, for instance, when you redirect non-www to www, then http to https, you have two redirects everywhere, so everything takes longer to crawl.

Get more links

This is easy to say, but hard to do. Getting more links is not just a matter of being awesome, it’s also a matter of making sure others know that you’re awesome too. It’s a matter of good PR and good engagement on Social. We’ve written extensively about link building, I’d suggest reading these 3 posts:

  1. Link building from a holistic SEO perspective
  2. Link building: what not to do?
  3. 6 steps to a successful link building strategy

When you have an acute indexation problem, you should definitely look at your crawl errors, blocking parts of your site and at fixing redirect chains first. Link building is a very slow method to increase your crawl budget. On the other hand: if you intend on building a large site, link building needs to be part of your process.

AMP and your crawl budget

Google is telling everyone to use Accelerated Mobile Pages, in short: AMP. These are “lighter” versions of web pages, specifically aimed at mobile. The problem with AMP is that it means adding a separate URL for every page you have. You’d get example.com/page/ and example.com/page/amp/. This means you need double the crawl budget for your site. If you have crawl budget issues already, don’t start working on AMP just yet. We’ve written about it twice, but find that for sites that do not serve news, it’s not worth it yet.

TL;DR: crawl budget optimization is hard

Crawl budget optimization is not for the faint of heart. If you’re doing your site’s maintenance well, or your site is relatively small, it’s probably not needed. If your site is medium sized and well maintained, it’s fairly easy to do based on the above tricks. If you find, after looking at some error logs, that you’re in over your head, it might be time to call in someone more experienced.

Read more: ‘Robots.txt: the ultimate guide’ »