SEO for a new website: the very first things to do

How does a new website start ranking? Does it just magically appear in Google after you’ve launched it? What things do you have to do to start ranking in Google and get traffic from the search engines? Here, I explain the first steps you’ll need to take right after the launch of your new website. Learn how to start working on the SEO for a new website!

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

First: you’ll need to have an external link

One of my closest friends launched a birthday party packages online store last week. It’s all in Dutch and it’s not WordPress (wrong choice of course, but I love her all the same :-)). After my friend launched her website, she celebrated and asked her friends, including me, what they thought of her new site. I love her site, but couldn’t find her in Google, not even if I googled the exact domain name. My first question to my friend was: do you have another site linking to your site? And her answer was ‘no’. I linked to her site from my personal site and after half a day, her website popped up in the search results. The very first step when working on SEO for a new website: getting at least one external link.

Why do you need an external link?

Google is a search engine that follows links. For Google to know about your site, it has to find it by following a link from another site. Google found my friend’s site because I put a link to that site on my personal site. When Google came around to crawl my site after I put the link there, it discovered the existence of my friend’s site. And indexed it. After indexing the site, it started to show the site in the search results.

Read more: ‘What does Google do?’ »

Next step: tweak your settings…

After that first link, your site probably will turn up in the search results. If it doesn’t turn up, it could be that the settings of your site are on noindex or is still blocked by robots.txt. If that’s the case, you’re telling Google not to index your site. Sometimes developers forget to turn either of these off after they finished working on your site.

Some pages are just not the best landing pages. You don’t want people landing on your check out page, for instance. And you don’t want this page to compete with other – useful – content or product pages to show up in the search results. Pages you don’t want to pop up in the search results ever (but there aren’t many of these) should have a noindex.

Yoast SEO can help you to set these pages to noindex. That means Google will not save this page in the index and it’ll not turn op in the search results.

Keep reading: ‘The ultimate guide to the robots meta tag’ »

Important third step: keyword research

My friend’s site now ranks on her domain name. That’s about it. She’s got some work to do to start ranking on other terms as well. When you want to improve the SEO for a new website you have carry out some proper keyword research. So go find out what your audience is searching for! What words do they use?

If you execute your keyword research properly, you’ll end up with a long list of search terms you want to be found for. Make sure to search for those terms in Google yourself. What results are there already? Who will be your online competitors for these search terms? What can you do to stand out from these results?

Read on: ‘Keyword research: the ultimate guide’ »

Learn how to write awesome and SEO friendly articles in our SEO Copywriting training »

SEO copywriting training$ 199 - Buy now » Info

And then: write, write, write

Then you start writing. Write about all those topics that are important to your audience. Use the words you came up with in your keyword research. You need to have content about the topics you want to rank for to start ranking in the search results.

Read more: ‘How to write a high quality and seo-friendly blog post’ »

But also: improve those snippets

Take a look at your results in the search engines once you start ranking (the so called snippets). Are those meta descriptions and the titles of the search results inviting? Are they tempting enough for your audience to click on them? Or should you write better ones?

Yoast SEO helps you to write great titles and meta descriptions. Use our snippet preview to create awesome snippets. That’ll really help in attracting traffic to your site.

Keep reading: ‘The snippet preview: what it means and how to use it?’ »

Think about site structure

Which pages and posts are most important? These should have other pages and posts linking to them. Make sure to link to the most important content. Google will follow your links, the post and pages that have the most internal links will be most likely to rank high in the search engines. Setting up such a structure, is basically telling Google which articles are important and which aren’t. Our brand new text link counter can be a great help to see if you’re linking often enough to your most important content.

Read on: ‘Internal linking for SEO: why and how’ »

Finally: do some link building

Google follows links. Links are important. So get the word out. Reach out to other site owners – preferably of topically related websites – and ask them to write about your new site. If Google follows multiple links to your website, it’ll crawl it more often. This is crucial when you do the SEO for a new website, and will eventually help in your rankings. Don’t go overboard in link building for SEO though, buying links is still a no-go:

Read more: ‘Link building: what not to do?’ »

Preventing your site from being indexed, the right way

We’ve said it in 2009, and we’ll say it again: it keeps amazing us that there are still people using just a robots.txt files to prevent indexing of their site in Google or Bing. As a result their site shows up in the search engines anyway. You know why it keeps amazing us? Because robots.txt doesn’t actually do the latter, even though it does prevents indexing of your site. Let me explain how this works in this post.

For more on robots.txt, please read robots.txt: the ultimate guide.

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

There is a difference between being indexed and being listed in Google

Before we explain things any further, we need to go over some terms here first:

  • Indexed / Indexing
    The process of downloading a site or a page’s content to the server of the search engine, thereby adding it to its “index”.
  • Ranking / Listing / Showing
    Showing a site in the search result pages (aka SERPs).

So, while the most common process goes from Indexing to Listing, a site doesn’t have to be indexed to be listed. If a link points to a page, domain or wherever, Google follows that link. If the robots.txt on that domain prevents indexing of that page by a search engine, it’ll still show the URL in the results if it can gather from other variables that it might be worth looking at. In the old days, that could have been DMOZ or the Yahoo directory, but I can imagine Google using, for instance, your My Business details these days, or the old data from these projects. There are more sites that summarize your website, right.

Now if the explanation above doesn’t make sense, have a look at this 2009 Matt Cutts video explanation:

If you have reasons to prevent indexing of your website, adding that request to the specific page you want to block like Matt is talking about, is still the right way to go. But you’ll need to inform Google about that meta robots tag.  So, if you want to effectively hide pages from the search engines you need them to index those pages. Even though that might seem contradictory. There are two ways of doing that.

Prevent listing of your page by adding a meta robots tag

The first option to prevent listing of your page is by using robots meta tags. We’ve got an ultimate guide on robots meta tags that’s more extensive, but it basically comes down to adding this tag to your page:

<meta name="robots" content="noindex,nofollow>

The issue with a tag like that is that you have to add it to each and every page.

Or by adding a X-Robots-Tag HTTP header

To make the process of adding the meta robots tag to every single page of your site a bit easier, the search engines came up with the X-Robots-Tag HTTP header. This allows you to specify an HTTP header called X-Robots-Tag and set the value as you would the meta robots tags value. The cool thing about this is that you can do it for an entire site. If your site is running on Apache, and mod_headers is enabled (it usually is), you could add the following single line to your .htaccess file:

Header set X-Robots-Tag "noindex, nofollow"

And this would have the effect that that entire site can be indexed. But would never be shown in the search results.

So, get rid of that robots.txt file with Disallow: / in it. Use the X-Robots-Tag or that meta robots tag instead!

Read more: ‘The ultimate guide to the meta robots tag’ »

Ask Yoast: Block your site’s search results pages?

Every website should have a decent internal search functionality that shows the visitors search results that fit their search query. However, those search results pages on your site don’t need to be shown in Google’s search results. In fact, Google advises against this too; it’s not a great user experience to click on a Google search result, just to end up on a search result page of your site. Learn what’s best practice to prevent this from happening!

User experience is not the only reason to prevent Google from including these pages in their search results. Spam domains can also abuse your search results pages, which is what happened to Krunoslav from Croatia. He therefore emailed Ask Yoast:

“Some spam domains were linking to the search results pages on my WordPress site. So what could I do to block Google from accessing my site search results? Is there any code that I could put in robots.txt?”

Check out the video or read the answer below!

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

Block your search results pages?

In the video, we explain what you could do to prevent Google from showing your site’s search results:

“Well, to be honest, I don’t think I would block them. What you could do, is try two different things:

1. One is do nothing and run our Yoast SEO plugin. We’ll automatically noindex all the search result pages on your site. But if that leads to weird rankings or to other stuff that is not really working for you, then you could do another thing:

2. The second way is to block them and put a disallow:/?=s* in your robots.txt. This basically means that you’re blocking Google from crawling your entire search query. I don’t know whether that’s the best solution though.

I would try noindex first and see if that does anything. If it doesn’t, then use the method of blocking your search results in your robots.txt.

Good luck!”

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers. Need some advice about SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘Block your site’s search results pages’ »

Block your site’s search result pages

Why should you block your internal search result pages for Google? Well, how would you feel if you are in dire need for the answer to your search query and end up on the internal search pages of a certain website? That’s one crappy experience. Google thinks so too. And prefers you not to have these internal search pages indexed.

Optimize your site for search & social media and keep it optimized with Yoast SEO Premium »

Yoast SEO for WordPress pluginBuy now » Info

Google considers these search results pages to be of lower quality than your actual informational pages. That doesn’t mean these internal search pages are useless, but it makes sense to block these internal search pages.

Back in 2007

10 Years ago, Google, or more specifically Matt Cutts, told us that we should block these pages in our robots.txt. The reason for that:

Typically, web search results don’t add value to users, and since our core goal is to provide the best search results possible, we generally exclude search results from our web search index. (Not all URLs that contains things like “/results” or “/search” are search results, of course.)
– Matt Cutts (2007)

Nothing changed, really. Even after 10 years of SEO changes, this remains the same. The Google Webmaster Guidelines still state that you should “Use the robots.txt file on your web server to manage your crawling budget by preventing crawling of infinite spaces such as search result pages.” Furthermore, the guidelines state that webmasters should avoid techniques like automatically generated content, in this case, “Stitching or combining content from different web pages without adding sufficient value”.

However, blocking internal search pages in your robots.txt doesn’t seem the right solution. In 2007, it even made more sense to simply redirect the user to the first result of these internal search pages. These days, I’d rather use a slightly different solution.

Blocking internal search pages in 2017

I believe nowadays, using a noindex, follow meta robots tag is the way to go instead. It seems Google ‘listens’ to that meta robots tag and sometimes ignores the robots.txt. That happens, for instance, when a surplus of backlinks to a blocked page tells Google it is of interest to the public anyway. We’ve already mentioned this in our Ultimate guide to robots.txt.

The 2007 reason is still the same in 2017, by the way: linking to search pages from search pages delivers a poor experience for a visitor. For Google, on a mission to deliver the best result for your query, it makes a lot more sense to link directly to an article or another informative page.

Yoast SEO will block internal search pages for you

If you’re on WordPress and using our plugin, you’re fine. We’ve got you covered:

Block internal search pages

That’s located at SEO › Titles & Metas › Archives. Most other content management systems allow for templates for your site’s search results as well, so adding a simple line of code to that template will suffice:
<meta name="robots" content="noindex,follow"/>

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

Meta robots AND robots.txt?

If you try to block internal search pages by adding that meta robots tag and disallowing these in your robots.txt, please think again. Just the meta robots will do. Otherwise, you’ll risk losing the link value of these pages (hence the follow in the meta tag). If Google listens to your robots.txt, they will ignore the meta robots tag, right? And that’s not what you want. So just use the meta robots tag!

Back to you

Did you block your internal search results? And how did you do that? Go check for yourself! Any further insights or experiences are appreciated; just drop us a line in the comments.

Read more: ‘Robots.txt: the ultimate guide’ »

SEO basics: What is crawlability?

Ranking in the search engines requires a website with flawless technical SEO. Luckily, the Yoast SEO plugin takes care (of almost) everything on your WordPress site. Still, if you really want to get most out of your website and keep on outranking the competition, some basic knowledge of technical SEO is a must. In this post, I’ll explain one of the most important concepts of technical SEO: crawlability.

What is the crawler again?

A search engine like Google consists of a crawler, an index and an algorithm. The crawler follows the links. When Google’s crawler finds your website, it’ll read it and its content is saved in the index.

A crawler follows the links on the web. A crawler is also called a robot, a bot, or a spider. It goes around the internet 24/7. Once it comes to a website, it saves the HTML version of a page in a gigantic database, called the index. This index is updated every time the crawler comes around your website and finds a new or revised version of it. Depending on how important Google deems your site and the amount of changes you make on your website, the crawler comes around more or less often.

Read more: ‘SEO basics: what does Google do’ »

And what is crawlability?

Crawlability has to do with the possibilities Google has to crawl your website. Crawlers can be blocked from your site. There are a few ways to block a crawler from your website. If your website or a page on your website is blocked, you’re saying to Google’s crawler: “do not come here”. Your site or the respective page won’t turn up in the search results in most of these cases.
There are a few things that could prevent Google from crawling (or indexing) your website:

  • If your robots.txt file blocks the crawler, Google will not come to your website or specific web page.
  • Before crawling your website, the crawler will take a look at the HTTP header of your page. This HTTP header contains a status code. If this status code says that a page doesn’t exist, Google won’t crawl your website. In the module about HTTP headers of our (soon to be launched!) Technical SEO training we’ll tell you all about that.
  • If the robots meta tag on a specific page blocks the search engine from indexing that page, Google will crawl that page, but won’t add it to its index.

This flow chart might help you understand the process bots follow when attempting to index a page:

Want to learn all about crawlability?

Although crawlability is just the very basics of technical SEO (it has to do with all the things that enable Google to index your site), for most people it’s already pretty advanced stuff. Nevertheless, if you’re blocking – perhaps even without knowing! – crawlers from your site, you’ll never rank high in Google. So, if you’re serious about SEO, this should matter to you.

If you really want to understand all the technical aspects concerning crawlability, you should definitely check out our Technical SEO 1 training, which will be released this week. In this SEO course, we’ll teach you how to detect technical SEO issues and how to solve them (with our Yoast SEO plugin).

Keep reading: ‘How to get Google to crawl your site faster’ »

 

Ask Yoast: should I redirect my affiliate links?

There are several reasons for cloaking or redirecting affiliate links. For instance, it’s easier to work with affiliate links when you redirect them, plus you can make them look prettier. But do you know how to cloak affiliate links? We explained how the process works in one of our previous posts. This Ask Yoast is about the method of cloaking affiliate links we gave you in that post. Is it still a good idea to redirect affiliate links via the script we described?

Elias Nilson emailed us, saying that he read our article about cloaking affiliate links and he’s wondering if the solution is still up-to-date.

“Is it still a good idea to redirect affiliate links via the script you describe in your post?”

Check out the video or read the answer below!

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training€ 99 - Buy now » Info

Redirect affiliate links

Read this transcript to figure out if it is still a valid option to redirect affiliate links via the described script. Want to see the script directly? Read this post: ‘How to cloak affiliate links’:

Honestly, yes. Recently we updated the post about cloaking affiliate links, so the post and therefore the script is still up to date. Link cloaking, which sounds negative, because we use the word cloaking, is basically hiding from Google that you’re an affiliate. And if you’re an affiliate, that’s still the thing that you want to do, because usually Google ranks original content that is not by affiliates better than it does affiliates.

So, yes, I’d still recommend that method, the link will be below this post, so you can see the original post that we are referencing to. It’s a very simple method to cloak your affiliate links and I think it works in probably the best way that I know.

So, keep going. Good luck.

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers. Need help with SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘How to cloak your affiliate links’ »

Ask Yoast: nofollow layered navigation links?

If you have a big eCommerce site with lots of products, layered navigation can help your users to narrow down their search results. Layered or faceted navigation is an advanced way of filtering by providing groups of filters for (many) product attributes. In this filtering process, you might create a lot of URLs though, because the user will be able to filter and thereby group items in many ways, and those groups will all be available on separate URLs. So what should you do with all these URLs? Do you want Google to crawl them all?

In this Ask Yoast, we’ll answer a question from Daniel Jacobsen:

“Should I nofollow layered navigation links? And if so, why? Are there any disadvantages of this?”

Check out the video or read the answer below!

Want to outrank your competitor and get more sales? Read our Shop SEO eBook! »

Shop SEO$ 25 - Buy now » Info

Layered navigation links

Read this transcript to learn how to deal with layered or faceted navigation links:

“The question is: “Why would you want to do that?” If you have too many URLs, so if you have a layered or a faceted navigation that has far too many options -creating billions of different types of URLs for Google to crawl – then probably yes. At the same time you need to ask yourself: “Why does my navigation work that way?” And, “Can we make it any different?” But in a lot of eCommerce systems that’s very hard. So in those cases adding a nofollow to those links, does actually help to prevent Google from indexing each and every one of the versions of your site.

I’ve worked on a couple of sites with faceted navigation that had over a billion variations in URLs, even though they only had like 10,000 products. If that’s the sort of problem you have, then yes, you need to nofollow them and maybe you even need to use your robots.txt file to exclude some of those variants. So specific stuff that you don’t want indexed, for instance, if you don’t want color indexed, you could do a robots.txt line that says: “Disallow for everything that has color in the URL”. At that point you strip down what Google crawls and what it thinks is important. The problem with that is, that if Google has links pointing at that version from somewhere else, those links don’t count for your site’s ranking either.

So it’s a bit of a quid pro quo, where you have to think about what is the best thing to do. It’s a tough decision. I really would suggest getting an experienced technical SEO to look at your site if it really is a problem, because it’s not a simple cut-and-paste solution that works the same for every site.

Good luck!”

Ask Yoast

In the series Ask Yoast we answer SEO questions from followers! Need help with SEO? Let us help you out! Send your question to ask@yoast.com.

Read more: ‘Internal search for online shops: an essential asset’ »

Playing with the X-Robots-Tag HTTP header

Traditionally, you will use a robots.txt file on your server to manage what pages, folders, subdomains or other content search engines will be allowed to crawl. But did you know that there’s also such a thing as the X-Robots-Tag HTTP header? In this post we’ll discuss what the possibilities are and how this might be a better option for your blog.

Quick recap: robots.txt

Before we continue, let’s take a look at what a robots.txt file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.

Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.

Please note that using the robots.txt file to hide your entire website for search engines is definitely not recommended.

Say hello to X-Robots-Tag

Back in 2007, Google announced that they added support for the X-Robots-Tag directive. What this meant was that you not only could restrict access to search engines via a robots.txt file, you could also programmatically set various robot.txt-related directives in the headers of a HTTP response. Now, you might be thinking “But can’t I just use the robots meta tag instead?”. The answer is yes. And no. If you plan on programmatically blocking a particular page that is written in HTML, then using the meta tag should suffice. But if you plan on blocking crawling of, lets say an image, then you could use the HTTP response approach to do this in code. Obviously you can always use the latter method if you don’t feel like adding additional HTML to your website.

X-Robots-Tag directives

As Sebastian explained in 2008, there are two different kinds of directives: crawler directives and indexer directives. I’ll briefly explain the difference below.

Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training! »

Yoast SEO for WordPress training$ 99 - Buy now » Info

Crawler directives

The robots.txt file only contains the so called ‘crawler directives’, which tells search engines where they are or aren’t allowed to go. By using the

Allow

directive, you can specify where search engines are allowed to crawl.

Disallow

does the exact opposite. Additionally, you can use the

Sitemap

directive to help search engines out and crawl your website even faster.

Note that it’s also possible to fine tune the directives for a specific search engine by using the

User-agent

directive in combination with the other directives.

As Sebastian points out and explains thoroughly in another post, pages can still show up in search results in case there are enough links pointing to it, despite explicitly defining these with the

Disallow

directive. This basically means that if you want to really hide something from the search engines, and thus from people using search, robots.txt won’t suffice.

Indexer directives

Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.

With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.

Example uses of the X-Robots-Tag

Theory is nice and all, but let’s see how you could use the X-Robots-Tag in the wild!

If you want to prevent search engines from showing files you’ve generated with PHP, you could add the following in the head of the header.php file:

header(&quot;X-Robots-Tag: noindex&quot;, true);

This would not prevent search engines from following the links on those pages. If you want to do that, then alter the previous example as follows:

header(&quot;X-Robots-Tag: noindex, nofollow&quot;, true);

Now, although using this method in PHP has its benefits, you’ll most likely end up wanting to block specific filetypes altogether. The more practical approach would be to add the X-Robots-Tag to your Apache server configuration or a .htaccess file.

Imagine you run a website which also has some .doc files, but you don’t want search engines to index that filetype for a particular reason. On Apache servers, you should add the following line to the configuration / a .htaccess file:

&lt;FilesMatch &quot;.doc$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

Or, if you’d want to do this for both .doc and .pdf files:

&lt;FilesMatch &quot;.(doc|pdf)$&quot;&gt;
Header set X-Robots-Tag &quot;index, noarchive, nosnippet&quot;
&lt;/FilesMatch&gt;

If you’re running Nginx instead of Apache, you can get a similar result by adding the following to the server configuration:

location ~* .(doc|pdf)$ {
	add_header  X-Robots-Tag &quot;index, noarchive, nosnippet&quot;;
}

There are cases in which the robots.txt file itself might show up in search results. By using an alteration of the previous method, you can prevent this from happening to your website:

&lt;FilesMatch &quot;robots.txt&quot;&gt;
Header set X-Robots-Tag &quot;noindex&quot;
&lt;/FilesMatch&gt;

And in Nginx:

location = robots.txt {
	add_header  X-Robots-Tag &quot;noindex&quot;;
}

Conclusion

As you can see based on the examples above, the X-Robots-Tag HTTP header is a very powerful tool. Use it wisely and with caution, as you won’t be the first to block your entire site by accident. Nevertheless, it’s a great addition to your toolset if you know how to use it.

Read more: ‘Meta robots tag: the ultimate guide’ »

Crawl budget optimization

Google doesn’t always spider every page on a site instantly. In fact, sometimes it can take weeks. This might get in the way of your SEO efforts. Your newly optimized landing page might not get indexed. At that point, it becomes time to optimize your crawl budget.

Crawl budget is the time Google has in a given period to crawl your site. It might crawl 6 pages a day, it might crawl 5,000 pages, it might even crawl 4,000,000 pages every single day. This depends on many factors, which we’ll discuss in this article. Some of these factors are things you can influence.

How does a crawler work?

A crawler like Googlebot gets a list of URLs to crawl on a site. It goes through that list systematically. It grabs your robots.txt file every once in a while to make sure it’s still allowed to crawl each URL and then crawls the URLs one by one. Once a spider has crawled a URL and it has parsed the contents, it adds new URLs it has found on that page that it has to crawl back on the to-do list.

Several events can make Google feel a URL has to be crawled. It might have found new links pointing at content, or someone has tweeted it, or it might have been updated in the XML sitemap, etc etc…. There’s no way to make a list of all the reasons why Google would crawl a URL, but when it determines it has to, it adds it to the to-do list.

What is crawl budget?

Crawl budget is the number of pages Google will crawl on your site on any given day. This number varies slightly from day to day, but overall it’s relatively stable. The number of pages Google crawls, your “budget”, is generally determined by the size of your site, the “health” of your site (how many errors Google encounters) and the number of links to your site. 

When is crawl budget an issue?

Crawl budget is not a problem if Google has to crawl a lot of URLs on your site and it has allotted a lot of crawls. Say your site has 250,000 pages. Google crawls 2,500 pages on this particular site each day. It will crawl some (like the homepage) more than others. It could take up to 200 days before Google notices particular changes to your pages if you don’t act. Crawl budget is an issue now. If it crawls 50,000 a day, there’s no issue at all.

To quickly determine whether your site has a crawl budget issue, follow the steps below. This does assume your site has a relatively small number of URLs that Google crawls but doesn’t index (for instance because you added meta noindex).

  1. Determine how many pages you have on your site, the number of your URLs in your XML sitemaps might be a good start.
  2. Go into Google Search Console.
  3. Go to Crawl -> Crawl stats and take note of the average pages crawled per day.
  4. Divide the number of pages by the “Average crawled per day” number.
  5. If you end up with a number higher than ~10 (so you have 10x more pages than what Google crawls each day), you should optimize your crawl budget. If you end up with a number lower than 3, go read something else. 

    Get the most out of Yoast SEO, learn every feature and best practice in our Yoast SEO for WordPress training!

    Yoast SEO for WordPress training

What URLs is Google crawling?

You really should know which URLs Google is crawling on your site. The only “real” way of knowing that is looking at your site’s server logs. For larger sites I personally prefer using Logstash + Kibana for that. For smaller sites, the guys at Screaming Frog have released quite a nice little tool, aptly called SEO Log File Analyser (note the S, they’re Brits).

Get your server logs and look at them

Depending on your type of hosting, you might not always be able to grab your log files. However, if you even so much as think you need to work on crawl budget optimization because your site is big, you should get them. If your host doesn’t allow you to get them, change hosts.

Fixing your site’s crawl budget is a lot like fixing a car. You can’t fix it by looking at the outside, you’ll have to open up that engine. Looking at logs is going to be scary at first. You’ll quickly find that there is a lot of noise in logs. You’ll find a lot of commonly occurring 404s that you think are nonsense. But you have to fix them. You have to get through the noise and make sure your site is not drowned in tons of old 404s.

Increase your crawl budget

Let’s look at the things that actually improve how many pages Google can crawl on your site.

Website maintenance: reduce errors

Step one in getting more pages crawled is making sure that the pages that are crawled return one of two possible return codes: 200 (for “OK”) or 301 (for “Go here instead”). All other return codes are not OK. To figure this out, you have to look at your site’s server logs. Google Analytics and most other analytics packages will only track pages that served a 200. So you won’t find many of the errors on your site in there.

Once you’ve got your server logs, try to find common errors, and fix them. The most simple way of doing that is by grabbing all the URLs that didn’t return 200 or 301 and then order by how often they were accessed. Fixing an error might mean that you have to fix code. Or you might have to redirect a URL elsewhere. If you know what caused the error, you can try to fix the source too.

Another good source to find errors is Google Search Console. Read this post by Michiel for more info on that. If you’re using Yoast SEO, connecting your site to Google Search Console through the plugin allows you to easily retrieve all those errors. If you’ve got Yoast SEO Premium, you can even redirect them away easily using the redirects manager.

Block parts of your site

If you have sections of your site that really don’t need to be in Google, block them using robots.txt. Only do this if you know what you’re doing, of course. One of the common problems we see on larger eCommerce sites is when they have a gazillion way to filter products. Every filter might add new URLs for Google. In cases like these, you really want to make sure that you’re letting Google spider only one or two of those filters and not all of them.

Reduce redirect chains

When you 301 redirect a URL, something weird happens. Google will see that new URL and add that URL to the to-do list. It doesn’t always follow it immediately, it adds it to its to-do list and just goes on. When you chain redirects, for instance, when you redirect non-www to www, then http to https, you have two redirects everywhere, so everything takes longer to crawl.

Get more links

This is easy to say, but hard to do. Getting more links is not just a matter of being awesome, it’s also a matter of making sure others know that you’re awesome too. It’s a matter of good PR and good engagement on Social. We’ve written extensively about link building, I’d suggest reading these 3 posts:

  1. Link building from a holistic SEO perspective
  2. Link building: what not to do?
  3. 6 steps to a successful link building strategy

When you have an acute indexation problem, you should definitely look at your crawl errors, blocking parts of your site and at fixing redirect chains first. Link building is a very slow method to increase your crawl budget. On the other hand: if you intend on building a large site, link building needs to be part of your process.

AMP and your crawl budget

Google is telling everyone to use Accelerated Mobile Pages, in short: AMP. These are “lighter” versions of web pages, specifically aimed at mobile. The problem with AMP is that it means adding a separate URL for every page you have. You’d get example.com/page/ and example.com/page/amp/. This means you need double the crawl budget for your site. If you have crawl budget issues already, don’t start working on AMP just yet. We’ve written about it twice, but find that for sites that do not serve news, it’s not worth it yet.

TL;DR: crawl budget optimization is hard

Crawl budget optimization is not for the faint of heart. If you’re doing your site’s maintenance well, or your site is relatively small, it’s probably not needed. If your site is medium sized and well maintained, it’s fairly easy to do based on the above tricks. If you find, after looking at some error logs, that you’re in over your head, it might be time to call in someone more experienced.

Read more: ‘Robots.txt: the ultimate guide’ »

robots.txt: the ultimate guide

The robots.txt file is one of the primary ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers. There are some extra rules that are used by a few search engines which can be useful too. This guide covers all the uses of robots.txt for your website. While it looks deceivingly simple, making a mistake in your robots.txt can seriously harm you site, so make sure to read and understand this.

What is a robots.txt file?

humans.txt

A couple of developers sat down and realized that they were, in fact, not robots. They were (and are) humans. So they created the humans.txt standard as a way of highlighting which people work on a site, amongst other things.

A robots.txt file is a text file, following a strict syntax. It’s going to be read by search engine spiders. These spiders are also called robots, hence the name. The syntax is strict simply because it has to be computer readable. There’s no reading between the lines here, something is either 1, or 0.

Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus between early search engine spider developers. It’s not an official standard by any standards organization, but all major search engines do adhere to it.

What does the robots.txt file do?

Crawl directives

The robots.txt file is one of a few crawl directives. We have guides on all of them, find them here:

Crawl directives guides by Yoast »

Search engines index the web by spidering pages. They follow links to go from site A to site B to site C and so on. Before a search engine spiders any page on a domain it hasn’t encountered before, it will open that domains robots.txt file. The robots.txt file tells the search engine which URLs on that site it’s allowed to index.

A search engine will cache the robots.txt contents, but will usually refresh it multiple times a day. So changes will be reflected fairly quickly.

robots.txt

Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is www.example.com, it should be found at http://www.example.com/robots.txt. Do be aware: if your domain responds without www. too, make sure it has the same robots.txt file! The same is true for http and https. When a search engine wants to spider the URL http://example.com/test, it will grab http://example.com/robots.txt. When it wants to spider that same URL but over https, it will grab the robots.txt from your https site too, so https://example.com/robots.txt.

It’s also very important that your robots.txt file is really called robots.txt. The name is case sensitive. Don’t make any mistakes in it or it will just not work.

Pros and cons of using robots.txt

Pro: crawl budget

Each site has an “allowance” in how many pages a search engine spider will crawl on that site, SEOs call this the crawl budget. By blocking sections of your site from the search engine spider, you allow your crawl budget to be used for other sections. Especially on sites where a lot of SEO clean up has to be done, it can be very beneficial to first quickly block the search engines from crawling a few sections.

blocking query parameters

One situation where crawl budget is specifically important is when your site uses a lot of query string parameters to filter and sort. Let’s say you have 10 different query parameters and with different values, that can be used in any combination. This leads to hundreds if not thousands of possible URLs. Blocking all query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

This line would block all URLs on your site with a query string on it:

Disallow: /*?*

Con: not removing a page from search results

Using the robots.txt file you can tell a spider where it cannot go on your site. You can not tell a search engine which URLs it cannot show in the search results. This means that not allowing a search engine to crawl a URL – called “blocking” it – does not mean that URL will not show up in the search results. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page.

Screenshot of a result for a blocked URL in the Google search results

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means the search engine has to be able to index that page and find the noindex tag, so the page should not be blocked by robots.txt.

Because the search engine can’t crawl the page, it cannot distribute the link value for links to your blocked pages. If it could crawl, but not index the page, it could still spread the link value across the links it finds on the page. When a page is blocked with robots.txt, the link value is lost.

robots.txt syntax

WordPress robots.txt

We have a complete article on how to best setup your robots.txt for WordPress. Note that you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each started by a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the most specific block that matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so whether you write them lowercase or capitalize them is up to you. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives for the sake of readability in the file.

User-agent directive

The first bit of every block of directives is the user-agent. A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent. For instance, the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; 
  +http://www.google.com/bot.html)

A relatively simple User-agent: Googlebot  line will do the trick if you want to tell this spider what to do.

Note that most search engines have multiple spiders. They will use specific spiders for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Below is a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

Disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that spider can access all sections of your site.

User-agent: *
Disallow: /

The example above would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: *
Disallow:

The example above would, with only one character less, allow all search engines to crawl your entire site.

User-agent: googlebot
Disallow: /Photo

The example above would block Google from crawling the Photo directory on your site and everything in it. This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the photo directory, as these lines are case sensitive.

How to use wildcards / regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards. However, all major search engines do understand it. This means you can have lines like this to block groups of files:

Disallow: /*.php
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions. Be aware that not all search engines might understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php could not be indexed, but /index.php?p=1 could be indexed. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

On top of the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

Allow directive

While not in the original “specification”, there was talk of an allow directive very early on. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

noindex directive

One of the lesser known directives, Google actually supports the noindex directive. We think this is a very dangerous thing. If you want to keep a page out of the search results, you usually have a good reason for that. Using a method of blocking that page that will only keep it out of Google, means you leave those pages open for other search engines. It could be very useful in a specific Googlebot user agent bit of your robots.txt though, if you’re working on improving your crawl budget. Note that noindex isn’t officially supported by Google, so while it works now, it might not at some point.

host directive

Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show example.com  or www.example.com. Simply specifying it as follows does the trick:

host: example.com

Because only Yandex supports the host directive, we wouldn’t advise you to rely on it. Especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect www.yoast.com to yoast.com.

crawl-delay directive

Supported by Yahoo!, Bing and Yandex the crawl-delay directive can be very useful to slow down these three, sometimes fairly crawl-hungry, search engines. These search engines have slightly different ways of reading the directive, but the end result is basically the same.

A line as follows below would lead to Yahoo! and Bing waiting 10 seconds after a crawl action. Yandex would only access your site once in every 10 second timeframe. A semantic difference, but interesting to know. Here’s the example crawl-delay line:

crawl-delay: 10

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to index 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t all that much. On the other hand, if you get 0 to no traffic from these search engines, it’s a good way to save some bandwidth.

sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – the location of your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions. We, in fact, highly recommend that you do. Search engine’s webmaster tools programs will give you very valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick option.

Read more: ‘several articles about Webmaster Tools’ »

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we like to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the Crawl menu) and we’d highly suggest using that:

robots.txt tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally robots.txt-block your entire site into search engine oblivion.

Keep reading: ‘WordPress robots.txt example for great SEO’ »