Central Perk: SEOmoz Daily SEO Blog

miercuri, 16 martie 2011

SEOmoz Daily SEO Blog

Restricting Robot Access for Improved SEO

Posted: 15 Mar 2011 02:03 PM PDT

Left to their own devices, search engine spiders will often perceive important pages as junk, index content that shouldn’t serve as a user entry point, generate duplicate content, along with a slew of other issues. Are you doing everything you can to guide bots through your website and make the most of each visit from search engine spiders?

It is a little like child-proofing a home. We use child safety gates to block access to certain rooms, add inserts to electrical outlets to ensure nobody gets electrocuted, and place dangerous items out of reach. At the same time we provide educational, entertaining, and safe items within easy access. You wouldn't open the front door of your unprepared home to a toddler, then pop out for a coffee and hope for the best.

Think of Googlebot as a toddler (If you need a more believable visual, try a really rich and very well-connected toddler). Left to roam the hazards Good girl, reading a book. unguided you'll likely have a mess and some missed potential on your hands. Remove the choice to access the troublesome areas of your website and they’re more likely to focus on the good quality options at hand instead.

Restricting access to junk and hazards while making quality choices easily accessible is an important and often overlooked component of SEO.

Luckily, there are a number of tools that allow us to make the most of bot activity and keep them out of trouble on our websites. Lets look at the four main robot restriction methods; the Meta Robots Tag, Robots.txt files, the X-Robots Tag, and the Canonical Tag. We’ll summarize quickly how each method is implemented, cover the pros and cons of each, and provide examples of how each one can be best used.

CANONICAL TAG

The canonical tag is a page level meta tag that is placed in the HTML header of a web page. It tells the search engines which URL is the canonical version of the page being displayed. Its purpose is to keep duplicate content out of the search engine index while consolidating your pages strength into one ‘canonical’ page.

The code looks like this:

<link rel="canonical" href="http://example.com/quality-wrenches.htm"/>

There is a good example of this tag in action over at MyWedding. They used this tag to take care of tracking parameters important to the marketing team. Try this url - http://www.mywedding.com/?utm_source=whatever-they-want-to-track. Right click on the page, then view the source. You'll see the rel="canonical" entry on the page.

Pros

Relatively easy to implement. Your dev group can move on to bigger fish.
Can be used to source content across domains. This may be a good solution if you have syndication deals in the works but don't want to compromise your own search engine presence.

Cons

Relatively easy to implement incorrectly (see catastrophic canonicalization)
Search engine support can be spotty. The tag is a signal more than a command.
Doesn't correct the core issue.

Example Uses

There are usually other ways to canonicalize content, but sometimes this is a solid solution given all variables.
Cindy Krum, a Moz associate, recommends canonical tag use if you run into a sticky situation and your mobile site version is outranking your traditional site.
If you don't want to track your referal parameters with a cookie, the canonical tag is a good alternative.

ROBOTS.TXT

Robots.txt allows for some control of search engine robot access to a site; however it does not guarantee a page won’t be indexed. It should be employed only when necessary. I generally recommend using the Meta tag “noindex” for keeping pages out of the index instead. so easy a monkey could do it

Pros

So easy a monkey could do it.
Great place to point out XML Sitemap files.

Cons

So easy a monkey could do it (see Serious Robots.txt Misuse)
Serves as a link juice block. Search engines are restricted from crawling the page content so (internal) links aren't followed and passed the value they deserve.

no-juice-passes

Example Uses

I recommend only using the robots.txt file to show that you have one. It shouldn't really restrict anything, but serves to point to the XML Sitemaps or an XML Sitemap direcotry file.
Check out the SEOmoz robots.txt file. It is fun and useful.

META ROBOTS TAG

The Meta robots tag creates page-level instructions for search engine bots. The Meta robots tag should be included in the head section of the HTML document. Here is some info on how the tag should look in your code.

Meta Robots Commands

The Meta Robots Tag is my very favorite option. By using the 'noindex' tag, you keep content out of the index but the search engine spiders will still follow the links and pass the link love.

Pros

Use of 'noindex' keeps a page out of the search index better than other options like a robots.txt file entry.
As long as you don't use the 'nofollow' tag, link juice can pass. Woot!
Fine tune your entries in the SERPs by specifying NOSNIPPET, NOODP, or NODIR. (You're getting all fancy on me now!)

Cons

Many quite smart folks use 'noindex, nofollow' together and miss out on the important link juice flow piece. :(

Example Uses

Imagine that your log-in page is the most linked to (and powerful) page on your website. You don't want it in the index, but you certainly don't want to add it to the robots.txt file because that is a link juice block.
Search result sort pages.
Paginated versions of pages.

X-ROBOTS-TAG

Since 2007 Google and other search engines have supported the X-Robots-Tag as a way to inform the bots about crawling and indexing preferences in the HTTP Header used to serve the file. The X-Robots-Tag is very useful for controlling indexation of non-HTML media types such as PDF documents.

Pros

Allows you to control indexation of unusual content like Excel files, PDFs, PPTs, and whatever else you've got hanging around.

Cons

This kind of weird content can be troublesome in the first place. Why not publish an HTML version on the web for indexation and this secondary file type for download, etc?

Example Uses

You offer product information on your site in HTML, but your marketing department also wants to make the beautiful PDF version available. You'd add the X-Robots to the PDFs.
You have an awesome set of excel templates that are link bait. If your bothered by the Excel files outranking your HTML landing pages you could add noindex to your x-robots tag in teh HTTP Header.

Lets Turn this Ship Back Around

What was all the baby talk you started out with, Lindsay? Oh, that's right. Thanks. In your quest to bot-proof your website, you have a number of tools at lets-turn-this-ship-around your disposal. These differ greatly from those used for baby-proofing but the end result is the same. Everybody (babies and bots) stays safe, on track, out of trouble, and focused on the most important stuff that is going to make a difference. Instead of baby gates and electric socket protectors you've got the Meta Robots Tag, Robots.txt files, the X-Robots Tag, and the Canonical Tag.

In my personal order of preference, I'd go with...

Meta Robots Tag
Canonical Tag
X-Robots-Tag
Robots.txt file

Your Turn!

I would love, love, love to hear how you use each of the above robot control protocols for effective SEO. Please share your uses and experience in the comments and let the conversation flow.

Happy Optimizing!

Stock Photography by Photoxpress

Do you like this post? Yes No