Saturday, December 15, 2012

Robots.txt file, Robots Meta tag and X-Robots tag



What is a robots.txt file?

Robots.txt files inform search engine spiders how to interact with indexing your content. By default, search engines are greedy. They want to index as much high quality information as they can, and they will assume that they can crawl everything unless you tell them otherwise. If you want search engines to index everything in your site, you do not need a robots.txt file (not even an empty one). The robots.txt file must exist in the root of the domain. Search Engine bots only check for this file in the root of the domain.

How do create a robots.txt file?

The simplest robots.txt file uses two rules:

1.     User-Agent: the robot the following rule applies to

 Googlebot: crawl pages from our web index

 Googlebot-Mobile: crawls pages for our mobile index

 Googlebot-Image: crawls pages for our image index

 Mediapartners-Google: crawls pages to determine AdSense content (used only if you show   AdSense ads on your site).

2.     Disallow: the pages you want to block

A user-agent is a specific search engine robot and the Disallow line lists the pages you want to block.

 To block the entire site, use a forward slash.

Disallow: /

To block a directory, follow the directory name with a forward slash.

Disallow: /private_directory/

To block a page, list the page.

Disallow: /private_file.html

 Note:- Googlebot is the search bot software used by Google.

Can I allow pages?

Yes, Googlebot recognizes an extension to the robots.txt standard called Allow. This extension may not be recognized by all other search engine bots, so check with other search engines you're interested in to find out. The Allow line works exactly like the Disallow line. Simply list a directory or page you want to allow.

You may want to use Disallow and Allow together. For instance, to block access to all pages in a subdirectory except one, you could use the following entries:

User-Agent: Googlebot
Disallow: /folder1/

User-Agent: Googlebot
Allow: /folder1/myfile.html

If you specify data for all bots use (*). It is globle.

User-Agent:  *

Disallow: /folder1/

You can use an asterisk (*) to match a sequence of characters. For instance, to block access to all subdirectories that begin with private, you could use the following entry:

User-Agent: Googlebot
Disallow: /private*/


Robots Meta tag

You can use robots.txt analysis tool  to check the working of your tobots.txt file.

When you block URLs from being indexed in Google via robots.txt they may still show those pages as URL only listings in their search results. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases. You can tell them to not index a page, or to not index a page and to not follow outbound links by inserting either of the following code bits in the HTML head of your document that you do not want indexed.

    <meta name="robots" content="noindex">
    <meta name="robots" content="noindex,nofollow">
      <meta name="googlebot" content="noindex">
    Please note that if you block the search engines in robots.txt and via the meta tags then they may never get to crawl the page to see the meta tags, so the URL may still appear in the search results URL only.
    If your page is still appearing in results, it is probably because Search Engine have not crawled your site since you added the tag.

Crawl Delay

Search engines allow you to set crawl priorities. Google does not support the crawl delay command directly, but you can lower your crawl priority inside Google Webmaster Central.

 Their robots.txt crawl delay code looks like
User-agent: Slurp
Crawl-delay: 5
where the 5 is in seconds.

To block access to all URLs that include a question mark (?), you could use the following entry:

User-agent: *
Disallow: /*?

You can use the $ character to specify matching the end of the URL

User-agent: Googlebot
Disallow: /*.asp$
Sources:-  http://sitemaps.blogspot.in/2006/02/using-robotstxt-file.html

http://tools.seobook.com/robots-txt/

Using the X-Robots-Tag HTTP header

The X-Robots-Tag can be used as an element of the HTTP header response for a given URL. Any directive that can used in an robots meta tag can also be specified as an X-Robots-Tag. Here's an example of an HTTP response with an X-Robots-Tag instructing crawlers not to index a page:

HTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT

(…)

X-Robots-Tag: noindex

(…)

Multiple X-Robots-Tag headers can be combined within the HTTP response, or you can specify a comma-separated list of directives. Here's an example of an HTTP header response which has a noarchive X-Robots-Tag combined with an unavailable_after X-Robots-Tag.

HTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT

(…)

X-Robots-Tag: noarchive

X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST

(…)

The X-Robots-Tag may optionally specify a user-agent before the directives. For instance, the following set of X-Robots-Tag HTTP headers can be used to conditionally allow showing of a page in search results for different search engines:

HTTP/1.1 200 OK

Date: Tue, 25 May 2010 21:42:43 GMT

(…)

X-Robots-Tag: googlebot: nofollow

X-Robots-Tag: otherbot: noindex, nofollow

(…)

Directives specified without a user-agent are valid for all crawlers. The section below demonstrates how to handle combined directives. Both the name and the specified values are not case sensitive.

Sources:- https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

No comments:

Post a Comment

Increase Conversions Upto 200%