Google Crawling and Indexation
An insufficient Google crawl rate and incomplete indexation are the
scourge of many websites, especially those large and new. In this forum
and others, many members report indexation issues and ask how to solve
them. The same is the case with SEO clients. My advice here is focused
on Google, but the same general principles apply to crawling and
indexation by Bing and Yahoo! as well.
First of all, Google indexation is hard to measure for a large site. There can be false alarms having to do with people using Google's site: operator, supposed to report the site's indexation count. It works well for small sites but is wildly unreliable for large ones and tends to severely underreport the count. Webmaster Tools is better for this, but possibly also unreliable. If your site is enormous, there is simply no certain way of knowing how many pages Google has indexed. For additional helpful data, check Google Analytics to see the total number of pages that have received visits. I also recommend that you manually run cache: checks of all your most important pages and of various random secondary pages to get a further idea of how your site is doing on the indexation front.
The Google crawl rate cannot be reliably controlled, but it can influenced by positive factors (listed here roughly in receding order of importance).
• Domain importance. Google’s Matt Cutts has recently admitted, interviewed by Eric Enge, that your site's crawl rate and depth of crawling are roughly proportional to PR. SEOs have long known this.
• Backlinks. PR is computed based on backlinks, which are absolutely central to indexation. If a site's page count is growing fast but the site is not earning enough new links, this may suggest to Google that the content is of low quality (guaranteed reduce your crawl and indexation rates).
• Deep Linking. Backlinks to individual pages (so-called "deep linking") are an effective way to ensure the indexation of those pages and their keep in the main Google index (as distinct from the supplementary index). Internal links to the same pages also help. Make sure that at least your most important pages get enough of both kinds of links. These need to be followed links (i.e. they should not contain the rel="nofollow" attribute).
• Site navigation and hierarchy. To the extent possible, a flat site hierarchy should be used. (An exemplary illustration is http://www.fanbase.com, with all the main categories appearing in the top-level navigation, enabling quick drilldown to individual pages.) This means (a) as few subdomains, subfolders and subdirectories as possible and (b) that all important pages must be reachable via the fewest clicks possible from the home page (more than 3-4 clicks is problematic).
• XML sitemaps. This a must. Here is one good tool -- http://www.xml-sitemaps.com -- for generating sitemaps; there are others too. Submit your sitemaps to the search engines via webmaster tools. Further notes:
o Sitemaps generally support <changefreq> and <priority> attributes, whose use may influence the crawl, although the impact is likely to be minor.
o Check WMT for sitemap errors and fix them.
o Just recently, Michael Gray has recommended that creating small sitemaps of (100 pages or less) to supplement your regular sitemaps can help get new content indexed faster. He has found using a dedicated sitemap for fresh content to be highly effective. I have not tested this personally yet, but it makes sense and Michael's mileage counts for a great deal.
• Duplicate content reduction. In general, duplicate content on a site is not a significant problem and does not entail "Google penalties." However, on very large sites high-volume duplicated content (identical pages sitting under different URLs) can confuse Google and impede proper indexing. One classic example of duplication occurs under different forms of site URLs: those that include the www. subdomain and those that don't (e.g. http://example.com/file1.html and http://www.example.com/file1.html typically have the same content). The way to handle this and other kinds of duplication it is via some form of URL canonicalization (see next item).
• URL canonicalization means creating a single SEO-friendly and user-friendly URL for each page and letting Google know that that URL is canonical. SEO reasons for canonicalization are various go beyond indexation issues: (1) Google, in spite of occasional denial, may assigns less importance to pages that contain extra slashed (subdirectories); (2) Google may sometimes have difficulties with URLs that are parameter-laden; (3) long ugly URLs are a turnoff for site visitors; (4) a clear, well-structures consistent URL convention is best for the user, for branding and for SEO; (5) canonicalization consolidates PageRank and link equity to the canonical version of the page, giving it a better chance to rank. Depending on your platform, various rewrite engines (see http://en.wikipedia.org/wiki/Rewrite_engine) can be used to automate the rewriting of URLs from "ugly" into friendly ones. URL canonicalization can be performed in any of 3 different ways:
o 301-redirect ("moved permanently") of all duplicate URLs to the canonical. IMHO this is the most reliable method of canonicalization, but it may have certain overheads.
o rel="canonical": Place a link of the form <link rel="canonical" href="http://example.com/canonical-url-example.html"> at the end of the <head> of each duplicate page. (Yes, it's OK for the canonical version to include this link to itself; and no, there is no limit on how many canonical links you can have.)
o "Display URLs as": the effect of this setting in the Google Webmaster Tools is similar to that of rel="canonical" and is the easiest option if you prefer not to write any code.
• URL stability and page uniqueness. While the issues surrounding duplicate content are fairly well known, one potential problem that is rarely discussed is the opposite. The term I have coined for it is multitasking URLs. Some applications may display different dynamically generated content under the same URL (for example, content specific the user's geographical location). Additionally, the title tags for such pages may also be generated on the fly and contradictory. I have seen this lead to a variety of indexation and search issues. For best results, the content of each page, whether dynamic or static, must be unique and must appear under its proper, unique and stable URL and title tag.
• Unique title tags. If you use the same title tags across multiple pages, Google may assume that those pages are duplicate and be reluctant to index them. Make your titles unique.
• Manual crawl rate setting. Google's Webmaster Tools offer a choice between letting Google determine the crawl rate automatically and setting it manually via a slide bar. Although setting it manually to max is unlikely to boost the crawl rate dramatically, it may brings about marginal improvement.
• Original content. It's good for all your important pages to have significant and unique original content.
• Updates, feeds, pinging. Frequent content updates both site-wide and on individual pages can significantly improve the crawl rate. Further, exporting RSS feeds and implementing automated search engine pinging have a beneficial effect. Pinging resources include http://pingomatic.com/ and http://pingler.com/.
• Social Media. Links from social media, although they are nofollow, help Google discover and index new content. Including sharing buttons on your pages and promoting them on social media sites can help get your pages into the index faster.
First of all, Google indexation is hard to measure for a large site. There can be false alarms having to do with people using Google's site: operator, supposed to report the site's indexation count. It works well for small sites but is wildly unreliable for large ones and tends to severely underreport the count. Webmaster Tools is better for this, but possibly also unreliable. If your site is enormous, there is simply no certain way of knowing how many pages Google has indexed. For additional helpful data, check Google Analytics to see the total number of pages that have received visits. I also recommend that you manually run cache: checks of all your most important pages and of various random secondary pages to get a further idea of how your site is doing on the indexation front.
The Google crawl rate cannot be reliably controlled, but it can influenced by positive factors (listed here roughly in receding order of importance).
• Domain importance. Google’s Matt Cutts has recently admitted, interviewed by Eric Enge, that your site's crawl rate and depth of crawling are roughly proportional to PR. SEOs have long known this.
• Backlinks. PR is computed based on backlinks, which are absolutely central to indexation. If a site's page count is growing fast but the site is not earning enough new links, this may suggest to Google that the content is of low quality (guaranteed reduce your crawl and indexation rates).
• Deep Linking. Backlinks to individual pages (so-called "deep linking") are an effective way to ensure the indexation of those pages and their keep in the main Google index (as distinct from the supplementary index). Internal links to the same pages also help. Make sure that at least your most important pages get enough of both kinds of links. These need to be followed links (i.e. they should not contain the rel="nofollow" attribute).
• Site navigation and hierarchy. To the extent possible, a flat site hierarchy should be used. (An exemplary illustration is http://www.fanbase.com, with all the main categories appearing in the top-level navigation, enabling quick drilldown to individual pages.) This means (a) as few subdomains, subfolders and subdirectories as possible and (b) that all important pages must be reachable via the fewest clicks possible from the home page (more than 3-4 clicks is problematic).
• XML sitemaps. This a must. Here is one good tool -- http://www.xml-sitemaps.com -- for generating sitemaps; there are others too. Submit your sitemaps to the search engines via webmaster tools. Further notes:
o Sitemaps generally support <changefreq> and <priority> attributes, whose use may influence the crawl, although the impact is likely to be minor.
o Check WMT for sitemap errors and fix them.
o Just recently, Michael Gray has recommended that creating small sitemaps of (100 pages or less) to supplement your regular sitemaps can help get new content indexed faster. He has found using a dedicated sitemap for fresh content to be highly effective. I have not tested this personally yet, but it makes sense and Michael's mileage counts for a great deal.
• Duplicate content reduction. In general, duplicate content on a site is not a significant problem and does not entail "Google penalties." However, on very large sites high-volume duplicated content (identical pages sitting under different URLs) can confuse Google and impede proper indexing. One classic example of duplication occurs under different forms of site URLs: those that include the www. subdomain and those that don't (e.g. http://example.com/file1.html and http://www.example.com/file1.html typically have the same content). The way to handle this and other kinds of duplication it is via some form of URL canonicalization (see next item).
• URL canonicalization means creating a single SEO-friendly and user-friendly URL for each page and letting Google know that that URL is canonical. SEO reasons for canonicalization are various go beyond indexation issues: (1) Google, in spite of occasional denial, may assigns less importance to pages that contain extra slashed (subdirectories); (2) Google may sometimes have difficulties with URLs that are parameter-laden; (3) long ugly URLs are a turnoff for site visitors; (4) a clear, well-structures consistent URL convention is best for the user, for branding and for SEO; (5) canonicalization consolidates PageRank and link equity to the canonical version of the page, giving it a better chance to rank. Depending on your platform, various rewrite engines (see http://en.wikipedia.org/wiki/Rewrite_engine) can be used to automate the rewriting of URLs from "ugly" into friendly ones. URL canonicalization can be performed in any of 3 different ways:
o 301-redirect ("moved permanently") of all duplicate URLs to the canonical. IMHO this is the most reliable method of canonicalization, but it may have certain overheads.
o rel="canonical": Place a link of the form <link rel="canonical" href="http://example.com/canonical-url-example.html"> at the end of the <head> of each duplicate page. (Yes, it's OK for the canonical version to include this link to itself; and no, there is no limit on how many canonical links you can have.)
o "Display URLs as": the effect of this setting in the Google Webmaster Tools is similar to that of rel="canonical" and is the easiest option if you prefer not to write any code.
• URL stability and page uniqueness. While the issues surrounding duplicate content are fairly well known, one potential problem that is rarely discussed is the opposite. The term I have coined for it is multitasking URLs. Some applications may display different dynamically generated content under the same URL (for example, content specific the user's geographical location). Additionally, the title tags for such pages may also be generated on the fly and contradictory. I have seen this lead to a variety of indexation and search issues. For best results, the content of each page, whether dynamic or static, must be unique and must appear under its proper, unique and stable URL and title tag.
• Unique title tags. If you use the same title tags across multiple pages, Google may assume that those pages are duplicate and be reluctant to index them. Make your titles unique.
• Manual crawl rate setting. Google's Webmaster Tools offer a choice between letting Google determine the crawl rate automatically and setting it manually via a slide bar. Although setting it manually to max is unlikely to boost the crawl rate dramatically, it may brings about marginal improvement.
• Original content. It's good for all your important pages to have significant and unique original content.
• Updates, feeds, pinging. Frequent content updates both site-wide and on individual pages can significantly improve the crawl rate. Further, exporting RSS feeds and implementing automated search engine pinging have a beneficial effect. Pinging resources include http://pingomatic.com/ and http://pingler.com/.
• Social Media. Links from social media, although they are nofollow, help Google discover and index new content. Including sharing buttons on your pages and promoting them on social media sites can help get your pages into the index faster.