Blog
Yugos at the factory
Image: www.autosoviet.altervista.org
(original source not clear)

Let's talk real quickly about duplicate content. Many CMS's (content management systems), including (but not limited to) website solutions like Magento and Wordpress, have a tendency to create several URLs for the exact same content. Such duplication can be caused by any number of things. Common duplication-causing items include category page sorting options, print-friendly versions of pages, cross-merchandised products and plenty more.

This duplication can create a real problem for search engines like Google. For one thing, it bogs down the crawler, forcing it to waste its crawl budget (which I touched on briefly in a previous post) on junk pages instead of crawling valuable pages. It also can be difficult for the search engine to tell which URL it should send users to, especially if people are linking to multiple versions. So naturally, cleaning this up should be a high priority for any webmaster.

There are a few tools at the webmaster's disposal to help:

  • Canonical tags: Canonical tags, which are added to the head section of the page's code, tell the search engines which page they should show by linking to it. These can be very handy for managing duplicate content, although they are perhaps the hardest to use correctly. It's all too common to see canonical tags that point to the exact duplicate content they're supposed to be eliminating.
  • Meta Robots tags: Like the canonical tags, meta robots tags are also added to the head section of the page's code. These contain directives for the search engine crawler, instructing it on whether it should index the page or not, and also whether it should follow links on the page or not (I generally recommend allowing search engine crawlers to follow links). Further reading on the meta robots tag
  • Rel=Prev/Next: Unlike the two methods above, these directives have a very specific purpose. They're added to link tags in the head section of the code on pages that are part of a paginated series, such as category pages on ecommerce websites or articles that span several pages, to help search engines establish a relationship between the pages in the series. While Google does recommend using them, Google also insists that it is pretty good at figuring out pagination without them. Further reading on rel=prev/next
  • Robots.txt: This is a file that gets uploaded to the root (folder that encompasses all of the folders and files) of your site. Within it you can tell crawlers which pages they are not allowed to crawl. One of the great abilities of the robots.txt file is that you can specify not only the content to block, but the crawler to block it from. The file can be a bit like a machete though, great for putting the kibosh on large sections of duplicate content, but somewhat unwieldy for use in every instance. Further reading on robots.txt

All of these are great for fighting duplicate content when you can't stop it from being created in the first place. Note, however, that it's very easy to use these methods incorrectly and do significant damage to your website's rankings. Only people who have a good understanding of how these tools work should attempt to use them.