Machine learning and algorithm-based intelligence are impressive, but often lack what is natural to humans: common sense. It is well known that if you put the same content on multiple pages, the content will be duplicated. But what if you create a page Mexico Phone Number List about similar things that have important differences? The algorithm flags them as duplicates, but humans have no problem distinguishing pages like these. E-Commerce: Similar products with multiple variations or Mexico Phone Number List significant differences Travel: Hotel Branch, Destination Package with Similar Content Job Ads: Comprehensive List of Same Items Business: Page of a local branch offering the same service in different regions How does this happen? How can I find the problem? What can you do about it? Risk of duplicate content Duplicate content interferes with your ability to display your site to search users in the following ways:
Unintentionally competing unique page rankings for the same keyword are lost Unable to rank pages in the cluster because Google has selected Mexico Phone Number List one page as legitimate Loss of site authority for large amounts of thin content How machines identify duplicate content Google uses an algorithm to determine if two pages or parts of a page are duplicate content. This is defined by Google as "quite similar" content. Google's similarity detection is based on the patented Simhash Mexico Phone Number List algorithm that analyzes blocks of content on web pages. It then calculates a unique identifier for each block and creates a hash or "fingerprint" for each page. Scalability is important because the number of web pages is huge. Simhash is currently the only viable way to find duplicate content on a large scale. The Simhash fingerprint is: It's cheap to calculate. They are established with a single crawl of the page.
The fixed length makes comparisons easy. You can almost find duplicates. Unlike many other algorithms, it equates small page changes with small hash changes. This means that the difference between any two fingerprints can be measured algorithmically and Mexico Phone Number List expressed as a percentage. To reduce the cost of evaluating every pair of pages, Google uses the following techniques: Clustering: Only fingerprints within a cluster should be compared by grouping a set of pages that are sufficiently similar. This is because everything else is already classified as different. Estimate: For very large clusters, the average similarity is applied after a certain number of fingerprint pairs have been calculated. Compare page fingerprints. Source: Detection of nearly duplicate documents for web Mexico Phone Number List crawl (Google patent) Finally, Google uses a weighted similarity rate that excludes certain blocks of the same content (boilerplate: header, navigation, sidebar, footer, disclaimer, etc.). Use n-gram analysis to consider the subject of the page and determine which words on the page appear most often.