Duplicate content

Last updated

Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly duplicate or closely similar. [1] When multiple pages contain essentially the same content, search engines such as Google and Bing can penalize or cease displaying the copying site in any relevant search results.

Contents

Types

Non-malicious

Non-malicious duplicate content may include variations of the same page, such as versions optimized for normal HTML, mobile devices, or printer-friendliness, or store items that can be shown via multiple distinct URLs. [1] Duplicate content issues can also arise when a site is accessible under multiple subdomains, such as with or without the "www." or where sites fail to handle the trailing slash of URLs correctly. [2] Another common source of non-malicious duplicate content is pagination, in which content and/or corresponding comments are divided into separate pages. [3]

Syndicated content is a popular form of duplicated content. If a site syndicates content from other sites, it is generally considered important to make sure that search engines can tell which version of the content is the original so that the original can get the benefits of more exposure through search engine results. [1] Ways of doing this include having a rel=canonical tag on the syndicated page that points back to the original, NoIndexing the syndicated copy, or putting a link in the syndicated copy that leads back to the original article. If none of these solutions are implemented, the syndicated copy could be treated as the original and gain the benefits. [4]

The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

There may be similar content between different web pages in the form of similar product content. This is usually noticed in e-commerce websites where the usage of similar keywords for similar categories of products leads to this form of non-malicious duplicate content. This is often the case when new iterations and versions of products are released but the seller or the e-commerce website mods do not the whole product descriptions. [5]

Malicious

Malicious duplicate content refers to content that is intentionally duplicated in an effort to manipulate search results and gain more traffic. This is known as search spam. There is a number of tools available to verify the uniqueness of the content. [6] In certain cases, search engines penalize websites' and individual offending pages' rankings in the search engine results pages (SERPs) for duplicate content considered "spammy."

Detecting duplicate content

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others. [7] [8]

Detection of plagiarism can be undertaken in a variety of ways. Human detection is the most traditional form of identifying plagiarism from written work. This can be a lengthy and time-consuming task for the reader [8] and can also result in inconsistencies in how plagiarism is identified within an organization. [9] Text-matching software (TMS), which is also referred to as "plagiarism detection software" or "anti-plagiarism" software, has become widely available, in the form of both commercially available products as well as open-source[ examples needed ] software. TMS does not actually detect plagiarism per se, but instead finds specific passages of text in one document that match text in another document.

Resolutions

If the content has been copied, there are multiple resolutions available to both parties. [10]

A HTTP 301 redirect (301 Moved Permanently) is a method of dealing with duplicate content to redirect users and search engine crawlers to the single pertinent version of the content. [1]

See also

Related Research Articles

<span class="mw-page-title-main">Web crawler</span> Software which systematically browses the World Wide Web

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic rather than direct traffic or paid traffic. Unpaid traffic may originate from different kinds of searches, including image search, video search, academic search, news search, and industry-specific vertical search engines.

Web syndication is making content available from one website to other sites. Most commonly, websites are made available to provide either summaries or full renditions of a website's recently added content. The term may also describe other kinds of content licensing for reuse.

<span class="mw-page-title-main">Metasearch engine</span> ALO.Online information retrieval tool

A metasearch engine is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. Sufficient data is gathered, ranked, and presented to the users.

Doorway pages are web pages that are created for the deliberate manipulation of search engine indexes (spamdexing). A doorway page will affect the index of a search engine by inserting results for particular phrases while sending visitors to a different page. Doorway pages that redirect visitors without their knowledge use some form of cloaking. This usually falls under Black Hat SEO.

URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.

Keyword stuffing is a search engine optimization (SEO) technique, considered webspam or spamdexing, in which keywords are loaded into a web page's meta tags, visible content, or backlink anchor text in an attempt to gain an unfair rank advantage in search engines. Keyword stuffing may lead to a website being temporarily or permanently banned or penalized on major search engines. The repetition of words in meta tags may explain why many search engines no longer use these tags. Nowadays, search engines focus more on the content that is unique, comprehensive, relevant, and helpful that overall makes the quality better which makes keyword stuffing useless, but it is still practiced by many webmasters.

A sitemap is a list of pages of a web site within a domain.

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Sitemaps is a protocol in XML format meant for a webmaster to inform search engines about URLs on a website that are available for web crawling. It allows webmasters to include additional information about each URL: when it was last updated, how often it changes, and how important it is in relation to other URLs of the site. This allows search engines to crawl the site more efficiently and to find URLs that may be isolated from the rest of the site's content. The Sitemaps protocol is a URL inclusion protocol and complements robots.txt, a URL exclusion protocol.

In blogging, a ping is an XML-RPC-based push mechanism by which a weblog notifies a server that its content has been updated. An XML-RPC signal is sent from the weblog to one or more Ping servers, as specified by originating weblog), to notify a list of their "Services" of new content on the weblog.

URL shortening is a technique on the World Wide Web in which a Uniform Resource Locator (URL) may be made substantially shorter and still direct to the required page. This is achieved by using a redirect which links to the web page that has a long URL. For example, the URL "https://example.com/assets/category_B/subcategory_C/Foo/" can be shortened to "https://example.com/Foo", and the URL "https://en.wikipedia.org/wiki/URL_shortening" can be shortened to "https://w.wiki/U". Often the redirect domain name is shorter than the original one. A friendly URL may be desired for messaging technologies that limit the number of characters in a message, for reducing the amount of typing required if the reader is copying a URL from a print source, for making it easier for a person to remember, or for the intention of a permalink. In November 2009, the shortened links of the URL shortening service Bitly were accessed 2.1 billion times.

Rogue security software is a form of malicious software and internet fraud that misleads users into believing there is a virus on their computer and aims to convince them to pay for a fake malware removal tool that actually installs malware on their computer. It is a form of scareware that manipulates users through fear, and a form of ransomware. Rogue security software has been a serious security threat in desktop computing since 2008. An early example that gained infamy was SpySheriff and its clones, such as Nava Shield.

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

In the field of search engine optimization (SEO), link building describes actions aimed at increasing the number and quality of inbound links to a webpage with the goal of increasing the search engine rankings of that page or website. Briefly, link building is the process of establishing relevant hyperlinks to a website from external sites. Link building can increase the number of high-quality links pointing to a website, in turn increasing the likelihood of the website ranking highly in search engine results. Link building is also a proven marketing tactic for increasing brand awareness.

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

XRumer is a piece of software made for spamming online forums and comment sections. It is marketed as a program for search engine optimization and was created by BotmasterLabs. It is able to register and post to forums with the aim of boosting search engine rankings. The program is able to bypass security techniques commonly used by many forums and blogs to deter automated spam, such as account registration, client detection, many forms of CAPTCHAs, and e-mail activation before posting. The program utilises SOCKS and HTTP proxies in an attempt to make it more difficult for administrators to block posts by source IP, and features a proxy checking tool to verify the integrity and anonymity of the proxies used.

A canonical link element is an HTML element that helps webmasters prevent duplicate content issues in search engine optimization by specifying the "canonical" or "preferred" version of a web page. It is described in RFC 6596, which went live in April 2012.

Social spam is unwanted spam content appearing on social networking services, social bookmarking sites, and any website with user-generated content. It can be manifested in many ways, including bulk messages, profanity, insults, hate speech, malicious links, fraudulent reviews, fake friends, and personally identifiable information.

References

  1. 1 2 3 4 "Duplicate content". Google Inc. Retrieved 2016-01-07.
  2. "Duplicate content - Duplicate Content" . Retrieved 2011-12-19.
  3. "Duplicate Content: Causation and Significance". Effective Business Growth. Retrieved 15 May 2017.
  4. Enge, Eric (April 28, 2014). "Syndicated Content: Why, When & How". Search Engine Land. Third Door Media. Retrieved June 25, 2018.
  5. Avoid Penalized By Google On Duplicate Content
  6. Ahmad, Bilal (20 May 2011). "6 Free Duplicate Content Checker Tools". TechMaish.com. Retrieved 15 May 2017.
  7. Culwin, Fintan; Lancaster, Thomas (2001). "Plagiarism, prevention, deterrence and detection". CiteSeerX   10.1.1.107.178 . Archived from the original on 18 April 2021. Retrieved 2022-11-11 via The Higher Education Academy.
  8. 1 2 Bretag, T., & Mahmud, S. (2009). A model for determining student plagiarism: Electronic detection and academic judgement. Journal of University Teaching & Learning Practice, 6(1). Retrieved from http://ro.uow.edu.au/jutlp/vol6/iss1/6
  9. Macdonald, R., & Carroll, J. (2006). Plagiarism—a complex issue requiring a holistic institutional approach. Assessment & Evaluation in Higher Education, 31(2), 233–245. doi:10.1080/02602930500262536
  10. "Have Duplicate Content? It Can Kill Your Rankings". OrangeFox.com. OrangeFox. Retrieved 27 March 2016.