This article needs additional citations for verification .(February 2021) |
Spamdexing (also known as search engine spam, search engine poisoning, black-hat search engine optimization , search spam or web spam) [1] is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating related and/or unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system. [2] [3]
Spamdexing could be considered to be a part of search engine optimization, [4] although there are many SEO methods that improve the quality and appearance of the content of web sites and serve content useful to many users. [5]
Search engines use a variety of algorithms to determine relevancy ranking. Some of these include determining whether the search term appears in the body text or URL of a web page. Many search engines check for instances of spamdexing and will remove suspect pages from their indexes. Also, search-engine operators can quickly block the results listing from entire websites that use spamdexing, perhaps in response to user complaints of false matches. The rise of spamdexing in the mid-1990s made the leading search engines of the time less useful. Using unethical methods to make websites rank higher in search engine results than they otherwise would is commonly referred to in the SEO (search engine optimization) industry as "black-hat SEO". [6] These methods are more focused on breaking the search-engine-promotion rules and guidelines. In addition to this, the perpetrators run the risk of their websites being severely penalized by the Google Panda and Google Penguin search-results ranking algorithms. [7]
Common spamdexing techniques can be classified into two broad classes: content spam [5] (term spam) and link spam. [3]
The earliest known reference [2] to the term spamdexing is by Eric Convey in his article "Porn sneaks way back on Web", The Boston Herald , May 22, 1996, where he said:
The problem arises when site operators load their Web pages with hundreds of extraneous terms so search engines will list them among legitimate addresses. The process is called "spamdexing," a combination of spamming—the Internet term for sending users unsolicited information—and "indexing." [2]
Keyword stuffing had been used in the past to obtain top search engine rankings and visibility for particular phrases. This method is outdated and adds no value to rankings today. In particular, Google no longer gives good rankings to pages employing this technique.
Hiding text from the visitor is done in many different ways. Text colored to blend with the background, CSS z-index positioning to place text underneath an image — and therefore out of view of the visitor — and CSS absolute positioning to have the text positioned far from the page center are all common techniques. By 2005, many invisible text techniques were easily detected by major search engines.
"Noscript" tags are another way to place hidden content within a page. While they are a valid optimization method for displaying an alternative representation of scripted content, they may be abused, since search engines may index content that is invisible to most visitors.
Sometimes inserted text includes words that are frequently searched (such as "sex"), even if those terms bear little connection to the content of a page, in order to attract traffic to advert-driven pages.
In the past, keyword stuffing was considered to be either a white hat or a black hat tactic, depending on the context of the technique, and the opinion of the person judging it. While a great deal of keyword stuffing was employed to aid in spamdexing, which is of little benefit to the user, keyword stuffing in certain circumstances was not intended to skew results in a deceptive manner. Whether the term carries a pejorative or neutral connotation is dependent on whether the practice is used to pollute the results with pages of little relevance, or to direct traffic to a page of relevance that would have otherwise been de-emphasized due to the search engine's inability to interpret and understand related ideas. This is no longer the case. Search engines now employ themed, related keyword techniques to interpret the intent of the content on a page.
These techniques involve altering the logical view that a search engine has over the page's contents. They all aim at variants of the vector space model for information retrieval on text collections.
Keyword stuffing is a search engine optimization (SEO) technique in which keywords are loaded into a web page's meta tags, visible content, or backlink anchor text in an attempt to gain an unfair rank advantage in search engines. Keyword stuffing may lead to a website being temporarily or permanently banned or penalized on major search engines. [8] The repetition of words in meta tags may explain why many search engines no longer use these tags. Nowadays, search engines focus more on the content that is unique, comprehensive, relevant, and helpful that overall makes the quality better which makes keyword stuffing useless, but it is still practiced by many webmasters.[ citation needed ]
Many major search engines have implemented algorithms that recognize keyword stuffing, and reduce or eliminate any unfair search advantage that the tactic may have been intended to gain, and oftentimes they will also penalize, demote or remove websites from their indexes that implement keyword stuffing.
Changes and algorithms specifically intended to penalize or ban sites using keyword stuffing include the Google Florida update (November 2003) Google Panda (February 2011) [9] Google Hummingbird (August 2013) [10] and Bing's September 2014 update. [11]
Headlines in online news sites are increasingly packed with just the search-friendly keywords that identify the story. Traditional reporters and editors frown on the practice, but it is effective in optimizing news stories for search. [12]
Unrelated hidden text is disguised by making it the same color as the background, using a tiny font size, or hiding it within HTML code such as "no frame" sections, alt attributes, zero-sized DIVs, and "no script" sections. People manually screening red-flagged websites for a search-engine company might temporarily or permanently block an entire website for having invisible text on some of its pages. However, hidden text is not always spamdexing: it can also be used to enhance accessibility. [13]
This involves repeating keywords in the meta tags, and using meta keywords that are unrelated to the site's content. This tactic has been ineffective. Google declared that it doesn't use the keywords meta tag in its online search ranking in September 2009. [14]
"Gateway" or doorway pages are low-quality web pages created with very little content, which are instead stuffed with very similar keywords and phrases. They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page; autoforwarding can also be used for this purpose. In 2006, Google ousted vehicle manufacturer BMW for using "doorway pages" to the company's German site, BMW.de. [15]
Scraper sites are created using various programs designed to "scrape" search-engine results pages or other sources of content and create "content" for a website.[ citation needed ] The specific presentation of content on these sites is unique, but is merely an amalgamation of content taken from other sources, often without permission. Such websites are generally full of advertising (such as pay-per-click ads), or they redirect the user to other sites. It is even feasible for scraper sites to outrank original websites for their own information and organization names.
Article spinning involves rewriting existing articles, as opposed to merely scraping content from other sites, to avoid penalties imposed by search engines for duplicate content. This process is undertaken by hired writers[ citation needed ] or automated using a thesaurus database or an artificial neural network.
Similarly to article spinning, some sites use machine translation to render their content in several languages, with no human editing, resulting in unintelligible texts that nonetheless continue to be indexed by search engines, thereby attracting traffic.
Link spam is defined as links between pages that are present for reasons other than merit. [16] Link spam takes advantage of link-based ranking algorithms, which gives websites higher rankings the more other highly ranked websites link to it. These techniques also aim at influencing other link-based ranking techniques such as the HITS algorithm.[ citation needed ]
Link farms are tightly-knit networks of websites that link to each other for the sole purpose of exploiting the search engine ranking algorithms. These are also known facetiously as mutual admiration societies. [17] Use of links farms has greatly reduced with the launch of Google's first Panda Update in February 2011, which introduced significant improvements in its spam-detection algorithm.
Blog networks (PBNs) are a group of authoritative websites used as a source of contextual links that point to the owner's main website to achieve higher search engine ranking. Owners of PBN websites use expired domains or auction domains that have backlinks from high-authority websites. Google targeted and penalized PBN users on several occasions with several massive deindexing campaigns since 2014. [18]
Putting hyperlinks where visitors will not see them is used to increase link popularity. Highlighted link text can help rank a webpage higher for matching that phrase.
A Sybil attack is the forging of multiple identities for malicious intent, named after the famous dissociative identity disorder patient and the book about her that shares her name, "Sybil". [19] [20] A spammer may create multiple web sites at different domain names that all link to each other, such as fake blogs (known as spam blogs).
Spam blogs are blogs created solely for commercial promotion and the passage of link authority to target sites. Often these "splogs" are designed in a misleading manner that will give the effect of a legitimate website but upon close inspection will often be written using spinning software or be very poorly written with barely readable content. They are similar in nature to link farms. [21] [22]
Guest blog spam is the process of placing guest blogs on websites for the sole purpose of gaining a link to another website or websites. Unfortunately, these are often confused with legitimate forms of guest blogging with other motives than placing links. This technique was made famous by Matt Cutts, who publicly declared "war" against this form of link spam. [23]
Some link spammers utilize expired domain crawler software or monitor DNS records for domains that will expire soon, then buy them when they expire and replace the pages with links to their pages. However, it is possible but not confirmed that Google resets the link data on expired domains. [ citation needed ] To maintain all previous Google ranking data for the domain, it is advisable that a buyer grab the domain before it is "dropped".
Some of these techniques may be applied for creating a Google bomb—that is, to cooperate with other users to boost the ranking of a particular page for a particular query.
Web sites that can be edited by users can be used by spamdexers to insert links to spam sites if the appropriate anti-spam measures are not taken.
Automated spambots can rapidly make the user-editable portion of a site unusable. Programmers have developed a variety of automated spam prevention techniques to block or at least slow down spambots.
Spam in blogs is the placing or solicitation of links randomly on other sites, placing a desired keyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and any site that accepts visitors' comments are particular targets and are often victims of drive-by spamming where automated software creates nonsense posts with links that are usually irrelevant and unwanted.
Comment spam is a form of link spam that has arisen in web pages that allow dynamic user editing such as wikis, blogs, and guestbooks. It can be problematic because agents can be written that automatically randomly select a user edited web page, such as a Wikipedia article, and add spamming links. [24]
Wiki spam is when a spammer uses the open editability of wiki systems to place links from the wiki site to the spam site.
Referrer spam takes place when a spam perpetrator or facilitator accesses a web page (the referee), by following a link from another web page (the referrer ), so that the referee is given the address of the referrer by the person's Internet browser. Some websites have a referrer log which shows which pages link to that site. By having a robot randomly access many sites enough times, with a message or specific address given as the referrer, that message or Internet address then appears in the referrer log of those sites that have referrer logs. Since some Web search engines base the importance of sites on the number of different sites linking to them, referrer-log spam may increase the search engine rankings of the spammer's sites. Also, site administrators who notice the referrer log entries in their logs may follow the link back to the spammer's referrer page.
Because of the large amount of spam posted to user-editable webpages, Google proposed a "nofollow" tag that could be embedded with links. A link-based search engine, such as Google's PageRank system, will not use the link to increase the score of the linked website if the link carries a nofollow tag. This ensures that spamming links to user-editable websites will not raise the sites ranking with search engines. Nofollow is used by several major websites, including Wordpress, Blogger and Wikipedia.[ citation needed ]
A mirror site is the hosting of multiple websites with conceptually similar content but using different URLs. Some search engines give a higher rank to results where the keyword searched for appears in the URL.
URL redirection is the taking of the user to another page without his or her intervention, e.g., using META refresh tags, Flash, JavaScript, Java or Server side redirects. However, 301 Redirect, or permanent redirect, is not considered as a malicious behavior.
Cloaking refers to any of several means to serve a page to the search-engine spider that is different from that seen by human users. It can be an attempt to mislead search engines regarding the content on a particular web site. Cloaking, however, can also be used to ethically increase accessibility of a site to users with disabilities or provide human users with content that search engines aren't able to process or parse. It is also used to deliver content based on a user's location; Google itself uses IP delivery, a form of cloaking, to deliver results. Another form of cloaking is code swapping, i.e., optimizing a page for top ranking and then swapping another page in its place once a top ranking is achieved. Google refers to these type of redirects as Sneaky Redirects. [25]
This section needs expansion. You can help by adding to it. (October 2017) |
Spamdexed pages are sometimes eliminated from search results by the search engine.
Users can employ search operators for filtering. For Google, a keyword preceded by "-" (minus) will omit sites that contains the keyword in their pages or in the URL of the pages from search result. As an example, the search "-<unwanted site>" will eliminate sites that contains word "<unwanted site>" in their pages and the pages whose URL contains "<unwanted site>".
Users could also use the Google Chrome extension "Personal Blocklist (by Google)", launched by Google in 2011 as part of countermeasures against content farming. [26] Via the extension, users could block a specific page, or set of pages from appearing in their search results. As of 2021, the original extension appears to be removed, although similar-functioning extensions may be used.
Possible solutions to overcome search-redirection poisoning redirecting to illegal internet pharmacies include notification of operators of vulnerable legitimate domains. Further, manual evaluation of SERPs, previously published link-based and content-based algorithms as well as tailor-made automatic detection and classification engines can be used as benchmarks in the effective identification of pharma scam campaigns. [27]
Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head
section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head
elements and attributes.
Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid search traffic rather than direct traffic, referral traffic, social media traffic, or paid traffic.
On the World Wide Web, a link farm is any group of websites that all hyperlink to other sites in the group for the purpose of increasing SEO rankings. In graph theoretic terms, a link farm is a clique. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a web search engine. Other link exchange systems are designed to allow individual websites to selectively exchange links with other relevant websites, and are not considered a form of spamdexing.
Cloaking is a search engine optimization (SEO) technique in which the content presented to the search engine spider is different from that presented to the user's browser. This is done by delivering content based on the IP addresses or the User-Agent HTTP header of the user requesting the page. When a user is identified as a search engine spider, a server-side script delivers a different version of the web page, one that contains content not present on the visible page, or that is present but not searchable. The purpose of cloaking is sometimes to deceive search engines so they display the page when it would not otherwise be displayed. However, it can also be a functional technique for informing search engines of content they would not otherwise be able to locate because it is embedded in non-textual containers, such as video or certain Adobe Flash components. Since 2006, better methods of accessibility, including progressive enhancement, have been available, so cloaking is no longer necessary for regular SEO.
Spam in blogs is a form of spamdexing which utilizes internet sites that allow content to be publicly posted, in order to artificially inflate their website ranking by linking back to their web pages. Backlinking helps search algorithms determine the popularity of a web page, which plays a major role for search engines like Google and Microsoft Bing to decide a web page ranking on a certain search query. This helps the spammer's website to list ahead of other sites for certain searches, which helps them to increase the number of visitors to their website.
A metasearch engine is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. Sufficient data is gathered, ranked, and presented to the users.
From the point of view of a given web resource (referent), a backlink is a regular hyperlink on another web resource that points to the referent. A web resource may be a website, web page, or web directory.
Doorway pages are web pages that are created for the deliberate manipulation of search engine indexes (spamdexing). A doorway page will affect the index of a search engine by inserting results for particular phrases while sending visitors to a different page. Doorway pages that redirect visitors without their knowledge use some form of cloaking. This usually falls under Black Hat SEO.
URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. Similarly, domain redirection or domain forwarding is when all pages in a URL domain are redirected to a different domain, as when wikipedia.com and wikipedia.net are automatically redirected to wikipedia.org.
The anchor text, link label, or link text is the visible, clickable text in an HTML hyperlink. The term "anchor" was used in older versions of the HTML specification for what is currently referred to as the "a element", or <a>
. The HTML specification does not have a specific term for anchor text, but refers to it as "text that the a element wraps around". In XML terms, the anchor text is the content of the element, provided that the content is text.
A scraper site is a website that copies content from other websites using web scraping. The content is then mirrored with the goal of creating revenue, usually through advertising and sometimes by selling user data.
The sandbox effect is a theory about the way Google ranks web pages in its index. It is the subject of much debate—its existence has been written about since 2004, but not confirmed, with several statements to the contrary.
An SEO contest is a prize activity that challenges search engine optimization (SEO) practitioners to achieve high ranking under major search engines such as Google, Yahoo, and MSN using certain keyword(s). This type of contest is controversial because it often leads to massive amounts of link spamming as participants try to boost the rankings of their pages by any means available. The SEO competitors hold the activity without the promotion of a product or service in mind, or they may organize a contest in order to market something on the Internet. Participants can showcase their skills and potentially discover and share new techniques for promoting websites.
nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>
. Because search engines often calculate a site's importance according to the number of hyperlinks from other sites, the nofollow
setting allows website authors to indicate that the presence of a link is not an endorsement of the target site's importance.
A search engine is a software system that provides hyperlinks to web pages and other relevant information on the Web in response to a user's query. The user inputs a query within a web browser or a mobile app, and the search results are often a list of hyperlinks, accompanied by textual summaries and images. Users also have the option of limiting the search to a specific type of results, such as images, videos, or news.
A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a keyword query.
Adversarial information retrieval is a topic in information retrieval related to strategies for working with a data source where some portion of it has been manipulated maliciously. Tasks can include gathering, indexing, filtering, retrieving and ranking information from such a data source. Adversarial IR includes the study of methods to detect, isolate, and defeat such manipulation.
In the field of search engine optimization (SEO), link building describes actions aimed at increasing the number and quality of inbound links to a webpage with the goal of increasing the search engine rankings of that page or website. Briefly, link building is the process of establishing relevant hyperlinks to a website from external sites. Link building can increase the number of high-quality links pointing to a website, in turn increasing the likelihood of the website ranking highly in search engine results. Link building is also a proven marketing tactic for increasing brand awareness.
White fonting is the practice of inserting hidden keywords into the body of an electronic document, in order to influence the actions of a search program reviewing that document. The name white fonting comes from the practice of adding keywords to a webpage, using a white font on a white background, in an effort to hide the additional keywords from sight.
Trademark stuffing is a form of keyword stuffing, an unethical search engine optimization method used by webmasters and Internet marketers in order to manipulate search engine ranking results served by websites such as Google, Yahoo! and Microsoft Bing. A key characteristic of trademark stuffing is the intent of the infringer to confuse search engines and Internet users into thinking a website or web page is owned or otherwise authorized by the trademark owner. Trademark stuffing does not include using trademarks on third party website pages with the boundaries of Fair Use. When used effectively, trademark stuffing enables infringing websites to capture search engine traffic that may have otherwise been received by an authorized website or trademark owner.