HOME

TheInfoList



OR:

Spamdexing (also known as search engine spam, search engine poisoning, black-hat
search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or " organic" results) rather than direc ...
, search spam or web spam) is the deliberate manipulation of
search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
indexes Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed, in a manner inconsistent with the purpose of the indexing system."Word Spy - spamdexing" (definition), March 2003, webpag
WordSpy-spamdexing
.
Spamdexing could be considered to be a part of
search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or " organic" results) rather than direc ...
, although there are many search engine optimization methods that improve the quality and appearance of the content of web sites and serve content useful to many users.


Overview

Search engines use a variety of
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
s to determine relevancy
ranking A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In mathematics, this is known as a weak order or total preorder of ...
. Some of these include determining whether the search term appears in the
body text __NOTOC__ The body text or body copy is the text forming the main content of a book, magazine, web page, or any other printed or digital work. This is as a contrast to both additional components such as headings, images, charts, footnotes etc. on ...
or
URL A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifi ...
of a web page. Many search engines check for instances of spamdexing and will remove suspect pages from their indexes. Also, search-engine operators can quickly block the results listing from entire websites that use spamdexing, perhaps in response to user complaints of false matches. The rise of spamdexing in the mid-1990s made the leading search engines of the time less useful. Using unethical methods to make websites rank higher in search engine results than they otherwise would is commonly referred to in the SEO (search engine optimization) industry as "black-hat SEO". These methods are more focused on breaking the search-engine-promotion rules and guidelines. In addition to this, the perpetrators run the risk of their websites being severely penalized by the
Google Panda Google Panda is a major change to Google's search results ranking algorithm that was first released in February 2011. The change aimed to lower the rank of "low-quality sites" or "thin sites", in particular " content farms", and return higher-qual ...
and
Google Penguin Google Penguin was a codename for a Google algorithm update that was first announced on April 24, 2012. The update was aimed at decreasing search engine rankings of websites that violate Google's Webmaster Guidelines by using now declared Grey H ...
search-results ranking algorithms. Common spamdexing techniques can be classified into two broad classes: ''content spam'' (or ''term spam'') and ''link spam''.


History

The earliest known reference to the term ''spamdexing'' is by Eric Convey in his article "Porn sneaks way back on Web,"
The Boston Herald The ''Boston Herald'' is an American daily newspaper whose primary market is Boston, Massachusetts, and its surrounding area. It was founded in 1846 and is one of the oldest daily newspapers in the United States. It has been awarded eight Pulit ...
, May 22, 1996, where he said:
The problem arises when site operators load their Web pages with hundreds of extraneous terms so search engines will list them among legitimate addresses. The process is called "spamdexing," a combination of
spamming Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose (especial ...
— the Internet term for sending users unsolicited information — and " indexing."


Content spam

These techniques involve altering the logical view that a search engine has over the page's contents. They all aim at variants of the
vector space model Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and r ...
for information retrieval on text collections.


Keyword stuffing

Keyword stuffing involves the calculated placement of keywords within a page to raise the keyword count, variety, and density of the page. This is useful to make a page appear to be relevant for a
web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
in a way that makes it more likely to be found. Example: A promoter of a
Ponzi scheme A Ponzi scheme (, ) is a form of fraud that lures investors and pays profits to earlier investors with funds from more recent investors. Named after Italian businessman Charles Ponzi, the scheme leads victims to believe that profits are comi ...
wants to attract web surfers to a site where he advertises his scam. He places hidden text appropriate for a fan page of a popular music group on his page, hoping that the page will be listed as a fan site and receive many visits from music lovers. Older versions of indexing programs simply counted how often a keyword appeared, and used that to determine relevance levels. Most modern search engines have the ability to analyze a page for keyword stuffing and determine whether the frequency is consistent with other sites created specifically to attract search engine traffic. Also, large webpages are truncated, so that massive dictionary lists cannot be indexed on a single webpage. (However, spammers can circumvent this webpage-size limitation merely by setting up multiple webpages, either independently or linked to each other.)


Hidden or invisible text

Unrelated
hidden text Hidden or The Hidden may refer to: Film and television Film * ''The Hidden'' (film), a 1987 American science fiction/horror film * ''Hidden'' (2005 film) or ''Caché'', a French thriller film * ''Hidden'' (2009 film), a Norwegian horror film ...
is disguised by making it the same color as the background, using a tiny font size, or hiding it within
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript ...
code such as "no frame" sections, alt attributes, zero-sized DIVs, and "no script" sections. People manually screening red-flagged websites for a search-engine company might temporarily or permanently block an entire website for having invisible text on some of its pages. However, hidden text is not always spamdexing: it can also be used to enhance
accessibility Accessibility is the design of products, devices, services, vehicles, or environments so as to be usable by people with disabilities. The concept of accessible design and practice of accessible development ensures both "direct access" (i.e ...
.


Meta-tag stuffing

This involves repeating keywords in the
meta tag Meta elements are tags used in HTML and XHTML documents to provide structured metadata about a Web page. They are part of a web page's head section. Multiple Meta elements with different attributes can be used on the same page. Meta elements can ...
s, and using meta keywords that are unrelated to the site's content. This tactic has been ineffective since 2005.


Doorway pages

"Gateway" or
doorway page Doorway pages (bridge pages, portal pages, jump pages, gateway pages or entry pages) are web pages that are created for the deliberate manipulation of search engine indexes (spamdexing). A doorway page will affect the index of a search engine by ins ...
s are low-quality web pages created with very little content, which are instead stuffed with very similar keywords and phrases. They are designed to rank highly within the search results, but serve no purpose to visitors looking for information. A doorway page will generally have "click here to enter" on the page; autoforwarding can also be used for this purpose. In 2006, Google ousted vehicle manufacturer BMW for using "doorway pages" to the company's German site, BMW.de.


Scraper sites

Scraper sites A scraper site is a website that copies content from other websites using web scraping. The content is then mirrored with the goal of creating revenue, usually through advertising and sometimes by selling user data. Scraper sites come in various f ...
are created using various programs designed to "scrape" search-engine results pages or other sources of content and create "content" for a website. The specific presentation of content on these sites is unique, but is merely an amalgamation of content taken from other sources, often without permission. Such websites are generally full of advertising (such as
pay-per-click Pay-per-click (PPC) is an internet advertising model used to drive traffic to websites, in which an advertiser pays a publisher (typically a search engine, website owner, or a network of websites) when the ad is clicked. Pay-per-click is usually ...
ads), or they redirect the user to other sites. It is even feasible for scraper sites to outrank original websites for their own information and organization names.


Article spinning

Article spinning involves rewriting existing articles, as opposed to merely scraping content from other sites, to avoid penalties imposed by search engines for
duplicate content Duplicate content is a term used in the field of search engine optimization to describe content that appears on more than one web page. The duplicate content can be substantial parts of the content within or across domains and can be either exactly ...
. This process is undertaken by hired writers or automated using a
thesaurus A thesaurus (plural ''thesauri'' or ''thesauruses'') or synonym dictionary is a reference work for finding synonyms and sometimes antonyms of words. They are often used by writers to help find the best word to express an idea: Synonym diction ...
database or a
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
.


Machine translation

Similarly to article spinning, some sites use
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates th ...
to render their content in several languages, with no human editing, resulting in unintelligible texts that nonetheless continue to be indexed by search engines, thereby attracting traffic.


Link spam

Link spam is defined as links between pages that are present for reasons other than merit. Link spam takes advantage of link-based ranking algorithms, which gives
websites A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wikipe ...
higher rankings the more other highly ranked websites link to it. These techniques also aim at influencing other link-based ranking techniques such as the
HITS algorithm Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web ...
.


Link farms

Link farms are tightly-knit networks of websites that link to each other for the sole purpose of exploiting the search engine ranking algorithms. These are also known facetiously as ''mutual admiration societies''. Use of links farms has greatly reduced with the launch of Google's first Panda Update in February 2011, which introduced significant improvements in its spam-detection algorithm.


Private blog networks

Blog networks A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries (posts). Posts are typically displayed in reverse chronological order s ...
(PBNs) are a group of authoritative websites used as a source of contextual links that point to the owner's main website to achieve higher search engine ranking. Owners of PBN websites use expired domains or auction domains that have
backlink A backlink is a link from some other website (the referrer) to that web resource (the referent). A ''web resource'' may be (for example) a website, web page, or web directory. A backlink is a reference comparable to a citation. The quantity, q ...
s from high-authority websites. Google targeted and penalized PBN users on several occasions with several massive deindexing campaigns since 2014.


Hidden links

Putting
hyperlink In computing, a hyperlink, or simply a link, is a digital reference to data that the user can follow or be guided by clicking or tapping. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text wi ...
s where visitors will not see them is used to increase link popularity. Highlighted link text can help rank a webpage higher for matching that phrase.


Sybil attack

A Sybil attack is the forging of multiple identities for malicious intent, named after the famous
dissociative identity disorder Dissociative identity disorder (DID), better known as multiple personality disorder or multiple personality syndrome, is a mental disorder characterized by the presence of at least two distinct and relatively enduring personality, personality sta ...
patient and the book about her that shares her name, " Sybil". A spammer may create multiple web sites at different
domain name A domain name is a string that identifies a realm of administrative autonomy, authority or control within the Internet. Domain names are often used to identify services provided through the Internet, such as websites, email services and more. As ...
s that all link to each other, such as fake blogs (known as
spam blogs A spam blog, also known as an auto blog or the neologism splog, is a blog which the author uses to promote affiliated websites, to increase the search engine rankings of associated sites or to simply sell links/ads. The purpose of a splog can be ...
).


Spam blogs

Spam blogs are blogs created solely for commercial promotion and the passage of link authority to target sites. Often these "splogs" are designed in a misleading manner that will give the effect of a legitimate website but upon close inspection will often be written using spinning software or be very poorly written with barely readable content. They are similar in nature to link farms.


Guest blog spam

Guest blog spam is the process of placing guest blogs on websites for the sole purpose of gaining a link to another website or websites. Unfortunately, these are often confused with legitimate forms of guest blogging with other motives than placing links. This technique was made famous by
Matt Cutts Matthew Cutts (born 1972 or 1973) is an American software engineer. Cutts is the former Administrator of the United States Digital Service. He was first appointed as acting administrator, to later be confirmed as full administrator in October 201 ...
, who publicly declared "war" against this form of link spam.


Buying expired domains

Some link spammers utilize expired domain crawler software or monitor DNS records for domains that will expire soon, then buy them when they expire and replace the pages with links to their pages. However, it is possible but not confirmed that Google resets the link data on expired domains. To maintain all previous Google ranking data for the domain, it is advisable that a buyer grab the domain before it is "dropped". Some of these techniques may be applied for creating a Google bomb — that is, to cooperate with other users to boost the ranking of a particular page for a particular query.


Using world-writable pages

Web sites that can be edited by users can be used by spamdexers to insert links to spam sites if the appropriate anti-spam measures are not taken. Automated
spambot A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle betwe ...
s can rapidly make the user-editable portion of a site unusable. Programmers have developed a variety of automated spam prevention techniques to block or at least slow down spambots.


Spam in blogs

Spam in blogs is the placing or solicitation of links randomly on other sites, placing a desired keyword into the hyperlinked text of the inbound link. Guest books, forums, blogs, and any site that accepts visitors' comments are particular targets and are often victims of drive-by spamming where automated software creates nonsense posts with links that are usually irrelevant and unwanted.


Comment spam

Comment spam is a form of link spam that has arisen in web pages that allow dynamic user editing such as
wikis A wiki ( ) is an online hypertext publication collaboratively edited and managed by its own audience, using a web browser. A typical wiki contains multiple pages for the subjects or scope of the project, and could be either open to the pub ...
,
blogs A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries (posts). Posts are typically displayed in reverse chronological order s ...
, and guestbooks. It can be problematic because agents can be written that automatically randomly select a user edited web page, such as a Wikipedia article, and add spamming links.


Wiki spam

Wiki spam is when a spammer uses the open editability of wiki systems to place links from the wiki site to the spam site.


Referrer log spamming

Referrer spam Referrer spam (also known as referral spam, log spam or referrer bombing) is a kind of spamdexing ( spamming aimed at search engines). The technique involves making repeated web site requests using a fake referrer URL to the site the spammer wi ...
takes place when a spam perpetrator or facilitator accesses a web page (the ''referee''), by following a link from another web page (the ''
referrer In HTTP, "" (a misspelling of Referrer) is an optional HTTP header field that identifies the address of the web page (i.e., the URI or IRI), from which the resource has been requested. By checking the referrer, the server providing the new web ...
''), so that the referee is given the address of the referrer by the person's Internet browser. Some
website A website (also written as a web site) is a collection of web pages and related content that is identified by a common domain name and published on at least one web server. Examples of notable websites are Google, Facebook, Amazon, and Wikipe ...
s have a referrer log which shows which pages link to that site. By having a
robot A robot is a machine—especially one programmable by a computer—capable of carrying out a complex series of actions automatically. A robot can be guided by an external control device, or the control may be embedded within. Robots may b ...
randomly access many sites enough times, with a message or specific address given as the referrer, that message or Internet address then appears in the referrer log of those sites that have referrer logs. Since some
Web search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
s base the importance of sites on the number of different sites linking to them, referrer-log spam may increase the search engine rankings of the spammer's sites. Also, site administrators who notice the referrer log entries in their logs may follow the link back to the spammer's referrer page.


Countermeasures

Because of the large amount of spam posted to user-editable webpages, Google proposed a "nofollow" tag that could be embedded with links. A link-based search engine, such as Google's
PageRank PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According ...
system, will not use the link to increase the score of the linked website if the link carries a nofollow tag. This ensures that spamming links to user-editable websites will not raise the sites ranking with search engines. Nofollow is used by several major websites, including
Wordpress WordPress (WP or WordPress.org) is a free and open-source content management system (CMS) written in hypertext preprocessor language and paired with a MySQL or MariaDB database with supported HTTPS. Features include a plugin architecture a ...
,
Blogger A blog (a Clipping (morphology), truncation of "weblog") is a discussion or informational website published on the World Wide Web consisting of discrete, often informal diary-style text entries (posts). Posts are typically displayed in Reverse ...
and
Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read referenc ...
.


Other types


Mirror websites

A
mirror site Mirror sites or mirrors are replicas of other websites or any network node. The concept of mirroring applies to network services accessible through any protocol, such as HTTP or FTP. Such sites have different URLs than the original site, but host ...
is the hosting of multiple websites with conceptually similar content but using different
URLs A Uniform Resource Locator (URL), colloquially termed as a web address, is a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. A URL is a specific type of Uniform Resource Identifie ...
. Some search engines give a higher rank to results where the keyword searched for appears in the URL.


URL redirection

URL redirection URL redirection, also called URL forwarding, is a World Wide Web technique for making a web page available under more than one URL address. When a web browser attempts to open a URL that has been redirected, a page with a different URL is opened. ...
is the taking of the user to another page without his or her intervention, ''e.g.'', using
META refresh Meta refresh is a method of instructing a web browser to automatically refresh the current web page or frame after a given time interval, using an HTML meta element with the http-equiv parameter set to "refresh" and a content parameter giving the ...
tags,
Flash Flash, flashes, or FLASH may refer to: Arts, entertainment, and media Fictional aliases * Flash (DC Comics character), several DC Comics superheroes with super speed: ** Flash (Barry Allen) ** Flash (Jay Garrick) ** Wally West, the first Kid F ...
,
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, often ...
,
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's most ...
or Server side redirects. However, 301 Redirect, or permanent redirect, is not considered as a malicious behavior.


Cloaking

Cloaking refers to any of several means to serve a page to the search-engine
spider Spiders ( order Araneae) are air-breathing arthropods that have eight legs, chelicerae with fangs generally able to inject venom, and spinnerets that extrude silk. They are the largest order of arachnids and rank seventh in total species div ...
that is different from that seen by human users. It can be an attempt to mislead search engines regarding the content on a particular web site. Cloaking, however, can also be used to ethically increase accessibility of a site to users with disabilities or provide human users with content that search engines aren't able to process or parse. It is also used to deliver content based on a user's location; Google itself uses IP delivery, a form of cloaking, to deliver results. Another form of cloaking is ''code swapping'', ''i.e.'', optimizing a page for top ranking and then swapping another page in its place once a top ranking is achieved. Google refers to these type of redirects as ''Sneaky Redirects''.


Countermeasures


Page omission by search engine

Spamdexed pages are sometimes eliminated from search results by the search engine.


Page omission by user

Users can employ search operators for filtering. For Google, a keyword preceded by "-" (minus) will omit sites that contains the keyword in their pages or in the URL of the pages from search result. As an example, the search "-" will eliminate sites that contains word "" in their pages and the pages whose URL contains "". Users could also use the
Google Chrome Google Chrome is a cross-platform web browser developed by Google. It was first released in 2008 for Microsoft Windows, built with free software components from Apple WebKit and Mozilla Firefox. Versions were later released for Linux, macOS, ...
extension "Personal Blocklist (by Google)", launched by Google in 2011 as part of countermeasures against content farming. Via the extension, users could block a specific page, or set of pages from appearing in their search results. As of 2021, the original extension appears to be removed, although similar-functioning extensions may be used. Possible solutions to overcome search-redirection poisoning redirecting to illegal internet pharmacies include notification of operators of vulnerable legitimate domains. Further, manual evaluation of SERPs, previously published link-based and content-based algorithms as well as tailor-made automatic detection and classification engines can be used as benchmarks in the effective identification of pharma scam campaigns.


See also

* Adversarial information retrieval *
Index (search engine) Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and ...
– overview of search engine indexing technology *
TrustRank TrustRank is an algorithm that conducts link analysis to separate useful webpages from spam and helps search engine rank pages in SERPs (Search Engine Results Pages). It is semi-automated process which means that it needs some human assistance ...
*
Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping ...
*
Microsoft SmartScreen SmartScreen (officially called Windows SmartScreen, Windows Defender SmartScreen and SmartScreen Filter in different places) is a cloud-based anti-phishing and anti-malware component included in several Microsoft products, including operating sys ...
* Microsoft Defender


References


External links

{{Spamming Black hat search engine optimization Spamming