HOME

TheInfoList



OR:

Email harvesting or scraping is the process of obtaining lists of
email address An email address identifies an email box to which messages are delivered. While early messaging systems used a variety of formats for addressing, today, email addresses follow a set of specific rules originally standardized by the Internet Engineer ...
es using various methods. Typically these are then used for bulk email or
spam Spam may refer to: * Spam (food), a canned pork meat product * Spamming, unsolicited or undesired electronic messages ** Email spam, unsolicited, undesired, or illegal email messages ** Messaging spam, spam targeting users of instant messaging ( ...
.


Methods

The simplest method involves spammers purchasing or trading lists of email addresses from other
spammer Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose (especial ...
s. Another common method is the use of special
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consists o ...
known as "harvesting
bots The British Overseas Territories (BOTs), also known as the United Kingdom Overseas Territories (UKOTs), are fourteen dependent territory, territories with a constitutional and historical link with the United Kingdom. They are the last remna ...
" or "harvesters", which
spider Spiders ( order Araneae) are air-breathing arthropods that have eight legs, chelicerae with fangs generally able to inject venom, and spinnerets that extrude silk. They are the largest order of arachnids and rank seventh in total species div ...
Web pages, postings on
Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose UUCP, Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis (computing), Jim Ellis conceived th ...
,
mailing list archive A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is re ...
s,
internet forum An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are often longer than one line of text, and are at least tempora ...
s and other online sources to obtain email addresses from public data. Spammers may also use a form of
dictionary attack In cryptanalysis and computer security, a dictionary attack is an attack using a restricted subset of a keyspace to defeat a cipher or authentication mechanism by trying to determine its decryption key or passphrase, sometimes trying thousands o ...
in order to harvest email addresses, known as a
directory harvest attack A directory harvest attack (DHA) is a technique used by spammers in an attempt to find valid/existent e-mail addresses at a domain by using brute force. The attack is usually carried out by way of a standard dictionary attack, where valid e-mail ...
, where valid email addresses at a specific domain are found by guessing email address using common usernames in email addresses at that domain. For example, trying alan@
example.com The domain names example.com, example.net, example.org, and example.edu are second-level domain names in the Domain Name System of the Internet. They are reserved by the Internet Assigned Numbers Authority (IANA) at the direction of the Internet ...
, [email protected], @example.com, etc. and any that are accepted for delivery by the recipient email server, instead of rejected, are added to the list of theoretically valid email addresses for that domain. Another method of email address harvesting is to offer a product or service free of charge as long as the user provides a valid email address, and then use the addresses collected from users as spam targets. Common products and services offered are jokes of the day, daily bible quotes, news or stock alerts, free merchandise, or even registered sex offender alerts for one's area. Another technique was used in late 2007 by the company iDate, which used email harvesting directed at subscribers to the Quechup website to spam the victim's friends and contacts.


Harvesting sources

Spammers may harvest email addresses from a number of sources. A popular method uses email addresses which their owners have published for other purposes.
Usenet Usenet () is a worldwide distributed discussion system available on computers. It was developed from the general-purpose UUCP, Unix-to-Unix Copy (UUCP) dial-up network architecture. Tom Truscott and Jim Ellis (computing), Jim Ellis conceived th ...
posts, especially those in archives such as
Google Groups Google Groups is a service from Google that provides discussion groups for people sharing common interests. The Groups service also provides a gateway to Usenet newsgroups via a shared user interface. Google Groups became operational in February ...
, frequently yield addresses. Simply searching the Web for pages with addresses — such as corporate staff directories or membership lists of professional societies — using
spambot A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle betwe ...
s can yield thousands of addresses, most of them deliverable. Spammers have also subscribed to discussion
mailing list A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is re ...
s for the purpose of gathering the addresses of posters. The DNS and
WHOIS WHOIS (pronounced as the phrase "who is") is a query and response protocol that is widely used for querying databases that store the registered users or assignees of an Internet resource, such as a domain name, an IP address block or an autonomo ...
systems require the publication of technical contact information for all Internet domains; spammers have illegally trawled these resources for email addresses. Spammers have also concluded that generally, for the domain names of businesses, all of the email addresses will follow the same basic pattern and thus are able to accurately guess the email addresses of employees whose addresses they have not harvested. Many spammers use programs called
web spider A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
s to find email addresses on web pages. Usenet article message-IDs often look enough like email addresses that they are harvested as well. Spammers have also harvested email addresses directly from
Google search Google Search (also known simply as Google) is a search engine provided by Google. Handling more than 3.5 billion searches per day, it has a 92% share of the global search engine market. It is also the most-visited website in the world. The ...
results, without actually spidering the websites found in the search. Spammer viruses may include a function which scans the victimized computer's disk drives (and possibly its network interfaces) for email addresses. These scanners discover email addresses which have never been exposed on the Web or in Whois. A compromised computer located on a shared
network segment A network segment is a portion of a computer network. The nature and extent of a segment depends on the nature of the network and the device or devices used to interconnect end stations. Ethernet According to the defining IEEE 802.3 standards ...
may capture email addresses from traffic addressed to its network neighbors. The harvested addresses are then returned to the spammer through the bot-net created by the virus. In addition, sometime the addresses may be appended with other information and cross referenced to extract financial and personal data. A recent, controversial tactic, called ''" e-pending"'', involves the ''appending'' of ''email'' addresses to direct-marketing databases. Direct marketers normally obtain lists of prospects from sources such as
magazine A magazine is a periodical publication, generally published on a regular schedule (often weekly or monthly), containing a variety of content. They are generally financed by advertising, purchase price, prepaid subscriptions, or by a combinati ...
subscriptions and customer lists. By searching the Web and other resources for email addresses corresponding to the names and street addresses in their records, direct marketers can send targeted spam email. However, as with most spammer "targeting", this is imprecise; users have reported, for instance, receiving solicitations to
mortgage A mortgage loan or simply mortgage (), in civil law jurisdicions known also as a hypothec loan, is a loan used either by purchasers of real property to raise funds to buy real estate, or by existing property owners to raise funds for any p ...
their house at a specific street address — with the address being clearly a business address including mail stop and office number. Spammers sometimes use various means to confirm addresses as deliverable. For instance, including a hidden
Web bug A web beaconAlso called web bug, tracking bug, tag, web tag, page tag, tracking pixel, pixel tag, 1×1 GIF, or clear GIF. is a technique used on web pages and email to unobtrusively (usually invisibly) allow checking that a user has accessed s ...
in a spam message written in
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScript ...
may cause the recipient's mail client to transmit the recipient's address, or any other unique key, to the spammer's Web site. Users can defend against such abuses by turning off their mail program's option to display images, or by reading email as plain-text rather than formatted. Likewise, spammers sometimes operate Web pages which purport to remove submitted addresses from spam lists. In several cases, these have been found to subscribe the entered addresses to receive more spam. When persons fill out a form, it is often sold to a spammer using a web service or http post to transfer the data. This is immediate and will drop the email in various spammer databases. The revenue made from the spammer is shared with the source. For instance, if someone applies online for a mortgage, the owner of this site may have made a deal with a spammer to sell the address. These are considered the best emails by spammers, because they are fresh and the user has just signed up for a product or service that often is marketed by spam.


Legality

In many jurisdictions there are anti-spam laws in place that restrict the harvesting or use of email addresses. In Australia, the creation or use of email-address harvesting programs (address harvesting software) is illegal, according to the 2003 anti-spam legislation, only if it is intended to use the email-address harvesting programs to send unsolicited commercial email. The legislation is intended to prohibit emails with 'an Australian connection' - spam originating in Australia being sent elsewhere, and spam being sent to an Australian address. New Zealand has similar restrictions contained in its Unsolicited Electronic Messages Act 2007. In The United States of America, the
CAN-SPAM Act of 2003 The Controlling the Assault of Non-Solicited Pornography And Marketing (CAN-SPAM) Act of 2003 is a law passed in 2003 establishing the United States' first national standards for the sending of commercial e-mail. The law requires the Federal Trad ...
made it illegal to initiate commercial email to a recipient where the email address of the recipient was obtained by: * Using an automated means that generates possible electronic mail addresses by combining names, letters, or numbers into numerous permutations. * Using an automated means to extract electronic mail addresses from an Internet website or proprietary online service operated by another person, and such website or online service included, at the time the address was obtained, a notice stating that the operator of such website or online service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages. Furthermore, website operators may not distribute their legitimately collected lists. The CAN-SPAM Act of 2003 requires that operators of web sites and online services should include a notice that the site or service will not give, sell, or otherwise transfer addresses, maintained by such website or online service, to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.


Countermeasures

; Address munging :
Address munging Address munging is the practice of disguising an e-mail address to prevent it from being automatically collected by unsolicited bulk e-mail providers. Address munging is intended to disguise an e-mail address in a way that prevents computer softwa ...
—e.g., changing "[email protected]" to "bob at example dot com"—is a common technique to make harvesting email addresses more difficult. Though relatively easy to overcome—see, e.g., thi
Google search
mdash;it is still effective.Silvan Mühlemann, 20 July 2008
Nine ways to obfuscate e-mail addresses compared
/ref> It is somewhat inconvenient to users, who must examine the address and manually correct it. ; Images : Using images to display part or all of an email address is a very effective harvesting countermeasure. The processing required to automatically extract text from images is not economically viable for spammers. It is very inconvenient for users, who type the address in manually. ; Contact forms : Email contact
forms Form is the shape, visual appearance, or configuration of an object. In a wider sense, the form is the way something happens. Form also refers to: *Form (document), a document (printed or electronic) with spaces in which to write or enter data * ...
which send an email but do not reveal the recipient's address avoid publishing an email address in the first place. However, this method prevents users from composing in their preferred email client, limits message content to plain text - and does not automatically leave the user with a record of what they've said in their "sent" mail folder. ; JavaScript obfuscation :
JavaScript JavaScript (), often abbreviated as JS, is a programming language that is one of the core technologies of the World Wide Web, alongside HTML and CSS. As of 2022, 98% of websites use JavaScript on the client side for webpage behavior, often ...
email
obfuscation Obfuscation is the obscuring of the intended meaning of communication by making the message difficult to understand, usually with confusing and ambiguous language. The obfuscation might be either unintentional or intentional (although intent ...
produces a normal, clickable email link for users while obscuring the address from spiders. In the source code seen by harvesters, the email address is scrambled, encoded, or otherwise obfuscated. While very convenient for most users, it does reduce
accessibility Accessibility is the design of products, devices, services, vehicles, or environments so as to be usable by people with disabilities. The concept of accessible design and practice of accessible development ensures both "direct access" (i.e ...
, e.g. for text-based browsers and screen readers, or for those not using a JavaScript-enabled browser. ; HTML obfuscation : In HTML, email addresses may be obfuscated in many ways, such as inserting hidden elements within the address or listing parts out of order and using CSS to restore the correct order. Each has the benefit of being transparent to most users, but none support clickable email links and none are accessible to text-based browsers and screen readers. ; CAPTCHA : Requiring users to complete a
CAPTCHA A CAPTCHA ( , a contrived acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart") is a type of challenge–response test used in computing to determine whether the user is human. The term was coined in 2003 ...
before giving out an email address is an effective harvesting countermeasure. A popular solution is the
reCAPTCHA reCAPTCHA is a CAPTCHA system that enables web hosts to distinguish between human and automated access to websites. The original version asked users to decipher hard to read text or match images. Version 2 also asked users to decipher text or ...
Mailhide service. (Note, 12.9.18: Mailhide is no longer supported.) ; CAN-SPAM Notice : To enable prosecution of spammers under the CAN-SPAM Act of 2003, a website operator must post a notice that "the site or service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages." ; Mail Server Monitoring : Email servers use a variety of methods to combat directory harvesting attacks, including to refuse to communicate with remote senders that have specified more than one invalid recipient address within a short time, but most such measures carry the risk of legitimate email being disrupted. ; Spider Traps : A
spider trap A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are ...
is a part of a website which is a honeypot designed to combat email harvesting spiders. Well-behaved spiders are unaffected, as the website's
robots.txt The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the site they are allowed to visit. Th ...
file will warn spiders to stay away from that area—a warning that malicious spiders do not heed. Some traps block access from the client's IP as soon as the trap is accessed. Others, like a network tarpit, are designed to waste the time and resources of malicious spiders by slowly and endlessly feeding the spider useless information. The "bait" content may contain large numbers of fake addresses, a technique known as list poisoning; though some consider this practice harmful.robotcop.org
"Webmasters can respond to misbehaving spiders by trapping them, poisoning their databases of harvested e-mail addresses, or simply block them."


See also

*
Anti-spam techniques Various anti-spam techniques are used to prevent email spam (unsolicited bulk email). No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) as opposed to ...
*
Email spam Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoida ...
*
Web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
*
Web scraping Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping ...


References

{{spamming