Email-address harvesting

Last updated

Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.

Contents

Methods

The simplest method involves spammers purchasing or trading lists of email addresses from other spammers.

Another common method is the use of special software known as "harvesting bots" or "harvesters", which spider Web pages, postings on Usenet, mailing list archives, internet forums and other online sources to obtain email addresses from public data.

Spammers may also use a form of dictionary attack in order to harvest email addresses, known as a directory harvest attack, where valid email addresses at a specific domain are found by guessing email address using common usernames in email addresses at that domain. For example, trying alan@example.com, alana@example.com, alanb@example.com, etc. and any that are accepted for delivery by the recipient email server, instead of rejected, are added to the list of theoretically valid email addresses for that domain.

Another method of email address harvesting is to offer a product or service free of charge as long as the user provides a valid email address, and then use the addresses collected from users as spam targets. Common products and services offered are jokes of the day, daily bible quotes, news or stock alerts, free merchandise, or even registered sex offender alerts for one's area. Another technique was used in late 2007 by the company iDate, which used email harvesting directed at subscribers to the Quechup website to spam the victim's friends and contacts. [1]

Harvesting sources

Spammers may harvest email addresses from a number of sources. A popular method uses email addresses which their owners have published for other purposes. Usenet posts, especially those in archives such as Google Groups, frequently yield addresses. Simply searching the Web for pages with addresses such as corporate staff directories or membership lists of professional societies using spambots can yield thousands of addresses, most of them deliverable. Spammers have also subscribed to discussion mailing lists for the purpose of gathering the addresses of posters. The DNS and WHOIS systems require the publication of technical contact information for all Internet domains; spammers have illegally trawled these resources for email addresses. Spammers have also concluded that generally, for the domain names of businesses, all of the email addresses will follow the same basic pattern and thus are able to accurately guess the email addresses of employees whose addresses they have not harvested. Many spammers use programs called web spiders to find email addresses on web pages. Usenet article message-IDs often look enough like email addresses that they are harvested as well. Spammers have also harvested email addresses directly from Google search results, without actually spidering the websites found in the search.[ original research? ]

Spammer viruses may include a function which scans the victimized computer's disk drives (and possibly its network interfaces) for email addresses. These scanners discover email addresses which have never been exposed on the Web or in Whois. A compromised computer located on a shared network segment may capture email addresses from traffic addressed to its network neighbors. The harvested addresses are then returned to the spammer through the bot-net created by the virus. In addition, sometime the addresses may be appended with other information and cross referenced to extract financial and personal data.[ original research? ]

A recent, controversial tactic, called "e-pending", involves the appending of email addresses to direct-marketing databases. Direct marketers normally obtain lists of prospects from sources such as magazine subscriptions and customer lists. By searching the Web and other resources for email addresses corresponding to the names and street addresses in their records, direct marketers can send targeted spam email. However, as with most spammer "targeting", this is imprecise; users have reported, for instance, receiving solicitations to mortgage their house at a specific street address with the address being clearly a business address including mail stop and office number.[ original research? ]

Spammers sometimes use various means to confirm addresses as deliverable. For instance, including a hidden Web bug in a spam message written in HTML may cause the recipient's mail client to transmit the recipient's address, or any other unique key, to the spammer's Web site. [2] Users can defend against such abuses by turning off their mail program's option to display images, or by reading email as plain-text rather than formatted.[ original research? ]

Likewise, spammers sometimes operate Web pages which purport to remove submitted addresses from spam lists. In several cases, these have been found to subscribe the entered addresses to receive more spam. [3]

When persons fill out a form, it is often sold to a spammer using a web service or http post to transfer the data. This is immediate and will drop the email in various spammer databases. The revenue made from the spammer is shared with the source. For instance, if someone applies online for a mortgage, the owner of this site may have made a deal with a spammer to sell the address. These are considered the best emails by spammers, because they are fresh and the user has just signed up for a product or service that often is marketed by spam.

Legality

In many jurisdictions there are anti-spam laws in place that restrict the harvesting or use of email addresses.[ original research? ]

In Australia, the creation or use of email-address harvesting programs (address harvesting software) is illegal, according to the 2003 anti-spam legislation, only if it is intended to use the email-address harvesting programs to send unsolicited commercial email. [4] [5] The legislation is intended to prohibit emails with 'an Australian connection' - spam originating in Australia being sent elsewhere, and spam being sent to an Australian address.

New Zealand has similar restrictions contained in its Unsolicited Electronic Messages Act 2007. [6] In The United States of America, the CAN-SPAM Act of 2003 [7] made it illegal to initiate commercial email to a recipient where the email address of the recipient was obtained by:

Furthermore, website operators may not distribute their legitimately collected lists. The CAN-SPAM Act of 2003 requires that operators of web sites and online services should include a notice that the site or service will not give, sell, or otherwise transfer addresses, maintained by such website or online service, to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.

Countermeasures

Address munging
Address munging e.g., changing "bob@example.com" to "bob at example dot com"is a common technique to make harvesting email addresses more difficult. Though relatively easy to overcomesee, e.g., this Google search it is still effective. [8] [9] It is somewhat inconvenient to users, who must examine the address and manually correct it.
Images
Using images to display part or all of an email address is a very effective harvesting countermeasure. The processing required to automatically extract text from images is not economically viable for spammers. It is very inconvenient for users, who type the address in manually.
Contact forms
Email contact forms which send an email but do not reveal the recipient's address avoid publishing an email address in the first place. However, this method prevents users from composing in their preferred email client, limits message content to plain text - and does not automatically leave the user with a record of what they've said in their "sent" mail folder.
JavaScript obfuscation
JavaScript email obfuscation produces a normal, clickable email link for users while obscuring the address from spiders. In the source code seen by harvesters, the email address is scrambled, encoded, or otherwise obfuscated. [8] While very convenient for most users, it does reduce accessibility, e.g. for text-based browsers and screen readers, or for those not using a JavaScript-enabled browser. [10]
HTML obfuscation
In HTML, email addresses may be obfuscated in many ways, such as inserting hidden elements within the address or listing parts out of order and using CSS to restore the correct order. Each has the benefit of being transparent to most users, but none support clickable email links and none are accessible to text-based browsers and screen readers.
CAPTCHA
Requiring users to complete a CAPTCHA before giving out an email address is an effective harvesting countermeasure. A popular solution is the reCAPTCHA Mailhide service. (Note, 12.9.18: Mailhide is no longer supported.) [11]
CAN-SPAM Notice
To enable prosecution of spammers under the CAN-SPAM Act of 2003, a website operator must post a notice that "the site or service will not give, sell, or otherwise transfer addresses maintained by such website or online service to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages." [12]
Mail Server Monitoring
Email servers use a variety of methods to combat directory harvesting attacks, including to refuse to communicate with remote senders that have specified more than one invalid recipient address within a short time, but most such measures carry the risk of legitimate email being disrupted.
Spider Traps
A spider trap is a part of a website which is a honeypot designed to combat email harvesting spiders. [13] Well-behaved spiders are unaffected, as the website's robots.txt file will warn spiders to stay away from that areaa warning that malicious spiders do not heed. Some traps block access from the client's IP as soon as the trap is accessed. [14] [15] [16] Others, like a network tarpit, are designed to waste the time and resources of malicious spiders by slowly and endlessly feeding the spider useless information. [17] The "bait" content may contain large numbers of fake addresses, a technique known as list poisoning; though some consider this practice harmful. [18] [19] [20] [21]

See also

Related Research Articles

<span class="mw-page-title-main">Email</span> Mail sent using electronic means

Electronic mail is a method of transmitting and receiving messages using electronic devices. It was conceived in the late–20th century as the digital version of, or counterpart to, mail. Email is a ubiquitous and very widely used communication medium; in current use, an email address is often treated as a basic and necessary part of many processes in business, commerce, government, education, entertainment, and other spheres of daily life in most countries.

The Simple Mail Transfer Protocol (SMTP) is an Internet standard communication protocol for electronic mail transmission. Mail servers and other message transfer agents use SMTP to send and receive mail messages. User-level email clients typically use SMTP only for sending messages to a mail server for relaying, and typically submit outgoing email to the mail server on port 587 or 465 per RFC 8314. For retrieving messages, IMAP is standard, but proprietary servers also often implement proprietary protocols, e.g., Exchange ActiveSync.

<span class="mw-page-title-main">Spamming</span> Unsolicited electronic messages, especially advertisements

Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose, or simply repeatedly sending the same message to the same user. While the most widely recognized form of spam is email spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social spam, spam mobile apps, television advertising and file sharing spam. It is named after Spam, a luncheon meat, by way of a Monty Python sketch about a restaurant that has Spam in almost every dish in which Vikings annoyingly sing "Spam" repeatedly.

A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is referred to as "the mailing list", or simply "the list."

Various anti-spam techniques are used to prevent email spam.

Gmane is an e-mail to news gateway. It allows users to access electronic mailing lists as if they were Usenet newsgroups, and also through a variety of web interfaces. Since Gmane is a bidirectional gateway, it can also be used to post on the mailing lists. Gmane is an archive; it never expires messages. Gmane also supports importing list postings made prior to a list's inclusion on the service.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by e-mail

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.

Address munging is the practice of disguising an e-mail address to prevent it from being automatically collected by unsolicited bulk e-mail providers. Address munging is intended to disguise an e-mail address in a way that prevents computer software from seeing the real address, or even any address at all, but still allows a human reader to reconstruct the original and contact the author: an email address such as, "no-one@example.com", becomes "no-one at example dot com", for instance.

<span class="mw-page-title-main">Yahoo! Mail</span> American email service

Yahoo! Mail is an email service offered by the American company Yahoo, Inc. The service is free for personal use, with an optional monthly fee for additional features. Business email was previously available with the Yahoo! Small Business brand, before it transitioned to Verizon Small Business Essentials in early 2022. Launched on October 8, 1997, as of January 2020, Yahoo! Mail has 225 million users.

A bounce message or just "bounce" is an automated message from an email system, informing the sender of a previous message that the message has not been delivered. The original message is said to have "bounced".

mailto is a Uniform Resource Identifier (URI) scheme for email addresses. It is used to produce hyperlinks on websites that allow users to send an email to a specific address directly from an HTML document, without having to copy it and entering it into an email client.

Email marketing is the act of sending a commercial message, typically to a group of people, using email. In its broadest sense, every email sent to a potential or current customer could be considered email marketing. It involves using email to send advertisements, request business, or solicit sales or donations. Email marketing strategies commonly seek to achieve one or more of three primary objectives, to build loyalty, trust, or brand awareness. The term usually refers to sending email messages with the purpose of enhancing a merchant's relationship with current or previous customers, encouraging customer loyalty and repeat business, acquiring new customers or convincing current customers to purchase something immediately, and sharing third-party ads.

<span class="mw-page-title-main">Spambot</span> Computer spam program (malware)

A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle between them and spammers in which spammers find new ways to evade the bans and anti-spam programs, and hosts counteract these methods.

An Internet bot, web robot, robot or simply bot, is a software application that runs automated tasks (scripts) over the Internet, usually with the intent to imitate human activity on the Internet, such as messaging, on a large scale. An Internet bot plays the client role in a client–server model whereas the server role is usually played by web servers. Internet bots are able to perform tasks, that are simple and repetitive, much faster than a person could ever do. The most extensive use of bots is for web crawling, in which an automated script fetches, analyzes and files information from web servers. More than half of all web traffic is generated by bots.

Disposable email addressing, also known as DEA or dark mail or "masked" email, refers to an approach which involves a unique email address being used for every contact, entity, or for a limited number of times or uses. The benefit is that if anyone compromises the address or utilizes it in connection with email abuse, the address owner can easily cancel it without affecting any of their other contacts.

A challenge–response system is a type of spam filter that automatically sends a reply with a challenge to the (alleged) sender of an incoming e-mail. It was originally designed in 1997 by Stan Weatherby, and was called Email Verification. In this reply, the purported sender is asked to perform some action to assure delivery of the original message, which would otherwise not be delivered. The action to perform typically takes relatively little effort to do once, but great effort to perform in large numbers. This effectively filters out spammers. Challenge–response systems only need to send challenges to unknown senders. Senders that have previously performed the challenging action, or who have previously been sent e-mail(s) to, would be automatically whitelisted.

On Internet usage, an email bomb is a form of net abuse that sends large volumes of email to an address to overflow the mailbox, overwhelm the server where the email address is hosted in a denial-of-service attack or as a smoke screen to distract the attention from important email messages indicating a security breach.

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

Since Internet users and system administrators have deployed a vast array of techniques to block, filter, or otherwise banish spam from users' mailboxes and almost all Internet service providers forbid the use of their services to send spam or to operate spam-support services, special techniques are employed to deliver spam emails. Both commercial firms and volunteers run subscriber services dedicated to blocking or filtering spam.

People tend to be much less bothered by spam slipping through filters into their mail box, than having desired e-mail ("ham") blocked. Trying to balance false negatives vs false positives is critical for a successful anti-spam system. As servers are not able to block all spam there are some tools for individual users to help control over this balance.

References

  1. Arthur, Charls (2007-09-13). "Do social network sites genuinely care about privacy?". theguardian. Archived from the original on 2016-12-22. Retrieved 2007-10-30.
  2. Heather Harreld (5 December 2000). "Embedded HTML 'bugs' pose potential security risk". InfoWorld. Archived from the original on 2006-12-10. Retrieved 2007-01-06.
  3. "Spam Unsubscribe Services". The Spamhaus Project Ltd. 29 September 2005. Archived from the original on 2009-03-09. Retrieved 2007-01-06.
  4. "EFA Analysis of Australian Spam Bills 2003". efa.org.au. Electronic Frontiers Australia. 2003-11-01. Address Harvesting Software and Lists. Archived from the original on 2021-05-04.
  5. "Australia slams the door on spam". 2003-08-18. Archived from the original on 2007-02-03. Retrieved 2021-07-04.
  6. "Unsolicited Electronic Messages Act 2007 No 7, Public Act Subpart 2—Address-harvesting software and harvested-address lists". legislation.govt.nz. Archived from the original on 2021-02-17. Retrieved 2021-07-04.
  7. "Public Law 108–187" (PDF). Archived (PDF) from the original on 2006-01-04. Retrieved 2007-05-28.
  8. 1 2 Silvan Mühlemann, 20 July 2008, Nine ways to obfuscate e-mail addresses compared
  9. Hohlfeld, Oliver; Graf, Thomas; Ciucu, Florin (2012). Longtime Behavior of Harvesting Spam Bots (PDF). ACM Internet Measurement Conference. Archived (PDF) from the original on 2014-07-25. Retrieved 2014-07-18.
  10. Roel Van Gils, A List Apart , 6 November 2007, Graceful Email Obfuscation Archived 2011-02-22 at the Wayback Machine
  11. "Mailhide: Free Spam Protection" . Retrieved 18 March 2023.
  12. "15 U.S. Code § 7704 - Other protections for users of commercial electronic mail" Archived 2016-09-19 at the Wayback Machine , Section a.4.b.1.A.i
  13. SEO Glossary Archived 2010-12-28 at the Wayback Machine : "A spider trap refers to either a continuous loop where spiders are requesting pages and the server is requesting data to render the page or an intentional scheme designed to identify (and "ban") spiders that do not respect robots.txt."
  14. Archived 2008-05-17 at the Wayback Machine A Spider Trap which bans clients which access it.
  15. Thomas Zeithaml, Spider Trap: How It Works Archived 2018-04-11 at the Wayback Machine
  16. Ralf D. Kloth, Trap bad bots in a bot trap Archived 2006-01-17 at the Wayback Machine
  17. "How to keep bad robots". fleiner.com. Archived from the original on 18 March 2023. Retrieved 18 March 2023.
  18. Ralf D. Kloth, Fight SPAM, catch Bad Bots Archived 2006-06-01 at the Wayback Machine : "Generating web pages with long lists of fake addresses to spoil the spam bot's address data base is not encouraged, because it is unknown if the spammers really care and on the other hand, the use of those addresses by spammers will cause additional traffic load on network links and involved innocent third party servers."
  19. Harvester Killer Archived 2008-04-11 at the Wayback Machine : generates fake emails and traps spiders in an endless loop.
  20. "Portability Support: Spider Blocking => Spider Trap - Detects and blocks bad bots". Archived from the original on 2011-07-06. Retrieved 2011-02-12. A Spider Trap which generates 5,000 fake email addresses and blocks the client from further access.
  21. robotcop.org Archived 2019-10-20 at the Wayback Machine : "Webmasters can respond to misbehaving spiders by trapping them, poisoning their databases of harvested e-mail addresses, or simply block them."