SpamBayes

Last updated
SpamBayes
Original author(s) Tim Peters
Initial releaseSeptember 2002
Stable release
1.0.4 / March 2005
Preview release
1.1a6 / December 6, 2008 (2008-12-06) [1]
Written in Python
Platform Cross-platform
Available in English only
Type E-mail filtering
License PSFL
Website spambayes.sourceforge.net

SpamBayes is a Bayesian spam filter written in Python which uses techniques laid out by Paul Graham in his essay "A Plan for Spam". It has subsequently been improved by Gary Robinson and Tim Peters, among others.

Contents

The most notable difference between a conventional Bayesian filter and the filter used by SpamBayes is that there are three classifications rather than two: spam, non-spam (called ham in SpamBayes), and unsure. The user trains a message as being either ham or spam; when filtering a message, the spam filters generate one score for ham and another for spam.

If the spam score is high and the ham score is low, the message will be classified as spam. If the spam score is low and the ham score is high, the message will be classified as ham. If the scores are both high or both low, the message will be classified as unsure.

This approach leads to a low number of false positives and false negatives, but it may result in a number of unsures which need a human decision.

Web filtering

Some work has gone into applying SpamBayes to filter internet content via a proxy web server. [2] [3] [4]

Related Research Articles

<span class="mw-page-title-main">Spamming</span> Unsolicited electronic messages, especially advertisements

Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose, or simply repeatedly sending the same message to the same user. While the most widely recognized form of spam is email spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social spam, spam mobile apps, television advertising and file sharing spam. It is named after Spam, a luncheon meat, by way of a Monty Python sketch about a restaurant that has Spam in almost every dish in which Vikings annoyingly sing "Spam" repeatedly.

A Domain Name System blocklist, Domain Name System-based blackhole list, Domain Name System blacklist (DNSBL) or real-time blackhole list (RBL) is a service for operation of mail servers to perform a check via a Domain Name System (DNS) query whether a sending host's IP address is blacklisted for email spam. Most mail server software can be configured to check such lists, typically rejecting or flagging messages from such sites.

Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally written by Eric S. Raymond after he read Paul Graham's article "A Plan for Spam" and is now maintained together with a group of contributors by David Relson, Matthias Andree and Greg Louis.

<span class="mw-page-title-main">Apache SpamAssassin</span> Open-source e-mail spam filter

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

Various anti-spam techniques are used to prevent email spam.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by email

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

<span class="mw-page-title-main">Opera Mail</span>

Opera Mail is the email and news client developed by Opera Software. It was an integrated component within the Opera web browser from version 2 through 12. With the release of Opera 15 in 2013, Opera Mail became a separate product and is no longer bundled with Opera. Opera Mail version 1.0 is available for OS X and Windows. It features rich text support and inline spell checking, spam filtering, a contact manager, and supports POP3 and IMAP, newsgroups, and Atom and RSS feeds.

Email marketing is the act of sending a commercial message, typically to a group of people, using email. In its broadest sense, every email sent to a potential or current customer could be considered email marketing. It involves using email to send advertisements, request business, or solicit sales or donations. Email marketing strategies commonly seek to achieve one or more of three primary objectives: build loyalty, trust, or brand awareness. The term usually refers to sending email messages with the purpose of enhancing a merchant's relationship with current or previous customers, encouraging customer loyalty and repeat business, acquiring new customers or convincing current customers to purchase something immediately, and sharing third-party ads.

Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly applying anti-spam techniques. Filtering can be applied to incoming emails as well as to outgoing ones.

Norton Internet Security, developed by Symantec Corporation, is a discontinued computer program that provides malware protection and removal during a subscription period. It uses signatures and heuristics to identify viruses. Other features include a personal firewall, email spam filtering, and phishing protection. With the release of the 2015 line in summer 2014, Symantec officially retired Norton Internet Security after 14 years as the chief Norton product. It was superseded by Norton Security, a rechristened adaptation of the Norton 360 security suite.

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted. A type II error, or a false negative, is the failure to reject a null hypothesis that is actually false. For example: a guilty person may be not convicted.

Bayesian poisoning is a technique used by e-mail spammers to attempt to degrade the effectiveness of spam filters that rely on Bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam or is not spam. The spammer hopes that the addition of random words that are unlikely to appear in a spam message will cause the spam filter to believe the message to be legitimate—a statistical type II error.

Backscatter is incorrectly automated bounce messages sent by mail servers, typically as a side effect of incoming spam.

The history of email spam reaches back to the mid-1990s when commercial use of the internet first became possible - and marketers and publicists began to test what was possible.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

<i>CompuServe Inc. v. Cyber Promotions, Inc.</i>

CompuServe Inc. v. Cyber Promotions, Inc. was a ruling by the United States District Court for the Southern District of Ohio in 1997 that set an early precedent for granting online service providers the right to prevent commercial enterprises from sending unsolicited email advertising – also known as spam – to its subscribers. It was one of the first cases to apply United States tort law to restrict spamming on computer networks. The court held that Cyber Promotions' intentional use of CompuServe's proprietary servers to send unsolicited email was an actionable trespass to chattels and granted a preliminary injunction preventing the spammer from sending unsolicited advertisements to any email address maintained by CompuServe.

<span class="mw-page-title-main">Bayesian programming</span> Statistics concept

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

A cold email is an unsolicited e-mail that is sent to a receiver without prior contact. It could also be defined as the email equivalent of cold calling. Cold emailing is a subset of email marketing and differs from transactional and warm emailing.

References

  1. http://sourceforge.net/projects/spambayes/files/spambayes/1.1a6/CHANGELOG.txt/download
  2. Montanaro, Skip (2003-12-07). "[spambayes-dev] Web filtering" . Retrieved 2023-04-18.
  3. "[spambayes-dev] Web filtering". 7 December 2003.
  4. "OSDIR". 6 November 2020.