SpamBayes

Last updated
SpamBayes
Original author(s) Tim Peters
Initial releaseSeptember 2002
Stable release
1.0.4 / March 2005
Preview release
1.1a6 / December 6, 2008 (2008-12-06) [1]
Written in Python
Platform Cross-platform
Available in English only
Type E-mail filtering
License PSFL
Website spambayes.sourceforge.net

SpamBayes is a Bayesian spam filter written in Python which uses techniques laid out by Paul Graham in his essay "A Plan for Spam". It has subsequently been improved by Gary Robinson and Tim Peters, among others. [2]

Contents

The most notable difference between a conventional Bayesian filter and the filter used by SpamBayes is that there are three classifications rather than two: spam, non-spam (called ham in SpamBayes), and unsure. The user trains a message as being either ham or spam; when filtering a message, the spam filters generate one score for ham and another for spam.

If the spam score is high and the ham score is low, the message will be classified as spam. If the spam score is low and the ham score is high, the message will be classified as ham. If the scores are both high or both low, the message will be classified as unsure.

This approach leads to a low number of false positives and false negatives, but it may result in a number of unsures which need a human decision.

Web filtering

Some work has gone into applying SpamBayes to filter internet content via a proxy web server. [3] [4] [5]

Related Research Articles

<span class="mw-page-title-main">Spamming</span> Unsolicited electronic messages, especially advertisements

Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, non-commercial proselytizing, or any prohibited purpose, or simply repeatedly sending the same message to the same user. While the most widely recognized form of spam is email spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social spam, spam mobile apps, television advertising and file sharing spam. It is named after Spam, a luncheon meat, by way of a Monty Python sketch about a restaurant that has Spam in almost every dish in which Vikings annoyingly sing "Spam" repeatedly.

Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally written by Eric S. Raymond after he read Paul Graham's article "A Plan for Spam" and is now maintained together with a group of contributors by David Relson, Matthias Andree and Greg Louis.

<span class="mw-page-title-main">Apache SpamAssassin</span> Open-source e-mail spam filter

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

Various anti-spam techniques are used to prevent email spam.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by email

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

Email marketing is the act of sending a commercial message, typically to a group of people, using email. In its broadest sense, every email sent to a potential or current customer could be considered email marketing. It involves using email to send advertisements, request business, or solicit sales or donations. Email marketing strategies commonly seek to achieve one or more of three primary objectives: build loyalty, trust, or brand awareness. The term usually refers to sending email messages with the purpose of enhancing a merchant's relationship with current or previous customers, encouraging customer loyalty and repeat business, acquiring new customers or convincing current customers to purchase something immediately, and sharing third-party ads.

Norton Internet Security, developed by Symantec Corporation, is a discontinued computer program that provides malware protection and removal during a subscription period. It uses signatures and heuristics to identify viruses. Other features include a personal firewall, email spam filtering, and phishing protection. With the release of the 2015 line in summer 2014, Symantec officially retired Norton Internet Security after 14 years as the chief Norton product. It was superseded by Norton Security, a rechristened adaptation of the original Norton 360 security suite. The suite was once again rebranded to Norton 360 in 2019.

SpamCop is an email spam reporting service, allowing recipients of unsolicited bulk or commercial email to report IP addresses found by SpamCop's analysis to be senders of the spam to the abuse reporting addresses of those IP addresses. SpamCop uses these reports to compile a list of computers sending spam called the "SpamCop Blocking List" or "SpamCop Blacklist" (SCBL).

The attention economy refers to the incentives of, especially advertising-driven companies, to maximize the time and attention their users give to the product they are selling.

A challenge–response system is a type of that automatically sends a reply with a challenge to the (alleged) sender of an incoming e-mail. It was originally designed in 1997 by Stan Weatherby, and was called Email Verification. In this reply, the purported sender is asked to perform some action to assure delivery of the original message, which would otherwise not be delivered. The action to perform typically takes relatively little effort to do once, but great effort to perform in large numbers. This effectively filters out spammers. Challenge–response systems only need to send challenges to unknown senders. Senders that have previously performed the challenging action, or who have previously been sent e-mail(s) to, would be automatically receive a challenge.

In statistical hypothesis testing, a type I error, or a false positive, is the rejection of the null hypothesis when it is actually true. For example, an innocent person may be convicted.

Bayesian poisoning is a technique used by e-mail spammers to attempt to degrade the effectiveness of spam filters that rely on Bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam or is not spam. The spammer hopes that the addition of random words that are unlikely to appear in a spam message will cause the spam filter to believe the message to be legitimate—a statistical type II error.

Backscatter is incorrectly automated bounce messages sent by mail servers, typically as a side effect of incoming spam.

Brightmail Inc. was a San Francisco–based technology company focused on anti-spam filtering. Brightmail's system has a three-pronged approach to stopping spam, the Probe Network is a massive number of e-mail addresses established for the sole purpose of receiving spam. The Brightmail Logistics and Operations Center (BLOC) evaluates newly detected spam and issues rules for ISPs. The third approach is the Spam Wall, a filtering engine that identifies and screens out spam based on the updates from the BLOC.

The history of email spam reaches back to the mid-1990s when commercial use of the internet first became possible - and marketers and publicists began to test what was possible.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

<i>CompuServe Inc. v. Cyber Promotions, Inc.</i>

CompuServe Inc. v. Cyber Promotions, Inc. was a ruling by the United States District Court for the Southern District of Ohio in 1997 that set an early precedent for granting online service providers the right to prevent commercial enterprises from sending unsolicited email advertising – also known as spam – to its subscribers. It was one of the first cases to apply United States tort law to restrict spamming on computer networks. The court held that Cyber Promotions' intentional use of CompuServe's proprietary servers to send unsolicited email was an actionable trespass to chattels and granted a preliminary injunction preventing the spammer from sending unsolicited advertisements to any email address maintained by CompuServe.

<span class="mw-page-title-main">Bayesian programming</span> Statistics concept

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

References

  1. "Download CHANGELOG.TXT (SpamBayes anti-spam)".
  2. Robinson, Gary (1 March 2003). "A Statistical Approach to the Spam Problem". Linux Journal. ISSN   1075-3583.
  3. Montanaro, Skip (2003-12-07). "[spambayes-dev] Web filtering" . Retrieved 2023-04-18.
  4. "[spambayes-dev] Web filtering". 7 December 2003.
  5. "OSDIR". 6 November 2020.