Sparse binary polynomial hashing

Last updated

Sparse binary polynomial hashing (SBPH) is a generalization of Bayesian spam filtering that can match mutating phrases as well as single words.

SBPH is a way of generating a large number of features from an incoming text automatically, and then using statistics to determine the weights for each of those features in terms of their predictive values for spam/nonspam evaluation.


Related Research Articles

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally written by Eric S. Raymond after he read Paul Graham's article "A Plan for Spam" and is now maintained together with a group of contributors by David Relson, Matthias Andree and Greg Louis.

<span class="mw-page-title-main">Apache SpamAssassin</span> Open-source e-mail spam filter

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

Various anti-spam techniques are used to prevent email spam.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by email

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

<span class="mw-page-title-main">Opera Mail</span>

Opera Mail is the email and news client developed by Opera Software. It was an integrated component within the Opera web browser from version 2 through 12. With the release of Opera 15 in 2013, Opera Mail became a separate product and is no longer bundled with Opera. Opera Mail version 1.0 is available for OS X and Windows. It features rich text support and inline spell checking, spam filtering, a contact manager, and supports POP3 and IMAP, newsgroups, and Atom and RSS feeds.

Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly applying anti-spam techniques. Filtering can be applied to incoming emails as well as to outgoing ones.

SpamBayes is a Bayesian spam filter written in Python which uses techniques laid out by Paul Graham in his essay "A Plan for Spam". It has subsequently been improved by Gary Robinson and Tim Peters, among others.

Within the probability theory Markov model, Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and nonspam more accurately than in simple Bayesian methods. A simple Bayesian model of written text contains only the dictionary of legal words and their relative probabilities. A Markovian model adds the relative transition probabilities that given one word, predict what the next word will be. It is based on the theory of Markov chains by Andrey Markov, hence the name. In essence, a Bayesian filter works on single words alone, while a Markovian filter works on phrases or entire sentences.

<span class="mw-page-title-main">LifeType</span> Blogging platform

LifeType is an open-source blogging platform with support for multiple blogs and users in a single installation. It is written in PHP and backed by a MySQL database. LifeType is licensed under the GNU General Public License.

Spam Bully is anti-spam software made by Axaware, LLC. SpamBully uses Bayesian filtering to separate good emails from spam emails. Spam Bully 3 included a feature which performed automated clicks on spam mail, similar to some other software, such as the later AdNauseam browser extension. The features include the ability to report spammers to their providers and the FTC, the option of converting the SpamBully toolbar into a variety of languages including Spanish, German, Italian and Russian.

The Anti-Spam SMTP Proxy (ASSP) is an open-source, Perl based, platform-independent transparent SMTP proxy server.

Bayesian poisoning is a technique used by e-mail spammers to attempt to degrade the effectiveness of spam filters that rely on Bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam or is not spam. The spammer hopes that the addition of random words that are unlikely to appear in a spam message will cause the spam filter to believe the message to be legitimate—a statistical type II error.

Mailwasher is an email filtering software for Windows that can detect and delete spam from a user's email when it is on the mail server, before being downloaded to the user's computer.

Pocomail was an e-mail client for Microsoft Windows systems that was first developed by Poco Systems in 1999. It was originally designed to provide better spam and attack protection by using its own scripting methods (PocoScript) as opposed to using JavaScript and native Microsoft scripting.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

<span class="mw-page-title-main">Bayesian programming</span> Statistics concept

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.