Sparse binary polynomial hashing

Last updated

Sparse binary polynomial hashing (SBPH) is a generalization of Bayesian spam filtering that can match mutating phrases as well as single words.

SBPH is a way of generating a large number of features from an incoming text automatically, and then using statistics to determine the weights for each of those features in terms of their predictive values for spam/nonspam evaluation.


Related Research Articles

Email Method of exchanging digital messages between people over a network

Electronic mail is a method of exchanging messages ("mail") between people using electronic devices. Email entered limited use in the 1960s, but users could only send to users of the same computer. Some systems also supported a form of instant messaging, where sender and receiver needed to be online online simultaneously. Ray Tomlinson is credited as the inventor of networked email; in 1971, he developed the first system able to send mail between users on different hosts across the ARPANET, using the @ sign to link the user name with a destination server. By the mid-1970s, this was the form recognized as email.

Spam (Monty Python) Monty Python sketch

"Spam" is a Monty Python sketch, first televised in 1970 and written by Terry Jones and Michael Palin. In the sketch, two customers are lowered by wires into a greasy spoon café and try to order a breakfast from a menu that includes Spam in almost every dish, much to the consternation of one of the customers. As the waitress recites the Spam-filled menu, a group of Viking patrons drown out all conversations with a song, repeating "Spam, Spam, Spam, Spam… Lovely Spam! Wonderful Spam!".

Spamming Unsolicited electronic messages, especially advertisements

Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose, or simply sending the same message over and over to the same user. While the most widely recognized form of spam is email spam, the term is applied to similar abuses in other media: instant messaging spam, Usenet newsgroup spam, Web search engine spam, spam in blogs, wiki spam, online classified ads spam, mobile phone messaging spam, Internet forum spam, junk fax transmissions, social spam, spam mobile apps, television advertising and file sharing spam. It is named after Spam, a luncheon meat, by way of a Monty Python sketch about a restaurant that has Spam in almost every dish in which vikings annoyingly sing "Spam" repeatedly.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed, in a manner inconsistent with the purpose of the indexing system.

In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models, but coupled with kernel density estimation, they can achieve higher accuracy levels.

Internet forum Online discussion site

An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are often longer than one line of text, and are at least temporarily archived. Also, depending on the access level of a user or the forum set-up, a posted message might need to be approved by a moderator before it becomes publicly visible.

Mobile phone spam Unwanted communication through a mobile phone

Mobile phone spam is a form of spam, directed at the text messaging or other communications services of mobile phones or smartphones. As the popularity of mobile phones surged in the early 2000s, frequent users of text messaging began to see an increase in the number of unsolicited commercial advertisements being sent to their telephones through text messaging. This can be particularly annoying for the recipient because, unlike in email, some recipients may be charged a fee for every message received, including spam. Mobile phone spam is generally less pervasive than email spam, where in 2010 around 90% of email is spam. The amount of mobile spam varies widely from region to region. In North America, mobile spam has steadily increased from 2008 ed 2012 and is projected to account for half of all mobile phone traffic in 2019. In parts of Asia up to 30% of messages were spam in 2012.

Email spam Unsolicited electronic advertising by e-mail

Email spam, also referred to as junk email or simply SPAM, is unsolicited messages sent in bulk by email (spamming).

Naive Bayes spam filtering

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify spam e-mail, an approach commonly used in text classification.

Zombie (computing) Network connected computer that has been compromised and is used for malicious task without the owner being aware of it

In computing, a zombie is a computer connected to the Internet that has been compromised by a hacker, computer virus, computer worm, or trojan horse program and can be used to perform malicious tasks of one sort or another under remote direction. Botnets of zombie computers are often used to spread e-mail spam and launch denial-of-service attacks. Most owners of "zombie" computers are unaware that their system is being used in this way. Because the owner tends to be unaware, these computers are metaphorically compared to fictional zombies. A coordinated DDoS attack by multiple botnet machines also resembles a "zombie horde attack", as depicted in fictional zombie films.

A joe job is a spamming technique that sends out unsolicited e-mails using spoofed sender data. Early joe jobs aimed at tarnishing the reputation of the apparent sender or inducing the recipients to take action against them, but they are now typically used by commercial spammers to conceal the true origin of their messages and to trick recipients into opening emails apparently coming from a trusted source.

Spambot Computer spam program (malware)

A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle between them and spammers in which spammers find new ways to evade the bans and anti-spam programs, and hosts counteract these methods.

Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.

A spam blog, also known as an auto blog or the neologism splog, is a blog which the author uses to promote affiliated websites, to increase the search engine rankings of associated sites or to simply sell links/ads.

Spamming, in the context of video games, refers to the repeated use of the same item or action. For example, "grenade spamming" is the act of a player throwing many grenades in succession into an area. In fighting games, one form of spamming would be to execute the same offensive maneuver or combo many times in succession.

Image spam Type of email spam

Image-based spam, or image spam, is a kind of email spam where the textual spam message is embedded into images, that are then attached to spam emails. Since most of the email clients will display the image file directly to the user, the spam message is conveyed as soon as the email is opened.

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.

Make Money Fast Electronic chain letter

Make Money Fast is a title of an electronically forwarded chain letter created in 1988 which became so infamous that the term is often used to describe all sorts of chain letters forwarded over the Internet, by e-mail spam, or in Usenet newsgroups. In anti-spammer slang, the name is often abbreviated "MMF".

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with one idea in mind: to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

BTDigg

BTDigg is the first Mainline DHT search engine. It participated in the BitTorrent DHT network, supporting the network and making correspondence between magnet links and a few torrent attributes which are indexed and inserted into a database. For end users, BTDigg provides a full-text database search via Web interface. The web part of its search system retrieved proper information by a user's text query. The Web search supported queries in European and Asian languages. The project name was an acronym of BitTorrent Digger. It went offline in June 2016, reportedly due to index spam. The site returned later in 2016 at a dot-com domain, went offline again and is now online.. The btdig.com site has its torrent crawler's source source listed on Github, dhtcrawler2.