CRM114 (program)

Last updated

CRM114 (full name: "The CRM114 Discriminator") is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Contents

Origin of the name

The name comes from the CRM-114 Discriminator in the Stanley Kubrick movie Dr. Strangelove - a piece of radio equipment designed to filter out messages lacking a specific code-prefix.

Operation

While others have done statistical Bayesian spam filtering based upon the frequency of single word occurrences in email, CRM114 achieves a higher rate of spam recognition through creating hits based upon phrases up to five words in length. These phrases are used to form a Markov Random Field representing the incoming texts. With this additional contextual recognition, it is one of the more accurate spam filters available. Initial testing in 2002 by author Bill Yerazunis [1] gave a 99.87% accuracy; [2] Holden [3] and TREC 2005 and 2006 [4] [5] gave results of better than 99%, with significant variation depending on the particular corpus.

CRM114's classifier can also be switched to use Littlestone's Winnow algorithm, character-by-character correlation, a variant on KNN (K-nearest neighbor algorithm) classification called Hyperspace, a bit-entropic classifier that uses entropy encoding to determine similarity, a SVM, by mutual compressibility as calculated by a modified LZ77 algorithm, and other more experimental classifiers. The actual features matched are based on a generalization of skip-grams.

The CRM114 algorithms are multi-lingual (compatible with UTF-8 encodings) and null-safe. A voting set of CRM114 classifiers have been demonstrated to detect confidential versus non-confidential documents written in Japanese at better than 99.9% detection rate and a 5.3% false alarm rate. [6]

CRM114 is a good example of pattern recognition software, demonstrating how machine learning can be accomplished with a reasonably simple algorithm. The program's C source code is available under the GPL.

At a deeper level, CRM114 is also a string pattern matching language, similar to grep or even Perl; although it is Turing complete it is highly tuned for matching text, and even a simple (recursive) definition of the factorial takes almost ten lines. Part of this is because the crm114 language syntax is not positional, but declensional. As a programming language, it may be used for many other applications aside from detecting spam. CRM114 uses the TRE approximate-match regex engine, so it is possible to write programs that do not depend on absolutely identical strings matching to function correctly.

CRM114 has been applied to email filtering in the KMail client [7] [8] and a number of other applications, including detection of bots on Twitter and Yahoo, [9] [10] as well as the first-level filter in the US Dept of Transportation's vehicle defect detection system. [11] It has also been used as a predictive method for classifying fault-prone software modules. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo or from subtitle text superimposed on an image.

Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power. These activities can be viewed as two facets of the same field of application, and they have undergone substantial development over the past few decades.

Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally written by Eric S. Raymond after he read Paul Graham's article "A Plan for Spam" and is now maintained together with a group of contributors by David Relson, Matthias Andree and Greg Louis.

<span class="mw-page-title-main">Apache SpamAssassin</span> Open-source e-mail spam filter

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

Various anti-spam techniques are used to prevent email spam.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by e-mail

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming).

<span class="mw-page-title-main">Naive Bayes spam filtering</span>

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

<span class="mw-page-title-main">Botnet</span> Collection of compromised internet-connected devices controlled by a third party

A botnet is a group of Internet-connected devices, each of which runs one or more bots. Botnets can be used to perform Distributed Denial-of-Service (DDoS) attacks, steal data, send spam, and allow the attacker to access the device and its connection. The owner can control the botnet using command and control (C&C) software. The word "botnet" is a portmanteau of the words "robot" and "network". The term is usually used with a negative or malicious connotation.

VoIP spam or SPIT is unsolicited, automatically dialed telephone calls, typically using voice over Internet Protocol (VoIP) technology.

Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly applying anti-spam techniques. Filtering can be applied to incoming emails as well as to outgoing ones.

<span class="mw-page-title-main">Kontact</span> Personal information manager software

Kontact is a personal information manager and groupware software suite developed by KDE. It supports calendars, contacts, notes, to-do lists, news, and email. It offers a number of inter-changeable graphical UIs all built on top of a common core.

An Internet bot, web robot, robot or simply bot, is a software application that runs automated tasks (scripts) over the Internet, usually with the intent to imitate human activity on the Internet, such as messaging, on a large scale. An Internet bot plays the client role in a client–server model whereas the server role is usually played by web servers. Internet bots are able to perform tasks, that are simple and repetitive, much faster than a person could ever do. The most extensive use of bots is for web crawling, in which an automated script fetches, analyzes and files information from web servers. More than half of all web traffic is generated by bots.

<span class="mw-page-title-main">Computer-aided diagnosis</span> Type of diagnosis assisted by computers

Computer-aided detection (CADe), also called computer-aided diagnosis (CADx), are systems that assist doctors in the interpretation of medical images. Imaging techniques in X-ray, MRI, Endoscopy, and ultrasound diagnostics yield a great deal of information that the radiologist or other medical professional has to analyze and evaluate comprehensively in a short time. CAD systems process digital images or videos for typical appearances and to highlight conspicuous sections, such as possible diseases, in order to offer input to support a decision taken by the professional.

Data loss prevention (DLP) software detects potential data breaches/data ex-filtration transmissions and prevents them by monitoring, detecting and blocking sensitive data while in use, in motion, and at rest.

<span class="mw-page-title-main">Image spam</span> Type of email spam

Image-based spam, or image spam, is a kind of email spam where the textual spam message is embedded into images, that are then attached to spam emails. Since most of the email clients will display the image file directly to the user, the spam message is conveyed as soon as the email is opened.

Kaspersky Internet Security is an internet security suite developed by Kaspersky Lab compatible with Microsoft Windows and Mac OS X. Kaspersky Internet Security offers protection from malware, as well as email spam, phishing and hacking attempts, and data leaks. Kaspersky Lab Diagnostics results are distributed to relevant developers through the MIT License.

<span class="mw-page-title-main">CRM 114 (fictional device)</span> Fictional device in Dr. Strangelove

The CRM 114 Discriminator is a fictional piece of radio equipment in Stanley Kubrick's film Dr. Strangelove (1964), the destruction of which prevents the crew of a B-52 from receiving the recall code that would stop them from dropping their hydrogen bombs on the Soviet Union. The device is one of several that malfunction in the film, along with Mandrake's telephone call attempts, the bomb doors failing to open and the Doomsday Weapon's misuse, a common theme in Kubrick's work of the failure of human planning.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

A social bot, or also described as a social AI or social algorithm, is a software agent that communicates autonomously on social media. The messages it distributes can be simple and operate in groups and various configurations with partial human control (hybrid) via algorithm. Social bots can also use artificial intelligence to express messages in more natural human dialogue.

<span class="mw-page-title-main">Outline of machine learning</span> Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

References

  1. Garretson, Cara (2007-03-19). "The antispam man". Network World.
  2. "CRM114 gets 99.87%". Paul Graham's website. 2002-10-16.
  3. Spam Filtering II
  4. Spam Track Overview (2005) - TREC 2005
  5. Spam Track Overview (2006) - TREC 2005
  6. "Archived copy" (PDF). media.blackhat.com. Archived from the original (PDF) on 2011-07-08.{{cite web}}: CS1 maint: archived copy as title (link)
  7. "Removing spam mail with CRM114 and KMail". Archived from the original on 2019-10-01. Retrieved 2019-10-01.
  8. "kmail.antispamrc at KDE/kdepim-addons". GitHub . 12 June 2022.
  9. Chu, Zi; Gianvecchio, Steven; Wang, Haining; Jajodia, Sushil (November 2012). "Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg?". IEEE Transactions on Dependable and Secure Computing. 9 (6): 811–824. doi:10.1109/TDSC.2012.75. ISSN   1545-5971.
  10. "Measurement and Classification of Humans and Bots in Internet Chat". Usenix. Retrieved 2023-01-16.
  11. Scovel III, Calvin L. (2015-06-18). Inadequate Data and Analysis Undermine NHTSA’s Efforts To Identify and Investigate Vehicle Safety Concerns (PDF) (Report). Office of Inspector General - U.S. Department of Transportation.
  12. Mizuno, Osamu; Ikami, Shiro; Nakaichi, Shuya; Kikuno, Tohru (May 2007). "Spam Filter Based Approach for Finding Fault-Prone Software Modules". Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007): 4–4. doi:10.1109/MSR.2007.29.