Apache SpamAssassin

Last updated

Apache SpamAssassin
Developer(s) Apache Software Foundation [1]
Initial releaseApril 20, 2001;23 years ago (2001-04-20)
Stable release
4.0.1 [2]   OOjs UI icon edit-ltr-progressive.svg / 29 March 2024;35 days ago (29 March 2024)
Repository SpamAssassin Repository
Written in Perl , C
Operating system Cross-platform
Type Spam filter
License Apache License 2.0
Website spamassassin.apache.org   OOjs UI icon edit-ltr-progressive.svg

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

Contents

The program can be integrated with the mail server to automatically filter all mail for a site. It can also be run by individual users on their own mailbox and integrates with several mail programs. Apache SpamAssassin is highly configurable; if used as a system-wide filter it can still be configured to support per-user preferences.

History

Apache SpamAssassin was created by Justin Mason, who had maintained a number of patches against an earlier program named filter.plx by Mark Jeftovic, which in turn was begun in August 1997. Mason rewrote all of Jeftovic's code from scratch and uploaded the resulting codebase to SourceForge on April 20, 2001. [3]

In Summer 2004 the project became an Apache Software Foundation project and later officially renamed to Apache SpamAssassin. [4]

Methods of usage

Apache SpamAssassin is a Perl-based application (Mail::SpamAssassin in CPAN) which is usually used to filter all incoming mail for one or several users. It can be run as a standalone application or as a subprogram of another application (such as a Milter, SA-Exim, Exiscan, MailScanner, MIMEDefang, Amavis) or as a client (spamc) that communicates with a daemon (spamd). The client/server or embedded mode of operation has performance benefits, but under certain circumstances may introduce additional security risks.

Typically either variant of the application is set up in a generic mail filter program, or it is called directly from a mail user agent that supports this, whenever new mail arrives. Mail filter programs such as procmail can be made to pipe all incoming mail through Apache SpamAssassin with an adjustment to a user's procmailrc file.

Operation

Apache SpamAssassin comes with a large set of rules which are applied to determine whether an email is spam or not. Most rules are based on regular expressions that are matched against the body or header fields of the message, but Apache SpamAssassin also employs a number of other spam-fighting techniques. The rules are called "tests" in the SpamAssassin documentation.

Each test has a score value that will be assigned to a message if it matches the test's criteria. The scores can be positive or negative, with positive values indicating "spam" and negative "ham" (non-spam messages). A message is matched against all tests and Apache SpamAssassin combines the results into a global score which is assigned to the message. The higher the score, the higher the probability that the message is spam.

Apache SpamAssassin has an internal (configurable) score threshold to classify a message as spam. Usually a message will only be considered as spam if it matches multiple criteria; matching just a single test will not usually be enough to reach the threshold.

If Apache SpamAssassin considers a message to be spam, it can be further rewritten. In the default configuration, the content of the mail is appended as a MIME attachment, with a brief excerpt in the message body, and a description of the tests which resulted in the mail being classified as spam. If the score is lower than the defined settings, by default the information about the tests passed and total score is still added to the email headers and can be used in post-processing for less severe actions, such as tagging the mail as suspicious.

Apache SpamAssassin allows for a per-user configuration of its behavior, even if installed as system-wide service; the configuration can be read from a file or a database. In their configuration users can specify individuals whose emails are never considered spam, or change the scores for certain rules. The user can also define a list of languages which they want to receive mail in, and Apache SpamAssassin then assigns a higher score to all mails that appear to be written in another language.

Apache SpamAssassin is based on heuristics (pattern recognition), and such software exhibits false positives and false negatives.

Network-based filtering methods

Apache SpamAssassin also supports:

More methods can be added reasonably easily by writing a Perl plug-in for Apache SpamAssassin.

Bayesian filtering

Apache SpamAssassin reinforces its rules through Bayesian filtering where a user or administrator "feeds" examples of good (ham) and bad (spam) into the filter in order to learn the difference between the two. For this purpose, Apache SpamAssassin provides the command-line tool sa-learn, which can be instructed to learn a single mail or an entire mailbox as either ham or spam.

Typically, the user will move unrecognized spam to a separate folder, and then run sa-learn on the folder of non-spam and on the folder of spam separately. Alternatively, if the mail user agent supports it, sa-learn can be called for individual emails. Regardless of the method used to perform the learning, SpamAssassin's Bayesian test will help score future e-mails based on this learning to improve the accuracy.

Licensing

Apache SpamAssassin is free/open source software, licensed under the Apache License 2.0. Versions prior to 3.0 are dual-licensed under the Artistic License and the GNU General Public License.

sa-compile

sa-compile is a utility distributed with Apache SpamAssassin that compiles a SpamAssassin ruleset into a deterministic finite automaton that allows Apache SpamAssassin to use processor power more efficiently.

Testing Apache SpamAssassin

Apache SpamAssassin is designed to trigger on the GTUBE, a 68-byte string similar to the antivirus EICAR test file. If this string is inserted in an RFC 5322 formatted message and passed through the Apache SpamAssassin engine, Apache SpamAssassin will trigger with a weight of 1000.

See also

Notes

  1. "Project Management Committee". The Apache Software Foundation. 2022. Retrieved 23 August 2023.
  2. Sidney Markowitz (29 March 2024). "[ANNOUNCE] Apache SpamAssassin 4.0.1 available" . Retrieved 30 March 2024.
  3. "SpamAssassin Prehistory". Apache Foundation. Retrieved 19 December 2018.
  4. "SpamAssassin Project Incubation Status". Apache Foundation. Retrieved 19 December 2018.

Related Research Articles

A Domain Name System blocklist, Domain Name System-based blackhole list, Domain Name System blacklist (DNSBL) or real-time blackhole list (RBL) is a service for operation of mail servers to perform a check via a Domain Name System (DNS) query whether a sending host's IP address is blacklisted for email spam. Most mail server software can be configured to check such lists, typically rejecting or flagging messages from such sites.

A whitelist or allowlist is a list or register of entities that are being provided a particular privilege, service, mobility, access or recognition. Entities on the list will be accepted, approved and/or recognized. Whitelisting is the reverse of blacklisting, the practice of identifying entities that are denied, unrecognised, or ostracised.

Bogofilter is a mail filter that classifies e-mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally written by Eric S. Raymond after he read Paul Graham's article "A Plan for Spam" and is now maintained together with a group of contributors by David Relson, Matthias Andree and Greg Louis.

Various anti-spam techniques are used to prevent email spam.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by email

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

<span class="mw-page-title-main">POPFile</span>

POPFile is an abandoned free, open-source, cross-platform mail filter originally written in Perl by John Graham-Cumming and maintained by a team of volunteers. It uses a naive Bayes classifier to filter mail. This allows the filter to "learn" and classify mail according to the user's preferences. Typically it is used to filter spam mail. It can also be used to sort mail into other user defined "buckets" or categories - for example, the user may define a bucket into which work email is sorted.

Email filtering is the processing of email to organize it according to specified criteria. The term can apply to the intervention of human intelligence, but most often refers to the automatic processing of messages at an SMTP server, possibly applying anti-spam techniques. Filtering can be applied to incoming emails as well as to outgoing ones.

<span class="mw-page-title-main">Kontact</span> Personal information manager software

Kontact is a personal information manager and groupware software suite developed by KDE. It supports calendars, contacts, notes, to-do lists, news, and email. It offers a number of inter-changeable graphical UIs all built on top of a common core.

The GTUBE is a 68-byte test string used to test anti-spam systems, in particular those based on SpamAssassin. In SpamAssassin, it carries an anti-spam score of 1000 by default, which would be sufficient to trigger any installation.

A challenge–response system is a type of that automatically sends a reply with a challenge to the (alleged) sender of an incoming e-mail. It was originally designed in 1997 by Stan Weatherby, and was called Email Verification. In this reply, the purported sender is asked to perform some action to assure delivery of the original message, which would otherwise not be delivered. The action to perform typically takes relatively little effort to do once, but great effort to perform in large numbers. This effectively filters out spammers. Challenge–response systems only need to send challenges to unknown senders. Senders that have previously performed the challenging action, or who have previously been sent e-mail(s) to, would be automatically receive a challenge.

SURBL is a collection of URI DNSBL lists of Uniform Resource Identifier (URI) hosts, typically web site domains, that appear in unsolicited messages. SURBL can be used to search incoming e-mail message bodies for spam payload links to help evaluate whether the messages are unsolicited. For example, if http://www.example.com is listed, then e-mail messages with a message body containing this URI may be classified as unsolicited. URI DNSBLs differ from prior DNSBLs, which commonly list mail sending IP addresses. SURBL is a specific instance of the general URI DNSBL list type.

<span class="mw-page-title-main">ISPConfig</span>

ISPConfig is an open source hosting control panel for Linux, licensed under BSD license and developed by the company ISPConfig UG. The ISPConfig project was started in autumn 2005 by Till Brehm from the German company projektfarm GmbH.

<span class="mw-page-title-main">Alpine (email client)</span> Email client

Alpine is a free software email client developed at the University of Washington.

hMailServer Open-source e-mail server

hMailServer was a free email server for Windows created by Martin Knafve. It ran as a Windows service and includes administration tools for management and backup. It had support for IMAP, POP3, and SMTP email protocols. It could use external database engines such as MySQL, MS SQL or PostgreSQL, or an internal MS SQL Compact Edition engine to store configuration and index data. The actual email messages were stored on disk in a raw MIME format. As of January 15th, 2022, active support and development were officially halted, although version 5.6 will continue to receive updates for critical bugs.

Backscatter is incorrectly automated bounce messages sent by mail servers, typically as a side effect of incoming spam.

<span class="mw-page-title-main">Blacklist (computing)</span> Criteria to control computer access

In computing, a blacklist, disallowlist, blocklist, or denylist is a basic access control mechanism that allows through all elements, except those explicitly mentioned. Those items on the list are denied access. The opposite is a whitelist, allowlist, or passlist, in which only items on the list are let through whatever gate is being used. A greylist contains items that are temporarily blocked until an additional step is performed.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

EmailTray is a lightweight email client for the Microsoft Windows operating system. EmailTray was developed by Internet Promotion Agency S.A., a software development d.

SmartScreen is a cloud-based anti-phishing and anti-malware component included in several Microsoft products:

References