Image-based spam, [3] [4] or image spam, is a kind of email spam where the textual spam message is embedded into images, that are then attached to spam emails. [5] Since most of the email clients will display the image file directly to the user, the spam message is conveyed as soon as the email is opened (there is no need to further open the attached image file).
The goal of image spam is clearly to circumvent the analysis of the email’s textual content performed by most spam filters [5] (e.g., SpamAssassin, RadicalSpam, Bogofilter, SpamBayes). Accordingly, for the same reason, together with the attached image, often spammers add some “bogus” text to the email, namely, a number of words that are most likely to appear in legitimate emails and not in spam. The earlier image spam emails contained spam images in which the text was clean and easily readable, as shown in Fig. 1.
Consequently, optical character recognition tools were used to extract the text embedded into spam images, which could be then processed together with the text in the email’s body by the spam filter, or, more generally, by more sophisticated text categorization techniques. [3] [6] Further, signatures (e.g., MD5 hashing) were also generated to easily detected and block already known spam images. Spammers in turn reacted by applying some obfuscation techniques to spam images, similarly to CAPTCHAs, both to prevent the embedded text to be read by OCR tools, and to mislead signature-based detection. Some examples are shown in Fig. 2.
This raised the issue of improving image spam detection using computer vision and pattern recognition techniques. [3] [4] [7] [8]
In particular, several authors investigated the possibility of recognizing image spam with obfuscated images by using generic low-level image features (like number of colours, prevalent colour coverage, image aspect ratio, text area), image metadata, etc. [7] [8] [9] [10] (see [4] for a comprehensive survey). Notably, some authors also tried detecting the presence of text in attached images with artifacts denoting an adversarial attempt to obfuscate it. [11] [12] [13] [14]
Image spam started in 2004 and peaked at the end of 2006, when over 50% of spam was image spam. In mid-2007, it started declining, and practically disappeared in 2008. [1] The reason behind this phenomenon is not easy to understand. The decline of image spam can probably be attributed both to the improvement of the proposed countermeasures (e.g., fast image spam detectors based on visual features), and to the higher requirements in terms of bandwidth of image spam that force spammers to send a smaller amount of spam over a given time interval. Both factors might have made image spam less convenient for spammers than other kinds of spam. Nevertheless, at the end of 2011 a rebirth of image spam was detected, and image spam reached 8% of all spam traffic, albeit for a small period. [2]
Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.
Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.
Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.
Various anti-spam techniques are used to prevent email spam.
Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.
CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.
Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.
Address munging is the practice of disguising an e-mail address to prevent it from being automatically collected by unsolicited bulk e-mail providers. Address munging is intended to disguise an e-mail address in a way that prevents computer software from seeing the real address, or even any address at all, but still allows a human reader to reconstruct the original and contact the author: an email address such as, "no-one@example.com", becomes "no-one at example dot com", for instance.
In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.
VoIP spam or SPIT is unsolicited, automatically dialed telephone calls, typically using voice over Internet Protocol (VoIP) technology.
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating, and independent features is crucial to produce effective algorithms for pattern recognition, classification, and regression tasks. Features are usually numeric, but other types such as strings and graphs are used in syntactic pattern recognition, after some pre-processing step such as one-hot encoding. The concept of "features" is related to that of explanatory variables used in statistical techniques such as linear regression.
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.
Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.
Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.
Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.
Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.
Asprise OCR is a commercial optical character recognition and barcode recognition SDK library that provides an API to recognize text as well as barcodes from images and output in formats like plain text, XML and searchable PDF.
Scene text is text that appears in an image captured by a camera in an outdoor environment.
Claudia Eckert is a German computer scientist specializing in middleware, computer security, malware, and the use of machine learning techniques to detect malware. She is managing director of the Fraunhofer Institute for Applied and Integrated Security in Garching, Germany, and professor and chair for security in computer science in the School of Computation, Information and Technology of the Technical University of Munich.