Image spam

Last updated
Fig. 1. Example of a clean spam image Clean Spam.png
Fig. 1. Example of a clean spam image
Fig. 2. Examples of obfuscated spam images to evade OCR-based and signature-based detection Obfuscated spam.jpg
Fig. 2. Examples of obfuscated spam images to evade OCR-based and signature-based detection
Fig. 3. Average size of spam versus percentage of image spam Average size of spam versus percentage of Image Spam.jpg
Fig. 3. Average size of spam versus percentage of image spam
Fig. 4. Average size of spam versus percentage of image and ZIP/RAR spam (2011-2012, per week) Size of spam versus Image and ZIPRAR Spam.jpg
Fig. 4. Average size of spam versus percentage of image and ZIP/RAR spam (2011-2012, per week)

Image-based spam, [3] [4] or image spam, is a kind of email spam where the textual spam message is embedded into images, that are then attached to spam emails. [5] Since most of the email clients will display the image file directly to the user, the spam message is conveyed as soon as the email is opened (there is no need to further open the attached image file).

Contents

Technique

The goal of image spam is clearly to circumvent the analysis of the email’s textual content performed by most spam filters [5] (e.g., SpamAssassin, RadicalSpam, Bogofilter, SpamBayes). Accordingly, for the same reason, together with the attached image, often spammers add some “bogus” text to the email, namely, a number of words that are most likely to appear in legitimate emails and not in spam. The earlier image spam emails contained spam images in which the text was clean and easily readable, as shown in Fig. 1.

Detection

Consequently, optical character recognition tools were used to extract the text embedded into spam images, which could be then processed together with the text in the email’s body by the spam filter, or, more generally, by more sophisticated text categorization techniques. [3] [6] Further, signatures (e.g., MD5 hashing) were also generated to easily detected and block already known spam images. Spammers in turn reacted by applying some obfuscation techniques to spam images, similarly to CAPTCHAs, both to prevent the embedded text to be read by OCR tools, and to mislead signature-based detection. Some examples are shown in Fig. 2.

This raised the issue of improving image spam detection using computer vision and pattern recognition techniques. [3] [4] [7] [8]

In particular, several authors investigated the possibility of recognizing image spam with obfuscated images by using generic low-level image features (like number of colours, prevalent colour coverage, image aspect ratio, text area), image metadata, etc. [7] [8] [9] [10] (see [4] for a comprehensive survey). Notably, some authors also tried detecting the presence of text in attached images with artifacts denoting an adversarial attempt to obfuscate it. [11] [12] [13] [14]

History

Image spam started in 2004 and peaked at the end of 2006, when over 50% of spam was image spam. In mid-2007, it started declining, and practically disappeared in 2008. [1] The reason behind this phenomenon is not easy to understand. The decline of image spam can probably be attributed both to the improvement of the proposed countermeasures (e.g., fast image spam detectors based on visual features), and to the higher requirements in terms of bandwidth of image spam that force spammers to send a smaller amount of spam over a given time interval. Both factors might have made image spam less convenient for spammers than other kinds of spam. Nevertheless, at the end of 2011 a rebirth of image spam was detected, and image spam reached 8% of all spam traffic, albeit for a small period. [2]

See also

Related Research Articles

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Pattern recognition</span> Automated recognition of patterns and regularities in data

Pattern recognition is the automated recognition of patterns and regularities in data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Apache SpamAssassin</span> Open-source e-mail spam filter

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS and fuzzy checksum techniques, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

Various anti-spam techniques are used to prevent email spam.

Edge detection includes a variety of mathematical methods that aim at identifying edges, defined as curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuities in one-dimensional signals is known as step detection and the problem of finding signal discontinuities over time is known as change detection. Edge detection is a fundamental tool in image processing, machine vision and computer vision, particularly in the areas of feature detection and feature extraction.

<span class="mw-page-title-main">Email spam</span> Unsolicited electronic advertising by e-mail

Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoidable, and repetitive. Email spam has steadily grown since the early 1990s, and by 2014 was estimated to account for around 90% of total email traffic.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

<span class="mw-page-title-main">Naive Bayes spam filtering</span>

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

Address munging is the practice of disguising an e-mail address to prevent it from being automatically collected by unsolicited bulk e-mail providers. Address munging is intended to disguise an e-mail address in a way that prevents computer software from seeing the real address, or even any address at all, but still allows a human reader to reconstruct the original and contact the author: an email address such as, "no-one@example.com", becomes "no-one at example dot com", for instance.

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.

VoIP spam or SPIT is unsolicited, automatically dialed telephone calls, typically using voice over Internet Protocol (VoIP) technology.

<span class="mw-page-title-main">Feature (machine learning)</span> Measurable property or characteristic

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern recognition, classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of "feature" is related to that of explanatory variable used in statistical techniques such as linear regression.

Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification.

Email harvesting or scraping is the process of obtaining lists of email addresses using various methods. Typically these are then used for bulk email or spam.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

<span class="mw-page-title-main">Adversarial machine learning</span> Research field that lies at the intersection of machine learning and computer security

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Asprise OCR is a commercial optical character recognition and barcode recognition SDK library that provides an API to recognize text as well as barcodes from images and output in formats like plain text, xml and searchable PDF.

<span class="mw-page-title-main">Scene text</span> Text captured as part of outdoor surroundings in a photograph

Scene text is text that appears in an image captured by a camera in an outdoor environment.

References

  1. 1 2 IBM X-Force® 2010, Mid-Year Trend and Risk Report (August 2010).
  2. 1 2 IBM X-Force® 2012, Mid-Year Trend and Risk Report (September 2012).
  3. 1 2 3 Giorgio Fumera, Ignazio Pillai, Fabio Roli, "Spam filtering based on the analysis of text information embedded into images". Journal of Machine Learning Research (special issue on Machine Learning in Computer Security), vol. 7, pp. 2699-2720, 12/2006.
  4. 1 2 3 Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli,Biggio, Battista; Fumera, Giorgio; Pillai, Ignazio; Roli, Fabio (2011). "A survey and experimental evaluation of image spam filtering techniques, Pattern Recognition Letters". Pattern Recognition Letters. 32 (10): 1436–1446. doi:10.1016/j.patrec.2011.03.022. Volume 32, Issue 10, 15 July 2011, Pages 1436-1446, ISSN 0167-8655.
  5. 1 2 Li, Siyuan; Li, Ruiguang; Xu, Yuan; Zhou, Hao; Yan, Hanbing; Xu, Bin; Zhang, Honggang (2018-09-01). "WAF‐Based Chinese Character Recognition for Spam Image Filtering". Chinese Journal of Electronics. 27 (5): 1050–1055. doi: 10.1049/cje.2018.06.014 . ISSN   1022-4653.
  6. "Bayes OCR Spam Assassin's Plugin".
  7. 1 2 Aradhye, H., Myers, G., Herson, J. A., 2005. Image analysis for efficient cat egorization of image-based spam e-mail. In: Proc. Int. Conf. on Document Analysis and Recognition, pp. 914–918.
  8. 1 2 Dredze, M., Gevaryahu, R., Elias-Bachrach, A., 2007. Learning fast classifiers for image spam. In: Proc. 4th Conf. on Email and Anti-Spam (CEAS)
  9. Wu, C.-T., Cheng, K.-T., Zhu, Q., Wu, Y.-L., 2005. Using visual features for anti-spam filtering. In: Proc. IEEE Int. Conf. on Image Processing, Vol. III.pp. 501–504.
  10. Liu, Q., Qin, Z., Cheng, H., Wan, M., 2010. Efficient modeling of spam images. In: Int. Symp. on Intelligent Information Technology and Security Informatics. IEEE Computer Society, pp. 663–666.
  11. "Fuzzy - OCR Spam Assassin's Plugin".
  12. Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Image Spam Filtering Using Visual Information", 14th Int. Conf. on Image Analysis and Processing (ICIAP 2007), Modena, Italy, IEEE Computer Society, pp. 105--110, 10/09/2007.
  13. Fabio Roli, Battista Biggio, Giorgio Fumera, Ignazio Pillai, Riccardo Satta , "Image Spam Filtering by Detection of Adversarial Obfuscated Text", Workshop on Neural Information Processing Systems (NIPS), Whistler, British Columbia, Canada, 08/12/2007.
  14. Battista Biggio, Giorgio Fumera, Ignazio Pillai, Fabio Roli , "Improving Image Spam Filtering Using Image Text Features", Fifth Conference on Email and Anti-Spam (CEAS 2008), Mountain View, CA, USA, 21/08/2008.