Steganalysis

Last updated

Steganalysis is the study of detecting messages hidden using steganography; this is analogous to cryptanalysis applied to cryptography.

Contents

Overview

The goal of steganalysis is to identify suspected packages, determine whether or not they have a payload encoded into them, and, if possible, recover that payload.

Unlike cryptanalysis, in which intercepted data contains a message (though that message is encrypted), steganalysis generally starts with a pile of suspect data files, but little information about which of the files, if any, contain a payload. The steganalyst is usually something of a forensic statistician, and must start by reducing this set of data files (which is often quite large; in many cases, it may be the entire set of files on a computer) to the subset most likely to have been altered.

Basic techniques

The problem is generally handled with statistical analysis. A set of unmodified files of the same type, and ideally from the same source (for example, the same model of digital camera, or if possible, the same digital camera; digital audio from a CD MP3 files have been "ripped" from; etc.) as the set being inspected, are analyzed for various statistics. Some of these are as simple as spectrum analysis, but since most image and audio files these days are compressed with lossy compression algorithms, such as JPEG and MP3, they also attempt to look for inconsistencies in the way this data has been compressed. For example, a common artifact in JPEG compression is "edge ringing", where high-frequency components (such as the high-contrast edges of black text on a white background) distort neighboring pixels. This distortion is predictable, and simple steganographic encoding algorithms will produce artifacts that are detectably unlikely.

One case where detection of suspect files is straightforward is when the original, unmodified carrier is available for comparison. Comparing the package against the original file will yield the differences caused by encoding the payload—and, thus, the payload can be extracted.

Advanced techniques

Noise floor consistency analysis

In some cases, such as when only a single image is available, more complicated analysis techniques may be required. In general, steganography attempts to make distortion to the carrier indistinguishable from the carrier's noise floor. In practice, however, this is often improperly simplified to deciding to make the modifications to the carrier resemble white noise as closely as possible, rather than analyzing, modeling, and then consistently emulating the actual noise characteristics of the carrier. In particular, many simple steganographic systems simply modify the least-significant bit (LSB) of a sample; this causes the modified samples to have not only different noise profiles than unmodified samples, but also for their LSBs to have different noise profiles than could be expected from analysis of their higher-order bits, which will still show some amount of noise. Such LSB-only modification can be detected with appropriate algorithms, in some cases detecting encoding densities as low as 1% with reasonable reliability. [1]

Further complications

Encrypted payloads

Detecting a probable steganographic payload is often only part of the problem, as the payload may have been encrypted first. Encrypting the payload is not always done solely to make recovery of the payload more difficult. Most strong ciphers have the desirable property of making the payload appear indistinguishable from uniformly-distributed noise, which can make detection efforts more difficult, and save the steganographic encoding technique the trouble of having to distribute the signal energy evenly (but see above concerning errors emulating the native noise of the carrier).

Barrage noise

If inspection of a storage device is considered very likely, the steganographer may attempt to barrage a potential analyst with, effectively, misinformation. This may be a large set of files encoded with anything from random data, to white noise, to meaningless drivel, to deliberately misleading information. The encoding density on these files may be slightly higher than the "real" ones; likewise, the possible use of multiple algorithms of varying detectability should be considered. The steganalyst may be forced into checking these decoys first, potentially wasting significant time and computing resources. The downside to this technique is it makes it much more obvious that steganographic software was available, and was used.

Conclusions and further action

Obtaining a warrant or taking other actions based solely on steganalytic evidence is a very dicey proposition unless a payload has been completely recovered and decrypted, because otherwise all the analyst has is a statistic indicating that a file may have been modified, and that modification may have been the result of steganographic encoding. Because this is likely to frequently be the case, steganalytic suspicions will often have to be backed up with other investigative techniques.

See also

Related Research Articles

In information theory, data compression, source coding, or bit-rate reduction is the process of encoding information using fewer bits than the original representation. Any particular compression is either lossy or lossless. Lossless compression reduces bits by identifying and eliminating statistical redundancy. No information is lost in lossless compression. Lossy compression reduces bits by removing unnecessary or less important information. Typically, a device that performs data compression is referred to as an encoder, and one that performs the reversal of the process (decompression) as a decoder.

<span class="mw-page-title-main">Lossy compression</span> Data compression approach that reduces data size while discarding or changing some of it

In information technology, lossy compression or irreversible compression is the class of data compression methods that uses inexact approximations and partial data discarding to represent the content. These techniques are used to reduce data size for storing, handling, and transmitting content. The different versions of the photo of the cat on this page show how higher degrees of approximation create coarser images as more details are removed. This is opposed to lossless data compression which does not degrade the data. The amount of data reduction possible using lossy compression is much higher than using lossless techniques.

Lossless compression is a class of data compression that allows the original data to be perfectly reconstructed from the compressed data with no loss of information. Lossless compression is possible because most real-world data exhibits statistical redundancy. By contrast, lossy compression permits reconstruction only of an approximation of the original data, though usually with greatly improved compression rates.

Steganography is the practice of representing information within another message or physical object, in such a manner that the presence of the information is not evident to human inspection. In computing/electronic contexts, a computer file, message, image, or video is concealed within another file, message, image, or video. The word steganography comes from Greek steganographia, which combines the words steganós, meaning "covered or concealed", and -graphia meaning "writing".

<span class="mw-page-title-main">Image compression</span> Reduction of image size to save storage and transmission costs

Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior results compared with generic data compression methods which are used for other digital data.

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

Speex is an audio compression codec specifically tuned for the reproduction of human speech and also a free software speech codec that may be used on VoIP applications and podcasts. It is based on the CELP speech coding algorithm. Speex claims to be free of any patent restrictions and is licensed under the revised (3-clause) BSD license. It may be used with the Ogg container format or directly transmitted over UDP/RTP. It may also be used with the FLV container format.

<span class="mw-page-title-main">G.711</span> ITU-T recommendation

G.711 is a narrowband audio codec originally designed for use in telephony that provides toll-quality audio at 64 kbit/s. G.711 passes audio signals in the range of 300–3400 Hz and samples them at the rate of 8,000 samples per second, with the tolerance on that rate of 50 parts per million (ppm). Non-uniform (logarithmic) quantization with 8 bits is used to represent each sample, resulting in a 64 kbit/s bit rate. There are two slightly different versions: μ-law, which is used primarily in North America and Japan, and A-law, which is in use in most other countries outside North America.

<span class="mw-page-title-main">Compression artifact</span> Distortion of media caused by lossy data compression

A compression artifact is a noticeable distortion of media caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired disk space or transmitted (streamed) within the available bandwidth. If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts. The compression algorithm may not be intelligent enough to discriminate between distortions of little subjective importance and those objectionable to the user.

A digital watermark is a kind of marker covertly embedded in a noise-tolerant signal such as audio, video or image data. It is typically used to identify ownership of the copyright of such signal. "Watermarking" is the process of hiding digital information in a carrier signal; the hidden information should, but does not need to, contain a relation to the carrier signal. Digital watermarks may be used to verify the authenticity or integrity of the carrier signal or to show the identity of its owners. It is prominently used for tracing copyright infringements and for banknote authentication.

Generation loss is the loss of quality between subsequent copies or transcodes of data. Anything that reduces the quality of the representation when copying, and would cause further reduction in quality on making a copy of the copy, can be considered a form of generation loss. File size increases are a common result of generation loss, as the introduction of artifacts may actually increase the entropy of the data through each generation.

Lossless JPEG is a 1993 addition to JPEG standard by the Joint Photographic Experts Group to enable lossless compression. However, the term may also be used to refer to all lossless compression schemes developed by the group, including JPEG 2000 and JPEG-LS.

StegoShare is a steganography tool that allows embedding of large files into multiple images. It may be used for anonymous file sharing.

An audio watermark is a unique electronic identifier embedded in an audio signal, typically used to identify ownership of copyright. It is similar to a watermark on a photograph.

<span class="mw-page-title-main">Cinavia</span>

Cinavia, originally called Verance Copy Management System for Audiovisual Content (VCMS/AV), is an analog watermarking and steganography system under development by Verance since 1999, and released in 2010. In conjunction with the existing Advanced Access Content System (AACS) digital rights management (DRM) inclusion of Cinavia watermarking detection support became mandatory for all consumer Blu-ray Disc players from 2012.

<span class="mw-page-title-main">Steganography tools</span> Software for embedding hidden data inside a carrier file

A steganography software tool allows a user to embed hidden data inside a carrier file, such as an image or video, and later extract that data.

<span class="mw-page-title-main">OpenPuff</span>

OpenPuff Steganography and Watermarking, sometimes abbreviated OpenPuff or Puff, is a free steganography tool for Microsoft Windows created by Cosimo Oliboni and still maintained as independent software. The program is notable for being the first steganography tool that:

BPCS-steganography is a type of digital steganography.

Error level analysis (ELA) is the analysis of compression artifacts in digital data with lossy compression such as JPEG.

OutGuess is a steganographic software for hiding data in the most redundant content data bits of existing (media) files. It has handlers for image files in the common Netpbm and JPEG formats, so it can, for example, specifically alter the frequency coefficients of JPEG files. It is written in C and published as Free Software under the terms of the old BSD license. It has been tested on a variety of Unix-like operating systems and is included in the standard software repositories of the popular Linux distributions Debian and Arch Linux and their derivatives.

References

  1. Patent No. 6,831,991, Reliable detection of LSB steganography in color and grayscale images; Fridrich, Jessica, et al., issued December 14th, 2004. (This invention was made with Government support under F30602-00-1-0521 and F49620-01-1-0123 from the U.S. Air Force. The Government has certain rights in the invention.)

Bibliography