ADX (file format)

Last updated
CRI ADX
Developer(s) CRI Middleware
Initial release1996
Platform Cross-platform
Type Codec / File format
License Proprietary
Website CRI Middleware

CRI ADX is a proprietary audio container and compression format developed by CRI Middleware specifically for use in video games; it is derived from ADPCM but with lossy compression. Its most notable feature is a looping function that has proved useful for background sounds in various games that have adopted the format, including many games for the Sega Dreamcast as well as some PlayStation 2, GameCube and Wii games. One of the first games to use ADX was Burning Rangers , on the Sega Saturn. Notably, the Sonic the Hedgehog series since the Dreamcast generation and the majority of Sega games for home video consoles and PCs since the Dreamcast continue to use this format for sound and voice recordings. Jet Set Radio Future for original Xbox also used this format. [1]

Contents

The ADX toolkit also includes a sibling format, AHX, which uses a variant of MPEG-2 audio intended specifically for voice recordings and a packaging archive, AFS, for bundling multiple CRI ADX and AHX tracks into a single container file.

Version 2 of the format (ADX2) uses the HCA and HCA-MX extension, which are usually bundled into a container file with the extensions ACB and AWB. The AWB extension is not to be confused with the Audio format with the same extension and mostly contains the binary data for the HCA files.

General overview

CRI ADX is a lossy audio format, but unlike other formats like MP3, it doesn't apply a psychoacoustic model to the sound to reduce its complexity. The ADPCM model instead stores samples by recording the error relative to a prediction function which means more of the original signal survives the encoding process; trading accuracy of the representation for size by using small sample sizes, usually 4bits. The human auditory system's tolerance for the noise this causes makes the loss of accuracy barely noticeable.

Like other encoding formats, CRI ADX supports up to 96000 Hz frequencies. however, the output sample depth is locked at 16bits, generally due to the lack of precision through the use of small sample sizes. It supports multiple channels but there seems to be an implicit limitation of stereo (2 channel) audio although the file format itself can represent up to 255 channels. The only particularly distinctive feature that sets CRI ADX apart from other ADPCM formats is the integrated looping functionality, enabling an audio player to optionally skip backwards after reaching a single specified point in the track to create a coherent loop; hypothetically, this functionality could be used to skip forwards as well but that would be redundant since the audio could simply be clipped with an editing program instead.

For playback aside from CRI Middleware's in-house software, there are a few plugins for WinAmp and also WAV conversion tools. FFmpeg also has CRI ADX support implemented, but its decoder is hard coded so can only properly decode 44100 Hz ADXs.

Technical description

The CRI ADX specification is not freely available, however the most important elements of the structure have been reverse engineered and documented in various places on the web. As a side note, the AFS archive files that CRI ADXs are sometimes packed in are a simple variant of a tarball which uses numerical indices to identify the contents rather than names.

The ADX disk format is defined in big-endian. The identified sections of the main header are outlined below:

0123456789ABCDEF
0x00x800x00Copyright OffsetEncoding TypeBlock SizeSample BitdepthChannel CountSample RateTotal Samples
0x10Highpass FrequencyVersionFlagsLoop Alignment Samples (v3)Loop Enabled (v3)Loop Enabled (v3)Loop Begin Sample Index (v3)
0x20Loop Begin Byte Index (v3)Loop Enabled (v4)

Loop End Sample Index (v3)

Loop Begin Sample Index (v4)

Loop End Byte Index (v3)

Loop Begin Byte Index (v4)
0x30Loop End Sample Index (v4)Loop End Byte Index (v4)Zero or more bytes empty space
???[CopyrightOffset - 2] ASCII (unterminated) string: "(c)CRI"
...[CopyrightOffset + 4] Audio data starts here

Fields labelled "Unknown" contain either unknown data or are apparently just reserved (i.e. filled with null bytes). Fields labelled with 'v3' or 'v4' but not both are considered "Unknown" in the version they are not marked with. This header may be as short as 20 bytes (0x14), as determined by the copyright offset, which implicitly removes support for a loop since those fields are not present.

The "Encoding Type" field should contain one of:

The "Version" field should contain one of:

When decoding AHX audio, the version field does not appear to have any meaning and can be safely ignored.

Files with encoding type '2' use 4 possible sets of prediction coefficients as listed below:

Coefficient 0Coefficient 1
Set 00x00000x0000
Set 10x0F000x0000
Set 20x1CC00xF300
Set 30x18800xF240

Sample format

CRI ADX encoded audio data is broken into a series of 'blocks', each containing data for only one channel. The blocks are then laid out in 'frames' which consist of one block from every channel in ascending order. For example, in a stereo (2 channel) stream this would consist of Frame 1: left channel block, right channel block; Frame 2: left, right; etc. Blocks are usually always 18 bytes in size containing 4bit samples though other sizes are technically possible, an example of such a block looks like this:

01234567891011121314151617
Predictor/Scale32 4bit samples

The predictor index is a 3bit integer that specifies which prediction coefficient set should be used to decode that block, while the scale is a 13bit unsigned integer (big-endian like the header) which is essentially the amplification of all the samples in that block. Each sample in the block must be decoded in bit-stream order, in descending order. For example, when the sample size is 4bits:

76543210
First sampleSecond sample

The samples themselves are presented not in reverse. Each sample is signed so for this example, the value can range between -8 and +7 (which will be multiplied by the scale during decoding). Although any bit-depth between 1 and 255 is made possible by the header, it is unlikely that one bit samples would ever occur as they can only represent the values {0, 1}, {-1, 0} or {-1, 1}, all of which are not particularly useful for encoding music.

CRI ADX decoding

An encoder for ADX can also be built by simply flipping the code to run in reverse. The code samples are written using C99.

Before a 'standard' CRI ADX can be either encoded or decoded, the set of prediction coefficients must be calculated. This is generally best done in the initialisation stage:

#define M_PI acos(-1.0)doublea,b,c;a=sqrt(2.0)-cos(2.0*M_PI*((double)adx_header->highpass_frequency/adx_header->sample_rate));b=sqrt(2.0)-1.0;c=(a-sqrt((a+b)*(a-b)))/b;//(a+b)*(a-b) = a*a-b*b, however the simpler formula loses accuracy in floating point// double coefficient[2];coefficient[0]=c*2.0;coefficient[1]=-(c*c);

This code calculates prediction coefficients for predicting the current sample from the 2 previous samples. Once it knows the decoding coefficients, it can start decoding the stream:

staticint32_t*past_samples;// Previously decoded samples from each channel, zeroed at start (size = 2*channel_count)staticuint_fast32_tsample_index=0;// sample_index is the index of sample set that needs to be decoded nextstaticADX_header*adx_header;// buffer is where the decoded samples will be put// samples_needed states how many sample 'sets' (one sample from every channel) need to be decoded to fill the buffer// looping_enabled is a boolean flag to control use of the built-in loop// Returns the number of sample 'sets' in the buffer that could not be filled (EOS)unsigneddecode_adx_standard(int16_t*buffer,unsignedsamples_needed,boollooping_enabled){unsignedconstsamples_per_block=(adx_header->block_size-2)*8/adx_header->sample_bitdepth;int16_tscale[adx_header->channel_count];if(looping_enabled&&!adx_header->loop_enabled)looping_enabled=false;// Loop until the requested number of samples are decoded, or the end of file is reachedwhile(samples_needed>0&&sample_index<adx_header->total_samples){// Calculate the number of samples that are left to be decoded in the current blockunsignedsample_offset=sample_index%samples_per_block;unsignedsamples_can_get=samples_per_block-sample_offset;// Clamp the samples we can get during this run if they won't fit in the bufferif(samples_can_get>samples_needed)samples_can_get=samples_needed;// Clamp the number of samples to be acquired if the stream isn't long enough or the loop trigger is nearbyif(looping_enabled&&sample_index+samples_can_get>adx_header->loop_end_index)samples_can_get=adx_header->loop_end_index-sample_index;elseif(sample_index+samples_can_get>adx_header->total_samples)samples_can_get=adx_header->total_samples-sample_index;// Calculate the bit address of the start of the frame that sample_index resides in and record that locationunsignedlongstarted_at=(adx_header->copyright_offset+4+\ sample_index/samples_per_block*adx_header->block_size*adx_header->channel_count)*8;// Read the scale values from the start of each block in this framefor(unsignedi=0;i<adx_header->channel_count;++i){bitstream_seek(started_at+adx_header->block_size*i*8);scale[i]=ntohs(bitstream_read(16));}// Pre-calculate the stop value for sample_offsetunsignedsample_endoffset=sample_offset+samples_can_get;// Save the bitstream address of the first sample immediately after the scale in the first block of the framestarted_at+=16;while(sample_offset<sample_endoffset){for(unsignedi=0;i<adx_header->channel_count;++i){// Predict the next sampledoublesample_prediction=coefficient[0]*past_samples[i*2+0]+coefficient[1]*past_samples[i*2+1];// Seek to the sample offset, read and sign extend it to a 32bit integer// Implementing sign extension is left as an exercise for the reader// The sign extension will also need to include a endian adjustment if there are more than 8 bitsbitstream_seek(started_at+adx_header->sample_bitdepth*sample_offset+\ adx_header->block_size*8*i);int_fast32_tsample_error=bitstream_read(adx_header->sample_bitdepth);sample_error=sign_extend(sample_error,adx_header->sample_bitdepth);// Scale the error correction valuesample_error*=scale[i];// Calculate the sample by combining the prediction with the error correctionint_fast32_tsample=sample_error+(int_fast32_t)sample_prediction;// Update the past samples with the newer samplepast_samples[i*2+1]=past_samples[i*2+0];past_samples[i*2+0]=sample;// Clamp the decoded sample to the valid range for a 16bit integerif(sample>32767)sample=32767;elseif(sample<-32768)sample=-32768;// Save the sample to the buffer then advance one place*buffer++=sample;}++sample_offset;// We've decoded one sample from every block, advance block offset by 1++sample_index;// This also means we're one sample further into the stream--samples_needed;// And so there is one less set of samples that need to be decoded}// Check if we hit the loop end marker, if we did we need to jump to the loop startif(looping_enabled&&sample_index==adx_header->loop_end_index)sample_index=adx_header->loop_start_index;}returnsamples_needed;}

Most of the above should be straightforward C code. The 'ADX_header' pointer refers to the data extracted from the header as outlined earlier, it is assumed to have already been converted to the host Endian. This implementation is not intended to be optimal and the external concerns have been ignored such as the specific method for sign extension and the method of acquiring a bitstream from a file or network source. Once it completes, there will be samples_needed sets (if stereo, there will be pairs for example) of samples in the output buffer. The decoded samples will be in host-endian standard interleaved PCM format, i.e. left 16bit, right 16bit, left, right, etc. Finally, if looping is not enabled, or not supported, then the function will return the number of sample spaces that were not used in the buffer. The caller can test if this value is not zero to detect the end of the stream and drop or write silence into the unused spaces if necessary.

Encryption

CRI ADX supports a simple encryption scheme which XORs values from a linear congruential pseudorandom number generator with the block scale values. This method is computationally inexpensive to decrypt (in keeping with CRI ADX's real-time decoding) yet renders the encrypted files unusable. The encryption is active when the "Flags" value in the header is 0x08. As XOR is symmetric the same method is used to decrypt as to encrypt. The encryption key is a set of three 16-bit values: the multiplier, increment, and start values for the linear congruential generator (the modulus is 0x8000 to keep the values in the 15-bit range of valid block scales). Typically all ADX files from a single game will use the same key.

The encryption method is vulnerable to known-plaintext attacks. If an unencrypted version of the same audio is known the random number stream can be easily retrieved and from it the key parameters can be determined, rendering every CRI ADX encrypted with that same key decryptable. The encryption method attempts to make this more difficult by not encrypting silent blocks (with all sample nybbles equal to 0), as their scale is known to be 0.

Even if the encrypted CRI ADX is the only sample available, it is possible to determine a key by assuming that the scale values of the decrypted CRI ADX must fall within a "low range". This method does not necessarily find the key used to encrypt the file, however. While it can always determine keys that produce an apparently correct output, errors may exist undetected. This is due to the increasingly random distribution of the lower bits of the scale values, which becomes impossible to separate from the randomness added by the encryption.

AHX decoding

AHX is an implementation of MPEG2 audio and the decoding method is basically the same as the standard, making it possible to simply demultiplex the stream from the ADX container and feed it through a standard MPEG Audio decoder like mpg123. The CRI ADX header's "sample rate" and "total samples" are usually the same as the original but other fields like the block size and sample bit depth will usually be zero, in addition to the looping functionality.

Related Research Articles

MPEG-1 is a standard for lossy compression of video and audio. It is designed to compress VHS-quality raw digital video and CD audio down to about 1.5 Mbit/s without excessive quality loss, making video CDs, digital cable/satellite TV and digital audio broadcasting (DAB) practical.

<span class="mw-page-title-main">Ogg</span> Open container format maintained by the Xiph.Org Foundation

Ogg is a free, open container format maintained by the Xiph.Org Foundation. The authors of the Ogg format state that it is unrestricted by software patents and is designed to provide for efficient streaming and manipulation of high-quality digital multimedia. Its name is derived from "ogging", jargon from the computer game Netrek.

<span class="mw-page-title-main">FLAC</span> Lossless digital audio coding format

FLAC is an audio coding format for lossless compression of digital audio, developed by the Xiph.Org Foundation, and is also the name of the free software project producing the FLAC tools, the reference software package that includes a codec implementation. Digital audio compressed by FLAC's algorithm can typically be reduced to between 50 and 70 percent of its original size and decompresses to an identical copy of the original audio data.

<span class="mw-page-title-main">Advanced Video Coding</span> Most widely used standard for video compression

Advanced Video Coding (AVC), also referred to as H.264 or MPEG-4 Part 10, is a video compression standard based on block-oriented, motion-compensated coding. It is by far the most commonly used format for the recording, compression, and distribution of video content, used by 91% of video industry developers as of September 2019. It supports a maximum resolution of 8K UHD.

The BMP file format or bitmap, is a raster graphics image file format used to store bitmap digital images, independently of the display device, especially on Microsoft Windows and OS/2 operating systems.

Audio Video Coding Standard (AVS) refers to the digital audio and digital video series compression standard formulated by the Audio and Video coding standard workgroup of China. Work began in 2002, and three generations of standards were published.

The Lempel–Ziv–Markov chain algorithm (LZMA) is an algorithm used to perform lossless data compression. It has been under development since either 1996 or 1998 by Igor Pavlov and was first used in the 7z format of the 7-Zip archiver. This algorithm uses a dictionary compression scheme somewhat similar to the LZ77 algorithm published by Abraham Lempel and Jacob Ziv in 1977 and features a high compression ratio and a variable compression-dictionary size, while still maintaining decompression speed similar to other commonly used compression algorithms.

<span class="mw-page-title-main">Smacker video</span> Digital video file format

Smacker video is a video file format developed by Epic Games Tools, and primarily used for full-motion video in video games. Smacker uses an adaptive 8-bit RGB palette. RAD's format for video at higher color depths is Bink Video. The Smacker format specifies a container format, a video compression format, and an audio compression format. Since its release in 1994, Smacker has been used in over 2300 games. Blizzard used this format for the cinematic videos seen in its games Warcraft II, StarCraft and Diablo I.

<span class="mw-page-title-main">Ogg page</span>

An Ogg page is a unit of data in an Ogg bitstream, usually between 4 kB and 8 kB, with a maximum size of 65,307 bytes.

H.262 or MPEG-2 Part 2 is a video coding format standardised and jointly maintained by ITU-T Study Group 16 Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), and developed with the involvement of many companies. It is the second part of the ISO/IEC MPEG-2 standard. The ITU-T Recommendation H.262 and ISO/IEC 13818-2 documents are identical.

Dolby Digital Plus, also known as Enhanced AC-3, is a digital audio compression scheme developed by Dolby Labs for the transport and storage of multi-channel digital audio. It is a successor to Dolby Digital (AC-3), and has a number of improvements over that codec, including support for a wider range of data rates, an increased channel count, and multi-program support, as well as additional tools (algorithms) for representing compressed data and counteracting artifacts. Whereas Dolby Digital (AC-3) supports up to five full-bandwidth audio channels at a maximum bitrate of 640 kbit/s, E-AC-3 supports up to 15 full-bandwidth audio channels at a maximum bitrate of 6.144 Mbit/s.

Real-Time Messaging Protocol (RTMP) is a communication protocol for streaming audio, video, and data over the Internet. Originally developed as a proprietary protocol by Macromedia for streaming between Flash Player and the Flash Communication Server, Adobe has released an incomplete version of the specification of the protocol for public use.

The Apple Icon Image format (.icns) is an icon format used in Apple Inc.'s macOS. It supports icons of 16 × 16, 32 × 32, 48 × 48, 128 × 128, 256 × 256, 512 × 512 points at 1x and 2x scale, with both 1- and 8-bit alpha channels and multiple image states. The fixed-size icons can be scaled by the operating system and displayed at any intermediate size.

Bit Rate Reduction, or BRR, also called Bit Rate Reduced, is a name given to an audio compression method used on the SPC700 sound coprocessor used in the SNES, as well as the audio processors of the Philips CD-i, the PlayStation, and the Apple Macintosh Quadra series. The method is a form of ADPCM.

An NRG file is a proprietary optical disc image file format originally created by Nero AG for the Nero Burning ROM utility. It is used to store disc images. Other than Nero Burning ROM, however, a variety of software titles can use these image files. For example, Alcohol 120%, or Daemon Tools can mount NRG files onto virtual drives for reading.

A variable-length quantity (VLQ) is a universal code that uses an arbitrary number of binary octets to represent an arbitrarily large integer. A VLQ is essentially a base-128 representation of an unsigned integer with the addition of the eighth bit to mark continuation of bytes. VLQ is identical to LEB128 except in endianness. See the example below.

Constrained Energy Lapped Transform (CELT) is an open, royalty-free lossy audio compression format and a free software codec with especially low algorithmic delay for use in low-latency audio communication. The algorithms are openly documented and may be used free of software patent restrictions. Development of the format was maintained by the Xiph.Org Foundation and later coordinated by the Opus working group of the Internet Engineering Task Force (IETF).

High Efficiency Video Coding (HEVC), also known as H.265 and MPEG-H Part 2, is a video compression standard designed as part of the MPEG-H project as a successor to the widely used Advanced Video Coding. In comparison to AVC, HEVC offers from 25% to 50% better data compression at the same level of video quality, or substantially improved video quality at the same bit rate. It supports resolutions up to 8192×4320, including 8K UHD, and unlike the primarily 8-bit AVC, HEVC's higher fidelity Main 10 profile has been incorporated into nearly all supporting hardware.

<span class="mw-page-title-main">VP9</span> Open and royalty-free video coding format released by Google in 2013

VP9 is an open and royalty-free video coding format developed by Google.

<span class="mw-page-title-main">VC-6</span> A video coding format

SMPTE ST 2117-1, informally known as VC-6, is a video coding format.

References

  1. "imgur.com". Imgur. Retrieved 10 May 2023.