Auralization is a procedure designed to model and simulate the experience of acoustic phenomena rendered as a soundfield in a virtualized space. This is useful in configuring the soundscape of architectural structures, concert venues, and public spaces, as well as in making coherent sound environments within virtual immersion systems.
The English term auralization was used for the first time by Kleiner et al. in an article in the journal of the AES en 1991. [1]
The increase of computational power allowed the development of the first acoustic simulation software towards the end of the 1960s. [2]
Auralizations are experienced through systems rendering virtual acoustic models made by convolving or mixing acoustic events recorded 'dry' (or in an anechoic chamber) projected within a virtual model of an acoustic space, the characteristics of which are determined by means of sampling its impulse response (IR). Once this has been determined, the simulation of the resulting soundfield in the target environment is obtained by convolution:
The resulting sound is heard as it would if emitted in that acoustic space.
For auralizations to be perceived as realistic, it is critical to emulate the human hearing in terms of position and orientation of the listener's head with respect to the sources of sound. For IR data to be convolved convincingly, the acoustic events are captured using a dummy head where two microphones are positioned on each side of the head to record an emulation of sound arriving at the locations of human ears, or using an ambisonics microphone array and mixed down for binaurality. Head-related transfer functions (HRTF) datasets can be used to simplify the process insofar as a monaural IR can be measured or simulated, then audio content is convolved with its target acoustic space. In rendering the experience, the transfer function corresponding to the orientation of the head is applied to simulate the corresponding spatial emanation of sound.
In mathematics, convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result. The integral is evaluated for all values of shift, producing the convolution function.
A microphone array is any number of microphones operating in tandem. There are many applications:
A head-related transfer function (HRTF), also known as anatomical transfer function (ATF), is a response that characterizes how an ear receives a sound from a point in space. As sound strikes the listener, the size and shape of the head, ears, ear canal, density of the head, size and shape of nasal and oral cavities, all transform the sound and affect how it is perceived, boosting some frequencies and attenuating others. Generally speaking, the HRTF boosts frequencies from 2–5 kHz with a primary resonance of +17 dB at 2,700 Hz. But the response curve is more complex than a single bump, affects a broad frequency spectrum, and varies significantly from person to person.
Ambisonics is a full-sphere surround sound format: in addition to the horizontal plane, it covers sound sources above and below the listener.
Surround sound is a technique for enriching the fidelity and depth of sound reproduction by using multiple audio channels from speakers that surround the listener. Its first application was in movie theaters. Prior to surround sound, theater sound systems commonly had three screen channels of sound that played from three loudspeakers located in front of the audience. Surround sound adds one or more channels from loudspeakers to the side or behind the listener that are able to create the sensation of sound coming from any horizontal direction around the listener.
An echo chamber is a hollow enclosure used to produce reverberation, usually for recording purposes. For example, the producers of a television or radio program might wish to produce the aural illusion that a conversation is taking place in a large room or a cave; these effects can be accomplished by playing the recording of the conversation inside an echo chamber, with an accompanying microphone to catch the reverberation. Nowadays, effects units are more widely used to create such effects, but echo chambers are still used today, such as the famous echo chambers at Capitol Studios.
Sound localization is a listener's ability to identify the location or origin of a detected sound in direction and distance.
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.
A motion simulator or motion platform is a mechanism that creates the feelings of being in a real motion environment. In a simulator, the movement is synchronised with a visual display of the outside world (OTW) scene. Motion platforms can provide movement in all of the six degrees of freedom (DOF) that can be experienced by an object that is free to move, such as an aircraft or spacecraft:. These are the three rotational degrees of freedom and three translational or linear degrees of freedom.
The Soundfield microphone is an audio microphone composed of four closely spaced subcardioid or cardioid (unidirectional) microphone capsules arranged in a tetrahedron. It was invented by Michael Gerzon and Peter Craven, and is a part of, but not exclusive to, Ambisonics, a surround sound technology. It can function as a mono, stereo or surround sound microphone, optionally including height information.
Virtual acoustic space (VAS), also known as virtual auditory space, is a technique in which sounds presented over headphones appear to originate from any desired direction in space. The illusion of a virtual sound source outside the listener's head is created.
Wave field synthesis (WFS) is a spatial audio rendering technique, characterized by creation of virtual acoustic environments. It produces artificial wavefronts synthesized by a large number of individually driven loudspeakers. Such wavefronts seem to originate from a virtual starting point, the virtual source or notional source. Contrary to traditional spatialization techniques such as stereo or surround sound, the localization of virtual sources in WFS does not depend on or change with the listener's position.
Archaeoacoustics is a sub-field of archaeology and acoustics which studies the relationship between people and sound throughout history. It is an interdisciplinary field with methodological contributions from acoustics, archaeology, and computer simulation, and is broadly related to topics within cultural anthropology such as experimental archaeology and ethnomusicology. Since many cultures have sonic components, applying acoustical methods to the study of archaeological sites and artifacts may reveal new information on the civilizations examined.
Ambiophonics is a method in the public domain that employs digital signal processing (DSP) and two loudspeakers directly in front of the listener in order to improve reproduction of stereophonic and 5.1 surround sound for music, movies, and games in home theaters, gaming PCs, workstations, or studio monitoring applications. First implemented using mechanical means in 1986, today a number of hardware and VST plug-in makers offer Ambiophonic DSP. Ambiophonics eliminates crosstalk inherent in the conventional stereo triangle speaker placement, and thereby generates a speaker-binaural soundfield that emulates headphone-binaural sound, and creates for the listener improved perception of reality of recorded auditory scenes. A second speaker pair can be added in back in order to enable 360° surround sound reproduction. Additional surround speakers may be used for hall ambience, including height, if desired.
Crystal River Engineering Inc. was an American technology company best known for their pioneering work in HRTF based real-time binaural, or 3D sound processing hardware and software. The company was founded in 1989 by Scott Foster after he received a contract from NASA to create the audio component of VIEW, a virtual reality based training simulator for astronauts. Crystal River Engineering was acquired by Aureal Semiconductor in 1996.
3D sound localization refers to an acoustic technology that is used to locate the source of a sound in a three-dimensional space. The source location is usually determined by the direction of the incoming sound waves and the distance between the source and sensors. It involves the structure arrangement design of the sensors and signal processing techniques.
Perceptual-based 3D sound localization is the application of knowledge of the human auditory system to develop 3D sound localization technology.
3D sound reconstruction is the application of reconstruction techniques to 3D sound localization technology. These methods of reconstructing three-dimensional sound are used to recreate sounds to match natural environments and provide spatial cues of the sound source. They also see applications in creating 3D visualizations on a sound field to include physical aspects of sound waves including direction, pressure, and intensity. This technology is used in entertainment to reproduce a live performance through computer speakers. The technology is also used in military applications to determine location of sound sources. Reconstructing sound fields is also applicable to medical imaging to measure points in ultrasound.
3D sound is most commonly defined as the daily human experience of sounds. The sounds arrive at the ears from every direction and varying distances, which contribute to the three-dimensional aural image humans hear. Scientists and engineers who work with 3D sound work to accurately synthesize the complexity of real-world sounds.
Cinematic virtual reality(Cine-VR) is an immersive experience where the audience can look around in 360 degrees while hearing spatialized audio specifically designed to reinforce the belief that the audience is actually in the virtual environment rather than watching it on a two-dimensional screen. Cine-VR is different from traditional Virtual Reality which uses computer generated worlds and characters more akin to interactive gaming engines, while cine-VR uses live images captured thorough a camera which makes it more like film.