Head-related transfer function

Last updated
HRTF filtering effect HRTF.svg
HRTF filtering effect

A head-related transfer function (HRTF) is a response that characterizes how an ear receives a sound from a point in space. As sound strikes the listener, the size and shape of the head, ears, ear canal, density of the head, size and shape of nasal and oral cavities, all transform the sound and affect how it is perceived, boosting some frequencies and attenuating others. Generally speaking, the HRTF boosts frequencies from 2–5 kHz with a primary resonance of +17 dB at 2,700 Hz. But the response curve is more complex than a single bump, affects a broad frequency spectrum, and varies significantly from person to person.

Contents

A pair of HRTFs for two ears can be used to synthesize a binaural sound that seems to come from a particular point in space. It is a transfer function, describing how a sound from a specific point will arrive at the ear (generally at the outer end of the auditory canal). Some consumer home entertainment products designed to reproduce surround sound from stereo (two-speaker) headphones use HRTFs. Some forms of HRTF processing have also been included in computer software to simulate surround sound playback from loudspeakers.

Sound localization

Humans have just two ears, but can locate sounds in three dimensions – in range (distance), in direction above and below (elevation), in front and to the rear, as well as to either side (azimuth). This is possible because the brain, inner ear, and the external ears (pinna) work together to make inferences about location. This ability to localize sound sources may have developed in humans and ancestors as an evolutionary necessity since the eyes can only see a fraction of the world around a viewer, and vision is hampered in darkness, while the ability to localize a sound source works in all directions, to varying accuracy, [1] regardless of the surrounding light.

Humans estimate the location of a source by taking cues derived from one ear (monaural cues), and by comparing cues received at both ears (difference cues or binaural cues). Among the difference cues are time differences of arrival and intensity differences. The monaural cues come from the interaction between the sound source and the human anatomy, in which the original source sound is modified before it enters the ear canal for processing by the auditory system. These modifications encode the source location and may be captured via an impulse response which relates the source location and the ear location. This impulse response is termed the head-related impulse response (HRIR). Convolution of an arbitrary source sound with the HRIR converts the sound to that which would have been heard by the listener if it had been played at the source location, with the listener's ear at the receiver location. HRIRs have been used to produce virtual surround sound. [2] [3] [ example needed ]

The HRTF is the Fourier transform of HRIR.

HRTFs for left and right ear (expressed above as HRIRs) describe the filtering of a sound source (x(t)) before it is perceived at the left and right ears as xL(t) and xR(t), respectively.

The HRTF can also be described as the modifications to a sound from a direction in free air to the sound as it arrives at the eardrum. These modifications include the shape of the listener's outer ear, the shape of the listener's head and body, the acoustic characteristics of the space in which the sound is played, and so on. All these characteristics will influence how (or whether) a listener can accurately tell what direction a sound is coming from.

In the AES69-2015 standard, [4] the Audio Engineering Society (AES) has defined the SOFA file format for storing spatially oriented acoustic data like head-related transfer functions (HRTFs). SOFA software libraries and files are collected at the Sofa Conventions website. [5]

How HRTF works

The associated mechanism varies between individuals, as their head and ear shapes differ.

HRTF describes how a given sound wave input (parameterized as frequency and source location) is filtered by the diffraction and reflection properties of the head, pinna, and torso, before the sound reaches the transduction machinery of the eardrum and inner ear (see auditory system). Biologically, the source-location-specific prefiltering effects of these external structures aid in the neural determination of source location, particularly the determination of the source's elevation. [6]

Technical derivation

FreqHRTF.jpg
A sample of frequency response of ears:
  • green curve: left ear  XL(f)
  • blue curve:   right ear XR(f)
for a sound source from upward front.
HRTFazimuth.png
An example of how the HRTF tilt with azimuth taken from a point of reference is derived

Linear systems analysis defines the transfer function as the complex ratio between the output signal spectrum and the input signal spectrum as a function of frequency. Blauert (1974; cited in Blauert, 1981) initially defined the transfer function as the free-field transfer function (FFTF). Other terms include free-field to eardrum transfer function and the pressure transformation from the free-field to the eardrum. Less specific descriptions include the pinna transfer function, the outer ear transfer function, the pinna response, or directional transfer function (DTF).

The transfer function H(f) of any linear time-invariant system at frequency f is:

H(f) = Output(f) / Input(f)

One method used to obtain the HRTF from a given source location is therefore to measure the head-related impulse response (HRIR), h(t), at the ear drum for the impulse Δ(t) placed at the source. The HRTF H(f) is the Fourier transform of the HRIR h(t).

Even when measured for a "dummy head" of idealized geometry, HRTF are complicated functions of frequency and the three spatial variables. For distances greater than 1 m from the head, however, the HRTF can be said to attenuate inversely with range. It is this far field HRTF, H(f, θ, φ), that has most often been measured. At closer range, the difference in level observed between the ears can grow quite large, even in the low-frequency region within which negligible level differences are observed in the far field.

HRTFs are typically measured in an anechoic chamber to minimize the influence of early reflections and reverberation on the measured response. HRTFs are measured at small increments of θ such as 15° or 30° in the horizontal plane, with interpolation used to synthesize HRTFs for arbitrary positions of θ. Even with small increments, however, interpolation can lead to front-back confusion, and optimizing the interpolation procedure is an active area of research.

In order to maximize the signal-to-noise ratio (SNR) in a measured HRTF, it is important that the impulse being generated be of high volume. In practice, however, it can be difficult to generate impulses at high volumes and, if generated, they can be damaging to human ears, so it is more common for HRTFs to be directly calculated in the frequency domain using a frequency-swept sine wave or by using maximum length sequences. User fatigue is still a problem, however, highlighting the need for the ability to interpolate based on fewer measurements.

The head-related transfer function is involved in resolving the cone of confusion, a series of points where ITD and ILD are identical for sound sources from many locations around the 0 part of the cone. When a sound is received by the ear it can either go straight down the ear into the ear canal or it can be reflected off the pinnae of the ear, into the ear canal a fraction of a second later. The sound will contain many frequencies, so therefore many copies of this signal will go down the ear all at different times depending on their frequency (according to reflection, diffraction, and their interaction with high and low frequencies and the size of the structures of the ear.) These copies overlap each other, and during this, certain signals are enhanced (where the phases of the signals match) while other copies are canceled out (where the phases of the signal do not match). Essentially, the brain is looking for frequency notches in the signal that correspond to particular known directions of sound.[ citation needed ]

If another person's ears were substituted, the individual would not immediately be able to localize sound, as the patterns of enhancement and cancellation would be different from those patterns the person's auditory system is used to. However, after some weeks, the auditory system would adapt to the new head-related transfer function. [7] The inter-subject variability in the spectra of HRTFs has been studied through cluster analyses. [8]

Assessing the variation through changes between the person's ear, we can limit our perspective with the degrees of freedom of the head and its relation with the spatial domain. Through this, we eliminate the tilt and other co-ordinate parameters that add complexity. For the purpose of calibration we are only concerned with the direction level to our ears, ergo a specific degree of freedom. Some of the ways in which we can deduce an expression to calibrate the HRTF are:

  1. Localization of sound in Virtual Auditory space [9]
  2. HRTF Phase synthesis [10]
  3. HRTF Magnitude synthesis [11]

Localization of sound in virtual auditory space

A basic assumption in the creation of a virtual auditory space is that if the acoustical waveforms present at a listener's eardrums are the same under headphones as in free field, then the listener's experience should also be the same.

Typically, sounds generated from headphones are perceived as originating from within the head. In the virtual auditory space, the headphones should be able to "externalize" the sound. Using the HRTF, sounds can be spatially positioned using the technique described below. [9]

Let x1(t) represent an electrical signal driving a loudspeaker and y1(t) represent the signal received by a microphone inside the listener's eardrum. Similarly, let x2(t) represent the electrical signal driving a headphone and y2(t) represent the microphone response to the signal. The goal of the virtual auditory space is to choose x2(t) such that y2(t) = y1(t). Applying the Fourier transform to these signals, we come up with the following two equations:

Y1 = X1LFM, and
Y2 = X2HM,

where L is the transfer function of the loudspeaker in the free field, F is the HRTF, M is the microphone transfer function, and H is the headphone-to-eardrum transfer function. Setting Y1 = Y2, and solving for X2 yields

X2 = X1LF/H.

By observation, the desired transfer function is

T= LF/H.

Therefore, theoretically, if x1(t) is passed through this filter and the resulting x2(t) is played on the headphones, it should produce the same signal at the eardrum. Since the filter applies only to a single ear, another one must be derived for the other ear. This process is repeated for many places in the virtual environment to create an array of head-related transfer functions for each position to be recreated while ensuring that the sampling conditions are set by the Nyquist criteria.

HRTF phase synthesis

There is less reliable phase estimation in the very low part of the frequency band, and in the upper frequencies the phase response is affected by the features of the pinna. Earlier studies also show that the HRTF phase response is mostly linear and that listeners are insensitive to the details of the interaural phase spectrum as long as the interaural time delay (ITD) of the combined low-frequency part of the waveform is maintained. This is the modeled phase response of the subject HRTF as a time delay, dependent on the direction and elevation. [10]

A scaling factor is a function of the anthropometric features. For example, a training set of N subjects would consider each HRTF phase and describe a single ITD scaling factor as the average delay of the group. This computed scaling factor can estimate the time delay as function of the direction and elevation for any given individual. Converting the time delay to phase response for the left and the right ears is trivial.

The HRTF phase can be described by the ITD scaling factor. This is in turn quantified by the anthropometric data of a given individual taken as the source of reference. For a generic case we consider β as a sparse vector

that represents the subject's anthropometric features as a linear superposition of the anthropometric features from the training data (y' = βT X), and then apply the same sparse vector directly on the scaling vector H. We can write this task as a minimization problem, for a non-negative shrinking parameter λ:

From this, ITD scaling factor value H' is estimated as:

where The ITD scaling factors for all persons in the dataset are stacked in a vector HRN, so the value Hn corresponds to the scaling factor of the n-th person.

HRTF magnitude synthesis

We solve the above minimization problem using Least Absolute Shrinkage and Selection Operator (LASSO). We assume that the HRTFs are represented by the same relation as the anthropometric features. [11] Therefore, once we learn the sparse vector β from the anthropometric features, we directly apply it to the HRTF tensor data and the subject's HRTF values H' given by:

where The HRTFs for each subject are described by a tensor of size D × K, where D is the number of HRTF directions and K is the number of frequency bins. All Hn,d,k corresponds to all the HRTFs of the training set are stacked in a new tensor HRN×D×K, so the value Hn,d,k corresponds to the k-th frequency bin for d-th HRTF direction of the n-th person. Also H'd,k corresponds to k-th frequency for every d-th HRTF direction of the synthesized HRTF.

HRTF from geometry

Accumulation of HRTF data has made it possible for a computer program to infer an approximate HRTF from head geometry. Two programs are known to do so, both open-source: Mesh2HRTF, [12] which runs physical simulation on a full 3D-mesh of the head, and EAC, which uses a neural network trained from existing HRTFs and works from photo and other rough measurements. [13]

Recording and playback technology

Recordings processed via an HRTF, such as in a computer gaming environment (see A3D, EAX, and OpenAL), which approximates the HRTF of the listener, can be heard through stereo headphones or speakers and interpreted as if they comprise sounds coming from all directions, rather than just two points on either side of the head. The perceived accuracy of the result depends on how closely the HRTF data set matches the characteristics of one's own ears, though a generic HRTF may be preferred to an accurate one measured from one's one ear. [14] Some vendors like Apple and Sony offer a variety of HRTFs to be selected by the user's ear shape. [15]

Windows 10 and above come with Microsoft Spatial Sound included, the same spatial audio framework used on Xbox One and Hololens 2. On a Windows PC or an Xbox One, the framework can use several different downstream audio processors, including Windows Sonic for Headphones, Dolby Atmos, and DTS Headphone:X, to apply an HRTF. The framework can render both fixed-position surround sound sources and dynamic "object" sources that can move in space. [16]

Apple similarly has Spatial Sound for its devices used with headphones produced by Apple or Beats. For music playback to headphones, Dolby Atmos can be enabled and the HRTF applied. [17] The HRTF (or rather, the object positions) can vary with head tracking to maintain the illusion of direction. [18] Qualcomm Snapdragon has a similar head-tracked spatial audio system, used by some brands of Android phones. [19] YouTube uses head-tracked HRTF with 360-degree and VR videos. [20]

Linux is currently unable to directly process any of the proprietary spatial audio (surround plus dynamic objects) formats. SoundScape Renderer offers directional synthesis. [21] PulseAudio and PipeWire each can provide virtual surround (fixed-location channels) using an HRTF. Recent PipeWire versions are also able to provide dynamic spatial rendering using HRTFs, [22] however integration with applications is still in progress. Users can configure their own positional and dynamic sound sources, as well as simulate a surround speaker setup using existing configurations.

The cross-platform OpenAL Soft, an implementation of OpenAL, uses HRTFs for improved localization. [23]

Windows and Linux spatial audio systems support any model of stereo headphones, while Apple only allows spatial audio to be used with Apple or Beats-branded Bluetooth headsets.[ citation needed ]

See also

Related Research Articles

<span class="mw-page-title-main">Binaural recording</span> Method of recording sound

Binaural recording is a method of recording sound that uses two microphones, arranged with the intent to create a 3D stereo sound sensation for the listener of actually being in the room with the performers or instruments. This effect is often created using a technique known as dummy head recording, wherein a mannequin head is fitted with a microphone in each ear. Binaural recording is intended for replay using headphones and will not translate properly over stereo speakers. This idea of a three-dimensional or "internal" form of sound has also translated into useful advancement of technology in many things such as stethoscopes creating "in-head" acoustics and IMAX movies being able to create a three-dimensional acoustic experience.

<span class="mw-page-title-main">Ambisonics</span> Full-sphere surround sound format

Ambisonics is a full-sphere surround sound format: in addition to the horizontal plane, it covers sound sources above and below the listener.

Audio analysis refers to the extraction of information and meaning from audio signals for analysis, classification, storage, retrieval, synthesis, etc. The observation mediums and interpretation methods vary, as audio analysis can refer to the human ear and how people interpret the audible sound source, or it could refer to using technology such as an audio analyzer to evaluate other qualities of a sound source such as amplitude, distortion, frequency response. Once an audio source's information has been observed, the information revealed can then be processed for the logical, emotional, descriptive, or otherwise relevant interpretation by the user.

3D audio effects are a group of sound effects that manipulate the sound produced by stereo speakers, surround-sound speakers, speaker-arrays, or headphones. This frequently involves the virtual placement of sound sources anywhere in three-dimensional space, including behind, above or below the listener.

Sound localization is a listener's ability to identify the location or origin of a detected sound in direction and distance.

<span class="mw-page-title-main">Equal-loudness contour</span> Frequency characteristics of hearing and perceived volume

An equal-loudness contour is a measure of sound pressure level, over the frequency spectrum, for which a listener perceives a constant loudness when presented with pure steady tones. The unit of measurement for loudness levels is the phon and is arrived at by reference to equal-loudness contours. By definition, two sine waves of differing frequencies are said to have equal-loudness level measured in phons if they are perceived as equally loud by the average young person without significant hearing impairment.

Virtual acoustic space (VAS), also known as virtual auditory space, is a technique in which sounds presented over headphones appear to originate from any desired direction in space. The illusion of a virtual sound source outside the listener's head is created.

Virtual surround is an audio system that attempts to create the perception that there are many more sources of sound than are actually present. In order to achieve this, it is necessary to devise some means of tricking the human auditory system into thinking that a sound is coming from somewhere that it is not. Most recent examples of such systems are designed to simulate the true (physical) surround sound experience using one, two or three loudspeakers. Such systems are popular among consumers who want to enjoy the experience of surround sound without the large number of speakers that are traditionally required to do so.

<span class="mw-page-title-main">Interaural time difference</span> Difference in time that it takes a sound to travel between two ears

The interaural time difference when concerning humans or animals, is the difference in arrival time of a sound between two ears. It is important in the localization of sounds, as it provides a cue to the direction or angle of the sound source from the head. If a signal arrives at the head from one side, the signal has further to travel to reach the far ear than the near ear. This pathlength difference results in a time difference between the sound's arrivals at the ears, which is detected and aids the process of identifying the direction of sound source.

Binaural fusion or binaural integration is a cognitive process that involves the combination of different auditory information presented binaurally, or to each ear. In humans, this process is essential in understanding speech as one ear may pick up more information about the speech stimuli than the other.

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means. In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of blind signal separation in that it is based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem.

Holophonics is a binaural recording system created by Hugo Zuccarelli that is based on the claim that the human auditory system acts as an interferometer. It relies on phase variance, just like stereophonic sound. The sound characteristics of holophonics are most clearly heard through headphones, though they can be effectively demonstrated with two-channel stereo speakers, provided that they are phase-coherent. The word "holophonics" is related to "acoustic hologram".

Ambiophonics is a method in the public domain that employs digital signal processing (DSP) and two loudspeakers directly in front of the listener in order to improve reproduction of stereophonic and 5.1 surround sound for music, movies, and games in home theaters, gaming PCs, workstations, or studio monitoring applications. First implemented using mechanical means in 1986, today a number of hardware and VST plug-in makers offer Ambiophonic DSP. Ambiophonics eliminates crosstalk inherent in the conventional stereo triangle speaker placement, and thereby generates a speaker-binaural soundfield that emulates headphone-binaural sound, and creates for the listener improved perception of reality of recorded auditory scenes. A second speaker pair can be added in back in order to enable 360° surround sound reproduction. Additional surround speakers may be used for hall ambience, including height, if desired.

Psychoacoustics is the branch of psychophysics involving the scientific study of sound perception and audiology—how the human auditory system perceives various sounds. More specifically, it is the branch of science studying the psychological responses associated with sound. Psychoacoustics is an interdisciplinary field including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science.

Spatial hearing loss refers to a form of deafness that is an inability to use spatial cues about where a sound originates from in space. Poor sound localization in turn affects the ability to understand speech in the presence of background noise.

3D sound localization refers to an acoustic technology that is used to locate the source of a sound in a three-dimensional space. The source location is usually determined by the direction of the incoming sound waves and the distance between the source and sensors. It involves the structure arrangement design of the sensors and signal processing techniques.

Perceptual-based 3D sound localization is the application of knowledge of the human auditory system to develop 3D sound localization technology.

3D sound reconstruction is the application of reconstruction techniques to 3D sound localization technology. These methods of reconstructing three-dimensional sound are used to recreate sounds to match natural environments and provide spatial cues of the sound source. They also see applications in creating 3D visualizations on a sound field to include physical aspects of sound waves including direction, pressure, and intensity. This technology is used in entertainment to reproduce a live performance through computer speakers. The technology is also used in military applications to determine location of sound sources. Reconstructing sound fields is also applicable to medical imaging to measure points in ultrasound.

3D sound is most commonly defined as the daily human experience of sounds. The sounds arrive to the ears from every direction and varying distances, which contribute to the three-dimensional aural image humans hear. Scientists and engineers who work with 3D sound work to accurately synthesize the complexity of real-world sounds.

Binaural unmasking is phenomenon of auditory perception discovered by Ira Hirsh. In binaural unmasking, the brain combines information from the two ears in order to improve signal detection and identification in noise. The phenomenon is most commonly observed when there is a difference between the interaural phase of the signal and the interaural phase of the noise. When such a difference is present there is an improvement in masking threshold compared to a reference situation in which the interaural phases are the same, or when the stimulus has been presented monaurally. Those two cases usually give very similar thresholds. The size of the improvement is known as the "binaural masking level difference" (BMLD), or simply as the "masking level difference".

References

  1. Daniel Starch (1908). Perimetry of the localization of sound. State University of Iowa. p. 35 ff.
  2. Begault, D.R. (1994) 3D sound for virtual reality and multimedia. AP Professional.
  3. So, R.H.Y., Leung, N.M., Braasch, J. and Leung, K.L. (2006) A low cost, Non-individualized surround sound system based upon head-related transfer functions. An Ergonomics study and prototype development. Applied Ergonomics, 37, pp. 695–707.
  4. "AES Standard AES69-2015: AES standard for file exchange - Spatial acoustic data file format". www.aes.org. Retrieved 2016-12-30.
  5. "Sofa Conventions Website". Acoustics Research Institute, a research institute of the Austrian Academy of Sciences.
  6. Blauert, J. (1997) Spatial hearing: the psychophysics of human sound localization. MIT Press.
  7. Hofman, Paul M.; Van Riswick, JG; Van Opstal, AJ (September 1998). "Relearning sound localization with new ears" (PDF). Nature Neuroscience. 1 (5): 417–421. doi:10.1038/1633. PMID   10196533. S2CID   10088534.
  8. So, R.H.Y., Ngan, B., Horner, A., Leung, K.L., Braasch, J. and Blauert, J. (2010) Toward orthogonal non-individualized head-related transfer functions for forward and backward directional sound: cluster analysis and an experimental study. Ergonomics, 53(6), pp.767-781.
  9. 1 2 Carlile, S. (1996). Virtual Auditory Space: Generation and Applications (1 ed.). Berlin, Heidelberg: Springer. ISBN   9783662225967.
  10. 1 2 Tashev, Ivan (2014). "HRTF phase synthesis via sparse representation of anthropometric features". 2014 Information Theory and Applications Workshop (ITA). pp. 1–5. doi:10.1109/ITA.2014.6804239. ISBN   978-1-4799-3589-5. S2CID   13232557.
  11. 1 2 Bilinski, Piotr; Ahrens, Jens; Thomas, Mark RP; Tashev, Ivan; Platt, John C (2014). "HRTF magnitude synthesis via sparse representation of anthropometric features" (PDF). 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE ICASSP, Florence, Italy. pp. 4468–4472. doi:10.1109/ICASSP.2014.6854447. ISBN   978-1-4799-2893-4. S2CID   5619011.
  12. Ziegelwanger, H., and Kreuzer, W., Majdak, P. (2015). "Mesh2HRTF: An open-source software package for the numerical calculation of head-related transfer functions," in Proceedings of the 22nd International Congress on Sound and Vibration, Florence, Italy.
  13. Carvalho, Davi (17 April 2023). "EAC - Individualized HRTF Synthesis". GitHub .
  14. Armstrong, Cal; Thresh, Lewis; Murphy, Damian; Kearney, Gavin (23 October 2018). "A Perceptual Evaluation of Individual and Non-Individual HRTFs: A Case Study of the SADIE II Database". Applied Sciences. 8 (11): 2029. doi: 10.3390/app8112029 .
  15. "Spatial Audio: Part 1 - Current Formats & The Rise Of HRTF - The Broadcast Bridge - Connecting IT to Broadcast". The Broadcast Bridge. 7 December 2022.
  16. "Spatial Sound for app developers for Windows, Xbox, and Hololens 2 - Win32 apps". learn.microsoft.com. 27 April 2023.
  17. "About Spatial Audio with Dolby Atmos in Apple Music". Apple Support. 27 March 2023.
  18. "Listen with spatial audio for AirPods and Beats". Apple Support. 19 July 2023.
  19. "Spatial Audio". www.qualcomm.com.
  20. "Use spatial audio in 360-degree and VR videos - YouTube Help". support.google.com.
  21. "SoundScape Renderer". spatialaudio.net. 9 January 2013.
  22. "Filter Chain". gitlab.freedesktop.org/pipewire/pipewire. 14 April 2023.
  23. "OpenAL Soft - Software 3D Audio". openal-soft.org.