Friday, April 22, 2016

A Virtual Headphone Listening Test Method

Fig. 1 The Harman Headphone Virtualizer App allows listeners to make double-blind comparisons of  different headphones through a high-quality replicator headphone. The  app has two listening modes: a sighted mode (shown) and a blind mode (not shown) where listeners are not biased by non-auditory factors (brand, price, celebrity endorsement,etc). Clicking on the picture will show a larger version.

Early on in our headphone research  we realized there was a need to develop a listening test method that allowed us to conduct more controlled double-blind listening tests on different headphones.  This was necessary in order to remove tactile cues (headphone weight and clamping force), visual and psychological biases  (e.g. headphone brand, price, celebrity endorsement,etc )  from listeners' sound quality judgements of headphones.  While these factors (apart from clamping force) don't physically affect the sound of headphones, our  previous research [1]  into blind vs. sighted listening tests revealed their cognitive influence affects listeners'  loudspeaker preferences [1], often in adverse ways. In sighted tests,  listeners were also less sensitive and  discriminating compared to blind conditions when judging different loudspeakers including their interaction with different music selections and loudspeaker positions in the room. For that reason, consumers should be dubious of loudspeaker and headphone reviews that are based solely on sighted listening.

While blind loudspeakers listening tests are possible through the addition of an acoustically-transparent- visually-opaque-curtain,  there is no simple way to hide the identity of a headphone when the listener is wearing it.  In our first headphone listening tests,  the experimenter positionally substituted the different headphones onto the listener's head from behind so that the headphone could not be visually identified. However, after a couple of trials, listeners began to identify certain headphones simply by their weight and clamping force. One of the easiest headphones for listeners to identify was the Audeze LCD-2, which was considerably heavier (522 grams) and more uncomfortable than the other headphones. The test was essentially no longer blind.

To that end, a virtual headphone method was developed whereby listeners could A/B different models of headphones that were virtualized through a single pair of headphones (the replicator headphone). Details on the method and its validation were presented at the 51st Audio Engineering Society International Conference on Loudspeakers and Headphones [2] in Helsinki, Finland in 2013.  A PDF of the slide presentation can be found  here.

Headphone virtualization is done by measuring the frequency response of the different  headphones at the DRP (eardrum reference point) using a G.R.A.S. 45 AG, and then equalizing the replicator headphone to match the measured responses of the real headphones.  In this way, listeners can make instantaneous  A/B comparisons between any number of virtualized headphones through the same headphone without the visual and tactile clues biasing their judgment. More details about the method are in the slides and AES preprint.

An important questions is: "How accurate are the virtual headphones compared to the actual headphones"?  In terms of their linear acoustic performance they are quite similar. Fig. 2 compares the  measured frequency response of the actual versus virtualized headphones.  The agreement is quite good up to 8-10 kHz above which we didn't aggressively equalize the headphones because of measurement errors and large variations related to headphone positioning both on the coupler and the listeners' head.

Fig. 2 Frequency response measurements of the6  actual versus virtualized headphones made on a  GRAS 45 AG coupler with pinna. The dotted curves are based on the physical headphone and the solid curves are from the virtual (replicator) headphone.  The measurements of the right channel of the headphone (red curves) have been offset by 10 dB from the left channels (blue curve) for visual clarify. Clicking on the picture will show a larger version.

More importantly, "Do the actual and virtual headphones sound similar"? To answer this question we performed a validation experiment where listeners evaluated 6 different headphone using both standard and virtual listening methods Listeners gave both preference and spectral balance ratings in both standard and virtual tests. For headphone preference ratings the correlation between standard and virtual test results was r = 0.85. A correlation of 1 would be perfect but 85% agreement is not bad, and hopefully more accurate than headphone ratings based on sighted evaluations. 

The differences between virtual and standard test results we believe are in part due to nuisance variables that were not perfectly controlled across the two test methods. A significant nuisance variable would likely be headphone leakage that would affect the amount of bass heard depending on the fit of the headphone on the individual listener. This would have affected the results in the standard test but not the virtual one where we used an open-back headphone that largely eliminates leakage variations across listeners.  Headphone weight and tactile cues were present in the standard test but not the virtual test, and this could in part explain the differences in results.  If these two variables could be better controlled even higher accuracy can be achieved in virtual headphone listening.

Fig.3 The mean listener preference ratings and 95% confidence intervals shown for the headphones rated using the Standard and Virtual Listening Test Methods. The Standard Method listeners evaluated the actual headphones with tactile/weigh biases and any leakage effects. In the Virtual Tests, there were no visual or tactile cues about the headphones. Note: Clicking on the picture will show a larger version.

Some additional benefits from virtual headphone testing were discovered besides eliminating sighted and psychological biases: the listening tests are faster, more efficient and more sensitive. When listeners can quickly switch and compare all of the headphones in a single trial, auditory memory is less of a factor, and they are better able to discriminate among the choices. Since this paper was written in 2013, we've improved the accuracy of the virtualization in part by developing a custom pinnae for our GRAS 45 CA that better simulates the leakage effects of headphones measured on real human subjects [3].

Finally, it's important to acknowledge what the virtual headphone method doesn't capture: 1)  non-minimum phase effects (mostly occurring at higher frequencies) and 2)  non-linear distortions that are level-dependent. The effect of these two variables on virtual headphone test method have been recently tested experimentally and will be the topic of a future blog posting. Stay tuned. 


[1] Floyd Toole and Sean Olive,”Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Listening Tests, and Other Interesting Things,” presented at the 97th AES Convention, preprint 3894 (1994). Download here.

[2] Sean E. 

[3] Todd Welti, "Improved Measurement of Leakage Effects for Circum-Aural and Supra-Aural Headphones," presented at the 38th AES Convention, (May 2014). Download here.


  1. Thanks Sean.

    Do you think beam steering by using a phased array of multiple headphone drivers per earcup could potentially fix the pinnae and coupling effects?

  2. What headphone was used to virtualize other headphones?


    Were these trained listeners?


    Some headphones significantly change in frequency response based upon their position on the head. For example:

    Do planar magnetic and electrostatic headphones vary less in frequency response than dynamic headphones when moved around the listeners head? The innerfidelity measurements suggest that planars are more consistent in frequency response when moved around the head.


    Can this virtualization technique be applied to IEMs?


    What is ideal in a headphone for virtualization?

    1. Hi Dreyka:

      1. In the validation tests we used a Sennheiser HD518 circumaural headphone because it is open-back (controlled bass leak) and good repeatability in measured response on different listeners. While it's response doesn't meet the Harman target it is relatively smooth and extended making it a good candidate for equalization. We've also used the Sennheiser HD800 for the same reasons: very consistent response on different listeners, smooth and easy to equalize.

      The listeners in these tests were trained.

      I can't comment on whether magnetic planar or electrostatic's are more consistent but since they are open I suspect they would be good. The Stax SR009 has very low measured distortion so we've used that for virtualization experiments where we were interested in the audibility of distortion.

      We are currently doing some research where we are virtualizing IEM headphones and the preliminary results look very good. The key is controlling for leakage effects which we have managed to do by monitoring the response in the ear for leakage.

      The ideal headphone for virtualization has an extended flat smooth response (20 Hz to 25 kHz), no distortion, no leakage or controlled leakage, and a consistent response across all listeners. If you find such a headphone let me know :)


    2. I'd be interested to hear about whether there are perceptual differences in soundstage between open and closed headphones when listened to in an anechoic chamber.

      It was partly discussed here:

    3. How are you equalizing IEMs and what IEMs are you using? A deep fit IEM like the Etymotic is going to be different to the shallow fit that is much more typical for IEMs like the Sennheiser IE8.

      Wouldn't the different geometry of everyones ear canals and the different volume of air in each canal affect frequency response. With the shallow fit that is common for many IEMs the resonance from a sealed ear canal is around 7-8Khz.

    4. Dreyka

      Yes, the insertion depth of the IEM will shift the frequency of the first resonance but it occurs above 8 kHz. The question is does it matter in terms of perceived sound quality?

      We are presenting a paper at the upcoming 141 AES Convention in Los Angeles where we compare objective and subjective evaluations of real and virtualized IEMs. The intra-inter agreement among listeners in term of their preferred IEMs was very good which suggests that maybe this is not a huge issue. We carefully monitored and controlled leakage which I think is the biggest issue in closed cirumaural and IEM measurements (objective or subjective).

  3. Hi Sean,

    Do you think it is theoretically possible electrostatics and planar magnetics exhibit less spatial variation is because the wavefront generated by these transducers is more uniform over the radiating area, thereby reducing variations in pinnae interactions?

  4. Hello Dr Olive and congratulations for you work

    As an owner of an hd595(similar to HD518) I find their bass very weak. By "weak" I mean that even after eq correction you don't feel the pressure on your eardrum. There is no tactility.
    Perhaps, this has to do with the openness of the design and it's respective inability to maintain high air pressure. This is depicted in csd measurements as very low resonance at low frequencies, if I'm not mistaken.
    What do you think? Might this be the reason why the simulated Audeze LCD 2(aka 518) sounds thin(mid centric) compared to the original? What is the significance of resonances in general?