Thursday, April 9, 2009

The Dishonesty of Sighted Listening Tests



An ongoing controversy within the high-end audio community is the efficacy of blind versus sighted audio product listening tests. In a blind listening test, the listener has no specific knowledge of what products are being tested, thereby removing the psychological influence that the product’s brand, design, price and reputation have on the listeners’ impression of its sound quality. While double-blind protocols are standard practice in all fields of science - including consumer testing of food and wine - the audio industry remains stuck in the dark ages in this regard. The vast majority of audio equipment manufacturers and reviewers continue to rely on sighted listening to make important decisions about the products’ sound quality.

An important question is whether sighted audio product evaluations produce honest and reliable judgments of how the product truly sounds.


A Blind Versus Sighted Loudspeaker Experiment

This question was tested in 1994, shortly after I joined Harman International as Manager of Subjective Evaluation [1]. My mission was to introduce formalized, double-blind product testing at Harman. To my surprise, this mandate met rather strong opposition from some of the more entrenched marketing, sales and engineering staff who felt that, as trained audio professionals, they were immune from the influence of sighted biases. Unfortunately, at that time there were no published scientific studies in the audio literature to either support or refute their claims, so a listening experiment was designed to directly test this hypothesis. The details of this test are described in references 1 and 2.


A total of 40 Harman employees participated in these tests, giving preference ratings to four loudspeakers that covered a wide range of size and price. The test was conducted under both sighted and blind conditions using four different music selections.


The mean loudspeaker ratings and 95% confidence intervals are plotted in Figure 1 for both sighted and blind tests. The sighted tests produced a significant increase in preference ratings for the larger, more expensive loudspeakers G and D. (note: G and D were identical loudspeakers except with different cross-overs, voiced ostensibly for differences in German and Northern European tastes, respectively. The negligible perceptual differences between loudspeakers G and D found in this test resulted in the creation of a single loudspeaker SKU for all of Europe, and the demise of an engineer who specialized in the lost art of German speaker voicing).


Brand biases and employee loyalty to Harman products were also a factor in the sighted tests, since three of the four products (G,D, and S) were Harman branded. Loudspeaker T was a large, expensive ($3.6k) competitor's speaker that had received critical acclaim in the audiophile press for its sound quality. However, not even Harman brand loyalty could overpower listeners' prejudices associated with the relatively small size, low price, and plastic materials of loudspeaker S; in the sighted test, it was less preferred to Loudspeaker T, in contrast to the blind test where it was slightly preferred over loudspeaker T.


Loudspeaker positional effects were also a factor since these tests were conducted prior to the construction of the Multichannel Listening Lab with its automated speaker shuffler. The positional effects on loudspeaker preference rating are plotted in Figure 2 for both blind and sighted tests. The positional effects on preference are clearly visible in the blind tests, yet, the effects are almost completely absent in the sighted tests where the visual biases and cognitive factors dominated listeners' judgment of the auditory stimuli. Listeners were also less responsive to loudspeaker-program effects in the sighted tests as compared to the blind test conditions. Finally, the tests found that experienced and inexperienced listeners (both male and female) tended to prefer the same loudspeakers, which has been confirmed in a more recent, larger study. The experienced listeners were simply more consistent in their responses. As it turned out, the experienced listeners were no more or no less immune to the effects of visual biases than inexperienced listeners.


In summary, the sighted and blind loudspeaker listening tests in this study produced significantly different sound quality ratings. The psychological biases in the sighted tests were sufficiently strong that listeners were largely unresponsive to real changes in sound quality caused by acoustical interactions between the loudspeaker, its position in the room, and the program material. In other words, if you want to obtain an accurate and reliable measure of how the audio product truly sounds, the listening test must be done blind. It’s time the audio industry grow up and acknowledge this fact, if it wants to retain the trust and respect of consumers. It may already be too late according to Stereophile magazine founder, Gordon Holt, who lamented in a recent interview:


“Audio as a hobby is dying, largely by its own hand. As far as the real world is concerned, high-end audio lost its credibility during the 1980s, when it flatly refused to submit to the kind of basic honesty controls (double-blind testing, for example) that had legitimized every other serious scientific endeavor since Pascal. [This refusal] is a source of endless derisive amusement among rational people and of perpetual embarrassment for me..”



References


[1] Floyd Toole and Sean Olive,”Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Listening Tests, and Other Interesting Things,” presented at the 97th AES Convention, preprint 3894 (1994). Download here.


[2] Floyd Toole, Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms, Focal Press, 2008.

Saturday, April 4, 2009

Binaural Room Scanning Part 2: Calibration, Testing, and Validation



In part 1 of this article, I described how binaural room scanning works and why it has great potential as a tool for psychoacoustic research and product testing. In part 2, I will describe some errors inherent to all BRS systems, which require proper calibration to remove them. Finally, I will summarize some research that has focused on testing and validating the performance of BRS systems .


BRS Errors

Unfortunately, all binaural record/reproduction systems inherently produce errors in the signals captured by the mannequin, and later reproduced through the headphones. The categories of BRS errors are summarized in Figure 2 [1]. Certain types of BRS errors (error 9) are easily removed with a correction filter. Individualized errors related to physical differences in the shapes and sizes of listeners’ ears/heads/torso versus those of the mannequin’s, are more challenging.


While it is possible to calibrate and remove individualized errors, doing so can be expensive and time-consuming, making BRS a less practical tool for psychoacoustic research and testing. Therefore, an important question is whether their correction leads to a significant perceptual improvement or difference in the listening test results. For example, if the error has no significant impact on the listening test results and conclusions, then the error is less of a concern. It is well known that humans can re-learn and adapt to errors in their vision or hearing introduced through injury or artificial means, suggesting that listeners may possibly do the same when listening through a BRS system.


BRS Calibration Testing and Validation

To answer some of the above questions, we have been conducting listening tests in parallel using both BRS and conventional in situ methods to determine whether they produce similar results. These tests have been conducted using different loudspeakers auditioned in a reflective listening room [1],[2], and having listeners evaluate the sound quality of different automotive audio systems [3]. So far, we have found no statistically significant differences in the results between the two methods. Listeners’ loudspeaker and automotive audio system preference ratings are the same whether measured in situ or through the BRS system. It is important to note that the BRS calibration used for these tests was based on a single listener, suggesting that individualized calibrations may not be necessary. Listeners are apparently adapting to and ignoring many of the residual errors that remain after calibration. We suspect adaptation is enhanced in multiple comparison listening tasks where the BRS errors are constant and common among the different loudspeakers or car audio systems being evaluated. Using a different BRS system, other researchers have reported similarly good agreement between BRS and in situ tests conducted on different audio CODECS [4], and an automobile audio system manipulated to produce different spectral and spatial attributes [5].


Future BRS Research

There remain many unanswered questions about the performance, calibration and testing of BRS systems. Is it necessary to capture and simulate the whole-body vibration that listeners feel when listening in a car or other listening space where the low frequency tactile element is significant? What is the best method for capturing and reproducing the non-linear distortion of the audio system, which is normally not included in the binaural room impulse response? Given that auditory perception is part of a multi-modal sensory experience, how important is it to include the visual cues (e.g. car and room interiors) that reinforce the auditory cues heard by the listener, and prevent cognitive dissonance? These are questions that we are currently investigating so that we can improve the overall accuracy and perceptual realism of BRS systems used in psychoacoustic research and product evaluations.


References


[1] Sean E. Olive, “Interaction Between Loudspeakers and Room Acoustics Influences Loudspeaker Preferences in Multichannel Audio Reproduction,” PhD Thesis, Schulich School of Music, McGill University, Montreal, Quebec, Canada, (February 2008).


[2] Olive, Sean,Welti Todd, and Martens, William L.,“Listener Loudspeaker Preference Ratings Obtained In Situ Match those Obtained Via a Binaural Room Scanning Measurement and Playback System,” presented at the 122nd Audio Eng. Soc., preprint 7034, (May 2007). Download here.

[3] Olive, Sean and Welti Todd, “Validation of a Binaural Car Scanning System for Subjective Evaluation of Automotive Audio Systems,” to be presented at the 36th International Audio Eng. Conference, Dearborn, Michigan, USA (June 2-4, 2009).

[4] S. Bech, M-A Gulbol, G. Martin, J. Ghani, and W. Ellermeir, “A listening test system for automotive audio - part 2: Initial verification [Preprint 6359]. Proceedings of the 118th International Convention of the Audio Eng. Soc., Barcelona, Spain, (May, 2005). Download here


[5] Søren Bech, Sylvain Choisel and Patrick Hegarty,”A Listening Test System for Automotive Audio – Part 3: Comparison of Attribute Ratings Made in a Vehicle with Those Made Using an Auralisation System,” [Preprint 7224], Proceedings of the 123rd International

Convention of the Audio Eng. Soc., Vienna, Austria, (October 2007). Download here.