Audio Musings by Sean Olive

Saturday, May 30, 2009

Harman's "How to Listen" - A New Computer-based Listener Training Program

Trained listeners with normal hearing are used at Harman International for all standard listening tests related to research and competitive benchmarking of consumer, professional and automotive audio products. This article explains why we use trained listeners, and describes a new computer-based software program developed for training and selecting Harman listeners.

Why Train Listeners?

There are many compelling reasons for training listeners. First, trained listeners produce more discriminating and reliable judgments of sound quality than untrained listeners [1]. This means that fewer listeners are needed to achieve the same statistical confidence, resulting in considerable cost savings. Second, trained listeners are taught to identify, classify and rate important sound quality attributes using precise, well-defined terms to explain their preferences for certain audio systems and products. Vague audiophile terms such as “chocolaty”, “silky” or “the bass lacks pace, rhythm or musicality” are NOT part of the trained listener's vocabulary since these descriptors are not easily interpreted by audio engineers who must use the feedback from the listening tests to improve the product. Third, the Harman training itself, so far, has produced no apparent bias when comparing the loudspeaker preferences of trained versus untrained listeners [1]. This allows us to safely extrapolate the preferences of trained listeners to those of the general untrained population of listeners (e.g. most consumers).

Harman's “How to Listen” Listener Training Program

Harman’s “How to Listen” is a new computer-based software application that helps Harman scientists efficiently train and select listeners used for psychoacoustic research and product evaluation. The self-administered program has 17 different training tasks that focus on four different attributes of sound quality: timbre (spectral effects), spatial attributes(localization and auditory imagery characteristics), dynamics, and nonlinear distortion artifacts. Each training task starts at a novice level, and gradually advances in difficulty based on the listeners’ performance. Constant feedback on the listener's responses is provided to improve their learning and performance. A presentation of the training software can be viewed in parts 1 and 2

Spectral Training Tasks

There are two different spectral training tasks. In the Band Identification training task, the listener compares a reference (Flat) and an equalized version of the music program (EQ), and must determine which frequency band is affected by the equalization (see slide 5 of part 2). The types of filters include peaks, dips, peak and dips, high/low shelving and low/high/bandpass filters. The task is aimed at teaching listeners to identify spectral distortions in precise, quantitative terms (filter type, frequency, Q and gain) that directly correspond to a frequency response measurement.

At the easiest skill level, there are only 2 frequency band choices, which are easily detected and classified. However, as the listener advances, the audio bandwidth is subdivided into multiple frequency bands making the audibility and identification of the affected frequency band more challenging.

The Spectral Plot training exercise takes this one step further. The listener compares different music selections equalized to simulate more complex frequency response shapes commonly found in measurements of loudspeakers in rooms and automotive spaces. The listener is given a choice of frequency curves which they must correctly match to the perceived spectral balances of the stimuli. This teaches listeners to graphically draw the perceived timbre of an audio component as a frequency response curve. Once trained, listeners become quite adept at drawing the perceived spectral balance of different loudspeakers, and these graphs closely correspond to their acoustical measurements [2], [3].

Sound Quality Attribute Tasks

The purpose of this task is to familiarize the listener with each of the four sound quality attributes (timbre, spatial, dynamics and nonlinear distortion) and their sub-attributes, and measure the listener's ability to reliably rate differences in the attribute's intensity. For example, in one task the listener must rank order the relative brightness/dullness of two or more stimuli based on the intensities of the brightness/dullness of the processed music selection. As the difficulty of the task increases, the listener must rate more stimuli that have incrementally smaller differences in intensity of the tested attribute. Listener performance is calculated using Spearman’s rank correlation coefficient which expresses the degree to which stimuli have been correctly rank ordered on the attribute scale.

Preference Training

In this task, the listener enters preference ratings for different music selections that have had one or more attributes (timbre, spatial, dynamics and nonlinear distortion) modified by incremental amounts.

By studying the interrelationship between the modification of these attributes and the preference ratings, Harman scientists can uncover how listeners weight different attributes when formulating their preferences. From this, the preference profile of a listener can be mapped based on the importance they place on certain sound quality attributes. The performance metric in the preference task is based on the F-statistic calculated from an ANOVA performed on the individual listeners’ data. The higher the F-statistic, the more discriminating and/or consistent the listeners’ ratings are --- a highly desirable trait in the selection of a listener.

Other Key Features

Harman’s “How to Listen” training software runs on both Windows and Mac OSX platforms, and includes a real-time DSP engine for manipulating the various sound quality attributes. Most common stereo and multichannel sound formats are supported. In “Practice Mode”, the user can easily add their own music selections.

All of the training results from the 100+ listeners located at Harman locations world-wide are stored on a centralized database server. A web-based front end will allow listeners to log in to monitor and compare their performances to those of other listeners currently in training. Of course, the identifies of the other listeners always remain confidential.

Conclusion

In summary, Harman’s “How to Listen” is a new computer-based, self-guided software program that teaches listeners how to identify, classify and rate the quality of recorded and reproduced sounds according to their timbral, spatial, dynamic and nonlinear distortion attributes. The training program gives constant performance feedback and analytics that allow the software to adapt to the ability of the listener. These performance metrics are used for selecting the most discriminating and reliable listeners used for research and subjective testing of Harman audio products.

References

[1] Sean. E Olive, "Differences in Performance and Preference of Trained Versus Untrained Listeners in Loudspeaker Tests: A Case Study," J. AES, Vol. 51, issue 9, pp. 806-825, September 2003. Download for free here, courtesy of Harman International.

[2] Sean E. Olive, “A Multiple Regression Model for Predicting Loudspeaker Preference Using Objective Measurements: Part I - Listening Test Results,” presented at the 116th AES Convention (May 2004).

[3] Floyd E. Toole, Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms, Focal press (July 2008). Available from Amazon here

Saturday, May 23, 2009

The Harman International Reference Listening Room

Last week I returned from the AES Munich Convention where I gave a paper entitled ”A New Reference Listening Room for Consumer, Professional, and Automotive Audio Research.” It describes the features, scientific rationale, and acoustical performance of a new reference listening room designed and built for the purposes of conducting controlled listening tests and psychoacoustic research for consumer, professional, and automotive audio products. The main features of the room include quiet and adjustable room acoustics, a high-quality calibrated playback system, an in-wall loudspeaker mover, and complete automated control of listening tests performed in the room. A copy of my Munich AES presentation is available here.

The first prototype reference room was built at the Harman Northridge campus in 2007. Additional reference listening rooms have since been built at Harman locations in the UK, Germany, with the fourth one being constructed in Farmington Hills, Michigan. We are in the process of measuring and calibrating the performances of the different rooms using acoustical measurements and binaural room scans, which will be evaluated for their perceptual similarity in sound quality.

With a standardized listening room and playback system, Harman scientists can conduct listener training, psychoacoustic research and product testing at different Harman locations throughout the world. The results from these different locations can be compared or pooled together since the room, playback system, and trained listeners are a constant variable. With this brings greater testing efficiency, flexibility, and new opportunities in the kinds of product research and listening tests Harman is able to do in the future. Already, we are using the unique features of these rooms to conduct very controlled listening tests on consumer in-wall speakers, and to research and benchmark the performance of various commercial and prototype loudspeaker-room correction devices.

You will hear a lot more about the Harman International reference listening rooms in the near future because of the pivotal role they will play in the research, testing and subjective benchmarking of new Harman consumer, professional and automotive audio products. Just thinking about these research possibilities makes me truly excited!

Thursday, April 9, 2009

The Dishonesty of Sighted Listening Tests

An ongoing controversy within the high-end audio community is the efficacy of blind versus sighted audio product listening tests. In a blind listening test, the listener has no specific knowledge of what products are being tested, thereby removing the psychological influence that the product’s brand, design, price and reputation have on the listeners’ impression of its sound quality. While double-blind protocols are standard practice in all fields of science - including consumer testing of food and wine - the audio industry remains stuck in the dark ages in this regard. The vast majority of audio equipment manufacturers and reviewers continue to rely on sighted listening to make important decisions about the products’ sound quality.

An important question is whether sighted audio product evaluations produce honest and reliable judgments of how the product truly sounds.

A Blind Versus Sighted Loudspeaker Experiment

This question was tested in 1994, shortly after I joined Harman International as Manager of Subjective Evaluation [1]. My mission was to introduce formalized, double-blind product testing at Harman. To my surprise, this mandate met rather strong opposition from some of the more entrenched marketing, sales and engineering staff who felt that, as trained audio professionals, they were immune from the influence of sighted biases. Unfortunately, at that time there were no published scientific studies in the audio literature to either support or refute their claims, so a listening experiment was designed to directly test this hypothesis. The details of this test are described in references 1 and 2.

A total of 40 Harman employees participated in these tests, giving preference ratings to four loudspeakers that covered a wide range of size and price. The test was conducted under both sighted and blind conditions using four different music selections.

The mean loudspeaker ratings and 95% confidence intervals are plotted in Figure 1 for both sighted and blind tests. The sighted tests produced a significant increase in preference ratings for the larger, more expensive loudspeakers G and D. (note: G and D were identical loudspeakers except with different cross-overs, voiced ostensibly for differences in German and Northern European tastes, respectively. The negligible perceptual differences between loudspeakers G and D found in this test resulted in the creation of a single loudspeaker SKU for all of Europe, and the demise of an engineer who specialized in the lost art of German speaker voicing).

Brand biases and employee loyalty to Harman products were also a factor in the sighted tests, since three of the four products (G,D, and S) were Harman branded. Loudspeaker T was a large, expensive ($3.6k) competitor's speaker that had received critical acclaim in the audiophile press for its sound quality. However, not even Harman brand loyalty could overpower listeners' prejudices associated with the relatively small size, low price, and plastic materials of loudspeaker S; in the sighted test, it was less preferred to Loudspeaker T, in contrast to the blind test where it was slightly preferred over loudspeaker T.

Loudspeaker positional effects were also a factor since these tests were conducted prior to the construction of the Multichannel Listening Lab with its automated speaker shuffler. The positional effects on loudspeaker preference rating are plotted in Figure 2 for both blind and sighted tests. The positional effects on preference are clearly visible in the blind tests, yet, the effects are almost completely absent in the sighted tests where the visual biases and cognitive factors dominated listeners' judgment of the auditory stimuli. Listeners were also less responsive to loudspeaker-program effects in the sighted tests as compared to the blind test conditions. Finally, the tests found that experienced and inexperienced listeners (both male and female) tended to prefer the same loudspeakers, which has been confirmed in a more recent, larger study. The experienced listeners were simply more consistent in their responses. As it turned out, the experienced listeners were no more or no less immune to the effects of visual biases than inexperienced listeners.

In summary, the sighted and blind loudspeaker listening tests in this study produced significantly different sound quality ratings. The psychological biases in the sighted tests were sufficiently strong that listeners were largely unresponsive to real changes in sound quality caused by acoustical interactions between the loudspeaker, its position in the room, and the program material. In other words, if you want to obtain an accurate and reliable measure of how the audio product truly sounds, the listening test must be done blind. It’s time the audio industry grow up and acknowledge this fact, if it wants to retain the trust and respect of consumers. It may already be too late according to Stereophile magazine founder, Gordon Holt, who lamented in a recent interview:

“Audio as a hobby is dying, largely by its own hand. As far as the real world is concerned, high-end audio lost its credibility during the 1980s, when it flatly refused to submit to the kind of basic honesty controls (double-blind testing, for example) that had legitimized every other serious scientific endeavor since Pascal. [This refusal] is a source of endless derisive amusement among rational people and of perpetual embarrassment for me..”

References

[1] Floyd Toole and Sean Olive,”Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Listening Tests, and Other Interesting Things,” presented at the 97th AES Convention, preprint 3894 (1994). Download here.

[2] Floyd Toole, Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms, Focal Press, 2008.

Saturday, April 4, 2009

Binaural Room Scanning Part 2: Calibration, Testing, and Validation

In part 1 of this article, I described how binaural room scanning works and why it has great potential as a tool for psychoacoustic research and product testing. In part 2, I will describe some errors inherent to all BRS systems, which require proper calibration to remove them. Finally, I will summarize some research that has focused on testing and validating the performance of BRS systems .

BRS Errors

Unfortunately, all binaural record/reproduction systems inherently produce errors in the signals captured by the mannequin, and later reproduced through the headphones. The categories of BRS errors are summarized in Figure 2 [1]. Certain types of BRS errors (error 9) are easily removed with a correction filter. Individualized errors related to physical differences in the shapes and sizes of listeners’ ears/heads/torso versus those of the mannequin’s, are more challenging.

While it is possible to calibrate and remove individualized errors, doing so can be expensive and time-consuming, making BRS a less practical tool for psychoacoustic research and testing. Therefore, an important question is whether their correction leads to a significant perceptual improvement or difference in the listening test results. For example, if the error has no significant impact on the listening test results and conclusions, then the error is less of a concern. It is well known that humans can re-learn and adapt to errors in their vision or hearing introduced through injury or artificial means, suggesting that listeners may possibly do the same when listening through a BRS system.

BRS Calibration Testing and Validation

To answer some of the above questions, we have been conducting listening tests in parallel using both BRS and conventional in situ methods to determine whether they produce similar results. These tests have been conducted using different loudspeakers auditioned in a reflective listening room [1],[2], and having listeners evaluate the sound quality of different automotive audio systems [3]. So far, we have found no statistically significant differences in the results between the two methods. Listeners’ loudspeaker and automotive audio system preference ratings are the same whether measured in situ or through the BRS system. It is important to note that the BRS calibration used for these tests was based on a single listener, suggesting that individualized calibrations may not be necessary. Listeners are apparently adapting to and ignoring many of the residual errors that remain after calibration. We suspect adaptation is enhanced in multiple comparison listening tasks where the BRS errors are constant and common among the different loudspeakers or car audio systems being evaluated. Using a different BRS system, other researchers have reported similarly good agreement between BRS and in situ tests conducted on different audio CODECS [4], and an automobile audio system manipulated to produce different spectral and spatial attributes [5].

Future BRS Research

There remain many unanswered questions about the performance, calibration and testing of BRS systems. Is it necessary to capture and simulate the whole-body vibration that listeners feel when listening in a car or other listening space where the low frequency tactile element is significant? What is the best method for capturing and reproducing the non-linear distortion of the audio system, which is normally not included in the binaural room impulse response? Given that auditory perception is part of a multi-modal sensory experience, how important is it to include the visual cues (e.g. car and room interiors) that reinforce the auditory cues heard by the listener, and prevent cognitive dissonance? These are questions that we are currently investigating so that we can improve the overall accuracy and perceptual realism of BRS systems used in psychoacoustic research and product evaluations.

References

[1] Sean E. Olive, “Interaction Between Loudspeakers and Room Acoustics Influences Loudspeaker Preferences in Multichannel Audio Reproduction,” PhD Thesis, Schulich School of Music, McGill University, Montreal, Quebec, Canada, (February 2008).

[2] Olive, Sean,Welti Todd, and Martens, William L.,“Listener Loudspeaker Preference Ratings Obtained In Situ Match those Obtained Via a Binaural Room Scanning Measurement and Playback System,” presented at the 122nd Audio Eng. Soc., preprint 7034, (May 2007). Download here.

[3] Olive, Sean and Welti Todd, “Validation of a Binaural Car Scanning System for Subjective Evaluation of Automotive Audio Systems,” to be presented at the 36th International Audio Eng. Conference, Dearborn, Michigan, USA (June 2-4, 2009).

[4] S. Bech, M-A Gulbol, G. Martin, J. Ghani, and W. Ellermeir, “A listening test system for automotive audio - part 2: Initial verification [Preprint 6359]. Proceedings of the 118th International Convention of the Audio Eng. Soc., Barcelona, Spain, (May, 2005). Download here

[5] Søren Bech, Sylvain Choisel and Patrick Hegarty,”A Listening Test System for Automotive Audio – Part 3: Comparison of Attribute Ratings Made in a Vehicle with Those Made Using an Auralisation System,” [Preprint 7224], Proceedings of the 123rd International

Convention of the Audio Eng. Soc., Vienna, Austria, (October 2007). Download here.

Tuesday, March 24, 2009

Binaural Room Scanning - A Powerful Tool For Audio Research & Testing

Binaural Room Scanning (BRS) is a powerful audio technology being used by Harman scientists to conduct innovative psychoacoustic research and listening tests that were previously not practical, or even possible. The roots of BRS are traced back to Studer (a Harman International company) who in the late 1990‘s developed a BRS processor that allowed recording engineers to remotely monitor their recordings via headphones through a virtual copy of their control room [1].

Unlike auralization methods, BRS provides an auditory display based on actual acoustical measurements of the loudspeakers and listening environment - not simulations based on a model of the loudspeakers and room. For this reason, BRS reproductions are significantly more accurate and realistic than model-based auralizations.

BRS measurements of the loudspeakers and listening space are made with an anthropomorphically accurate binaural mannequin equipped with microphones in each ear (see top photo above). Measurements are made at every 1-2 degrees over a range of ±60 degrees by precisely rotating the mannequin's head via a stepper motor controlled by the BRS measurement computer. Each measurement is stored as a set of binaural room impulse responses (BRIR) that provide the filters through which music is convolved and sent to a calibrated pair of high quality headphones (see bottom photo above). A key feature of the BRS playback system is its ultrasonic head-tracker: it constantly monitors the position of the listener's head, sending the angular coordinates to the playback engine, which in turn switches to the corresponding set of measured BRIRs. In this way, the BRS playback preserves the natural dynamic interaural cues, used by humans to localize sound in rooms. Without these dynamic cues, headphones tend to produce sound images localized inside or near the head with front-to-back reversals being quite common. Head-tracking is therefore necessary for accurate assessment of the true spatial qualities of the audio reproduction.

Current and Future Applications For BRS

As a research tool, BRS offers greater efficiencies and opportunities in how audio scientists research, develop and test audio products within home, professional and automotive listening spaces. BRS allows an unlimited number of acoustical variables to be manipulated, sequentially captured, and later evaluated in a highly repeatable and controlled manner. Using BRS, Harman researchers can do perceptual experiments and product evaluations that would otherwise be impractical or impossible using conventional in situ listening tests. This includes double-blind, controlled comparisons of different audio systems in different automobiles, concert halls or arenas, and home theaters.

BRS has already been used at Harman to study how the acoustical properties of the loudspeaker and listening room interact with each other, how these interactions affect the sound quality of the music reproduction, and the extent to which listeners’ adapt to the room acoustics when listening to multichannel audio systems [2],[3]. Over the next few years, BRS will help expand our current scientific understanding of how listeners perceive sound in rooms, so that we can optimize the sound quality of loudspeakers, acoustic spaces, and room-correction devices used to tame loudspeaker-room interactions. A BRS auditory display connected over the internet to a BRS database could even allow consumers to compare and select their most preferred loudspeaker model, concert hall seat, or automotive audio system configuration, without ever leaving the privacy of their home.

Finally, BRS brings enormous efficiencies, flexibility, and cost savings to psychoacoustic research and testing. The acoustical complexity of an automotive audio system can be captured and stored as a relatively small 10 MB file, which can then be emailed and evaluated anywhere in the world using a relatively inexpensive auditory display. The high costs associated with building expensive ITU-R listening rooms, transporting listeners, automobiles, and loudspeakers around the world for evaluation may soon be a thing of the past.

In the next installment, I will discuss some of the inherent errors found in all BRS systems, and how they can be removed through proper calibration. Some recent listening experiments will be described that validate the perceptual accuracy and performance of our BRS system.

References

[1] Horbach, Ulrich, Karamustafaoglu, Attila, Pellegrini, Renato, Mackensen, Philip, Theile, Günther, “Design and Applications of a Data-Based Auralization System for Surround Sound,” presented at the 106th Audio Eng. Soc. Convention, preprint 4976, (May 1999). Download here.

[2] Olive, Sean and Martens William L. “Interaction between Loudspeakers and Room Acoustics Influences Loudspeaker Preferences in Multichannel Audio,” presented at the 123rd Audio Eng. Soc., Convention, preprint 7196 (October 2007). Download here.

[3] Olive, Sean and Welti Todd, “Validation of a Binaural Car Scanning Measurement System for Subjective Evaluation of Automotive Audio Systems,” to be presented at the 36th International Audio Eng. Conference, Dearborn, Michigan, USA (June 2-4, 2009).

Sunday, January 11, 2009

What Loudspeaker Specifications Are Relevant to Sound Quality?

This past week I attended the International Loudspeaker Manufacturer’s Association (ALMA) Winter Symposium in Las Vegas where the theme was “Sound Quality in Loudspeaker Design and Manufacturing.” Over the course of 3 days there were presentations, round table discussions, and workshops from the industry’s leading experts focused on improving the sound quality of the loudspeaker. Ironically, the important question of whether these improvements matter to consumers wasn’t raised until the final hours of the symposium in a panel discussion called: “What loudspeaker specifications are relevant to perception?”

The panelists included myself, Steve Temme (Listen Inc.), Dr. Earl Geddes (GedLee), Laurie Fincham (THX), Mike Klasco (Menlo Scientific), and Dr. Floyd Toole (former VP Acoustic Engineering at Harman), who served as the panel moderator. After about 30 minutes, a consensus was quickly reached on the following points:

The perception of loudspeaker sound quality is dominated by linear distortions, which can be accurately quantified and predicted using a set of comprehensive anechoic frequency response measurements (see my previous posting here)
Both trained and untrained listeners tend to prefer the most accurate loudspeakers when measured under controlled double-blind listening conditions (see this article here).
The relationship between perception and measurement of nonlinear distortions is less well understood and needs further research. Popular specifications like Total Harmonic Distortion (THD) and Intermodulation Distortion (IM) do not accurately reflect the distortion’s audibility and effect on the perceived sound quality of the loudspeaker.
Current industry loudspeaker specifications are woefully inadequate in characterizing the sound quality of the loudspeaker. The commonly quoted “20 Hz - 20 kHz , +- 3 dB” single-curve specification is a good example. Floyd Toole made the observation that there is more useful performance information on the side of a tire (see tire below) compared to what’s currently found on most loudspeaker spec sheets (see Floyd's new book "Sound Reproduction").

For the remaining hour, the discussion turned towards identifying the root cause of why loudspeaker performance specifications seem stuck in the Pleistocene Age, despite scientific advancements in loudspeaker psychoacoustics. Do consumers really care about loudspeaker sound quality? Or are they mostly satisfied with the status quo? Why do loudspeaker manufacturers continue to hide behind loudspeaker performance numbers that are mostly meaningless, and often misleading?

The evidence that consumers no longer care about sound quality is anecdotal, largely based on the recent down-market trend in consumer audio. Competition from digital cameras, flat panel video displays, MP3 players, computers, and GPS navigation devices, has decimated the consumers' audio budget. This doesn't prove consumers care less about loudspeaker sound quality, only that there is less available money to purchase it. Marketing research studies indicate that sound quality remains an important factor in consumers' audio purchase decisions. Given the opportunity to hear different loudspeakers under controlled unbiased listening conditions, consumers will tend to prefer the most accurate ones. Unfortunately, with the demise of the speciality audio dealer and the growth of internet-based sales, consumers rarely have the opportunity to audition different loudspeakers - even under the most biased and uncontrolled listening conditions. This is a perfect opportunity and reason for why the industry needs to provide new loudspeaker specifications that accurately portray the perceived sound quality of the loudspeaker.

So why is the loudspeaker industry not moving more quickly towards this goal? In my view, complacency and fear are the major obstacles. The loudspeaker industry is very conservative and largely self-regulated. There are no regulatory agencies to force improvement, or even check whether a product's quoted specifications are compliant with reality. Change will only occur as the result of competition, or pressure exerted by consumers, industry trade organizations (e.g.CEDIA, CEA) or consumer product testing organizations, like Consumer Reports. The fear of adopting a new specification stems from the realization that a company can no longer hide beneath the Emperor's new clothes (i.e. the current specifications). A perceptually relevant specification would clearly identify the good sounding loudspeakers from the truly mediocre ones. In the future, a perceptual-based specification like the one illustrated to the right, could provide ratings on overall sound quality, and various timbral, spatial and dynamic attributes. The consumer could then choose a loudspeaker based on these measured attributes.

In conclusion, all evidence suggests that consumers highly value sound quality when purchasing a loudspeaker, yet current loudspeaker specifications provide little guidance in this matter. It is time the loudspeaker industry grows up and realizes this. Adopting a more perceptually meaningful loudspeaker specification would permit consumers to make smarter loudspeaker choices based on how it sounds. This would better serve the interests of consumers and loudspeaker manufacturers who view the sound quality of a loudspeaker to be its most important selling feature.

Saturday, January 3, 2009

Why Consumer Report's Loudspeaker Accuracy Scores Are Not Accurate

For over 35 years, Consumer Reports magazine recommended loudspeakers to consumers based on what many audio scientists believe to be a flawed loudspeaker test methodology. Each loudspeaker was assigned an accuracy score related to the "flatness" of its sound power response measured in 1/3-octave bands. Consumers Union (CU) - the organization behind Consumer Reports - asserted that the sound power best predicts how good the loudspeaker sounds in a typical listening room. Until recently, this assertion had never been formally tested or validated in a published scientific study.

In 2004, the author conducted a study designed to address following question: "Does the CU loudspeaker model accurately predict listeners' loudspeaker preference ratings?" (see Reference 1). A sample of 13 different loudspeaker models reviewed in the 2001 August edition of Consumer Reports was selected for the study. Over the course of several months, the 13 loudspeakers were subjectively evaluated by a panel of trained listeners in a series of controlled, double-blind listening tests. Comparative judgments were made among different groups of 4 speakers at a time using four different music programs. Loudspeaker positional biases were eliminated via an automated speaker shuffler. To control loudspeaker context effects, a balanced test design was used so that each loudspeaker was compared against the other 12 loudspeaker models, an equal number of times. This produced a total of 2,912 preference, distortion and spectral balance ratings, in addition to 2,138 comments.

The above graph plots the mean listener loudspeaker preference rating and 95% confidence intervals (blue circles), and the corresponding CU predicted accuracy score (red squares) for each of the 13 loudspeakers. The agreement between the listener preference and CU accuracy scores is very poor, indeed; in fact, the correlation between the two sets of ratings is actually negative (r = -.22) and statistically insignificant (p = 0.46). The most preferred loudspeaker in the test group (loudspeaker 1) actually received the lowest CU accuracy score (76). Conversely, some of the least preferred loudspeakers (e.g. loudspeakers 9 and 10) received the highest CU accuracy scores. In conclusion, the CU accuracy scores do not accurately predict listeners' loudspeaker preference ratings. Since this study was published, CU has begun to reevaluate their loudspeaker testing methods. Hopefully, their new rating system will more accurately predict the perceived sound quality of loudspeakers in a typical listening room.

In the next installment of this article, I will explain why the CU loudspeaker model failed to accurately predict listeners' loudspeaker preferences, and show some new models that work much better in this regard.

Updated 1/5/2009: Today, I was contacted by Consumer Reports who informed me that since 2006 they no longer publish loudspeaker reviews based on their sound power model that I tested in 2004. I was told their new model for predicting loudspeaker sound quality uses a combination of sound power and other analytics to better characterize what the listener hears in a room. In this regard, it is similar to the predictive model I developed, which I will discuss in an upcoming blog posting.

References

[1] Sean E. Olive, "A Multiple Regression Model for Predicting Loudspeaker Preferences using Objective Measurements: Part 1 -Listening Test Results," presented at the 116th AES Convention, May 2004.