Audio Musings by Sean Olive: Evaluating the Sound Quality of Ipod Music Stations: Part 2 Listening Tests

Saturday, April 10, 2010

Evaluating the Sound Quality of Ipod Music Stations: Part 2 Listening Tests

Part 1 of this article described a listening test method used at Harman International for evaluating the sound quality of Ipod Music Docking Stations. In part 2, I present the results of a recent competitive benchmarking listening test where three popular Music Stations of comparable price were evaluated by a panel of trained listeners. Were listeners able to reliably formulate a preference among the different Ipod Music Stations using this test method? And what were the underlying sound quality attributes that explain these preferences? Read on to find out.

Throughout this article, I will refer to slides in an accompanying PDF presentation, or you can watch a YouTube video of the presentation.

The Products Tested
A listening test was performed on three Ipod Music Stations that retail for the same approximate price of $599: the Harman Kardon MS 100, the Bose SoundDock 10, and the Bowers & Wilkins Zeppelin (see slide 2). All three products provide Ipod docking playback capability and an auxiliary input for external sources such as CD player,etc. The latter was used in these tests to reproduce a CD-quality stereo test signal fed from a digital sound source.

Listening Test Method
The Music Stations were evaluated in the Harman International Reference Listening Room (slide 4) described in detail in a previous blog posting. Each Music Station was positioned on a shelf attached to the Harman automated in-wall speaker mover, which provides the means for rapid multiple comparisons among three products designed to be used in, on, or near a wall boundary. The music stations were level-matched within 0.1 dB at the listening position by playing pink noise through each unit and adjusting the acoustic output level to produce the same loudness measured via the CRC stereo loudness meter .

All tests were performed double-blind with the identities of the products hidden via an acoustically transparent, but visually opaque screen. The listening panel consisted of 7 trained listeners with normal audiometric hearing. Each listener sat in the same seat situated on-axis to the Music Stations positioned at seated ear height, approximately 11 feet away (slide 5).

The Music Stations were evaluated using a multiple comparison (A/B/C) protocol whereby listeners could switch at will between the three products before entering their final comments and ratings based on overall preference, distortion, and spectral balance. This was repeated using four different stereo music programs with one repeat (4 programs x 2 observations = 8 trials). In total, each listener provided 216 ratings, in addition to their comments. The typical length of the test was between 30-40 minutes. The presentation order of the music programs and Music Stations were randomized by the Harman Listening Test software to minimize any order-related biases in the results.

Results: Overall Preference Ratings For the Music Stations
A repeated measures analysis of variance was used to statistically establish the effects and interactions between the independent variables and the different sound quality ratings. The main effect was related to the Music Stations with no significant effects or interactions observed between the program material and Music Stations. Note that in the following discussion, the brands/models of the Music Stations have removed from the results since this information is not relevant to the primary purpose of the research and this article. Instead, the Music Station products have been assigned the letters A,B and C in descending order according to their mean overall preference rating.

The mean preference ratings and upper 95% confidence intervals based on the 7 listeners are plotted in slide 7. Music Station A received a preference rating of 6.8, and was strongly preferred over the Music Stations B (4.58) and C (4.08).

Individual Listener Preference
The individual listener preference ratings and upper 95% confidence intervals are plotted in slide 8. The intra and inter listener reliability in ratings were generally quite high. All seven listeners rated Music Station A higher than the other two products, although some listeners, notably 55 and 64, were less discriminating and reliable than other the listeners. Both these listeners had significantly less training and experience than the other listeners, which has been demonstrated in previous studies to be an important factor in listener performance.

Distortion Ratings
Nonlinear distortion includes audible buzzes, rattles, noise and other level-dependent distortions related to the performance of the electronics, transducers, and mechanical integrity of the product’s enclosure. In these tests, the average playback level was held constant (78 dB(B) slow), and listeners could not adjust it up or down. Under these test conditions, some listeners still felt there were audible differences in distortion (slide 9) with Music Station A (distortion rating = 7.19) having less distortion than Music Stations B (5.5) and C (4.94).

Some of these differences in subjective distortion ratings could be related to a “Halo Effect," a scaling bias wherein listeners tend to rate the distortion of loudspeakers according to their overall preference ratings - even when the distortion is not audible. An example of “Halo Effect” bias has been noted in a previous loudspeaker study by the author [1]. Reliable and accurate quantification of nonlinear distortion in perceptually meaningful terms remains problematic until better subjective and objective measurements are developed.

Spectral Balance Ratings
Listeners rated the spectral balance of each Music Station across seven equally log-spaced frequency bands using a ± 5-point scale. A rating of 0 indicates an ideal spectral balance, positive numbers indicate too much emphasis within the frequency band, and negative numbers indicate a deemphasis within the frequency band. Rating the spectral balance of an audio component is a highly specialized task that requires skill and practice acquired through using Harman’s “How to Listen” listener training software application. In a previous study [1], it has been shown that spectral balance ratings are closely related to the measured anechoic listening window of the loudspeaker, although may vary with changes in the directivity and the ratio of direct/reflected sound at the listening location.

The mean spectral balance ratings averaged across all programs and listeners are plotted in slide 10. Listeners felt Music Station A had the flattest or most ideal spectral balance, with the exception of a need for more upper/lower bass, and less emphasis in the upper treble. Music Station B was judged to have too much emphasis in the upper bass (88 Hz), and too little emphasis in the upper midrange/treble. Music Station C was rated to have a slight overemphasis in the upper bass, and a very uneven balance throughout the midrange with a peak centered around 1700 Hz.

Listener Comments
Listeners provided comments that described the audible difference among three Music Stations. The frequency or number of times a specific comment was used to describe each product is summarized in slide 11. The correlation between the product’s preference rating and each descriptor is indicated by correlation coefficient (r) shown in the bottom row of the table. The same table data shown in slide 11 are plotted in graphical form in slide 12.

The most common three descriptors applied to the Music Station A were neutral (16), bright (9), and thin (9). These descriptors generally confirm the perceived mean spectral balance ratings summarized in slide 10.

The three most frequent descriptors applied to Music Station B were colored (13), boomy bass (10), and uneven mids(6). The “boomy bass” is clearly suggested in spectral balance ratings (see the large 88 Hz peak) in slide 10.

The three most frequent descriptors used to describe the sound quality of Music Station C were colored (19), uneven mids (9), and harsh (6). All three descriptors have a high negative correlation with the overall preference rating, and may explain the low preference rating this product received. The coloration and unevenness of the midrange are confirmed in the spectral balance rating in slide 10. The harshness is most likely related to the perceived spectral peak perceived around 1700 Hz.

Conclusions
This article summarized the results of a controlled, double-blind listening test performed on three comparatively priced Ipod Music Stations using seven trained listeners with normal hearing. The results provide evidence that the sound quality of Music Station A was strongly preferred over Music Stations B and C. There was strong consensus among all seven listeners who rated Music Station A highest overall. The Music Station preference ratings can be largely explained by examining the perceived spectral balance ratings of the products, which are in turn closely related to listener comments on the sound quality of the products.

The most preferred product, Music Station A, was perceived to have the flattest, most ideal spectral balance, and solicited frequent comments to its neutral sound quality. As the spectral balance ratings deviated from flat or ideal, the products received frequent comments related to coloration, boomy bass, and uneven midrange. While the distortion ratings were highly correlated with preference, more investigation is needed to determine the extent to which the distortion ratings are related to a possible scaling bias known as the “halo effect."

In part 3 of this article, I will present the objective measurements of these products - both anechoic and in-room acoustical measurements - to see if they can reliably predict the subjective ratings of the products reported here.

References
[1] Sean E. Olive, “ A Multiple Regression Model for Predicting Loudspeaker Preference Using Objective Measurements: Part I - Listening Test Results,” presented at the 116th AES Convention, preprint 6113 (May 2004).

14 comments:

AnonymousApril 14, 2010 at 6:02 PM
so trying to get bass out of these small speakers by augmenting the mid bass section is a bad idea

of course having a flat response in the mids and highs are important

just the same as bigger speakers

what is the ideal bass roll of for best integration with room gain in this case table and wall gain in what frequency to start.
ReplyDelete
Replies
Dr. Sean OliveApril 14, 2010 at 7:40 PM
Hi Anonymous,

Yes, boosting the mid bass is not a good idea since as soon as you put it against a boundary and/or desktop it will have too much bass, and listeners will downgrade it.

I will be taking more about the acoustical interaction between the Music Station and room/wall boundaries in Part 3 of these article where I discuss measurements. Stay tuned.

Cheers
Sean
ReplyDelete
Replies
vukiApril 14, 2010 at 11:19 PM
Hi Sean,

do you really think people are buying this gadgets based on critical listening? Most iPod dock owners are going to boost bass and treble through some eq. preset anyway.

Regards,
Vuki
ReplyDelete
Replies
Dr. Sean OliveApril 15, 2010 at 6:48 AM
Hi Vuki,

I doubt that the majority of people who purchase these units make decisions based on critical listening they do themselves. It's difficult to do such listening in a store.

However, there is sufficient research to indicate that most consumers indicate that sound quality is among the top 3 reasons why they purchase a particular product. If audio companies can effectively communicate to consumers that their products are best-in-class in terms of sound quality then consumers have a compelling reason to purchase their products.

There is a successful company whose primary slogan is "better sound through research" That alone should tell you that consumers respond to the message that good sound is important.
ReplyDelete
Replies
vukiApril 15, 2010 at 10:06 AM
Is that the company that produced Music Station B?
ReplyDelete
Replies
Valentin RApril 24, 2010 at 8:03 AM
Sean

Somewhere i read that for this tests there is a standardization for vocabulary. I think you said Floyde Toole was working in this project

could you guide me to the correct book or paper to more information in this subject
ReplyDelete
Replies
Dr. Sean OliveApril 24, 2010 at 9:58 AM
@ Valentin,
Floyd Toole came up with some standard sound quality terms he used for loudspeaker tests back at NRC. You can find those in his 1985-6 AES papers on Subjective/Objective tests on loudspeakers. There were largely based on loudspeaker studies by the Swedish psychologist Gabrielsson and his colleagues who did factor analysis on a large number of terms used to describe sound quality in controlled experiments. From this, emerged 8-9 sound quality dmensions.

Floyd also worked on the Harman Audio Glossary that you can find here:
http://www.harman.com/EN-US/OurCompany/Technologyleadership/Pages/AudioGlossary.aspx

What I would like to see developed is a standard set of sound quality terms that include audio examples that demonstrate the attribute. For example, I made a YouTube video demonstrating Bright/Dull here http://www.youtube.com/watch?v=3rxLFfCIC24

You can have a whole set of demonstrations for Bright/Full, Full/Thin, Coloration, Spaciousness, Envelopment, Apparent Source Width, Dynamic compression, different types of distortion/noise (hum, buzz, clipping, Low bit-rate artifacts),etc

Cheers
Sean
ReplyDelete
Replies
nicoJune 23, 2010 at 8:12 AM
Hi Sean,

congratulations to this very insightful experiment. Your videos on youtube really helped to understand the setup. What an enormous effort you and your colleges put into that experimental environment!

Here are some critique points:
- You could better explain how the subjects were recruited. It would be crucial to know that they were NOT associated in any way with your company. It would be nice to know some more background on the listeners, e.g. if they have listened to iPod stations before or if they are owning one. How are the subjects representing the population of buyers of such systems (young, maybe inexperienced listeners?)
- What I found very surprising is the low variance in the overall rating. That leads to believe that all subjects judge "overall SQ" the same way.
- Slide 7 shows small "error" bars that (I think) indicate variance. The bars are very small. However Slide 8 shows high variance for B and C. Also Slide 8 shows "error" bars that I would not have expected for one single listener... did every listener judge multiple times?
- The listeners are confusingly labeled 4,9,53,54,61,64 which could make one believe that you left some out... (1,2,3,4,5,6,7 would be better)
- One thing you controlled for is the size of the room, which seems large compared to a typical listening room where to put such a system (my believe, cannot support with data)

Keep up the wonderful work, we need more experimental research in this area!

Nico
ReplyDelete
Replies
Dr. Sean OliveJuly 12, 2010 at 12:34 AM
Hi Nico,

Thanks for the nice feedback,
1. The listeners for these tests are all Harman employees who are screened for normal hearing and then trained and selected based on their performance as listeners. We bring in listeners from time to time who are untrained just to make sure we can extrapolate their preferences to our own trained panel, and there is excellent correlation. Check out http://seanolive.blogspot.com/2008/12/part-2-differences-in-performances-of.html

2. The low inter-intra listener variance in ratings is because of the training.

3. I need to check out what you mean here, but I think it could be related to difference in scales.

4. The numbers represent Listener IDs, and that is why they are not successive numbers. In publications I would probably make them successive. Thanks.

5. The room (24' x 21' x 9') is perhaps a little larger than a typical domestic room but it is necessary for 7-channel audio with a decent listening area for 6 listeners. We were careful not to play the Ipod Stations at loud levels where they distorted.
ReplyDelete
Replies
PaskolosAugust 16, 2010 at 2:52 PM
Hello,
So H/K MS100 was rated as the best by the listeners?
ReplyDelete
Replies
UnknownNovember 12, 2010 at 2:59 PM
Dear Sean,

So was Music Station A - Harman Kardon and was Music Station B - Bose Soundock? I checked all your blogs and still unable to figure this out. The reason is I have decided to buy one of these tomorrow and Im really confused which one. Based on the research I would like to go for Music Station A. Now only if I could figure that out ?
ReplyDelete
Replies
Dr. Sean OliveNovember 20, 2010 at 9:53 AM
Hi Goldie,
Yes A = Harman Kardon and B shall remain nameless.
Cheers
Sean
ReplyDelete
Replies
UnknownDecember 8, 2010 at 11:32 AM
Sean,

Very insightful, I just have two questions:

1) Were any of the listeners involved at ANY step of the product development? Someone who has spent time developing a product could easily pick out their "baby" in a blind test and therefore bias their preference toward the one they know is a Harman product...

2) How securely are the different Music Stations tied to the desktop? It seems that you could be adding rigidity to each Station by tying it down (I know this is necessary to facilitate the motorized switching). In products which are typically susceptible to unwanted resonances, could this impact certain products more than others?

Thanks!
ReplyDelete
Replies
Dr. Sean OliveDecember 9, 2010 at 6:05 PM
Thanks Joseph
My answers are:
1) No, there were no listeners involved in the design of the product.
2) We apply a strap to the top of the product to avoid it falling of the speaker mover. When we measure the frequency response of speaker with and without the strap in place the measured differences are essentially the same. This suggests that the strap has no significant influence on the sound.
ReplyDelete
Replies

Add comment