Thursday, March 11, 2010

A Method For Training Listeners and Selecting Program For Listening Tests

The benefits of training listeners for subjective evaluation of reproduced sound are well documented [1]-[3]. Not only do trained listeners produce more discriminating and reliable sound quality ratings than untrained listeners, but they can report what they perceive in very precise, quantitative and meaningful terms.

One of the unexpected byproducts of listener training is that it identifies which music selections are most sensitive to distortions commonly found within the audio chain [4]. This is exactly what was found in a series of listener training experiments the author reported in a 1994 paper entitled, “A method for training listeners and selecting program material for listening tests.” The following sections summarize the findings of those early experiments, which helped establish an objective method for training and selecting listeners and program material used for listening tests at Harman International over the past 16 years. A slide presentation summarizing the paper can be downloaded here, and will be referred to throughout the following sections.

Matching the Sound of Spectral Distortions to Their Frequency Response Curve

A computer-based training task was designed where listeners were required to compare different spectral distortions added to programs and then match the frequency response curve of the filter that generated the distortion (see slides 4-5). This was repeated using eight different equalizations and twenty different music selections digitally edited into short 10-20 s loops.

The equalizations included ±3 dB shelving filters at low (100 Hz) and high (5 kHz) frequencies, and ±3 dB resonances (Q = 0.66) centered at 500 Hz and 2 kHz (slide 6). An equalized version of the program (Flat) was always provided as a reference. The twenty music selections include classical, jazz and pop/rock genres with instrumentations that varied from solo instruments, speech and small combos to rock/combos and orchestras (slide 7). Pink noise was also included since this continuous broadband signal has been found to produce the lowest detection thresholds of resonances in loudspeakers [5],[6].

Eight untrained listeners with normal hearing participated in the training exercises, which were conducted over five separate listening sessions consisting of 32 trials each (slides 8 and 9). The presentation order of the equalizations, trials, and programs were randomized to prevent any order related biases. The listener’s performance was based on the percentage of correct responses given over the course of the five training sessions.

The Results

The training results were statistically analyzed using a repeated measures analysis of variance (ANOVA) to determine the effect the different music programs, equalizations, and trials had on the listeners’ performance in correctly identifying the different equalizations (slide 11).

Listener Performance Is Strongly Influenced by Program Selection

The single largest effect on the listener’s performance was the program selection. Slide 13 plots the mean listener performance scores for each of the twenty programs averaged across all eight equalizations. The percentage of correct responses ranged from a high of 88% (pink noise) to a low of 54% (jazz piano trio). Listeners performed the task best when using broadband, spectrally dense continuous signals like pink noise or pop/rock selections like Tracy Chapman, Little Feat, and Jennifer Warnes. Listeners performed worse on programs featuring solo instruments, small combos and speech that produced more discontinuous and narrow band signals. More about this later.

Equalization Context Influences Listener Performance

The effect of equalization on listener performance was surprisingly small (slide 14). There was a tendency for listeners to correctly identify the spectral distortions that occurred at low and high frequency regions versus the midband equalizations. The explanation for this can be found by examining the interaction effect between equalization * trial, indicating that listener performance depended on which combinations of equalizations were presented within a trial. In other words, the context in which an equalization was presented influenced listener performance (slide 15). These contextual effects can be summarized as follows:

  1. Listeners gave more correct responses when the presented equalizations were more separated in frequency.
  2. Listeners gave more correct responses when presented spectral boosts versus notches; spectral notches were often confused with spectral peaks located at slightly higher frequencies.
  3. Low frequency boosts were often confused with high frequency cuts (and vice versa).
  4. Low frequency cuts were often confused with high frequency boosts (and vice versa)

Greater frequency separation between different equalizations would produce more distinctive tonal or timbral differences that would help improve identification. The second observation confirms previous research that has found spectral notches are more difficult to detect than spectral peaks of similar bandwidth [5]. The one exception is broadband dips, which have similar detection thresholds as resonance peaks with equivalent bandwidth[6]. Observations c) and d) are related to each other, and are more difficult to explain. On first glance, it seems implausible that boosts and cuts separated five octaves apart can be confused with one another. A possible explanation is that listeners are using information across the entire bandwidth to judge the perceived perceive balance of the bass and treble. In this case, the slope or shape of the spectra must be an important factor (slide 16). Since a boost or cut of similar magnitude at opposite ends of the audio bandwidth produce similar broadband shapes or slopes, this might explain why listeners might confuse the two with each other.

Program and EQ Interact to Influence Listener Performance

There was also a significant interaction between program and equalization that affected listener performance. This interaction effect was most apparent for the programs presented in training session 3 where listener performance varied significantly depending on the combination of programs and equalization presented to the listener (slide 18). It seems plausible that these differences were related to differences in the spectra of the programs, which was confirmed by plotting the average 1/3-octave spectra of the four programs (slide 19). The largest listener response errors tended to occur when the equalization fell in a frequency range where there was little spectral energy in the programs (e.g. Programs P10 (Stan Getz) and P19 (Canadian Brass)). It makes sense that listeners cannot easily analyze the spectral distortions if the program material does not contain signals that make them audible.

Not All Listeners Are Equal to the Task

No amount of training will make me eligible for the Canadian Olympic hockey team - even if I were 25 years younger. Some people simply lack the innate mental and physical raw material to perform a highly specialized task. This is also true for critical listening as illustrated by the average performance scores of eight listeners after 5 listening sessions (slide 20). The range of individual listener performances range from 82% (listener 4) to 31% (listener 3). All listeners had normal hearing. Therefore, the reason for this large inter-listener variance in performance is related to other factors such as listener motivation, attentiveness, and their listening (and general) intelligence. Training data such as this, can provide an objective quantifiable metric for selecting the best listeners for audio product evaluations.

Practice Makes Perfect

The success of any listener training task that it can lead to measurable improvement in performance with repetition. Slide 21 show shows listener performance measured over five training sessions based on the eight listeners tested. The graph shows monotonic improvement in performance from 65% correct responses to 80% after five training sessions. Additional training sessions would most likely realize further gains in performance for some subjects. In other words, the training works!

Programs With Wider and Flatter Spectrums Improve Listener Performance (Why Tracy Chapman is as Good as Pink Noise)

Spectrum analysis was performed on the different program selections to see if this could explain the strong effect of program on listener performance. The 1/3-octave spectrum of each program was plotted based on a long-term average taken over the entire length of the loop. When we looked at the spectrums of the programs it became clear that this was a significant predictor of how well listeners would perform their task.

Slide 22 plots the average spectrum of four groups of program (5 programs in each group) rank ordered (from highest to lowest) according to the listener performance scores they produced. It clearly shows that the programs with the flattest and most extended spectrums (e.g. pink noise, pop/rock, full orchestra) were better suited for identifying spectral distortions. After pink noise, Tracy Chapman (program 2 in the above graph) had among the widest and flattest spectrums measured, and along with pink noise (program 1) registered the highest listener performance scores. Programs that had narrow band spectra with limited energy above and below 500 Hz (speech, solo instruments, small jazz and classical ensembles) concentrated in group 4 were less suited for identifying spectral distortions. While these groupings had some of the most musically entertaining selections, in the end, they were not good signals for detecting and characterizing spectral distortions in audio components.


A listener training method has been described that teaches listeners how to identify spectral distortions according to their frequency response curve. Experimental evidence was shown indicating listeners improved their performance in this task after 5 training sessions, although not all listeners are equal in their performance.

Statistical analysis of the training data revealed that the program selections are the largest factor influencing listener performance in this task: programs with continuous broadband spectra (e.g. pink noise, Tracy Chapman,etc) provide the best signals for characterizing spectral distortions whereas programs with narrow band spectra (e.g. speech, solo instruments) provide poor signals for performing this task. Furthermore, listeners seem to confuse certain types of spectral distortions with others when the distortions presented share similarities in their frequency, bandwidth, and broadband spectral slope or shape.

Finally, it is important to remember that the training methods and programs discussed in this study focussed on perception and analysis of spectral distortions. While these types of distortions are the most dominant ones found in loudspeakers, microphones and listening rooms, there are other types of distortions for which a different set of programs are likely better suited for revealing their audibility and subjective analysis. The current Harman listener training software “How to Listen” includes training tasks on spectral distortion as well as spatial, dynamic and various types of nonlinear distortions for which we hope to discover the optimal programs for detecting and analyzing their audibility. Stay tuned.


  1. Olive, Sean E., "Differences in Performance and Preference of Trained Versus Untrained Listeners in Loudspeaker Tests: A Case Study,” J. Audio Eng. Soc. Vol. 51, issue 9, pp. 806-825, September 2003. Download for free here, courtesy of Harman International.
  2. Bech, Soren, “Selection and Training of Subjective for Listening Tests on Sound-Reproducing Equipment,” J. Audio Eng. Soc., vol. 40 no. 7/8 pp. 590-610 (July 1992).
  3. Toole Floyd E. "Subjective Measurements of Loudspeakers Sound Quality and Listener Performance," J. Audio Eng. Soc., vol. 33, pp. 2-32 (1985 Jan./Feb.).
  4. Olive, Sean E., “A Method for Training Listeners and Selecting Program Material for Listening Tests” presented at the 97th AES Convention, preprint 3893, (November 1994).
  5. Toole, Floyd E. and Sean E. Olive, “The Modification of Timbre by Resonances: Perception and Measurement,” J. Audio Eng. Soc., Vol. 36, pp. 122-142 (March 1998).
  6. Olive, Sean E.; Schuck, Peter L.; Ryan, James G.; Sally, Sharon L.; Bonneville, Marc E. “The Detection Thresholds of Resonances at Low Frequencies,” J. Audio Eng. Soc., Vol. 45, Issue 3, pp. 116-128 (March 1997).
  7. Olive Sean E., “Harman’s How to Listen - A New Computer-based Listener Training Program,” May 30,2009.


  1. "program selections are the largest factor influencing listener performance in this task: programs with continuous broadband spectra (e.g. pink noise, Tracy Chapman,etc) provide the best signals for characterizing spectral distortions whereas programs with narrow band spectra (e.g. speech, solo instruments) provide poor signals for performing this task."

    The above is another reason why the typical "auditioning" process advocated by audiophiles is so flawed: people choose the wrong music.

    Before reading your article, I would have thought that solo instruments, voices, and small instrumental combos and were particularly revealing of flaws. Just goes to show you what a little science can reveal -- that your intuition is incorrect by 180 degrees.

  2. Would love to try the software out and become a trained listener.

  3. Hi Anonymous,
    It's coming soon (sometime in May). Honestly..


  4. Hi Mark,

    Yes, indeed. There are many scientific papers in the literature where they use speech for sound quality experiments -- one of the least revealing programs of spectral distortions found in these tests.

    However, narrow band, spectrally sparse programs are useful for detecting other types of distortions and artifacts from perceptual codecs (MP3), and mechanical buzzes/rattles, electrical hum,etc because they provide little frequency and temporal masking of the artifacts. But they are not so good at revealing the most dominant distortions in loudspeakers and rooms: resonances and other frequency response problems.

  5. Hi Sean,

    I love your articles. But, since I'm not able to participate in your studies I can at least afford to criticize ;)
    What puzzles me about all this trained listeners stories is what's their purpose. As I can tell listeners are trained to some reasonably flat measuring loudspeakers (or headphones?)? So, I understand that they are pretty good at discerning freq. resp aberrations and distortions, but their reference is reproduced sound? I know it's again about "circle of confusion", but shouldn't those listeners have live, unamplified music as the reference?

    And off-topic question - since Revel Salon 2 and JBL Everest are both from Harman's stable are they designed by the same "principles"? Because listening to them at last year High-End in Munich I got impression that their balance is quite different.

    Best regards,

  6. Hi Vuki,

    The problem with using live acoustic music as a reference is that the recordings used to judge loudspeakers don't sound like these live events; indeed, most recordings of popular music purchased today are studio or ProTools creations that were never intended to have any resemblance to a live acoustic event.

    Secondly, we use a multiple comparison ( A/B/C/D) loudspeaker paradigm, which allows listeners to separate many of the distortions in the recordings from those in the loudspeakers. It becomes moot whether the recording is perfect, although we try to select recordings that are well-balanced and relatively neutral. Also, the listener training process familiarizes the listeners with the idiosyncrasies of the recordings to minimize loudspeaker-program interactions.

    The Revel Salon 2 (cone and dome with wide dispersion waveguide) and the JBL Everest (compression driver with constant directivity horn) are quite different loudspeaker designs with slightly different design objectives, so, not surprisingly, they sound different due to differences in their measured directivities, bass extension, and nonlinear distortion; apart from that they have the same same goals of flat on-axis and smooth off-axis response.

  7. My curiodity is killing me

    what are the slighty diferent desingn objectives

    sorry for the of topic

  8. Hi Valentin,

    I would say there is more emphasis on max SPL for the Everest, achieved through its high efficiency compression driver/constant directivity horn. If you've heard these speakers, you will notice immediately that they will play very loud with very low distortion. Beyond that the design goals are much the same.


  9. Thanks again Sean for allowing us a look through your window. This is level of transparency unseen in this industry.

    Slide 13 Program Material #2 and #5 are the same with different results. Is this part of the observed increases in listener training?

    Also, regarding the section Equalization Context Influences Listener Performance, I can imagine that an auto eq with an ability to recognize program material and set up the appropriate eq for that material would be worth the effort. At least academically.

  10. DR. Sean thanks

    in Doctor Floyd book in page 399 figure 18.17
    he has a comparison of horn and direct radiator systems (JBL 1400 array, Revel studio, i think)

    and states that in double blind tests both would get equally good scores and would vary only with some "Program Material" i suppose this has to do with the directvity of the systems and the interaction of the room

    of course power is not an issue in this tests since both are powered at the same level

    so program as you state is extremely important in the lab results

    PD What is the reference level at Harman test and is there a test at higher volume does this change results

  11. Hi Sean,

    I still don't understand what's the point of having the listeners trained to some loudspeaker reference? As I understand it, they are some form of unreliable spectral analyzer.

    Best regards,

  12. Hi Vuki,

    The listener training teaches listeners to graphically draw the perceived spectral balance of any audio system (home, car, professional) in a more direct and efficient way compared to using descriptors such as full/thin, bright/dull, mid emphasis/de-emphasis, coloration ,etc

    After training, listeners become reliable spectrum analyzers.

  13. Hello Sean, can you please define spectral distortions? Thanks, Jason

  14. Hi Anonymous,

    Spectral distortions in the context of these experiments means linear distortions produced by the different equalizations.The equalizations change the frequency response of the playback chain and alter the original spectrum of the music signals evaluated.

  15. Thanks Sean, I hate to keep knocking at the door, but I really want to fully understand this...
    why linear distortions and not nonlinear or distortion?

  16. Hi Anonymous,

    In these experiments, we only studied linear distortions that are time and level invariant. I used the term "spectral distortion" because the training task was focussed on listeners learning to classifying the distortions based on frequency/amplitude/Q.

    Nonlinear distortions are different in that the properties of the distortions vary with input level and can change over time (e.g. a loudspeaker heating up). Nonlinear distortions create new spectral components that were not in the original signal. I suppose you could also call that spectral distortion but that's not what I'm referring to.

  17. "It's coming soon (sometime in May). Honestly..


    Training software

    cant wait :)

  18. "It's coming sometime in May."

    I've been patiently awaiting its arrival. Any updates?

  19. Nearing the end of May. Should we see this software soon?

  20. sow its October

    will the software be coming out or has it been canceled
    i would relly like to use it



  21. Hi Valentin,
    Our plans to release "How to Listen" were not cancelled - just delayed due to various reasons. We are aiming to release it December.