Friday, July 9, 2010

Why Live-versus-Recorded Listening Tests Don't Work


Figure 1: Singer Frieda Hempel conducting a Tone Test at Edison Studios, NYC in 1918. Note that many of the listeners' ears are covered by the blind folds making it a double blind and double deaf listening test, since the experimenter Edison was deaf himself.


Recently I was asked how I could possibly prove or assert that listeners prefer accurate loudspeakers without having performed a live-versus-recorded listening test. This is a test where the listener compares a live musical performance to a recording of the performance reproduced through loudspeakers. The closer the sound quality of the reproduction is to that of the live performance, the more accurate the loudspeaker is deemed to be - at least in theory. In practice, these tests are usually ridden with so many uncontrolled listening test nuisance variables that the results are essentially meaningless. This article examines why live-versus-recorded listening tests are not suitable for serious scientific investigations of the perceived sound quality of recorded and reproduced sound.


Edison’s Tone Tests: “People will hear what you tell them to hear”
Thomas Edison was among the first audio engineers to embrace live-versus-recorded demonstrations. In 1910, he invented the Edison Diamond Disk Phonograph, which he claimed had “no tone” of its own. To prove it, a series of road shows involving 4,000 live-versus-recorded demonstrations of his phonograph were conducted in auditoriums across the United States At some point during the live music performance there would be a switch over to the recorded performance, and apparently audience members could not tell the difference between the live and recorded performances

After a 1916 live-versus-recorded demonstration in Carnegie Hall, the New York Evening Mail stated “the ear could not tell when it was listening to the phonograph alone, and when to actual voice and reproduction together. Only the eye could discover the truth by noting when the singer’s mouth was open or closed” [1].


By today’s standards, the fidelity of Edison’s disc phonograph was egregious in terms of its noise, distortion, limited dynamic range, bandwidth and frequency response (you can hear some of Edison’s recordings online here). It’s hard to imagine that listeners were fooled into thinking his Diamond Disk recording was indistinguishable from the live performance. In fact, we now know that Edison manipulated the tests to produce the results he wanted. First, he carefully chose the music and musicians to work within the technical limitations of his technology. Edison detested music with extreme dynamics, high tones, vibrato and complex textures because they were a challenge to his deafness and his Tone Tests. He selected and coached musicians to mimic the sound of their recordings to minimize the audible differences between live and recorded performances [1],[2].


Second, Edison was the consummate audio salesman and was known to say, “People will hear what you tell them to hear” [2]. The expectations and perceptions of his listeners were manipulated before the test to produce a more predicable outcome. Audience members were given a concert program before his Tone Tests that clearly told them exactly what they would hear, how amazing it will sound, and what an appropriate response would be:


“Those who hear this test will realize fully for the first time how literally true it is that Mr. Edison has made possible the re-creation of the artist’s voice. No more exacting test could be made to demonstrate that the New Edison actually does re-create the voice of the artist than to play it side by side with the artist who made the records. This is the final proof. Close your eyes. See if you can distinguish the voice of the New Edison from that of the artist. Did you ever believe it possible to re-create a voice? Note that the voice of the artist and the voice of the Edison are indistinguishable” [emphasis is mine] [ 3].


Figure 2: Another Edison Tone Test where extraneous biases related to sight and smell may have compromised the results based on the large number of listeners covering their noses. Perhaps a bad case of singer's halitosis made it possible to identify the live performance from the recorded one based on smell alone?


Other Live-Versus-Recorded Demonstrations

Following Edison’s live-versus-recorded demonstrations, other tests have been conducted by Harry Olson at RCA, and G.A. Briggs (Wharfedale) and Peter Walker at Quad in the 1950’s. [4]. A common problem with these demonstrations was double reverberation: the reverberation of the room was heard both in the recording, and again when it was reproduced through loudspeakers in the same room. This made it easier for listeners to tell the difference between the recorded and live performances.


Acoustic Research's Live-Versus-Recorded Demonstrations

During the 1960’s, Acoustic Research (AR), an American loudspeaker company, performed over 75 live-versus-recorded concerts in cities around the USA featuring The Fine Arts String Quartet, and the AR-3 loudspeaker [5],[6]. To solve the double reverberation problem, the recordings of the quartet were made in an anechoic chamber, or outdoors. Outdoor live-versus-recorded demonstrations had the added benefit that there were no room reflections in either the recording or the live performance. This made the demonstrations less sensitive to off-axis problems in the microphones and loudspeakers. It also relaxed the demands on the recording-reproduction to accurately capture and reproduce the complex spatial properties of a reverberant performing space.


The AR demonstrations apparently generated an enormous amount of free publicity in newspapers and audio magazines where it was reported that the reproduction of the recordings was virtually indistinguishable from the live performance. AR sales increased dramatically, to the point where in 1966 AR apparently owned 32% market share of loudspeakers sold in the United States.



A Live-Versus-Recorded Method For Testing Loudspeaker Accuracy

Edgar Villchur, head of Acoustic Research, to his credit, was a firm believer that loudspeakers should accurately reproduce the art (the recorded music) and not editorialize or enhance it. In a 1962 paper, he described a live-versus-recorded method for evaluating the accuracy of loudspeakers [7]. The method used a reference loudspeaker (the live performance) that was placed in the listening room with the loudspeaker-under-test. The goal of the loudspeaker-under-test was to accurately reproduce a previous recording of the reference loudspeaker playing white noise in an anechoic chamber. The original white noise signal was also fed to the reference loudspeaker during the listening test. The more similar the loudspeaker-under-test sounded to the reference speaker, the more accurate it was deemed to be, at least in theory.


Villchur acknowledged that the sensitivity and validity of the method depended on the quality of the reference loudspeaker, its directivity, and the choice of program material. White noise was more revealing of loudspeaker inaccuracies than music. His reference loudspeaker consisted of a single 2-inch midrange from an AR-3 loudspeaker selected because he found using multiple drivers caused acoustical inference that was audible in the anechoic chamber, but not so audible in a reverberant listening room; these differences would produce errors in the listening test. One wonders how a tiny 2-inch driver could have produced adequate high treble and low bass without distortion. These limitations would significantly limit the accuracy and usefulness of this listening test method.


Another problem with this method was that the anechoic loudspeaker recordings were made at a single point in space, and did not capture the directivity and off-axis characteristics of the reference loudspeaker. Unless the speaker-under-test had the same directivity and off-axis characteristics of the reference loudspeaker, it could never sound exactly the same in a reflective listening room. To compensate for these errors, Villchur used a trial-an-error process to find the best microphone position relative to the reference loudspeaker where the timbre of the anechoic recording best matched the timbre of the reference loudspeaker when placed in a room. Adjusting the recording to mimmic the sound of live performance was the reverse process of what Edison’s musicians did, but essentially it produced the same bias. (Edison would have been proud!)


Finally, it is not clear how Villchur controlled loudspeaker positional biases when comparing the reference loudspeaker to the loudspeaker-under-test. Loudspeaker positional biases have been shown to produce audible effects that are sometimes larger than the audible differences between different loudspeakers under test [9]. At Harman, these positional biases are eliminated via an automated speaker shuffler that places each loudspeaker in the same position of the room.


Summary of Problems with Live-versus-Recorded Tests

By today’s standards, the live-versus-recorded tests performed to date lack the necessary scientific controls and rigor to consider their results or conclusions accurate, repeatable and valid. Below are a few of the most significant psychological, physical, methodological or experimental listening variables that plague these types of tests. While it is possible to control some of these variables, others are either impossible, impractical or too expensive to control.



Sighted and Cross-Modality Biases

To date, most of the live-versus-sighted tests have been performed sighted, where non-auditory cues were available to allow the listener to identify whether they were hearing the live or reproduced sound source. These tests could have been easily made blind via an acoustically transparent curtain; however, scientific validity was apparently not the primary purpose of the test. The visual cues from the musicians (bowing, lip syncing) would also enhance the realism and presence of the reproduction, a well-known cognitive effect observed in research of binaural and virtual reality displays.


Listener Expectation, Authority Bias, Group Interaction Bias

In many of the public live-versus-recorded demonstrations, listeners expectations were manipulated by knowledge given to them by the organizers of the demonstrations. In some cases, listeners were told what the expected response should be before the test began (see Edison's concert programs above). In large groups settings, listeners' responses can be easily swayed by the opinions and reaction of other members in the group (a herd mentality), especially when an authority member is present. These biases are easily removed from live-versus-recorded tests by repeating the test for each individual listener. The live and recorded performances would have to be replicated for every listener, which makes the tests too difficult, expensive, time consuming, and impractical to use.


Qualifications of Listeners

None of the live-versus-recorded tests I've read about have reported the hearing and critical listening qualifications of the listeners who participated in them. These are important variables in the sensitivity and reliability of the test results, and can be easily quantified.


Live and Recorded Performances Must Be Identical

For live-versus-recorded tests to be valid, the live and recorded performance should be identical, having the same notes, intonation, tempo, dynamics, loudness, balance between instruments, and the same location and sense of space of the instruments. Otherwise, there are extraneous cues that allow listeners to readily identify the live and recorded performances. Midi-controlled instruments (e.g. player pianos) are but one example of how this problem could be resolved.


Positional Biases from Live and Reproduced Sound Sources

Unless the live and reproduced (e.g. loudspeakers) sound sources occupy the same physical locations, the listener can always identify the live versus recorded versions based on the localized positions of the sound sources.


Errors in the Recording

The usefulness of live-versus-recorded methods for perceptual measurements of sound quality in the playback chain is severely limited by errors in the recording. The recording errors are not easily separated from the errors in the playback chain (see circle-of-confusion). Microphones and microphone techniques both contain errors that limit the timbral, spatial and dynamic accuracy of the recordings through which we judge loudspeakers. Apparently the most effective live-versus-recorded demonstrations were conducted outdoors - effectively an anechoic environment - where the off-axis performances of the microphones and loudspeakers, and the complex spatial cues of a reflective room were largely removed as factors from the experiment. However, results from outdoor live-versus-recorded tests cannot be generalized to how the loudspeakers would perform in real rooms, where the off-axis sounds provide a significant contribution towards the listener's impression of the loudspeaker.


Lack of Proper Scientific Protocols, Listener Response Data, Statistical Analysis, Results

The most interesting characteristic of live-versus-recorded tests is that they never seem to provide listener response data, statistical analysis or published results. Eyewitness reports written in newspapers or magazines do not constitute scientific evidence.


Accuracy is Not Applicable to Most Recordings Made Today

Most recordings made today are not intended to sound like the live performance. Anyone who heard Taylor Swift's live performance with Stevie Nicks at the 2010 Grammy Awards understands why.(Note: you can relive the magical moment on Youtube. Warning: this may be offensive for the musically-inclined). About 90% of commercial recordings are studio creations consisting of a series of overdubs, processed with auto-tuning, equalization, dynamic compression, and reverb sampled from an alien nation. For these recordings, there is no equivalent live performance to which the recording/reproduction can be compared for accuracy. The only reference is what the artist heard over the loudspeakers in the recording control room. If the important performance aspects of the playback system through which the art (the music and recording) was created can be reproduced in the home, then the consumer will hear an accurate reproduction of the music, as the artist intended. It is possible to achieve this if we adopt a science in the service of art philosophy towards audio recording and reproduction.


Conclusions

In reviewing the history of live-versus-reproduced tests, most have been performed as elaborate sales and marketing demonstrations designed to fool listeners into believing that a product sounded much better and more accurate than it actually was. While live-versus-recorded tests have proven their merit as an effective marketing and sales tool, they have not yet proven themselves as a serious method for scientific experiments intended to advance our psychoacoustic understanding of music recording and reproduction.


The reason for this, I believe, is that live-versus-recorded tests do not adequately control important listening test nuisance variables, a prerequisite for accurate, reliable and scientifically valid results. It is not entirely coincidental, that (to my knowledge) none of the live-versus-recorded tests to date have produced a single scientific publication or new psychoacoustic knowledge.


Hopefully, you now understand why I don’t conduct live-versus-recorded loudspeaker listening tests.


References

[1] Harvith, J., and Harvith, S. Edison, Musicians and the Phonograph: A Century in Retrospect, Greenwood Press, N.Y (1987).

[2] Andre Milliard, “Edison’s Tone Tests and the Ideal of Perfect Sound Reproduction,” from Lost and Found Sounds’, NPR.

[3] Program for Edison Demonstration http://www.nipperhead.com/old/tonetest04.htm

[4] Wharfedale History: http://www.wharfedale.co.uk/About/History/tabid/66/Default.aspx

[5] Acoustic Research http://en.wikipedia.org/wiki/Acoustic_Research

[6] Edgar Villchur, http://edgarvillchur.com/

[7] Villchur, Edgar, “A Method of Testing Loudspeakers with Random Noise”, J. Audio Eng. Society, Vol. 10, Issue 4, pp, 306-309 (October 1962),

[8] Kissinger, John R.The Development of the Simulated Live-vs-Recorded Test into a Design Tool, presented at the 35th AES Convention, preprint 609, (October 1968

[9] Olive, Sean E.; Schuck, Peter L.; Sally, Sharon L.; Bonneville, Marc E. “The Effects of Loudspeaker Placement on Listeners' Preference Ratings”,JAES Volume 42 Issue 9 pp. 651-669; September 1994.


12 comments:

  1. That's great Sean, I love the "people will hear what you tell them to hear." Hence the rise of mp3 and Bose...

    Thanks, Bill Jehle

    ReplyDelete
  2. Hi, an excellent article as usual, much appreciated. I am not sure that criticising the way it has been done in the past is a valid reason not to do it properly yourself.

    I would be interested in your thoughts re: playback-of-recording-of-live-performance vs playback-of-recording-of-loudspeaker-playing-recording-of-live-performance. How close does this come to assessing the loudspeaker vs the live performance?

    Having said that, I cannot see any way around the different polar patterns of live music compared to any particular loudspeaker. If a piano and a violin, playing the same note, have different polar patterns, a loudspeaker cannot play back the same note with different polar patterns for each instrument. Hence live-vs-speaker is invalid.

    Grant

    ReplyDelete
  3. Hi Grant,

    Thanks for your feedback. I thought it was appropriate to historically discuss how the method has been used in the past with some constructive criticism. I find it interesting, and perhaps noteworthy, that the method has been used to fool people into believing something is better than it actually is.It would be interesting to explore more why that is.

    I did offer some suggestions how the method could be improved upon. However, I don't see myself ever using the method since I don't think it is very sensitive or practical for evaluating loudspeakers. The main problem is that no recordings or recording methods come close to capturing what you perceive in a live performance it seems moot to use this method for evaluating loudspeakers. I think binaural recordings probably come closest to capturing/reproducing the live performance.

    You are right about the mismatch between the directivity of instruments versus loudspeakers, and how that invalidates these experiments.

    The directivity of a typical forward facing loudspeaker very much approximates a human voice based on measurements of human voice at NRC. So I think Edison was probably smart in choosing singers for these experiments. The other way to get around this problem is to record and reproduce in outdoors or in an anechoic chamber which is what AR did with the live-vs-recorded demos of the The Fine Arts Quartet.

    ReplyDelete
  4. This will sound odd, but I think when the engineer cranks up the oboe in a Mahler symphony, that's like concert hall listening - actually, something between fulfilling Mahler's intentions and the "creative" listening you naturally are doing at a concert.

    For sure, can't put Ozawa doing Berlioz' Requiem in Massey Hall into my living room, but something is going right with hifi.

    I think Vilchur was on the right track looking for a reference - I use an AR-1W in my system. I suggested to my client once they carry around a mechanical alarm clock and a recording of same for reference vis a vis noise annoyance from streetcars.

    Ben

    ReplyDelete
  5. Hi Anonymous,

    I don't disagree with you in that recording engineers will make adjustments (e.g. add close mics) to compensate for limitations in 2-channel stereo, hall acoustics, restrictions in where they can place their mics,etc. These adjustments can make the recording sound more natural or pleasing.

    We are surrounded in life with lots of natural references -human voices, nature,environmental sounds,etc, machinery, instruments, applause- that provide references (if included in recordings) when judging the fidelity of different loudspeakers. Through training listeners become familiar with the programs and how they should sound.

    Listeners are also pretty good detectives when it comes to identifying resonances/frequency response aberrations, and distortions in loudspeakers.

    When doing multiple comparisons among loudspeakers, it becomes easier to separate the distortions in recordings from those in the loudspeakers - more so than a single stimulus test. Any distortions that are constant when switching among loudspeakers are associated with the recording and tend to fall into the background. If the sound quality changes as you switch to a different speaker that sonic character becomes associated with the speaker itself.

    ReplyDelete
  6. Sean,

    If you can ignore its instances of pro-analog cheerleading, you might enjoy a book published last year called Perfecting Sound Forever: An Aural History of Recorded Music" by Greg Milner. It too discusses a few of the 'live vs recorded' listening tests, in part to demonstrate how our idea of 'perfect sound' has changed with the times.

    ReplyDelete
  7. jehle, mp3 is not an instance of 'people will hear what you tell them to hear', except to the extent that *anti*-mp3 writers tell people that mp3s sound terrible compared to lossless formats. Such claims aren't based on any thoughtful consideration of how mp3s work, or how listening comparisons should be done. It's just bias against the idea of 'lossyness'....and uncritical belief that if something 'measures' better, it MUST sound better.

    ReplyDelete
  8. "During the 1960’s, Acoustic Research (AR), an American loudspeaker company, performed over 75 live-versus-recorded concerts in cities around the USA featuring The Fine Arts String Quartet, and the AR-3 loudspeaker [5],[6]. To solve the double reverberation problem, the recordings of the quartet were made in an anechoic chamber, or outdoors. Outdoor live-versus-recorded demonstrations had the added benefit that there were no room reflections in either the recording or the live performance. This made the demonstrations less sensitive to off-axis problems in the microphones and loudspeakers. It also relaxed the demands on the recording-reproduction to accurately capture and reproduce the complex spatial properties of a reverberant performing space."

    This statement is erroneous. The AR Live-vs.-Recorded concerts were *not* conducted outdoors. The recording of the musicians for these demonstrations was performed in anechoic (outdoors in this case) space to avoid "double reverberation," such that the AR-3s and the Fine Arts Quartet (as well as guitarist Gustavo Lopéz and the 1910 Nickelodeon) would be treated to a nearly identical acoustical environment.

    "To date, most of the live-versus-sighted tests have been performed sighted, where non-auditory cues were available to allow the listener to identify whether they were hearing the live or reproduced sound source. These tests could have been easily made blind via an acoustically transparent curtain; however, scientific validity was apparently not the primary purpose of the test. The visual cues from the musicians (bowing, lip syncing) would also enhance the realism and presence of the reproduction, a well-known
    cognitive effect observed in research of binaural and virtual reality displays."


    AR used several methods to avoid "visual clues" in identifying the point at which the musicians stopped playing and the loudspeakers began playing. AR made several recordings of actual performances (both the musicians and the speakers). The method commonly used throughout all of the concerts was to have the musicians "lift" the bows slightly, but continue "playing" while the AR-3s played. However, in the early stages the Fine Arts Quartet played a trick on the audience. Midway through the performance of the first-selected piece, First Violinist Leonard Sorkin stopped the music and asked the audience, "I'm curious, how many of you in the audience could detect when we switched from live music to the recorded music?" There was a show of hands in the audience. "I'm sorry to tell you this, but except for the first two bars, the entire piece was played by the speakers." By the way, there was never any signal equalization used on the speakers -- except to adjust (increase and attenuate) the treble control on the preamp for different acoustical conditions in different concert halls

    ARHPG (Acoustic Research Historic Preservation Group)

    ReplyDelete
  9. Dear ARHPG,
    Thank you for the correction regarding the Live-vs-Recorded AR outdoor concerts. I think that misconception probably came from photographs on the web showing the Quartet playing outdoors with AR speakers behind them. The speakers must have been added for publicity sake only.

    ReplyDelete
  10. Dear Dr. Olive,

    Thank you for your reply and comments.

    The outdoor photographs of the Fine Arts Quartet -- along with Edgar Villchur, Roy Allison and other individuals -- were taken in Woodstock, NY during the initial recording sessions, and the AR-3s were for monitoring purposes and to check playback levels, etc. Clearly, the AR-3s received unintended publicity in this picture, but AR never used these pictures in any of its advertisements. It should be mentioned that Dynaco and a magnetic-tape manufacturer, Concertapes (as well as Ampex and Sony), were also involved in the LvR concerts. Concertapes had a recording contract with The Fine Arts Quartet at the time, and the Quartet approached Edgar Villchur in 1959 about doing live-vs.-recorded concerts together.

    In 1956, the New York Audio League (Julian Hirsch) used four AR-1 loudspeakers, along with Bozak midrange drivers and the JansZen electrostatic tweeters, to reproduce the Aeolian-Skinner pipe organ in a Mt. Kisco, NY church, so LvR demonstrations had been done previously using AR equipment. This demonstration was also very successful, and it helped establish Acoustic Research's AR-1 as one of the finer low-frequency reproducers of the era.

    The Fine Arts Quartet LvR concerts were AR's most successful, in my view, and they were met with almost universal critical acclaim from music critics, equipment reviewers and the public alike. More than seventy-five concerts were performed across the country in most of the largest cities, such as New York, Boston, Washington, DC, Chicago, LA and San Francisco. The ensemble tone of the Fine Arts Quartet -- performed in relatively spacious enclosed spaces -- was reproduced well from accounts from the newspaper reports and popular high-fidelity magazines of the period.

    Subsequently, AR performed several concerts with classical guitarist Gustavo Lopez and the 1910 Seaburg Nickelodeon; while these demonstrations were quite successful, they lacked the closeness of playback of the Fine Arts Quartet. In the mid-1970s, AR used the then-new AR-10Pi (an updated version of the AR-3 and AR-3a) in a series of live-vs.-recorded demonstrations with jazz drummer Neil Grover. This LvR proved more difficult, as it took over 800-watt peaks to reproduce the drum "rim shots" and other drum sounds through the .5% efficient AR-10s. Amazingly, no AR driver was damaged except for a frozen voice coil caused by dc-offset from a failed Dunlap-Clarke Dreadnaught 1000 amplifier. Overall, I don't believe that these difficult demonstrations were as successful as AR's 1959-1963 Fine Arts Quartet series.

    ARHPG (Acoustic Research Historic Preservation Group)

    ReplyDelete
  11. Steven Sullivan states that we only think MP3s sound worse because they measure worse and because we are prejuduced.

    Sorry Steve, You are flat out wrong here. There is a huge sonic difference. Go to a local high end stereo store and plug your little chip in and take a CD of the same music with you. You will either discover your error or need a hearing test.

    I listen to a lot of XM radio and enjoy it; not because it sounds as good as CD or a good LP, but because I am used to it. If I listen to CDs first, then it takes a while to ajust to the inferior format When I go from XM to CD, I immediately presented with a vastly superior sound.

    DanV.

    ReplyDelete
  12. William SommerwerckJuly 19, 2013 at 8:39 AM

    I'm a degreed EE and AES member who used to review for "Stereophile". I stopped reviewing when I discovered that huge sonic differences heard one day (with matched levels) utterly vanished the next. (Don't read too much into that.) This didn't convert me to an ABX advocate, as that sort of testing has its own problems. (Using a test protocol designed for one form of scientific analysis is not necessarily valid for another.)

    Though at least one well-known speaker designer expressed his view that the AR/Dyna tests were valid, their validity is beside the point. They are interesting in their own right, simply for giving us an idea of what is or isn't audible, how listeners react when they /know/ there should be audible differences, etc.

    As an advocate of facsimile reproduction (I couldn't care less about studio recordings that have no acoustical counterpart), I'd like to propose an expensive and difficult project -- an attempt to see just how close one can come to "the original sound" -- the illusion that one is hearing the original performance.

    This would be a multi-year project, requiring long-term commitment to making live recordings using different mics and mic setups, and playing them through an even wider variety of speakers and speaker arrays.

    What would we learn from this? I don't know. But that's the point of scientific testing -- seeing what will happen. (As HP used to say... "We never stop asking What If?"

    ReplyDelete