lingüística: 2018

martes, 2 de octubre de 2018

Speech Perception

Speech Perception.

When you listen to someone speaking you generally focus on understanding their
meaning. One famous (in linguistics) way of saying this is that "we speak in order
to be heard, in order to be understood" Oakobson et al., 1952). Our drive, as
listeners, to understand the talker leads us to focus on getting the words being
said, and not so much on exactly how they are pronounced. But sometimes a
pronunciation will jump out at you: somebody says a familiar word in an unfamiliar way and you just have to ask "Is that how you say that?" When we listen
to the phonetics of speech — to how the words sound and not just what they mean
— we as listeners are engaged in speech perception.
In speech perception, listeners focus attention on the sounds of speech and notice
phonetic details about pronunciation that are often not noticed at all in normal
speech communication. For example, listeners will often not hear, or not seem
to hear, a speech error or deliberate mispronunciation in ordinary conversation,
but will notice those same errors when instructed to listen for mispronunciations
(see Cole, 1973).

Testing mispronunciation detection.
As you go about your daily routine, try mispronouncing a word every now
and then to see if the people you are talking to will notice. For instance, if
the conversation is about a biology class you could pronounce it "biolochi."
After saying it this way a time or two you could tell your friends about your
little experiment and ask if they noticed any mispronounced words. Do people
notice mispronunciation more in word-initial position or in medial position?
With vowels more than consonants? In nouns and verbs more than in gram-
matical words? How do people look up words in their mental dictionary if
they don't notice when a sound has been mispronounced? Evidently, looking up words in the mental lexicon is a little different from looking up words
in a printed dictionary (try entering "biolochi" in Google). Do you find that
your friends think you are strange when you persist in mispronouncing words
on purpose?

So, in this chapter we're going to discuss speech perception as a phonetic mode
of listening, in which we focus on the sounds of speech rather than the words.
An interesting problem in phonetics and psycholinguistics is to find a way of measuring how much phonetic information listeners take in during normal conversation, but in this book we can limit our focus to the phonetic mode of listening.

5.1 Auditory Ability Shapes Speech Perception.

As we saw in chapter 4, speech perception is shaped by general properties of
the auditory system that determine what can and cannot be heard, what cues will
be recoverable in particular segmental contexts, and how adjacent sounds will
influence each other. For example, we saw that the cochlea's nonlinear frequency
scale probably underlies the fact that no language distinguishes fricatives on the
basis of frequency components above 6,000 Hz.
Two other examples illustrate how the auditory system constrains speech perception. The first example has to do with the difference between aspirated and
unaspirated stops. This contrast is signaled by a timing cue that is called the "voice
onset time" (abbreviated as VOT). VOT is a measure (in milliseconds) of the
delay of voicing onset following a stop release burst. There is a longer delay in
aspirated stops than in unaspirated stops — so in aspirated stops the vocal folds are
held open for a short time after the oral closure of the stop has been released.
That's how the short puff of air in voiceless aspirated stops is produced. It has
been observed that many languages have a boundary between aspirated and unaspirated stops at about 30 ms VOT. What is so special about a 30 ms delay between
stop release and onset of voicing?
Here's where the auditory system comes into play. Our ability as hearers
to detect the nonsimultaneous onsets of tones at different frequencies probably
underlies the fact that the most common voice onset time boundary across languages is at about ±30 ms. Consider two pure tones, one at 500 Hz and the other

at 1,000 Hz. In a perception test (see, for example, the research studies by Pisoni,
1977, and Pastore and Farrington, 1996), we combine these tones with a small
onset asynchrony — the 500 Hz tone starts 20 ms before the 1,000 Hz tone. When
we ask listeners to judge whether the two tones were simultaneous or whether
one started a little before the other, we discover that listeners think that tones
separated by a 20 ms onset asynchrony start at the same time. Listeners don't
begin to notice the onset asynchrony until the separation is about 30 ms. This
parallelism between nonspeech auditory perception and a cross-linguistic phonetic
universal leads to the idea that the auditory system's ability to detect onset asynchrony is probably a key factor in this cross-linguistic phonetic property.
Example number two: another general property of the auditory system is probably at work in the perceptual phenomenon known as "compensation for coarticulation." This effect occurs in the perception of place of articulation in CV syllables.
The basic tool in this study is a continuum of syllables that ranges in equal acoustic
steps from [do] to [gal (see figure 5.1). This figure needs a little discussion. At
the end of chapter 3 I introduced spectrograms, and in that section I mentioned
that the dark bands in a spectrogram show the spectral peaks that are due to
the vocal tract resonances (the formant frequencies). So in figure 5.1a we see a
sequence of five syllables with syllable number 1 labeled [do] and syllable number 5 labeled [go]. In each syllable, the vowel is the same; it has a first formant
frequency (F1) of about 900 Hz, a second formant frequency (F,) of about 1,100 Hz,
an F3 at 2,500 Hz, and an F4 at 3,700 Hz. The difference between [du] and [go]
has to do with the brief formant movements (called formant transitions) at
the start of each syllable. For [do] the F, starts at 1,500 Hz and the F3 starts at
2,900 Hz, while for [go] the F, starts at 1,900 Hz and the F3 starts at 2,000 Hz.
You'll notice that the main difference between [al] and [ar] in figure 5. lb is the
F, pattern at the end of the syllable.
Virginia Mann (1980) found that the perception of this [doHgo] continuum
depends on the preceding context. Listeners report that the ambiguous syllables
in the middle of the continuum sound like "ga" when preceded by the VC syllable
[al], and sound like "da" when preceded by [Qr.].
As the name implies, this "compensation for coarticulation" perceptual effect
can be related to coarticulation between the final consonant in the VC context
token ([01] or [or]) and the initial consonant in the CV test token ([da}-[ga]). However,
an auditory frequency contrast effect probably also plays a role. The way this explanation works is illustrated in figure 5. lb. The relative frequency of F, distinguishes
[da] from [go] — F3 is higher in [do] than it is in [go]. Interestingly, though, the
perceived frequency of F3 may also be influenced by the frequency of the F, just
prior to [da/go]. When F3 just prior to [do/ga] is low (as in [ar]), the [dolga] F,
sounds contrastively higher, and when the F3 just prior is high, the [da/ go] F, sounds
lower. Lotto and Kluender (1998) tested this idea by replacing the precursor syl-
lable with a simple sine wave that matched the ending frequency of the F3 of [or],
in one condition, or matched the ending F3 frequency of [al]. in another condition. They found that these nonspeech isolated tones shifted the perception of
the [da]-[ga] continuum in the same direction that the [cm] and [al] syllables did.
So evidently, at least a part of the compensation for coarticulation phenomenon
is due to a simple auditory contrast effect having nothing to do with the phonetic

mode of perception.

Two explanations for one effect.
Compensation for coarticulation is controversial. For researchers who like to
think of speech perception in terms of phonetic perception — i.e. "hearing'
people talk — compensation for coarticulation is explained in terms of
coarticulation. Tongue retraction in [r] leads listeners to expect tongue
retraction in the following segment and thus a backish stop (more like "g")
can still sound basically like a "d" in the [r] context because of this
context-dependent expectation. Researchers who think that one should first
and foremost look for explanations of perceptual effects in the sensory input
system (before positing more abstract cognitive parsing explanations) are
quite impressed by the auditory contrast account.
It seems to me that the evidence shows that both of these explanations
are right. Auditory contrast does seem to occur with pure tone context tokens,
in place of [ar] or [al], but the size of the effect is smaller than it is with a
phonetic precursor syllable. The smaller size of the effect suggests that audi-
tory contrast is not the only factor. I've also done research with stimuli like
this where I present a continuum between [al] and [ar] as context for the
[da}-[ga] continuum. When both the precursor and the target syllable are
ambiguous, the identity of the target syllable (as "da" or "ga") depends on the
perceived identity of the precursor. That is, for the same acoustic token, if the
listener thinks that the context is "ar" he or she is more likely to identify
the ambiguous target as "da." This is clearly not an auditory contrast effect.
So, both auditory perception and phonetic perception seem to push
listeners in the same direction.

5.2 Phonetic Knowledge Shapes Speech Perception.
Of course, the fact that the auditory system shapes our perception of speech does
not mean that all speech perception phenomena are determined by our auditory
abilities. As speakers, not just hearers, of language, we are also guided by our knowledge of speech production. There are main two classes of perceptual effects that

emerge from phonetic knowledge: categorical perception and phonetic coherence.

5.2.1 Categorical perception.

Take a look back at figure 5.1a. Here we have a sequence of syllables that shifts
gradually (and in equal acoustic steps) from a syllable that sounds like "da" at
one end to a syllable that sounds like "ga" at the other (see table 5.1). This type
of gradually changing sequence is called a stimulus continuum. When we play
these synthesized syllables to people and ask them to identify the sounds - with
an instruction like "please write down what you hear" - people usually call the
first three syllables "da" and the last two "ga." Their response seems very cat-
egorical: a syllable is either "da" or "ga." But, of course, this could be so simply
because we only have two labels for the sounds in the continuum, so by
definition people have to say either "da" or "ga." Interestingly, though — and this
is why we say that speech perception tends to be categorical — the ability to hear
the differences between the stimuli on the continuum is predictable from the labels
we use to identify the members of the continuum.

To illustrate this, suppose I play you the first two syllables in the continuum
shown in figure 5.1a — tokens number 1 and 2. Listeners label both of these as
"da," but they are slightly different from each other. Number 1 has a third for-
mant onset of 2,750 Hz while the F3 in token number 2 starts at 2,562 Hz. People
don't notice this contrast — the two syllables really do sound as if they are iden-
tical. The same thing goes for the comparisons of token 2 with token 3 and of
token 4 with token 5. But when you hear token 3 (a syllable that you would ordi-
narily label as "da") compared with token 4 (a syllable that you would ordinarily
label "ga"), the difference between them leaps out at you. The point is that in the
discrimination task — when you are asked to detect small differences — you don't
have to use the labels "da" or "ga." You should be able to hear the differences at
pretty much the same level of accuracy, no matter what label you would have put
on the tokens, because the difference is the same (188 Hz for F3 onset) for token
1 versus 2 as it is for token 3 versus 4. The curious fact is that even when you don't
have to use the labels "da" and "ga" in your listening responses, your perception
is in accordance with the labels — you can notice a 188 Hz difference when the
tokens have different labels and not so much when the tokens have the same label.
One classic way to present these hypothetical results is shown in figure 5.2
(see Liberman et al., 1957, for the original graph like this). This graph has two
"functions" — two lines — one for the proportion of times listeners will identify
a token as "da", and one for the proportion of times that listeners will be able to
accurately tell whether two tokens (say number 1 and number 2) are different from
each other. The first of these two functions is called the identification function,
and I have plotted it as if we always (probability equals 1) identify tokens 1, 2, and
3 as "da." The second of these functions is called the discrimination function,
and I have plotted a case where the listener is reduced to guessing when the tokens
being compared have the same label (where "guessing" equals probability of
correct detection of difference is 0.5), and where he or she can always hear the
difference between token 3 (labeled "da") and token 4 (labeled "ga"). The pattern
of response in figure 5.2 is what we mean by "categorical perception" — within-
category discrimination is at chance and between-category discrimination is per-
fect. Speech tends to be perceived categorically, though interestingly, just as with
compensation for coarticulation, there is an auditory perception component in
this kind of experiment, so that speech perception is never perfectly categorical.

Our tendency to perceive speech categorically has been investigated in many
different ways. One of the most interesting of these lines of research suggests
(to me at least) that categorical perception of speech is a learned phenomenon (see
Johnson and Ralston, 1994). It turns out that perception of sine wave analogs of
the [do] to [ga] continuum is much less categorical than is perception of normal-
sounding speech. Robert Remez and colleagues (Remez et al., 1981) pioneered
the use of sine wave analogs of speech to study speech perception. In sine wave
analogs, the formants are replaced by time-varying sinusoidal waves (see figure 5.3).
These signals, while acoustically comparable to speech, do not sound at all like
speech. The fact that we have a more categorical response to speech signals
than to sine wave analogs of speech suggests that there is something special
about hearing formant frequencies as speech versus hearing them as nonspeech,
video-game noises. One explanation of this is that as humans we have an innate
ability to recover phonetic information from speech so that we hear the intended,
categorical gestures of the speaker.
A simpler explanation of why speech tends to be heard categorically is that our
perceptual systems have been tuned by linguistic experience. As speakers, we have
somewhat categorical intentions when we speak — for instance, to say "dot" instead
of "got." So as listeners we evaluate speech in terms of the categories that we
have learned to use as speakers. Several kinds of evidence support this "acquired
categoriality" view of categorical perception.

For example, as you know from trying to learn the sounds of the International
Phonetic Alphabet, foreign speech sounds are often heard in terms of native sounds.
For instance, if you are like most beginners, when you were learning the implosive
sounds [ ], [d], and [ ] it was hard to hear the difference between them and
plain voiced stops. This simple observation has been confirmed many times and
in many ways, and indicates that in speech perception, we hear sounds that we
are familiar with as talkers. Our categorical perception boundaries are determined
by the language that we speak (The theories proposed by Best, 1995, and Flege,
1995, offer explicit ways of conceptualizing this.)

Categorical magnets.

One really interesting demonstration of the language-specificity of categor-
ical perception is the "perceptual magnet effect," (Kuhl et al., 1992). In this
experiment, you synthesize a vowel that is typical of the sound of [i] and
then surround it with vowels that systematically differ from the center
vowel. In figure 5.4 this is symbolized by the white star, and the white
circles surrounding it. A second set of vowels is synthesized, again in a radial
grid around a center vowel. This second set is centered not on a typical
[i] but instead on a vowel that is a little closer to the boundary between [i]
and [e].
When you ask adults if they can hear the difference between the center
vowel (one of the stars) and the first ring of vowels, it turns out that they
have a harder time distinguishing the white star (a prototypical [i]) from its
neighbors than they do distinguishing the black star (a non-prototypical [i])
from its neighbors. This effect is interesting because it seems to show that
categorical perception is a gradient within categories (note that all of the
vowels in the experiment sound like variants of [i], even the ones in the black
set that are close to the [i]/ [e] boundary). However, even more interesting
is the fact that the location of a perceptual magnet differs depending on
the native language of the listener — even when those listeners are mere
infants!

Here's another phenomenon that illustrates the phonetic coherence of speech
perception. Imagine that you make a video of someone saying "ba," "da," and
"ga." Now, you dub the audio of each of these syllables onto the video of the
others. That is, one copy of the video of [bct] now has the audio recording of
[do] as its sound track, another has the audio of [go], and so on. There are some
interesting confusions among audio/video mismatch tokens such as these, and
one of them in particular has become a famous and striking demonstration of
the phonetic coherence of speech perception.
Some of the mismatches just don't sound right at all. For example, when you
dub audio [du] onto video 034 listeners will report that the token is "ba" (in accor-
dance with the obvious lip closure movement) but that it doesn't sound quite
normal.
The really famous audio/video mismatch is the one that occurs when you dub
audio [ba] onto video [go]. The resulting movie doesn't sound like either of the
input syllables, but instead it sounds like "da"! This perceptual illusion is called
the McGurk effect after Harry McGurk, who first demonstrated it (McGurk and
MacDonald, 1976). It is a surprisingly strong illusion that only goes away when
you close your eyes. Even if you know that the audio signal is [bc], you can only
hear "da."
The McGurk effect is an illustration of how speech perception is a process
in which we deploy our phonetic knowledge to generate a phonetically coherent
percept. As listeners we combine information from our ears and our eyes to come
to a phonetic judgment about what is being said. This process taps specific pho-
netic knowledge, not just generic knowledge of speech movements. For instance,
Walker et al. (1995) demonstrated that audio / video integration is blocked when
listeners know the talkers, and know that the voice doesn't belong with the
face (in a dub of one person's voice onto another person's face). This shows that
phonetic coherence is a property of speech perception, and that phonetic coher-
ence is a learned perceptual capacity, based on knowledge we have acquired
as listeners.

McGurking ad nauseam.
The McGurk effect is a really popular phenomenon in speech perception,
and researchers have poked and prodded it quite a bit to see how it works.
In fact it is so popular we can make a verb out of the noun "McGurk effect"
— to "McGurk" is to have the McGurk effect. Here are some examples of
McGurking:
Babies McGurk (Rosenblum et al., 1997)
You can McGurk even when the TV is upside down (Campbell, 1994)
Japanese listeners McGurk less than English listeners (Sekiyama and
Tohkura, 1993)
Male faces can McGurk with female voices (Green et al., 1991)
A familiar face with the wrong voice doesn't McGurk (Walker et aL , 1995).

5.3 Linguistic Knowledge Shapes Speech Perception.
We have seen so far that our ability to perceive speech is shaped partly by the
nonlinearities and other characteristics of the human auditory system, and we have
seen that what we hear when we listen to speech is partly shaped by the phonetic
knowledge we have gained as speakers. Now we turn to the possibility that speech
perception is also shaped by our knowledge of the linguistic structures of our native
language.
I have already included in section 5.2 (on phonetic knowledge) the fact that
the inventory of speech sounds in your native language shapes speech perception,
so in this section I'm not focusing on phonological knowledge when I say "lin-
guistic structures," but instead I will present some evidence of lexical effects in speech
perception — that is, that hearing words is different from hearing speech sounds.
I should mention at the outset that there is controversy about this point. I will
suggest that speech perception is influenced by the lexical status of the sound
patterns we are hearing, but you should know that some of my dear colleagues
will be disappointed that I'm taking this point of view.

Scientific method: on being convinced.
There are a lot of elements to a good solid scientific argument, and I'm not
going to go into them here. But I do want to mention one point about how
we make progress. The point is that no one individual gets to declare an
argument won or lost. I am usually quite impressed by my own arguments
and cleverness when I write a research paper. I think I've figured something
out and I would like to announce my conclusion to the world. However,
the real conclusion of my work is always written by my audience and it keeps
being written by each new person who reads the work. They decide if the
result seems justified or valid. This aspect of the scientific method, includ-
ing the peer review of articles submitted for publication, is part of what leads
us to the correct answers.
The question of whether speech perception is influenced by word processing
is an interesting one in this regard. The very top researchers — most clever, and
most forceful — in our discipline are in disagreement on the question. Some
people are convinced by one argument or set of results and others are more
swayed by a different set of findings and a different way of thinking about the
question. What's interesting to me is that this has been dragging on for a
long, long time. And what's even more interesting is that as the argument drags
on, and researchers amass more and more data on the question, the theories
start to blur into each other a little. Of course, you didn't read that here!

The way that "slips of the ear" work suggests that listeners apply their know-
ledge of words in speech perception. Zinny Bond (1999) reports perceptual errors
like "spun toffee" heard as "fun stocking" and "wrapping service" heard as
wrecking service." In her corpus of slips of the ear, almost all of them are word
misperceptions, not phoneme misperceptions. Of course, sometimes we may mis-
hear a speech sound, and perhaps think that the speaker has mispronounced the
word, but Bond's research shows that listeners are inexorably drawn into hearing
words even when the communication process fails. This makes a great deal of
sense, considering that our goal in speech communication is to understand what
the other person is saying, and words (or more technically, morphemes) are the
units we trade with each other when we talk.
This intuition, that people tend to hear words, has been verified in a very clever
extension of the place of articulation experiment we discussed in sections 5.1 and
5.2. The effect, which is named the Ganong effect after the researcher who first
found it (Ganong, 1980), involves a continuum like the one in figure 5.1, but with
a word at one end and a nonword at the other. For example, if we added a final
[g] to our [da}-[ga] continuum we would have a continuum between the word
"dog' and the nonword [gag]. What Ganong found, and what makes me think
that speech perception is shaped partly by lexical knowledge, is that in this new
continuum we will get more "dog' responses than we will get "da" responses in
the [daHga] continuum. Remember the idea of a "perceptual magnet" from above?
Well, in the Ganong effect words act like perceptual magnets; when one end of
the continuum is a word, listeners tend to hear more of the stimuli as a lexical
item, and fewer of the stimuli as the nonword alternative at the other end of the
continuum.
Ganong applied careful experimental controls using pairs of continua like
"tash"—"dash" and "task"—"dask" where we have a great deal of similarity
between the continuum that has a word on the It/ end ("task"—"dask") and
the one that has a word on the /d/ end ("tash"—"dash"). That way there is less
possibility that the difference in number of "d" responses is due to small acoustic
differences between the continua rather than the difference in lexicality of the
endpoints. It has also been observed that the lexical effect is stronger when
the sounds to be identified are at the ends of the test words, as in "kiss"—"kish"
versus "fiss"—"fish." This makes sense if we keep in mind that it takes a little
time to activate a word in the mental lexicon.
A third perceptual phenomenon that suggests that linguistic knowledge (in the
form of lexical identity) shapes speech perception was called "phoneme restora-
tion" by Warren when he discovered it (Warren, 1970). Figure 5.7 illustrates phoneme
restoration. The top panel is a spectrogram of the word "legislation" and the bot-
tom panel shows a spectrogram of the same recording with a burst of broadband
noise replacing the [s]. When people hear the noise-replaced version of the sound
file in figure 5.7b they "hear" the [s] in LletisileN. Arthur Samuel (1991)
reported an important bit of evidence suggesting that the [s] is really perceived
in the noise-replaced stimuli. He found that listeners can't really tell the differ-
ence between a noise-added version of the word (where the broadband noise is
simply added to the already existing [s]) and a noise-replaced version (where the
[s] is excised first, before adding noise). What this means is that the [s] is actually
perceived — it is restored — and thus that your knowledge of the word "legisla-
tion" has shaped your perception of this noise burst.

Jeff Elman and jay McClelland (1988) provided another important bit of evid-
ence that linguistic knowledge shapes speech perception. They used the phoneme
restoration process to induce the perception of a sound that then participated in
a compensation for coarticulation. This two-step process is a little complicated,
but one of the most clever and influential experiments in the literature.
Step one: compensation for coarticulation. We use a [daHga] continuum just like
the one in figure 5.1, but instead of context syllables [al] and [ai], we use [as] and
[GB There is a compensation for coarticulation using these fricative context
syllables that is like the effect seen with the liquid contexts. Listeners hear more
"ga" syllables when the context is [as] than when it is [of ].
Step two: phoneme restoration. We replace the fricative noises in the words
"abolish" and "progress" with broadband noise, as was done to the [s] of "legis-
lature" in figure 5.7. Now we have a perceived [s] in "progress" and a perceived [5]
in "abolish" but the signal has only noise at the ends of these words in our tokens.
The question is whether the restoration of [1 and [5] in "progress" and "abolish"
is truly a perceptual phenomenon, or just something more like a decision bias
in how listeners will guess the identity of a word. Does the existence of a word
"progress" and the nonexistence of any word "progresh" actually influence
speech perception? Elman and McClelland's excellent test of this question was to
use "abolish" and "progress" as contexts for the compensation for coarticulation
experiment. The reasoning is that if the "restored" [s] produces a compensation
for coarticulation effect, such that listeners hear more "ga" syllables when these
are preceded by a restored [s] than when they are preceded by a restored [5],
then we would have to conclude that the [s] and [f ] were actually perceived by
listeners — they were actually perceptually there and able to interact with the per-
ception of the [da]—[ga] continuum. Guess what Elman and McClelland found?
That's right the phantom, not-actually-there [s] and [5] caused compensation for
coarticulation — pretty impressive evidence that speech perception is shaped by
our linguistic knowledge.

5.4 Perceptual Similarity.
Now to conclude the chapter, I'd like to discuss a procedure for measuring
perceptual similarity spaces of speech sounds. This method will be useful in later
chapters as we discuss different types of sounds, their acoustic characteristics, and
then their perceptual similarities. Perceptual similarity is also a key parameter in
relating phonetic characteristics to language sound change and the phonological
patterns in language that arise from sound change.
The method involves presenting test syllables to listeners and asking them
to identify the sounds in the syllables. Ordinarily, with carefully produced "lab
speech" (that is, speech produced by reading a list of syllables into a microphone
in the phonetics lab) listeners will make very few misidentifications in this task,
so we usually add some noise to the test syllables to force some mistakes. The
noise level is measured as a ratio of the intensity of the noise compared with the
peak intensity of the syllable. This is called the signal-to-noise ratio (SNR) and
is measured in decibels. To analyze listeners' responses we tabulate them in a con-
fusion matrix. Each row in the matrix corresponds to one of the test syllables
(collapsing across all 10 tokens of that syllable) and each column in the matrix
corresponds to one of the responses available to listeners.

Table 5.2 shows the confusion matrix for the 0 dB SNR condition in George
Miller and Patricia Nicely's (1955) large study of consonant perception. Yep, these
data are old, but they're good. Looking at the first row of the confusion matrix
we see that [f] was presented 264 times and identified correctly as "f" 199 times
and incorrectly as "th" 46 times. Note that Miller and Nicely have more data for
some sounds than for others.
Even before doing any sophisticated data analysis, we can get some pretty quick
answers out of the confusion matrix. For example, why is it that "Keith" is some-
times pronounced "Kee by children? Well, according to Miller and Nicely's data,
[0] was called "f" 85 times out of 232 — it was confused with "f" more often than
with any other speech sound tested. Cool. But it isn't clear that these data tell us
anything at all about other possible points of interest — for example, why "this"
and "that" are sometimes said with a [d] sound. To address that question we need
to find a way to map the perceptual "space" that underlies the confusions we observe
in our experiment. It is to this mapping problem we now turn.

5.4.1 Maps from distances.
So, we're trying to pull information out of a confusion matrix to get a picture of
the perceptual system that caused the confusions. The strategy that we will use
takes a list of distances and reconstructs them as a map. Consider, for example,
the list of distances below for cities in Ohio.
Columbus to Cincinnati, 107 miles
Columbus to Cleveland, 142 miles
Cincinnati to Cleveland, 249 miles

From these distances we can put these cities on a straight line as in figure 5.8a,
with Columbus located between Cleveland and Cincinnati. A line works to
describe these distances because the distance from Cincinnati to Cleveland is
simply the sum of the other two distances (107 + 142 = 249).
Here's an example that requires a two-dimensional plane.
Amsterdam to Groningen, 178 km
Amsterdam to Nijmegen, 120 km
Groningen to Nijmegen, 187 km
The two-dimensional map that plots the distances between these cities in the
Netherlands is shown in figure 5.8b. To produce this figure I put Amsterdam and
Groningen on a line and called the distance between them 178 km. Then I drew
an arc 120 km from Amsterdam, knowing that Nijmegen has to be somewhere
on this arc. Then I drew an arc 187 km from Groningen, knowing that Nijmegen
also has to be somewhere on this arc. So, Nijmegen has to be at the intersection

of the two arcs — 120 km from Amsterdam and 187 km from Groningen. This

method of locating a third point based on its distance from two known points

is called triangulation. The triangle shown in figure 5.8b is an accurate depic-

tion of the relative locations of these three cities, as you can see in the map in

figure 5.9.

You might be thinking to yourself, "Well, this is all very nice, but what does

it have to do with speech perception?" Good question. It turns out that we can

compute perceptual distances from a confusion matrix. And by using an extension

of triangulation called multidimensional scaling, we can produce a perceptual

map from a confusion matrix.

5.4.2 The perceptual map of fricatives.

In this section we will use multidimensional scaling (MDS) to map the percep-

tual space that caused the confusion pattern in table 5.2.

The first step in this analysis process is to convert confusions into distances.

We believe that this is a reasonable thing to try to do because we assume that

when things are close to each other in perceptual space they will get confused

with each other in the identification task. So the errors in the matrix in table 5.2

tell us what gets confused with what. Notice, for example, that the voiced con-

sonants [v], [a], [z], and [d] are very rarely confused with the voiceless consonants

[f], [8], and [s]. This suggests that voiced consonants are close to each other in per-

ceptual space while voiceless consonants occupy some other region. Generalized

statements like this are all well and good, but we need to compute some specific

estimates of perceptual distance from the confusion matrix.

Here's one way to do it (I'm using the method suggested by the mathem-

atical psychologist Roger Shepard in his important 1972 paper "Psychological

representation of speech sounds"). There are two steps. First, calculate similarity

and then from the similarities we can derive distances.

Similarity is easy. The number of times that you think [f] sounds like "0" is a

reflection of the similarity of "f" and "0" in your perceptual space. Also, "f"—"0"

similarity is reflected by the number of times you say that [0] sounds like "f", so

we will combine these two cells in the confusion matrix — [f] heard as "0" and [0]

heard as "f." Actually, since there may be a different number of [f] and [0] tokens

presented, we will take proportions rather than raw counts.

Notice that for any two items in the matrix we have a submatrix of four cells:

(a) is the submatrix of response proportions for the "f" I "0" contrast from Miller

and Nicely's data. Note, for example, that the value 0.75 in this table is the pro-

portion of [f] tokens that were recognized as "f" (199/264 = 0.754). Listed with

the submatrix are two abstractions from it.

The variables in submatrix (b) code the proportions so that "p" stands for
proportion, the first subscript letter stands for the row label and the second sub-
script letter stands for the column label. So p is a variable that refers to the
proportion of times that [0] tokens were called "f." In these data NI. is equal
to 0.37. Submatrix (c) abstracts this a little further to say that for any two sounds
i and j, we have a submatrix with confusions (subscripts don't match) and
correct answers (subscripts match).

Asymmetry in confusion matrices.
Is there some deep significance in the fact that [0] is called "f" more often
than [f] is called "th"? It may be that listeners had a bias against calling things
"th" — perhaps because it was confusing to have to distinguish between "th"
and "dh" on the answer sheet. This would seem to be the case in table 5.2
because there are many more "f" responses than "th" responses overall.
However, the relative infrequency of "s" responses suggests that we may not
want to rely too heavily on a response bias explanation, because the "s"-to-
[s] mapping is common and unambiguous in English. One interesting point
about the asymmetry of [f] and [8] confusions is that the perceptual con-
fusion matches the cross-linguistic tendency for sound change (that is, [9] is
more likely to change into [f] than vice versa). Mere coincidence, or is there
a causal relationship? Shepard's method for calculating similarity from a
confusion matrix glosses over this interesting point and assumes that pf„
and p1 are two imperfect measures of the same thing — the confusability of
"f" and "9." These two estimates are thus combined to form one estimate
of "f"—"0" similarity. This is not to deny that there might be something
interesting to look at in the asymmetry, but only to say that for the purpose
of making perceptual maps the sources of asymmetry in the confusion matrix
are ignored.

Here is Shepard's method for calculating similarity from a confusion matrix.
We take the confusions between the two sounds and scale them by the correct
responses. In math, that's:

In this formula, S„ is the similarity between category i and category j. In the case
of "f" and "0" in Miller and Nicely's data (table 5.2) the calculation is:

I should say that regarding this formula Shepard simply says that it "has been
found serviceable." Sometimes you can get about the same results by simply tak-
ing the average of the two confusion proportions p, and pi, as your measure of
similarity, but Shepard's formula does a better job with a confusion matrix in which
one category has confusions concentrated between two particular responses,
while another category has confusions fairly widely distributed among possible
responses - as might happen, for example, when there is a bias against using one
particular response alternative.

OK, so that's how to get a similarity estimate from a confusion matrix. To get
perceptual distance from similarity you simply take the negative of the natural
log of the similarity:

This is based on Shepard's Law, which states that the relationship between per-
ceptual distance and similarity is exponential. There may be a deep truth about
mental processing in this law - it comes up in all sorts of unrelated contexts (Shannon
and Weaver, 1949; Parzen, 1962), but that's a different topic.
Anyway, now we're back to map-making, except instead of mapping the relative
locations of Dutch cities in geographic space, we're ready to map the perceptual
space of English fricatives and "d." Table 5.3 shows the similarities calculated from
the Miller and Nicely confusion matrix (table 5.2) using equation (5.1).
The perceptual map based on these similarities is shown in figure 5.10. One of
the first things to notice about this map is that the voiced consonants are on one
side and the voiceless consonants are on the other. This captures the observation
that we made earlier, looking at the raw confusions, that voiceless sounds were
rarely called voiced, and vice versa. It is also interesting that the voiced and voice-
less fricatives are ordered in the same way on the vertical axis. This might be a
front/back dimension, or there might be an interesting correlation with some
acoustic aspect of the sounds.
In figure 5.10, I drew ovals around some clusters of sounds. These show
two levels of similarity among the sounds as revealed by a hierarchical cluster
analysis (another neat data analysis method available in most statistics software
packages - see Johnson, 2008, for more on this). At the first level of clustering
"0" and "f" cluster with each other and "v" and "d" cluster together in the
perceptual map. At a somewhat more inclusive level the sibilants are included with
their non-sibilant neighbors ("s" joins the voiceless cluster and "z" joins the
voiced cluster). The next level of clustering, not shown in the figure, puts [d] with
the voiced fricatives.

Combining cluster analysis with MDS gives us a pretty clear view of the
perceptual map. Note that these are largely just data visualization techniques; we
did not add any information to what was already in the confusion matrix (though
we did decide that a two-dimensional space adequately describes the pattern of
confusions for these sounds).
Concerning the realizations of "this" and "that" we would have to say that
these results indicate that the alternations [d]—[d] and [d]—[z] are not driven by
auditory/ perceptual similarity alone: there are evidently other factors at work —
otherwise we would find "vis" and "vat" as realizations of "this" and "that."

MDS and acoustic phonetics.

In acoustic phonetics one of our fundamental puzzles has been how to decide
which aspects of the acoustic speech signal are important and which things
don't matter. You look at a spectrogram and see a blob — the question is,
do listeners care whether that part of the sound is there? Does that blob
matter? Phoneticians have approached the "Does it matter?" problem in a
number of ways.
For example, we have looked at lots of spectrograms and asked concerning
the mysterious blob, "Is it always there?" One of the established facts of
phonetics is that if an acoustic feature is always, or even usually, present
then listeners will expect it in perception. This is even true of the so-called
"spit spikes" seen sometimes in spectograms of the lateral fricatives [+]
and 031 (A spit spike looks like a stop release burst — see chapter 8 - but
occurs in the middle of a fricative noise.) These sounds get a bit juicy, but
this somewhat tangential aspect of their production seems to be useful in
perception.
Another answer to "Does it matter?" has been to identify the origin of
the blob in the acoustic theory of speech production. For example, some-
times room reverberation can "add" shadows to a spectrogram. (Actually in
the days of reel-to-reel tape recorders we had to be careful of magnetic
shadows that crop up when the magnetic sound image transfers across layers
of tape on the reel.) If you have a theory of the relationship between speech
production and speech acoustics you can answer the question by saying,
"It doesn't matter because the talker didn't produce it." We'll be exploring
the acoustic theory of speech production in some depth in the remaining
chapters of this book.
One of my favorite answers to "Does it matter?" is "Cooper's rule." Franklin
Cooper, in his 1951 paper with Al Liberman and John Borst, commented
on the problem of discovering "the acoustic correlates of perceived speech."
They claimed that there are "many questions about the relation between
acoustic stimulus and auditory perception which cannot be answered
merely by an inspection of spectrograms, no matter how numerous and
varied these might be" (an important point for speech technologists to
consider). Instead they suggested that "it will often be necessary to make
controlled modifications in the spectrogram, and then to evaluate the
effects of these modifications on the sound as heard. For these purposes we
have constructed an instrument" (one of the first speech synthesizers). This
is a pretty beautiful direct answer. Does that blob matter? Well, leave it
out when you synthesize the utterance and see if it sounds like something
else.
And finally there is the MDS answer. We map the perceptual space and
then look for correlations between dimensions of the map and acoustic prop-
erties of interest (like the mysterious blob). If an acoustic feature is tightly
correlated with a perceptual dimension then we can say that that feature
probably does matter. This approach has the advantages of being based on
naturally produced speech, and of allowing the simultaneous exploration of
many acoustic parameters.

Recommended Reading
Best, C. T. (1995) A direct realist perspective on cross-language speech perception. In W.
Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodological issues in cross-language speech research, Timonium, MD: York Press, 167-200. Describes a theory of cross-language speech perception in which listeners map new, unfamiliar sounds on to their inventory of native-language sounds.
Bond, Z. S. (1999) Slips of the Ear: Errors in the Perception of Casual Conversation, San Diego Academic Press. A collection, and analysis, of misperception in "the wild" — in ordinary conversations.
Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge, MA: MIT Press. The theory and evidence for a gestalt theory of audition — a very important book.
Campbell, R. (1994) Audiovisual speech: Where, what, when, how? Current Psychology of Cognition, 13, 76-80. On the perceptual resilience of the McGurk effect.
Cole, R. A. (1973) Listening for mispronunciations: A measure of what we hear during speech. Perception 4:7 Psychophysics, 13, 153-6. A study showing that people often don't hear mispronunciations in speech communication.
Cooper, F. S., Liberman, A. M., and Borst, J. M. (1951) The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proceedings of the National Academy of Science, 37, 318-25. The source of "Cooper's rule."
Elman, J. L. and McClelland, J. L. (1988). Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27, 143-65. One of the most clever, and controversial, speech perception experiments ever reported.
Flege, J. E. (1995) Second language speech learning: Theory, findings, and problems. In W. Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodo-logical issues in cross-language speech research, Timonium, MD: York Press, 167-200. Describes a theory of cross-language speech perception in which listeners map new, unfamiliar sounds on to their inventory of native-language sounds.
Ganong, W. F. (1980) Phonetic categorization in auditory word recognition. Journal of Experimental Psychology: Human Perception and Performance, 6, 110-25. A highly influen-tial demonstration of how people are drawn to hear words in speech perception. The basic result is now known as "the Ganong effect."
Green, K. P., Kuhl, P. K., Meltzoff, A. N., and Stevens, E. B. (1991) Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception 6P- Psychophysics, 50, 524-36. Integrating gender-mismatched voices and faces in the McGurk effect.
Jakobson, R., Fant, G., and Halle, M. (1952) Preliminaries to Speech Analysis, Cambridge, MA: MIT Press. A classic in phonetics and phonology in which a set of distinctive phono-logical features is defined in acoustic terms.
Johnson, K. and Ralston, J. V. (1994) Automaticity in speech perception: Some speech/ nonspeech comparisons. Phonetica, 51(4), 195-209. A set of experiments suggesting that over-learning accounts for some of the "specialness" of speech perception.
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. (1992) Linguistic experiences alter phonetic perception in infants by 6 months of age. Science, 255, 606-8. Demonstrating the perceptual magnet effect with infants.
Liberman, A. M., Harris, K. S., Hoffman H. S., and Griffith, B. C. (1957) The discrimina-tion of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-68. The classic demonstration of categorical perception in speech perception.
Lotto, A. J. and Kluender, K. R. (1998) General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60, 602-19. A demonstration that at least a part of the compensation for coarticulation effect (Mann, 1980) is due to auditory contrast.
Mann, V. A. (1980) Influence of preceding liquid on stop-consonant perception. Perception ear Psychophysics, 28, 407-12. The original demonstration of compensation for coarticu-lation in sequences like [al da] and [or ga].
McGurk, H. and MacDonald, J. (1976) Hearing lips and seeing voices. Nature, 264, 746-8. The audiovisual speech perception effect that was reported in this paper has been come to be called "the McGurk effect."
Miller, G. A. and Nicely, P. E. (1955) An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338-52. A standard reference for the confusability of American English speech sounds.
Parzen, E. (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065-76. A method for estimating probability from instances.
Pastore, R. E. and Farrington, S. M. (1996) Measuring the difference limen for identification of order of onset for complex auditory stimuli. Perception &. Psychophysics, 58(4), 510-26. On the auditory basis of the linguistic use of aspiration as a distinctive feature.
Pisoni, D. B. (1977) Identification and discrimination of the relative onset time of two-component tones: Implications for voicing perception in stops. Journal of the Acoustical Society of America, 61, 1352-61. More on the auditory basis of the linguistic use of aspiration as a distinctive feature.
Rand, T. C. (1974) Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55(3), 678-80. The first demonstration of the duplex perception effect.
Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981) Speech perception with-out traditional speech cues. Science, 212, 947-50. The first demonstration of how people perceive sentences that have been synthesized using only time-varying sine waves.
Rosenblum, L. D., Schmuckler, M. A., and Johnson, J. A. (1997) The McGurk effect in infants. Perception & Psychophysics, 59, 347-57.
Sekiyama, K. and Tohkura, Y. (1993) Inter-language differences in the influence of visual cues in speech perception. Journal of Phonetics, 21, 427-44. These authors found that the McGurk effect is different for people who speak different languages.
Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication. Urbana: University of Illinois. The book that established "information theory."
Shepard, R. N. (1972) Psychological representation of speech sounds. In E. E. David and P. B. Denes (eds.), Human Communication: A unified view. New York: McGraw-Hill, 67-113. Measuring perceptual distance from a confusion matrix.
Walker, S., Bruce, V., and O'Malley, C. (1995) Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124-33. A fascinating demonstration of how top-down knowledge may mediate the McGurk effect.
Warren, R. M. (1970) Perceptual restoration of missing speech sounds. Science, 167, 392-3. The first demonstration of the "phoneme restoration effect.".

domingo, 30 de septiembre de 2018

Basic Audition

The human auditory system is not a high-fidelity system. Amplitude is compressed;
frequency is warped and smeared; and adjacent sounds may be smeared together.
Because listeners experience auditory objects, not acoustic records like waveforms
or spectrograms, it is useful to consider the basic properties of auditory percep-
tion as they relate to speech acoustics. This chapter starts with a brief discussion
of the anatomy and function of the peripheral auditory system, then discusses
two important differences between the acoustic and the auditory representation
of sound, and concludes with a brief demonstration of the difference between
acoustic analysis and auditory analysis using a computer simulation of auditory
response. Later chapters will return to the topics introduced here as they relate to
the perception of specific classes of speech sounds.

4.1 Anatomy of the Peripheral Auditory System.

The peripheral auditory system (that part of the auditory system not in the
brain) translates acoustic signals into neural signals; and in the course of the trans-
lation, it also performs amplitude compression and a kind of Fourier analysis of
the signal.
Figure 41 illustrates the main anatomical features of the peripheral auditory
system (see Pickles, 1988). Sound waves impinge upon the outer ear, and travel
down the ear canal to the eardrum. The eardrum is a thin membrane of skin which is stretched like the head of a drum at the end of the ear canal. Like the
membrane of a microphone, the eardrum moves in response to air pressure
fluctuations.

These movements are conducted by a chain of three tiny bones in the middle
ear to the fluid-filled inner ear. There is a membrane (the basilar membrane)
that runs down the middle of the conch-shaped inner ear (the cochlea). This mem-
brane is thicker at one end than the other. The thin end, which is closest to the
bone chain, responds to high-frequency components in the acoustic signal, while
the thick end responds to low-frequency components. Each auditory nerve fiber
innervates a particular section of the basilar membrane, and thus carries infor-
mation about a specific frequency component in the acoustic signal. In this way,
the inner ear performs a kind of Fourier analysis of the acoustic signal, breaking
it down into separate frequency components.

4.2 The Auditory Sensation of Loudness.

The auditory system imposes a type of automatic volume control via amplitude
compression, and as a result, it is responsive to a remarkable range of sound inten-
sities (see Moore, 1982). For instance, the air pressure fluctuations produced by
thunder are about 100,000 times larger than those produced by a whisper (see
table 4.1).

Table 41 A comparison of the acoustic and perceptual amplitudes of some
common sounds. The amplitudes are given in absolute air pressure fluctuation
(micro-Pascals -!AN, acoustic intensity (decibels sound pressure level dB SPL)
and perceived loudness (Sones).

How the inner ear is like a piano.

For an example of what I mean by "is responsive to," consider the way in
which piano strings respond to tones. Here's the experiment: go to your
school's music department and find a practice room with a piano in it. Open
the piano, so that you can see the strings. This works best with a grand or
baby grand, but can be done with an upright. Now hold down the pedal
that lifts the felt dampers from the strings and sing a steady note very loudly.
Can you hear any of the strings vibrating after you stop singing? This experi-
n-ient usually works better if you are a trained opera singer, but an enthu-
siastic novice can also produce the effect. Because the loudest sine wave
components of the note you are singing match the natural resonant frequencies
of one or more strings in the piano, the strings can be induced to vibrate
sympathetically with the note you sing. The notion of natural resonant fre-
quency applies to the basilar membrane in the inner ear. The thick part nat-
urally vibrates sympathetically with the low-frequency components of an
incoming signal, while the thin part naturally vibrates sympathetically with
the high-frequency components.

Look at the values listed in the pressure column in the table. For most people,
a typical conversation is not subjectively ten times louder than a quiet office, even
though the magnitudes of their sound pressure fluctuations are. In general, subjective auditory impressions of loudness differences do not match sound pressure
differences. The mismatch between differences in sound pressure and loudness
has been noted for many years. For example, Stevens (1957) asked listeners to adjust
the loudness of one sound until it was twice as loud as another or, in another
task, until the first was half as loud as the second. Listeners' responses were con-
verted into a scale of subjective loudness, the units of-which are called sones. The
sone scale for intensities above 40dB SPL) can be calculated with the formula in
equation (4.1), and is plotted in figure 4.2a. (I used the abbreviations "dB" and
"SPL" in this sentence. We are on the cusp of introducing these terms formally
in this section, but for the sake of accuracy I needed to say "above 40dB SPL".
"Decibel" is abbreviated "dB'', and "sound pressure lever is abbreviated "SPL" —
further definition of these is imminent.)

The sone scale sho-ws listeners' judgments of relative loudness, scaled so that a

sound about as loud as a quiet office (2,000 Oa) has a value of 1, a sound that is

subjectively half as loud has a value of 0.5, and one that is mice as loud has a

value of 2. As is clear in the figure, the relationship benveen sound pressure and

loudness is not linear. For soft sounds, large changes in perceived loudness result

from relatively small changes in sound pressure, while for loud sounds, relatively

large pressure changes produce only small changes in perceived loudness. For

example, if peak amplitude changes from 100,0001_LPa to 200,000 .LiPa, the change

in sones is from 10.5 to 16 sones, but a change of the same pressure manimd_e

from 2,000,000 1_1Pa to 2,100,000 I_L.Pa produces less than a 2 sone change in loud-

ness from 64 to 65.9 sones).

Figure 4.2 also shows (in part b) an older relative acoustic intensity scale that

is named after Alexander Graham Bell. This unit of loudness, the bel, is too big

for most purposes, and it is more common to use tenths of a bel, or decibels

(abbreviated dE). This easily calculated scale is widely used in auditory phonetics

and psycho-acoustics, because it provides an approximation to the nonlinearity

of human loudness sensation.

As the difference between dE SPL and dB SL implies (see box on "Decibels");

perceived loudness varies as a function of frequency Figure 4.3 illustrates the rela-

tionship between subjective loudness and dB SPL. The curve in the figure repre-

sents the intensities of a set of tones that have the same subjective loudness as a

1,000 Hz tone presented at 60 dB SPL. The curve is like the settings of a graphic

equalizer on a stereo. The lever on the left side of the equalizer controls the rela-

tive amplitude of the lowest-frequency components in the music, while the lever

on the right side controls the relative amplitude of the highest frequencies. This

equal loudness contour shows that you have to amplify- the lowest and highest

frequencies if you want them to sound as loud as the middle frequencies

(whether this sounds good is another issue), so, as the figure shows, the auditory

system is most sensitive to sounds that have frequencies between 2 and 5 kHz.

Note also that sensitivity drops off quickly above 10 kHz. This was part of

my motivation for recommending a sampling rate of 22 kHz (11 kHz Nyquist

frequency) for acoustic/ phonetic analysis.

Decibels.

Although it is common to express the amplitude of a sound wave in terms

of pressure, or, once we have converted acoustic energy into electrical

energy, in volts, the decibel scale is a -way of expressing sound amplitude

that is better correlated with perceived loudness. On this scale the relative

loudness of a sound is measured in terms of sound intensity (which is pro-

portional to the square of the amplitude) on a logarithmic scale. Acoustic

intensity is the amount of acoustic power exerted by the sound wave's pres-

sure fluctuation per -unit of area. A common unit of rneasure for acoustic

intensity is Watts per square centimeter (W/ cm:).

Consider a sound with average pressure amplitude x. Because acoustic

intensity is proportional to the square of amplitude, the intensity of x rela-

tive to a reference sound with pressure amplitude r is x2/ r2, A be] is the base

10 logarithm of this power ratio: log„(x2/ r2), and a decibel is 10 times this:

10 log,o(x2/r2). This formula can be simplified to 20 loga(x/ r) = dE.

There are two common choices for the reference level r in dB measure-

ments. One is 2011Pa, the typical absolute auditory threshold (lowest audible

pressure fluctuation) of a 1,000 Hz tone. When this reference value is

used, the values are labeled dB SPL for Sound Pressure Level). The other

common choice for the reference level has a different reference pressure level

for each frequency. In this method, rather than use the absolute threshold

for a 1,000 Hz tone as the reference for all frequencies, the loudness of a

tone is measured relative to the typical absolute threshold level for a tone

at that frequency. When this method is used, the values are labeled dB SL

for Sensation Level).

In speech analysis programs, amplitude may be expressed in dB relative

to the largest amplitude value that can be taken by a sample in the digital

speech waveform, in which case the amplitude values are negative numbers;

or it may be expressed in dB relative to the smallest amplitude value that

can be represented in the digital speech waveform, in which case the ampli-

tude values are positive numbers. These choices for the reference level in

the dB calculation are used when it is not crucial to know the absolute

dB SPL value of the signal. For instance, calibration is not needed for

comparative RS or spectral amplitude measurements.

4.3 Frequency Response of the Auditory System.

As discussed in section 4.1, the auditory system performs a running Fourier anal-

ysis of incoming sounds. However, this physiological frequency analysis is not the

same as the mathematical Fourier decomposition of signals. The main difference

is that the auditory system's frequency response is not linear, just as a change of

1,000 Pa in a soft sound is not perceptually equivalent to a similar change in a

loud sound, so a change from 500 to 1,000 Hz is not perceptually equivalent to

a change from 5,000 to 5,300 Hz. This is illustrated in figure 4.4, which shows the

relationship between an auditory frequency scale called the Bark scale (wicker,

1961; Schroeder et al., 1979), and acoustic frequency in kHz. Zwicker (1975) showed

that the Bark scale is proportional to a scale of perceived pitch (the Mel scale) and

to distance along the basilar membrane. A tone with a frequency of 500 Hz has

an auditory frequency of 4.9 Bark, while a tone of 1,000 Hz is 8.5 Bark, a differ-

ence of 3.6 Bark. On the other hand, a tone of 5,000 Hz has an auditory frequency

of 19.2 Bark, while one of 5,500 Hz has an auditory frequency of 19.8 Bark, a

difference of only 0.6 Bark. The curve shown in figure 4.4 represents the fact that

the auditory system is more sensitive to frequency changes at the low end of the

audible frequency range than at the high end.

This nonlinearity in the sensation of frequency is related to the fact that the

listener's experience of the pitch of periodic sounds and of the timbre of complex

sounds is largely shaped by the physical structure of the basilar membrane.

Figure 45 illustrates the relationship between frequency and location along the

basilar membrane. As mentioned_ earlier, the basilar membrane is thin at its

base and thick at its apex; as a result, the base of the basilar membrane responds

to high-frequency sounds, and the apex to low-f•equency sounds. As figure 4.5

shows, a relatively latge portion of the basilar membrane responds to sounds below

LON Hz, whereas only a small portion responds to sounds between 12,000 and

13,000 Hz, for example. Therefore, small changes in frequency below 1,000 Hz

are more easily detected than are small changes in frequency above 12,000 Hz,

The relationship between auditory frequency and acoustic frequency shown in

figure 4.4 is due to the structure of the basilar membrane in the inner ear.

4.4 Saturation and Masking.

The sensory neurons in the inner ear and the auditory nerve that respond CO sound

are physiological machines that run on chemical supplies (charged ions of sodium

and potassium). Consequently, when they run low on supplies, or are running at

maximum capacity, they may fail to respond to sound as vigorously as usual. This

comes up a lot, actually.

For example, after a short period of silence auditory nerve cell response to a

tone is much greater than it is after the tone has been playing for a little while.

During the silent period, the neurons can fully "recharge their batteries," as they

take on charged positive ions. So when a sound moves the basilar membrane in

the cochlea, the hair cells and the neurons in the auditory nerve are ready to fire,

The duration of this period of greater sensitivity varies from neuron to neuron

but is generally short, maybe 5 to 10 ms. Interestingly, this is about the duration

of a stop release burst, and it has been suggested that the greater sensitivity of

auditory neurons after a short period of silence might increase one's perceptual

acuity for stop release burst information. This same mechanism should make it

possible to hear changing sound properties more generally, because the relevant

silence (as far as the neurons are concerned) is lack of acoustic energy at the

particular center frequency of the auditory nerve cell. So, an initial burst of

activity would tend to decrease the informativeness of steady-state sounds relative

to acoustic variation, whether there has been silence or not.

Another aspect of have sound is registered by the auditory system in tine is

the phenomenon of masking. In masking, the presence of one sound makes another,

nearby sound more difficult to hear. Masking has been called a 'line busy" effect.

The idea is that if a neuron is firing in response to one sound, and another sound

would tend to be encoded also by the firing of that neuron, then the second sound

will not be able to cause much of an increment in firing — thus the system will be

relatively insensitive to the second sound. We will discuss two types of masking

that are relevant for speech percepdon: "frequency masking' and "temporal

masking,"

Figure 4.6 illustrates frequency masking, showing the results of a test that used

a narrow band of noise (the gray band in the figure) and a series of sine wave

tones (the open circles). The masker noise, in this particular illustration, is 90 Hz

wide and centered on 410 Hz, and its loudness is 70 dB SPL, The dots indicate

how much the amplitude must be increased for a tone to be heard in the pre-

sence of the masking noise (that is, the dots plot the elevation of threshold level

for the probe tones). For example, the threshold amplitude of a tone of 100 Hz

the first dot) is not affected at all by the presence of the 410 Hz masker, but a

tone of 400 Hz has to be amplified by 50 dB to remain audible. One key aspect

of the frequency masking data shown in figure 4,6 is called the upward spread

of masking. Tones at frequencies higher than the masking noise show a greater

effect than tones below the masker. So, to hear a tone of 610 Hz (200 Hz higher

than the center of the masking noise) the tone must be amplified 38 dB higher

than its normal threshold loudness, while a tone of 210 Hz (200 Hz lower than

the center of the masking noise) needs hardly any amplification at all. This illus-

trates that low-frequency noises will tend to mask the presence of high-frequency

components.

The upward spread of masking: whence and what for?.

There are two things to say about the upward spread of masking. First, it

probably comes from the mechanics of basilar membrane vibration in the

cochlea. The pressure wave of a sine wave transmitted to the cochlea from

the bones of the middle ear travels down the hasitar membrane, building

up in amplitude (ie. displacing the membrane more and more), up to the

location of maximum response to the frequency of the sine wave (see figure

4.5), and then rapidly ceases displacing the basilar membrane. The upshot

is that the basilar membrane at the base of the cochlea, where higher fre-

quencies are registered, is stimulated by low-frequency sounds, while the

apex of the basilar membrane, where lower frequencies are registered, is not

much affected by high-frequency sounds. So, the upward spread of masking

is a physiological by-product of the mechanical operation of this little fluidfilled coil in your head,

"So what?• you may say. Well, the upward spread of masking is used

to compress sound in MP s. We mentioned audio compression before in

chapter 3 and said that raw audio can be a real bandwidth hog. The MP3

compression standard uses masking., and the upward spread of masking

in particular, to selectively leave frequency components out of compressed

audio. Those bits in the high-frequency Lail in figure 4,6, that you wouldn't

be able to hear anyway? Goner have space by simply leaving out the inaudible bits.

The second type of masking is temporal masking. What happens here is that

sounds that come in a sequence may obscure each other. For example, a short,

soft sound may be perfectly audible by itself, but can be completely obscured if

it closely follows a much louder sound at the same frequency There are a lot of

parameters to this "forward masking" phenomenon. In ranges that could affect

speech perception, we would note that the masking noise must be greater than

40 dB SPL and the frequency of the masked sound must match (or be a frequency

subset of she masker. The masking effect drops off very quickly and is of almost

no practical significance after about 25 ins. In speech, we may see a slight effect

of forward masking at vowel offsets (rowels being the loudest sounds in speech).

Backward masking, where the audibility of a soft sound is reduced by a later-

occurring loud sound, is even less relevant for speech perception, though it is

an interesting puzzle how something could reach back in Lime and affect your

perception of a sound. It isn't really magic, though; just strong signals traveling

through the nervous system more quickly than weak ones.

4.5 Auditory Representations.

In practical terms what all this means is that when we calculate an acoustic power

spectrum of a speech sound, the frequency and loudness scales of the analyzing

device (for instance, a computer or a spectrograph) are not the same as the audi-

tory system's frequency and loudness scales. Consequently, acoustic analyses of

speech sounds may not match the listener's experience. The resulting mismatch

is especially dramatic for sounds like some stop release bursts and fricatives that

have a lot of high-frequency energy and/ or sudden amplitude changes. One way

to avoid this mismatch between acoustic analysis and the listener's experience

is to implement a functional model of the auditory system. Some examples of

the use of auditory models in speech analysis are Liljencrants and Lindblom (1972),

Bladon and Lindblom (1981), Johnson (1939), Lyons (1982), Patterson (1976), Moore

and Glaslierg (1983), and Seneff (1988). Figure 4.7 shows the difference between

the auditory and acoustic spectra of a complex wave composed of a 500 Hz and

a 1,500 Hz sine wave component. The vertical axis is amplitude in dB, and the

horizontal axis shows frequency in Hz, marked on the bottom of the graph, and

Bark, marked on the top of the graph. I made this auditory spectrum, and others

shown in later figures, with a computer program (Johnson, 1939) that mimics the

frequency response characteristics shown in figure 4.4 and the equal loudness con-

tour shown in figure 4,3. Notice that because the acoustic and auditory frequency

scales are different, the peaks are located at different places in the two represen-

tations, even though both spectra cover the frequency range from fl to 10,000 Hz,

Almost half of the auditory frequency scale covers frequencies below 1,500 Hz,

while this same range covers less than two-tenths of the acoustic display So, low-

frequency components tend to dominate the auditory spectrum. Notice too that

in the auditory spectrum there is some frequency smearing that causes the peak

at 11 Bark (1,500 Hz) to be somewhat broader than that at 5 Bark (500 Hz). This

spectral-smearing effect increases as frequency increases.

Figure 4.3 shows an example of the difference between acoustic and auditory

spectra of speech. The acoustic spectra of the release bursts of the clicks in Xhosa

are shown in (a), while (b) shows the corresponding auditory spectra. Like figure 4.7,

this figure shows several differences between acoustic and auditory spectra. First,

the region between 6 and 10 kHz (20-4 Bark in the auditory spectra), in which

the clicks do not differ very much, is not very prominent in the auditory spectra.

In the acoustic spectra this insignificant portion takes up two-fifths of the frequency

scale, while it takes up only one-fifth of the auditory frequency scale. This serves

to visually; and presumably auditorily, enhance the differences between the

spectra. Second, the auditory spectra show many fewer local peaks than do

the acoustic spectra. In this regard it should, be noted that the acoustic spectra

shown in figure 4.8 were calculated using LPC analysis to smooth them; the FFT

spectra which were input CO the auditory model were much more comphcated

than these smooth LPG spectra. The smoothing evident in the auditory spectra,

on the other hand, is due to the increased bandwidths of the auditory filters at

high frequencies.

Auditory models are interesting, because they offer a way of looking at the speech

signal from the point of view of the listener. The usefulness of auditory models

in phonetics depends on the accuracy of the particular simulation of the peri-

pheral auditory system. Therefore, the illustrations in this book were produced by

models that implement only well-known, and extensively studied, nonlinearities

in auditory loudness and frequency response, and avoid areas of knowledge that

are less well understood for complicated signals like speech.

These rather conservative auditory representations suggest that acoustic ana-

lyses give only a rough approximation to the auditory representations that

listeners use in identifying speech sounds.

Recall from chapter 3 that digital spectrograms are produced by encoding

spectral amplitude in a series of FFT spectra as shades of gray in the spectrogram.

This same method of presentation can also be used to produce auditory spec-

trograms from sequences of auditory spectra. Figure 4.9 shows an acoustic

spectrogram and an auditory spectrogram of the Cantonese word ikal "chicken (see

figure 3.20). To produce this figure, I used a publicly available auditory model (Lyons'

cochlear model (Lyons, 1932; Marley, 1988), which can be found at: http:

linguistics,berkeleyedulphonlablresourcesl ). The au di to ry spectrogram , which

is also called a cochleagram, combines features of auditory spectra and spectro-

grams. As in a spectrogram, the simulated auditory response is represented with

spectral amplitude plotted as shades of gray, with time on the horizontal axis

and frequency on the vertical axis. Note that although the same frequency range

(0-11 kHz) is covered in both displays, the movements of the lowest concentra-

tions of spectral energy in the vowel are much more visually pronounced in the

cochleagram because of the auditory frequency scale.