Abstract. Defining ‘‘emotion’’ is a notorious problem. Without consensual
conceptualization and operationalization of exactly what phenomenon is to be studied,
progress in theory and research is difficult to achieve and fruitless debates are likely to
proliferate. A particularly unfortunate example is William James’s asking the
question ‘‘What is an emotion?’’ when he really meant ‘‘feeling’’, a misnomer that
started a debate which is still ongoing, more than a century later. This contribution
attempts to sensitize researchers in the social and behavioral sciences to the importance
of definitional issues and their consequences for distinguishing related but
fundamentally different affective processes, states, and traits. Links between scientific
and folk concepts of emotion are explored and ways to measure emotion and its
components are discussed.
One of the major drawbacks of social science research is the need to resort to everyday language concepts in both theory and empirical investigation. The inherent fuzziness and the constant evolution of these language categories as well as inter-language, inter-cultural, and inter-individual differences make it difficult to define central working concepts in the universal, invariant, and consensual fashion generally required by a systematic scientific approach. Isolated attempts to artificially create more appropriate concepts that are unaffected by the multiple connotations of natural language terms (e.g. Cattell’s attempt to create a new taxonomy of personality
traits using synthetic labels; Cattell, 1990) seem doomed to failure, not only because of the difficulty of obtaining widespread consensus in the scientific community but also because of the need of much of social science to work with lay persons’ self-report, which makes it mandatory to employ lay or naive concepts. The concept of ‘‘emotion’’ presents a particularly thorny problem. Even though the term is used very frequently, to the point of being extremely fashionable these days, the question ‘‘What is an emotion?’’ rarely generates the same answer from different individuals, scientists or laymen alike. William James tried to give an authoritative answer in 1884, but only started a continuing debate which is currently finding renewed vigor (Niedenthal et al., 2005). The number of scientific definitions proposed has grown to the point where counting seems quite hopeless (Kleinginna and Kleinginna already reviewed more than one hundred in 1981). In frustration, scientists have attempted to have recourse to the analysis of the everyday use of the folk concepts: emotions are what people say they are (e.g. Averill, 1980; Frijda et al., 1995). However, as the debate in this journal, following the report of the first quasirepresentative study of emotional experience (Scherer et al., 2004; Scherer, 2004a) has shown, scholars from different disciplines in
the humanities and the social and behavioral sciences rarely agree on how to use this evidence. While this kind of conceptual and definitional discussion can have a stimulating effect in the short run, it can have stifling consequences for the advancement in the field and for collaborative research between different disciplines. At a time when it is increasingly recognized that affective and emotional phenomena need to be addressed in a genuinely interdisciplinary fashion (see the Handbook of the Affective Sciences; Davidson et al., 2003b), it becomes imperative to generate a minimal consensus about the defining features of the different types of affective phenomena. In this piece I do not systematically review these issues. Rather, I want to describe and defend a programmatic statement of a component process definition of emotion that I first proposed in 1982 in this journal (Scherer, 1982; see also Scherer, 1984a, 2001). Mention of ‘‘componential theories of emotion’’ is quite widespread today and the notion of emotions as component processes seems to gain increasing acceptance. Following a brief description of the component process definition, I examine what the defining characteristics of emotion are and how these differ from other affect states. In addition, I explore the problem of linking folk concepts of emotion to a scientific, component process conceptualization. Finally, I discuss how emotions can best be measured empirically and introduce two new instruments.
A component process definition of emotion and feeling.
In the framework of the component process model, emotion is defined as an episode of interrelated, synchronized changes in the states of all or most of the five organismic subsystems in response to the evaluation of an external or internal stimulus event as relevant to major concerns of the organism (Scherer, 1987, 2001). The components of an emotion episode are the respective states of the five subsystems and the process consists of the coordinated changes over time. Table 1 shows the relation between components and subsystems as well as presumed substrata and functions. Three of the components have long-standing status as modalities of emotion – expression, bodily symptoms and arousal, and subjective experience. The elicitation of action tendencies and the preparation of action have also been implicitly associated with emotional arousal (e.g. fight–flight tendencies) but it is only after explicit inclusion of these motivational consequences in componential theories (and Frijda’s forceful claim for the emotion-differentiating function of action tendencies, see Frijda, 1986, 1987), that these important features of emotion episodes have acquired the status of a major component in their own right. The inclusion of a cognitive, information processing component, as I have suggested above, is less consensual. Many theorists still prefer to see emotion and cognition as two independent but interacting systems. However, one can argue that all subsystems underlying emotion components function independently much of the time and that the special nature of emotion as a hypothetical construct consists of the coordination and synchronization of all of these systems during an emotion episode, driven by appraisal (Scherer, 2004b).
TABLE 1
Relationships between organismic subsystems and the functions and components of
emotion.
How can emotions, as defined above, be distinguished from other affective phenomena such as feelings, moods, or attitudes? Let us take the term feeling first. As shown in Table 1, the component process model reserves the use of this term for the subjective emotional experience component of emotion, presumed to have an important monitoring and regulation function. In fact, it is suggested that ‘‘feelings integrate the central representation of appraisal-driven response organization in emotion’’ (Scherer, 2004b), thus reflecting the total pattern of cognitive appraisal as well as motivational and somatic response patterning that underlies the subjective experience of an emotional episode. Using the term feeling, a single component denoting the subjective experience process, as a synonym for emotion, the total multi-modal component process, produces serious confusions and hampers our understanding of the phenomenon. In fact, it can be argued that the long-standing debate generated by William James’s peripheral theory of emotion is essentially due to James’s failure to make this important distinction: when in 1884 he asked ‘‘What is an emotion?’’, he really meant ‘‘What is a feeling?’’ (see Scherer, 2000a).
Using a design feature approach to distinguish emotion from other
affective phenomena.
Having clarified the distinction between emotion and feeling, it remains to differentiate emotion (with feeling as one of its components) from other types of affective phenomena. Instances or tokens of these types, which can vary in degree of affectivity, are often called ‘‘emotions’’ in the literature (or at least implicitly assimilated with the concept). Examples are liking, loving, cheerful, contemptuous, or anxious. I have suggested four such types of affective phenomena that should be distinguished from emotion proper, although there may be some overlap in the meaning of certain words: preferences, attitudes, affective dispositions, and interpersonal stances. How can we differentially define these phenomena in comparison to emotion? The difficulty of differentiating emotion from other types of affective phenomena is reminiscent of a similar problem in defining the specificity of language in comparison with other types of communication systems, human or animal. The anthropological linguist Charles Hockett made a pioneering effort to define 13 elementary design features of communication systems, such as semanticity, arbitrariness, or discreteness, that can be used for the profiling of different types of communication, allowing him to specify the unique nature of language (Hockett, 1960; see summary in Hauser, 1996: 47–8).
I suggest that we use some of the elements of the definition of emotion suggested above for such a distinction. These elements of features can be seen as equivalent to design features in Hockett’s sense. These features will now be described in detail.
Event focus.
The definition given above suggests that emotions are generally elicited by stimulus events. By this term I mean that something happens to the organism that stimulates or triggers a response after having been evaluated for its significance. Often such events will consist of natural phenomena like thunderstorms or the behavior of other people or animals that may have significance for our wellbeing. In other cases, one’s own behavior can be the event that elicits emotion, as in the case of pride, guilt, or shame. In addition to such events that are more or less external to the organism, internal events are explicitly considered as emotion elicitors by the definition. These could consist of sudden neuroendocrine or physiological changes or, more typically, of memories or images that might come to our mind. These recalled or imagined representations of events can be sufficient to generate strong emotions (see also the debate between Goldie, 2004, Parkinson, 2004, and Scherer, 2004a, in this journal). The need for emotions to be somehow connected to or anchored in a specific event, external or internal, rather than being free-floating, resulting from a strategic or intentional decision, or existing as a permanent feature of an individual, constitutes the event focus design feature.
Appraisal driven.
A central aspect of the component process definition of emotion is that the eliciting event and its consequences must be relevant to major concerns of the organism. This seems rather obvious as we do not generally get emotional about things or people we do not care about. In this sense, emotions can be seen as relevance detectors (Frijda, 1986; Scherer, 1984a). Componential theories of emotion generally assume that the relevance of an event is determined by a rather complex yet very rapidly occurring evaluation process that can occur on several levels of processing ranging from automatic and implicit to conscious conceptual or propositional evaluations (Leventhal and Scherer, 1987; van Reekum and Scherer, 1997). The component process model postulates that different emotions are produced by a sequence of cumulative stimulus evaluation or appraisal checks with emotion-specific outcome profiles (Ellsworth and Scherer, 2003; Scherer, 1984a, 1993, 2001). For the purposes of design feature analysis I suggest distinguishing between intrinsic and extrinsic appraisal. Intrinsic appraisal evaluates the feature of an object or person independently of the current needs and goals of the appraiser, based on genetic (e.g. sweet taste) or learned (e.g. bittersweet food) preferences (see Scherer, 1987, 1988). Transactional appraisal (see Lazarus, 1968, 1991) evaluates events and their consequences with respect to their conduciveness for salient needs, desires, or goals of the appraiser. The design features event focus and appraisal basis are linked, highlighting the adaptational functions of the emotions, helping to prepare appropriate behavioral reactions to events with potentially important consequences.
Response synchronization.
This design feature of the proposed emotion definition is also implied by the adaptational functions of emotion. If emotions prepare appropriate responses to events, the response patterns must correspond to the appraisal analysis of the presumed implications of the event. Given the importance of the eliciting event, which disrupts the flow of behavior, all or most of the subsystems of the organism must contribute to response preparation. The resulting massive mobilization of resources must be coordinated, a process which can be described as response synchronization (Scherer, 2000b, 2001). I believe that this is in fact one of the most important design features of emotion, one that in principle can be operationalized and measured empirically.
Rapidity of change.
Events, and particularly their appraisal, change rapidly, often because of new information or due to re-evaluations. As appraisal drives the patterning of the responses in the interest of adaptation, the emotional response patterning is also likely to change rapidly as a consequence. While we are in the habit of talking about ‘‘emotional states’’ these are rarely steady states. Rather, emotion processes are undergoing constant modification allowing rapid readjustment to changing circumstances or evaluations.
Behavioral impact.
Emotions prepare adaptive action tendencies and their motivational underpinnings. In this sense they have a strong effect on emotionconsequent behavior, often interrupting ongoing behavior sequences and generating new goals and plans. In addition, the motor expression component of emotion has a strong impact on communication which may also have important consequences for social interaction.
Intensity.
Given the importance of emotions for behavioral adaptation, one can assume the intensity of the response patterns and the corresponding emotional experience to be relatively high, suggesting that this may be an important design feature in distinguishing emotions from moods, for example.
Duration.
Conversely, as emotions imply massive response mobilization and synchronization as part of specific action tendencies, their duration must be relatively short in order not to tax the resources of the organism and to allow behavioral flexibility. In contrast, lowintensity moods that have little impact on behavior can be maintained for much longer periods of time without showing adverse effects.
Following Hockett’s example of characterizing different animal and human communication systems with the help of a set of design features, Table 2 shows an attempt to specify the profiles of different affective phenomena and the emotion design features described above (the table shows a revised version of the matrix first proposed in Scherer, 2000c). Based on these assumptions, one can attempt as follows to differentially define affective phenomena in distinguishing them from emotions.
1) Preferences. Relatively stable evaluative judgments in the sense of liking or disliking a stimulus, or preferring it or not over other objects or stimuli, should be referred to as preferences. By definition, stable preferences should generate intrinsic appraisal (intrinsic pleasantness check), independently of current needs or goals, although the latter might modulate the appraisal (Scherer, 1988). The affective states produced by encountering attractive or aversive stimuli (event focus) are stable and of relatively low intensity, and do not produce pronounced response synchronization. Preferences generate unspecific positive or negative feelings, with low behavioral impact except tendencies towards approach or avoidance.
2) Attitudes. Relatively enduring beliefs and predispositions towards specific objects or persons are generally called attitudes. Social psychologists have long identified three components of attitudes (see Breckler, 1984): a cognitive component (beliefs about the attitude object), an affective component (consisting mostly of differential valence), and a motivational or behavioral component (a stable action tendency with respect to the object, e.g. approach or avoidance). Attitude objects can be things, events, persons, and groups or categories of individuals. Attitudes do not need to be triggered by event appraisals although they may become more salient when encountering or thinking of the attitude object. The affective states induced by a salient attitude can be labeled with terms such as hating, valuing, or desiring. Intensity and response synchronization are generally weak and behavioral tendencies are often overridden by situational constraints. While it may seem prosaic, I suggest treating love as an interpersonal attitude with a very strong positive affect component rather than an emotion. The notion of loving someone seems to imply a long-term affective disposition rather than a brief episodic feeling, although thoughts of or the interaction with a loved person can produce strong and complex emotions, based on intrinsic and transactional appraisal and characterized by strong response synchronization. This is an example of how more stable affect dispositions can make the occurrence of an emotion episode more likely as well as introducing specific response patterns and feeling states.
3)Mood. Emotion psychologists have often discussed the difference between mood and emotion (e.g. Frijda, 2000). Generally, moods are considered as diffuse affect states, characterized by a relative enduring predominance of certain types of subjective feelings that affect the experience and behavior of a person. Moods may often emerge without apparent cause that could be clearly linked to an event or specific appraisals. They are generally of low intensity and show little response synchronization, but may last over hours or even days. Examples are being cheerful, gloomy, listless, depressed, or buoyant.
4) Affect dispositions. Many stable personality traits and behavior tendencies have a strong affective core (e.g. nervous, anxious, irritable, reckless, morose, hostile, envious, jealous). These dispositions describe the tendency of a person to experience certain moods more frequently or to be prone to react with certain types of emotions, even upon slight provocation. Not surprisingly, certain terms like irritable or anxious can describe both affect dispositions as well as momentary moods or emotions and it is important to specify whether the respective term is used to qualify a personality disposition or an episodic state. Affect dispositions also include emotional pathology; while being in a depressed mood is quite normal, being always depressed may be a sign of an affective disturbance, including a clinical syndrome of depression requiring medical attention.
5) Interpersonal stances. The specificity of this category is that it is characteristic of an affective style that spontaneously develops or is strategically employed in the interaction with a person or a group of persons, coloring the interpersonal exchange in that situation (e.g. being polite, distant, cold, warm, supportive, contemptuous). Interpersonal stances are often triggered by events, such as encountering a certain person, but they are less shaped by spontaneous appraisal than by affect dispositions, interpersonal attitudes, and, most importantly, strategic intentions. Thus, when an irritable person encounters a disliked individual there may be a somewhat higher probability of the person adopting an interpersonal stance of hostility in the interaction as compared to an agreeable person. Yet it seems important to distinguish this affective phenomenon from other types, because of its specific instantiation in an interpersonal encounter and the intentional, strategic character that may characterize the affective style used throughout the interaction.
So far, I have pitted emotions against other types of affective phenomena. Recently (Scherer, 2004c), I have suggested the need to distinguish between different types of emotions: aesthetic emotions and utilitarian emotions. The latter correspond to the commongarden- variety of emotions usually studied in emotion research such as anger, fear, joy, disgust, sadness, shame, guilt. These types of emotions can be considered utilitarian in the sense of facilitating our adaptation to events that have important consequences for our wellbeing. Such adaptive functions are the preparation of action tendencies (fight, flight), recovery and reorientation (grief, work), motivational enhancement (joy, pride), or the creation of social obligations (reparation). Because of their importance for survival and wellbeing, many utilitarian emotions are high-intensity emergency reactions, involving the synchronization of many organismic subsystems, as described above. In the case of aesthetic emotions, the functionality for an immediate adaptation to an event that requires the appraisal of goal relevance and coping potential is absent or much less pronounced. Kant defined aesthetic experience as ‘‘interesseloses Wohlgefallen’’ (disinterested pleasure; Kant, 2001), highlighting the complete absence of utilitarian considerations. Thus, the aesthetic experience of a work of visual art or a piece of music is not shaped by the appraisal of the work’s ability to satisfy my bodily needs, further my current goals or plans, or correspond to my social values. Rather, aesthetic emotions are produced by the appreciation of the intrinsic qualities of the beauty of nature, or the qualities of a work of art or an artistic performance. Examples of such aesthetic emotions are being moved or awed, being full of wonder, admiration, bliss, ecstasy, fascination, harmony, rapture, solemnity. The absence of utilitarian functions in aesthetic emotions does not mean that they are disembodied. Music and many other forms of art can be demonstrated to produce physiological and behavioral changes (Bartlett, 1999; Scherer and Zentner, 2001). However, these bodily changes are not in the service of behavioral readiness or the preparation of specific, adaptive action tendencies (Frijda, 1986). For example, the most commonly reported bodily symptoms for intense aesthetic experiences are goose pimples, shivers, or moist eyes – all rather diffuse responses which contrast strongly with the arousal and action-oriented responses for many utilitarian emotions.
Exploring the semantic space of folk concepts of emotion.
How many emotions are there? I submit that there is currently no answer to this question. Proponents of discrete emotion theories, inspired by Darwin, have suggested different numbers of so-called basic emotions (Ekman, 1972, 1992; Izard, 1971, 1992; Tomkins, 1962, 1984). Most of these are utilitarian emotions as defined above and play an important role in adapting to frequently occurring and prototypically patterned types of significant events in the life of organisms. In consequence, emotions like anger, fear, joy, and sadness are relatively frequently experienced (with anger and joy outranking all others; see the quasi-representative actuarial survey reported by Scherer et al., 2004). Given the aspects of frequency and prototypicality, I have suggested calling these emotions
modal rather than basic, given that there is little consensus as to the meaning and criteria for how basic is to be defined (Scherer, 1994). Obviously, the small number of basic or modal emotions (something between 6 and 14 depending on the theorists) is hardly representative for the range of human (or possibly even animal) emotionality. I have argued (Scherer, 1984a) that there are as many different emotions as there are distinguishably different profiles of appraisal with corresponding response patterning. Using the definition proposed above, in particular the necessary criterion of response synchronization, the number of different emotions could be determined empirically. However, this proposal is only of academic interest as, in addition to conceptual problems such as the criterion for a sufficient level of response synchronization, problems of access to a vast range of emotional episodes and measurement problems render such an empirical assessment impossible. I suggest that we need to have recourse to the study of folk concepts of emotion in order to make headway on the question of the number and nature of discriminable types of emotions. If, in the evolution of languages, certain types of distinctions between different types of emotional processes have been considered important enough for communication to generate different words or expressions, social and behavioral scientists should consider these distinctions worthy of study. Not surprisingly, different scholars have made efforts to do just that (Levy, 1984; Lutz, 1988; Russell, 1991; Russell et al., 1995; Wierzbicka, 1999). The problem is to map the fuzzy and complex semantic fields of the folk emotion concepts onto the scientific construct definitions. This is particularly important as in distinguishing emotions the task is not to identify common semantic primitives (as suggested by Wierzbicka, 1999) but to examine fine-grained differences, spanning all of the components of the respective emotion processes, to grasp the specificity of the processes referenced by the respective terms. While dictionary definitions of emotion labels in different languages, as well as thesaurus entries, may be useful, reflecting the learned intuitions of the language experts responsible for the respective entries, this approach is neither sufficiently comprehensive nor consensual enough to be appropriate for scientific profiling of emotion terms. I submit that the design feature approach outlined above can be profitably used to establish semantic profiles of folk concepts of emotions represented by emotion terms from natural languages. Concretely, emotion terms can be rated by native speakers of different natural languages with respect to a number of items for each of the design features. For example, one can ask participants in such a study to imagine a person whose emotional experience at a particular point in time is consensually described by observers as ‘‘irritated’’. Then raters are asked to evaluate the typical eliciting and response characteristics that would warrant the description of the person’s emotional state with this label. This would include items on the eliciting event, the type of appraisal the person is likely to have made of the event and its consequences, the response patterns in the different components, and the behavioral impact (action tendencies) generated, as well as the intensity and duration of the associated experience. Table 3 shows an example of such a semantic grid based on a design feature approach. For each of four domains, respondents have to indicate how a typical person would appraise and respond to a typical eliciting event for a given affect label. The items relative to appraisal dimensions were adapted from the Geneva Appraisal Questionnaire (GAQ – see References) and items on response characteristics were modeled on a questionnaire used in two large-scale collaborative studies on cross-cultural similarities and differences in emotional experience (Scherer and Wallbott, 1994; Scherer et al., 1986). Semantic grid profiles for different emotion terms allow, at least if there is reasonable agreement between raters (in the sense of interrater reliability) the definition of the semantic field, the meaning, of an emotion term in the respective language. In addition to allowing the examination of subtle differences in the meanings of different emotion terms and providing similarity-of-profile data that can be used to statistically determine the relationships between members of emotion families and the overall structure of the semantic space for emotions, such data for different languages inform us about potential cultural and linguistic differences in emotion encoding. This aspect, apart from the scientific interest (Breugelmans et al., 2005; Fontaine et al., 2002), is of great value in ensuring comparability of instruments in intercultural studies.
How can emotions be measured?
If one accepts the definition of emotion outlined above, there is no single gold-standard method for its measurement. Rather, given the component process nature of the phenomenon, only convergent measurement via assessment of all component changes involved can provide a comprehensive measure of an emotion. In other words, in an ideal world of science, we would need to measure (1) the continuous changes in appraisal processes at all levels of central nervous system processing (i.e. the results of all of the appraisal checks, including their neural substrata), (2) the response patterns generated in the neuroendocrine, autonomic, and somatic nervous systems, (3) the motivational changes produced by the appraisal results, in particular action tendencies (including the neural signatures in the respective motor command circuits), (4) the patterns of facial and vocal expression as well as body movements, and (5) the nature of the subjectively experienced feeling state that reflects all of these component changes. Needless to say, such comprehensive measurement of emotion has never been performed and is unlikely to become standard procedure in the near future. However, there have been major advances in recent years with respect to measuring individual components such as appraisal (Scherer et al., 2001), brain mechanisms (Davidson et al., 2003a), physiological response patterns (Stemmler, 2003), and expressive behavior (Harrigan et al., 2005). While both nonverbal behavior (e.g. facial and vocal expression) and physiological indicators can be used to infer the emotional state of a person, there are no objective methods of measuring the subjective experience of a person during an emotion episode. Given the definition of feeling as a subjective cognitive representation, reflecting a unique experience of mental and bodily changes in the context of being confronted with a particular event, there is no access other than to ask the individual to report on the nature of the experience. In many cases researchers provide participants with more or less standardized lists of emotion labels with different kinds of answer formats to obtain information on the qualitative nature of the affective state experienced. However, the use of fixed response alternatives, while ensuring efficiency and standardization of data collection, has several serious disadvantages. One of the major ones is the possibility that one or several response alternatives can ‘‘prime’’ participants, i.e. suggest responses that they might not have chosen otherwise. The opposite problem is that a participant might want to respond with a category that is not provided in the list, thus forcing the person to respond with the closest alternative, or, if provided, with a residual category such as ‘‘other’’, with the specificity and accuracy of the data suffering in both cases. Even if one of the categories provided corresponds to the state experienced by the participant, he or she may not be familiar with the label chosen by the researcher, being used to referring to the affective state with a near synonym, for example, a more popular or slang expression (e.g. jittery in the place of anxious).
Free response measurement of emotional feeling – the Geneva Affect Label Coder.
To avoid such problems, researchers sometimes choose to use a freeresponse format, asking participants to respond with freely chosen labels or short expressions that in their mind best characterize the nature of the state they experienced. This is not a panacea as some participants, especially those who do not normally attempt to label and communicate their emotional responses, may have problems coming up with appropriate labels. In addition, one can expect individual differences in the range of the active vocabulary which may constrain the responses of some respondents. However, in general the advantages in specificity and accuracy of the responses and the elimination of the priming artifact would seem to privilege the use of a free-response format in cases in which maximal accuracy and a fine-grained resolution of the affect description are sought. Unfortunately, this advantage is compromised by the fact that it is generally impossible to analyze free responses in a quantitative, statistical fashion as their number is often extremely high and the response frequency per label extremely low. In consequence, researchers generally sort free responses into a more limited number of emotion categories, using notions of family resemblances and synonyms. To date, there is neither an established procedure for sorting free-response labels or expressions into a smaller number of overarching categories nor agreement as to the number and nature of a standard set of emotion categories. In general, researchers will determine a list of emotion categories in an eclectic fashion or based on a particular theory and then ask coders to classify free responses with more or less explicit coding instructions and more or less concern for reliability. In the interest of the comparability and cumulativeness of findings from different studies, it seems desirable to develop a standard list of emotion categories to be regularly employed in research using freeresponse report of subjective feeling states and to use a reliable, standardized coding procedure. In this article, I suggest a pragmatic solution, the Geneva Affect Label Coder (GALC), based on an Excel macro program that attempts to recognize 36 affective categories commonly distinguished by words in natural languages and parses text data bases for these terms and their synonyms (as based on established thesauri). I will briefly describe the development of the instrument in the context of a large-scale event sampling study of emotional experiences published in this journal (Scherer et al., 2004), where pertinent results are reported. As the instrument was intended for use in a wide variety of emotion- inducing contexts, I decided to choose a rather extensive list of semantic categories that index different types of affect-related experiences covering emotions, moods, and other types of transitory affect states (see the design feature approach discussed above). The 36 categories shown in Table 4 were chosen on the basis of both empirical grounds (occurring in a quasi-representative population survey of what respondents freely report when asked which emotion they experienced yesterday) and published surveys of emotion terms in different languages (Averill, 1975; Gehm and Scherer, 1988; Russell, 1983). An additional criterion for selection of a category was the existence of empirical research or theoretical discussion on specific differentiable states. The category terms shown in Table 4 have been chosen as category descriptors on the assumption that they denote the central meaning of a fuzzy category that is implied by a much larger number of established words or popular expressions, including metaphors. The underlying assumption of the current approach is that the occurrence in verbal reports of any label or expression considered as being part of the family of affective states (denoted by an overarching category label) can be taken as evidence for the presence of a feeling state that is closely associated with the fuzzy category identified by the central concept. I selected the terms that constitute synonyms, near synonyms, or related emotion family members of the category labels based on extensive comparison of dictionary and thesaurus entries in English, German, and French. As Table 4 shows, each category, represented by the first term in the row, is indexed by a number of roots for adjectives or nouns denoting a related emotional state. Admittedly, the grouping of the related terms is currently based on my own judgment on the basis of the literature. The results of semantic grid studies, as described above, will allow the use of sophisticated cluster analysis and multidimensional scaling programs to empirically determine the wellfoundedness of these linguistic intuitions. The program GALC, which incorporates look-up tables like the one shown in Table 4 for English, French, and German, allows searching for the occurrences of the indexed word stems in ASCII text files. Based on the presence of the respective word stems, the occurrence of one or two emotion categories will be determined by the program (the detection of two different categories indicating potential ambivalence or the presence of emotion blends). The program, consisting of an Excel file containing a macro parser program, can be freely downloaded for research use (see References).
Forced choice response measurement of feeling – the Geneva Emotion Wheel.
In many cases, especially those involving highly controlled experimental paradigms, the use of the free-response format is contraindicated, especially when fine-grained scalar measurement on a few standard feeling states is required for the purpose of comparison between experimental groups. Psychologists have used two major methods to obtain forced-choice self-reports of emotional experience:
(1) the discrete emotions approach, and (2) the dimensional approach. The first, the discrete emotions approach, goes back to the origin of language and the emergence of words and expressions describing clearly separable states. The approach has a venerable scientific history in the sense that since the dawn of behavioral science philosophers have used emotion words to analyze human emotional experience. Darwin (1998) has made this approach palatable for the biological and social sciences in showing the evolutionary continuity of a set of ‘‘basic emotions’’ and identifying observable physiological and expressive symptoms that accompany them. The discrete emotions approach relies on the categorization that is reflected in the organization of the semantic fields for emotion in natural languages. The justification for accepting the structure provided by language is the fact that the language-based categories seem to correspond to unique response patterns, i.e. emotion-category specific patterns of facial and vocal expressions as well as physiological response profiles.
Given the primary role of natural language categories for emotions as reflected by emotion words, the method of assessing selfreport used by researchers adopting the discrete emotions approach is the use of scales with nominal, ordinal, or interval characteristics. Generally the researcher provides the respondent with a list of emotion terms and the latter is alternatively asked (1) to check terms that best describe the emotion experienced (nominal scale), (2) to indicate on a 3- to 5-point scale whether the respective emotion was experienced a little, somewhat, or strongly (ordinal scale), or (3) to use an analog scale to indicate how much an emotion has been experience (e.g. on an underlying dimension from 0 to 100 – interval scale). Methods vary on whether respondents are to respond on only the most pertinent emotion scale, to respond on two or more scales to indicate possible blends, or to respond to all scales in a list (replying with none or 0 for categories that are not at all appropriate to describe the experience). While there are some standardized instruments of this kind (e.g. Izard’s Differential Emotion Scale; Izard, 1991), most investigators prefer to create ad hoc lists of emotion categories that seem relevant in a specific research context. While the results obtained with this approach are highly plausible and easily interpretable (given that widely shared language labels are used), there are serious problems of comparability of results across different studies in which widely different sets of emotion labels are used. Furthermore, the statistical analysis of these data suffers from the problem of an abundance of missing data (all scales with 0 or none as values) and the difficulty of analyzing and interpreting an extraordinary number of different blends of emotion (Scherer, 1998; Scherer and Ceschi, 2000). The second method, the dimensional approach, was pioneered by Wilhelm Wundt (1905) who attempted to develop a structural description of subjective feeling as it is accessible through introspection. He suggested that these subjective feelings can be described by their position in a three-dimensional space formed by the dimensions of valence (positive–negative), arousal (calm–excited), and tension (tense–relaxed). Wundt believed that the mental phenomenon of feeling, as described by these three dimensions, covaried with measurable states of the body such as, for example, physiological arousal. Wundt’s suggestion has had an extraordinary impact, both on the measurement of feeling (e.g. Schlosberg, 1954) and on the emotional connotations of language concepts in general (e.g. Osgood et al., 1957). Given the difficulty of consistently identifying a third dimension (such as tension, control, or potency) from arousal or excitation, many modern dimensional theorists limit themselves to the valence and arousal dimension, sometimes suggesting circular structures as most adapted to mapping emotional feelings into this twodimensional space (Russell, 1983). Concretely, the methodology used in this approach consists in asking a respondent how positive or negative and how excited or aroused he or she feels (either in two separate steps or by providing a two-dimensional surface and asking the respondent to determine the appropriate position). In consequence, the emotional feeling of the person is described by a point in this valence arousal space. This method of obtaining self-report of emotional feeling is simple and straightforward and generally quite reliable. It also lends itself to advanced statistical processing since interval scaling can be used quite readily. On the other hand, the results are restricted to the degrees of positive or negative feeling and of bodily excitation. Most importantly, contrary to the discrete emotions approach, there is very little information on the type of event that has produced the emotion and the appraisal processes underlying the responses. One of the major drawbacks of this approach is the difficulty of knowing whether the valence dimension describes the intrinsic quality of an eliciting object or the quality of the feeling (which need not coincide). Even more importantly, it is difficult to differentiate the aspect of intensity of feeling from bodily excitation. Thus, extremely intensive anger is likely to be characterized by high arousal whereas intense sadness may be accompanied by very low arousal. Which of these two approaches is preferable? Until now, researchers have rarely specified why they chose one method over another. Generally, methodological choice has followed theoretical convictions as to the degree of differentiatedness of the emotion system that psychologists need to adopt to understand and predictemotional responses. However, one can apply more systematic criteria to justify particular choices. For example, how should one best describe the differences between two individuals who have just experienced an emotion as compared to differentiating between the feelings of the same person at different points in time? After all, psychological measurement is generally interested in describing differences between individuals or between states over time. Specifically, which are more comparable: two individuals who share the same point in valence-arousal space or two individuals who use the same word to describe their feelings? Chances are that two individuals who use the same verbal descriptor have more similar emotions than those sharing a point in semantic space. This can be easily demonstrated by the fact that both very fearful and very angry persons would be in a similar region of the two-dimensional space – negatively valenced high arousal (see Figure 1). While such regions in two-dimensional space can show sizeable overlap, verbal labels often uniquely identify major elements of the eliciting event (at least in terms of appraisal dimensions) as well as the integrated representation of response patterns. One of the potential shortcomings of dimensional approaches based on valence and arousal is that both dimensions are quite ambiguous. As mentioned above, it is often not clear whether a valence judgment (pleasant or unpleasant) concerns the appraisal of the nature of the stimulus object or event or rather the feeling induced by it. Similarly, arousal or activation ratings may refer to perceived activation in a situation (or image) or to the proprioceptive feeling of physiological arousal induced by the stimulus event. This ambiguity often exists even when the instructions given to participants clearly specify the meaning -- which is not always the case. If arousal ratings are meant to measure induced physiological activation or excitement, there is the additional problem that this interoception is often erroneous (Vaitl, 1996). Another criterion is the communicability of emotional states between individuals. To describe the coordinates of an individual’s position in valence-arousal space is unlikely to provide much information to others, including a researcher who is ignorant of the eliciting situation. Similarly, while some researchers may find it sufficient to know about valence or arousal, others may need more specific information on emotional experience to make reliable inferences. It is surprising that, given the central role of emotion self-report in this research area, there have been few attempts to develop new instruments that avoid some of the shortcomings of the existing approaches. In what follows I describe such an effort. The design characteristics for the instrument to be developed are as follows:
+ concentrating on the feeling component of emotion, in the sense of qualia, rather than asking respondents to judge concrete response characteristics such as sympathetic arousal;
+ going beyond a simple valence-arousal space in order to be better able to differentiate qualitatively different states that share the same region in this space;
+ relying on standard emotion labels in natural languages in order to capitalize on respondents’ intuitive understanding of the semantic field;
+ allowing systematic assessment of the intensity of the feeling;
+ going beyond the arbitrariness of choosing different sets of emotion terms and presenting them in very unsystematic fashion by building some emotion structure into the instrument;
+ presenting the instrument in a graphical form that is userfriendly, allowing the respondent to rapidly understand the principle and use the instrument in a reliable fashion.
Starting with the last point, I decided to use appraisal dimensions (or stimulus evaluation checks) to impose structure on the emotion categories (as described by natural language labels) to be used in the instrument. If one adopts the notion that emotions are elicited and differentiated by appraisal, then the structure of the emotion system should be largely determined by the major appraisal dimensions. As shown by numerous studies, the appraisal dimensions that seem to have the strongest impact on emotion differentiation are goal conduciveness (including valence) and coping potential (control/ power). In consequence, I decided to arrange a number of frequently used and theoretically interesting emotion categories in a two-dimensional space formed by goal conduciveness vs goal obstructiveness on the one hand and high vs low control/power (reflecting the coping potential appraisal check) on the other. It is expected that different emotion terms can be appropriately on these dimensions. Figure 1 shows an illustration. The graph shows the mapping of the terms Russell (1983) uses as markers for his claim of an emotion circumplex in two-dimensional valence by activity/arousal space (upper-case terms). Onto this representation I superimposed the two dimensional structure based on similarity ratings of 80 German emotion terms (þ, lower-case terms, translated to English) from an earlier study that demonstrated the justification for the assumption that semantic space may be organized by appraisal criteria (see Scherer, 1984b: 47–55). The plus (þ) signs indicate the exact location of the terms in a two-dimensional space. Quite surprisingly, this simple superposition yields a remarkably good fit. It also shows that adding additional terms makes Russell’s circumplex less of an obvious structural criterion – to obtain a perfect circle in a multidimensional scaling analysis seems to require the inclusion of non-emotion terms, as in the case of ‘‘sleepy, tired, and droopy’’ to mark the low arousal pole (as implicitly acknowledged by Russell himself; Russell, 1991: 439). More importantly for the present purposes, a 458 rotation of the axes corresponds rather nicely to an explanation of the distribution of the terms in a two-dimensional space formed by goal conduciveness and coping potential. As argued above, verbal report measures the component of subjectively experienced feeling. Feelings that are members of any one specific emotion family can be expected to vary most among each other with respect to intensity (e.g. irritation–anger–rage), which, as argued above, may correlate with but is not the same as physiological arousal. It was therefore decided to map the intensity dimension as the distance of an emotion category’s position in the goal conduciveness-coping potential space from the origin (see also Reisenzein, 1994; Russell, 1980: 1170). In line with the attempt to create a graphically intuitive presentation, members of each emotion family were represented as a set of circles with increasing circumference (comparable to a spike in a wheel). In the interest of the ease of reading, the number of emotion families was limited to 4 per quadrant, yielding a total of 16 (which seems reasonable considering that the upper limit of the number of ‘‘basic emotions’’ is often considered to be around 14). The choice of the concrete families was also in large part determined by what are generally considered to be either basic or fundamental emotions or those frequently studied in the field. Figure 2 shows the prototype of this instrument which because of its origin and shape has been called the Geneva Emotion Wheel (GEW). In this first version of the GEW, presented on a computer screen, all members of an emotion family were identified by a specific label, which became visible when moving the mouse across a circle. First attempts at validation of the instrument (Baenziger et al., 2005) showed that it is difficult to reproduce the theoretically predicted intensity scaling of the terms on some of the ‘‘spikes’’ in the wheel. In consequence, in more recent versions of the GEW we have abandoned the effort to label intermediate intensities with different labels for members of the same emotion family. Rather, only the family as a whole is specified, asking participants to rate the intensity of an experienced or imagined emotion on the basis of the distance from the hub of the wheel and the size of the circles. The study of the reliability and validity of the instrument continues. Researchers interested in using the instrument can download a copy of the computer program or a paper-and-pencil version (see References). While further improvements seem possible, we feel that the GEW attains some of the aims outlined above and constitutes a useful addition to the methods toolbox in emotion research. While several instruments have been proposed that ask judges to conjointly evaluate two dimensions, such as valence and arousal (Cowie et al., 2000; Russell et al., 1989) or pleasantness and unpleasantness (Larsen et al., 2004), the Geneva Emotion Wheel may be the first such instrument to design the dimensional layout of the emotion qualities on pure appraisal dimensions (arrangement of emotion terms in two dimensional space) and the intensity of the associated subjective feeling (distance from origin).
Conclusions.
The definition of emotions, distinguishing them from other affective states or traits, and measuring them in a comprehensive and meaningful way have been a constant challenge for emotion researchers in different disciplines of the social and behavioral sciences over a long period of time. I have no illusion about the fact that this contribution will be little more than a drop in an ocean of writing about these topics. Definitions cannot be proven. They need to be consensually considered as useful by a research community in order to guide research, make research comparable across laboratories and disciplines, and allow some degree of cumulativeness, and they are quite central for the development of instruments and measurement operations – as well as for the communication of results and the discussion between scientists. If this article, following the discussion of some of these issues in the wake of our actuarial study of Swiss emotions in this journal (Scherer et al., 2004), can help to at least raise the consciousness of the need for progress in this domain, it will have fulfilled its purpose.
---------------------------------------
Klaus Scherer studied economics and social sciences at the University of Cologne,
the London School of Economics and Harvard University (PhD 1970). After
teaching at the University of Pennsylvania, the University of Kiel and the University
of Giessen, he has been full professor of psychology at the University of
Geneva since 1985. He is the director of the recently established Swiss Centre
for Affective Sciences. His teaching and research activities focus on the nature
and function of emotion, in particular the study of cognitive appraisal of emotion-
eliciting events, and of facial and vocal emotion expression. His numerous
publications include monographs, contributed chapters and papers in international
journals. He has edited several collected volumes and handbooks, and
co-edits a book series on ‘‘Affective Science’’ for Oxford University Press. He is
the founding co-editor (with R. Davidson) of the journal Emotion. Author’s address:
Department of Psychology, University of Geneva, 40, Bd du Pont d’Arve, CH-1205
Geneva, Switzerland. [email: Klaus.Scherer@pse.unige.ch]
References
Averill, J.R. (1975) ‘‘A Semantic Atlas of Emotional Concepts’’, JSAS Catalog of
Selected Documents in Psychology 5, 330. (Ms. No. 421)
Averill, J.R. (1980) ‘‘A Constructivist View of Emotion’’, in R. Plutchik and
H. Kellerman (eds) Emotion: Vol. 1. Theory, Research, and Experience, pp. 305–40.
New York: Academic Press.
Baenziger, T., Tran, V. and Scherer, K.R. (2005) ‘‘The Emotion Wheel. A Tool for the
Verbal Report of Emotional Reactions’’, poster presented at the conference of the
International Society of Research on Emotion, Bari, Italy.
Bartlett, D.L. (1999) ‘‘Physiological Responses to Music and Sound Stimuli’’, in
D.A. Hodges (ed.) Handbook of Music Psychology, 2nd edn, pp. 343–85. San
Antonio, CA: IMR.
Breckler, S.J. (1984) ‘‘Empirical Validation of Affect, Behavior, and Cognition as
Distinct Components of Attitude’’, Journal of Personality and Social Psychology
47: 1191–205.
Breugelmans, S.M., Poortinga, Y.H., Ambadar, Z., Setiadi, B., Vaca, J.B. and
Widiyanto, P. (2005) ‘‘Body Sensations Associated with Emotions in Rara´muri
Indians, Rural Javanese, and Three Student Samples’’, Emotion 5: 166–74.
Cattell, R.B. (1990) ‘‘Advances in Cattellian Personality Theory’’, in L. A. Pervin (ed.)
Handbook of Personality: Theory and Research, pp. 101–10. New York: Guilford.
Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M. and
Schro¨ der, M. (2000) ‘‘FEELTRACE: An Instrument for Recording Perceived
Emotion in Real Time’’, in Proceedings of the ISCA Workshop on Speech and
Emotion, pp. 19–24. Belfast: Textflow.
Darwin, C. (1998) The Expression of Emotions in Man and Animals, ed. P. Ekman.
London: HarperCollins. (Orig. published 1872.)
Davidson, R.J., Pizzagalli, D., Nitschke, J.B. and Kalin, N.H. (2003a) ‘‘Parsing the
Subcomponents of Emotion and Disorders of Emotion: Perspectives from Affective
Neuroscience’’, in R. J. Davidson, K.R. Scherer and H. Goldsmith (eds) Handbook
of the Affective Sciences, pp. 8–24. New York and Oxford: Oxford University Press.
Davidson, R.J., Scherer, K. R and Goldsmith, H., eds (2003b) Handbook of the Affective
Sciences. New York and Oxford: Oxford University Press.
Ekman, P. (1972) ‘‘Universals and Cultural Differences in Facial Expression of Emotion’’,
in J.R. Cole (ed.) Nebraska Symposium on Motivation, Vol. 19, pp. 207–83.
Lincoln: University of Nebraska Press.
Ekman, P. (1992) ‘‘An Argument for Basic Emotions’’, Cognition and Emotion 6(3/4):
169–200.
Ellsworth, P.C. and Scherer, K.R. (2003) ‘‘Appraisal Processes in Emotion’’, in
R.J. Davidson, H. Goldsmith and K.R. Scherer (eds) Handbook of the Affective
Sciences, pp. 572–95. New York and Oxford: Oxford University Press.
Fontaine, J.R.J., Poortinga, Y.H., Setiadi, B. and Markam, S.S. (2002) ‘‘Cognitive
Structure of Emotion Terms in Indonesia and The Netherlands’’, Cognition and
Emotion 16(1): 61–86.
Frijda, N.H. (1986) The Emotions. Cambridge: Cambridge University Press.
Frijda, N. H. (1987) ‘‘Emotion, Cognitive Structure, and Action Tendency’’, Cognition
and Emotion 1: 115–43.
Frijda, N.H. (2000) ‘‘The Psychologist’s Point of View’’, in M. Lewis and
J.M. Haviland-Jones (eds) Handbook of Emotions, 2nd edn, pp. 59–74. New York:
Guilford.
Frijda, N.H., Markam, S., Sato, K. and Wiers, R. (1995) ‘‘Emotions and Emotion
Words’’, in J. A. Russell, J.M. Fernandez-Dols, A. S. R. Manstead and
J. C. Wellenkamp (eds) Everyday Conceptions of Emotion: An Introduction to the
Psychology, Anthropology and Linguistics of Emotion, Vol. 81, pp. 121–43.
Dordrecht: Kluwer Academic Publishers.
Gehm, Th. and Scherer, K.R. (1988) ‘‘Factors Determining the Dimensions of
Subjective Emotional Space’’, in K.R. Scherer (ed.) Facets of Emotion: Recent
Research, pp. 99–114. Hillsdale, NJ: Erlbaum.
Geneva Affect Label Coder (GALC); available under Research Tools at http://
www.unige.ch/fapse/emotion
Geneva Appraisal Questionnaire (GAQ); available under Research Tools at http://
www.unige.ch/fapse/emotion
Geneva Emotion Wheel (GEW); available under Research Tools at http://www.
unige.ch/fapse/emotion
Goldie, P. (2004) ‘‘The Life of the Mind: Commentary on ‘Emotions in Everyday
Life’ ’’, Social Science Information 43(4): 591–8.
Harrigan, J., Rosenthal, R. and Scherer, K.R., eds (2005) The New Handbook of
Methods in Nonverbal Behavior Research. Oxford: Oxford University Press.
Hauser, M.D. (1996) The Evolution of Communication. Cambridge, MA: MIT Press.
Hockett, C.F. (1960) ‘‘The Origin of Speech’’, Scientific American 203: 88–96.
Izard, C.E. (1971) The Face of Emotion. New York: Appleton-Century-Crofts.
Izard, C.E. (1991) The Psychology of Emotions. New York: Plenum Press.
Izard, C.E. (1992) ‘‘Basic Emotions, Relations Among Emotions, and Emotion–
Cognition Relations’’, Psychological Review 99: 561–5.
James, W. (1884) ‘‘What Is An Emotion?’’, Mind 9: 188–205.
Kant, I. (2001) Kritik der Urteilskraft. Hamburg: Meiner. (Orig. published 1790.)
Kleinginna, P.R. and Kleinginna, A.M. (1981) ‘‘A Categorized List of Emotion Definitions
with Suggestions for a Consensual Definition’’, Motivation and Emotion 5:
345–79.
Larsen, J.T., Norris, C.J. and Cacioppo, J.T. (2004) ‘‘The Evaluative Space Grid:
A Single-item Measure of Positive and Negative Affect’’, unpublished manuscript,
Texas TechUniversity, Lubbock, TX.
Lazarus, R.S. (1968) ‘‘Emotions and Adaptation: Conceptual and Empirical Relations’’,
in W.J. Arnold (ed.) Nebraska Symposium on Motivation Vol. 16, pp. 175–
270. Lincoln, NE: University of Nebraska Press.
Lazarus, R.S. (1991) Emotion and Adaptation. New York: Oxford University Press.
Leventhal, H. and Scherer, K.R. (1987) ‘‘The Relationship of Emotion to Cognition:
A Functional Approach to a Semantic Controversy’’, Cognition and Emotion 1:
3–28.
Levy, R.I. (1984) ‘‘The Emotions in Comparative Perspective’’, in K.R. Scherer and
P. Ekman (eds) Approaches to Emotion, pp. 397–410. Hillsdale, NJ: Erlbaum.
Lutz, C. (1988) Unnatural Emotions: Everyday Sentiments on a Micronesian Atoll and
their Challenge to Western Theory. Chicago, IL: University of Chicago Press.
Niedenthal, P.M., Barsalou, L.W., Winkielman, P., Krauth-Gruber, S. and Ric, F.
(2005) ‘‘Embodiment in Attitudes, Social Perception, and Emotion’’, Personality
and Social Psychology Review 9: 184–211.
Osgood, C.E., Suci, G.J. and Tannenbaum, P.H. (1957) The Measurement of
Meaning. Urbana: University of Illinois Press.
Parkinson, B. (2004) ‘‘Auditing Emotions: What Should We Count?’’, Social Science
Information 43(4): 633–45.
Reisenzein, R. (1994) ‘‘Pleasure-arousal Theory and the Intensity of Emotions’’,
Journal of Personality and Social Psychology 67: 525–39.
Russell, J. A. (1980) ‘‘A Circumplex Model of Affect’’, Journal of Personality and
Social Psychology 39: 1161–78.
Russell, J.A. (1983) ‘‘Pancultural aspects of the human conceptual organization of
emotions’’, Journal of Personality and Social Psychology 45: 1281–8.
Russell, J.A. (1991) ‘‘Culture and the Categorization of Emotions’’, Psychological
Bulletin 110: 426–50.
Russell, J.A., Fernandez-Dols, J.M., Manstead, A.S.R. and Wellenkamp, J.C., eds
(1995) Everyday Conceptions of Emotion: An Introduction to the Psychology,
Anthropology and Linguistics of Emotion. Dordrecht: Kluwer Academic Publishers.
Russell, J. A., Weiss, A. and Mendelsohn, G. A. (1989) ‘‘Affect Grid: A Single-item
Scale of Pleasure and Arousal’’, Journal of Personality and Social Psychology 57:
493–502.
Scherer, K.R. (1982) ‘‘Emotion as a Process: Function, Origin, and Regulation’’,
Social Science Information 21: 555–70.
Scherer, K.R. (1984a) ‘‘On the Nature and Function of Emotion: A Component
Process Approach’’, in K.R. Scherer and P. Ekman (eds) Approaches to Emotion,
pp. 293–317. Hillsdale, NJ: Erlbaum.
Scherer, K.R. (1984b) ‘‘Emotion as a Multicomponent Process: A Model and Some
Cross-Cultural Data’’, in P. Shaver (ed.) Review of Personality and Social
Psychology, Vol. 5, pp. 37–63. Beverly Hills, CA: Sage.
Scherer, K.R. (1987) ‘‘Toward a Dynamic Theory of Emotion: The Component Process
Model of Affective States’’, Geneva Studies in Emotion and Communication 1:
1–98; available at: http://www.unige.ch/fapse/emotion/genstudies/genstudies.html
Scherer, K.R. (1988) ‘‘Criteria for Emotion-Antecedent Appraisal: A Review’’, in
V. Hamilton, G.H. Bower and N.H. Frijda (eds) Cognitive Perspectives on Emotion
and Motivation, pp. 89–126. Dordrecht: Kluwer.
Scherer, K. R. (1993) ‘‘Studying the Emotion-Antecedent Appraisal Process: An
Expert System Approach’’, Cognition and Emotion 7: 325–55.
Scherer, K.R. (1994) ‘‘Toward a Concept of ‘Modal Emotions’’’, in P. Ekman and
R.J. Davidson (eds) The Nature of Emotion: Fundamental Questions, pp. 25–31.
New York and Oxford: Oxford University Press.
Scherer, K.R. (1998) ‘‘Analyzing Emotion Blends’’, in A. Fischer (ed.) Proceedings
of the 10th Conference of the International Society for Research on Emotions,
pp. 142–8. Wu¨ rzberg: ISRE Publications.
Scherer, K.R. (2000a) ‘‘Emotion’’, in M. Hewstone and W. Stroebe (eds) Introduction
to Social Psychology: A European Perspective, 3rd edn, pp. 151–91. Oxford:
Blackwell.
Scherer, K.R. (2000b) ‘‘Emotions as Episodes of Subsystem Synchronization Driven
by Nonlinear Appraisal Processes’’, in M.D. Lewis and I. Granic (eds) Emotion,
Development, and Self-Organization: Dynamic Systems Approaches to Emotional
Development, pp. 70–99. New York and Cambridge: Cambridge University Press.
Scherer, K.R. (2000c) ‘‘Psychological Models of Emotion’’, in J. Borod (ed.) The
Neuropsychology of Emotion, pp. 137–62. Oxford and New York: Oxford University
Press.
Scherer, K.R. (2001) ‘‘Appraisal Considered as a Process of Multi-Level Sequential
Checking’’, in K.R. Scherer, A. Schorr and T. Johnstone (eds) Appraisal Processes
in Emotion: Theory, Methods, Research, pp. 92–120. New York and Oxford: Oxford
University Press.
Scherer, K.R. (2004a) ‘‘Ways to Study the Nature and Frequency of Our Daily Emotions:
Reply to the Commentaries on ‘Emotions in Everyday Life’ ’’, Social Science
Information 43(4): 667–89.
Scherer, K.R. (2004b) ‘‘Feelings Integrate the Central Representation of Appraisal-
Driven Response Organization in Emotion’’, in A.S.R. Manstead, N.H. Frijda
and A.H. Fischer (eds) Feelings and Emotions: The Amsterdam Symposium,
pp. 136–57. Cambridge: Cambridge University Press.
Scherer, K.R. (2004c) ‘‘Which Emotions Can Be Induced by Music? What Are the
Underlying Mechanisms? And How Can We Measure Them?’’, Journal of New
Music Research 33(3): 239–51.
Scherer, K.R. and Ceschi, G. (2000) ‘‘Studying Affective Communication in the Airport:
The Case of Lost Baggage Claims’’, Personality and Social Psychology Bulletin
26(3): 327–39.
Scherer, K.R. and Wallbott, H.G. (1994) ‘‘Evidence for Universality and Cultural
Variation of Differential Emotion Response Patterning’’, Journal of Personality
and Social Psychology 66(2): 310–28.
Scherer, K.R. and Zentner, M.R. (2001) ‘‘Emotional Effects of Music: Production
Rules’’, in P. N. Juslin and J. A. Sloboda (eds) Music and Emotion: Theory and
Research, pp. 361–92. Oxford: Oxford University Press.
Scherer, K.R., Schorr, A. and Johnstone, T., eds (2001) Appraisal Processes in Emotion:
Theory, Methods, Research. New York and Oxford: Oxford University Press.
Scherer, K.R., Wallbott, H.G. and Summerfield, A.B., eds (1986) Experiencing Emotion:
A Crosscultural Study. Cambridge: Cambridge University Press.
Scherer, K.R., Wranik, T., Sangsue, J., Tran, V. and Scherer, U. (2004) ‘‘Emotions in
Everyday Life: Probability of Occurrence, Risk Factors, Appraisal and Reaction
Patterns’’, Social Science Information 43(4): 499–570.
Schlosberg, H. (1954) ‘‘Three Dimensions of Emotion’’, Psychological Review 61:
81–8.
Stemmler, G. (2003) ‘‘Methodological Considerations in the Psychophysiological
Study of Emotion’’, in R.J. Davidson, K.R. Scherer and H. Goldsmith (eds) Handbook
of the Affective Sciences, pp. 225–55. New York and Oxford: Oxford University
Press.
Tomkins, S.S. (1962) Affect, Imagery, Consciousness: Vol. 1. The Positive Affects.
New York: Springer.
Tomkins, S.S. (1984) ‘‘Affect Theory’’, in K.R. Scherer and P. Ekman (eds)
Approaches to Emotion, pp. 163–196. Hillsdale, NJ: Erlbaum.
Vaitl, D. (1996) ‘‘Interoception’’, Biological Psychology 42(1–2): 1–27.
Van Reekum, C.M. and Scherer, K.R. (1997) ‘‘Levels of Processing for Emotion-
Antecedent Appraisal’’, in G. Matthews (ed.) Cognitive Science Perspectives on
Personality and Emotion, pp. 259–300. Amsterdam: Elsevier Science.
Wierzbicka, A. (1999) Emotions Across Languages and Cultures. Cambridge:
Cambridge University Press.
Wundt, W. (1905) Grundzu¨ge der physiologischen Psychologie. Leipzig: Engelmann.
lingüística
martes, 12 de mayo de 2020
martes, 2 de octubre de 2018
Speech Perception
Speech Perception.
When you listen to someone speaking you generally focus on understanding their
meaning. One famous (in linguistics) way of saying this is that "we speak in order
to be heard, in order to be understood" Oakobson et al., 1952). Our drive, as
listeners, to understand the talker leads us to focus on getting the words being
said, and not so much on exactly how they are pronounced. But sometimes a
pronunciation will jump out at you: somebody says a familiar word in an unfamiliar way and you just have to ask "Is that how you say that?" When we listen
to the phonetics of speech — to how the words sound and not just what they mean
— we as listeners are engaged in speech perception.
In speech perception, listeners focus attention on the sounds of speech and notice
phonetic details about pronunciation that are often not noticed at all in normal
speech communication. For example, listeners will often not hear, or not seem
to hear, a speech error or deliberate mispronunciation in ordinary conversation,
but will notice those same errors when instructed to listen for mispronunciations
(see Cole, 1973).
Testing mispronunciation detection.
As you go about your daily routine, try mispronouncing a word every now
and then to see if the people you are talking to will notice. For instance, if
the conversation is about a biology class you could pronounce it "biolochi."
After saying it this way a time or two you could tell your friends about your
little experiment and ask if they noticed any mispronounced words. Do people
notice mispronunciation more in word-initial position or in medial position?
With vowels more than consonants? In nouns and verbs more than in gram-
matical words? How do people look up words in their mental dictionary if
they don't notice when a sound has been mispronounced? Evidently, looking up words in the mental lexicon is a little different from looking up words
in a printed dictionary (try entering "biolochi" in Google). Do you find that
your friends think you are strange when you persist in mispronouncing words
on purpose?
So, in this chapter we're going to discuss speech perception as a phonetic mode
of listening, in which we focus on the sounds of speech rather than the words.
An interesting problem in phonetics and psycholinguistics is to find a way of measuring how much phonetic information listeners take in during normal conversation, but in this book we can limit our focus to the phonetic mode of listening.
5.1 Auditory Ability Shapes Speech Perception.
As we saw in chapter 4, speech perception is shaped by general properties of
the auditory system that determine what can and cannot be heard, what cues will
be recoverable in particular segmental contexts, and how adjacent sounds will
influence each other. For example, we saw that the cochlea's nonlinear frequency
scale probably underlies the fact that no language distinguishes fricatives on the
basis of frequency components above 6,000 Hz.
Two other examples illustrate how the auditory system constrains speech perception. The first example has to do with the difference between aspirated and
unaspirated stops. This contrast is signaled by a timing cue that is called the "voice
onset time" (abbreviated as VOT). VOT is a measure (in milliseconds) of the
delay of voicing onset following a stop release burst. There is a longer delay in
aspirated stops than in unaspirated stops — so in aspirated stops the vocal folds are
held open for a short time after the oral closure of the stop has been released.
That's how the short puff of air in voiceless aspirated stops is produced. It has
been observed that many languages have a boundary between aspirated and unaspirated stops at about 30 ms VOT. What is so special about a 30 ms delay between
stop release and onset of voicing?
Here's where the auditory system comes into play. Our ability as hearers
to detect the nonsimultaneous onsets of tones at different frequencies probably
underlies the fact that the most common voice onset time boundary across languages is at about ±30 ms. Consider two pure tones, one at 500 Hz and the other
at 1,000 Hz. In a perception test (see, for example, the research studies by Pisoni,
1977, and Pastore and Farrington, 1996), we combine these tones with a small
onset asynchrony — the 500 Hz tone starts 20 ms before the 1,000 Hz tone. When
we ask listeners to judge whether the two tones were simultaneous or whether
one started a little before the other, we discover that listeners think that tones
separated by a 20 ms onset asynchrony start at the same time. Listeners don't
begin to notice the onset asynchrony until the separation is about 30 ms. This
parallelism between nonspeech auditory perception and a cross-linguistic phonetic
universal leads to the idea that the auditory system's ability to detect onset asynchrony is probably a key factor in this cross-linguistic phonetic property.
Example number two: another general property of the auditory system is probably at work in the perceptual phenomenon known as "compensation for coarticulation." This effect occurs in the perception of place of articulation in CV syllables.
The basic tool in this study is a continuum of syllables that ranges in equal acoustic
steps from [do] to [gal (see figure 5.1). This figure needs a little discussion. At
the end of chapter 3 I introduced spectrograms, and in that section I mentioned
that the dark bands in a spectrogram show the spectral peaks that are due to
the vocal tract resonances (the formant frequencies). So in figure 5.1a we see a
sequence of five syllables with syllable number 1 labeled [do] and syllable number 5 labeled [go]. In each syllable, the vowel is the same; it has a first formant
frequency (F1) of about 900 Hz, a second formant frequency (F,) of about 1,100 Hz,
an F3 at 2,500 Hz, and an F4 at 3,700 Hz. The difference between [du] and [go]
has to do with the brief formant movements (called formant transitions) at
the start of each syllable. For [do] the F, starts at 1,500 Hz and the F3 starts at
2,900 Hz, while for [go] the F, starts at 1,900 Hz and the F3 starts at 2,000 Hz.
You'll notice that the main difference between [al] and [ar] in figure 5. lb is the
F, pattern at the end of the syllable.
Virginia Mann (1980) found that the perception of this [doHgo] continuum
depends on the preceding context. Listeners report that the ambiguous syllables
in the middle of the continuum sound like "ga" when preceded by the VC syllable
[al], and sound like "da" when preceded by [Qr.].
As the name implies, this "compensation for coarticulation" perceptual effect
can be related to coarticulation between the final consonant in the VC context
token ([01] or [or]) and the initial consonant in the CV test token ([da}-[ga]). However,
an auditory frequency contrast effect probably also plays a role. The way this explanation works is illustrated in figure 5. lb. The relative frequency of F, distinguishes
[da] from [go] — F3 is higher in [do] than it is in [go]. Interestingly, though, the
perceived frequency of F3 may also be influenced by the frequency of the F, just
prior to [da/go]. When F3 just prior to [do/ga] is low (as in [ar]), the [dolga] F,
sounds contrastively higher, and when the F3 just prior is high, the [da/ go] F, sounds
lower. Lotto and Kluender (1998) tested this idea by replacing the precursor syl-
lable with a simple sine wave that matched the ending frequency of the F3 of [or],
in one condition, or matched the ending F3 frequency of [al]. in another condition. They found that these nonspeech isolated tones shifted the perception of
the [da]-[ga] continuum in the same direction that the [cm] and [al] syllables did.
So evidently, at least a part of the compensation for coarticulation phenomenon
is due to a simple auditory contrast effect having nothing to do with the phonetic
mode of perception.
Two explanations for one effect.
Compensation for coarticulation is controversial. For researchers who like to
think of speech perception in terms of phonetic perception — i.e. "hearing'
people talk — compensation for coarticulation is explained in terms of
coarticulation. Tongue retraction in [r] leads listeners to expect tongue
retraction in the following segment and thus a backish stop (more like "g")
can still sound basically like a "d" in the [r] context because of this
context-dependent expectation. Researchers who think that one should first
and foremost look for explanations of perceptual effects in the sensory input
system (before positing more abstract cognitive parsing explanations) are
quite impressed by the auditory contrast account.
It seems to me that the evidence shows that both of these explanations
are right. Auditory contrast does seem to occur with pure tone context tokens,
in place of [ar] or [al], but the size of the effect is smaller than it is with a
phonetic precursor syllable. The smaller size of the effect suggests that audi-
tory contrast is not the only factor. I've also done research with stimuli like
this where I present a continuum between [al] and [ar] as context for the
[da}-[ga] continuum. When both the precursor and the target syllable are
ambiguous, the identity of the target syllable (as "da" or "ga") depends on the
perceived identity of the precursor. That is, for the same acoustic token, if the
listener thinks that the context is "ar" he or she is more likely to identify
the ambiguous target as "da." This is clearly not an auditory contrast effect.
So, both auditory perception and phonetic perception seem to push
listeners in the same direction.
5.2 Phonetic Knowledge Shapes Speech Perception.
Of course, the fact that the auditory system shapes our perception of speech does
not mean that all speech perception phenomena are determined by our auditory
abilities. As speakers, not just hearers, of language, we are also guided by our knowledge of speech production. There are main two classes of perceptual effects that
emerge from phonetic knowledge: categorical perception and phonetic coherence.
5.2.1 Categorical perception.
Take a look back at figure 5.1a. Here we have a sequence of syllables that shifts
gradually (and in equal acoustic steps) from a syllable that sounds like "da" at
one end to a syllable that sounds like "ga" at the other (see table 5.1). This type
of gradually changing sequence is called a stimulus continuum. When we play
these synthesized syllables to people and ask them to identify the sounds - with
an instruction like "please write down what you hear" - people usually call the
first three syllables "da" and the last two "ga." Their response seems very cat-
egorical: a syllable is either "da" or "ga." But, of course, this could be so simply
because we only have two labels for the sounds in the continuum, so by
definition people have to say either "da" or "ga." Interestingly, though — and this
is why we say that speech perception tends to be categorical — the ability to hear
the differences between the stimuli on the continuum is predictable from the labels
we use to identify the members of the continuum.
To illustrate this, suppose I play you the first two syllables in the continuum
shown in figure 5.1a — tokens number 1 and 2. Listeners label both of these as
"da," but they are slightly different from each other. Number 1 has a third for-
mant onset of 2,750 Hz while the F3 in token number 2 starts at 2,562 Hz. People
don't notice this contrast — the two syllables really do sound as if they are iden-
tical. The same thing goes for the comparisons of token 2 with token 3 and of
token 4 with token 5. But when you hear token 3 (a syllable that you would ordi-
narily label as "da") compared with token 4 (a syllable that you would ordinarily
label "ga"), the difference between them leaps out at you. The point is that in the
discrimination task — when you are asked to detect small differences — you don't
have to use the labels "da" or "ga." You should be able to hear the differences at
pretty much the same level of accuracy, no matter what label you would have put
on the tokens, because the difference is the same (188 Hz for F3 onset) for token
1 versus 2 as it is for token 3 versus 4. The curious fact is that even when you don't
have to use the labels "da" and "ga" in your listening responses, your perception
is in accordance with the labels — you can notice a 188 Hz difference when the
tokens have different labels and not so much when the tokens have the same label.
One classic way to present these hypothetical results is shown in figure 5.2
(see Liberman et al., 1957, for the original graph like this). This graph has two
"functions" — two lines — one for the proportion of times listeners will identify
a token as "da", and one for the proportion of times that listeners will be able to
accurately tell whether two tokens (say number 1 and number 2) are different from
each other. The first of these two functions is called the identification function,
and I have plotted it as if we always (probability equals 1) identify tokens 1, 2, and
3 as "da." The second of these functions is called the discrimination function,
and I have plotted a case where the listener is reduced to guessing when the tokens
being compared have the same label (where "guessing" equals probability of
correct detection of difference is 0.5), and where he or she can always hear the
difference between token 3 (labeled "da") and token 4 (labeled "ga"). The pattern
of response in figure 5.2 is what we mean by "categorical perception" — within-
category discrimination is at chance and between-category discrimination is per-
fect. Speech tends to be perceived categorically, though interestingly, just as with
compensation for coarticulation, there is an auditory perception component in
this kind of experiment, so that speech perception is never perfectly categorical.
Our tendency to perceive speech categorically has been investigated in many
different ways. One of the most interesting of these lines of research suggests
(to me at least) that categorical perception of speech is a learned phenomenon (see
Johnson and Ralston, 1994). It turns out that perception of sine wave analogs of
the [do] to [ga] continuum is much less categorical than is perception of normal-
sounding speech. Robert Remez and colleagues (Remez et al., 1981) pioneered
the use of sine wave analogs of speech to study speech perception. In sine wave
analogs, the formants are replaced by time-varying sinusoidal waves (see figure 5.3).
These signals, while acoustically comparable to speech, do not sound at all like
speech. The fact that we have a more categorical response to speech signals
than to sine wave analogs of speech suggests that there is something special
about hearing formant frequencies as speech versus hearing them as nonspeech,
video-game noises. One explanation of this is that as humans we have an innate
ability to recover phonetic information from speech so that we hear the intended,
categorical gestures of the speaker.
A simpler explanation of why speech tends to be heard categorically is that our
perceptual systems have been tuned by linguistic experience. As speakers, we have
somewhat categorical intentions when we speak — for instance, to say "dot" instead
of "got." So as listeners we evaluate speech in terms of the categories that we
have learned to use as speakers. Several kinds of evidence support this "acquired
categoriality" view of categorical perception.
For example, as you know from trying to learn the sounds of the International
Phonetic Alphabet, foreign speech sounds are often heard in terms of native sounds.
For instance, if you are like most beginners, when you were learning the implosive
sounds [ ], [d], and [ ] it was hard to hear the difference between them and
plain voiced stops. This simple observation has been confirmed many times and
in many ways, and indicates that in speech perception, we hear sounds that we
are familiar with as talkers. Our categorical perception boundaries are determined
by the language that we speak (The theories proposed by Best, 1995, and Flege,
1995, offer explicit ways of conceptualizing this.)
Categorical magnets.
One really interesting demonstration of the language-specificity of categor-
ical perception is the "perceptual magnet effect," (Kuhl et al., 1992). In this
experiment, you synthesize a vowel that is typical of the sound of [i] and
then surround it with vowels that systematically differ from the center
vowel. In figure 5.4 this is symbolized by the white star, and the white
circles surrounding it. A second set of vowels is synthesized, again in a radial
grid around a center vowel. This second set is centered not on a typical
[i] but instead on a vowel that is a little closer to the boundary between [i]
and [e].
When you ask adults if they can hear the difference between the center
vowel (one of the stars) and the first ring of vowels, it turns out that they
have a harder time distinguishing the white star (a prototypical [i]) from its
neighbors than they do distinguishing the black star (a non-prototypical [i])
from its neighbors. This effect is interesting because it seems to show that
categorical perception is a gradient within categories (note that all of the
vowels in the experiment sound like variants of [i], even the ones in the black
set that are close to the [i]/ [e] boundary). However, even more interesting
is the fact that the location of a perceptual magnet differs depending on
the native language of the listener — even when those listeners are mere
infants!


Here's another phenomenon that illustrates the phonetic coherence of speech
perception. Imagine that you make a video of someone saying "ba," "da," and
"ga." Now, you dub the audio of each of these syllables onto the video of the
others. That is, one copy of the video of [bct] now has the audio recording of
[do] as its sound track, another has the audio of [go], and so on. There are some
interesting confusions among audio/video mismatch tokens such as these, and
one of them in particular has become a famous and striking demonstration of
the phonetic coherence of speech perception.
Some of the mismatches just don't sound right at all. For example, when you
dub audio [du] onto video 034 listeners will report that the token is "ba" (in accor-
dance with the obvious lip closure movement) but that it doesn't sound quite
normal.
The really famous audio/video mismatch is the one that occurs when you dub
audio [ba] onto video [go]. The resulting movie doesn't sound like either of the
input syllables, but instead it sounds like "da"! This perceptual illusion is called
the McGurk effect after Harry McGurk, who first demonstrated it (McGurk and
MacDonald, 1976). It is a surprisingly strong illusion that only goes away when
you close your eyes. Even if you know that the audio signal is [bc], you can only
hear "da."
The McGurk effect is an illustration of how speech perception is a process
in which we deploy our phonetic knowledge to generate a phonetically coherent
percept. As listeners we combine information from our ears and our eyes to come
to a phonetic judgment about what is being said. This process taps specific pho-
netic knowledge, not just generic knowledge of speech movements. For instance,
Walker et al. (1995) demonstrated that audio / video integration is blocked when
listeners know the talkers, and know that the voice doesn't belong with the
face (in a dub of one person's voice onto another person's face). This shows that
phonetic coherence is a property of speech perception, and that phonetic coher-
ence is a learned perceptual capacity, based on knowledge we have acquired
as listeners.
McGurking ad nauseam.
The McGurk effect is a really popular phenomenon in speech perception,
and researchers have poked and prodded it quite a bit to see how it works.
In fact it is so popular we can make a verb out of the noun "McGurk effect"
— to "McGurk" is to have the McGurk effect. Here are some examples of
McGurking:
Babies McGurk (Rosenblum et al., 1997)
You can McGurk even when the TV is upside down (Campbell, 1994)
Japanese listeners McGurk less than English listeners (Sekiyama and
Tohkura, 1993)
Male faces can McGurk with female voices (Green et al., 1991)
A familiar face with the wrong voice doesn't McGurk (Walker et aL , 1995).
5.3 Linguistic Knowledge Shapes Speech Perception.
We have seen so far that our ability to perceive speech is shaped partly by the
nonlinearities and other characteristics of the human auditory system, and we have
seen that what we hear when we listen to speech is partly shaped by the phonetic
knowledge we have gained as speakers. Now we turn to the possibility that speech
perception is also shaped by our knowledge of the linguistic structures of our native
language.
I have already included in section 5.2 (on phonetic knowledge) the fact that
the inventory of speech sounds in your native language shapes speech perception,
so in this section I'm not focusing on phonological knowledge when I say "lin-
guistic structures," but instead I will present some evidence of lexical effects in speech
perception — that is, that hearing words is different from hearing speech sounds.
I should mention at the outset that there is controversy about this point. I will
suggest that speech perception is influenced by the lexical status of the sound
patterns we are hearing, but you should know that some of my dear colleagues
will be disappointed that I'm taking this point of view.
Scientific method: on being convinced.
There are a lot of elements to a good solid scientific argument, and I'm not
going to go into them here. But I do want to mention one point about how
we make progress. The point is that no one individual gets to declare an
argument won or lost. I am usually quite impressed by my own arguments
and cleverness when I write a research paper. I think I've figured something
out and I would like to announce my conclusion to the world. However,
the real conclusion of my work is always written by my audience and it keeps
being written by each new person who reads the work. They decide if the
result seems justified or valid. This aspect of the scientific method, includ-
ing the peer review of articles submitted for publication, is part of what leads
us to the correct answers.
The question of whether speech perception is influenced by word processing
is an interesting one in this regard. The very top researchers — most clever, and
most forceful — in our discipline are in disagreement on the question. Some
people are convinced by one argument or set of results and others are more
swayed by a different set of findings and a different way of thinking about the
question. What's interesting to me is that this has been dragging on for a
long, long time. And what's even more interesting is that as the argument drags
on, and researchers amass more and more data on the question, the theories
start to blur into each other a little. Of course, you didn't read that here!
The way that "slips of the ear" work suggests that listeners apply their know-
ledge of words in speech perception. Zinny Bond (1999) reports perceptual errors
like "spun toffee" heard as "fun stocking" and "wrapping service" heard as
wrecking service." In her corpus of slips of the ear, almost all of them are word
misperceptions, not phoneme misperceptions. Of course, sometimes we may mis-
hear a speech sound, and perhaps think that the speaker has mispronounced the
word, but Bond's research shows that listeners are inexorably drawn into hearing
words even when the communication process fails. This makes a great deal of
sense, considering that our goal in speech communication is to understand what
the other person is saying, and words (or more technically, morphemes) are the
units we trade with each other when we talk.
This intuition, that people tend to hear words, has been verified in a very clever
extension of the place of articulation experiment we discussed in sections 5.1 and
5.2. The effect, which is named the Ganong effect after the researcher who first
found it (Ganong, 1980), involves a continuum like the one in figure 5.1, but with
a word at one end and a nonword at the other. For example, if we added a final
[g] to our [da}-[ga] continuum we would have a continuum between the word
"dog' and the nonword [gag]. What Ganong found, and what makes me think
that speech perception is shaped partly by lexical knowledge, is that in this new
continuum we will get more "dog' responses than we will get "da" responses in
the [daHga] continuum. Remember the idea of a "perceptual magnet" from above?
Well, in the Ganong effect words act like perceptual magnets; when one end of
the continuum is a word, listeners tend to hear more of the stimuli as a lexical
item, and fewer of the stimuli as the nonword alternative at the other end of the
continuum.
Ganong applied careful experimental controls using pairs of continua like
"tash"—"dash" and "task"—"dask" where we have a great deal of similarity
between the continuum that has a word on the It/ end ("task"—"dask") and
the one that has a word on the /d/ end ("tash"—"dash"). That way there is less
possibility that the difference in number of "d" responses is due to small acoustic
differences between the continua rather than the difference in lexicality of the
endpoints. It has also been observed that the lexical effect is stronger when
the sounds to be identified are at the ends of the test words, as in "kiss"—"kish"
versus "fiss"—"fish." This makes sense if we keep in mind that it takes a little
time to activate a word in the mental lexicon.
A third perceptual phenomenon that suggests that linguistic knowledge (in the
form of lexical identity) shapes speech perception was called "phoneme restora-
tion" by Warren when he discovered it (Warren, 1970). Figure 5.7 illustrates phoneme
restoration. The top panel is a spectrogram of the word "legislation" and the bot-
tom panel shows a spectrogram of the same recording with a burst of broadband
noise replacing the [s]. When people hear the noise-replaced version of the sound
file in figure 5.7b they "hear" the [s] in LletisileN. Arthur Samuel (1991)
reported an important bit of evidence suggesting that the [s] is really perceived
in the noise-replaced stimuli. He found that listeners can't really tell the differ-
ence between a noise-added version of the word (where the broadband noise is
simply added to the already existing [s]) and a noise-replaced version (where the
[s] is excised first, before adding noise). What this means is that the [s] is actually
perceived — it is restored — and thus that your knowledge of the word "legisla-
tion" has shaped your perception of this noise burst.
Jeff Elman and jay McClelland (1988) provided another important bit of evid-
ence that linguistic knowledge shapes speech perception. They used the phoneme
restoration process to induce the perception of a sound that then participated in
a compensation for coarticulation. This two-step process is a little complicated,
but one of the most clever and influential experiments in the literature.
Step one: compensation for coarticulation. We use a [daHga] continuum just like
the one in figure 5.1, but instead of context syllables [al] and [ai], we use [as] and
[GB There is a compensation for coarticulation using these fricative context
syllables that is like the effect seen with the liquid contexts. Listeners hear more
"ga" syllables when the context is [as] than when it is [of ].
Step two: phoneme restoration. We replace the fricative noises in the words
"abolish" and "progress" with broadband noise, as was done to the [s] of "legis-
lature" in figure 5.7. Now we have a perceived [s] in "progress" and a perceived [5]
in "abolish" but the signal has only noise at the ends of these words in our tokens.
The question is whether the restoration of [1 and [5] in "progress" and "abolish"
is truly a perceptual phenomenon, or just something more like a decision bias
in how listeners will guess the identity of a word. Does the existence of a word
"progress" and the nonexistence of any word "progresh" actually influence
speech perception? Elman and McClelland's excellent test of this question was to
use "abolish" and "progress" as contexts for the compensation for coarticulation
experiment. The reasoning is that if the "restored" [s] produces a compensation
for coarticulation effect, such that listeners hear more "ga" syllables when these
are preceded by a restored [s] than when they are preceded by a restored [5],
then we would have to conclude that the [s] and [f ] were actually perceived by
listeners — they were actually perceptually there and able to interact with the per-
ception of the [da]—[ga] continuum. Guess what Elman and McClelland found?
That's right the phantom, not-actually-there [s] and [5] caused compensation for
coarticulation — pretty impressive evidence that speech perception is shaped by
our linguistic knowledge.
5.4 Perceptual Similarity.
Now to conclude the chapter, I'd like to discuss a procedure for measuring
perceptual similarity spaces of speech sounds. This method will be useful in later
chapters as we discuss different types of sounds, their acoustic characteristics, and
then their perceptual similarities. Perceptual similarity is also a key parameter in
relating phonetic characteristics to language sound change and the phonological
patterns in language that arise from sound change.
The method involves presenting test syllables to listeners and asking them
to identify the sounds in the syllables. Ordinarily, with carefully produced "lab
speech" (that is, speech produced by reading a list of syllables into a microphone
in the phonetics lab) listeners will make very few misidentifications in this task,
so we usually add some noise to the test syllables to force some mistakes. The
noise level is measured as a ratio of the intensity of the noise compared with the
peak intensity of the syllable. This is called the signal-to-noise ratio (SNR) and
is measured in decibels. To analyze listeners' responses we tabulate them in a con-
fusion matrix. Each row in the matrix corresponds to one of the test syllables
(collapsing across all 10 tokens of that syllable) and each column in the matrix
corresponds to one of the responses available to listeners.
Table 5.2 shows the confusion matrix for the 0 dB SNR condition in George
Miller and Patricia Nicely's (1955) large study of consonant perception. Yep, these
data are old, but they're good. Looking at the first row of the confusion matrix
we see that [f] was presented 264 times and identified correctly as "f" 199 times
and incorrectly as "th" 46 times. Note that Miller and Nicely have more data for
some sounds than for others.
Even before doing any sophisticated data analysis, we can get some pretty quick
answers out of the confusion matrix. For example, why is it that "Keith" is some-
times pronounced "Kee by children? Well, according to Miller and Nicely's data,
[0] was called "f" 85 times out of 232 — it was confused with "f" more often than
with any other speech sound tested. Cool. But it isn't clear that these data tell us
anything at all about other possible points of interest — for example, why "this"
and "that" are sometimes said with a [d] sound. To address that question we need
to find a way to map the perceptual "space" that underlies the confusions we observe
in our experiment. It is to this mapping problem we now turn.
5.4.1 Maps from distances.
So, we're trying to pull information out of a confusion matrix to get a picture of
the perceptual system that caused the confusions. The strategy that we will use
takes a list of distances and reconstructs them as a map. Consider, for example,
the list of distances below for cities in Ohio.
Columbus to Cincinnati, 107 miles
Columbus to Cleveland, 142 miles
Cincinnati to Cleveland, 249 miles
From these distances we can put these cities on a straight line as in figure 5.8a,
with Columbus located between Cleveland and Cincinnati. A line works to
describe these distances because the distance from Cincinnati to Cleveland is
simply the sum of the other two distances (107 + 142 = 249).
Here's an example that requires a two-dimensional plane.
Amsterdam to Groningen, 178 km
Amsterdam to Nijmegen, 120 km
Groningen to Nijmegen, 187 km
The two-dimensional map that plots the distances between these cities in the
Netherlands is shown in figure 5.8b. To produce this figure I put Amsterdam and
Groningen on a line and called the distance between them 178 km. Then I drew
an arc 120 km from Amsterdam, knowing that Nijmegen has to be somewhere
on this arc. Then I drew an arc 187 km from Groningen, knowing that Nijmegen
also has to be somewhere on this arc. So, Nijmegen has to be at the intersection
The variables in submatrix (b) code the proportions so that "p" stands for
proportion, the first subscript letter stands for the row label and the second sub-
script letter stands for the column label. So p is a variable that refers to the
proportion of times that [0] tokens were called "f." In these data NI. is equal
to 0.37. Submatrix (c) abstracts this a little further to say that for any two sounds
i and j, we have a submatrix with confusions (subscripts don't match) and
correct answers (subscripts match).
Asymmetry in confusion matrices.
Is there some deep significance in the fact that [0] is called "f" more often
than [f] is called "th"? It may be that listeners had a bias against calling things
"th" — perhaps because it was confusing to have to distinguish between "th"
and "dh" on the answer sheet. This would seem to be the case in table 5.2
because there are many more "f" responses than "th" responses overall.
However, the relative infrequency of "s" responses suggests that we may not
want to rely too heavily on a response bias explanation, because the "s"-to-
[s] mapping is common and unambiguous in English. One interesting point
about the asymmetry of [f] and [8] confusions is that the perceptual con-
fusion matches the cross-linguistic tendency for sound change (that is, [9] is
more likely to change into [f] than vice versa). Mere coincidence, or is there
a causal relationship? Shepard's method for calculating similarity from a
confusion matrix glosses over this interesting point and assumes that pf„
and p1 are two imperfect measures of the same thing — the confusability of
"f" and "9." These two estimates are thus combined to form one estimate
of "f"—"0" similarity. This is not to deny that there might be something
interesting to look at in the asymmetry, but only to say that for the purpose
of making perceptual maps the sources of asymmetry in the confusion matrix
are ignored.
Here is Shepard's method for calculating similarity from a confusion matrix.
We take the confusions between the two sounds and scale them by the correct
responses. In math, that's:
In this formula, S„ is the similarity between category i and category j. In the case
of "f" and "0" in Miller and Nicely's data (table 5.2) the calculation is:
I should say that regarding this formula Shepard simply says that it "has been
found serviceable." Sometimes you can get about the same results by simply tak-
ing the average of the two confusion proportions p, and pi, as your measure of
similarity, but Shepard's formula does a better job with a confusion matrix in which
one category has confusions concentrated between two particular responses,
while another category has confusions fairly widely distributed among possible
responses - as might happen, for example, when there is a bias against using one
particular response alternative.
OK, so that's how to get a similarity estimate from a confusion matrix. To get
perceptual distance from similarity you simply take the negative of the natural
log of the similarity:
This is based on Shepard's Law, which states that the relationship between per-
ceptual distance and similarity is exponential. There may be a deep truth about
mental processing in this law - it comes up in all sorts of unrelated contexts (Shannon
and Weaver, 1949; Parzen, 1962), but that's a different topic.
Anyway, now we're back to map-making, except instead of mapping the relative
locations of Dutch cities in geographic space, we're ready to map the perceptual
space of English fricatives and "d." Table 5.3 shows the similarities calculated from
the Miller and Nicely confusion matrix (table 5.2) using equation (5.1).
The perceptual map based on these similarities is shown in figure 5.10. One of
the first things to notice about this map is that the voiced consonants are on one
side and the voiceless consonants are on the other. This captures the observation
that we made earlier, looking at the raw confusions, that voiceless sounds were
rarely called voiced, and vice versa. It is also interesting that the voiced and voice-
less fricatives are ordered in the same way on the vertical axis. This might be a
front/back dimension, or there might be an interesting correlation with some
acoustic aspect of the sounds.
In figure 5.10, I drew ovals around some clusters of sounds. These show
two levels of similarity among the sounds as revealed by a hierarchical cluster
analysis (another neat data analysis method available in most statistics software
packages - see Johnson, 2008, for more on this). At the first level of clustering
"0" and "f" cluster with each other and "v" and "d" cluster together in the
perceptual map. At a somewhat more inclusive level the sibilants are included with
their non-sibilant neighbors ("s" joins the voiceless cluster and "z" joins the
voiced cluster). The next level of clustering, not shown in the figure, puts [d] with
the voiced fricatives.
Combining cluster analysis with MDS gives us a pretty clear view of the
perceptual map. Note that these are largely just data visualization techniques; we
did not add any information to what was already in the confusion matrix (though
we did decide that a two-dimensional space adequately describes the pattern of
confusions for these sounds).
Concerning the realizations of "this" and "that" we would have to say that
these results indicate that the alternations [d]—[d] and [d]—[z] are not driven by
auditory/ perceptual similarity alone: there are evidently other factors at work —
otherwise we would find "vis" and "vat" as realizations of "this" and "that."
MDS and acoustic phonetics.
In acoustic phonetics one of our fundamental puzzles has been how to decide
which aspects of the acoustic speech signal are important and which things
don't matter. You look at a spectrogram and see a blob — the question is,
do listeners care whether that part of the sound is there? Does that blob
matter? Phoneticians have approached the "Does it matter?" problem in a
number of ways.
For example, we have looked at lots of spectrograms and asked concerning
the mysterious blob, "Is it always there?" One of the established facts of
phonetics is that if an acoustic feature is always, or even usually, present
then listeners will expect it in perception. This is even true of the so-called
"spit spikes" seen sometimes in spectograms of the lateral fricatives [+]
and 031 (A spit spike looks like a stop release burst — see chapter 8 - but
occurs in the middle of a fricative noise.) These sounds get a bit juicy, but
this somewhat tangential aspect of their production seems to be useful in
perception.
Another answer to "Does it matter?" has been to identify the origin of
the blob in the acoustic theory of speech production. For example, some-
times room reverberation can "add" shadows to a spectrogram. (Actually in
the days of reel-to-reel tape recorders we had to be careful of magnetic
shadows that crop up when the magnetic sound image transfers across layers
of tape on the reel.) If you have a theory of the relationship between speech
production and speech acoustics you can answer the question by saying,
"It doesn't matter because the talker didn't produce it." We'll be exploring
the acoustic theory of speech production in some depth in the remaining
chapters of this book.
One of my favorite answers to "Does it matter?" is "Cooper's rule." Franklin
Cooper, in his 1951 paper with Al Liberman and John Borst, commented
on the problem of discovering "the acoustic correlates of perceived speech."
They claimed that there are "many questions about the relation between
acoustic stimulus and auditory perception which cannot be answered
merely by an inspection of spectrograms, no matter how numerous and
varied these might be" (an important point for speech technologists to
consider). Instead they suggested that "it will often be necessary to make
controlled modifications in the spectrogram, and then to evaluate the
effects of these modifications on the sound as heard. For these purposes we
have constructed an instrument" (one of the first speech synthesizers). This
is a pretty beautiful direct answer. Does that blob matter? Well, leave it
out when you synthesize the utterance and see if it sounds like something
else.
And finally there is the MDS answer. We map the perceptual space and
then look for correlations between dimensions of the map and acoustic prop-
erties of interest (like the mysterious blob). If an acoustic feature is tightly
correlated with a perceptual dimension then we can say that that feature
probably does matter. This approach has the advantages of being based on
naturally produced speech, and of allowing the simultaneous exploration of
many acoustic parameters.
Recommended Reading
Best, C. T. (1995) A direct realist perspective on cross-language speech perception. In W.
Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodological issues in cross-language speech research, Timonium, MD: York Press, 167-200. Describes a theory of cross-language speech perception in which listeners map new, unfamiliar sounds on to their inventory of native-language sounds.
Bond, Z. S. (1999) Slips of the Ear: Errors in the Perception of Casual Conversation, San Diego Academic Press. A collection, and analysis, of misperception in "the wild" — in ordinary conversations.
Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge, MA: MIT Press. The theory and evidence for a gestalt theory of audition — a very important book.
Campbell, R. (1994) Audiovisual speech: Where, what, when, how? Current Psychology of Cognition, 13, 76-80. On the perceptual resilience of the McGurk effect.
Cole, R. A. (1973) Listening for mispronunciations: A measure of what we hear during speech. Perception 4:7 Psychophysics, 13, 153-6. A study showing that people often don't hear mispronunciations in speech communication.
Cooper, F. S., Liberman, A. M., and Borst, J. M. (1951) The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proceedings of the National Academy of Science, 37, 318-25. The source of "Cooper's rule."
Elman, J. L. and McClelland, J. L. (1988). Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27, 143-65. One of the most clever, and controversial, speech perception experiments ever reported.
Flege, J. E. (1995) Second language speech learning: Theory, findings, and problems. In W. Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodo-logical issues in cross-language speech research, Timonium, MD: York Press, 167-200. Describes a theory of cross-language speech perception in which listeners map new, unfamiliar sounds on to their inventory of native-language sounds.
Ganong, W. F. (1980) Phonetic categorization in auditory word recognition. Journal of Experimental Psychology: Human Perception and Performance, 6, 110-25. A highly influen-tial demonstration of how people are drawn to hear words in speech perception. The basic result is now known as "the Ganong effect."
Green, K. P., Kuhl, P. K., Meltzoff, A. N., and Stevens, E. B. (1991) Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception 6P- Psychophysics, 50, 524-36. Integrating gender-mismatched voices and faces in the McGurk effect.
Jakobson, R., Fant, G., and Halle, M. (1952) Preliminaries to Speech Analysis, Cambridge, MA: MIT Press. A classic in phonetics and phonology in which a set of distinctive phono-logical features is defined in acoustic terms.
Johnson, K. and Ralston, J. V. (1994) Automaticity in speech perception: Some speech/ nonspeech comparisons. Phonetica, 51(4), 195-209. A set of experiments suggesting that over-learning accounts for some of the "specialness" of speech perception.
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. (1992) Linguistic experiences alter phonetic perception in infants by 6 months of age. Science, 255, 606-8. Demonstrating the perceptual magnet effect with infants.
Liberman, A. M., Harris, K. S., Hoffman H. S., and Griffith, B. C. (1957) The discrimina-tion of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-68. The classic demonstration of categorical perception in speech perception.
Lotto, A. J. and Kluender, K. R. (1998) General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60, 602-19. A demonstration that at least a part of the compensation for coarticulation effect (Mann, 1980) is due to auditory contrast.
Mann, V. A. (1980) Influence of preceding liquid on stop-consonant perception. Perception ear Psychophysics, 28, 407-12. The original demonstration of compensation for coarticu-lation in sequences like [al da] and [or ga].
McGurk, H. and MacDonald, J. (1976) Hearing lips and seeing voices. Nature, 264, 746-8. The audiovisual speech perception effect that was reported in this paper has been come to be called "the McGurk effect."
Miller, G. A. and Nicely, P. E. (1955) An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338-52. A standard reference for the confusability of American English speech sounds.
Parzen, E. (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065-76. A method for estimating probability from instances.
Pastore, R. E. and Farrington, S. M. (1996) Measuring the difference limen for identification of order of onset for complex auditory stimuli. Perception &. Psychophysics, 58(4), 510-26. On the auditory basis of the linguistic use of aspiration as a distinctive feature.
Pisoni, D. B. (1977) Identification and discrimination of the relative onset time of two-component tones: Implications for voicing perception in stops. Journal of the Acoustical Society of America, 61, 1352-61. More on the auditory basis of the linguistic use of aspiration as a distinctive feature.
Rand, T. C. (1974) Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55(3), 678-80. The first demonstration of the duplex perception effect.
Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981) Speech perception with-out traditional speech cues. Science, 212, 947-50. The first demonstration of how people perceive sentences that have been synthesized using only time-varying sine waves.
Rosenblum, L. D., Schmuckler, M. A., and Johnson, J. A. (1997) The McGurk effect in infants. Perception & Psychophysics, 59, 347-57.
Sekiyama, K. and Tohkura, Y. (1993) Inter-language differences in the influence of visual cues in speech perception. Journal of Phonetics, 21, 427-44. These authors found that the McGurk effect is different for people who speak different languages.
Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication. Urbana: University of Illinois. The book that established "information theory."
Shepard, R. N. (1972) Psychological representation of speech sounds. In E. E. David and P. B. Denes (eds.), Human Communication: A unified view. New York: McGraw-Hill, 67-113. Measuring perceptual distance from a confusion matrix.
Walker, S., Bruce, V., and O'Malley, C. (1995) Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124-33. A fascinating demonstration of how top-down knowledge may mediate the McGurk effect.
Warren, R. M. (1970) Perceptual restoration of missing speech sounds. Science, 167, 392-3. The first demonstration of the "phoneme restoration effect.".
When you listen to someone speaking you generally focus on understanding their
meaning. One famous (in linguistics) way of saying this is that "we speak in order
to be heard, in order to be understood" Oakobson et al., 1952). Our drive, as
listeners, to understand the talker leads us to focus on getting the words being
said, and not so much on exactly how they are pronounced. But sometimes a
pronunciation will jump out at you: somebody says a familiar word in an unfamiliar way and you just have to ask "Is that how you say that?" When we listen
to the phonetics of speech — to how the words sound and not just what they mean
— we as listeners are engaged in speech perception.
In speech perception, listeners focus attention on the sounds of speech and notice
phonetic details about pronunciation that are often not noticed at all in normal
speech communication. For example, listeners will often not hear, or not seem
to hear, a speech error or deliberate mispronunciation in ordinary conversation,
but will notice those same errors when instructed to listen for mispronunciations
(see Cole, 1973).
Testing mispronunciation detection.
As you go about your daily routine, try mispronouncing a word every now
and then to see if the people you are talking to will notice. For instance, if
the conversation is about a biology class you could pronounce it "biolochi."
After saying it this way a time or two you could tell your friends about your
little experiment and ask if they noticed any mispronounced words. Do people
notice mispronunciation more in word-initial position or in medial position?
With vowels more than consonants? In nouns and verbs more than in gram-
matical words? How do people look up words in their mental dictionary if
they don't notice when a sound has been mispronounced? Evidently, looking up words in the mental lexicon is a little different from looking up words
in a printed dictionary (try entering "biolochi" in Google). Do you find that
your friends think you are strange when you persist in mispronouncing words
on purpose?
So, in this chapter we're going to discuss speech perception as a phonetic mode
of listening, in which we focus on the sounds of speech rather than the words.
An interesting problem in phonetics and psycholinguistics is to find a way of measuring how much phonetic information listeners take in during normal conversation, but in this book we can limit our focus to the phonetic mode of listening.
5.1 Auditory Ability Shapes Speech Perception.
As we saw in chapter 4, speech perception is shaped by general properties of
the auditory system that determine what can and cannot be heard, what cues will
be recoverable in particular segmental contexts, and how adjacent sounds will
influence each other. For example, we saw that the cochlea's nonlinear frequency
scale probably underlies the fact that no language distinguishes fricatives on the
basis of frequency components above 6,000 Hz.
Two other examples illustrate how the auditory system constrains speech perception. The first example has to do with the difference between aspirated and
unaspirated stops. This contrast is signaled by a timing cue that is called the "voice
onset time" (abbreviated as VOT). VOT is a measure (in milliseconds) of the
delay of voicing onset following a stop release burst. There is a longer delay in
aspirated stops than in unaspirated stops — so in aspirated stops the vocal folds are
held open for a short time after the oral closure of the stop has been released.
That's how the short puff of air in voiceless aspirated stops is produced. It has
been observed that many languages have a boundary between aspirated and unaspirated stops at about 30 ms VOT. What is so special about a 30 ms delay between
stop release and onset of voicing?
Here's where the auditory system comes into play. Our ability as hearers
to detect the nonsimultaneous onsets of tones at different frequencies probably
underlies the fact that the most common voice onset time boundary across languages is at about ±30 ms. Consider two pure tones, one at 500 Hz and the other
at 1,000 Hz. In a perception test (see, for example, the research studies by Pisoni,
1977, and Pastore and Farrington, 1996), we combine these tones with a small
onset asynchrony — the 500 Hz tone starts 20 ms before the 1,000 Hz tone. When
we ask listeners to judge whether the two tones were simultaneous or whether
one started a little before the other, we discover that listeners think that tones
separated by a 20 ms onset asynchrony start at the same time. Listeners don't
begin to notice the onset asynchrony until the separation is about 30 ms. This
parallelism between nonspeech auditory perception and a cross-linguistic phonetic
universal leads to the idea that the auditory system's ability to detect onset asynchrony is probably a key factor in this cross-linguistic phonetic property.
Example number two: another general property of the auditory system is probably at work in the perceptual phenomenon known as "compensation for coarticulation." This effect occurs in the perception of place of articulation in CV syllables.
The basic tool in this study is a continuum of syllables that ranges in equal acoustic
steps from [do] to [gal (see figure 5.1). This figure needs a little discussion. At
the end of chapter 3 I introduced spectrograms, and in that section I mentioned
that the dark bands in a spectrogram show the spectral peaks that are due to
the vocal tract resonances (the formant frequencies). So in figure 5.1a we see a
sequence of five syllables with syllable number 1 labeled [do] and syllable number 5 labeled [go]. In each syllable, the vowel is the same; it has a first formant
frequency (F1) of about 900 Hz, a second formant frequency (F,) of about 1,100 Hz,
an F3 at 2,500 Hz, and an F4 at 3,700 Hz. The difference between [du] and [go]
has to do with the brief formant movements (called formant transitions) at
the start of each syllable. For [do] the F, starts at 1,500 Hz and the F3 starts at
2,900 Hz, while for [go] the F, starts at 1,900 Hz and the F3 starts at 2,000 Hz.
You'll notice that the main difference between [al] and [ar] in figure 5. lb is the
F, pattern at the end of the syllable.
Virginia Mann (1980) found that the perception of this [doHgo] continuum
depends on the preceding context. Listeners report that the ambiguous syllables
in the middle of the continuum sound like "ga" when preceded by the VC syllable
[al], and sound like "da" when preceded by [Qr.].
As the name implies, this "compensation for coarticulation" perceptual effect
can be related to coarticulation between the final consonant in the VC context
token ([01] or [or]) and the initial consonant in the CV test token ([da}-[ga]). However,
an auditory frequency contrast effect probably also plays a role. The way this explanation works is illustrated in figure 5. lb. The relative frequency of F, distinguishes
[da] from [go] — F3 is higher in [do] than it is in [go]. Interestingly, though, the
perceived frequency of F3 may also be influenced by the frequency of the F, just
prior to [da/go]. When F3 just prior to [do/ga] is low (as in [ar]), the [dolga] F,
sounds contrastively higher, and when the F3 just prior is high, the [da/ go] F, sounds
lower. Lotto and Kluender (1998) tested this idea by replacing the precursor syl-
lable with a simple sine wave that matched the ending frequency of the F3 of [or],
in one condition, or matched the ending F3 frequency of [al]. in another condition. They found that these nonspeech isolated tones shifted the perception of
the [da]-[ga] continuum in the same direction that the [cm] and [al] syllables did.
So evidently, at least a part of the compensation for coarticulation phenomenon
is due to a simple auditory contrast effect having nothing to do with the phonetic
mode of perception.
Two explanations for one effect.
Compensation for coarticulation is controversial. For researchers who like to
think of speech perception in terms of phonetic perception — i.e. "hearing'
people talk — compensation for coarticulation is explained in terms of
coarticulation. Tongue retraction in [r] leads listeners to expect tongue
retraction in the following segment and thus a backish stop (more like "g")
can still sound basically like a "d" in the [r] context because of this
context-dependent expectation. Researchers who think that one should first
and foremost look for explanations of perceptual effects in the sensory input
system (before positing more abstract cognitive parsing explanations) are
quite impressed by the auditory contrast account.
It seems to me that the evidence shows that both of these explanations
are right. Auditory contrast does seem to occur with pure tone context tokens,
in place of [ar] or [al], but the size of the effect is smaller than it is with a
phonetic precursor syllable. The smaller size of the effect suggests that audi-
tory contrast is not the only factor. I've also done research with stimuli like
this where I present a continuum between [al] and [ar] as context for the
[da}-[ga] continuum. When both the precursor and the target syllable are
ambiguous, the identity of the target syllable (as "da" or "ga") depends on the
perceived identity of the precursor. That is, for the same acoustic token, if the
listener thinks that the context is "ar" he or she is more likely to identify
the ambiguous target as "da." This is clearly not an auditory contrast effect.
So, both auditory perception and phonetic perception seem to push
listeners in the same direction.
5.2 Phonetic Knowledge Shapes Speech Perception.
Of course, the fact that the auditory system shapes our perception of speech does
not mean that all speech perception phenomena are determined by our auditory
abilities. As speakers, not just hearers, of language, we are also guided by our knowledge of speech production. There are main two classes of perceptual effects that
emerge from phonetic knowledge: categorical perception and phonetic coherence.
5.2.1 Categorical perception.
Take a look back at figure 5.1a. Here we have a sequence of syllables that shifts
gradually (and in equal acoustic steps) from a syllable that sounds like "da" at
one end to a syllable that sounds like "ga" at the other (see table 5.1). This type
of gradually changing sequence is called a stimulus continuum. When we play
these synthesized syllables to people and ask them to identify the sounds - with
an instruction like "please write down what you hear" - people usually call the
first three syllables "da" and the last two "ga." Their response seems very cat-
egorical: a syllable is either "da" or "ga." But, of course, this could be so simply
because we only have two labels for the sounds in the continuum, so by
definition people have to say either "da" or "ga." Interestingly, though — and this
is why we say that speech perception tends to be categorical — the ability to hear
the differences between the stimuli on the continuum is predictable from the labels
we use to identify the members of the continuum.
To illustrate this, suppose I play you the first two syllables in the continuum
shown in figure 5.1a — tokens number 1 and 2. Listeners label both of these as
"da," but they are slightly different from each other. Number 1 has a third for-
mant onset of 2,750 Hz while the F3 in token number 2 starts at 2,562 Hz. People
don't notice this contrast — the two syllables really do sound as if they are iden-
tical. The same thing goes for the comparisons of token 2 with token 3 and of
token 4 with token 5. But when you hear token 3 (a syllable that you would ordi-
narily label as "da") compared with token 4 (a syllable that you would ordinarily
label "ga"), the difference between them leaps out at you. The point is that in the
discrimination task — when you are asked to detect small differences — you don't
have to use the labels "da" or "ga." You should be able to hear the differences at
pretty much the same level of accuracy, no matter what label you would have put
on the tokens, because the difference is the same (188 Hz for F3 onset) for token
1 versus 2 as it is for token 3 versus 4. The curious fact is that even when you don't
have to use the labels "da" and "ga" in your listening responses, your perception
is in accordance with the labels — you can notice a 188 Hz difference when the
tokens have different labels and not so much when the tokens have the same label.
One classic way to present these hypothetical results is shown in figure 5.2
(see Liberman et al., 1957, for the original graph like this). This graph has two
"functions" — two lines — one for the proportion of times listeners will identify
a token as "da", and one for the proportion of times that listeners will be able to
accurately tell whether two tokens (say number 1 and number 2) are different from
each other. The first of these two functions is called the identification function,
and I have plotted it as if we always (probability equals 1) identify tokens 1, 2, and
3 as "da." The second of these functions is called the discrimination function,
and I have plotted a case where the listener is reduced to guessing when the tokens
being compared have the same label (where "guessing" equals probability of
correct detection of difference is 0.5), and where he or she can always hear the
difference between token 3 (labeled "da") and token 4 (labeled "ga"). The pattern
of response in figure 5.2 is what we mean by "categorical perception" — within-
category discrimination is at chance and between-category discrimination is per-
fect. Speech tends to be perceived categorically, though interestingly, just as with
compensation for coarticulation, there is an auditory perception component in
this kind of experiment, so that speech perception is never perfectly categorical.
Our tendency to perceive speech categorically has been investigated in many
different ways. One of the most interesting of these lines of research suggests
(to me at least) that categorical perception of speech is a learned phenomenon (see
Johnson and Ralston, 1994). It turns out that perception of sine wave analogs of
the [do] to [ga] continuum is much less categorical than is perception of normal-
sounding speech. Robert Remez and colleagues (Remez et al., 1981) pioneered
the use of sine wave analogs of speech to study speech perception. In sine wave
analogs, the formants are replaced by time-varying sinusoidal waves (see figure 5.3).
These signals, while acoustically comparable to speech, do not sound at all like
speech. The fact that we have a more categorical response to speech signals
than to sine wave analogs of speech suggests that there is something special
about hearing formant frequencies as speech versus hearing them as nonspeech,
video-game noises. One explanation of this is that as humans we have an innate
ability to recover phonetic information from speech so that we hear the intended,
categorical gestures of the speaker.
A simpler explanation of why speech tends to be heard categorically is that our
perceptual systems have been tuned by linguistic experience. As speakers, we have
somewhat categorical intentions when we speak — for instance, to say "dot" instead
of "got." So as listeners we evaluate speech in terms of the categories that we
have learned to use as speakers. Several kinds of evidence support this "acquired
categoriality" view of categorical perception.
For example, as you know from trying to learn the sounds of the International
Phonetic Alphabet, foreign speech sounds are often heard in terms of native sounds.
For instance, if you are like most beginners, when you were learning the implosive
sounds [ ], [d], and [ ] it was hard to hear the difference between them and
plain voiced stops. This simple observation has been confirmed many times and
in many ways, and indicates that in speech perception, we hear sounds that we
are familiar with as talkers. Our categorical perception boundaries are determined
by the language that we speak (The theories proposed by Best, 1995, and Flege,
1995, offer explicit ways of conceptualizing this.)
Categorical magnets.
One really interesting demonstration of the language-specificity of categor-
ical perception is the "perceptual magnet effect," (Kuhl et al., 1992). In this
experiment, you synthesize a vowel that is typical of the sound of [i] and
then surround it with vowels that systematically differ from the center
vowel. In figure 5.4 this is symbolized by the white star, and the white
circles surrounding it. A second set of vowels is synthesized, again in a radial
grid around a center vowel. This second set is centered not on a typical
[i] but instead on a vowel that is a little closer to the boundary between [i]
and [e].
When you ask adults if they can hear the difference between the center
vowel (one of the stars) and the first ring of vowels, it turns out that they
have a harder time distinguishing the white star (a prototypical [i]) from its
neighbors than they do distinguishing the black star (a non-prototypical [i])
from its neighbors. This effect is interesting because it seems to show that
categorical perception is a gradient within categories (note that all of the
vowels in the experiment sound like variants of [i], even the ones in the black
set that are close to the [i]/ [e] boundary). However, even more interesting
is the fact that the location of a perceptual magnet differs depending on
the native language of the listener — even when those listeners are mere
infants!


Here's another phenomenon that illustrates the phonetic coherence of speech
perception. Imagine that you make a video of someone saying "ba," "da," and
"ga." Now, you dub the audio of each of these syllables onto the video of the
others. That is, one copy of the video of [bct] now has the audio recording of
[do] as its sound track, another has the audio of [go], and so on. There are some
interesting confusions among audio/video mismatch tokens such as these, and
one of them in particular has become a famous and striking demonstration of
the phonetic coherence of speech perception.
Some of the mismatches just don't sound right at all. For example, when you
dub audio [du] onto video 034 listeners will report that the token is "ba" (in accor-
dance with the obvious lip closure movement) but that it doesn't sound quite
normal.
The really famous audio/video mismatch is the one that occurs when you dub
audio [ba] onto video [go]. The resulting movie doesn't sound like either of the
input syllables, but instead it sounds like "da"! This perceptual illusion is called
the McGurk effect after Harry McGurk, who first demonstrated it (McGurk and
MacDonald, 1976). It is a surprisingly strong illusion that only goes away when
you close your eyes. Even if you know that the audio signal is [bc], you can only
hear "da."
The McGurk effect is an illustration of how speech perception is a process
in which we deploy our phonetic knowledge to generate a phonetically coherent
percept. As listeners we combine information from our ears and our eyes to come
to a phonetic judgment about what is being said. This process taps specific pho-
netic knowledge, not just generic knowledge of speech movements. For instance,
Walker et al. (1995) demonstrated that audio / video integration is blocked when
listeners know the talkers, and know that the voice doesn't belong with the
face (in a dub of one person's voice onto another person's face). This shows that
phonetic coherence is a property of speech perception, and that phonetic coher-
ence is a learned perceptual capacity, based on knowledge we have acquired
as listeners.
McGurking ad nauseam.
The McGurk effect is a really popular phenomenon in speech perception,
and researchers have poked and prodded it quite a bit to see how it works.
In fact it is so popular we can make a verb out of the noun "McGurk effect"
— to "McGurk" is to have the McGurk effect. Here are some examples of
McGurking:
Babies McGurk (Rosenblum et al., 1997)
You can McGurk even when the TV is upside down (Campbell, 1994)
Japanese listeners McGurk less than English listeners (Sekiyama and
Tohkura, 1993)
Male faces can McGurk with female voices (Green et al., 1991)
A familiar face with the wrong voice doesn't McGurk (Walker et aL , 1995).
5.3 Linguistic Knowledge Shapes Speech Perception.
We have seen so far that our ability to perceive speech is shaped partly by the
nonlinearities and other characteristics of the human auditory system, and we have
seen that what we hear when we listen to speech is partly shaped by the phonetic
knowledge we have gained as speakers. Now we turn to the possibility that speech
perception is also shaped by our knowledge of the linguistic structures of our native
language.
I have already included in section 5.2 (on phonetic knowledge) the fact that
the inventory of speech sounds in your native language shapes speech perception,
so in this section I'm not focusing on phonological knowledge when I say "lin-
guistic structures," but instead I will present some evidence of lexical effects in speech
perception — that is, that hearing words is different from hearing speech sounds.
I should mention at the outset that there is controversy about this point. I will
suggest that speech perception is influenced by the lexical status of the sound
patterns we are hearing, but you should know that some of my dear colleagues
will be disappointed that I'm taking this point of view.
Scientific method: on being convinced.
There are a lot of elements to a good solid scientific argument, and I'm not
going to go into them here. But I do want to mention one point about how
we make progress. The point is that no one individual gets to declare an
argument won or lost. I am usually quite impressed by my own arguments
and cleverness when I write a research paper. I think I've figured something
out and I would like to announce my conclusion to the world. However,
the real conclusion of my work is always written by my audience and it keeps
being written by each new person who reads the work. They decide if the
result seems justified or valid. This aspect of the scientific method, includ-
ing the peer review of articles submitted for publication, is part of what leads
us to the correct answers.
The question of whether speech perception is influenced by word processing
is an interesting one in this regard. The very top researchers — most clever, and
most forceful — in our discipline are in disagreement on the question. Some
people are convinced by one argument or set of results and others are more
swayed by a different set of findings and a different way of thinking about the
question. What's interesting to me is that this has been dragging on for a
long, long time. And what's even more interesting is that as the argument drags
on, and researchers amass more and more data on the question, the theories
start to blur into each other a little. Of course, you didn't read that here!
The way that "slips of the ear" work suggests that listeners apply their know-
ledge of words in speech perception. Zinny Bond (1999) reports perceptual errors
like "spun toffee" heard as "fun stocking" and "wrapping service" heard as
wrecking service." In her corpus of slips of the ear, almost all of them are word
misperceptions, not phoneme misperceptions. Of course, sometimes we may mis-
hear a speech sound, and perhaps think that the speaker has mispronounced the
word, but Bond's research shows that listeners are inexorably drawn into hearing
words even when the communication process fails. This makes a great deal of
sense, considering that our goal in speech communication is to understand what
the other person is saying, and words (or more technically, morphemes) are the
units we trade with each other when we talk.
This intuition, that people tend to hear words, has been verified in a very clever
extension of the place of articulation experiment we discussed in sections 5.1 and
5.2. The effect, which is named the Ganong effect after the researcher who first
found it (Ganong, 1980), involves a continuum like the one in figure 5.1, but with
a word at one end and a nonword at the other. For example, if we added a final
[g] to our [da}-[ga] continuum we would have a continuum between the word
"dog' and the nonword [gag]. What Ganong found, and what makes me think
that speech perception is shaped partly by lexical knowledge, is that in this new
continuum we will get more "dog' responses than we will get "da" responses in
the [daHga] continuum. Remember the idea of a "perceptual magnet" from above?
Well, in the Ganong effect words act like perceptual magnets; when one end of
the continuum is a word, listeners tend to hear more of the stimuli as a lexical
item, and fewer of the stimuli as the nonword alternative at the other end of the
continuum.
Ganong applied careful experimental controls using pairs of continua like
"tash"—"dash" and "task"—"dask" where we have a great deal of similarity
between the continuum that has a word on the It/ end ("task"—"dask") and
the one that has a word on the /d/ end ("tash"—"dash"). That way there is less
possibility that the difference in number of "d" responses is due to small acoustic
differences between the continua rather than the difference in lexicality of the
endpoints. It has also been observed that the lexical effect is stronger when
the sounds to be identified are at the ends of the test words, as in "kiss"—"kish"
versus "fiss"—"fish." This makes sense if we keep in mind that it takes a little
time to activate a word in the mental lexicon.
A third perceptual phenomenon that suggests that linguistic knowledge (in the
form of lexical identity) shapes speech perception was called "phoneme restora-
tion" by Warren when he discovered it (Warren, 1970). Figure 5.7 illustrates phoneme
restoration. The top panel is a spectrogram of the word "legislation" and the bot-
tom panel shows a spectrogram of the same recording with a burst of broadband
noise replacing the [s]. When people hear the noise-replaced version of the sound
file in figure 5.7b they "hear" the [s] in LletisileN. Arthur Samuel (1991)
reported an important bit of evidence suggesting that the [s] is really perceived
in the noise-replaced stimuli. He found that listeners can't really tell the differ-
ence between a noise-added version of the word (where the broadband noise is
simply added to the already existing [s]) and a noise-replaced version (where the
[s] is excised first, before adding noise). What this means is that the [s] is actually
perceived — it is restored — and thus that your knowledge of the word "legisla-
tion" has shaped your perception of this noise burst.
Jeff Elman and jay McClelland (1988) provided another important bit of evid-
ence that linguistic knowledge shapes speech perception. They used the phoneme
restoration process to induce the perception of a sound that then participated in
a compensation for coarticulation. This two-step process is a little complicated,
but one of the most clever and influential experiments in the literature.
Step one: compensation for coarticulation. We use a [daHga] continuum just like
the one in figure 5.1, but instead of context syllables [al] and [ai], we use [as] and
[GB There is a compensation for coarticulation using these fricative context
syllables that is like the effect seen with the liquid contexts. Listeners hear more
"ga" syllables when the context is [as] than when it is [of ].
Step two: phoneme restoration. We replace the fricative noises in the words
"abolish" and "progress" with broadband noise, as was done to the [s] of "legis-
lature" in figure 5.7. Now we have a perceived [s] in "progress" and a perceived [5]
in "abolish" but the signal has only noise at the ends of these words in our tokens.
The question is whether the restoration of [1 and [5] in "progress" and "abolish"
is truly a perceptual phenomenon, or just something more like a decision bias
in how listeners will guess the identity of a word. Does the existence of a word
"progress" and the nonexistence of any word "progresh" actually influence
speech perception? Elman and McClelland's excellent test of this question was to
use "abolish" and "progress" as contexts for the compensation for coarticulation
experiment. The reasoning is that if the "restored" [s] produces a compensation
for coarticulation effect, such that listeners hear more "ga" syllables when these
are preceded by a restored [s] than when they are preceded by a restored [5],
then we would have to conclude that the [s] and [f ] were actually perceived by
listeners — they were actually perceptually there and able to interact with the per-
ception of the [da]—[ga] continuum. Guess what Elman and McClelland found?
That's right the phantom, not-actually-there [s] and [5] caused compensation for
coarticulation — pretty impressive evidence that speech perception is shaped by
our linguistic knowledge.
5.4 Perceptual Similarity.
Now to conclude the chapter, I'd like to discuss a procedure for measuring
perceptual similarity spaces of speech sounds. This method will be useful in later
chapters as we discuss different types of sounds, their acoustic characteristics, and
then their perceptual similarities. Perceptual similarity is also a key parameter in
relating phonetic characteristics to language sound change and the phonological
patterns in language that arise from sound change.
The method involves presenting test syllables to listeners and asking them
to identify the sounds in the syllables. Ordinarily, with carefully produced "lab
speech" (that is, speech produced by reading a list of syllables into a microphone
in the phonetics lab) listeners will make very few misidentifications in this task,
so we usually add some noise to the test syllables to force some mistakes. The
noise level is measured as a ratio of the intensity of the noise compared with the
peak intensity of the syllable. This is called the signal-to-noise ratio (SNR) and
is measured in decibels. To analyze listeners' responses we tabulate them in a con-
fusion matrix. Each row in the matrix corresponds to one of the test syllables
(collapsing across all 10 tokens of that syllable) and each column in the matrix
corresponds to one of the responses available to listeners.
Table 5.2 shows the confusion matrix for the 0 dB SNR condition in George
Miller and Patricia Nicely's (1955) large study of consonant perception. Yep, these
data are old, but they're good. Looking at the first row of the confusion matrix
we see that [f] was presented 264 times and identified correctly as "f" 199 times
and incorrectly as "th" 46 times. Note that Miller and Nicely have more data for
some sounds than for others.
Even before doing any sophisticated data analysis, we can get some pretty quick
answers out of the confusion matrix. For example, why is it that "Keith" is some-
times pronounced "Kee by children? Well, according to Miller and Nicely's data,
[0] was called "f" 85 times out of 232 — it was confused with "f" more often than
with any other speech sound tested. Cool. But it isn't clear that these data tell us
anything at all about other possible points of interest — for example, why "this"
and "that" are sometimes said with a [d] sound. To address that question we need
to find a way to map the perceptual "space" that underlies the confusions we observe
in our experiment. It is to this mapping problem we now turn.
5.4.1 Maps from distances.
So, we're trying to pull information out of a confusion matrix to get a picture of
the perceptual system that caused the confusions. The strategy that we will use
takes a list of distances and reconstructs them as a map. Consider, for example,
the list of distances below for cities in Ohio.
Columbus to Cincinnati, 107 miles
Columbus to Cleveland, 142 miles
Cincinnati to Cleveland, 249 miles
From these distances we can put these cities on a straight line as in figure 5.8a,
with Columbus located between Cleveland and Cincinnati. A line works to
describe these distances because the distance from Cincinnati to Cleveland is
simply the sum of the other two distances (107 + 142 = 249).
Here's an example that requires a two-dimensional plane.
Amsterdam to Groningen, 178 km
Amsterdam to Nijmegen, 120 km
Groningen to Nijmegen, 187 km
The two-dimensional map that plots the distances between these cities in the
Netherlands is shown in figure 5.8b. To produce this figure I put Amsterdam and
Groningen on a line and called the distance between them 178 km. Then I drew
an arc 120 km from Amsterdam, knowing that Nijmegen has to be somewhere
on this arc. Then I drew an arc 187 km from Groningen, knowing that Nijmegen
also has to be somewhere on this arc. So, Nijmegen has to be at the intersection
of the two arcs — 120 km from Amsterdam and 187 km from Groningen. This
method of locating a third point based on its distance from two known points
is called triangulation. The triangle shown in figure 5.8b is an accurate depic-
tion of the relative locations of these three cities, as you can see in the map in
figure 5.9.
You might be thinking to yourself, "Well, this is all very nice, but what does
it have to do with speech perception?" Good question. It turns out that we can
compute perceptual distances from a confusion matrix. And by using an extension
of triangulation called multidimensional scaling, we can produce a perceptual
map from a confusion matrix.
5.4.2 The perceptual map of fricatives.
In this section we will use multidimensional scaling (MDS) to map the percep-
tual space that caused the confusion pattern in table 5.2.
The first step in this analysis process is to convert confusions into distances.
We believe that this is a reasonable thing to try to do because we assume that
when things are close to each other in perceptual space they will get confused
with each other in the identification task. So the errors in the matrix in table 5.2
tell us what gets confused with what. Notice, for example, that the voiced con-
sonants [v], [a], [z], and [d] are very rarely confused with the voiceless consonants
[f], [8], and [s]. This suggests that voiced consonants are close to each other in per-
ceptual space while voiceless consonants occupy some other region. Generalized
statements like this are all well and good, but we need to compute some specific
estimates of perceptual distance from the confusion matrix.
Here's one way to do it (I'm using the method suggested by the mathem-
atical psychologist Roger Shepard in his important 1972 paper "Psychological
representation of speech sounds"). There are two steps. First, calculate similarity
and then from the similarities we can derive distances.
Similarity is easy. The number of times that you think [f] sounds like "0" is a
reflection of the similarity of "f" and "0" in your perceptual space. Also, "f"—"0"
similarity is reflected by the number of times you say that [0] sounds like "f", so
we will combine these two cells in the confusion matrix — [f] heard as "0" and [0]
heard as "f." Actually, since there may be a different number of [f] and [0] tokens
presented, we will take proportions rather than raw counts.
Notice that for any two items in the matrix we have a submatrix of four cells:
(a) is the submatrix of response proportions for the "f" I "0" contrast from Miller
and Nicely's data. Note, for example, that the value 0.75 in this table is the pro-
portion of [f] tokens that were recognized as "f" (199/264 = 0.754). Listed with
the submatrix are two abstractions from it.
The variables in submatrix (b) code the proportions so that "p" stands for
proportion, the first subscript letter stands for the row label and the second sub-
script letter stands for the column label. So p is a variable that refers to the
proportion of times that [0] tokens were called "f." In these data NI. is equal
to 0.37. Submatrix (c) abstracts this a little further to say that for any two sounds
i and j, we have a submatrix with confusions (subscripts don't match) and
correct answers (subscripts match).
Asymmetry in confusion matrices.
Is there some deep significance in the fact that [0] is called "f" more often
than [f] is called "th"? It may be that listeners had a bias against calling things
"th" — perhaps because it was confusing to have to distinguish between "th"
and "dh" on the answer sheet. This would seem to be the case in table 5.2
because there are many more "f" responses than "th" responses overall.
However, the relative infrequency of "s" responses suggests that we may not
want to rely too heavily on a response bias explanation, because the "s"-to-
[s] mapping is common and unambiguous in English. One interesting point
about the asymmetry of [f] and [8] confusions is that the perceptual con-
fusion matches the cross-linguistic tendency for sound change (that is, [9] is
more likely to change into [f] than vice versa). Mere coincidence, or is there
a causal relationship? Shepard's method for calculating similarity from a
confusion matrix glosses over this interesting point and assumes that pf„
and p1 are two imperfect measures of the same thing — the confusability of
"f" and "9." These two estimates are thus combined to form one estimate
of "f"—"0" similarity. This is not to deny that there might be something
interesting to look at in the asymmetry, but only to say that for the purpose
of making perceptual maps the sources of asymmetry in the confusion matrix
are ignored.
Here is Shepard's method for calculating similarity from a confusion matrix.
We take the confusions between the two sounds and scale them by the correct
responses. In math, that's:
In this formula, S„ is the similarity between category i and category j. In the case
of "f" and "0" in Miller and Nicely's data (table 5.2) the calculation is:
I should say that regarding this formula Shepard simply says that it "has been
found serviceable." Sometimes you can get about the same results by simply tak-
ing the average of the two confusion proportions p, and pi, as your measure of
similarity, but Shepard's formula does a better job with a confusion matrix in which
one category has confusions concentrated between two particular responses,
while another category has confusions fairly widely distributed among possible
responses - as might happen, for example, when there is a bias against using one
particular response alternative.
OK, so that's how to get a similarity estimate from a confusion matrix. To get
perceptual distance from similarity you simply take the negative of the natural
log of the similarity:
This is based on Shepard's Law, which states that the relationship between per-
ceptual distance and similarity is exponential. There may be a deep truth about
mental processing in this law - it comes up in all sorts of unrelated contexts (Shannon
and Weaver, 1949; Parzen, 1962), but that's a different topic.
Anyway, now we're back to map-making, except instead of mapping the relative
locations of Dutch cities in geographic space, we're ready to map the perceptual
space of English fricatives and "d." Table 5.3 shows the similarities calculated from
the Miller and Nicely confusion matrix (table 5.2) using equation (5.1).
The perceptual map based on these similarities is shown in figure 5.10. One of
the first things to notice about this map is that the voiced consonants are on one
side and the voiceless consonants are on the other. This captures the observation
that we made earlier, looking at the raw confusions, that voiceless sounds were
rarely called voiced, and vice versa. It is also interesting that the voiced and voice-
less fricatives are ordered in the same way on the vertical axis. This might be a
front/back dimension, or there might be an interesting correlation with some
acoustic aspect of the sounds.
In figure 5.10, I drew ovals around some clusters of sounds. These show
two levels of similarity among the sounds as revealed by a hierarchical cluster
analysis (another neat data analysis method available in most statistics software
packages - see Johnson, 2008, for more on this). At the first level of clustering
"0" and "f" cluster with each other and "v" and "d" cluster together in the
perceptual map. At a somewhat more inclusive level the sibilants are included with
their non-sibilant neighbors ("s" joins the voiceless cluster and "z" joins the
voiced cluster). The next level of clustering, not shown in the figure, puts [d] with
the voiced fricatives.
Combining cluster analysis with MDS gives us a pretty clear view of the
perceptual map. Note that these are largely just data visualization techniques; we
did not add any information to what was already in the confusion matrix (though
we did decide that a two-dimensional space adequately describes the pattern of
confusions for these sounds).
Concerning the realizations of "this" and "that" we would have to say that
these results indicate that the alternations [d]—[d] and [d]—[z] are not driven by
auditory/ perceptual similarity alone: there are evidently other factors at work —
otherwise we would find "vis" and "vat" as realizations of "this" and "that."
MDS and acoustic phonetics.
In acoustic phonetics one of our fundamental puzzles has been how to decide
which aspects of the acoustic speech signal are important and which things
don't matter. You look at a spectrogram and see a blob — the question is,
do listeners care whether that part of the sound is there? Does that blob
matter? Phoneticians have approached the "Does it matter?" problem in a
number of ways.
For example, we have looked at lots of spectrograms and asked concerning
the mysterious blob, "Is it always there?" One of the established facts of
phonetics is that if an acoustic feature is always, or even usually, present
then listeners will expect it in perception. This is even true of the so-called
"spit spikes" seen sometimes in spectograms of the lateral fricatives [+]
and 031 (A spit spike looks like a stop release burst — see chapter 8 - but
occurs in the middle of a fricative noise.) These sounds get a bit juicy, but
this somewhat tangential aspect of their production seems to be useful in
perception.
Another answer to "Does it matter?" has been to identify the origin of
the blob in the acoustic theory of speech production. For example, some-
times room reverberation can "add" shadows to a spectrogram. (Actually in
the days of reel-to-reel tape recorders we had to be careful of magnetic
shadows that crop up when the magnetic sound image transfers across layers
of tape on the reel.) If you have a theory of the relationship between speech
production and speech acoustics you can answer the question by saying,
"It doesn't matter because the talker didn't produce it." We'll be exploring
the acoustic theory of speech production in some depth in the remaining
chapters of this book.
One of my favorite answers to "Does it matter?" is "Cooper's rule." Franklin
Cooper, in his 1951 paper with Al Liberman and John Borst, commented
on the problem of discovering "the acoustic correlates of perceived speech."
They claimed that there are "many questions about the relation between
acoustic stimulus and auditory perception which cannot be answered
merely by an inspection of spectrograms, no matter how numerous and
varied these might be" (an important point for speech technologists to
consider). Instead they suggested that "it will often be necessary to make
controlled modifications in the spectrogram, and then to evaluate the
effects of these modifications on the sound as heard. For these purposes we
have constructed an instrument" (one of the first speech synthesizers). This
is a pretty beautiful direct answer. Does that blob matter? Well, leave it
out when you synthesize the utterance and see if it sounds like something
else.
And finally there is the MDS answer. We map the perceptual space and
then look for correlations between dimensions of the map and acoustic prop-
erties of interest (like the mysterious blob). If an acoustic feature is tightly
correlated with a perceptual dimension then we can say that that feature
probably does matter. This approach has the advantages of being based on
naturally produced speech, and of allowing the simultaneous exploration of
many acoustic parameters.
Recommended Reading
Best, C. T. (1995) A direct realist perspective on cross-language speech perception. In W.
Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodological issues in cross-language speech research, Timonium, MD: York Press, 167-200. Describes a theory of cross-language speech perception in which listeners map new, unfamiliar sounds on to their inventory of native-language sounds.
Bond, Z. S. (1999) Slips of the Ear: Errors in the Perception of Casual Conversation, San Diego Academic Press. A collection, and analysis, of misperception in "the wild" — in ordinary conversations.
Bregman, A. S. (1990). Auditory Scene Analysis: The Perceptual Organization of Sound, Cambridge, MA: MIT Press. The theory and evidence for a gestalt theory of audition — a very important book.
Campbell, R. (1994) Audiovisual speech: Where, what, when, how? Current Psychology of Cognition, 13, 76-80. On the perceptual resilience of the McGurk effect.
Cole, R. A. (1973) Listening for mispronunciations: A measure of what we hear during speech. Perception 4:7 Psychophysics, 13, 153-6. A study showing that people often don't hear mispronunciations in speech communication.
Cooper, F. S., Liberman, A. M., and Borst, J. M. (1951) The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proceedings of the National Academy of Science, 37, 318-25. The source of "Cooper's rule."
Elman, J. L. and McClelland, J. L. (1988). Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes. Journal of Memory and Language, 27, 143-65. One of the most clever, and controversial, speech perception experiments ever reported.
Flege, J. E. (1995) Second language speech learning: Theory, findings, and problems. In W. Strange (ed.), Speech Perception and Linguistic Experience: Theoretical and methodo-logical issues in cross-language speech research, Timonium, MD: York Press, 167-200. Describes a theory of cross-language speech perception in which listeners map new, unfamiliar sounds on to their inventory of native-language sounds.
Ganong, W. F. (1980) Phonetic categorization in auditory word recognition. Journal of Experimental Psychology: Human Perception and Performance, 6, 110-25. A highly influen-tial demonstration of how people are drawn to hear words in speech perception. The basic result is now known as "the Ganong effect."
Green, K. P., Kuhl, P. K., Meltzoff, A. N., and Stevens, E. B. (1991) Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception 6P- Psychophysics, 50, 524-36. Integrating gender-mismatched voices and faces in the McGurk effect.
Jakobson, R., Fant, G., and Halle, M. (1952) Preliminaries to Speech Analysis, Cambridge, MA: MIT Press. A classic in phonetics and phonology in which a set of distinctive phono-logical features is defined in acoustic terms.
Johnson, K. and Ralston, J. V. (1994) Automaticity in speech perception: Some speech/ nonspeech comparisons. Phonetica, 51(4), 195-209. A set of experiments suggesting that over-learning accounts for some of the "specialness" of speech perception.
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. (1992) Linguistic experiences alter phonetic perception in infants by 6 months of age. Science, 255, 606-8. Demonstrating the perceptual magnet effect with infants.
Liberman, A. M., Harris, K. S., Hoffman H. S., and Griffith, B. C. (1957) The discrimina-tion of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54, 358-68. The classic demonstration of categorical perception in speech perception.
Lotto, A. J. and Kluender, K. R. (1998) General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60, 602-19. A demonstration that at least a part of the compensation for coarticulation effect (Mann, 1980) is due to auditory contrast.
Mann, V. A. (1980) Influence of preceding liquid on stop-consonant perception. Perception ear Psychophysics, 28, 407-12. The original demonstration of compensation for coarticu-lation in sequences like [al da] and [or ga].
McGurk, H. and MacDonald, J. (1976) Hearing lips and seeing voices. Nature, 264, 746-8. The audiovisual speech perception effect that was reported in this paper has been come to be called "the McGurk effect."
Miller, G. A. and Nicely, P. E. (1955) An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27, 338-52. A standard reference for the confusability of American English speech sounds.
Parzen, E. (1962) On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065-76. A method for estimating probability from instances.
Pastore, R. E. and Farrington, S. M. (1996) Measuring the difference limen for identification of order of onset for complex auditory stimuli. Perception &. Psychophysics, 58(4), 510-26. On the auditory basis of the linguistic use of aspiration as a distinctive feature.
Pisoni, D. B. (1977) Identification and discrimination of the relative onset time of two-component tones: Implications for voicing perception in stops. Journal of the Acoustical Society of America, 61, 1352-61. More on the auditory basis of the linguistic use of aspiration as a distinctive feature.
Rand, T. C. (1974) Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55(3), 678-80. The first demonstration of the duplex perception effect.
Remez, R. E., Rubin, P. E., Pisoni, D. B., and Carrell, T. D. (1981) Speech perception with-out traditional speech cues. Science, 212, 947-50. The first demonstration of how people perceive sentences that have been synthesized using only time-varying sine waves.
Rosenblum, L. D., Schmuckler, M. A., and Johnson, J. A. (1997) The McGurk effect in infants. Perception & Psychophysics, 59, 347-57.
Sekiyama, K. and Tohkura, Y. (1993) Inter-language differences in the influence of visual cues in speech perception. Journal of Phonetics, 21, 427-44. These authors found that the McGurk effect is different for people who speak different languages.
Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication. Urbana: University of Illinois. The book that established "information theory."
Shepard, R. N. (1972) Psychological representation of speech sounds. In E. E. David and P. B. Denes (eds.), Human Communication: A unified view. New York: McGraw-Hill, 67-113. Measuring perceptual distance from a confusion matrix.
Walker, S., Bruce, V., and O'Malley, C. (1995) Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57, 1124-33. A fascinating demonstration of how top-down knowledge may mediate the McGurk effect.
Warren, R. M. (1970) Perceptual restoration of missing speech sounds. Science, 167, 392-3. The first demonstration of the "phoneme restoration effect.".
Suscribirse a:
Comentarios (Atom)