domingo, 30 de septiembre de 2018

Basic Audition

The human auditory system is not a high-fidelity system. Amplitude is compressed;
frequency is warped and smeared; and adjacent sounds may be smeared together.
Because listeners experience auditory objects, not acoustic records like waveforms
or spectrograms, it is useful to consider the basic properties of auditory percep-
tion as they relate to speech acoustics. This chapter starts with a brief discussion
of the anatomy and function of the peripheral auditory system, then discusses
two important differences between the acoustic and the auditory representation
of sound, and concludes with a brief demonstration of the difference between
acoustic analysis and auditory analysis using a computer simulation of auditory
response. Later chapters will return to the topics introduced here as they relate to
the perception of specific classes of speech sounds.

4.1 Anatomy of the Peripheral Auditory System.

The peripheral auditory system (that part of the auditory system not in the
brain) translates acoustic signals into neural signals; and in the course of the trans-
lation, it also performs amplitude compression and a kind of Fourier analysis of
the signal.
Figure 41 illustrates the main anatomical features of the peripheral auditory
system (see Pickles, 1988). Sound waves impinge upon the outer ear, and travel
down the ear canal to the eardrum. The eardrum is a thin membrane of skin which is stretched like the head of a drum at the end of the ear canal. Like the
membrane of a microphone, the eardrum moves in response to air pressure
fluctuations.


These movements are conducted by a chain of three tiny bones in the middle
ear to the fluid-filled inner ear. There is a membrane (the basilar membrane)
that runs down the middle of the conch-shaped inner ear (the cochlea). This mem-
brane is thicker at one end than the other. The thin end, which is closest to the
bone chain, responds to high-frequency components in the acoustic signal, while
the thick end responds to low-frequency components. Each auditory nerve fiber
innervates a particular section of the basilar membrane, and thus carries infor-
mation about a specific frequency component in the acoustic signal. In this way,
the inner ear performs a kind of Fourier analysis of the acoustic signal, breaking
it down into separate frequency components.

4.2 The Auditory Sensation of Loudness.

The auditory system imposes a type of automatic volume control via amplitude
compression, and as a result, it is responsive to a remarkable range of sound inten-
sities (see Moore, 1982). For instance, the air pressure fluctuations produced by
thunder are about 100,000 times larger than those produced by a whisper (see
table 4.1).

Table 41 A comparison of the acoustic and perceptual amplitudes of some
common sounds. The amplitudes are given in absolute air pressure fluctuation
(micro-Pascals -!AN, acoustic intensity (decibels sound pressure level dB SPL)
and perceived loudness (Sones).


How the inner ear is like a piano.

For an example of what I mean by "is responsive to," consider the way in
which piano strings respond to tones. Here's the experiment: go to your
school's music department and find a practice room with a piano in it. Open
the piano, so that you can see the strings. This works best with a grand or
baby grand, but can be done with an upright. Now hold down the pedal
that lifts the felt dampers from the strings and sing a steady note very loudly.
Can you hear any of the strings vibrating after you stop singing? This experi-
n-ient usually works better if you are a trained opera singer, but an enthu-
siastic novice can also produce the effect. Because the loudest sine wave
components of the note you are singing match the natural resonant frequencies
of one or more strings in the piano, the strings can be induced to vibrate
sympathetically with the note you sing. The notion of natural resonant fre-
quency applies to the basilar membrane in the inner ear. The thick part nat-
urally vibrates sympathetically with the low-frequency components of an
incoming signal, while the thin part naturally vibrates sympathetically with
the high-frequency components.

Look at the values listed in the pressure column in the table. For most people,
a typical conversation is not subjectively ten times louder than a quiet office, even
though the magnitudes of their sound pressure fluctuations are. In general, subjective auditory impressions of loudness differences do not match sound pressure
differences. The mismatch between differences in sound pressure and loudness
has been noted for many years. For example, Stevens (1957) asked listeners to adjust
the loudness of one sound until it was twice as loud as another or, in another
task, until the first was half as loud as the second. Listeners' responses were con-
verted into a scale of subjective loudness, the units of-which are called sones. The
sone scale for intensities above 40dB SPL) can be calculated with the formula in
equation (4.1), and is plotted in figure 4.2a. (I used the abbreviations "dB" and
"SPL" in this sentence. We are on the cusp of introducing these terms formally
in this section, but for the sake of accuracy I needed to say "above 40dB SPL".
"Decibel" is abbreviated "dB'', and "sound pressure lever is abbreviated "SPL" —
further definition of these is imminent.)

The sone scale sho-ws listeners' judgments of relative loudness, scaled so that a
sound about as loud as a quiet office (2,000 Oa) has a value of 1, a sound that is
subjectively half as loud has a value of 0.5, and one that is mice as loud has a
value of 2. As is clear in the figure, the relationship benveen sound pressure and
loudness is not linear. For soft sounds, large changes in perceived loudness result
from relatively small changes in sound pressure, while for loud sounds, relatively
large pressure changes produce only small changes in perceived loudness. For
example, if peak amplitude changes from 100,0001_LPa to 200,000 .LiPa, the change
in sones is from 10.5 to 16 sones, but a change of the same pressure manimd_e
from 2,000,000 1_1Pa to 2,100,000 I_L.Pa produces less than a 2 sone change in loud-
ness from 64 to 65.9 sones).
Figure 4.2 also shows (in part b) an older relative acoustic intensity scale that
is named after Alexander Graham Bell. This unit of loudness, the bel, is too big
for most purposes, and it is more common to use tenths of a bel, or decibels
(abbreviated dE). This easily calculated scale is widely used in auditory phonetics
and psycho-acoustics, because it provides an approximation to the nonlinearity
of human loudness sensation.
As the difference between dE SPL and dB SL implies (see box on "Decibels");
perceived loudness varies as a function of frequency Figure 4.3 illustrates the rela-
tionship between subjective loudness and dB SPL. The curve in the figure repre-
sents the intensities of a set of tones that have the same subjective loudness as a
1,000 Hz tone presented at 60 dB SPL. The curve is like the settings of a graphic
equalizer on a stereo. The lever on the left side of the equalizer controls the rela-
tive amplitude of the lowest-frequency components in the music, while the lever
on the right side controls the relative amplitude of the highest frequencies. This
equal loudness contour shows that you have to amplify- the lowest and highest
frequencies if you want them to sound as loud as the middle frequencies 
(whether this sounds good is another issue), so, as the figure shows, the auditory
system is most sensitive to sounds that have frequencies between 2 and 5 kHz.
Note also that sensitivity drops off quickly above 10 kHz. This was part of
my motivation for recommending a sampling rate of 22 kHz (11 kHz Nyquist
frequency) for acoustic/ phonetic analysis.


Decibels.
Although it is common to express the amplitude of a sound wave in terms
of pressure, or, once we have converted acoustic energy into electrical
energy, in volts, the decibel scale is a -way of expressing sound amplitude
that is better correlated with perceived loudness. On this scale the relative
loudness of a sound is measured in terms of sound intensity (which is pro-
portional to the square of the amplitude) on a logarithmic scale. Acoustic
intensity is the amount of acoustic power exerted by the sound wave's pres-
sure fluctuation per -unit of area. A common unit of rneasure for acoustic
intensity is Watts per square centimeter (W/ cm:).
Consider a sound with average pressure amplitude x. Because acoustic
intensity is proportional to the square of amplitude, the intensity of x rela-
tive to a reference sound with pressure amplitude r is x2/ r2, A be] is the base
10 logarithm of this power ratio: log„(x2/ r2), and a decibel is 10 times this:
10 log,o(x2/r2). This formula can be simplified to 20 loga(x/ r) = dE.
There are two common choices for the reference level r in dB measure-
ments. One is 2011Pa, the typical absolute auditory threshold (lowest audible
pressure fluctuation) of a 1,000 Hz tone. When this reference value is
used, the values are labeled dB SPL for Sound Pressure Level). The other
common choice for the reference level has a different reference pressure level
for each frequency. In this method, rather than use the absolute threshold
for a 1,000 Hz tone as the reference for all frequencies, the loudness of a
tone is measured relative to the typical absolute threshold level for a tone
at that frequency. When this method is used, the values are labeled dB SL
for Sensation Level).
In speech analysis programs, amplitude may be expressed in dB relative
to the largest amplitude value that can be taken by a sample in the digital
speech waveform, in which case the amplitude values are negative numbers;
or it may be expressed in dB relative to the smallest amplitude value that
can be represented in the digital speech waveform, in which case the ampli-
tude values are positive numbers. These choices for the reference level in
the dB calculation are used when it is not crucial to know the absolute
dB SPL value of the signal. For instance, calibration is not needed for
comparative RS or spectral amplitude measurements.

4.3 Frequency Response of the Auditory System.

As discussed in section 4.1, the auditory system performs a running Fourier anal-
ysis of incoming sounds. However, this physiological frequency analysis is not the
same as the mathematical Fourier decomposition of signals. The main difference
is that the auditory system's frequency response is not linear, just as a change of
1,000 Pa in a soft sound is not perceptually equivalent to a similar change in a
loud sound, so a change from 500 to 1,000 Hz is not perceptually equivalent to
a change from 5,000 to 5,300 Hz. This is illustrated in figure 4.4, which shows the
relationship between an auditory frequency scale called the Bark scale (wicker,
1961; Schroeder et al., 1979), and acoustic frequency in kHz. Zwicker (1975) showed
that the Bark scale is proportional to a scale of perceived pitch (the Mel scale) and
to distance along the basilar membrane. A tone with a frequency of 500 Hz has
an auditory frequency of 4.9 Bark, while a tone of 1,000 Hz is 8.5 Bark, a differ-
ence of 3.6 Bark. On the other hand, a tone of 5,000 Hz has an auditory frequency
of 19.2 Bark, while one of 5,500 Hz has an auditory frequency of 19.8 Bark, a
difference of only 0.6 Bark. The curve shown in figure 4.4 represents the fact that
the auditory system is more sensitive to frequency changes at the low end of the
audible frequency range than at the high end.
This nonlinearity in the sensation of frequency is related to the fact that the
listener's experience of the pitch of periodic sounds and of the timbre of complex
sounds is largely shaped by the physical structure of the basilar membrane.
Figure 45 illustrates the relationship between frequency and location along the
basilar membrane. As mentioned_ earlier, the basilar membrane is thin at its
base and thick at its apex; as a result, the base of the basilar membrane responds
to high-frequency sounds, and the apex to low-f•equency sounds. As figure 4.5
shows, a relatively latge portion of the basilar membrane responds to sounds below
LON Hz, whereas only a small portion responds to sounds between 12,000 and
13,000 Hz, for example. Therefore, small changes in frequency below 1,000 Hz
are more easily detected than are small changes in frequency above 12,000 Hz,
The relationship between auditory frequency and acoustic frequency shown in
figure 4.4 is due to the structure of the basilar membrane in the inner ear.



4.4 Saturation and Masking.
The sensory neurons in the inner ear and the auditory nerve that respond CO sound
are physiological machines that run on chemical supplies (charged ions of sodium
and potassium). Consequently, when they run low on supplies, or are running at
maximum capacity, they may fail to respond to sound as vigorously as usual. This
comes up a lot, actually.
For example, after a short period of silence auditory nerve cell response to a
tone is much greater than it is after the tone has been playing for a little while.
During the silent period, the neurons can fully "recharge their batteries," as they
take on charged positive ions. So when a sound moves the basilar membrane in
the cochlea, the hair cells and the neurons in the auditory nerve are ready to fire,
The duration of this period of greater sensitivity varies from neuron to neuron
but is generally short, maybe 5 to 10 ms. Interestingly, this is about the duration
of a stop release burst, and it has been suggested that the greater sensitivity of
auditory neurons after a short period of silence might increase one's perceptual
acuity for stop release burst information. This same mechanism should make it
possible to hear changing sound properties more generally, because the relevant
silence (as far as the neurons are concerned) is lack of acoustic energy at the
particular center frequency of the auditory nerve cell. So, an initial burst of
activity would tend to decrease the informativeness of steady-state sounds relative
to acoustic variation, whether there has been silence or not.
Another aspect of have sound is registered by the auditory system in tine is
the phenomenon of masking. In masking, the presence of one sound makes another,
nearby sound more difficult to hear. Masking has been called a 'line busy" effect.
The idea is that if a neuron is firing in response to one sound, and another sound
would tend to be encoded also by the firing of that neuron, then the second sound
will not be able to cause much of an increment in firing — thus the system will be
relatively insensitive to the second sound. We will discuss two types of masking
that are relevant for speech percepdon: "frequency masking' and "temporal
masking,"

Figure 4.6 illustrates frequency masking, showing the results of a test that used
a narrow band of noise (the gray band in the figure) and a series of sine wave
tones (the open circles). The masker noise, in this particular illustration, is 90 Hz
wide and centered on 410 Hz, and its loudness is 70 dB SPL, The dots indicate
how much the amplitude must be increased for a tone to be heard in the pre-
sence of the masking noise (that is, the dots plot the elevation of threshold level
for the probe tones). For example, the threshold amplitude of a tone of 100 Hz
the first dot) is not affected at all by the presence of the 410 Hz masker, but a
tone of 400 Hz has to be amplified by 50 dB to remain audible. One key aspect
of the frequency masking data shown in figure 4,6 is called the upward spread
of masking. Tones at frequencies higher than the masking noise show a greater
effect than tones below the masker. So, to hear a tone of 610 Hz (200 Hz higher
than the center of the masking noise) the tone must be amplified 38 dB higher
than its normal threshold loudness, while a tone of 210 Hz (200 Hz lower than
the center of the masking noise) needs hardly any amplification at all. This illus-
trates that low-frequency noises will tend to mask the presence of high-frequency
components.

The upward spread of masking: whence and what for?.

There are two things to say about the upward spread of masking. First, it
probably comes from the mechanics of basilar membrane vibration in the
cochlea. The pressure wave of a sine wave transmitted to the cochlea from
the bones of the middle ear travels down the hasitar membrane, building
up in amplitude (ie. displacing the membrane more and more), up to the
location of maximum response to the frequency of the sine wave (see figure
4.5), and then rapidly ceases displacing the basilar membrane. The upshot
is that the basilar membrane at the base of the cochlea, where higher fre-
quencies are registered, is stimulated by low-frequency sounds, while the
apex of the basilar membrane, where lower frequencies are registered, is not
much affected by high-frequency sounds. So, the upward spread of masking
is a physiological by-product of the mechanical operation of this little fluidfilled coil in your head,
"So what?• you may say. Well, the upward spread of masking is used
to compress sound in MP s. We mentioned audio compression before in
chapter 3 and said that raw audio can be a real bandwidth hog. The MP3
compression standard uses masking., and the upward spread of masking
in particular, to selectively leave frequency components out of compressed
audio. Those bits in the high-frequency Lail in figure 4,6, that you wouldn't
be able to hear anyway? Goner have space by simply leaving out the inaudible bits.

The second type of masking is temporal masking. What happens here is that
sounds that come in a sequence may obscure each other. For example, a short,
soft sound may be perfectly audible by itself, but can be completely obscured if
it closely follows a much louder sound at the same frequency There are a lot of
parameters to this "forward masking" phenomenon. In ranges that could affect
speech perception, we would note that the masking noise must be greater than
40 dB SPL and the frequency of the masked sound must match (or be a frequency
subset of she masker. The masking effect drops off very quickly and is of almost
no practical significance after about 25 ins. In speech, we may see a slight effect
of forward masking at vowel offsets (rowels being the loudest sounds in speech).
Backward masking, where the audibility of a soft sound is reduced by a later-
occurring loud sound, is even less relevant for speech perception, though it is
an interesting puzzle how something could reach back in Lime and affect your
perception of a sound. It isn't really magic, though; just strong signals traveling
through the nervous system more quickly than weak ones.

4.5 Auditory Representations.

In practical terms what all this means is that when we calculate an acoustic power
spectrum of a speech sound, the frequency and loudness scales of the analyzing
device (for instance, a computer or a spectrograph) are not the same as the audi-
tory system's frequency and loudness scales. Consequently, acoustic analyses of
speech sounds may not match the listener's experience. The resulting mismatch
is especially dramatic for sounds like some stop release bursts and fricatives that
have a lot of high-frequency energy and/ or sudden amplitude changes. One way
to avoid this mismatch between acoustic analysis and the listener's experience
is to implement a functional model of the auditory system. Some examples of
the use of auditory models in speech analysis are Liljencrants and Lindblom (1972),
Bladon and Lindblom (1981), Johnson (1939), Lyons (1982), Patterson (1976), Moore
and Glaslierg (1983), and Seneff (1988). Figure 4.7 shows the difference between
the auditory and acoustic spectra of a complex wave composed of a 500 Hz and
a 1,500 Hz sine wave component. The vertical axis is amplitude in dB, and the
horizontal axis shows frequency in Hz, marked on the bottom of the graph, and
Bark, marked on the top of the graph. I made this auditory spectrum, and others
shown in later figures, with a computer program (Johnson, 1939) that mimics the
frequency response characteristics shown in figure 4.4 and the equal loudness con-
tour shown in figure 4,3. Notice that because the acoustic and auditory frequency
scales are different, the peaks are located at different places in the two represen-
tations, even though both spectra cover the frequency range from fl to 10,000 Hz,
Almost half of the auditory frequency scale covers frequencies below 1,500 Hz,
while this same range covers less than two-tenths of the acoustic display So, low-
frequency components tend to dominate the auditory spectrum. Notice too that
in the auditory spectrum there is some frequency smearing that causes the peak
at 11 Bark (1,500 Hz) to be somewhat broader than that at 5 Bark (500 Hz). This
spectral-smearing effect increases as frequency increases.


Figure 4.3 shows an example of the difference between acoustic and auditory
spectra of speech. The acoustic spectra of the release bursts of the clicks in Xhosa
are shown in (a), while (b) shows the corresponding auditory spectra. Like figure 4.7,
this figure shows several differences between acoustic and auditory spectra. First,
the region between 6 and 10 kHz (20-4 Bark in the auditory spectra), in which
the clicks do not differ very much, is not very prominent in the auditory spectra.
In the acoustic spectra this insignificant portion takes up two-fifths of the frequency
scale, while it takes up only one-fifth of the auditory frequency scale. This serves
to visually; and presumably auditorily, enhance the differences between the
spectra. Second, the auditory spectra show many fewer local peaks than do
the acoustic spectra. In this regard it should, be noted that the acoustic spectra
shown in figure 4.8 were calculated using LPC analysis to smooth them; the FFT
spectra which were input CO the auditory model were much more comphcated
than these smooth LPG spectra. The smoothing evident in the auditory spectra,
on the other hand, is due to the increased bandwidths of the auditory filters at
high frequencies.
Auditory models are interesting, because they offer a way of looking at the speech
signal from the point of view of the listener. The usefulness of auditory models
in phonetics depends on the accuracy of the particular simulation of the peri-
pheral auditory system. Therefore, the illustrations in this book were produced by
models that implement only well-known, and extensively studied, nonlinearities
in auditory loudness and frequency response, and avoid areas of knowledge that
are less well understood for complicated signals like speech.
These rather conservative auditory representations suggest that acoustic ana-
lyses give only a rough approximation to the auditory representations that
listeners use in identifying speech sounds.
Recall from chapter 3 that digital spectrograms are produced by encoding
spectral amplitude in a series of FFT spectra as shades of gray in the spectrogram.
This same method of presentation can also be used to produce auditory spec-
trograms from sequences of auditory spectra. Figure 4.9 shows an acoustic
spectrogram and an auditory spectrogram of the Cantonese word ikal "chicken (see
figure 3.20). To produce this figure, I used a publicly available auditory model (Lyons'




cochlear model (Lyons, 1932; Marley, 1988), which can be found at: http:
linguistics,berkeleyedulphonlablresourcesl ). The au di to ry spectrogram , which
is also called a cochleagram, combines features of auditory spectra and spectro-
grams. As in a spectrogram, the simulated auditory response is represented with
spectral amplitude plotted as shades of gray, with time on the horizontal axis
and frequency on the vertical axis. Note that although the same frequency range
(0-11 kHz) is covered in both displays, the movements of the lowest concentra-
tions of spectral energy in the vowel are much more visually pronounced in the
cochleagram because of the auditory frequency scale.


Recommended Reading
Bladon„ A. and Lindblom, B. (1981) Modeling the judgement of vowel quality differences. Journal of the Acousticai Society of America, 69, 1414-22. Using an auditory model to predict vovi,rel perception results.
Brody 1* livi, (1946) Three Unpublished Drawings of the Anatomy of the Human Ear, Philadelphia: Saunders. The source of figure 4.1.
Johnson, K. (1989) Contrast and normalization in vowel perception. Journal of Phonetics, 18, 229-54. Using an auditory model to predict vowel perception results.
Liliencrants, J. and Lindblom, Bjorn (1972) Numerical simulation of vowel quality systems: The role of perceptual contrast. Language, 48, 839-62. Using an auditory model to predict cross-linguistic patterns in vowel inventories,
Lyons, R. F. (1982) A computational model of filtering, detection and compression in the cochlea. Proceedings of the IEEE international Conference on Acoustics, Speech and Signal Processivg, 1282-5. A simulation of the transduction of acoustic signals into auditory nerve signals. Slaney's (1988) implementation is used in this book to produce "cochlea-
1'2
grams.
Moore, B. C. J. (1982) An introduction to the Psychology of Hearing, 2nd edn., New York: Academic Press. A comprehensive introduction to the behavioral m.easurement of human hearing ab ilit — auditory psychophysics.
Moore, B. C. J. and Glasberg, B. R. (1983) Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. puma! of the Acoustical Society of America, 74, 750-3. Description of the equivalent rectangular bandwidth (ER.B) auditory frequency scale and showing how to calculate simulated auditory spectra. from this information.
Patterson, R. D. (1976) Auditory filter shapes derived from noise stimuli. JoHniai of the Acoustical Society of America, 59, 640-54, Showing how to calculate simulated auditory spectra using an auditory filter-bank.
Pickles, J. O. (1988) An Introduction to the Pitysioiogy offlearing, 2nd ecln., New York: Academic Press. An authoritative and fascinating introduction to the physics and chemistry of hearing, with a good review of auditory psychophysics as well.
Schroeder, M. R., Atal, B. S., and Hall, J. L. (1979) Objective measure of certain speech signal degradations based on masking properties of human auditory perception. In B. Lindblom and S. hman (eds.), Frolifiers ofSpeech Commi.inication. Research, London:. Academic Press, 217-29. Measurement and use of the Bark frequency scale.

Seneff, 5, (1988) A joint synchronytmean-rate model of auditory speech processing. Journal of Phonetics, 16, 55-76.
Sian ey, M. (1988) Lyons' cochlear model. Apple Technical Report, 13. Apple Corporate Library., Cupertino, A. A publicly distributed implementation of Richard Lyons' (1982) simulation of hearing.
Stevens., S. S. (1957) Concerning the form of the loudness function...loin-nal of the. Acoustica Society of America., 29, 603-6. An example of classic 1950s-style auditory psych ophysics in Avhich the sone scale was first described.
Zwicker, E.. (1961) Subdivision of the audible frequency range into critical bands (Frequen.zgruppen). Journcd of tile Acoustical Society ofAmerica 33, 248. An early descrip-tion of the Bark scale of auditory critical frequency bands.
Zwicker, E. (1975) Scaling. In W. D. Keidel and NV. D. Neff (eds.), Auditory System: Physioiogy (CNS), behavioral studies, psycho ottstics, Berlin: Springer-Verlag. An overview of various frequency and loudness scales that were found in auditory psychophysics research.



martes, 4 de septiembre de 2018

Researcher degrees of freedom in phonetic research

Researcher degrees of freedom in phonetic research
Timo B. Roettger
Department of Linguistics, Northwestern University, Evanston, United States
timo.roettger@northwestern.edu
Abstract
Every data analysis is characterized by a multitude of decisions, so-called “researcher degrees
of freedom” (Simmons, Nelson, & Simonsohn, 2011), that can affect its outcome and the conclusions
we draw from it. Intentional or unintentional exploitation of researcher degrees of
freedom can have dramatic consequences for our results and interpretations, increasing the
likelihood of obtaining false positives. It is argued that quantitative phonetics faces a large
number of researcher degrees of freedom due to its scientific object, speech, being inherently
multidimensional and exhibiting complex interactions between multiple co-varying layers. A
Type-I error simulation is presented that demonstrates the severe false error inflation when exploring
researcher degrees of freedom. It is argued that, combined with common cognitive fallacies,
unintentional exploitation of researcher degrees of freedom introduces strong bias and
poses a serious challenge to quantitative phonetics as an empirical science. This paper discusses
potential remedies for this problem including adjusting the threshold for significance, drawing
a clear line between confirmatory and exploratory analyses via preregistered reports, open,
honest and transparent practices in communicating data analytical decisions, and direct replications.
Keywords
false positive, methodology, preregistration, replication, reproducibility speech production, statistical
analysis
Acknowledgment
I would like to thank Aviad Albert, Eleanor Chodroff, Emily Cibelli, Jennifer Cole, Tommy
Denby, Matt Goldrick, James Kirby, Nicole Mirea, Limor Raviv, Shayne Sloggett, Lukas Soenning,
Mathias Stoeber, Bodo Winter, and two anonymous reviewers for valuable feedback on
earlier drafts of this paper. All remaining errors are my own.

1. Introduction
In our endeavor to understand human language, phoneticians and phonologists explore speech
behavior, generate hypotheses about aspects of the speech transmission process, and collect
and analyze suitable data in order to eventually put these hypotheses to the test. We proceed to
the best of our knowledge and capacities, striving to make this process as accurate as possible,
but, as in any other scientific discipline, this process can lead to unintentional misinterpretations
of our data. Within a scientific community, we need to have an open discourse about such
misinterpretations, raise awareness of their consequences, and find possible solutions to them.
In the most widespread framework of statistical inference, null hypothesis significance testing,
one common misinterpretation of data is a false positive, i.e. erroneously rejecting a null
hypothesis (Type I error)1. When undetected, false positives can have far reaching consequences,
potentially leading to bold theoretical claims which may misguide future research.
These types of misinterpretations can be persistent through time because our publication apparatus
does neither incentivize publishing null results nor direct replication attempts, biasing the
scientific record towards novel positive findings. As a result, there may be a large number of
null results in the “file drawer” that will never see the light of day (e.g. Sterling, 1959).
False positives can have many causes. One factor that has received little attention is the
so called “garden of forking paths” (Gelman & Loken, 2014) or what Simmons, Nelson, and Simonsohn (2011) call “researcher degrees of freedom”. Data analysis - that is, the path we chose
from the raw data to the results section of a paper - is a complex process. We can look at data
from different angles and each way to look at them may lead to different methodological and
analytical choices. The potential choices in the process of data analysis are collectively referred
to as researcher degrees of freedom. Said choices, however, are usually not specified in advance
but are commonly made in an ad hoc fashion, after having explored several aspects of the
data and analytical choices. In other words, they are data-contingent rather than motivated on
independent, subject-matter grounds. This is often not an intentional process but a result of
cognitive biases. Intentional or not, exploiting researcher degrees of freedom can increase false
positives rates. This problem is shared by all quantitative scientific fields (Gelman & Loken, the specific characteristics of phonetic analyses.

1 There are other errors that can happen and that are important to discuss. Most closely related to the
present discussion are Type II (Thomas et al., 1985), Type M, and Type S errors (Gelman & Carlin,
2014). Within quantitative linguistics, there are several recent discussions of these errors, see Kirby &
Sonderegger (2018), Nicenboim et al. (2018), Nicenboim & Vasishth (2016), Vasishth & Nicenboim
(2016). 2014; Simmons et al., 2011; Wichert et al., 2016), but has not been extensively discussed for


In this paper it will be argued that analyses in quantitative phonetics face a high number
of researcher degrees of freedom due to the inherent multidimensionality of speech behavior,
which is the outcome of a complex interaction between different meaning-signaling layers.
This article will discuss relevant researcher degrees of freedom in quantitative phonetic research,
reasons as to why exploiting researcher degrees of freedom is potentially harmful for
phonetics as a cumulative empirical endeavor, and possible remedies to these issues.
The remainder of this paper is organized as follows: In §2, I will review the concept of
researcher degrees of freedom and how they can lead to misinterpretations in combination with
certain cognitive biases. In §3, I will argue that researcher degrees of freedom are particularly
prevalent in quantitative phonetics, focusing on the analysis of speech production data. In §4, I
will present a simple Type-I error rate simulation, demonstrating that mere chance leads to a
large inflation of false-positives when we exploit common researcher degrees of freedom. In §5,
I will discuss possible ways to contain the probability of false positives due to researcher degrees
of freedom, discussing adjustments of the alpha level (§5.1), stressing the importance of a
more rigorous distinction between confirmatory and exploratory analyses (§5.2), preregistrations
and registered reports (§5.3), transparent reporting (§5.4), and direct replications (§5.5).
2. Researcher degrees of freedom
Every data analysis is characterized by a multitude of decision that can affect its outcome and,
in turn, the conclusions we draw from it (Gelman & Loken, 2014; Simmons et al., 2011;
Wichert et al., 2016). Among the decisions that need to be made during the process of learning
from data are the following: What do we measure? What predictors and what mediators do we
include? What type of statistical models do we use?
There are many choices to make and most of them can have an influence on the result that
we obtain. Researchers usually do not make all of these decisions prior to data collection: Rather
they usually explore the data and possible analytical choices to eventually settle on one
‘reasonable’ analysis plan which, ideally, yields a statistically convincing result. Depending on
the statistical framework we are operating in, our statistical results are affected by the number
of hidden analyses performed.
In most scientific papers, statistical inference is drawn by means of null hypothesis-significance-
testing (NHST, Gigerenzer et al., 2004; Lindquist, 1940). Because NHST is by a large margin the most common inferential framework used in quantitative phonetics, the present discussion
is conceived with NHST in mind. However, most of the arguments are relevant for
other statistical frameworks, too. Traditional NHST performs inference by assuming that the
null hypothesis is true in the population of interest. For concreteness, assume we are conducting
research on an isolated undocumented language and investigate the phonetic instantiation
of phonological contrasts. We suspect, the language has word stress, i.e. one syllable is phonologically
‘stronger’ than other syllables. We are interested in how word stress is phonetically
manifested and pose the following hypothesis:
(1) There is a phonetic difference between stressed and unstressed syllables.
NHST allows to compute the probability of observing a result at least as extreme as a test
statistic (e.g. t-value), assuming that the null hypothesis is true (the p-value). In our concrete
example, the p-value tells us the probability of observing our data or more extreme data, if
there really was no difference between stressed and unstressed syllables (null hypothesis). Receiving
a p-value below a certain threshold (commonly 0.05) is then interpreted as evidence to
claim that the probability of the data, if the null hypothesis was in fact true (no difference between
stressed and unstressed syllables), is sufficiently low.
Within this framework, any difference between conditions that yields a p-value below 0.05
is practically considered sufficient to reject the null hypothesis and claim that there is a difference.
However, these tests have a natural false positive rate, i.e. given a p-value of 0.05, there
is a 5% probability that our data accidentally suggests that the null can be refuted. When we
accept the arbitrary threshold of 0.05 to distinguish ‘significant’ from ‘non-significant’ findings
(which is under heated dispute at the moment, see Benjamin et al., 2018, see also §5.1), we
have to acknowledge this level of uncertainty surrounding our results which comes with a certain
base rate of false positives.
If, for example, we decide to measure only a single phonetic parameter (e.g. vowel duration)
to test the hypothesis in (1), 5% would be the base rate of false positives, given a p-value
of 0.05. However, this situation changes if we measure more than one parameter. For example,
we could test, say, vowel duration, average intensity, and average f0 (all common phonetic correlates
of word stress, e.g. Gordon & Roettger, 2017), amounting to three null hypothesis significance
tests. One of these analyses may turn out to be significant, yielding a p-value of 0.05 or
lower. We could go ahead and write our paper, claiming that stressed and unstressed syllables
are phonetically different in the language under investigation.
But this procedure increases the chances of finding a false positive. If n independent comparisons
are performed, the false positive rate would be 1− (1 −0.05)n instead of 0.05; three tests, for example, will produce a false positive rate of approximately 14% (i.e. 1 − 0.95 * 0.95
* 0.95 = 1 − 0.857 = 0.143). Why is that? Assuming that we could get a significant result
with a p-value of 0.05 by chance in 5% of cases, the more often we look at random samples,
the more often we will accidentally find a significant result (e.g. Tukey, 1953).
Generally, this reasoning can be applied to all researcher degrees of freedom. With
every analytical decision, with every forking path in the analytical labyrinth, with every researcher
degree of freedom, we increase the likelihood of finding significant results due to
chance. In other words, the more we look, explore, and dredge the data, the greater the likelihood
of finding a significant result. Exploiting researcher degrees of freedom until significance
is reached has been called out as harmful practices for scientific progress (John, Loewenstein, &
Prelec, 2012). Two often discussed instanced of such harmful practices is HARKing (Hypothesizing
After Results are Known, e.g. Kerr, 1998) and p-hacking (e.g. Simmons et al., 2011).
HARKing refers to the practice of presenting relationships that have been obtained after data
collection as if they were hypothesized in advance. P-hacking refers to the practice of hunting
for significant results in order to ultimately report these results as if confirming the planned
analysis. While such exploitations of researcher degrees of freedom are certainly harmful to the
scientific record, there are good reasons to believe that they are, more often than not, unintentional.
People are prone to cognitive biases. Our cognitive system craves for coherency, seeing patterns
in randomness (apophenia, Brugger, 2001); we weigh evidence in favor of our preconceptions
more strongly than evidence that challenges our established views (confirmation bias,
Nickerson, 1998); we perceive events as being plausible and predictable after they have occurred
(hindsight bias, Fischhoff, 1975).2 Scientists are no exception. For example, Bakker and
Wicherts (2011) analyzed statistical errors in over 250 psychology papers. They found that
more than 90% of the mistakes were in favor of the researchers’ expectations, making a nonsignificant finding significant. Fugelsang et al. (2004) investigated how scientists evaluate data
that are either consistent or inconsistent with prior expectations. They showed that when researchers
are faced with results that disconfirm their expectations, they are likely to blame the
methodology employed while results that confirmed their expectations were rarely critically
evaluated.
We work in a system that rewards and incentivizes positive results more than negative results
(John et al., 2012), so we have the natural desire to find a positive result in order to publish our findings. A large body of research suggests that when we are faced with multiple decisions,
we may end up convincing ourselves that the decision with the most rewarding outcome
is the most justified one (e.g. Dawson et al., 2002; Hastorf & Cantril, 1954). In light of the dynamic
interplay of cognitive biases and our current incentive structure in academia, having
many analytical choices may lead us to unintentionally exploit these choices during data analysis.
This can inflate the number of false positives.
2 See Greenland (2017) for a discussion of cognitive biases that are more specific to statistical analyses.

3. The garden of forking paths in quantitative phonetics
Quantitative phonetics is no exception to the discussed issues and may in fact be particularly at
risk because its very scientific object offers a considerable number of perspectives and decisions
along the data analysis path. The next section will discuss researcher degrees of freedom in
phonetics, and will, for exposition purposes, focus on speech production research. It turns out
that the type of data that we are collecting, i.e. acoustic or articulatory data, opens up many
different forking paths (for a more discipline-neutral assessment of researcher degrees of freedom,
see Wichert et al., 2016). I discuss the following four sets of decisions (see Fig. 1 for an
overview): choosing phonetic parameters (§3.1), operationalizing chosen parameters (§3.2),
discarding data (§3.3), and choosing speech-related independent variables (§3.4). These distinctions
are made for convenience and I acknowledge that there are no clear boundaries between
these four sets. They highly overlap and inform each other to different degrees.
Figure 1: Schematic depiction of decision procedure during data analysis that can lead
to an increased false positive rate. Along the first analysis pipeline (blue), decisions are
made as to what phonetic parameters are measured (§3.1), how they are operationalized
(§3.2), what data is kept and what data is discarded (§3.3) and what additional independent
variables are measured (§3.4). The results are statistically analyzed (which
comes with its own set of researcher degrees of freedom, see Wichert et al., 2016) and
interpreted. If the results are as expected and/or desired, the study will be published. If
not, cognitive biases facilitate reassessments of earlier analytical choices (red arrow) (or
a reformulation of hypotheses, i.e. HARKing), increasing the false positive rate.
3.1 Choosing phonetic parameters
When conducting a study on speech production, the first important analytical decision to test a
hypothesis is the question of operationalization, i.e. how to measure the phenomenon of interest.
For example, how do we measure whether two sounds are phonetically identical, whether
one syllable in the word is more prominent than others, or whether two discourse functions are
produced with different prosodic patterns? In other words, how to we quantitatively capture
relevant features of speech?
Speech categories are inherently multidimensional and vary through time. The acoustic
parameters for one category are usually asynchronous, i.e. appear at different points of time in
the unfolding signal and overlap with parameters for other categories (e.g. Jongman et al.,
2000; Lisker, 1986; Summerfield, 1984; Winter, 2014). If we zoom in and look at a well-researched
speech phenomenon such as the voicing distinction between stops in English, it turns
out that the distinction between voiced and voiceless stops can be manifested by many different
acoustic parameters such as voice onset time (e.g. Lisker & Abramson, 1963), formant transitions
(e.g. Benkí, 2001), pitch in the following vowel (e.g. Haggard et al., 1970), the duration
of the preceding vowel (e.g. Raphael, 1972), the duration of the closure (e.g. Lisker, 1957), as
well as spectral differences within the stop release (e.g. Repp, 1979). Even temporally dislocated
acoustic parameters correlate with voicing, for example, in the words led versus let, voicing
correlates can be found in the acoustic manifestation of the initial /l/ of the word (Hawkins
& Nguyen, 2004). These acoustic aspects correlate with their own set of articulatory configurations,
determined by a complex interplay of different supralaryngeal and laryngeal gestures, coordinated
with each other in intricate ways.
This multiplicity of phonetic cues exponentially grows if we look at larger temporal
windows as it is the case for suprasegmental aspects of speech. Studies investigating acoustic
correlates of word-level prominence, for example, have been using many different measurements
including temporal characteristics (duration of certain segments or subphonemic intervals),
spectral characteristics (intensity measures, formants, and spectral tilt), and measurements
related to fundamental frequency (f0) (Gordon & Roettger, 2017).
Looking at even larger domains, the prosodic expression of pragmatic functions can be
expressed by a variety of structurally different acoustic cues which can be distributed throughout
the whole utterance. Discourse functions are systematically expressed by multiple tonal
events differing in their position, shape, and alignment (e.g. Niebuhr et al., 2011). They can
also be expressed by global or local pitch scaling, as well as acoustic information within the
temporal or spectral domain (e.g. Cangemi, 2015; Ritter & Roettger, 2015; van Heuven & van
Zanten 2005).3 All of these dimensions are potential candidates for our operationalization and
therefore represent researcher degrees of freedom.
If we ask questions such as “is a stressed syllable phonetically different from an unstressed
syllable?”, out of all potential measures, any single measure has the potential to reject
the corresponding null hypothesis (there is no difference). But which measurement should we
pick? There are often many potential phonetic correlates for the relevant phonetic difference
under scrutiny. Looking at more than one measurement seems to be reasonable. However, looking
at many phonetic exponents to test a single global hypothesis increases the false error rate.
If we were to test 20 dimensions of the speech signal repeatedly, on average one of these tests
will, by mere chance, result in a spurious significant result (at a 0.05 alpha level). We obviously
have justified preconceptions about which phonetic dimensions may be good candidates
for certain functional categories, informed by a long list of references. However, while these
preconceptions can certainly help us make theoretically-informed decisions, they may also bear
the risk for ad hoc justifications of analytical choices that happen after having explored researcher
degrees of freedom.
One may further object that dimensions of speech are not independent of each other,
i.e. many phonetic dimensions co-vary systematically. Such covariation between multiple
acoustic dimensions can, for example, result from originating from the same underlying articulatory
configurations. For example, VOT and onset f0, the fundamental frequency at the onset
of the vowel following the stop, are systematically co-varying across languages, which has been
argued to be a biomechanical consequence of articulatory and/or aerodynamic configurations
(Hombert, Ohala, & Ewan, 1979). Löfqvist et al. (1989) showed that when producing voiceless
consonants, speakers exhibit higher levels of activity in the cricothyroid muscle and in turn
greater vocal fold tension. Greater vocal fold tension is associated with higher rates of vocal
fold vibration leading to increased f0. Since VOT and onset f0 are presumably originating from
the same articulatory configuration, one could argue that we do not have to correct for multiple
testing when measuring these two acoustic dimensions. However, as will be shown in §4,
correlated measures lead to false positive inflation rates that are nearly as high as in independent
multiple tests (see also von der Malsburg & Angele, 2017).

3 Speech is not an isolated channel of communication, it co-occurs in rich interactional contexts. Beyond acoustic and articulatory dimensions, spoken communication is accompanied by non-verbal modalities such as body posture, eye gaze direction, head movement and facial expressions, all of which have been shown to contribute to comprehension and may thus be considered relevant parameters to measure (Cummins, 2012; Latif et al., 2014; Prieto et al., 2015; Rochet-Capellan et al., 2008; Yehia et al., 1998).
3.2 Operationalizing chosen parameters
The garden of forking paths does not stop here. There are many different ways to operationalize
the dimensions of speech that we have chosen. For example, when we want to extract specific
acoustic parameters of a particular interval of the speech signal, we need to operationalize
how to decide on a given interval. We usually have objective annotation procedures and clear
guidelines that we agree on prior to the annotation process, but these decisions have to be considered
researcher degrees of freedom and can potentially be revised after having seen the results
of the statistical analysis.
Irrespective of the actual annotation, we can look at different acoustic domains. For example,
a particular acoustic dimension such as duration or pitch can be operationalized differently
with respect to its domain and the way it is assessed: In their survey of over a hundred
acoustic studies on word stress correlates, Gordon and Roettger (2017) encountered dozens of
different approaches how to quantitatively operationalize f0, intensity, and spectral tilt as correlates
of word stress. Some studies took the whole syllable as a domain, others targeted the
mora, the rhyme, the coda or individual segments. Specific measurements for f0 and intensity
included the mean, the minimum, the maximum, or even the standard deviation over a certain
domain. Alternative measures included the value at the midpoint of the domain, at the onset
and offset of the domain, or the slope between onset and offset. The measurement of spectral
tilt was also variegated. Some studies measured relative intensity of different frequency bands,
where the choice of frequency bands varied considerably across studies. Yet other studies measured
relative intensity of the first two harmonics.
Similarly, time-series data such as pitch curves, formant trajectories or articulatory
movements have been analyzed as sequences of static landmarks (“magic moments”, Vatikiotis-
Bateson, Barbosa & Best, 2014), with large differences across studies regarding the identity and
the operationalization of these landmarks. For example, articulatory studies looking at intragestural
coordination commonly measure gestural onsets, displacement metrics, peak velocity,
the onset and offset of gestural plateaus, or stiffness (i.e. relating peak velocity to displacement).
Alternatively, time-series data can be analyzed holistically as continuous trajectories,
differing with regard to the degrees of smoothing applied (e.g. Wieling, 2018).
In multidimensional datasets such as acoustic or articulatory data, there may be thousands
of sensible analysis pathways. Beyond that, there are many different ways to process
these raw measurements with regard to relevant time windows and spectral regions of interest;
there are many possibilities of transforming or normalizing the raw data or smoothing and interpolating trajectories.
3.3 Discarding data
Setting aside the multidimensional nature of speech and assuming that we actually have a priori
decided on what phonetic parameters to measure (see 3.1) and how to measure and process
them (see 3.2), we are now faced with additional choices. We let two annotators segment the
speech stream and annotate the signal according to clearly specified segmentation criteria. During
this process, the annotators notice noteworthy patterns that either make the preset operationalization
of the acoustic criteria ambiguous or introduce possible confounds.
The segmentation process may be difficult due to undesired speaker behavior, e.g. hesitations,
disfluencies, or mispronunciations. There may be issues related to the quality of the recording
such as background noise, signal interference, technical malfunctions or interruptive
external events (something that happens quite frequently during field work).
Another aspect of the data that may strike us as problematic when we extract phonetic
information are other linguistic factors that could interfere with our research question. Speech
consists of multiple information channels including acoustic parameters that distinguish words
from each other and acoustic parameters that structure prosodic constituents into rhythmic
units, structure utterances into meaningful units, signal discourse relations or deliver indexical
information about the social context. Dependent on what we are interested in, the variable use
of these information channels may interfere with our research question. For example, a speaker
may produce utterances that differ in their phrase-level prosodic make-up. In controlled production
studies, speakers often have to produce a very restricted set of sentences. Speakers may
for whatever reason (boredom, fatigue, etc.) insert prosodic boundaries or alter the information
structure of an utterance, which, in turn, may drastically affect the phonetic form of other parts
of the signal. For example, segments of accented words have been shown to be phonetically enhanced,
making them longer, louder, and more clearly articulated (e.g. Cho & Keating, 2009;
Harrington et al., 2000). Related to this point, speakers may use different voice qualities, some
of which will make the acoustic extraction of certain parameters difficult. For example, if we
were interested in extracting f0, parts of the signal that are produced with a creaky voice may
not be suitable; or if we were interested in spectral properties of the segments, parts of the signal
produced in falsetto may not be suitable.
It is reasonable to ‘clean’ the data and remove data points that we consider as not desirable,
i.e. productions that diverge from the majority of productions (e.g. unexpected phrasing,
hesitations, laughter, etc.). These undesired productions may interfere with our research hypothesis
and may mask the signal, i.e. make it less likely to find systematic patterns. We can exclude these data points in a list-wise fashion, i.e. subjects who exhibit a certain amount of
undesired behavior (set by a certain threshold) may be entirely excluded. Alternatively, we can
exclude trials on a pair-wise level across levels of a predictor. We can also simply exclude problematic tokens on a trial-by-trial basis. All of these choices change our final data set and, thus,
may affect the overall result of our analysis.
To exemplify the point, let us assume recorded utterances exhibit prosodic variation:
Sometimes the target word carries a pitch accent and sometimes it is deaccented. It is easily
justifiable to only look at the accented tokens, as the segmental cues are likely to be enhanced
in these contexts. It is equally justifiable to do the opposite, only looking at deaccented variants
as this should make a genuine lexical contrast more difficult to obtain and, in turn, makes the
refutation of the null hypothesis (i.e. there is no difference) more difficult.

3.4 Choosing independent variables
Often, subsetting the data or exclusion of whole clusters of data can have impactful consequences,
as we would discard a large amount of our collected data. Instead of discarding data
due to unexpected covariates, we can add these covariates as independent variables to our
model. In our example, we could include the factor of accentuation as an interaction term into
our analysis to see whether the investigated effect may interact with accentuation. If we either
find a main effect of syllable position on our measurements or an interaction of with accentuation,
we would probably proceed and refute the null hypothesis (there is no difference between
stressed and unstressed syllables). This rational can be applied to external covariates, too. For
example, in prosodic research, researchers commonly add the sex of their speakers to their
models, either as a main effect or an interaction term, so as to control for the large sex-specific
f0 variability. Even though this reasoning is justified by independent factors, it represents a set
of researcher degrees of freedom.
Figure 2: Schematic example of forking analytical paths as described in §3. First, the
researchers choose from a number of possible phonetic parameters (1), then they choose
in what temporal domain they measure this parameter and how they operationalize it
(2). After extracting the data, the researchers must decide how to deal with dimension
of speech that are orthogonal to what they are primarily interested in (say the difference
between stressed and unstressed syllables: blue vs. green dots, 3). For example, speakers
might produce words either with or without a pitch accent (light vs. dark dots). They
can discard undesired observations (e.g. discard all deaccented target words), discard
clusters (e.g. discard all contrasting pairs that contain at least one deaccented target
word), or keep the entire data set. If they decide to keep them, the question arises as
whether to include these moderators or not and if so whether to include them as a simple
predictor of add them to an interaction term (4). Note that these analytical decisions
are just a subset of all possible researcher degrees of freedom at each stage.

To summarize, the multidimensional nature of speech offers a myriad of different ways
to look at our data. It allows us to choose dependent variables from a large pool of candidates;
it allows us to measure the same dependent variable in alternative ways; and it allows us to
preprocess our data in different ways by for example normalization or smoothing algorithms.
Moreover, the complex interplay of different levels of speech phenomena introduced the possibility
to correct or discard data during data annotation in a non-blinded fashion; it allows us to
measure other variables that can be used as covariates, mediators or moderators. These variables
could also enable further exclusion of participants (see Figure 2 for a schematic example).
While there are often good reasons to go down one forking path rather than another,
the sheer amount of possible ways to analyze speech comes with the danger of exploring many
of these paths and picking those that yield a statistically significant result. Such practices, exactly
because they are often unintentional, are problematic because they drastically increase
the rate of false positives.
It is important to note that these researcher degrees of freedom are not restricted to speech
production studies. Within more psycholinguistically oriented studies in our field, we can for
example examine online speech perception by monitoring eye movements (e.g. Tanenhaus et
al., 1995) or hand movements (Spivey et al., 2005). There are many different informative relationships between these measured motoric patterns and aspects of the speech signal. For example,
in eye tracking studies, we can investigate different aspects of the visual field (foveal, parafoveal,
peripheral), we can look at the number and time course of different fixations of a region
(e.g. first, second, third), the duration of a fixation or the sum of fixations before exiting / entering
a particular region (see von der Malsburg & Angele, 2017, for a discussion).
More generally, the issue of researcher degrees of freedom is relevant for all quantitative
scientific disciplines (see Wichert et al., 2016). As of yet, we have not discussed much of the
analytical flexibility that is specific to statistical modeling: There are choices regarding the
model type, model architecture, and the model criticism procedure, all of which come with
their own set of researcher degrees of freedom. Note that having these choices is not problematic
per se, but unintentionally exploring these choices before a final analytical commitment has
been made may inflate false positive rates. Depending on the outcome and our preconception
of what we expect to find, confirmation and hindsight bias may lead us to believe that there
are justified ways to look at the data in one particular way, until we reach a satisfying (significant)
result. Again, this is a general issue in scientific practices and applies to other disciplines,
too. However, given the multidimensionality of speech as well as the intricate interaction of
different level of speech behavior, the possible unintentional exploitation of researcher degrees
of freedom is particularly ubiquitous in our field. To demonstrate the possible severity of the issue, the following section presents a simulation that simulates false positive rates based on
analytical decisions that speech production studies face commonly.

4. Simulating researcher degrees of freedom exploitation
In this section, a simulation is presented which shows that exploiting researcher degrees of
freedom increases the probability of false positives, i.e. erroneously rejecting the null hypothesis.
The simulation was conducted with R (R Core Team, 2016)4 in order to demonstrate the effect
of two different sets of researcher degrees of freedom: Testing multiple dependent variables
and adding a binomial covariate to the model (for similar simulations, see e.g. Barr et al.,
2013; Winter, 2011, 2015). Additionally, the correlation between dependent variables is varied
with one set of simulations assuming three entirely independent measurements (r =0), and one
set of simulations assuming measurements that are highly correlated with each other (r =0.5).
The script to reproduce these simulations is publicly available here (osf.io/6nsfk)5.
The simulation is based on the following dummy experiment: A group of researchers analyzes
a speech production data set to test the hypothesis that stressed and unstressed syllables
are phonetically different from each other. They collect data from 64 speakers producing words
with either stress category. Speakers vary regarding the intonational form of their utterances
with approximately half of the speakers producing a pitch accent on the target word and the
other half deaccenting the target word.
As opposed to a real-world scenario, we know the true underlying effect, since we draw
values from a normal distribution around a mean value that we specify. In the present simulation,
there is no difference between stressed and unstressed syllables in the “population”, i.e.
values for stressed and unstressed syllables are drawn from the same underlying distribution.
However, due to chance sampling, there are always going to be small differences between
stressed and unstressed syllables in any given sample. Whether the word carried an accent or
not is also randomly assigned to data points. In other words, there is no true effect of stress,
neither is there an effect of accent on the observed productions.

4 The script utilizes the MASS package (Venables & Ripley, 2002) and the tidyverse package
(Wickham, 2017).
5 The simulation was inspired by Simmons et al.’s (2011) simulation which is publicly available
here: https://osf.io/a67ft/.

We simulated 10,000 datasets and tested for the effect of stress on dependent variables
under different scenarios. In our dummy scenarios, researchers explored following researcher
degrees of freedom:
(i) Instead of measuring a single dependent variable to refute the null hypothesis, the
dummy researchers measured three dependent variables (say, intensity, f0 and duration). Measuring
multiple aspects of the signal is common practice within the phonetic literature across
phenomena. In fact, measuring only three dependent variables is probably on the lower end of
what researchers commonly do. However, phonetic aspects are often correlated, some of them
potentially being generated by the same biomechanical mechanisms. To account for this aspect
of speech, these variables were generated as either being entirely uncorrelated or being highly
correlated with each other (r =0.5).
(ii) The dummy researchers explored the effect of the accent covariate and tested
whether inclusion of this variable as a main effect or as an interacting term with stress yields a
significant effect (or interaction effect). Accent (0,1) was randomly assigned to data rows with
a probability of 0.5, that is there is no inherent relationship with the measured parameters.
These two scenarios correspond to researcher degrees of freedom 1 and 4 in Fig. 1 and
2. Based on these scenarios, the dummy researchers ran simple linear models on respective dependent
variables with stress as a predictor (and accent as well as their interaction in scenario
ii). The simulation counts the number of times, the researchers would obtain at least one significant
result (p-value <0.05) exploring the garden of forking paths described above.
This simulation (as well as the scenario it is based on) is admittedly a simplification of
real-world scenarios and might therefore not be representative. As discussed above, the number
of researcher degrees of freedom are manifold in phonetic investigations, therefore any simulation
has to be a simplification of the real state of affairs. For example, the simulation is based
on a regression model for a between-subject design, therefore not taking within-subject variation
into account. It is thus important to note that the ‘true’ false positive rates in any given experimental
scenario are not expected to be precisely as reported, or that the relative effects of
the various researcher degrees of freedom are similar to those found in the simulation. The true
false positive rates may differ from those reported here. Despite these necessary simplifications,
the generated numbers can be considered very informative and are intended to illustrate the
underlying principles of how exploiting researcher degrees of freedom can impact the false positive
rate.
We should expect about 5% significant results for an alpha level of 0.05 (the commonly
accepted threshold in NHST). Knowing that there is no stress contrast in our simulated population,
any significant result is by definition a false positive. Given the proposed alpha level, we
thus expect that on average 1 out of 20 studies will show a false positive. Fig. 3 illustrates the
results based on 10,000 simulations.
Figure 3: The x axis depicts the proportion of studies for which at least one attempted
analysis was significant at a 0.05 level. Light grey bars indicate the results for uncorrelated
measures, dark grey bars indicate results for highly correlated measures (r = 0.5).
Results were obtained for running three separate linear models for each dependent variable
(row 1-3); for running one linear model with the main effect only, one model
allowing for covariance with an accent main effect, and one model with an analysis of
covariance with an accent interaction (row 4, a significant effect is reported if the effect
of the condition or its interaction with accent was significant in any of these analyses).
Row 5-6 combine multiple dependent variables with adding a covariate. The dashed
line indicates the expected false positive base rate of 5%.

The baseline scenario (only one measurement, no covariate added) provides the expected base
rate of false positives (5%, see row 1 in Fig. 3). Departing from this baseline, the false positive
rate increases substantially (row 2-6). Looking at the results for the uncorrelated measurements
first (light grey bars), results of the simulation indicate that as the researchers measure three dependent variables (rows 2-3), e.g. three possible acoustic exponents, they obtain a significant
effect in 14.6% of all cases. In other words, by testing three different measurements, the falseerror
rate increases by a factor of almost 3. When researchers explore the possibility of the effect
under scrutiny co-varying with a random binomial co-variate, they obtain a significant effect
in 11.9% of cases, increasing the false-error rate by a factor of 2.3. These numbers may
strike one as rather low but they are based on only a small number of researcher degrees of
freedom. In real-world scenarios, some of these choices are entirely random but potentially informed
by the researcher’s preconceptions and expectations, potentially clouded by cognitive
biases.
Moreover, we usually walk through the garden of forking paths and unintentionally exploit
multiple different researcher degrees of freedom at the same time. If the dummy researchers
combine measure three different acoustic exponents and run additional analyses for the
added covariate the false error rate increases to 31.5%, and increases by a factor of over 6, i.e.
the false-error rate is more than six times larger than our agreed rate of 5%.
One might argue that the acoustic measurements that we take are highly correlated, making
the choice between them not as impactful since they measure the same thing. To anticipate
this remark, the same simulations were run with highly correlated measurements (r =0.5, see
dark grey bars in Fig. 3). Although the false-positive rate is slightly lower than for the uncorrelated
measures, we still obtain large number of false positives. If the dummy researchers measure
three different acoustic exponents and run additional analyses for the added covariate, the
false error rate is 28%. Thus, despite the measurements being highly correlated, the false error
rate is still very high (see also Malsburg & Angele 2017 for a discussion of eye-tracking data).
In order to put these numbers into context, imagine a journal that publishes 10 papers per issue
and each paper reports only one result. Given the described scenario above, 3-4 papers of this
issue would report a false positive.
Keep in mind that the presented simulation is necessarily a simplification of real-world
data sets. The true false positive rates might differ from those presented here. However, even
within a simplified scenario, the present simulation serves as a proof of concept that exploiting
researcher degrees of freedom can impact the false positive rate quite substantially.
Another limitation of the present simulation (and of most simulations of this nature) is
that it is blind to the direction of the effect. Here, we counted every p-value below our set alpha
level as a false positive. In real-world scenarios, however, one could argue that we often
have directional hypotheses. For example, we expect unstressed syllables to exhibit ‘significantly’ shorter durations/lower intensity/lower f0, etc. It could be argued that finding a significant
effect going into the ‘wrong’ direction is potentially less likely to be considered trustworthy
evidence (i.e. unstressed syllables being longer). However, due to cognitive biases such as
hindsight bias (Fischhoff, 1975) and overconfident beliefs in the replicability of significant results
(Vasishth, Mertzen, & Jäger, 2018) the subjective expectedness of effect directionality
might be misleading.
The discussed issues are general issues of data analysis operating in a particular inferential
framework and they are not specific to phonetic sciences. However, we need to be aware of
these issues and we need to have an open discourse about them. It turns out that one of the aspects
that makes our scientific object so interesting, the complexity of speech, can pose a methodological
burden. But there are ways to substantially decrease the risk of producing false positives.

5. Possible remedies
The demonstrated false-positive rates are not an inevitable fate. In the following, I will discuss
possible strategies to elude the threat posed by researcher degrees of freedom, focusing on five
different topics: I discuss the possibility of adjusting the alpha level as a way to reduce false
positives (§5.1). Alternatively, I propose that drawing a clear line between exploratory and confirmatory
analyses (§5.2) and committing to researcher degrees of freedom prior to data collection
with preregistered protocols (§5.3) can limit the number of false positives. Additionally, a
valuable complementary practice may lie in open, honest and transparent reporting on how
and what was done during the data analysis procedure (§5.4). While transparency cannot per
se limit the exploitation of researcher degrees of freedom, it can facilitate their detection. Finally,
it is argued that direct replications are our strongest strategy against false positives and
researcher degrees of freedom (§5.5).

5.1 Adjusting the significance threshold
The increased false positive rates as discussed in this paper are closely linked to null hypothesis
significance testing which, in its contemporary form, constitutes a dichotomous statistical decision
procedure based on a preset threshold (the alpha level). At first glance, an obvious solution
to reduce the likelihood of obtaining false positives is to adjust the alpha level, either by
correcting it for relevant researcher degrees of freedom or by lowering the decision threshold
for significance a priori.
First, one could correct the alpha level as a function of the number of exploited researcher
degrees of freedom. This solution has mainly been discussed in the context of multiple comparisons,
in which the researcher corrects the alpha level threshold according to the number of
tests performed (Benjamini & Hochberg, 1995; Tukey, 1954). If we were to measure three
acoustic parameters to test a global null hypothesis, which can be refuted by a single statistically
significant result, we would lower the alpha level to account for three tests. These corrections
can be done for example via the Bonferroni or the Šidák method.6
One may object that corrections for multiple testing are not reasonable in the case of
speech production on the grounds that acoustic measures are usually correlated. In this case,
correcting for multiple tests may be too conservative. However, as demonstrated in §4 and discussed
by von der Malsburg & Angele (2017), multiple testing of highly correlated measures
leads to false positive rates that are nearly as high as for independent multiple tests. Thus, a
multiple comparisons correction is necessary even with correlated measures in order to obtain
the conventional false positive rate of 5%.
For the concrete simulation presented in §4, we could conservatively correct for the number
of tests that we have performed, e.g. correcting for testing three dependent variables. While
such corrections capture the simulated false positive rate for multiple tests on uncorrelated
measures well, they are too conservative to capture the false positive rates for scenarios in
which measures are correlated to different degrees. Simply correcting for multiple tests would
be too conservative in these cases.
Given the large analytical decision space that we have discussed above, it remains unclear
as to how much to correct the alpha level for individual researcher degree of freedom. Moreover,
given that speech production experiments usually yield a limited amount of data, strong
alpha level corrections can drastically inflate false negatives (Type II errors, e.g. Thomas et al.,
1985), i.e. erroneously accepting the null hypothesis. Alpha level correction does also not help
us detect exploitation of researcher degrees of freedom. If our dummy researchers have measured
20 different variables, but only report 3 of them in the published manuscript, correcting
the alpha level for the reported three variables is too anti-conservative.

6 One approach is to use an additive (Bonferroni) inequality: For n tests, the alpha level for each test is given by the overall alpha level divided by n. A second approach is to use a multiplicative inequality (Šidák): For n tests, the alpha level for each test is calculated by taking 1 minus the nth root of the complement of the overall alpha level.

Alternatively, one could pose a more conservative alpha level that we agree on within the
quantitative sciences a priori. Benjamin et al. (2018) recently made such a proposal, recommending
to lower our commonly agreed alpha level from p ≤ 0.05 to p ≤ 0.005. All else being
equal, lowering the alpha level will reduce the absolute number of false positives (e.g. in the
simulation above, we would only obtain around 0.5% to 3% of false positives across scenarios).
However, it has been argued that lowering the commonly accepted threshold for significance
potentially increases the risk of scientific practices that could harm the scientific record even
further (Amrhein & Greenland, 2018; Amrhein, Korner-Nievergelt, & Roth, 2017; Greenland,
2017; Lakens et al., 2018): In order to demonstrate a ‘significant’ result, studies would need
substantially larger sample sizes7, increasing the rate of false negatives. Requiring larger sample
sizes also leads to an increase in time and effort associated with a study (while possible academic
rewards remain the same), potentially reducing the willingness of researchers to conduct
direct replication studies even further (see §5.5, Lakens et al., 2018). Lowering the bar as to
what counts as significant is also unlikely to tackle issues such as p-hacking, HARKing, or publication
bias. Instead it might lead to an amplification of cognitive biases that justify exploration
of alternative analytical pipelines (Amrhein & Greenland, 2018; Amrhein, Korner-Nievergelt, &
Roth, 2017). In sum, although the adjustment of alpha levels appears to limit false positives, it
is not always clear how to adjust it and it might introduce other dynamics that potentially
harm the scientific record.
5.2 Flagging analyses as exploratory vs. confirmatory
One important remedy to tackle the issue of researcher degrees of freedom is to draw a clear
line between exploratory and confirmatory analyses, two conceptually separate phases in scientific
discovery (Box, 1976; Tukey, 1980). In an exploratory analysis, we observe patterns and
relationships which lead to the generation of concrete hypotheses as to how these observations
can be explained. These hypotheses can then be challenged by collecting new data (e.g. in controlled
experiments). Putting our predictions under targeted scrutiny helps us revising our theories
based on confirmatory analyses. Our revised models can then be further informed by additional
exploration of the available data. This iterative process of alternating exploration and
confirmation advances our knowledge. This is hardly any news to the reader. However, quantitative
research in science in general as well as in our field in particular often blurs the line between
these two types of data analysis.

7 E.g. achieving 80% power with an alpha level of 0.005, compared with an alpha level of 0.05, requires 70% larger sample sizes for between-subjects designs with two-sided tests (see Lakens et al., 2018).

Exploratory and confirmatory analyses should be considered complementary to each
other. Unfortunately, when it comes to publishing our work, they are not weighted equally.
Confirmatory analyses have a superior status, determining the way we frame our papers and
the way funding agencies demand successful proposals to look like. This asymmetry can have
harmful consequences which I have discussed already: HARKing and p-hacking. It may also incentivize
researchers to sidestep clear-cut distinctions between exploratory and confirmatory
findings. The publication apparatus forces our thoughts into a confirmatory mindset, while we
often want to explore the data and generate hypotheses. For example, we want to explore what
the most important phonetic exponents of a particular functional contrast are. We may not necessarily
have a concrete prediction that we want to test at this stage, but we want to understand
patterns in speech with respect to their function. Exploratory analyses are necessary to
establish standards as to how aspects of speech relate to linguistic, cognitive, and social variables.
Once we have established such standards, we can agree to only look at relevant phonetic
dimensions, reducing the analytical flexibility with regard to what and how to measure (§3.1-
3.2).
Researcher degrees of freedom mainly affect the confirmatory part of scientific discovery,
they do not restrict our attempts to explore our data. But claims based on exploration
should be cautious. After having looked at 20 acoustic dimensions, any seemingly systematic
pattern may be spurious. Instead, this exploratory step should generate new hypotheses which
we then can confirm or disconfirm using a new dataset. In many experiments, prior to data collection,
it may be simply not clear how a functional contrast may manifest itself phonetically.
Presenting such exploratory analyses as confirmatory may hinder replicability and may give a
false feeling of certainty regarding the results (Vasishth et al., 2018). In a multivariate setting,
which is the standard setting for phonetic research, there are multiple dimensions to the data
that can inform our theories. Exploring these dimensions may often be more valuable than just
a single confirmatory test of a single hypothesis (Baayen et al., 2017). These cases make the
conceptual line between confirmatory and exploratory analyses so important. We should explore
our data. Yes. Yet we should not pretend that we are testing concrete hypotheses when
doing so.
Although our academic incentive system makes drawing this line difficult, journals start
to become aware of this issue and have started to create incentives to publish exploratory analyses
explicitly (for example Cortex, see McIntosh, 2017). One way of ensuring a clear separation
between exploratory and confirmatory analyses are preregistrations and registered reports
(Nosek et al., 2018; Nosek & Lakens, 2014).

5.3 Preregistrations and registered reports
A preregistration is a time-stamped document in which researchers specify exactly how they
plan to collect their data and how they plan to conduct their confirmatory analyses. Such reports
can differ with regard to the details provided, ranging from basic descriptions of the
study design to detailed procedural and statistical specifications up to the publication of scripts.
Preregistrations can be a powerful tool to reduce researcher degrees of freedom because
researchers are required to commit to certain decisions prior to observing the data. Additionally,
public preregistration can at least help to reduce issues related to publication bias, i.e. the
tendency to publish positive results more often than null results (Franco et al., 2014; Sterling,
1959), as the number of failed attempts to reject a hypothesis can be tracked transparently (if
the studies were conducted).
There are several websites that offer services and / or incentives to preregister studies
prior to data collection, such as AsPredicted (AsPredicted.org) and the Open Science Framework
(osf.io). These platforms allow us to time-log reports and either make them publicly available
or grant anonymous access only to a specific group of people (such as reviewers and editors
during the peer-review process).
A particular useful type of preregistration is a peer-reviewed registered report, which an
increasing number of scientific journals adopted already (Nosek et al., 2018; Nosek & Lakens,
2014, see cos.io/rr for a list of journals that have adopted this model).8 These protocols include
the theoretical rationale of the study and a detailed methodological description. In other words,
a registered report is a full-fledged manuscript minus the result and discussion section. These
reports are then critically assessed by peer reviewers, allowing the authors to refine their methodological
design. Upon acceptance, the publication of the study results is in-principle guaranteed,
no matter whether the results turn out to provide evidence for or against the researcher’s
predictions.
For experimental phonetics, a preregistration or registered report would ideally include
a detailed description of what is measured and how exactly it is measured/operationalized, as
well as a detailed catalogue of objective inclusion criteria (in addition to other key aspects of
the method including all relevant researcher degrees of freedom related to preprocessing, postprocessing,
statistical modelling, etc. see Wichert et al., 2016). Committing to these decisions
prior to data collection can reduce the danger of unintentionally exploiting researcher degrees
of freedom.

8 Note that journals that are specific to quantitative linguistics have not adopted these practices yet.

At first sight, there appear to be several challenges that come with preregistrations (see
Nosek et al., 2018, for an exhaustive discussion). For example, after starting to collect data, we
might realize that our preset exclusion criteria do not capture an important behavioral aspect
of our experiment (e.g., some speakers may produce undesired phrase-level prosodic patterns
which we did not anticipate). These patterns, however, interfere with our research question.
Deviations from our data collection and analysis plan are common. In this scenario, we could
change our preregistration and document these changes alongside our reasons as to why and
when (i.e. after how many observations) we have made these changes. This procedure still provides
substantially lower risk of cognitive biases impacting our conclusions compared to a situation
in which we did not preregister at all.
Researchers working with corpora may object that preregistrations cannot be applied
to their investigations because their primary data has already been collected. But preregistration
of analyses can still be performed. Although, ideally, we limit researcher degrees of freedom
prior to having seen the data, we can (and should) preregister analyses after having seen
pilot data, parts of the study, or even whole corpora. When researchers generate a hypothesis
that they want to confirm with a corpus dataset, they can preregister analytic plans and commit
to how evidence will be interpreted before analyzing the data.
Another important challenge when preregistering our studies is predicting appropriate
inferential models. Preregistering a data analysis necessitates knowledge about the nature of
the data. For example, we might preregister an analysis assuming that our measurements are
normally distributed. After collecting our data, we might realize that the data has heavy right
tales, calling for a log-transformation, thus, our preregistered analysis might not be appropriate.
One solution to this challenge is to define data analytical procedures in advance that allow
us to evaluate distributional aspects of the data and potential data transformations irrespective
of the research question. Alternatively, we could preregister a decision tree. This may actually
be tremendously useful for people using hierarchical linear models (often referred to linear
mixed effects models). When using appropriate random effect structures (see Barr et al., 2013;
Bates et al., 2015), these models are known to run into convergence issues (e.g. Kimball et al.,
in press). To remedy such convergence issues, a common strategy is to drop complex random
effect terms incrementally (Bates et al., 2015). Since we do not know whether a model will
converge or not in advance, a concrete plan of how we reduce model complexity can be preregistered
in advance (e.g. see the preregistration of Roettger & Franke 2017 as an example).
Preregistrations and registered reports help us draw a line between the hypotheses we
intend to test and data exploration. Any exploration of the data beyond the preregistered reports
have to be considered as hypotheses-generating only. If we are open and transparent about this distinction and ideally publicly indicate where we draw the line between the two,
we can limit false positives due to researcher degrees of freedom exploitation in our confirmatory
analyses and commit more honestly to subsequent exploratory analyses.

5.4 Transparency
The credibility of scientific findings is mainly rooted in the evidence supporting it. We assess
the validity of this evidence by constantly reviewing and revising our methodologies, and by
extending and replicating findings. This becomes difficult, if parts of the process are not transparent
or cannot be evaluated. For example, it is difficult to evaluate whether exploitation of
researcher degrees of freedom is an issue for any given study, if the authors are not transparent
about when they made which analytical decisions. We end up having to trust the authors. Trust
is good, control is better. We should be aware and open about researcher degrees of freedom
and communicate this aspect of our data analyses to our peers as honestly as we can. If we
have measured ten phonetic parameters, we should provide this information; if we have run
different analyses based on different exclusion criteria, we should say so and discuss the results
of these analyses. An open, honest and transparent research culture is desirable. As argued
above, public preregistration can facilitate transparency of what analytical decisions we make
and when we have made them. Being transparent does of course not prevent p-hacking, HARKing
or other exploitations of researcher degrees of freedom but it makes these harmful practices
detectable.
For our field, transparency with regard to our analysis has many advantages (e.g.
Nicenboim et al., 2018).9 As analysis is subjective in the sense that it incorporates the researcher’s
beliefs and assumptions about a study system (McElreath, 2016) the only way to
make analyses objectively assessable is to be transparent about this aspect of empirical work.
Transparency then allows other researchers to draw their own conclusions as to which researcher
degrees of freedom were present and how they may have affected the original conclusion.

9 Beyond sharing data tables and analysis scripts, it would be desirable to share raw acoustic or articulatory files. However, making these types of data available relies on getting permission from participantsin advance (as acoustic data is inherently identifiable). If we collect their participants’ consents, making raw speech production data available to the community would greatly benefit evidence accumulation in our field. We can share these data on online repositories such as OSCAAR (the Online Speech/Corpora Archive and Analysis Resource: https://oscaar.ci.northwestern.edu/).

In order to facilitate such transparency, we need to agree on how to report aspects of
our analyses. While preregistrations and registered reports lead to better discoverability of researcher
degrees of freedom, they do not necessarily allow us to systematically evaluate them.
We need institutionalized standards, as many other disciplines have already developed. There
are many reporting guidelines that offer standards for reporting methodological choices (see
the Equator Network for an aggregation of these guidelines: http://www.equator-network.
org/). Systematic reviews and meta analyses such as Gordon and Roettger (2017) and
Roettger & Gordon (2017) can be helpful departure points to create an overview of possible analytical decisions and their associated degrees of freedom (e.g. what is measured; how it is
measured, operationalized, processed, and extracted; what data are excluded; when and how
are they excluded, etc.). Such guidelines, however, are only effective when a community agrees
on their value and applies them including journals, editors, reviewers, and authors.
5.5 Direct replications
The above discussed remedies help us to both limit the exploitation of researcher degrees of
freedom and make them more detectable, but none of these strategies is a bullet-proof protection
against false positives. To ultimately avoid the impact of false-positives on the scientific
record, we should increase our efforts to directly replicate previous research, defined here as
the repetition of the experimental methods that led to a reported finding.
The call for more replication is not original. Replication has always been considered a
tremendously important aspect of the scientific method (e.g. Campbell, 1969; Kuhn, 1962; Popper,
1934/1992; Rosenthal, 1991) and in recent coordinated efforts to replicate published results,
the social sciences uncovered unexpectedly low replicability rates, a state of affairs that
has been coined the ‘replication crisis’. For example, the Open Science Collaboration (2015)
tried to replicate 100 studies that were published in three high-ranking psychology journals.
They assessed whether the replications and the original experiments yielded the same result
and found that only about one third to one half of the original findings (depending on the definition
of replication) were also observed in the replication study. This lack of replicability is
not restricted to psychology. Concerns about the replicability of findings have been raised for
medical sciences (e.g. Ioannidis, 2005), neuroscience (Wager, Lindquist, Nichols, Kober, & van
Snellenberg, 2009), genetics (Hewitt, 2012), cancer research (Errington et al., 2014), and economics
(Camerer et al., 2016).
Most importantly, it is a very real problem for quantitative linguistics, too. For example,
Nieuwland et al. (2018) recently tried to replicate a seminal study by DeLong et al. (2005)
which is considered a landmark study for the predictive processing literature and which has been cited over 500 times. In their preregistered multi-site replication attempt (9 laboratories,
334 subjects), Nieuwland et al. were not able to replicate some of the key findings of the original
study.
Stack, James, & Watson (2018) recently failed to replicate a well-cited effect of rapid
syntactic adaptation by Fine, Jaeger, Farmer, and Qian (2013). After failing to find the original
effect in an extension, they went back and directly replicated the original study with appropriate
statistical power. They found no evidence for rapid syntactic adaptation.
Possible reasons for the above cited failures to replicate are manifold. As has been argued
here, exploitation of researcher degrees of freedom is one reasons why there is a large
number of false positives. Combined with other statistical issues such as low power (e.g. for recent
discussion see Kirby & Sonderegger, 2018; Nicenboim et al., 2018), violation of the independence
assumption (Nicenboim & Vasishth, 2016; Winter 2011, 2015), and the ‘significance’
filter (i.e. treating results publishable because p < 0.05 leads to overoptimistic expectations of
replicability, see Vasishth et al., 2018), it is to be expected that there are a large number of experimental
phonetic findings that may not stand the test of time.
The above replication failures sparked a tremendously productive discourse throughout
the quantitative sciences and led to quick methodological advancements and best practice recommendations.
For example, there are several coordinated efforts to directly replicate important
findings by multi-site projects such as the ManyBabies project (Frank et al., 2017) and
Registered Replication Reports (Simons, Holcombe, &, Spellman, 2014). These coordinated efforts
can help us put theoretical foundations on a firmer footing. However, the logistic and
monetary resources associated with such large-scale projects are not always pragmatically feasible
for everyone in the field.
Replication studies are not very popular because the necessary time and resource investment
are not appropriately rewarded in contemporary academic incentive systems (Koole &
Lakens, 2012; Makel, Plucker, & Hegarty, 2012; Nosek, Spies, & Motyl, 2012). Both successful
replications (Madden, Easley, & Dunn, 1995) and repeated failures to replicate (e.g., Doyen,
Klein, Pichon, & Cleeremans, 2012) are only rarely published, and if they are published they
are usually published in less prestigious outlets than the original findings. To overcome the
asymmetry between the cost of direct replication studies and the presently low academic payoff
for it, we as a research community must re-evaluate the value of direct replications. Funding
agencies, journals, editors, and reviewers should start valuing direct replication attempts, be it
successful replications or replication failures, as much as they value novel findings. For example,
we could either dedicate existing journal space for direct replications (e.g. as an article
type) or by creating new journals that are specifically dedicated to replication studies.
As soon as we make publishing replications easier, more researchers will be compelled
to replicate both their own work and the work of others. Only by replicating empirical results
and evaluating the accumulated evidence can we substantiate previous findings and extend
their external validity (Ventry & Schiavetti, 1986).

6. Summary and concluding remarks
This article has discussed researcher degrees of freedom in the context of quantitative phonetics.
Researcher degrees of freedom concern all possible analytical choices that may influence
the outcome of our analysis. In a null-hypothesis-significance testing framework of inference,
intentional or unintentional exploitation of researcher degrees of freedom can have a dramatic
impact on our results and interpretations, increasing the likelihood of obtaining false positives.
Quantitative phonetics faces a large number of researcher degrees of freedom due to its scientific
object being inherently multidimensional and exhibiting complex interactions between
many co-varying layers of speech. A Type-I error simulation demonstrated substantial false error
rates when combining just two researcher degrees of freedom such as testing more than one
phonetic measurement, and including a speech-relevant co-variate in the analysis. It has been
argued that combined with common cognitive fallacies, unintentional exploitation of researcher
degrees of freedom introduces strong bias and poses a serious challenge to quantitative
phonetics as an empirical science.
Several potential remedies for this problem have been discussed. When operating in the
NHST statistical framework, we can reconsider our preset threshold for significance, a strategy
that has been argued to come with its own set of shortcomings. Several alternative solutions
have been proposed (see Fig. 4).

Figure 4: Schematic depiction of decision procedure during data analysis that limits
false positives: Prior to data collection, the researcher commits to an analysis pipeline
via preregistration/registered reports, leading to a clear separation of confirmatory
(blue) and exploratory analysis (light green). The analysis is executed accordingly and
the results are interpreted with regard to the confirmatory analysis. After the confirmatory
analysis, the researcher can revisit the decision procedure and explore the data
(green arrow). The interpretation of the confirmatory analysis and potentially insights
gained from the exploratory analysis are published alongside an open and transparent
track record of all analytical steps (preregistration, code and data for both confirmatory
and exploratory analysis). Finally, either prior to publication or afterwards, the study is
directly replicated (dark green arrow) by either the same research group or independent
researchers in order to substantiate the results.

We should draw an explicit line between confirmatory and exploratory analyses. One way to
enforce such a clear line are preregistrations or registered reports, records of the experimental
design and the analysis plan, committed to prior to data collection and analysis. While preregistration
offers better detectability of researcher degrees of freedom, standardized reporting
guidelines and transparent reporting might facilitate a more objective assessment of these researcher
degrees of freedom by other researchers. Yet all of these proposals come with their
own limitations and challenges. A complementary strategy to limit false positives lies in direct
replications, a form of research that is not well rewarded within the present academic system.
We as a community need to be openly discussing such issues and find practically feasible
solutions to them. Possible solutions must not only be feasible from a logistic perspective
but should also avoid punishing rigorous methodology within our academic incentive system.
Explicitly labeling our work as exploratory, being transparent about potential bias due to researcher
degrees of freedom, or direct replication attempts may make it more difficult to be rewarded
for our work (i.e. by being able to publish our studies in prestigious journals). Thus,
authors, reviewers, and editors alike need to be aware of these methodological challenges. The
present paper was conceived in the spirit of such an open discourse. Thus, the single most powerful
solution to methodological challenges as described in this paper, is engaging in a critical
and open discourse about our methods and analyses.

7. References