lingüística

Correlations Between Genetic and Linguistic Data.

5. 1 The ‘New Synthesis’.
An approach which can diagnose and use even unknown loans turns out
to be of considerable relevance when we turn to another area of controversy,
this time moving beyond linguistics per se to a set of observed
though contentious interdisciplinary correlations. It is easy to see language
families as abstractions, and perhaps our conventional representation
of each language, reconstructed or attested, as a single node
encourages that kind of thinking. But reconstructed languages in the
past, like languages today, must have had speakers; so it follows that
human histories, rather than simply linguistic histories, are necessarily
involved.
However, it is impossible to Wnd out about those human histories
through linguistic work alone: we have to take an interdisciplinary approach.
One possibility, which is much in the scientiWc news at the moment,
is the ‘new synthesis’; and its proponents argue that we can bring
together evidence from linguistics, genetics, and archaeology, assess
whether meaningful correlations exist between these disciplines, then use
this cumulative evidence to provide clues to the histories of human populations.
Moreover, work of this kind might help us understand the features
of populations today, by revealing prehistoric aYliations and contacts.
The idea of constructing trees for linguistic and genetic groupings and
measuring the degree of similarity between them is, of course, not a new
one: we might trace the start of work on mappings between linguistics and
genetics to the publication of the well-known Cavalli-Sforza et al. (1988)
parallel linguistic—genetic tree, which is shown in Figure 5.1.
As is well known, there has been a good deal of criticism of this tree
(Bateman et al. 1990; McMahon and McMahon 1995; Sims-Williams
1998), both because of lack of independence of the populations sampled
(some genetic populations, like Na-Dene, are deWned on the basis of the
language spoken in the community, which rather begs the question) and
because it includes very long-range comparisons—many historical linguists
would consider constructs like Amerind, Nostratic, and Eurasiatic
essentially unfounded (Campbell 1988; McMahon and McMahon 1995).
However, it would naturally be unreasonable to reject any prospect of
meaningful matches because of problems in a single early application. We
can, however, learn two lessons from the Cavalli-Sforza tree and its
critics. First, any reasonable attempt to establish correlations must be based on good genetics, and good linguistics: if the methodology or
results on either side are suspect, the best we can hope for is MatisoV’s
vision (1990) of two drunks supporting each other. Good science cannot
come to good conclusions on the basis of bad data. It follows that our
discussion below will rule out from the start any correlations based on
Greenberg’s mass comparison (see Ch. 1 above), or involving long-range
language megafamilies—which we see as currently unsupported. Second,
we cannot expect any successful and enlightening attempt to identify
correlations between genetics and linguistics to involve straightforward,
one-to-one matches: both genetic and linguistic histories are too complex
for that. Indeed, it is precisely the common, but independent, characteristics
of the two systems, since both involve gradual divergence along with
contact phenomena, that make parallels between the two, methodologically
and in terms of results, so attractive. But dealing with that complexity
will involve the interpretation of equally complex patterns, not writing
equations of the x ¼ y type.
Even accepting these concerns, hopes for the so-called ‘new synthesis’
of disciplines are high: Cavalli-Sforza (2000: vii), for instance, introduces
his recent book as follows:
This book surveys the research on human evolution from the many diVerent
Welds of study that contribute to our knowledge. It is a history of the last hundred
thousand years, relying on archaeology, genetics, and linguistics. Happily, these
three disciplines are now generating many new data and insights. All of them can
be expected to converge toward a common story, and behind them must lie a
single history. Singly, each approach has many lacunae, but hopefully their
synthesis can help to Wll the gaps.
However, the promise of the new synthesis has not yet been achieved, and
Renfrew (1999: 1–2) is rather more cautious in his assessment, noting
that ‘We may be on the brink of seeing some convergence in our understanding
of issues of genetic diversity, cultural diversity, and linguistic
diversity. It may be possible, then, to work toward a uniWed reconstruction
of the history of human populations. It is much needed, because
certainly we do not have such a uniWed history at the moment.’ Sims-
Williams (1998) also provides a careful, critical overview of the whole
area and its prospects. Before proceeding to consider some suggested
correlations, then, it is important to review some possible misunderstandings
and problems.

5. 2 Correlations Between Genetics and Linguistics: Cautions and Caveats.
We should begin by deWning, at least in a very general way, just what we
mean by correlations between genetics and linguistics. Most importantly,
of course, there is no claim of determinism between genetics and linguistics
(which at its most problematic and simplistic would mean that the genes an
individual carries determine the language she speaks). This is a notoriously
diYcult area, since we do not wish to reject the hypothesis that there is a
genetic component underlying language either: it is entirely possible for the
human species to be predisposed to language learning and use without any
of that genetic hard-wiring corresponding to characteristics of English, as
opposed to Estonian or Quechua. Exploring these issues further is beyond
the scope of this book; but it is worth simply reiterating that establishing
population-level correlations between linguistic and genetic features does
not imply any causal connection between the two systems.
It is also worth stressing the necessity of working at the population
level when exploring these correlations. Calculations will ideally be based
on many individuals, involving averages and probabilities across groups,
not absolute values for individuals. Although molecular genetics has had
a higher public proWle recently, because of the advances in the Human
Genome Project, for instance, it is not the best genetic correlate
for linguistic variation. One of the immediate concerns about crossdisciplinary
comparisons of the kind we are proposing involves the apparently
discrete and absolute nature of genetic haplotypes at the DNA
level, where we either have a sequence of GTA, or ATG, and not something
in between; quite reasonably, linguists see this as contrasting absolutely
with the naturally variable and choice-ridden data of language.
However, the right comparison, we believe, is with population genetics, or
evolutionary biology, rather than molecular genetics. When we scale up
the molecular material for a whole population we do see variation and
‘choice’, so that at the same locus we might Wnd 10% of one population
with GTA and 90% with ATG, and the opposite ratio in a second
population, while yet a third has 10% GTA and 90% ATA. This looks
much more like the shifting, variable patterns linguists know and love.
It is also worth reminding ourselves that populations are abstractions,
like speech communities. It would clearly be unrealistic to expect, in a sociolinguistic survey, that each member of a speech community would
use a particular variant 33% of the time, or that all middle-class women
would use that same variant in precisely 95% of their formal speech (even
assuming that we are conWdent we can deWne relativistic constructs like
‘middle-class’ or ‘formal’ in such a deWnite and non-overlapping way). As
linguists, however, we can transcend the individual level, and interpret
data of this kind in terms of its reality for the speech community. Individual
speakers are not robbed of their identity or their uniqueness by
being grouped together into broader categories; and for diVerent purposes
we can study the individual or the group.
Exactly the same is true of genetic studies of populations. Individuals are
important; but there are some studies for which we need to take a broader
view, and categorize people into groups according to the average of their
genetic characteristics. It may seem unlikely, looking at cosmopolitan,
modern, urban European populations, for instance, that we can ever
reach any meaningful conclusion on their genetic characteristics, since
each individual will have his or her own highly speciWc history. But averaging
over a suYciently large number of individuals can indeed reveal
particular frequent, key attributes for the group, alongside the individual
markers which signal a history outside that population as unusual and
marginal. Put in linguistic terms, we might doubt, listening to several
speakers from the same area, that we can subsume their distinct and
individual accents under a single system; but grouping together a whole
range of such speakers may well reveal shared characteristics.We both have
noticeably Scots accents; closer inspection reveals that one author has
acquired a marginal contrast of /æ/ versus /A/ over 15 years of living in
England, though typically Scots lack this distinction and have a single,
undiVerentiated low mid /a/ vowel (the other author is holding out and
has no such contrast). This does not remove the general impression of
Scottishness when either of us speaks; and it does not contradict the
observation that most Scots lack the Sam–psalm opposition. Both observations
are valid, and relevant for diVerent purposes. It is also important to
note that contemporary urban populations, with their history of input from
widely divergent genetic (and linguistic) sources, are by no means the norm
either diachronically or diatopically: smaller, closer-knit communities with
greater continuity represent a more usual basis for human histories.
This has three implications for work of the sort discussed below. First,
it is important that we should collect both linguistic and genetic data from ‘older’, more isolated populations before admixture levels out many of
the signals in which we are interested. Second, those of us who are urban
speakers and rejoice in our mixed and exotic heritage have to accept that
we are relatively unusual in global terms, and therefore that our own
experiences and expectations do not amount to a necessary rejection of
these methods and results. Equally, however, we cannot simply ignore all
those mixed populations that do exist, and of course admixture at some
level occurs even in the smallest and most traditional of groups (see
McMahon 2004): hence, we must as a matter of urgency investigate
means of recognizing and, where necessary, factoring out admixture.
This will be a recurring theme in the discussion below, and, as we shall
see, diVerences in practice here between geneticists and historical linguists
represent a signiWcant threat to progress in the ‘new synthesis’.
These preliminaries are important in breaking down possible misperceptions
of the meaning of genetic/linguistic correlations. Turning to a
broad deWnition of those correlations, we mean simply that, all else being
equal, when the languages spoken by two populations are closely related
we might expect genes present in the two populations to be similar; and,
conversely, when the languages are only distantly related (or unrelated)
the genetic proWles of the two populations should also show considerable
diVerentiation. We can then hope to use these aYnities between linguistics
and genetics to help us cast light on the histories of particular populations.
Since populations, after all, consist of people who both carry genes
and use languages, it might be more surprising if there were no correlations
between genetic and linguistic conWgurations. The general observation
goes back to Darwin (1996 [1859]: 342), who suggested that ‘If we
possessed a perfect pedigree of mankind, a genealogical arrangement of
the races of man would aVord the best classiWcation of the various
languages now spoken throughout the world.’ This might now seem
somewhat overstated: clearly, ‘The correlation between genes and languages
cannot be perfect’, because both languages and genes can be
replaced independently; but the relationship ‘Nevertheless . . . remains
positive and statistically signiWcant’ (Cavalli-Sforza 2000: 167).
This correlation is supported by a range of recent studies, and we explore
several of these in detail in the following section. To take just one example,
Barbujani (1997: 1011) reports that ‘In Europe, for example, . . . several
inheritable diseases diVer, in their incidence, between geographically
close but linguistically distant populations’. In this case and others we Wnd a general and telling statistical correlation between genetic and
linguistic features, which reXects interesting and investigable parallelism
rather than determinism. Where we observe genetic and linguistic parallels
today, we therefore hypothesize earlier ancestral identity: as Barbujani
(ibid. 1014) observes:
Population admixture and linguistic assimilation should have weakened the
correspondence between patterns of genetic and linguistic diversity. The fact
that such patterns are, on the contrary, well correlated at the allele-frequency
level . . . suggests that parallel linguistic and allele-frequency change were not the
exception, but the rule.
The ‘new synthesis’ may look promising, but at present it is limited, since
most recent work has involved correlations between archaeology and
genetics: Renfrew and Boyle (2000) coin the term ‘archaeogenetics’ for
exactly this bilateral disciplinary match. There remain some doubts over
the feasibility of including linguistic evidence, in large part because of the
generally non-quantitative approaches favoured by historical and comparative
linguists, and the consequent diYculties of establishing repeatable,
demonstrably correct results, let alone parallels with other
disciplines. Archaeology and, to an even greater extent, genetics are
quantitative in their approaches and methods, and in their evaluations
of results; and if their practitioners are to understand and use historicallinguistic
data, linguists must therefore deal in probabilities and degrees
of relatedness. Here we have a further motivation for the development of
quantitative methods in historical linguistics: if we are genuinely interested
in interdisciplinary research, and do not supply numbers of our
choosing, we cannot be surprised if archaeologists and geneticists attempt
to provide their own. To give just one example, Poloni et al. (1997: 1017–
18) adopt the following methodology:
Linguistic distances between pairs of populations were deWned as simple dissimilarity
indexes . . . two populations within the same language family are set to a
distance of 3 if they belong to diVerent subfamilies; their distance is decreased by
1 for each shared level of classiWcation—up to three shared levels, where their
distance is set to 0 . . . a dissimilarity index of 8 was arbitrarily assigned to any pair
of populations belonging to diVerent language families.
What this means is that Poloni and her colleagues, urgently requiring
some numbers to feed into their computations, have almost arbitrarily
assigned grades of relatedness of 0, 1, 2, and 3 to pairs of languages, with a score of 8 for pairs generally thought to be unrelated. If we as linguists feel
that these are crude overgeneralizations, then the onus is very much on us
to provide better-reasoned alternatives. Not all colleagues may agree that
linguists should feel under any obligation to change the way we do linguistics,
just because other disciplines are interested in our results; Smith (1989:
185) takes the more insular view that ‘linguistic theory is not aVected by the
fact that its subject matter can also be of interest to others: the hydrologist’s
theories are not aVected by spitting’. As the last four chapters suggest,
we take a very diVerent view, and see the development of quantitative
and computational methods as crucial to progress within historical and
comparative linguistics, for discipline-internal as well as interdisciplinary
reasons. Clearly, we are not alone, either: after a gap following Embleton
(1986), there is now something of a resurgence in interest in quantitative
methods among historical linguists and colleagues in other disciplines
(Kessler 2001; Ringe, Warnow and Taylor 2002; McMahon and
McMahon 2003, 2004; Heggarty et al. forthcoming; Forster, Toth, and
Bandelt 1998; Forster and Toth 2003; Gray and Atkinson 2003; Renfrew,
McMahon, and Trask 2000). As quantitative methods develop further,
one of the main barriers to integrating linguistics into the ‘new synthesis’
seems set to disappear. It is therefore timely to consider some general issues
relating to correlations speciWcally between linguistics and genetics.

5. 3 Evidence for Correlations.
5. 3. 1 Genetic Evidence and Sampling.
A range of recent studies in the genetics literature discuss evidence for
correlations between genetics and linguistics at the population level.
Looking ahead, we shall see that there are interesting parallelisms, but
that correlations seem less signiWcant in some cases than others. As we
shall show, a very inXuential factor here, which has not so far been taken
into account, is the diVerent attitudes of linguists and geneticists to
admixture between systems.
Considerations of space mean it is possible to discuss only four studies;
Sokal (1988), Poloni et al. (1997), Gray and Jordan (2000), and Rosser
et al. (2000). An overview of current literature on genetics—linguistics
correlations at the population level is provided by McMahon (2004), who also focuses on two general issues, namely the type of genetic evidence
used and the techniques and rationale involved in sampling.
First, no single type of genetic feature is consistently included in these
comparisons: we cannot simply say that all relevant studies compare
‘genes’. Three main types of genetic evidence have been included in
studies of correlations with language, and these are the so-called ‘classical
set’ of genetic polymorphisms, such as the ABO blood groups (see
Cavalli-Sforza et al. 1994); mitochondrial DNA; and Y-chromosome
material. There are other genetic systems which promise to be even
more informative, notably involving microsatellite DNA and repeat sequences
which are unique events and therefore provide excellent markers
for group membership, but these have not yet been applied in interdisciplinary
research. Not all these genetic markers seem to correlate equally
well with linguistics: Poloni et al. (1997) argue that the clearest results
typically come from comparisons with Y-chromosome DNA, which
contains the gene determining maleness and is therefore passed on only
from fathers to sons; and McMahon (2004) surveys a range of studies of
Europe which indicate that prioritizing classical set, Y-chromosome, or
mitochondrial DNA evidence (the last being passed on only through the
female line) can give diVerent results. This might seem to constitute an
open and shut case for rejecting such correlations altogether; but it is
much more likely to indicate that men and women in populations may
sometimes have diVerent histories, providing both a more complex and a
more interesting picture. In turn, this may reXect the higher variability in
male reproductive success, as well as indicating cases of partial migration
or organized intermarriage systems between groups.
Even more important and potentially problematic is the issue of population
sampling (Cavalli-Sforza et al. 1994; Moore 1994). Most genetic
variants predate the geographical break-up of the human species and
therefore diVer between human groups only in relative frequency; it
follows that investigators cannot validly deWne populations after the
fact on the basis of the variants they do or do not have, but must crucially
deWne the boundaries of the population in advance in order to be explanatory.
Random sampling on a physical-grid approach might be ideal,
but is both socio-politically and scientiWcally challenging; and so far
sampling has often been on the basis of named, culturally signiWcant
groups, such as villages or ethnic groupings. These are commonly
deWned by language aYliation, with unfortunate consequences for the independence of linguistic and genetic data. In addition, the result of such
non-random sampling is that small, disappearing tribal groups characterized
on the basis of their language are often treated as equivalent to
similar-sized samples drawn from large, modern nation states. As
MacEachern (2000: 361) points out ‘the Hadza of Tanzania, with a total
population of about 1,000, occupy the same analytical status in Cavalli-
Sforza et al.’s regional genetic reconstructions [1994] as do the South
Chinese (approximate population 500,000,000) and the French (approximate
population 60,000,000), yet these three ethnonyms deWne entirely
diVerent types of human population unit’. McMahon (2004: 4) notes that
This approach has provided perfectly acceptable samples for addressing largescale
questions of human origins, such as the Out of Africa vs Multi-Regional
Hypotheses . . . where only a few representative populations are required from
each continent. However, when we are asking questions about the relationships
between human groups and their languages, to base the sampling criteria in one
domain on data from the other automatically weakens the importance of any
relationships detected.
It could be argued that sampling strategies based on language groupings
might be appropriate for groupings with pre-agricultural social organization;
but powerful evidence against this simplistic assumption is provided
by the extended studies of the Yanomani tribal groups living in the
Amazon basin of South America summarized in Merriweather et al.
(2000). Fission and fusion, intermarriage and warfare amongst the
roughly 150 villages that make up this linguistic group have led to a
situation where several villages are genetically closer to geographically
close but linguistically and culturally distinct groups than they are to
other Yanomani villages. These hunter-gatherer villages have at least as
much evidence of complex interactions as anywhere else, so that a choice
of members from a single village to represent the Yanomani could be as
misleading as choosing a group of Londoners to represent Western
Europeans. We shall touch on issues of evidence and sampling in connection
with the four studies to be discussed below.
5. 3. 2 Four SpeciWc Studies.
Our Wrst case study is Gray and Jordan (2000), which reports on the use of
unpublished data from Blust’s Comparative Austronesian Dictionary to
construct a phylogeny of theAustronesian languages. Themain idea behind this paper was to test two competing hypotheses on the origin and spread
of Austronesian languages and speakers: these are the ‘express-train-to-
Polynesia’ and ‘tangled-bank’ models. Gray and Jordan’s phylogeny was
strongly congruent with the ‘express-train’ model, which is well supported
by archaeology and all three types of molecular genetic evidence, and
assumes a rapid population expansion from an original source population
in Taiwan, with a unidirectional series of population movements covering
the 10,000 km to Polynesia in approximately two millennia. Any possible
contribution of contact is also minimized, since the archaeological culture
of these earlyAustronesians exploited island coastlines, meaning thatwhere
they did arrive on already populated islands they were unlikely to interact
much with the inland-dwelling prior inhabitants; many other islands would
have been unpopulated. We might anticipate that this sort of history,
involving continual change with clear punctuations as populations split
and move on, and rather little contact, might produce patterns in keeping
with relatively simple models of divergence, like the family tree. Gray and
Jordan’s paper is a paradigm case of the approach advocated in earlier
chapters, where quantitativework can validate existing proposals; here they
provide evidence for a particular tree of the Austronesian languages which
was originally put forward by comparative linguists, but can be shown to be
supported by data from other disciplines.
Our second case study, Poloni et al. (1997), used genetic data mainly
from a single region of the Y-chromosome, in 45 published populations,
and 13 collected by their own group. The total sample included 3,767
individuals, with a worldwide distribution, but some bias in favour
of African and European populations. Poloni et al. demonstrate a
strong correlation between linguistic and genetic distance among their 58
populations; in particular, they identify four essentially non-overlapping
clusters on the basis of members’ genetic characteristics and whether
they spoke an Indo-European, Khoisan, Niger-Congo, or Afro-Asiatic
language.
Our third case, Sokal (1988), again seems to support the existence and
exploitability of correlations between linguistics and genetics; this time
the genetic data studied were ‘classical-set’ autosomal genetic polymorphisms.
Sokal demonstrates signiWcant correlations between languages and
genes across Eurasia using a simple model of linguistic distance, with
languages within a subfamily (such as Romance) being set at 0 distance,
while languages in diVerent subfamilies within a family (such as a Romance and a Germanic language within Indo-European) were set to 1,
and those from diVerent families (such as Turkish, from the putative
Altaic family, and Hungarian from Finno-Ugric) were set at 2. Sokal
carried out Mantel correlation analyses of the genetic-distance matrices,
using several diVerent estimators for genetic distance, against the resulting
linguistic-distance matrix; these were signiWcant for over half of the
genetic loci studied.
Although these initial investigations of correlations between linguistic
and genetic evidence look positive, clouds start to gather on the horizon
when we look at a fourth recent paper. Rosser et al. (2000) used by far the
most extensive molecular data set to date, but, perhaps paradoxically,
genetics–linguistics correlations here become rather more elusive. Rosser
et al. studied 11 separate Y-chromosome polymorphisms on 3,616
chromosomes drawn from 47 European populations, and their main
suggestion is that the primary determinant of both the linguistic and the
genetic variation seems to be geography. In other words, variation in
both linguistic and genetic terms relies on the degree of physical distance
between populations. Where the populations compared are on
diVerent continents, so that there is considerable physical distance
between them, we would expect, and indeed Wnd, a good deal of linguistic
and genetic distance too. Exceptional cases of large linguistic and
genetic diVerences between geographically close populations are often
associated with clearly identiWable local barriers, such as mountain ranges
or stretches of water: for instance, as Rosser et al. (ibid.) note, the Georgian
and Ossetic populations are geographically close, but are genetically
and linguistically distinct, and separated by the Caucasus mountains.
There are consequently two alternative accounts for our linguistics–
genetics correlations. We may have found the real explanatory factor, in
the shape of geography; the apparent correlation between languages and
genes is then revealed as secondary. On the other hand, it might be that
the indubitable eVect of geography is not the main, or the only, factor but
is masking a true correlation between genetics and linguistics which
reXects shared population history. One way to reduce the confounding
eVect of a third common variable is to use a statistical technique known as
autocorrelation analysis to ‘remove’ the eVect of the third variable,
leaving a partial correlation of the other two variables of interest. For a
partial correlation of genes and language with geography held constant,
this amounts to asking what the correlation for language and genes would be for all those populations with the same geographic distance from each
other. This then isolates the relevant component of total variation, revealing
the extent to which a knowledge of the genetic relationships
between populations can be used to infer the relationship between their
languages and vice versa.
From this point of view the main diVerence between the studies we have
discussed is that Poloni et al. (1997) did not test for the contribution of
geography: they did refer to distance, but of course distance is only one
aspect of geography, since, as we have seen already, populations separated
by the same distance may be more or less close in languages and
genes depending on whether that distance includes a major barrier like a
mountain or sea. On the other hand, Poloni et al. (ibid.) were working
across continents, and most of their largest-distance Wgures in language
and genes correspond to populations on either side of these geographical
barriers. Sokal (1988) removed the eVect of geographic distance and
found a reduced but still signiWcant correlation between language and
genes. Gray and Jordan (2000) are not dealing with populations on
diVerent continents in a political sense, but certainly these populations
are divided by isolating stretches of water. Similarly, in the Cavalli-Sforza
et al. (1988) tree those correlations that seem most convincing and robust
are again those that operate across continents. Rosser et al. (2000) were
most careful in their treatment of geography, since they considered both
local barriers and distance; their conclusion was that, within continents,
geography is by far the greatest explanatory force for genetic distances,
eclipsing the contribution of language as an independent barrier to gene
Xow. However, even in their work significant correlations between linguistic
and genetic characteristics of populations were found where samples
include populations on diVerent continents or otherwise separated by
major physical barriers: although Rosser et al. included only European
populations in their main analysis, they did also consider two African
populations, and in comparisons involving these groups the linguistic–
genetic correlations did become significant.

5. 3. 3 The Contribution of Contact.
It is self-evident that the likelihood of contact and interbreeding is much
lower for populations on diVerent land masses or separated by a major
physical boundary than for adjacent or physically close populations. Indeed, before the development of relatively recent technological
innovations, simple distance even within continents would have correlated
very strongly indeed with the likelihood of contact between members
of diVerent populations. Sewell Wright, whose work in the 1930s led
him to be acknowledged as the father of modern population genetics,
is said to have held that the single most important factor in reducing
the level of inbreeding in human populations was the invention of the
bicycle, since before this the norm was for marriage within Wve miles of
one’s birthplace, whereas afterwards population admixture quickly became
the rule. Genes in populations do naturally change and diverge; this
is the basis of the fundamental speciation model of isolation by distance.
However, the further apart two populations are geographically, the
greater the divergence is likely to be, because in geographically close
populations interbreeding and consequent admixture will cause genetic
convergence, running counter to the eVects of normal divergence. In the
most distant cases we would not Wnd even the very limited amount of
admixture required (in the order of 1 or 2 individuals per generation
(Nei 1987)) to prevent those populations from diverging. It follows that
we should expect to Wnd considerably less genetic distance between
geographically close populations which are not separated by any signiWcant
physical barrier—and if there is anything in the claims of correlation
between genetics and linguistics, we should expect that relatively
small genetic distance to be paralleled by less linguistic distance. Of course,
these distinctions are all more diYcult to observe in studies sampling
only modern, mobile groups, since technological innovation has led
to a greater likelihood of interbreeding between even the most distant
populations.
These expectations are supported by our knowledge that contact between
two populations does not only have the genetic eVect caused by
interbreeding. Contact is also possible at a linguistic level, and has its own
consequences there (see Thomason 2001; Ch. 3 above). Depending on the
intensity of contact, and on other imponderables like language attitudes,
prestige, and so on, these eVects may range from the occasional, nativized
lexical item to wholesale structural borrowing, convergence, pidginization
and creolization, language mixing, and the like. And just as interbreeding
was less likely, at least until relatively recently, for geographically distant
populations, so language contact might be expected to be less intense the
further apart two speech communities are. If neither genetic nor linguistic mixing takes place to any great extent
between populations on diVerent continents, or with a major physical
boundary separating them, then one can well understand why the correlation
between the two types of evidence seems relatively strong for populations
under these circumstances: greatest genetic distance equals greatest
linguistic distance. However, where populations are geographically close,
with no intervening physical barrier, one would equally expect increased
similarity at the genetic level to be mirrored in increased linguistic similarity;
and here we have a paradox, because within a continental mass Rosser
et al. (2000) suggest that the correlations are less signiWcant. In other
words, where populations are geographically adjacent one would expect
recent history, and its consequences in terms of admixture, to blur to an
equivalent degree any more distant historical relationships in both genetics
and linguistics. What seems to happen, however, is that some of these
geographically close populations remain more distant linguistically than
would be anticipated given the probability of recent contact: here, the
expected correlation with genetics is disturbed. For example, in northern
populations of Europe a particular Y-chromosomal haplotype has been
associated with the expansion of the Ugric-speaking peoples along an
eastern to western axis (Zerjal et al. 1997). Rosser et al. (2000) identify
this particular haplotype (HG16) in all the Finno-Ugric speaking populations,
but also in the adjacent Indo-European-speaking Lithuanian and
Latvian populations. However, although the historical spread and current
distribution of this haplotype is an excellent example of genetic contact
and diVusion, it is happening across the most signiWcant language-family
boundary in the region, between Indo-European and Finno-Ugric.
There are two approaches to interpreting this apparent paradox. Either
contact between populations does not have any linguistic consequences—
or only very minor ones. Or contact-induced change is going on all right in
both linguistics and genetics, but linguists and geneticists handle admixture
in very diVerent ways. It does not even seem worth testing the hypothesis
that language contact does not happen—putting it at its most cartoonishly
simple, there are simply many more opportunities for conversation than
for interbreeding, especially where the latter must end up in the production
of viable oVspring if the genetic proWle of a population is to be aVected by
admixture. But there has certainly been a long-standing tendency in comparative
linguistics to marginalize or exclude contact-induced changes, as
we have seen in earlier chapters. Contact-induced changes are problematic: they can lead to erroneous hypotheses in terms of family-tree construction,
and to false steps in reconstruction. And if our priority is the construction
of linguistic family trees, it is only natural that we should attempt to remove
the eVects of changes which are out of keeping with the tree model, whether
by pre-selecting basic vocabulary lists, which should be relatively resistant
to contact, or by excluding languages with non-tree-like histories, which
Thomason and Kaufman (1988) describe as ‘non-genetic’. In fact, the kind
of discrepancy that arises can be illustrated quite straightforwardly by
considering the case of French. In all our PHYLIP trees discussed in
Chapter 4, French falls squarely and consistently inside Romance.
However, we can also construct a genetic tree for the French population—
admittedly a highly idealized concept, taking into account the concerns
about sampling expressed in 3.1 above.Our tree, which covers a range
of European populations, is based on average genetic distance for 88
‘classical-set’ genetic polymorphisms from Cavalli-Sforza et al. (1994),
and appears as Figure 5.2. It also shows that French falls clearly inside
Germanic, producing a complete lack of parity, in this case, between
linguistic and genetic trees.
This disparity arises, then, because if historical linguists can exclude
borrowings they will—and they will certainly prioritize data which seem
less amenable to external inXuence. This, however, is exactly what geneticists
do not do. The tendency in population genetics has been to
recognize and accept migration, and its genetic consequences, and there
is a signiWcant history of attempts to provide measures of interpopulation
exchange, and indeed models of how this might happen, and the extent of
its eVects, under particular circumstances. This is all part of quantitative
work in genetics which aims to calculate equilibrium gene frequencies and
levels of variability, and to assess the contribution of the diVerent forces
aVecting populations, namely mutation, migration, drift, and selection.
Given this discrepancy in practice between linguistics and genetics, our
hypothesis is that the actions of linguists in denying, downplaying, or
attempting to screen out the eVects of borrowing may have created the
appearance of non-signiWcance in the correlation between linguistic and
genetic variation for certain populations within continental land masses.
The exclusion of borrowings will automatically prioritize and emphasize
data indicating common ancestry and earlier history for the linguistic
systems concerned, while the genetic systems for the same populations
will also include any more recent innovations due to contact and admix-ture. This would create an obvious mismatch between those systems,
which would then appear less comparable in geographically adjacent
populations, disrupting the overall correlation between linguistics and
genetics. But this discrepancy has not arisen because the histories are
diVerent—it has arisen because the histories are the same, but half the
linguistic history is being analysed out! That is, our linguistic methodologies
attempt to exclude contact-induced changes, and this conspires
against the recognition of parallels between linguistics and genetics.
Matters get even worse, though, when we consider that last sentence in
more detail—our linguistic methodologies attempt to exclude contactinduced
changes, yes; but we know that they do not always succeed. Basic
meaning lists, as we showed in Chapter 4, have on average 12.3% loans (at
least for the Wve languages we sampled); and even where we are dealing with
well-attested and intensively studied languages, as in theDKB database for Indo-European, and agreed mechanisms for marking and Wltering out
loans, we have seen that errors can inevitably be identiWed at some level.
This presents evenmore of a problem, because it means we cannot even rely
on the disparity between linguistic and genetic trees being a consistent one:
the degree of mismatch we Wnd will depend on how many loans linguists
have missed (for instance, because of gaps in their knowledge of certain
systems), and that is hardly a factor amenable to statistical modelling.
What our methods oVer is a way of avoiding the problems arising from
these inconsistencies of practice. We can retain all the data, loans and all,
for ‘new-synthesis’ type work where we are undertaking cross-disciplinary
comparison and thinking about population histories; but we can then
identify the borrowings and exclude them later for purely linguistic
work. In Kessler’s terms, we can retain a database reXecting ‘historical
connectedness’ for comparison with genetics or archaeology, but prioritize
true cognates when we are drawing our family trees. If our methods
can reliably exclude loans, we may not even in principle be restricted to
Swadesh-type lists in future, since we will be able to assess the diVerent
contributions of diVerent meanings by rerunning our programs and isolating
which meanings are contributing to shifts of languages between runs.

5. 4 Looking Forward.
We ended the last section with a bright prospect; but there are two
outstanding issues to be considered before we leave the topic of linguistic–
genetic correlations. First, identifying some apparent disparities
between linguistic and genetic trees does not remove the possibility that
there may also be real ones: we noted at the outset that comparing these
two independent systems would always be a complex undertaking, not a
simple one-to-one match, and population movements might have highly
signiWcant, but variable, eVects on correlations between genes and language.
Large-scale directed migration might result in the total replacement
of the resident population and their language; or the ‘newcomers’
may join with the local population, forming a composite group speaking
the new language. If the number of incomers is small, they may be
amalgamated into the resident population and learn to speak the original
language of the area, leaving only a genetic signal in the resulting population.
Alternatively, an invading elite may generate an eVective cultural change, including resultant replacement of the local language, without
signiWcant genetic inXuence.
In other words, there is no intrinsic reason why genes and culture
should show identical lines of descent. Indeed, in areas of Australia native
languages appear to be attached more to a particular geographic locality
than to any particular resident group of humans: individual Aboriginal
tribes in the area are multilingual and speak the language appropriate to
their physical position in the landscape (David Nash, personal communication).
Thus, as McMahon (2004: 4) notes, ‘one extreme possibility for
language replacement would be for a language to no longer be spoken as
a Wrst language by any single group, but rather be used by two genetically
distinct tribes whose ranges overlap where that language was originally
spoken by a now extinct third tribe which shared little genetically with
either group of current speakers’. Our methods oVer the prospect of
unearthing real correlations between linguistic and genetic features in
cases where earlier diVerences in disciplinary practice have obscured
them: but they can do nothing to resolve those cases where the correlations
really do break down, and we must accept that these exist too.
Finally, there is a further question of representation. Our work in
Chapter 4 and the investigations reported here have been based on
family-tree models; and yet those are by no means universally accepted
for language. Dixon (1997) argues that there are areas of the world,
notably Australia, and perhaps periods of equilibrium for other language
groups, where convergence will be more important than divergence, and
the tree oVers an inappropriate model. Thomason and Kaufman (1988)
see pidgins, creoles, and mixed languages as non-genetic, and therefore as
intrinsically incompatible with the family tree. If we are serious about
rehabilitating contact-induced change, and want to be able to account for
both aspects of Kessler’s ‘historical connectedness’ (2001), then our concentration
on trees is problematic.
On the one hand, there will be situations and language groups for
which the tree is a wholly appropriate model: it certainly has the advantage
of familiarity, clarity of representation, and a built-in diachronic
aspect through its vertical dimension. If we have methods which can
isolate features arising from contact and exclude those, then arguably
for many languages we have a better case than ever for using trees. On the
other hand, how are we to represent relationships between languages at
the stage before we exclude contact; or in cases where we speciWcally want to focus on the contribution of contact; or in situations where we are
carrying out analyses to assess what the contribution of contact might be?
There is something inherently unsettling about using tree-drawing and
tree-selection programs speciWcally to isolate features and changes incompatible
with trees, as with the shifts of English, Frisian, and Romanian in
the hihi versus lolo trees in Chapter 4. However, if we have learned one
thing so far, it is that biology, and speciWcally population genetics, has
many of the same potential problems as comparative linguistics; and,
moreover, that many of these problems have already been successfully
confronted. It is therefore unsurprising to Wnd alternative programs,
beginning with Network (Bandelt et al. 1995; Bandelt, Forster, and
Ro¨ hl 1999), which allow the representation of features arising through
both common ancestry and contact; and in the next chapter we turn to an
investigation of network representations for language.

6. Climbing Down from the Trees: Network Models.
6. 1 Network Representations in Biology.
6. 1. 1 Problems with Trees.
The fundamental problems with family trees are the degree of idealization
they necessitate and their essential incompatibility with the forces of
contact-induced change which, as we have been arguing throughout this
book, are as important for at least some languages as descent with
diVerentiation from a common ancestor. Some languages will have an
essentially tree-like history, while others are primarily contact languages.
Historical linguists, with their propensity for designing opposing
methods, might suggest the tree model for one extreme and the wave
model for the other: but this does not get around the problem that most
languages will occupy some position on the cline between these two end
points. We can foresee long and unproductive struggles over deciding
when each model is to be used, missing crucial bits of data for each system
in the process, whereas in a perfect world what we need is a single model
which could sort out for us how much of a language is tree-like and how
much non-tree-like, and display the two driving forces, and resulting
language features, diVerently.
Fortunately, we do not need to wait for a perfect world to Wnd such a
model: it is already under development in biology. As Pagel (2000: 190–1)
notes:
linguists should bear in mind that the glaring, even embarrassing, exceptions
are not conWned to linguistic evolution. Thus, biological evolution witnesses
horizontal transmission of genetic information just as words are borrowed horizontally between languages . . . Evolutionary biology’s response to these phenomena
has been to develop, among many other methods, more sophisticated
techniques for detecting gene transfer, identifying convergence, and measuring
rates of evolution.
We shall return to the vexed question of rates of change in the next
chapter, but detecting and displaying gene transfer and convergence are
directly relevant to the analysis of contact-induced change in linguistics,
and both are tackled in computer programs based on networks.
Network models, at Wrst glance, seem just too good to be true. Bryant,
Filimon, and Gray (in preparation: 2) suggest that what we need in
dealing with population histories is:
an analytic approach that enables us to assess where on the continuum between a
pure tree and a totally tangled network any particular case may lie. More
speciWcally, this approach should be able both to identify the particular populations
where admixture has occurred and detail the exact characters that were
borrowed.
Network representations can indeed achieve these goals; but to understand
how, we must return to their origin, in dealing with molecular
genetic data at the level of the individual.
6. 1. 2 Networks in Genetics.
The original Network program (Bandelt et al. 1995; Bandelt, Forster, and
Ro¨ hl 1999; Forster et al. 2001; <http://www.Xuxus-engineering.com>,
accessed March 2005) was initially developed to deal with cases where a
particular genetic sequence has more than one possible history. If there is
more than one possible history, then there is more than one possible tree.
Network both analyses and represents this ambiguity by collapsing the
alternative possible trees into a single network graph. For parts of the
sequence where there is only one possible history the diagram will look
tree-like; but where there are multiple possible histories the program
draws a reticulation, or a box shape, to indicate that the data are compatible
with more than one tree structure. An example is given in
Figure 6.1, and discussed immediately below.
Figure 6.1 has two parts. The Wrst is a sequence of 12 bases (each base
being an A, C, G, or T) of mitochondrial DNA for six molecules A–F.
A represents the common ancestral state, and the diVerences between this and the other Wve molecules are shown in bold: these are state changes, or
mutations. Network is essentially a cladistic analysis (Ridley 1986; Page
and Holmes 1998; Skelton and Smith 2002; McMahon and McMahon
forthcoming b), which means that only the mutations are relevant in
constructing the tree: unchanging characters are ignored, as they are
uninformative.
The graph derived from this set of sequences is the other half of
Figure 6.1. It shows the six molecules A–F, and is predominantly treelike,
with the exception of the reticulation joining molecules D, E, and F.
This reticulation reXects the fact that F shares a mutation at base 5 with D, and a mutation at base 9 with E. But because these molecules are
mitochondrial DNA, which can be inherited only through the female line
and therefore from an individual’s mother, F cannot be the direct descendent
of both D and E. The problem, in other words, is that these facts
are compatible with two possible trees: either the mutation at base 5
happened in D and was inherited into F, giving the tree in Figure 6.2a,
or base 9 changed in E and was inherited into F, giving the tree in
Figure 6.2b.
In either case one mutation happened once and is then inherited; and
the other base has been aVected by two independent mutations (the
second in each case marked with an asterisk). If Figure 6.2(a) is right,
both E and F have independently experienced mutations in base 9, and if
Figure 6.2(b) is right, then both D and F have independent mutations in
base 5. It is not possible in principle to sort out the actual order of
branching from the data we have, so Network simply records the ambiguity
in the reticulation it draws. The method respects the assumption,
common also in linguistic family trees, that there can be only one direct
ancestor in each case, but also signals the fact that we can interpret the
available data as pointing to two possible-candidate common ancestors,
and invoke an alternative process, here independent mutation, for the
remainder of the data.
In the case of these molecular data we are always dealing with a choice
for each state change between ancestral mutation and inheritance, or
independent, spontaneous mutation. A reticulation means we can’t decide
which happened: either is possible. What cannot be going on at this
individual level is any kind of borrowing from an unrelated individual:
remember that the mitochondrial DNA can only be inherited from the mother, or mutate in situ. However, when we are dealing with autosomal
DNA (not mitochondrial or Y-chromosome material, but the majority of
genes, which are inherited in two copies, one from each parent) common
states can again arise from common ancestry or from independent,
multiple mutation, but also from recombination. In this case, we need
to envisage the three-generation series of events shown in Figure 6.3.
In the Wrst generation the mother and father each have two copies of a
particular gene sequence, which they have in turn inherited from their
parents. At generation 2 they each bequeath one copy of their own gene
sequence to their child, who receives two ‘pure’ versions of that chromosome,
one from each parent. However, in this second-generation individual
a process of recombination occurs, such that the ‘pure’ genetic sequences inherited from his/her father and mother are reshuZed together,
giving two mixed molecules. One of these mixed chromosomes is
then, at generation 3, passed on to our individual’s child, who will
therefore carry as one of his/her copies of that chromosome a mixed
sequence, containing bases from the grandmother, mixed with other
bases ‘borrowed’ from the grandfather.
In this case too Network would draw a reticulation. This time the
reticulation does not mean the chromosome in question has two possible
histories, one involving inheritance and the other independent mutation,
and that we cannot choose between them. Instead, this reticulation shows
that the inherited ‘mixed’ chromosome has two actual histories at the
same time: parts of it come from two diVerent ancestors. In both these
examples note that we do not spend time fretting about whether we should
employ a tree-drawing or a network-drawing program: although Network,
naturally enough, draws networks where they are appropriate,
cases where there has been neither ‘borrowing’ (recombination) nor parallel
development (the same mutation independently twice or more, also
known as homoplasy) will automatically be represented with the most
likely tree. That is, the program involved draws a tree when the relationships
are clear and tree-like, and a more complex network when the
connections are more complex or ambiguous and show more interaction.
6. 1. 3 Split Decomposition.
The option of a program which generates trees or networks depending on
the data, or, more accurately, which generates trees interrupted where
appropriate by reticulations, is clearly of interest for linguistic data too;
but before we turn to these applications we should Wrst say a little more
about exactly how the data are analysed by Network. The central process
here is split decomposition (Bandelt and Dress 1992), a technique for
dividing data into natural groups.
Split decomposition for this kind of Network analysis, based on diVerent
states, involves three steps (adapted from the excellent account in
Bryant, Filimon, and Gray, in preparation).
First, we need to identify something to count, and then count it. In the
case of our sequences in Figure 6.1 we are counting the number of
molecules with each state: so, we count 6 with the value, or state, A at
position 1; 5 with state G at position 2 and 1 with state C at position 2; 1 with state A and 5 with state T at position 3; 1 with state C and 5 with
state G at position 4; 4 with state A and 2 with state C at position 5; and
so on.
Second, we then work out the splits generated by this collection of
values. This is achieved by Wguring out the minimum number of changes
of state between each pair of sequences. So, again for the data in
Figure 6.1, position 1 tells us nothing at all: there are no changes of
state, since all 6 molecules share the same state, and we cannot therefore
use this site to split the data. Our Wrst split emerges at position 2, where
sequence A is the only one with state C, splitting it eVectively from the
other sequences, which all share state G. At position 3 we have a further
split of C from all the other sequences; and at position 5, D and F are
separated from all the rest.
The third and Wnal stage involves plotting the resulting splits; this is the
role of the Network program. Network will draw branch lengths depending
on the number of state changes separating particular nodes, and will,
as we have seen, incorporate reticulations where there is more than one
possible source for a derived state, giving the diagram in Figure 6.1. So,
for instance, at position 5 Network will make a split between D and F on
the one hand, and all the other sequences on the other; but at position 9 it
encounters data which seem to force a split of E and F as against the rest.
Clearly, these two splits are incompatible: they cannot be displayed on the
same tree, unless we avail ourselves of the possibility of using reticulations
to collapse the two possible trees into the same superordinate graph.

6. 2 Applying Network to Linguistic Data.
6. 2. 1 Comparing Linguistic and Biological Data.
In extending Network to language data it is Wrst important to show how
those data can be seen as comparable to the biological sequence data we
have been considering so far. A meaning list (our 200-item Swadesh list;
or our reduced more conservative, (hihi) or more changeable, (lolo) sublists)
can be regarded as essentially equivalent to a chromosomal sequence
of bases. As noted in Chapter 4, the real data we are interested in for
quantitative lexical-list comparisons are not the individual lexical items
themselves, with their individual and highly language-speciWc shapes.
What we need is to convert those items into states, comparable with our A, C, G, and T base labels. The Dyen, Kruskal, and Black (1992)
database we have been using incorporates this step, since for each meaning
a list of states is provided, with each numerical code signalling lexical
items that are cognate, or borrowed, or missing, or unique (see Table 6.1).
In Table 6.1, 0 is used to indicate missing data, and codes between 030
and 050 mark unique states and borrowed items. Other numbers group
lexical items deemed to have arisen from the same item in the common
ancestor: 003, or 401, or 200, in other words, mark cognate sets. Finally,
in the case of borrowed items, the bracketed code following indicates the
class the item would belong to if it were mistakenly classed as a cognate
rather than as a loan. This additional bracketed code therefore also
indicates the likely source of the borrowed item.
As Table 6.1 shows, the equivalent of each sequence in our genetic data
is the coded list for a particular language or variety. If for a particular
position (or, in linguistic terms, meaning slot) we have a consistent value
through all the languages of say 200, then we have an uninformative site,
where every list retains a cognate, as in Meaning 109 in Table 6.1. If we
have more than one cognate class, then Network will insert a split at the
appropriate point, as between 003 and 004 for meaning 001. If there is a
borrowing, as with 031 for meaning 003, then this is the equivalent of one
of our recombined elements from Figure 6.3, where the common ancestral
signal for the language list as a whole includes elements introduced
from elsewhere by ‘mixing’. In such cases, employing a special code from
the range 030–050 instructs the program to ignore the item in question.
On the other hand, if we mistakenly coded animal in English as 401, cognate with the forms in the Romance languages, we would expect a split between English and Danish, and therefore a reticulation, since this character will be incompatible with the tree predicted by the overall
pattern of cognacy for these lists.

6. 2. 2 Network and Borrowing: Simulated Data.
Borrowing, of course, must be the prime candidate for the type of eVect
we would hope to detect using programs like Network, so it seems
appropriate to begin with an application to a situation of this kind. In
Chapter 4 we considered simulated data for the ‘real’ history of a family,
with higher and lower degrees of mutation (corresponding to our lolo and
hihi data for Indo-European), and with diVerent degrees of borrowing.
At each stage of the simulation trees were plotted using PHYLIP
(Felsenstein 2001). Let us see what happens when we apply Network to
these simulated data instead.
The idea of using simulated data, as for the tree analyses in Chapter 4,
is that if borrowing of particular types and intensities creates a typical
signal in Network, we can look for just that signal, or variants of it, when
we apply Network in real cases. Figure 6.4 shows a network for our
simulated hihi list. We selected the characters to include here by starting
from a full 200-item simulated list with a variable mutation rate, then choosing the 25 items which were changing most slowly. These least
changeable, most conservative characters are clearly the closest analogue
to our real hihi sublist. For the full list we set borrowing from language B
to language A at 10%, but the fact that we derive such a straightforward
tree in Figure 6.4 suggests that in this sublist there has probably been no
borrowing at all from B to A.
For comparison, Figure 6.5 shows our simulated version of the lolo list,
which is set to change twice as fast. Here, it is evident that Network has had
insuperable problems in constructing a single tree, since there are reticulations
towards the root of the tree, and these are clearly linking languages A
and B, which fall on opposite sides of the group of reticulations.
These diVerently shaped graphs may look convincing: we certainly get
a diVerent signal for cases where the rate of borrowing is likely to be
higher as opposed to cases where we would expect less borrowing, since
the mutation rate underlying Figure 6.4 is lower than that for Figure 6.5. However, so far this is essentially circumstantial: we have a plausible account of the diVerences, but no real evidence that it is the right one.
After all, we have seen that for biological data the same patterns of
reticulation can reXect independent mutation as well as recombination.
We could opt to rerun Network multiple times, bootstrapping the graphs
by assessing how often we get the same picture from the same data, or the
same data minus individual items on each run; this is a good way of
testing which items are actually causing any discrepancies between the
two Wgures. However, bootstrapping is a particularly time-consuming
process, and a further useful property of Network provides us with a
convenient short cut. Not only does Network variably construct trees or
more complex graphs depending on the complexity of the relationships in
the data, it also accompanies each graph with a list of the data points
which are most diYcult to reconcile with the tree—in other words, those
which are behaving in the most non-tree-like way.
When we unpack the Network program and access these lists of nontree-
like characters we Wnd that there are a few inconsistent items for
Figure 6.4. However, none of these is a borrowing from language B to A.
On the other hand, of the 25 items in the simulated lolo list, where we
would anticipate that borrowing should be more common, we Wnd four
cases of loans from B into A. Four loans out of 25 items does represent
more than 10% borrowing; but our 10% setting was for the list as a whole,
and if these 25 items are the ones changing most rapidly we would expect
(as for the real lolo data) that borrowing would be particularly concentrated
in this sublist. Three of these borrowings are included in the reticulations
towards the root of the tree, and the complex pattern of
reticulations here indicates that some of these items are also shared by
the sisters of A or B, leading to multiplex links between languages. The
fourth item has a shared state, by sheer accident, with one language outside
the branches forAand B, and therefore does not give a signal leading to an
A–B reticulation, though it does appear in the list of problematic items.

6. 2. 3 Network and Borrowing: Real Data.
Turning to an equivalent case for real rather than simulated data,
Figure 6.6 shows the output of Network for the hihi, most conservative
sublist for Romance and Germanic. Since the mutation rate is known to
be relatively low for these items, and we have already established that
within Germanic at least none of the known borrowings appears in this sublist (Embleton 1986; Ch. 4 above), it is perhaps unsurprising but
helpfully aYrming that Figure 6.6 is highly tree-like. There is a single
reticulation within Romance, reXecting the amount of interborrowing we
know has taken place historically across the Romance group, and some
reticulations towards the root, but these reXect uncertainties in the bigger
Indo-European picture, not relationships speciWcally between Romance
and Germanic.
Figure 6.7 shows the graph for Romance and Germanic with the least
conservative, lolo sublist, and has considerably more reticulations, especially
at the root of the tree and within Romance.
However, we might also expect Figure 6.7 to incorporate reticulations
for Germanic, since we know for a fact that there are interborrowings within Germanic in this lolo sublist: recall from Chapter 4 that Embleton
(ibid.) lists ‘wing’, ‘left (hand)’, ‘to pull’, ‘to push’, ‘river’, and ‘to throw’
as falling into this category, and all are included in our least conservative
meanings. Yet there are no reticulations in either Figure 6.6 or Figure 6.7
for Germanic.
There are two responses we can make to this. First, the extent to which
Network will display reticulations depends crucially on the setting of an
internal parameter, e (Bandelt, Forster, and Ro¨ hl 1999). The value for
this epsilon parameter determines the sensitivity of Network to conXicting
signals in the data, and therefore sets the number of reticulations
which will be visualized. Where epsilon is low, Network will tend not to
display groupings with low support (in other words, links that involve a
small number of characters) as reticulations, but will show new mutations
on the relevant branches of the tree instead. Where epsilon is high,
Network will attempt to show any connection as a reticulation, though
this can lead to particularly complex, multidimensional graphs, in which
the signal is arguably impossible to disentangle from the noise. We can see
this eVect by comparing Figure 6.7, where epsilon is set low, with
Figure 6.8, where it is considerably higher (< 1 versus 2 respectively). Figure 6.8(a) shows reticulations for Germanic in abundance. But the
problem is that there are so many of them, both between and within
groups, that the Network is almost impossible to interpret. We have a
choice, then, between two types of output from Network. In Figure 6.7
we do not see the reticulations that signal contact in every case, but can
check that Network has in fact experienced diYculty in reconciling data with the displayed tree, and that these data are in fact known borrowings,
by accessing the list of problem characters which Network generates
automatically. For Figure 6.7 the problem items do include all those
items which are (i) borrowed from one Germanic language to another,
and (ii) miscoded in the Dyen, Kruskal, and Black database as cognates
(see Ch. 4 above). Alternatively, we can visualize an increased number of
reticulations as in Figure 6.8, but will still have to sift through the list of
problem cases to access the linguistic reasons for each link in the graph.
However, it is also worth noting that Network, even with epsilon set
low, shows a particularly clear and sensitive reaction to our undiagnosed
loans within Germanic. SpeciWcally, in the most conservative graph in
Figure 6.6 English is contained in a cluster with German, and is squarely
within West Germanic, while in the least conservative graph in Figure 6.7
English has shifted altogether into North Germanic. Likewise, Frisian, in
the same least conservative tree, is clustered with Dutch/Flemish/Afrikaans.
Although we are not seeing reticulations in these cases, we do Wnd
that subgroups shift within the trees depending on the presence of loans in
the data. By way of illustration, Figure 6.9 shows the output of Network
with two diVerent codings for the single item ‘wing’, which is erroneously
coded in the Dyen, Kruskal, and Black database as cognate between
North Germanic and English, although we know in fact that this is a
loan into English.
In Figure 6.9(a), English appears quite clearly as a North Germanic
language. In Figure 6.9(b), however, the recoding of that single item as a loan means English appears outside the North Germanic branch. Indeed,
in Figure 6.9(b) English falls outside Germanic altogether, due to the
inXuence of borrowings from Romance. These have been entirely appropriately
coded as borrowings, or unique items, in the database, but the
cumulative eVect of all these unique states is to distance English from
the other Germanic languages which do not share them. It is notable that
the coding of even a single item can have such a powerful eVect on the
structure of the graph.
Clearly, further consideration has to be given to the interpretation of
diVerent network patterns, and to the most appropriate settings for epsilon.
In any case, we cannot see these programs (just as we argued for the
tree generation and selection metrics in Chapter 4) as standing alone: the
programs can help us visualize the issues in the data, and focus on the
problematic data points, but we still need linguists with detailed knowledge
of the languages in question to sort out the real signiWcance of each
point. This will also help with a Wnal problematic aspect of Network which
we have already encountered for biological data. In biology the presence
of a reticulation need not always mean recombination, the closest analogue
for individual DNA data of linguistic borrowing. Reticulations can
also signal convergent evolution (homoplasy), where the same pattern has
arisen more than once by chance; or shared retentions from the common
ancestral form which are maintained in certain cases but lost in others. Of
course, parallel changes and shared retentions are not unknown in historical
linguistics either, so that expert linguistic knowledge is invaluable in
sorting out which of the problem cases can be ascribed to either of these
less common causes, and which are more likely to reXect contact. In our
simulation work we have undertaken a rather partial, indirect test of the
eVect of shared retentions. As noted in Chapter 4, we found that trees
drawn on the basis of simulated data began to be disrupted only at a rate of
5% borrowing, and then only on 15% of runs; more commonly, disruption
of tree structure was observed at 10% borrowing. It seems unlikely that we
should Wnd as many as 5% shared retentions, let alone 10%; this will
depend on the histories of the languages concerned, but in our simulations
we found a maximum of 2% shared retentions, with an average of around
0.9%. Nonetheless, this is a further indication that linguists will still have
to consider the problem data points generated by Network carefully to
ensure that we are not over-interpreting the existence of borrowing where
other factors may be responsible.

6. 3 Distance-based Network Methods.
6. 3. 1 Distance-based Versus Character-based Approaches.
Network clearly oVers an interesting range of possibilities for representing
and interpreting conXicting signals, which may for linguistic data
indicate borrowing, in data sets. We have shown that tests of the method
on Indo-European data provide support for Network, which generates
meaningfully diVerent graphs for sublists including more and fewer signals
of contact, and eVectively isolates the items responsible in the form of
reticulations, lists of problem characters, or both. It would seem appropriate
to assess whether we can now apply Network to cases where we are
not so sure about the linguistic history, to see whether we can reach some
clarity on the basis of patterns we have observed for known histories.
However, before going on to this next step we should report some recent
advances in network methodology.
All the illustrations we have provided so far have involved the application
of Network, and of the underlying technique of split decomposition,
to character data. But this brings inevitable limitations. In biological
applications, character-based approaches are applicable only at the level
of the individual, in comparing particular molecules, although there exists
a much clearer analogue for contact-induced change in linguistics at the
level of biological populations. If we were to apply Network to the
relationship and histories of populations, we would Wrst have to determine
the network for each molecule in the sample, then as a second-order
problem assess the distribution of those molecular patterns in populations
by plotting each molecule on a map, for instance, to show where its
carriers are most typically located. What we cannot do if we are dealing
only with character data is to achieve an easily read composite network
graph for all the molecules we wish to consider and their relative frequencies
in diVerent populations. However, we could do this if we were dealing
with composite distances between populations based on a summation of
all those individual molecules. Recent network-based approaches have
therefore shifted from a character-based to a distance-based method, as in
Splitstree (Huson 1998) and NeighbourNet (Bryant and Moulton 2004),
though these are still very much models under development.
The development of distance-based metrics also brings considerable
advantages for linguistic applications, though these are really still in their infancy (Bryant 2004; Holden and Gray 2004; Bryant, Filimon, and
Gray in preparation). In particular, as we saw in Chapter 3 above, there is
an inherent diYculty in using character-based approaches for language
data: the characters chosen for one language group (say, Ringe, Warnow,
and Taylor’s phonological and morphological characters for Indo-European
(2002)) will be highly unlikely to generalize to other language
groups, since they have been selected particularly as speciWc innovations
which are salient in subgrouping for that family. It is true that this is not
such a major problem for lexical data, since with our Swadesh lists we are
by convention dealing with a set list containing set slots. However, there
are still diYculties here. How are we to compare, for instance, a conventional
Swadesh list with the adjusted variety developed for Australian
languages by Alpher and Nash (1999)? Alternatively, even where we
might keep a particular slot there can be serious diYculties in determining
which item should Wll it for a given language or group: for instance, in
Quechua there are two words for ‘brother’, depending on whether we are
discussing a man’s brother or a woman’s brother, and up to Wve words for
‘wash’, depending on whether we are washing hands or clothes, for
instance.
This Quechua problem is the direct motivation for Heggarty’s proposal
(forthcoming) to extend lexicostatistical comparison to provide a more
nuanced means of comparing lexical semantics. As we have seen, modiWed
Swadesh lists already exist in the literature, and on the model of
MatisoV’s CALMSEA list, (1978, 2000), containing meanings Culturally
and Linguistically Meaningful for South-east Asia, Heggarty proposes a
parallel 150-meaning CALMA list, incorporating meanings Culturally
and Linguistically Meaningful for the Andes (for a full list see Heggarty
forthcoming and McMahon, Heggarty, McMahon, and Slaska forthcoming).
The CALMA list is altered in several ways, discussed in detail
in Section 6.3.4 below; but for the moment the most relevant modiWcation
is that a single list-meaning may be split into several discrete subsenses
where the data warrant such treatment. For instance, the Andean
languages commonly distinguish two senses for the Swadesh meaning
‘sun’, namely ‘celestial object’ on the one hand and ‘sunlight/heat’ on the
other. Some Andean varieties will have one form for each of these two
senses, as for instance Atalla Quechua has inti for the ‘celestial object’
sense, and rupa-y for ‘sunlight/heat’. Laraos Quechua has both forms too,
but it uses inti for both these subsenses, and rupa-y only in the verb root ‘be hot (sunny), burn’. Puki Aymara, however, has only inti, and rupa-y is
entirely unknown. Comparing these varieties reveals a complex set of
patterns of overlap, which can be expressed as weighted calculations of
degree of similarity; but this immediately means we will require more
sophisticated calculations than the usual 0 or 1. Weighted values of this
kind, as we shall see, are quite typically incorporated in distance-based
calculations, though they disrupt the assumptions of character-based
approaches.

6. 3. 2 Split Decomposition and Distance Data.
How, then, does the split-decomposition approach work for distance
data? Rather than dealing with the approach in biology Wrst and extending
this to language data, we shall turn immediately to linguistic applications.
In outline, this approach overlaps signiWcantly with the earlier description
of character-based split decomposition, though there are some additional
steps (see again Bryant, Filimon, and Gray in preparation).
For Splitstree (Huson 1998) the Wrst stage is to derive a distance matrix
from the data we are using. In many cases this will involve simply adding
up the 1 and 0 values for whether items are cognate or not across the
whole list, though as noted above innovations whereby a wider range of
intermediate values is included can in principle be accommodated (Heggarty
forthcoming). The second step is to generate splits on the basis of
the data. For character-based approaches this is a straightforward process,
since it involves essentially spotting diVerences and generating splits
accordingly; but for distance-based approaches there is added complexity.
For a maximum of four languages or groups, split decomposition
calculates the maximum distance between each pair, along with the distance
separating the two languages within each pair. If the distance
between pairs is greater than the distance within a pair, then Splitstree
generates a split with a branch length equal to that positive value. Cumulatively,
these calculations of distance generate a tree, and where we
Wnd conXict between the splits, then reticulations will be introduced.
The problem with Splitstree is that it experiences diYculty in dealing
with large numbers of languages or particularly complex and messy
signals. As Bryant, Filimon, and Gray (in preparation) note, introducing
more languages and splits leads inevitably to a reduction in the values for
branch lengths, so that it is harder to generate cases of conXict. This means that graphs based on bigger and more complex data sets tend to
become more tree-like by default, because the amount of data predisposes
to small or negative values for diVerences between groups, and reticulations
therefore rarely arise. Clearly, this makes Splitstree problematic for
large data sets, as we shall see below. This problem is being addressed in
the development of NeighbourNet, which uses an algorithm similar to the
neighbour-joining approach for trees (Saitou and Nei 1987). The details
are too complex for full discussion here (though see Bryant and Moulton
2004; Bryant, Filimon and Gray in preparation), but the consequence is
that NeighbourNet can deal with much larger data sets and will be able to
generate splits and reticulations in even relatively messy cases. The potential
drawback, on the other hand, is that NeighbourNet may be such a
robust heuristic that it will generate splits and identify conXicts even
where the data do not really support them. Furthermore, NeighbourNet
is essentially a phenetic method: that is, it works on the basis of observed
similarities and distances between languages at a particular time, and
does not explicitly seek to reconstruct a history for the group. Outputs
from NeighbourNet are strictly phenograms, which give an indication of
relative distance, rather than phylograms, which attempt to reconstruct
the historical pattern and order of branching.
It follows that an urgent priority for linguists is to assess the operation
of these diVerent clustering programs on known data, to allow us to
identify the patterns we observe as characteristic of particular types of
history. If we do not Wnd consistent representations for consistent types of
input, we have a problem with the programs. If we do, then we can start
to generalize these approaches to less securely understood cases. We
therefore make no apology for returning to our simulations and Indo-European data again.

6. 3. 3 Distance-based Methods and Linguistic Data.
It is straightforward to show that NeighbourNet is indeed more sensitive
to the presence of contact in the linguistic data than Splitstree. Figure 6.10
shows the output of NeighbourNet on the left and Splitstree on the right
for our simulated least conservative data, with 10% borrowing, calculated
and printed using the Splitstree 4 beta test version (1 June 2004 release;
<http://www-ab.informatik.uni-tuebingen.de/software/jsplits.welcome_
en.html>, accessed March 2005). Clearly, both programs construct reticulations, though they are markedly more extensive for Neighbour-
Net. Note that borrowing here is from language A into the unrelated
language M(rather than between two of the related languages, such as A
and B, as was the case in previous simulations).Mhas no similarity, prior
to borrowing, to any of the other simulated languages; we have selected
this option here to minimize any potential interference from shared
retentions (see below).
Although we have not shown them here, parallel graphs for our simulated
most conservative data set show a more striking diVerence, since
Splitstree here gives a completely tree-like representation, while NeighbourNet
includes a small number of minor reticulations. As we have
noted above, these most conservative data do not include any borrowings,
so contact cannot be the explanation for these reticulations. In fact,
NeighbourNet is here picking up the extremely small percentage of
shared retentions, where states from the common ancestor are by chance
retained in languages from diVerent branches of the tree, though their
immediate sisters have lost those characters. What is more diYcult is
checking the cause of such reticulations for Splitstree and NeighbourNet:
because these programs are operating on distance data, they are applied
to a matrix of numbers rather than to the lexical material itself, and they

6. 3. 4 Applying NeighbourNet Beyond Indo-European.
All these methods require further comparison and testing, and all remain
under active development (see Forster, Polzin, and Ro¨ hl 2005 for Network),
but for the moment it seems particularly worth continuing with NeighbourNet, since the algorithm here is apparently set to a good
level for detecting relevant signals in linguistic data, without being
overburdened by too much noise. It has the further advantage of operating
extremely quickly, with runs for our entire Dyen, Kruskal, and Black
database taking 40–50 seconds on a 700 MHz PC.
In Chapter 2 we discussed Embleton’s stepwise approach to the development
of new methods, which is repeated for convenience in (1) below:
(1) Embleton (1986: 3)—steps in quantitative analysis
(i) to devise a procedure, based on theoretical grounds, on a
particular model, or on past experience . . .
(ii) to verify the procedure by applying it to some data where there
already exists a large body of linguistic opinion for comparison,
often Indo-European data . . . this may lead to revision of
the procedure of stage (i), or at the extreme to its total abandonment;
(iii) to apply the procedure to data where linguistic opinions have
not yet been produced, have not yet been Wrmly established, or
perhaps are even in conXict. In practice, this usually means
application to non-Indo-European data. . . .
For NeighbourNet and analogous methods we have now passed Stages 1
and 2, and must devise some appropriate Stage 3 tests. Though there are
in principle many situations to which we could apply such methods, and
indeed many where linguists would be particularly keen to have a diagnostic
for working out the most likely history, we shall content ourselves
for the moment with two small demonstrations of the method. Both are
designed to assess how NeighbourNet performs when faced with particularly
strong and pervasive evidence of contact; the second applies the
method to a single situation in South America where the evidence is
equivocal as between a hypothesis of common ancestry or a long period
of contact and convergence.
First, if we are to pursue our goal of not ignoring or excluding borrowing
but learning how to diagnose and use it, we must consider cases of
contact-induced change more radical than the few undiagnosed loans our
tree- and network-drawing methods have unearthed in a single Indo-
European database, extensive though this is. If there is one geographical
area where linguists agree that the eVects of contact have been particularly
widespread, it is Australia. True, the Australian linguistic situation is anything but settled: some linguists argue for at least one substantial,
old language family, Pama-Nyungan, with other groupings and isolates
(Koch 1997; Bowern and Koch 2004; Evans 2004), while others contend
that Pama-Nyungan is not demonstrable, and that the Australian languages
are connected primarily by long-standing contact relationships
(Dixon 1980, 2001, 2002). This, then, seems an area ripe for methodological
innovation.
Our own test here is a very small-scale one, using a severely limited
corpus of data from 26 languages of south-eastern Western Australia
analysed and published by David Nash (2002). The interrelationships of
these languages are poorly understood, and the available data frequently
consist of an incomplete meaning list (based on the Alpher and Nash
(1999) modiWed Swadesh list), collected in most cases from a single
speaker. These data possibly represent a worst-case test for phylogenetic
methods: we cannot select our data, since the sources are intrinsically
limited, and we have no comparative-method work to fall back on, so
that judgements of likely cognacy are necessarily based simply on recurrent
similarity.
For comparison, a NeighbourNet graph for the whole Dyen, Kruskal,
and Black database (hence, 200 items for 95 languages and varieties) is
shown in Figure 6.12. This clearly produces a tree, though there are
obvious reticulations too, particularly in the Balto-Slavic group (note
that Splitstree provides essentially the same outline topology, but with
considerably fewer reticulations, as we would expect from the comparisons
made earlier).
However, Figure 6.13, drawn again using both Splitstree and NeighbourNet
for comparison shows that the phylogenetic signal is very considerably
weaker in the Australian data. In Figure 6.13(a), the Splitstree
graph collapses 20 of the 26 languages as a single node: any phylogenetic
structure there may be has been concealed completely by the eVects of
contact. As discussed above, Splitstree therefore has the dual disadvantage,
at least in current versions, of maximizing tree-like structure
and failing to illustrate signals of contact, but equally failing to discern
tree-ness when the data set is very complex. The NeighbourNet graph in
Figure 6.13(b) constitutes a step forward, with some vestiges of a tree-like
signal emerging, though the volume of reticulations is still considerable.
Note that in both these graphs the alphabetical language codes are those
used by Nash (2002). The next step would be to ask linguists with a particular interest in
these languages to assess whether the groups which are emerging in the
NeighbourNet graph are likely to be linked primarily by the apparently
underlying phylogenetic structure rather than by reticulations; this is not
something we can pursue further here. Finally, however, note that putting
even this highly convergent data through a tree-drawing program like
PHYLIP (Felsenstein 2001) Neighbour, which operates on a neighbourjoining
algorithm, will inevitably produce a tree. Figure 6.14 shows a tree
of this kind, drawn for Nash’s Australian data using the neighbourjoining
algorithm also available in Splitstree. Programs which cannot
analyse out conXicts in the data cannot diagnose the eVects of contact;
and the fact that they draw trees cannot be taken as evidence that we have
languages with a fundamentally tree-like history, since trees, by deWnition, are all they can conceivably draw. Bootstrapping would, it is true, be
highly likely to show very poor support for any particular tree conWguration
with data of this type; but the fact remains that such doubts could
only emerge from further processing and testing of the tree, whereas
network-based approaches oVer us the possibility of establishing how
tree-like our data are from the outset.
These Australian data, however, are unlikely to be accepted as anything
approaching a cast-iron test of any method or model, partly because
the Australian situation is recognized as such a recalcitrant one, and
partly because the data analysed here are so fragmentary. In addition, though we turn in detail to issues of dating and time depth in the next
chapter, Australia is generally agreed to have a long settlement history,
dating back to at least 40,000 years bp (before the present): this leaves a
great deal of time available for linguistic splits and diVerentiation to
accumulate, and for contact relationships to develop and change,
meaning that the chances of successfully recovering a single, accurate
history for these languages are inevitably small using any method. If the
social and historical forces obtaining since settlement have predisposed to
contact and convergence, these eVects will be correspondingly greater.
Our second test, therefore, involves a range of Andean languages. Here,
the issues are at least clearer, and we have access to a considerable
database of material collected Wrst-hand by Paul Heggarty, mainly between
2001 and 2004.
The material we have used here from Heggarty’s database (see also
Heggarty forthcoming) involves 150 lexical items from each of 14 varieties
of Quechua; 3 varieties of Aymara; and Kawki and Jaqaru, which
are typically classiWed as independent Aymaran languages. The central
question here is whether Quechua and Aymaran are related, or whether
the undoubted aYnities between them rather reXect extensive contact—
the Quechumaran question. Terminology in this area is rather Xuid, with
various proposals for the name of the cluster containing Aymara, Jaqaru, and Kawki: this has variously been called Aru, Jaqi, and Aymaran. Here,
we shall use Aymara when referring to the single language, three varieties
of which are included in Heggarty’s database, and Aymaran for the family
containing Aymara plus the subgroup composed of Jaqaru and Kawki.
As discussed brieXy in Section 6.3.1 above, Heggarty’s 150-item
CALMA meaning list overlaps signiWcantly with the Swadesh list used
elsewhere in this book. However, the CALMA list is modiWed in four
ways. First, not all 200 meanings from the Swadesh list were collected,
either because they refer to concepts not native to or otherwise unknown
in the Andes, or more commonly because two Swadesh meanings share
the same root in Andean languages: this is the case for ‘one’ and ‘other’,
for example, or ‘woman’ and ‘wife’. The shared roots mean these pairs of
items are not independent, so they have been collapsed into a single slot in
the CALMA list. Second, a few pan-Andean items which provide good
indicators of local relatedness through particular correspondences were
included: these are similar to, though not identical with, items in the
Swadesh list. For instance, CALMA includes ‘fox’, ‘be ill’, and ‘Wngernail’
in place of ‘wolf’, ‘sick’, and ‘claw’ respectively. Third, where all the
languages and varieties Heggarty is comparing diVerentiate between
meanings in a similar way (as, for example, with the case of ‘man’s
brother’ versus ‘woman’s brother’, a distinction consistently expressed
by two discrete forms in Quechua) Heggarty splits this single meaning
from the Swadesh list into two separate list meanings. This means that
‘brother’, and similarly ‘sister’, ‘old’, and ‘young’ each occupy two slots in
the CALMA list. Finally, cases where Andean languages may have
diVerent forms for two or more subsenses, as in the case of ‘sun’, with
the subsenses ‘celestial object’ and ‘sunlight/heat’, are assigned a single
slot in the CALMA list. However, the score when two languages or
varieties are compared will be intermediate between 1 and 0, depending
on the overlap of subsenses between the varieties.
This brings us to a further innovation inherent in Heggarty’s revised
lexicostatistical model. Not only does comparison between languages and
varieties involve weighting and intermediate values (and recall that this
will require analysis using a distance-based rather than a character-based
Network approach), but we also have to revise our initial stage of data
processing. Up to now we have worked with the Dyen, Kruskal, and
Black database, which incorporates judgements of whether items are
cognate or not; these are reXected in the codings (recall Table 6.1 above) of 401 for all the Romance ‘animal’ forms, but the unique state
031 assigned to English animal, to mark it as a likely borrowing. However,
assigning values based on cognacy judgements in the Andean situation
simply begs the Quechumaran question. How can we reach an
objective evaluation of whether the undoubted similarities we Wnd here
are due to common ancestry or contact if we use terminology and codings
which presuppose that items are cognate?
The answer here involves a revision of our terminology which is not
purely cosmetic, but underlines a diVerence in approach. Where we have
considerable knowledge of the histories of languages, and of their likely
relationships (and this is likely to involve prior application of the comparative
method), we can use traditional lexicostatistics, mark up our lists
according to plausible cognacy, and talk about cognates as we have done
for Indo-European and for our simulated data (where we know the
history because we created it) in the examples above. However, when
we turn to less securely charted linguistic waters we need to be more
circumspect about what we are and are not claiming, and must make our
comparisons on a more neutral basis. Heggarty (forthcoming) therefore
suggests that in such cases we refer not to cognates but to correlates
between languages.
What this means is that in situations where we indubitably Wnd signiWcant
numbers of matches between languages but it would appear that
any signal in the data lies beyond the reach of the comparative method,
we should deliberately not beg the question of whether such matches are
cognates or loanwords, but should use a term neutral between the two
possible interpretations. Correlates, then, are striking form-to-meaning
correspondences which are highly unlikely to be due to chance, but might
well reXect either common ancestry or contact. Heggarty (forthcoming)
suggests that potential correlates should be rated on a 0–7 scale expressing
levels of ‘plausibility’. These diVerent scores express how far the
degree of phonetic similarity observed between correlate sets appears to
constitute a correlation signiWcantly greater than chance. To give some
outline examples, a case for the ‘sun’ slot where two varieties both have
inti, and therefore identical forms, would score 7; inti–rupa-y would score
0, at the other extreme, since there is really no basis for assuming correlateness
in this case; p’iqi–piqa is rated at 5, but *qulu – *urqu (¼ 3) and
*huma – *qam (¼ 2) are seen as less convincing and more speculative.
These assessments are based on a number of principles, discussed in Heggarty (forthcoming), and are predicated on known sound changes
typical within the Andean languages. They also draw on Cerro´n-
Palomino’s (2000: 311) categories of obvious loanwords, very probable
cognates, probable cognates, and obviously unrelated forms. Beyond
that, there is some subjectivity in these characterizations; but the nature
of the scales involved will mean the impact of such subjectivity on the
Wgures is kept to a minimum, causing generally at worst a shift of the
order of 0.1 to 0.2. A misidentiWcation in traditional lexicostatistics, of
course, would mean a shift of 0 to 1 or vice versa.
Heggarty’s approach is also gradient in another way, since he allows
intermediate values between 1 and 0 in comparisons, based on the degree of
intelligibility between languages (or indeed varieties, making this approach
helpfully applicable to dialect as well as language relationships). Often, of
course, we will still be dealing with values of 1 or 0. For instance, where
there is full, total intelligibility between two varieties, a coding of 1 will be
entered: this is the case for Laraos Quechua and Puki Aymara, both of
which use the single, identical form inti for the list-meaning ‘sun’, making
no particular distinction of separate subsenses. At the other extreme,
Chetilla Quechua uses only rupa-y, while Puki Aymara uses only inti, and
the two words clearly do not resemble one another formally; the obvious
similarity coding is 0. There are, however, a range of intermediate values,
which are shown in (2);we cannot go into details here on the precise method
of calculation used, but see Heggarty (forthcoming) and McMahon, Heggarty,
McMahon and Slaska (forthcoming) for further information.
(2) Correlate scoring on a descending scale of mutual intelligibility
(i) Laraos – Puki Full correlates inti in all senses of ‘sun’.
Score 1
(ii) Laraos – Atalla Full correlates inti for the ‘celestial-object’
subsense.
For ‘sunlight’, Laraos speaker uses rupa-y
only for ‘burn’, otherwise inti; Atalla
speaker uses rupa-y for ‘heat of the sun’,
otherwise inti. Score 0.83
(iii) Puki – Atalla Full correlates inti for the ‘celestial-object’
subsense.
For ‘sunlight’, Atalla speaker uses rupa-y
for ‘heat of the sun’, otherwise inti; Puki
speaker has only inti. Score 0.78
(iv) Chetilla – Atalla For the main ‘celestial-object’ subsense,
Chetilla speaker has only rupa-y; Atalla
speaker has inti, but will understand
rupa-y as ‘heat of the sun’.
For ‘sunlight’, full correlates rupa-y.
Score 0.56
(v) Chetilla – Laraos Chetilla speaker uses only rupa-y; Laraos
speaker uses inti, and will understand
rupa-y only as ‘be hot (sunny), burn’.
Score 0.17
(vi) Chetilla – Puki No correlates in either sense; Chetilla
speaker uses only rupa-y for both senses,
Puki speaker uses only inti. Lexemes are
not correlate. Score 0
These measures of intelligibility and overlap between subsenses provide
a graded rating of similarity, but that is all—any network plotted from
these results would be purely a phenogram, giving information on distance,
and not a phylogram, which tells us about the history of the diVerent
groups. And yet, especially if we include Aymaran as well as Quechua
data, it is precisely insight into the more likely history that we need. How,
then, are we to reconcile our intentionally neutral, correlate-based approach
with our search for a resolution to the Quechumaran problem?
The answer, again, lies in our use of sublists, which allow us to place a
historical interpretation on our phenetic results. In keeping with our
methodology for Indo-European, we have excerpted from Heggarty’s
database two groups of 30 items corresponding to our earlier hihi and
lolo sublists—those which are most retentive on the one hand, and those
most prone to change and borrowing on the other. These sublists are
shown in (3), and though membership is not identical with the hihi and
lolo lists for Indo-European, the overlap has been maximized as much as
possible given the diVerent compositions of the Swadesh and CALMA
lists (overlapping items are shown in bold). As (3) shows, 25 items from
the Andean hihi list are also included in the Indo-European one; the
overlap for the lolo list looks poorer, at 18, but recall that our Indo-
European lolo list included only 23 items: we increased the hihi list to 30
to compensate for the presence of 6 totally uninformative meanings,
which were cognate across the entire group.
Although these sublists are not identical to those considered earlier for
Indo-European, they can be shown to be diVerentially aVected by borrowing
in the same way. Spanish borrowings can be identiWed relatively
readily in all the Andean languages and varieties, and we Wnd an average
of 2.7% Spanish loans in the hihi sublist, but 6.7% in the lolo sublist,
nearly three times as high. This diVerence is signiWcant at the p < 0.001
level (paired t-test; t ¼ 4:1, df ¼ 18).
This operational diVerence between the more and less conservative
sublists is encouraging, but it remains to be seen whether these can also
be used as a basis for deciding between the alternative histories of Quechua
and Aymaran. Recall that our use of graded similarity scores in
some comparisons, following Heggarty’s introduction of subsenses,
means distance-based rather than character-based programs are clearly
more appropriate. The graphs in Figure 6.15 were therefore generated
using NeighbourNet.
For the networks in Figure 6.15, Spanish loans have been excluded by
marking them, as usual, as unique states. It is very clear in both these
graphs that the 14 Quechua dialects cluster together; so do the three
Aymara varieties, plus Jaqaru and Kawki, which however constitute a
separate branch within Aymaran. The most interesting aspect of these
and a sample of Romance languages from the Dyen, Kruskal, and Black
database, which represent two Indo-European groups of comparable
sizes and overall lexical distance from each other. Here, the comparable
distances are 52% for the lolo sublist and 32% for the hihi group. The
Wgures are diVerent, clearly, but the pattern is the same: for the two
Aymaran groups, and the two Indo-European ones, we Wnd greater
distance for the lolo sublist, and greater similarity for the hihi meanings.
This is precisely the opposite of the pattern shown in Figure 6.15 for
Quechua compared with Aymaran.
What the Aymaran-internal and Indo-European calculations have in
common, of course, is that they involve languages which are known to
belong to a single group, whether at the subfamily or family level. In the
case of Quechua compared to Aymaran, this is not necessarily the case:
and indeed our Wgures would seem to argue against common ancestry, and
for a relationship of contact. Common ancestry would appear to correspond
regularly to calculations of greater distance for the lolo subgroup
than for the hihi one, simply because the lolo items, by deWnition, are more
likely to change. The opposite trend, as in the striking Wgures for Quechua
as compared with Aymaran, where three times as much distance is apparent
in the hihi as in the lolo items, favours an argument of contact rather
than common ancestry. If two groups show greater aYnities in the sublist
which is more prone to contact, then contact seems, to put it bluntly, the
most appropriate explanation. The fact that this is precisely the opposite
balance to those cases where we can be much more conWdent that we do
have common ancestry strengthens this conclusion; at this stage it may be
accidental that the threefold additional distance between Quechua and
Aymaran hihi sublists matches the threefold additional Spanish loans we
found on average in the Andean lolo sublists, but the parallel is at least
indicative and worthy of further investigation. None of this proves that
Quechua and Aymaran never shared a common ancestor; but it does
suggest a very signiWcant inXuence of contact as the main determinant of
the lexical similarities between the two groups.
6. 4 The Uses of Computational and Quantitative Methods.
The computational and quantitative techniques we have illustrated here
are intended to identify, represent, and elucidate problems in linguistic classiWcation. We have shown that it is possible to test diVerent programs
and applications against cases of known language histories, whether these
involve common ancestry or contact or both. We have argued that
network approaches in general are more Xexible and insightful than
those based only on trees, and that these allow us access to a means of
representing the whole history of languages, not only those aspects which
derive by descent with modiWcation from a single ancestral state. Nonetheless,
each of the models we have illustrated has its advantages and
disadvantages: for example, Network generates an extremely helpful list
of those items incompatible with a tree structure, which can then be
checked individually by linguists who know the languages involved well;
on the other hand, it is suitable only for character-based data. NeighbourNet,
which works on distance data and seems optimal for discerning
the eVects of contact in lexical lists, cannot by its nature list the problem
characters, since it is based on Wgures derived from a composite character
list. The data we have, and the way we wish to analyse them, will therefore
determine which program we decide to use.
It is worth reiterating that the Wrst and essential stage of working with
these programs and representations involves helping linguists to demonstrate
that the insights already achieved through purely linguistic
methods are sound, and to test and perhaps refute less likely hypotheses.
At the same time, increased awareness of the beneWts of such computational
techniques in known cases may convince more historical linguists
that they can also be generalized into the unknown. Most of this chapter
has been devoted to demonstrating that new quantiWcational and computational
methods can aYrm what linguists feel they already know, and
for this we make no apology. It is only when such methods are accepted in
these standard cases, tested and aYrmed by simulations based on what
linguistic methods have already established, that linguists are likely to
trust them in resolving other and more complex cases where linguistic
opinion persistently diVers, and the data do not allow a purely linguistic
resolution. Our own illustrations of the merits of these methods for such
unclear cases are tentative and preliminary, but we hope they indicate the
possible beneWts of quantitative and computational approaches in future
research.
However, one potential problem remains with the methods we have
used so far: all are based on second-order data coding, whether this
involves assessments of likely cognacy, or degree of intelligibility forcorrelates. Ideally, we might wish to introduce alternative methods for
Wrst-order comparison of linguistic data (which will require sophisticated
measures of similarity), and also to explore the possibilities for quantitative
work outside the lexicon; we return to these issues in Chapter 8. First,
however, we turn to another pressing question. Our trees and networks
contain nodes and reticulations: can we use linguistic data to suggest
dates for these, and if so, are those dates likely to be accurate?.

lingüística

lunes, 18 de abril de 2016

No hay comentarios.:

Publicar un comentario