- 1. Introduction
- Chapter 2: Background
- Chapter 3: Data
- Chapter 4: Methodology
- Thesis Outline
- Appendices
- Appendix A: Speaker Metadata
- Appendix B: Per-Speaker Concept Forms and Cognate Decisions
- Appendix C: Phonetic Normalization Rules
- Appendix D: Cognate-State Coding — Methodology and Matrix Revisions
- Appendix E: Cognate Decision Notes Assessment Prompt
- Appendix F: Supplementary Data
- Appendix G: Review Tool
The central Zagros range runs northwest to southeast along the Iraq-Iran border, dividing the lowland plains of eastern Iraq from the Iranian plateau. Between the ridgelines, narrow valleys and intermontane basins have long sheltered small, relatively isolated speech communities. The core of the Southern Kurdish (SK) speech area runs along both sides of the border — Kermanshah and Ilam on the Iranian side, Diyala governorate and Khanaqin on the Iraqi side (see Figure 1, §3.2, for a map of speaker origins within this region). The terrain is rough enough that two villages thirty kilometres apart can be harder to reach than two cities three hundred kilometres apart by road, and the linguistic fragmentation often reflects this.
Fattah (2000: 56) calls this region a "shatter zone" (zone d'éclatement), a term drawn from historical geography for areas where political fragmentation and geographic compartmentalization complicate dialect classification. The label captures the political complexity of the borderlands — Ottoman-Persian frontier zones, tribal confederacies, forced resettlement — but it can mislead if read as a claim about linguistic divergence per se. As Belelli (2019: 75-76) has argued, SK varieties in fact form "a rather compact dialect continuum, unified by a fair degree of mutual intelligibility based on shared phonological, morphosyntactic and lexical features." The classification difficulty lies not in dramatic linguistic divergence but in the fact that the differences that do exist — many of them relatively minor — refuse to align into neat isogloss bundles. Within the SK area, Kurdish varieties coexist with Arabic, Persian, and Gorani — each exerting contact pressure at different social registers and in different domains. Arabic dominates administrative life on the Iraqi side; Persian does the same in Iran. Gorani — often lumped with Kurdish despite being a distinct Northwest Iranian language — is spoken in enclaves scattered through the same mountain valleys. Borrowing and code-switching are not occasional features of this ecology; they are the background against which all variety-internal change takes place.
Mutual intelligibility among SK varieties is generally high, but does not degrade smoothly with geographic distance. Speakers of Kalhori in Kermanshah province and speakers of Faili in Khanaqin, separated by roughly 200 kilometres, report partial but inconsistent comprehension. The breakdown tracks not just lexical divergence but structural differences in verbal morphology and case marking (Haig 2008: 103). Complicating matters, large-scale displacement of Faili Kurds from Iraq during the 1970s and 1980s scattered communities across Iran and the broader diaspora — disrupting dialect networks and making fieldwork on certain varieties difficult. Any phylogenetic study of SK has to start from this.
Kurdish dialectology has worked from MacKenzie's (1961: 142) tripartite division since the 1960s — Northern (Kurmanji), Central (Sorani), and Southern Kurdish, separated primarily by the fate of intervocalic -d-: retained in the north, weakened to a fricative in the centre, and lost entirely in many southern varieties. The isogloss has held up as a first-order classifier. The trouble begins when one looks inside Southern Kurdish and asks what structure, if any, holds among its constituent varieties.
Fattah (2000: 56) documents at least 35 distinct SK varieties and argues that Laki, traditionally grouped with Southern Kurdish, should be removed on phonological grounds. Even setting Laki aside, the remaining varieties resist neat subgrouping. The central question — whether Faili, Kalhori, and Khanaqini form a hierarchical subgroup within SK, or simply radiate from a common ancestor without recoverable internal branching (a flat star topology) — has not been settled.
Isoglosses have not resolved it, because they disagree. Phonological features group Faili with Kalhori; morphosyntactic features, particularly patterns of ergativity loss, group Kalhori with Khanaqini (Haig 2008: 103). When different feature classes produce contradictory trees, bundling isoglosses on a map offers no principled way forward — adjudicating between them requires a weighting scheme, and none has been agreed upon. Mohammadirad (forthcoming) complicates things further, suggesting the SK area is better understood as a "transitional zone" between Central Kurdish and the broader West Iranian continuum described by Paul (1998). That framing raises a harder question: is Southern Kurdish a genetic clade at all, or an areal grouping held together by geography and contact?
Bayesian inference handles this kind of problem without forcing a resolution. Rather than selecting one best tree, the method samples a posterior distribution — a probability-weighted ensemble of trees that makes the uncertainty explicit. Competing topologies, such as the Faili-Kalhori subgroup versus the flat star, receive posterior probabilities that can be compared directly. The data do not have to pick a winner; they report how much support each hypothesis actually has.
The approach has precedent. Gray and Atkinson (2003: 435) applied Bayesian inference to Indo-European using lexical cognate data; Bouckaert et al. (2012) extended the framework to include geographic diffusion modeling. More recently, Auderset et al. (2023: 35) used similar methods to resolve internal structure within Mixtecan — a family with analogous problems of shallow time depth and heavy contact. These studies show that computational phylogenetics can produce informative results even when the signal is weak or partially obscured by borrowing.
I use BEAST 2 (Bouckaert et al. 2014) with a covarion substitution model, which allows evolutionary rates to vary across both characters and branches. For dialect-level data, where some lexical items change rapidly under contact pressure while others remain stable, that flexibility is essential. Model comparison uses Bayes Factors (Kass and Raftery 1995: 773) to choose among alternative tree topologies and substitution models.
This thesis is a proof of concept, not a definitive classification of Southern Kurdish. The dataset is an 85-item borrowing-resistant wordlist elicited from 11 speakers representing 6 SK varieties: Faili, Kalhori, Khanaqini, Qasri, Mandali, and Sahana. No outgroups or comparative taxa are included; the tree is rooted agnostically from the SK data alone. Laki is excluded because no sufficiently detailed published wordlist was available at the time of data collection. The sample covers varieties from the western and southern portions of the SK speech area, overlapping with at least four of Fattah's (2000) seven dialect subgroups — Kalhori-Sanjābi-Zangane, Laki-Kermānshāhi, and likely Malekshāhi and Badre'i. The Bijāri and Kolyā'i groups, located further north inside Iran, are not represented. A full comparison with Fattah's classification is therefore not possible, though the results can be evaluated against the groups the sample does cover.
Three research questions frame the analysis:
- Does the lexical evidence support a hierarchical internal structure for Southern Kurdish, or is the topology better described as an unresolved polytomy?
- If hierarchical structure exists, which varieties cluster together, and with what posterior probability?
- How do the results compare with the subgrouping hypotheses proposed by MacKenzie (1961), Fattah (2000), and Mohammadirad (forthcoming)?
The remainder of the thesis is organised as follows. Chapter 2 reviews the linguistic background of the SK area and introduces the Bayesian phylogenetic framework — including why a probabilistic approach handles the conflicting signal of a contact zone better than parsimony or maximum likelihood does. Chapter 3 describes the fieldwork, speaker cohort, and wordlist design, including the specific decisions that determined which 85 items made the final cut. Chapter 4 lays out the full methodology: IPA transcription, LexStat cognate detection, and the BEAST 2 inference pipeline with its covarion substitution model. Chapter 5 presents the results. Chapter 6 returns to the central questions and considers what a more complete dataset would need to look like to resolve the remaining ambiguity.
The varieties collectively labeled "Southern Kurdish" occupy a geographically fragmented territory along the central and southern Zagros mountain range, spanning the Iraqi-Iranian borderlands. On the Iranian side, this territory encompasses the provinces of Kermanshah and Ilam, extending as far south as Dehloran and as far east as Asadabad in Hamadan Province. On the Iraqi side, the speech area covers the districts of Khanaqin, Mandali, and Badra within the Diyala and Wasit governorates (Fattah 2000: VII). Unlike the expansive plateaus characterizing the Kurmanji-speaking regions of Anatolia, the Southern Kurdish domain is defined by a series of parallel ridges and deep, narrow valleys carved by the tributaries of the Sirwan and Karkheh rivers. This rugged terrain acts both as a natural fortress and as a barrier to easy communication, historically isolating communities in individual valleys — though, as Belelli (2019: 75-76) observes, the resulting varieties are less dramatically divergent than their geographic separation might suggest, forming a compact continuum with considerable mutual intelligibility. To the east and southeast, the region grades into the territory of the Lurs; to the north, it borders the Sorani-speaking Mokriyan and Ardalan regions, as well as the enduring pockets of Gorani and Hawrami speakers. This position places Southern Kurdish speakers at a critical geographical interface the meeting point of the Mesopotamian plains and the Iranian plateau.
The systematic study of these varieties begins with MacKenzie (1961), whose Kurdish Dialect Studies established the foundational tripartite classification of Kurdish into Northern (Kurmanji), Central (Sorani), and Southern groups. MacKenzie identified the lenition of Old Iranian intervocalic -d- (yielding reflexes such as -l-, -y-, -w-, or deletion) as a primary diagnostic isogloss for the Southern group, distinguishing it phonologically from Central Kurdish (MacKenzie 1961: 142). Under this taxonomy, a wide range of heterogeneous dialects Kalhori, Feyli, Kordali, Laki, and others were grouped under a single "Southern Kurdish" umbrella on the basis of shared phonological innovations. For decades, this classification served as the standard in Iranian linguistics, positing a unified ancestral origin for these varieties distinct from the Sorani group.
This unitary model was fundamentally challenged by Fattah (2000) in his comprehensive study Les dialectes kurdes meridionaux, the product of approximately twenty-five years of fieldwork across the Southern Zagros. Fattah documented no fewer than thirty-five distinct varieties within the region traditionally labeled "Southern Kurdish" (Fattah 2000: 22-40), ranging from the geographically detached Bijari variety in the north to Feyli in the south, and including well-documented groups such as Kermanshahi, Sanjabi, Kalhori, and Ilami alongside less-known varieties like Erkawazi, Mekhasi, and Warmizyar. The sheer number of identified varieties raised questions about the coherence of MacKenzie's grouping — though the variety count can overstate the actual linguistic distance. As Haig (personal communication, 2026) notes, SK syntax is highly convergent, phonology at the phonemic level is similar across varieties, and morphological differences tend to be modest. The classification problem is not one of extreme divergence but of non-aligning minor differences — a pattern where no single isogloss neatly separates one group from another.
Fattah also proposed a preliminary internal classification, grouping his SK varieties into seven dialect subgroups arranged roughly from north to south: (1) Bijāri, the isolated northernmost enclave in Kordestān Province; (2) Kolyā'i, covering the northern and eastern reaches of Kermānshāh Province into Hamadān; (3) Laki-Kermānshāhi, centred on the Sahne and Harsin counties; (4) Kalhori-Sanjābi-Zangane, the geographically largest group, stretching from west of Kermānshāh city through the Kalhor territories to the Iraqi border at Khānaqin; (5) Malekshāhi, in northwestern Ilām Province and across to Zurbātiya in Iraq; (6) Badre'i, in Darre Shahr county and the Baghdad diaspora; and (7) Kordali, at the southern periphery in Dehlorān and Ābdānān (Fattah 2000: 22-40; cf. Belelli 2019: 78-80, Fig. 2). The classification has significant limitations. As Belelli (2019: 80-81) observes, Fattah's subgroups are primarily ethnolects — defined by tribal affiliation rather than by strictly linguistic criteria — and he never specifies which diagnostic features set each group apart from the others. The groups show considerable internal variation, and several varieties are transitional between subgroups: Sahne, nominally Laki-Kermānshāhi, often aligns linguistically with Kolyā'i, while Xānaqin diverges from the Kalhori-Sanjābi-Zangane group to which it is nominally assigned. No quantitative assessment of dialect distances has been attempted (Belelli 2019: 81). The classification remains impressionistic (Haig, personal communication, 2026), but it is the only concrete internal classification of SK that exists, and any new proposal must engage with it.
Fattah's most consequential argument concerned the status of Laki. He observed that Laki preserved a feature of fundamental typological significance: split ergativity. In Laki, the ergative construction is maintained in past transitive clauses, aligning it with Kurmanji (Northern Kurdish) and distinguishing it sharply from all other Southern Kurdish varieties, which have shifted to a nominative-accusative alignment (Fattah 2000: 55-56). This morphosyntactic criterion led Fattah to propose that Laki cannot be considered a dialect of Southern Kurdish at all, but rather constitutes a separate branch what has sometimes been termed the "fourth language" of the Kurdish family, alongside Kurmanji, Sorani, and a reduced Southern Kurdish core. The Laki-speaking territory, stretching from Khorramabad to east of Kermanshah and from Holeylan to Harsin (Fattah 2000: 56-57), with enclaves as far as Khorasan and Mazandaran provinces, represents a substantial speech community whose taxonomic placement remains contested.
The question of Laki's position has been further explored by Anonby (2003), who examined the broader Luri-Laki-Kurdish boundary and found that the taxonomic divisions in this region resist clean resolution. The difficulty lies in the fact that Laki shares lexical and phonological features with Southern Kurdish while retaining the morphosyntactic ergativity characteristic of more northern varieties a combination that defies placement in any single branch of a tree model. Fattah's reclassification left a reduced Southern Kurdish core centered on Kalhori and Feyli, but it simultaneously opened a broader debate about where the boundaries of "Kurdish" itself should be drawn.
Contemporary scholarship has moved toward a more nuanced position. Haig (2008: 103) notes that structural innovations in the Southern Zagros such as the widespread loss of ergativity in most Southern Kurdish varieties conflict with the phonological isoglosses that MacKenzie used to define the group. A feature-based classification thus produces contradictory results depending on whether phonological or morphosyntactic criteria are prioritized. Mohammadirad (forthcoming) extends this observation, arguing that the Southern Zagros functions as a "transitional zone" where Northwestern Iranian traits (typical of Kurdish) and Southwestern Iranian traits (typical of Persian and Luri) overlap without bundling into neat isogloss packages. In this view, "Southern Kurdish" is less a genetic clade than a label of convenience for a complex contact zone. Belelli (2019; forthcoming), in her work on the Southern Kurdish dialect continuum and on the verb in Southern Kurdish, and Borjian (2024), in a recent grammatical description of the Kermanshahi dialect, offer additional perspectives on the distinctiveness of specific Southern Kurdish varieties, though the fundamental question remains open: do these dialects descend from a common Southern Kurdish ancestor, or has prolonged contact created the appearance of unity where none exists genealogically?
The following table summarises the principal classification proposals for SK and its position within Kurdish and West Iranian:
| Source | Proposed grouping | Basis | Status of Laki | Key limitation |
|---|---|---|---|---|
| MacKenzie (1961) | Three-way split: NK / CK / SK (broad, including Laki) | Phonological isoglosses (lenition of -d-) | Included in SK | Single criterion; lumps heterogeneous varieties |
| Fattah (2000) | 7 SK subgroups (BIJ, KOL, L-KER, KSZ, MAL, BAD, KOR); Laki removed | Mixed: tribal, geographic, phonological, morphosyntactic | Separate branch ("fourth language") | Impressionistic; diagnostic features unspecified; ethnolects not geolects |
| Anonby (2003) | Laki–Luri–Kurdish boundary unresolvable by tree model | Lexical + morphosyntactic | Intermediate: shares features with both SK and Luri | No internal SK classification proposed |
| Haig (2008) | Feature-based classification produces contradictory results | Structural innovations vs. phonological isoglosses | Not addressed directly | Demonstrates the problem, not a solution |
| Aliakbari et al. (2014) | "Ilāmi (Feyli) dialect group" for southern varieties | Geographic + ethnographic | Not addressed | Different internal groupings from Fattah for Ilām Province |
| Belelli (2019) | SK = compact dialect continuum; Fattah's groups not sharply defined | Critical review of Fattah + own fieldwork | L-KER transitional toward Laki | No alternative classification offered |
| Mohammadirad (forthcoming) | SK = "transitional zone" between NW and SW Iranian | Isogloss overlap, contact features | Not addressed | Questions whether SK is a genetic clade at all |
The reason these proposals disagree can be illustrated by examining a handful of diagnostic features across Fattah's subgroups. The following table, drawing on Belelli (2019: 81-88) and Fattah (2000: 280-375), shows how key features cut across rather than align with group boundaries:
| Feature | BIJ | KOL | L-KER | KSZ | MAL | BAD | KOR | Classification signal |
|---|---|---|---|---|---|---|---|---|
| Intervocalic -d- loss | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Unites all SK vs. CK (MacKenzie's criterion) |
| Ergativity retained | — | — | — | — | — | — | — | Lost in all SK; retained in Laki/Kurmanji |
| /v/ (vs. cSK /w/) | — | — | ✓ | — | — | — | some | Links L-KER to Laki, separates from cSK |
| Present prefix ma- | — | — | ✓ | — | — | — | — | Links L-KER to Laki only |
| Reflexive wiž (vs. cSK xwa-/xo-) | — | — | ✓ | — | — | — | — | Links L-KER to Laki only |
| Animacy in demonstratives | ✓ | — | — | ✓ | ✓ | ✓ | ✓ | Groups BIJ + southern SK, excludes KOL + L-KER |
| Dark /ɫ/ phonemic | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Shared across SK (except Kermanshah city, Mandali) |
| Past participle suffix | -īg | -ī/-e | -ī/-e | -ī/-e | -īg | -īg | -a | BIJ + MAL + BAD vs. KOL + L-KER + KSZ vs. KOR alone |
The pattern illustrates the core problem: features 1-2 define SK as a group, but features 3-5 pull Laki-Kermānshāhi toward Laki, feature 6 cuts across groups in yet another way, and feature 8 creates a three-way split that matches none of the above. No single feature or feature class produces a clean internal tree — the signal is genuinely conflicting. No existing study has applied quantitative methods to the internal classification of SK — the gap this thesis addresses.
The extreme dialectal fragmentation of the region where neighboring valleys may speak significantly different varieties can be attributed in part to its historical status as a frontier zone. Throughout the early modern period, the Southern Zagros formed the volatile boundary between the Ottoman Empire and the Persian Safavid (and later Qajar) dynasties. Central state control was often weak or intermittent, allowing tribal confederacies the Kalhor, the Feyli, the Sanjabi, and others to emerge as the primary sociopolitical units. Dialect boundaries frequently tracked with tribal allegiances rather than mere geography, such that a "dialect" was often the marker of a specific tribal identity. Seasonal transhumance between lowland and highland pastures prevented the stabilization of a single regional standard, while imperial policies of forced resettlement (deportation) of rebellious tribes transplanted speech communities and created a patchwork linguistic map. The result is a region where historical instability, topography, and tribalism have conspired to preserve a rich but fragmented linguistic mosaic.
The situation is further complicated by the historical presence of Gorani in the same geographic space. Gorani served for centuries as a prestige literary and religious language across much of the Southern Zagros, particularly among the Yarsani (Ahl-e Haqq) communities (Mahmoudveysi et al. 2012). The interaction between vernacular Kurdish dialects and this Gorani superstrate introduced varying degrees of lexical and structural borrowing into different Southern Kurdish varieties, creating yet another layer of similarity that is areally rather than genetically motivated. As Haig has cautioned, apparent lexical closeness between Gorani and Southern Kurdish varieties may reflect centuries of geographic co-existence and mutual borrowing rather than deep genealogical affinity (Haig, personal communication, January 2026) a distinction that purely lexical data cannot resolve on its own.
The contemporary situation adds urgency to the documentation of these varieties. Over the past century, the Southern Kurdish speech area has been subject to profound demographic upheaval. The Iran-Iraq War (1980-1988) devastated the borderlands, displacing entire communities from towns such as Khanaqin, Mandali, and Qasr-e Shirin. The Ba'athist Arabization campaigns of the 1970s and 1980s forcibly relocated Kurdish populations from Diyala Governorate and replaced them with Arab settlers, severing generations-old ties between speech communities and their geographic bases. On the Iranian side, urbanization has drawn speakers from rural valleys into Kermanshah, Ilam, and Tehran, where the prestige of Persian in education, media, and administration accelerates language shift. Younger speakers in urban settings increasingly use Persian as their primary language of daily interaction, reserving Kurdish for intra-family communication or abandoning it altogether. The result is a situation in which the most linguistically conservative speakers those who preserve the full range of phonological distinctions documented by Fattah are overwhelmingly elderly, rural, and declining in number. Several of the varieties Fattah catalogued in the 1970s and 1980s may no longer be spoken in their traditional form; others survive only in displaced communities in Baghdad, Erbil, or the European diaspora, where contact with Arabic, Sorani, or the host-country language introduces new layers of convergence. Any empirical study of Southern Kurdish internal structure must therefore contend not only with the historical complexity of the dialect landscape but also with the reality that this landscape is actively eroding.
It is this unresolved state of affairs a classification built on contradictory evidence, challenged by internal diversity, complicated by tribal fragmentation and imperial displacement, and obscured by layers of contact with Gorani, Arabic, and Persian that motivates the present study. If traditional qualitative methods have failed to produce consensus on the internal structure of Southern Kurdish, a quantitative approach capable of handling conflicting signals may offer new traction.
The difficulty of classifying Southern Kurdish is not unique to Kurdish linguistics. It is a specific instance of a general problem that has occupied historical linguistics since its inception: the inadequacy of discrete models when applied to continuous data.
The dominant metaphor in historical linguistics remains the Stammbaumtheorie, or Family Tree Model, formalized by August Schleicher in the mid-nineteenth century. The tree model posits that languages diverge through clean bifurcations: a parent language splits into daughter languages, which then develop independently. Shared innovations between daughter languages are interpreted as evidence of common ancestry, and the resulting tree represents a nested hierarchy of genetic relationships. The model is powerful and intuitive, and it has proven remarkably successful for reconstructing deep genealogical relationships the Indo-European family tree being the paradigmatic example.
However, the tree model carries a critical assumption: that once a speech community separates, it ceases to share significant innovations with its sister branches. This assumption is routinely violated in dialect geography, where adjacent varieties maintain contact and continue to exchange features long after any putative "split." The result is the dialect continuum a situation in which transitions between varieties are gradual rather than abrupt, mutual intelligibility decreases with distance, and isoglosses refuse to bundle in the neat patterns that a tree model would predict.
The limitations of the tree were recognized early. In 1872, Johannes Schmidt proposed the Wellentheorie, or Wave Theory, which models linguistic change as innovations radiating outward from a center, weakening with distance. The Wave Theory captures a reality that the tree cannot: that features diffuse laterally across established boundaries, creating overlapping distributions that resist hierarchical classification. Applied to the Southern Zagros, the Wave Theory offers a plausible account of why phonological isoglosses and morphosyntactic isoglosses cross-cut one another they represent different waves of innovation, propagated at different times from different centers, and there is no reason to expect them to align.
Paul (1998) brought this theoretical tension into sharp focus for the West Iranian context. In his analysis of the position of Zazaki among West Iranian languages, Paul suggested that the entire West Iranian family not merely Kurdish functions as a dialect continuum stretching from Northwestern varieties (Kurdish, Zazaki, Gorani, Tatic) through transitional forms to Southwestern varieties (Persian, Luri). The isoglosses defining "Northwestern Iranian" as a genetic grouping are often retentions rather than shared innovations: features preserved from Proto-Iranian that were subsequently lost in the Southwest (cf. Korn 2019: 239). A grouping defined by retentions, Paul argued, is inherently weaker than one defined by exclusive innovations, because retentions can be independently preserved in unrelated lineages. The consequence is that "Northwestern Iranian" may be a retention-based grouping rather than a tight genetic clade — its member varieties (Kurdish, Zazaki, Hawrami, Balochi, Mazanderani) are geographically scattered, and the features they share may have been independently preserved rather than jointly inherited. The same uncertainty propagates downward to its supposed subgroups, including Kurdish.
For Southern Kurdish specifically, the continuum problem is acute. The varieties of the Southern Zagros sit at the intersection of multiple diffusion zones. Arabic vocabulary has penetrated from the west through centuries of Islamic administration (Opengin 2020: 459); Persian influence radiates from the east through state education, media, and commerce; and Gorani features have diffused through centuries of religious and literary prestige. Each of these contact layers creates a "wave" of similarity that has nothing to do with shared ancestry. The influx of Arabic vocabulary into Kurdish, Persian, and Zazaki creates a superficial layer of similarity across the entire West Iranian spectrum, while the long-standing political and cultural dominance of Persian has led to the progressive "Persianization" of certain Southern Kurdish varieties, particularly in syntax and idiom. These contact phenomena act as noise in the phylogenetic signal: a feature shared by a Southern Kurdish variety and Persian might be misinterpreted as evidence of genetic closeness when in reality it reflects centuries of bilingualism.
The challenge for any classificatory method is therefore to disentangle these layers: to separate (1) genetic retentions inherited from a common ancestor, (2) shared innovations arising from early dialect proximity, and (3) later convergence effects driven by contact with prestige languages. The failure to distinguish these layers leads to the misclassification of contact-induced similarity as genetic kinship a pitfall that the strict application of the tree model often invites.
Attempts to address the continuum problem computationally have yielded instructive precedents. Heggarty et al. (2023) applied Bayesian phylogenetic methods to the Quechua family of South America, a case that shares structural parallels with Southern Kurdish: a dialect continuum spread across rugged terrain, with a history of imperial disruption (Inca, then Spanish) and ongoing contact-driven convergence. Their analysis demonstrated that probabilistic tree-building could recover meaningful subgroupings even where traditional isogloss-based methods had produced contradictory classifications, though they also found that the posterior distributions at certain nodes reflected genuine ambiguity in the data rather than methodological failure. Cathcart (2020: 1), working specifically on West Iranian, applied phylogenetic network methods to a broader set of Iranian languages and confirmed that the Northwestern-Southwestern divide is not a clean bifurcation but a gradient, with Kurdish varieties occupying an intermediate position that reflects both retained archaisms and contact-driven innovations. These studies suggest that computational methods do not dissolve the continuum problem but reframe it: instead of forcing a binary classification, they quantify the degree to which the data supports competing hypotheses and identify the specific loci where conflicting signals are strongest.
Traditional qualitative methods comparing isoglosses, weighing their relative importance, debating which features are "diagnostic" have not produced consensus for Southern Kurdish. The disagreements between MacKenzie, Fattah, and the transitional-zone school are not failures of scholarship but symptoms of a fundamental mismatch between the analytical tool (the discrete tree) and the empirical reality (the contact-saturated continuum). What is needed is a method that does not force an either-or choice between tree and wave, but instead quantifies the degree to which the data supports one topology over another and honestly reports when the signal is ambiguous. This is the promise of Bayesian phylogenetic inference.
Bayesian phylogenetic methods entered historical linguistics through the landmark study of Gray and Atkinson (2003: 435), which applied computational techniques originally developed for molecular biology to the Indo-European family. By analyzing cognate data across Indo-European languages using Bayesian Markov Chain Monte Carlo (MCMC) algorithms, Gray and Atkinson demonstrated that probabilistic tree-building could recover known subgroupings with high statistical confidence while simultaneously dating divergence events. The study was controversial its dating estimates were contested by historical linguists who argued the method produced impossibly old dates supporting the Anatolian hypothesis over the Steppe/Kurgan model (Balter 2004; Heggarty 2006), by scholars who demonstrated that forcing linguistic data into bifurcating trees misinterprets contact-induced similarity as shared inheritance (McMahon and McMahon 2005), and most forcefully by Pereltsvaig and Lewis (2015), who argued that the approach amounted to "glottochronology with bells and whistles," privileging mathematical abstraction over archaeological and linguistic evidence. More broadly, critics questioned the transferability of biological models to cultural data, noting that language evolution is heavily reticulate due to borrowing and contact (McMahon and McMahon 2005), that binary cognate coding oversimplifies complex linguistic relationships (Heggarty 2006), that the tree model is fundamentally unsuited for dialect chains and linkage situations (François 2014), and that phylogenetic trees model only "spread zones" while failing to capture the long-term equilibrium of accretion zones (Nichols 1992). Despite these critiques, the study established a methodological precedent that has since been extended to language families worldwide, including Austronesian (Gray, Drummond and Greenhill 2009), Semitic (Kitchen et al. 2009), Japonic (Lee and Hasegawa 2011), Bantu (Grollemund et al. 2015), Turkic (Hruschk... [truncated]
The core logic of Bayesian inference differs fundamentally from the deterministic approaches of traditional classification. Where a parsimony analysis seeks the single tree that minimizes the total number of character-state changes, and a maximum likelihood analysis seeks the single tree that maximizes the probability of the observed data, Bayesian inference produces a posterior distribution a probability-weighted sample of thousands of plausible trees (Larget and Simon 1999; Felsenstein 2004: Ch. 18). Each tree in the sample represents a hypothesis about the evolutionary history of the data, and the frequency with which a given relationship appears in the sample provides a direct measure of confidence in that relationship (Huelsenbeck et al. 2001: 2311). Unlike bootstrap support values, which measure sampling consistency, posterior probabilities represent the probability that a given clade is correct given the data and model (Yang 2006: 5.3). A clade appearing in 95% of sampled trees receives a posterior probability of 0.95; one appearing in only 55% is flagged as uncertain, likely reflecting conflicting signals in the data (Felsenstein 2004; Drummond and Bouckaert 2015).
This quantification of uncertainty is precisely what the Southern Kurdish classification problem demands. A traditional analysis must declare a variety either "inside" or "outside" a group; a Bayesian analysis can report that the data supports inclusion with, say, 0.72 probability an honest representation of ambiguity that is more informative than a forced binary choice.
The inference proceeds through MCMC sampling. The algorithm begins with an initial tree and a set of model parameters a substitution model governing how character states change over time, a clock model specifying the rate of change, and a tree prior encoding assumptions about the branching process. The algorithm then proposes small modifications swapping branches, adjusting rates and accepts or rejects each proposal based on how well the modified tree explains the observed data. Over millions of iterations, the chain converges on the high-probability regions of "tree space," producing a representative sample of the posterior distribution. Convergence is assessed by examining Effective Sample Size (ESS) values for key parameters; ESS values exceeding 200 indicate that the chain has explored the distribution adequately rather than becoming trapped in a local optimum (Drummond and Bouckaert 2015: 128; Rambaut et al. 2018).
The substitution model governs how character states (in this case, cognate presence or absence, coded as binary 1/0) change over evolutionary time. For binary linguistic data, the Binary Continuous-Time Markov Chain (CTMC) model (Lewis 2001; Bouckaert et al. 2012, 2014) estimates the rates at which cognates are gained and lost along the branches of the tree. Gamma-distributed rate heterogeneity (Yang 1994), adapted for linguistic data by Atkinson and Gray (2005), accounts for the empirical observation that some lexical items evolve faster than others function words and basic body-part terms are notoriously resistant to replacement (Swadesh 1952; Pagel, Atkinson and Meade 2007), while cultural vocabulary turns over rapidly (Tadmor, Haspelmath and Taylor 2010). The tree prior specifies assumptions about the branching process itself; the Yule model (Gernhard 2008), which assumes a constant rate of lineage splitting without extinction, provides a parsimonious baseline appropriate for datasets representing distinct varieties rather than population-level samples (Chang et al. 2015; Bouckaert et al. 2012).
The fundamental limitation of any tree-based method including Bayesian phylogenetics is the assumption that evolution is fundamentally bifurcating. Languages do not always split cleanly; they borrow, converge, and hybridize. Bayesian inference does not eliminate this assumption, but it provides tools to detect its violations. When the data contains strong conflicting signals as would be expected in a contact-saturated continuum the posterior distribution reflects this conflict: the consensus tree shows low support values at the contested nodes, and visualization tools such as DensiTree display the full cloud of sampled trees, revealing areas of the topology where alternative groupings compete. Complementary network methods, such as Neighbor-Net (Bryant and Moulton 2004: 255), can be employed alongside the Bayesian analysis to explicitly visualize non-tree-like signal, distinguishing areas of clean divergence from areas of reticulation where contact has blurred the genealogical record.
The software ecosystem for Bayesian phylogenetic inference has matured considerably since Gray and Atkinson's original study. BEAST 2 (Bouckaert et al. 2014), the successor to BEAST, has emerged as the standard platform for linguistic phylogenetics due to its modular architecture, which allows researchers to specify substitution models, clock models, and tree priors independently and to compare competing configurations through Bayes factor analysis. Alternative frameworks exist MrBayes (Ronquist et al. 2012: 539) remains widely used in biology, and RevBayes (Höhna et al. 2016: 726) offers a flexible scripting environment but BEAST 2's extensive use in linguistic studies, from Chang et al. (2015: 194) on Indo-European to Grollemund et al. (2015: 13296) on Bantu, provides a rich body of methodological precedent and validated parameter settings that can be adapted to new datasets. The choice of BEAST 2 for the present study is motivated by this accumulated expertise and by the availability of linguistic-specific packages, including the binary covarion model that accommodates the rate heterogeneity characteristic of lexical evolution.
It should be noted that the scale of the present study differs substantially from the large comparative projects that have defined the field. Auderset et al. (2023: 35), whose analysis of the Mixtecan family serves as a methodological model for this thesis, sampled over one hundred varieties across a language family with high internal diversity. The present dataset eleven speakers representing six Southern Kurdish varieties is deliberately narrower in scope. The claims made on the basis of this analysis must be proportionally modest. This thesis does not aim to resolve the prehistory of West Iranian or to produce a definitive classification of Kurdish. It is conceived as a proof of concept: a demonstration that Bayesian phylogenetic methods, applied to carefully curated lexical data from a well-defined geographic zone, can provide new empirical traction on classification questions that qualitative methods have left unresolved. The methodology from fieldwork through transcription, cognate detection, and probabilistic inference is designed to be replicable and scalable. If the approach proves viable for the internal structure of Southern Kurdish, it can be extended to the broader Kurdish family and, ultimately, to the West Iranian continuum as a whole.
Eleven native speakers representing six SK varieties contributed recordings to the dataset. I stratified the cohort to capture the geographic and social breadth of the SK speech area rather than simply recruiting the most accessible speakers. The cohort includes speakers from both sides of the Iraq-Iran border — Khanaqin and the Baghdad diaspora on the Iraqi side; Kermanshah, Qasr-e Shirin, and Sahana on the Iranian side — but is concentrated in the western and southern SK region. Of Fattah's seven subgroups, the sample overlaps with at least four (Kalhori-Sanjābi-Zangane, Laki-Kermānshāhi, and likely Malekshāhi and Badre'i via the Faili diaspora speakers), while the Bijāri and Kolyā'i groups further north in Iran are absent. The two Faili speakers (Fail01, Fail02) originated from the Baghdad diaspora — a community whose Kurdish was formed under the forced deportations of the 1980s, and which has since lived alongside Iraqi Arabic and Persian in ways that left traces in the lexicon. The Khanaqin group contains the most speakers: four (Khan01–Khan04) from a single location, which allows idiolectal variation within Khanaqini to be separated from the inter-variety signal that is the analysis's primary target. From Iran, I recorded a Kalhori speaker (Kalh01) from Klaw Drizh near Kermanshah, a Qasri speaker (Qasr01) from Qasr-e Shirin, and Saha01 from the town of Sahana in Sahne county, Kermanshah Province — varieties their speakers called "Kalhori," "Qasri," and "Sahane" (or "Kurdi Xwareen") respectively. Kalh01 and Saha01 both originate from the Kermanshah area but fall into different Fattah subgroups — Kalhori-Sanjābi-Zangane and Laki-Kermānshāhi respectively — illustrating the non-alignment of geographic proximity and dialect classification that characterizes the region. Mand01, born in Khanaqin but associated with the Mandali variety, sits on the Arab-Kurdish linguistic border. Finally, a reference dataset (Oxfo01), contributed by Masoud Mohammadirad of the University of Cambridge, links the fieldwork corpus to the varieties documented in the forthcoming Oxford Handbook of Kurdish Linguistics (Haig et al., forthcoming).
The following table maps each speaker to Fattah's (2000) seven-group classification (see §2.1, Table 1), illustrating the sample's coverage and its gaps:
| Speaker | Thesis variety | Origin | Fattah subgroup | Notes |
|---|---|---|---|---|
| Fail01 | Faili | Baghdad diaspora | MAL / BAD (uncertain) | Feyli community; Fattah places some Baghdad SK in both groups |
| Fail02 | Faili | Baghdad diaspora | MAL / BAD (uncertain) | Same community as Fail01 |
| Kalh01 | Kalhori | Klaw Drizh, Kermanshah | KSZ | Self-identifies as Kalhori |
| Khan01 | Khanaqini | Khanaqin / Baghdad | KSZ (nominally) | Belelli notes Xānaqin diverges from KSZ |
| Khan02 | Khanaqini | Khanaqin / Baghdad | KSZ (nominally) | — |
| Khan03 | Khanaqini | Khanaqin / Baghdad | KSZ (nominally) | Variety identified as Kalhori-Khanaqini |
| Khan04 | Khanaqini | Khanaqin | KSZ (nominally) | Youngest speaker; contact variables from Lori-speaking mother |
| Qasr01 | Qasri | Qasr-e Shirin | KSZ | Sanjābi territory; cross-border speaker |
| Mand01 | Mandali | Khanaqin / Mandali | KSZ / MAL (border) | Self-identifies as Mandali; wife is Khanaqini |
| Saha01 | Sahana | Sahne, Kermanshah | L-KER | Xwareen variety; transitional group per Belelli (2019) |
| Oxfo01 | Reference | Bijār or Ilām (TBC) | BIJ or unknown | Mohammadirad contribution; variety to be confirmed |
The sample is concentrated in the Kalhori-Sanjābi-Zangane (KSZ) group, with single representatives of Laki-Kermānshāhi (L-KER) and possibly Bijāri (BIJ) via the reference variety. The Kolyā'i (KOL) and Kordali (KOR) groups are absent. The Faili speakers' exact placement within Fattah's scheme is uncertain — the Baghdad SK diaspora straddles both his Malekshāhi and Badre'i groups (Belelli 2019: 79), and the label "Feyli" itself functions not as a dialect-specific designation but as a cover term mapping to the broader Ilāmi dialect cluster comprising Malekshāhi, Badre'i, and Kordali (Aliakbari et al. 2014: 7-8; cf. Fattah 2000: 70-74). Without precise data on the speakers' family origins predating their displacement to Baghdad — or systematic acoustic analysis of diagnostic phonological variables such as the retention versus replacement of dark /ɫ/ by clear /l/, which Belelli (2019: 83) identifies as characteristic of the Mandali area within the Malekshāhi zone — assigning these speakers to a single Fattah subgroup would be speculative. The uncertainty is recorded honestly in Table 2 above. This imbalance across subgroups is an unavoidable consequence of fieldwork constraints (§1.4), but it means the phylogenetic results will have stronger resolution for the KSZ cluster than for inter-group relationships.
The cohort runs from roughly 26 to 64 years old — five women, six men. In principle, age-graded variation could reveal ongoing sound changes or lexical replacement in progress: younger speakers exposed to more Sorani and Persian schooling might show higher rates of borrowing than older speakers who acquired Kurdish in predominantly monolingual home environments. In practice, however, the age range of this sample is too narrow and too skewed toward middle-aged and older speakers to support robust apparent-time inferences. Ten of the eleven speakers were born between 1962 and 1984, forming a relatively homogeneous generational cohort; only Khan04 (born c. 2000) represents a younger generation. Any age-related patterns in the data should therefore be interpreted cautiously — they may reflect individual biography rather than generational change.
Gender also matters here, though in ways that depart from the standard sociolinguistic pattern. In the variationist tradition established by Labov (1990), women typically lead linguistic change, adopting prestige forms and innovations faster than men. In the Kurdish context, the pattern inverts — or at least complicates — because the relevant prestige languages (Persian, Arabic) are not internal innovations but external state languages, and access to them is mediated by gendered structures of education and employment. Male speakers, historically more integrated into the state economy through military service, government employment, and commerce, show higher rates of Persian and Arabic lexical borrowing. Female speakers — particularly in older generations and rural communities — were often excluded from formal education and public-sphere employment, and as a result preserve conservative phonological and lexical features that have been leveled in male speech. The intersection of gender, education, and institutional power thus produces an outcome opposite to what a straightforward application of Labov's gender paradox would predict: it is the socially less mobile speakers who retain the more conservative forms, not because of any inherent conservatism, but because the channels through which contact-induced change enters the community — schools, markets, bureaucracy — were disproportionately accessible to men.
Fail01 (male, born 1984) spent six years in Baghdad before moving to Tehran (seven years) and Ilam (five years), settling in Sulaymaniyah around 2002; he is multilingual in Faili, Arabic, Farsi, Kalhori, and English. Fail02 (female, born 1968) spent twelve years in Baghdad followed by twenty-four years in Tehran until she moved to Sulaymaniyah in 2019. Kalh01 is an unusual case: born in 1981 in Klaw Drizh, he grew up there for twenty-five years before stints in Tehran and southern Iran, and has lived in Germany since 2016. He speaks Kalhori, Farsi, English, Sorani, and German, with passive Kurmanji and formal Arabic; he also holds an MA in General Linguistics, with a thesis on Kalhori phonology drawn from fieldwork in his home region — making him both a native speaker and a trained observer of his own variety. Khan01 (female, born 1970) spent her entire life in Baghdad from birth until she left in 2006; her son Muhamed worked as translator across two sessions in April 2023, speaking Arabic to her during elicitation. Khan02 (female, born 1967) spent twenty-five years in Khanaqin and sixteen in Baghdad before moving to Sulaymaniyah; she holds a BA in Kurdish from Baghdad and speaks Germiani alongside Khanaqini, Arabic, and some Turkish. Khan03 (male, born 1978) settled in the Sulaymaniyah-Khanaqin area after ten years in Baghdad and five in Khanaqin; his variety is identified as Kalhori-Khanaqini, with Arabic and Farsi as secondary languages. The youngest speaker, Khan04 (female, born circa 2000 in Khanaqin), had recently moved to Sulaymaniyah for the first time at the time of recording; her mother speaks Lori and her father is from Sulaymaniyah, introducing contact variables that need to be kept in mind when reading her data. Qasr01 (male, born 1973) grew up in Qasr-e Shirin, spent years in Kermanshah and Karaj-Tehran, then crossed into Iraq around a decade ago — his speech is a direct link between the Iranian and Iraqi sides of the Southern Kurdish area. Mand01 (male, born circa 1962) was born in Khanaqin, spent roughly twenty years there and in Mandali and Baghdad, and has lived in Sulaymaniyah since 2006; he identifies with the Mandali variety but speaks Khanaqini at home with his wife, alongside Arabic, Farsi, and Turkmen. Saha01 (female, born circa 1978 in Sahana) left her hometown after twenty-two years for periods in Kermanshah and Tehran, and arrived in Sulaymaniyah about eighteen months before the recording; she calls her language "Sahane" or "Kurdi Xwareen," speaks Farsi, Laki, and some Sorani, and uses her L1 at home. The reference variety (Oxfo01) was documented by Mohammadirad for the Oxford Handbook corpus and likely reflects the Bijar or Ilam varieties central to his research (Mohammadirad, personal communication).
Six of the eleven speakers have lived in Baghdad, Tehran, or Karaj at some point; three have crossed the Iraq-Iran border. This pattern of displacement and multilingualism is not an anomaly to be controlled away but the defining sociolinguistic reality of the Southern Kurdish borderlands — a reality that any analysis of this region must confront rather than exclude.
What this mobility means for the phylogenetic analysis is worth stating plainly. If a speaker left Sahana at age twenty-two and spent the next two decades in Kermanshah and Tehran, the lexical items she produces in 2024 may reflect not only her hometown variety but also accommodation to the speech of those cities. The same applies to speakers who spent formative decades in Baghdad, where Arabic and Sorani exert constant pressure on the Kurdish lexicon. The "original regional signal" that a phylogenetic tree attempts to recover is therefore filtered through layers of adult contact — and there is no reliable method for stripping those layers away. This does not invalidate the analysis, but it does mean the resulting tree should be read as a classification of speakers' current repertoires rather than of idealized, pre-contact dialect systems. Where the tree shows unexpected clustering or low posterior support at a node, speaker mobility is one of several possible explanations that the discussion chapter will need to weigh.
Zaem mediated most fieldwork sessions — a Germiyani-Khanaqini bilingual from Sulaymaniyah (born circa 2000) who speaks Germiyani (L1), Khanaqini, Arabic, and English, and visits family in Khanaqin regularly. His knowledge of the Khanaqini variety and the local community networks made establishing the rapport needed for naturalistic elicitation considerably easier. Perya, a professional legal translator, handled the Sahana session, using Sorani and Farsi during elicitation and English with me. The variation in elicitation languages across sessions — Arabic, Sorani, Khanaqini, and English — is a potential confound that needs to be acknowledged: different prompt languages may activate different lexical access pathways, potentially influencing which variant of a target item a speaker produces. I mitigated this through the bilingual check protocol described above.
I collected the data within what Fattah (2000: 56) terms a linguistic "shatter zone" (zone d'éclatement) — the borderlands of the Southern Zagros spanning from Khanaqin in Iraq through Kermanshah and Ilam in Iran. In this region, Kurdish vernaculars coexist with Arabic and Persian as dominant state languages, producing a linguistic ecology defined by multilingualism, code-switching, and contact-induced change. Standard dialectological methods, which often rely on formal elicitation in institutional settings, risk triggering exactly the kind of register shift that obscures the vernacular baseline. Labov's (1972) Observer's Paradox is particularly acute in communities with a history of linguistic suppression, where formal interview contexts prime the use of prestige varieties — Standard Sorani, Arabic, or Persian — rather than the local Southern Kurdish form.
To mitigate this, I conducted recordings in informal settings rather than institutional environments. Most Iraqi-based sessions took place in a monastery library in Sulaymaniyah, a semi-private space familiar to the local community. I recorded Saha01 in a purpose-built studio with soft furnishings and wooden panelling that minimized both acoustic interference and the formality of the setting. For Kalh01 in Germany, I interviewed him at his dining table in Moers — a domestic setting that reproduced the informal conditions of the Iraqi sessions as closely as possible. I used a Rode Wireless Go II microphone system paired with a Rode VideoMic Pro+ mounted on a Sony A7S III camera. All sessions were recorded in 4K at 60 frames per second, a deliberate choice that allowed frame-by-frame analysis of lip rounding, tongue position, and jaw movement during post-processing — providing visual corroboration for segments that were acoustically ambiguous. The informal settings occasionally introduced ambient noise, but I treated this as an acceptable trade-off: acoustic noise can be cleaned in post-processing, whereas the sociolinguistic distortion of a formal register, once recorded, cannot be removed.
I conducted fieldwork across multiple sessions between April 2023 and May 2025 — three periods in Iraq (April 2023, November 2023, and April–May 2025) and one in Germany (May 2025, Kalh01).
To ensure semantic comparability across the diverse speaker group, I used a translation task for elicitation. The elicitation language varied by speaker and translator: Arabic for Khan01 (mediated by her son), Sorani for Qasr01 and others via the primary translator, Khanaqini for Khan03, and English for Kalh01. I gave speakers prompts and carrier sentences, supplemented by physical gestures, pointing, and direct questions such as "what is this called?" The primary translator, Zaem, was particularly effective in Khanaqini sessions — his ability to clarify target meanings in the speakers' own variety reduced misunderstandings and minimized reliance on contact-language mediation. While translation tasks carry a risk of lexical priming, this method was necessary to secure the specific items required for phylogenetic analysis, which spontaneous speech often fails to produce in sufficient density. Crucially, prompts never contained the target word in Southern Kurdish, so speakers were not primed with the form being elicited. When an informant's response suggested a misunderstanding, Zaem would embed the target meaning in a carrier sentence in another language, or rephrase the question, until the intended concept was clear. Each speaker repeated every lexical item three times. I took the second repetition as the primary datum: the first often reflects an initial search process — hesitation, self-correction, or code-switching into the prompt language — while the third may introduce hyper-articulation or fatigue effects. Where the second repetition was degraded by noise or disfluency, I substituted the first.
I obtained verbal informed consent from all participants — a deliberate choice given the political sensitivity of the region and the history of displacement affecting communities such as the Faili Kurds. Speakers understood that all data would be used strictly for linguistic analysis and that their identities would be anonymized in the resulting publications. For video recordings, I informed participants that only the lower portion of the face (mouth and nose downward) could be made publicly available, ensuring they remained visually unidentifiable.
The construction of a phylogenetically informative wordlist in a contact-saturated zone requires a principled approach to distinguishing genealogical signal from areal noise. The theoretical framework for this distinction is provided by the Loanword Typology Project (Haspelmath and Tadmor 2009), which demonstrates that lexical borrowing is not stochastic but governed by predictable semantic constraints. "Cultural" vocabulary -- terms relating to religion, law, administration, and technology -- is highly susceptible to replacement under conditions of superstratal influence, while "basic" vocabulary -- body parts, motion verbs, fundamental physical properties -- demonstrates remarkable cross-linguistic stability (Haspelmath and Tadmor 2009: 22-34; Tadmor, Haspelmath and Taylor 2010).
The stability hierarchy documented by Haspelmath and Tadmor is not merely an empirical generalization but reflects deep functional constraints on lexical replacement. Body-part terms, for instance, are acquired early in language development, used with high frequency, and embedded in dense collocational networks that resist item-by-item substitution; a language can borrow the word for "telephone" without disrupting its lexicon, but borrowing the word for "hand" requires restructuring dozens of idiomatic expressions, compounds, and metaphorical extensions. Conversely, cultural vocabulary sits at the periphery of the lexical system, loosely connected and easily replaced when a new prestige source provides an alternative. The practical consequence is that a wordlist designed for phylogenetic inference must be weighted toward the stable core, even at the cost of excluding semantically rich but borrowing-prone domains.
For the Southern Kurdish region, where centuries of contact with Arabic (as the language of religion and historical administration) and Persian (as the language of state and high culture) have saturated the cultural lexicon, a naive wordlist would be methodologically disastrous. A standard list containing terms such as judge (hakim), religion (din), or court (mahkama) would cluster varieties based on their political and religious contact history rather than their linguistic genealogy. The primary imperative was therefore to construct a dataset that maximizes resistance to borrowing.
The resulting wordlist consists of eighty-five items, derived from the intersection of two comparative corpora: the Jena-Bamberg Iranian List (JBIL) combined with the Kurdish Lexical Questionnaire (KLQ), which together comprise roughly 500 lexical items developed by Haig and Anonby for fieldwork across the Iranian language family, and the wordlist used in the Oxford Handbook of Kurdish Linguistics (Haig et al., forthcoming), which comprises 125 items based on a modified Swadesh list supplemented with Kurdish-diagnostic terms. The decision to narrow the dataset to items overlapping with the Oxford Handbook wordlist was deliberate: by aligning with a pre-existing, independently compiled list, the present study produces results that are directly comparable to the Oxford Handbook corpus and provides a foundation on which future phylogenetic analyses can expand. Rather than imputing missing data or tolerating gaps, the analysis was restricted to items attested with high quality across all surveyed varieties -- a "lowest common denominator" approach that sacrifices breadth for reliability.
The semantic composition of the eighty-five items is heavily weighted toward domains that the Loanword Typology Project identifies as the most resistant to replacement: basic verbs (25.9%), including items such as eat, drink, die, kill, sleep, and give; body parts (23.5%), including head, eye, hand, liver, heart, and blood; natural phenomena (23.5%), including water, fire, sun, star, and stone; and core adjectives and properties (18.8%), including big, small, long, and new. Function words and kinship terms account for the remainder. Crucially, all items susceptible to borrowing from the Islamic-administrative sphere were excluded.
To validate the quality of this intersection, the list was benchmarked against the Leipzig-Jakarta List (Haspelmath and Tadmor 2009), an empirically derived inventory of the one hundred meanings most resistant to borrowing cross-linguistically. The overlap stands at 92% (seventy-eight of eighty-five items), confirming that the data-driven intersection converged on the most stable portion of the lexicon. While eighty-five items is a smaller dataset than some macro-family studies, Gray and Atkinson (2003) demonstrated that accurate, borrowing-resistant lists produce more reliable trees than larger, noisier ones. The dataset represents the "hard core" of the Southern Kurdish lexicon -- the stratum most likely to preserve genuine genealogical signal beneath the layers of contact.
A further consideration in wordlist design concerns the treatment of compound numerals, which present a specific methodological challenge. Languages frequently express teen numbers (eleven through nineteen) as transparent compounds of a base and a unit -- "ten-three" for thirteen, for instance. When two varieties share the compound structure but differ in one component, coding the compound as a single unit conflates independent etymological judgments. Following the recommendation of Bouckaert et al. (2014), compound numerals were decomposed into their constituent morphemes, each coded independently. Where decomposition proved ambiguous -- as with lexicalized compounds whose internal structure is no longer transparent to speakers -- the item was excluded rather than force-fitted into a coding scheme. This conservative treatment reflects a broader principle governing the entire dataset: uncertainty is flagged and excluded rather than resolved by assumption.
The decision to work with eighty-five items rather than a larger inventory also reflects the practical constraints of fieldwork in a conflict-affected region. Not all speakers could be recorded under identical conditions; some sessions were interrupted, others conducted in noisy environments that degraded specific items. Rather than retaining items with uneven attestation and introducing systematic gaps into the character matrix, the wordlist was trimmed to the intersection of high-quality attestations across all eleven speakers. The resulting dataset is small by the standards of large-scale phylogenetic studies -- Auderset et al. (2023: 35) used 209 concepts across 137 varieties -- but it is complete. Every cell in the character matrix contains a verified datum, eliminating the need for imputation or missing-data handling that can introduce artifacts into Bayesian inference.
Getting from raw audio to a phylogenetically analysable character matrix takes two linked stages: transcription and cognate identification. Neither stage is a black box. Both rely on a combination of manual judgment and computational processing — the logic of Computer-Assisted Language Comparison (CALC; List 2016), which uses automation to scale human expertise rather than replace it.
The primary transcription was done by hand, in IPA. Southern Kurdish makes this necessary in a way that other well-resourced languages do not. The fronted rounded vowels /oe/ and /uu/ — and their degree of centralisation in specific varieties — are among the main diagnostics for distinguishing Southern Kurdish dialects from one another, but these contrasts fall within the acoustic space that standard ASR models routinely conflate. No amount of phoneme inventory fine-tuning fixes this at the level of a system trained primarily on European languages. The transcription had to be human-primary.
To catch inconsistencies that accumulate across a multi-speaker dataset, I supplemented the manual transcriptions with an ASR validation pass using the Wav2Vec2 framework (Baevski et al. 2020), specifically the XLSR-53 variant. This model learns acoustic representations from raw unlabelled audio, then adapts to a target language via fine-tuning on small labelled datasets. For low-resource varieties like Kurdish, that architecture is more practical than the alternatives, which typically require thousands of hours of transcribed speech. The ASR output was never treated as ground truth. Raw audio was segmented into word-level clips, processed to produce candidate IPA strings, and then compared against my manual transcriptions using Levenshtein distance. Items with large discrepancies were flagged and re-verified. The purpose was to catch drift — the tendency for transcription conventions to shift subtly across sessions — not to outsource phonological judgment to a machine.
Before transcription was possible, the audio required substantial preparation. The fieldwork recordings range from roughly two and a half hours to over five hours each and contain target items embedded in conversational frames, overlapping talk, background noise, and variable distance from the microphone. A Kurdish research assistant carried out initial lexical segmentation for a subset of the recordings, marking target items in Adobe Audition and exporting timestamped regions as CSV files. An earlier version of the pipeline extracted individual clips from these timestamps, but the approach proved brittle — segments were sometimes too short, speaker metadata was inconsistent across the concatenated files the assistant had received, and a further set of recordings (see section 3.2) arrived with no segmentation at all. The PARSE workstation described in section 4.4 replaced this workflow with an annotation pipeline that works directly from the original recordings without generating segmented audio files. A quality control pass excluded any item where overlapping speech, noise, or hesitation made the target form ambiguous. That last judgment — whether a token is usable — had to be made case by case. There is no automated criterion that reliably captures the pragmatic difference between a speaker rephrasing on the fly and a speaker genuinely producing the target word.
Cognate identification used LingPy (List et al. 2018) and the LexStat algorithm. LexStat is preferable to simple edit-distance methods because it derives language-specific sound-correspondence matrices from the data rather than applying a universal distance metric. Alignments were initialised using the Sound Class Algorithm, which groups sounds into broad phonetic categories to absorb allophonic variation before the clustering step. Thresholds between 0.55 and 0.65 were tested — a range chosen because the Southern Kurdish varieties sit close enough to one another that small threshold shifts can materially affect which forms are grouped together.
Automated cognate detection gets a lot right, but it also makes characteristic errors. Following Rama et al. (2018), who showed that automated approaches approach but do not reach expert accuracy, I cross-checked the LexStat clusters against the historical phonology documented in MacKenzie (1961). Two recurring error types became apparent in the Southern Kurdish data. The algorithm grouped the Arabic loanword waqt (time) as cognate across varieties — the phonetic similarity is real, but it reflects shared borrowing, not shared inheritance. Conversely, it failed to link dast (hand) with das (hand), because the final consonant cluster deletion looked like a mismatch rather than a regular phonological process. Manual correction addressed both.
Borrowings required particular care. In the Southern Kurdish shatter zone, Arabic and Persian loanwords penetrate even the basic vocabulary that other language families tend to keep inherited. The question of what counts as borrowed is not always clean — as Haig (personal communication, January 2026) has noted, the line between inherited and borrowed in West Iranian is often a matter of degree rather than kind, especially for early Persian loanwords whose entry predates systematic documentation. Transparent recent borrowings, where the phonological form closely mirrors the Arabic or Persian source without nativisation, were excluded outright. Partially nativised items — those where vowel harmony or consonant lenition marks the form as having passed through the Kurdish phonological system — went to manual review and were excluded unless the nativisation itself showed variety-specific patterns that might carry phylogenetic signal. Items of genuinely ambiguous etymology were retained but annotated, and their effect on the tree was tested through sensitivity analysis.
The final cognate matrix was encoded in binary: each variety received 1 (cognate present) or 0 (cognate absent) for each cognate class. Where a variety produced multiple forms for a single meaning — a native term alongside a borrowing, for instance — each form was coded as a separate cognate class. This expansion of multistate slots into binary columns follows Gray and Atkinson (2003) and avoids the information loss that comes from arbitrarily selecting one form per variety per meaning slot. The resulting matrix covers eighty-five meaning slots, expanded to [VERIFY] cognate columns across eleven varieties.
The binary character matrix was converted to NEXUS format for analysis in BEAST 2 (Bouckaert et al. 2014). Choosing a Bayesian framework is not incidental. Parsimony and maximum likelihood both select a single best tree and report it. Bayesian inference produces a distribution over trees — a probability-weighted ensemble that quantifies how much uncertainty the data actually contains. In a dataset drawn from a contact-saturated continuum, where conflicting signals are expected rather than exceptional, that distribution is the result, not a concession to imprecision.
The primary substitution model is the Binary Covarion (Tuffley and Steel 1998; Penny et al. 2001), now the field standard for cognate data in linguistic phylogenetics (Hoffmann, Bouckaert and Greenhill 2021: 119). The simpler alternative — the Binary Continuous-Time Markov Chain (CTMC) — assumes each cognate evolves at a constant rate across the whole tree. The Covarion relaxes this by allowing individual characters to switch between a fast mode and a slow mode at any point in the tree. Internally, each cognate occupies one of four hidden states: fast-absent, fast-present, slow-absent, slow-present. Four free parameters govern the transitions between these states: visible state frequencies, hidden state frequencies, the switch rate between modes, and the slow mutation rate.
Why does this matter for Southern Kurdish? In a contact zone, the rate at which a variety replaces vocabulary is not constant over time. A community might remain relatively isolated for generations — slow mode — then experience a burst of lexical exchange as political boundaries shift or populations are displaced, pushing it toward fast mode. The Covarion models exactly this kind of within-character temporal variability. A plain CTMC with Gamma-distributed rate heterogeneity handles the fact that different cognates evolve at different speeds, but it assigns each character a fixed rate drawn from that distribution. The Covarion adds the possibility that any cognate's rate can change within its own history on the tree — a different and more realistic claim for contact-zone data (Bouckaert and Robbeets 2017).
This also marks a change from the UZH term paper (Ardelean 2025), which used Binary CTMC with Gamma rate heterogeneity. The Covarion doesn't abandon that sensitivity — it subsumes between-site variation while adding within-site temporal switching (Cathcart, Karakostis and Jager 2023). A parallel CTMC + Gamma analysis runs alongside as a sensitivity check: where both models produce compatible topologies, that convergence is evidence of a robust genealogical signal; where they diverge, the divergence itself becomes something to explain in Chapter 5.
The tree prior is the Yule model — a pure birth process with a constant lineage-splitting rate and no extinction term. More complex priors exist. The Birth-Death process models extinction alongside speciation; coalescent priors are designed for population-level genetic samples. For eleven dialect varieties rather than population samples, the Yule prior's simplicity is an asset. It imposes few demographic assumptions and lets the cognate data carry most of the inferential weight.
A Strict Clock was applied, assuming a uniform rate of lexical replacement across the entire tree. In a shatter zone, this invites scepticism — varieties with heavier contact exposure might plausibly replace vocabulary faster than isolated ones. Three considerations push toward the conservative choice. The varieties are geographically tight and their divergence relatively shallow, which limits how much rate variation could accumulate even if the mechanism existed. A Relaxed Clock adds parameters the model must estimate from a small dataset, increasing the risk of overfitting rather than capturing genuine rate differences. Perhaps most importantly, any rate variation visible in this dataset is more plausibly contact noise than a real difference in the pace of lexical change — and letting the clock relax would absorb that noise into rate estimates rather than treating it as signal to be set aside. The Strict Clock keeps the model conservative.
The MCMC analysis ran for 10,000,000 generations, sampling every 1,000 states. Convergence was assessed in Tracer (Rambaut et al. 2018): all key parameters — posterior probability, likelihood, prior, tree height, and substitution model parameters — required Effective Sample Size values above 200. The first 10% of sampled trees were discarded as burn-in. The remaining sample was summarised as a Maximum Clade Credibility (MCC) tree in TreeAnnotator, with node heights set to the median of the posterior distribution. MCC trees are not a single "best" tree in the way a maximum likelihood tree is; they represent the topology whose clade membership is most consistently endorsed across the entire posterior sample. Node posterior probabilities tell you how reliably each split appeared in that sample.
Tree uncertainty was additionally visualised using DensiTree (Bouckaert and Heled 2014). DensiTree overlays every sampled tree in a single figure. Where a clade is well-supported, hundreds of trees draw the same branching pattern and the result is a dense, dark band. Where topologies conflict, overlapping lines spread into a diffuse cloud. The figure makes phylogenetic uncertainty legible in a way that individual node posterior probabilities don't quite capture — you see the shape of disagreement across the whole tree, not just a number attached to a single node.
Multiple independent MCMC runs were conducted from different random starting seeds. Consistent posterior distributions across runs indicate that the chains explored the parameter space adequately rather than converging on different local optima. Sensitivity analyses compare the primary Binary Covarion model against Binary CTMC with Gamma rate heterogeneity, and also test a Relaxed Clock (Drummond et al. 2006) against the Strict Clock, a Birth-Death prior against the Yule prior, and a leave-one-out analysis removing each variety in turn to assess whether any single taxon exerts disproportionate influence on the inferred topology. Results are in Chapter 5.
Before building any phylogenetic tree, it is worth asking whether the data has a tree-like structure at all. The contact area between languages leads to convergent borrowing and structural diffusion which produces a reticulate signal that standard bifurcating tree models fail to represent accurately. Two metrics were computed prior to the Bayesian analysis. Delta scores (Holland et al. 2002: 2051) measure how consistently the distance relationships among any set of four varieties conform to a single unrooted tree; values near zero are tree-like, values near one indicate pervasive conflicting signal. Q-residuals provide a related check of how well a tree-derived distance matrix matches the observed pairwise distances.
The main visualisation tool for non-tree-like signal is the Neighbor-Net split graph (Bryant and Moulton 2004: 255). Where a bifurcating tree forces every conflicting signal into a single branching point, Neighbor-Net represents it as a parallelogram — a visible split in the network where two incompatible groupings each claim some support from the data. Reading the Neighbor-Net alongside the Bayesian MCC tree gives a more complete picture than either alone. High posterior probability at a node that also appears as a clean split in the network is good evidence of genuine genealogical signal. Where a node shows low posterior probability and the corresponding network position shows strong reticulation, contact and genealogy are genuinely entangled — and the tree-based model is straining to represent something that may not actually be tree-like.
No outgroups were included. The dataset covers eleven Southern Kurdish speakers representing six varieties, and no external comparative taxa appear in the character matrix. The tree was rooted agnostically from the Southern Kurdish data alone — no variety was assigned an outgroup role and no monophyly constraints were imposed. This approach, which Haig (personal communication, January 2026) has described as working from a "blank mind," allows the internal structure of Southern Kurdish to emerge from the data rather than from any prior assumption about which variety is most peripheral or most archaic. If the cognate matrix contains enough signal to resolve that structure, the posterior distribution will reflect it. If the signal is genuinely ambiguous — as contact-zone data often is — the posterior will show that too, through low node probabilities and the diffuse topology visible in DensiTree.
The methodology described in the preceding sections assumes that individual lexical items can be located, segmented, and transcribed from the source recordings with reasonable efficiency. In practice, this was not straightforward. The eleven fieldwork recordings range from roughly two and a half hours to over five hours each, often split across multiple WAV files, and each contains responses to roughly five hundred items from the combined JBIL and KLQ wordlists embedded in conversational frames, metalinguistic commentary, and ambient noise. While the ASR timestamps were generally reliable where they existed, they were missing entirely for stretches of several recordings. The initial segmentation described in section 4.1 covered only a subset of these recordings and produced segments that were not always reliable. Across eleven speakers, the eighty-five phylogenetically relevant items had to be located within recordings where the elicitation order did not always follow the wordlist sequence. Manually scanning roughly thirty hours of audio to find each one would have been prohibitively slow. It would also have introduced exactly the kind of inconsistency that a systematic pipeline is meant to prevent.
I built PARSE (Phonetic Analysis and Review Source Explorer) to handle this. It is a browser-based workstation with two modes that divide the annotation work along its natural axis. In Annotate mode, I work with a single speaker's recording. A waveform display allows navigation of the full original files, draggable regions define precise start and end times for each lexical item, and four annotation tiers are maintained simultaneously: IPA, orthography, concept label, and speaker identifier. In Compare mode, the same data appears as a concept-by-speaker matrix, where each cell shows the IPA and orthographic forms for a given speaker-concept pair alongside controls for cognate adjudication. The two modes share the same underlying annotation files, so edits in one are immediately visible in the other. This separation matters because the two tasks require different cognitive orientations. Segmenting and transcribing a single speaker's recording is close listening, focused on acoustic detail. Comparing forms across speakers is pattern recognition, focused on which similarities reflect shared inheritance and which reflect borrowing or coincidence.
The core of PARSE is a speech processing pipeline built around the razhan/whisper-base-sdh model (Abdulrahman [YEAR]), a Whisper variant fine-tuned on Southern Kurdish speech data. This model provides orthographic transcription in Kurdish Arabic script together with word-level timestamps. Voice activity detection, handled by Silero VAD, segments the continuous recording into candidate speech regions before the model processes each one. I tuned the VAD parameters for the specific characteristics of the elicitation recordings. A lower activation threshold than the default catches soft-spoken consultants recorded at variable distances from the microphone, sometimes outdoors. A minimum silence duration of three hundred milliseconds between detected speech segments prevents the merging of stimulus-response pairs that a more aggressive setting would collapse into single units. That last adjustment fixed a recurring problem: five to thirty-nine consecutive items were being grouped into one transcript segment, which made downstream processing impossible. The problem was specific to the elicitation format, where the interviewer poses a stimulus in Sorani or Persian and the speaker responds after a brief pause. Standard VAD settings treat that pause as part of the same utterance.
The full processing chain from raw field recording to ranked annotation candidates proceeds as follows:
Raw 2-5 h WAV(s) (Southern Kurdish elicitation)
|
v
1. Silero VAD segments speech regions
(threshold 0.35, min silence 300 ms)
|
v
2. razhan/whisper-base-sdh (faster-whisper, CUDA float16)
--> orthographic words + word-level timestamps
|
|-- >= 50 words -------------------- < 50 words ------|
v v
3a. Targeted wav2vec2-xlsr-53 refinement 3b. Chunked wav2vec2 fallback
+/-0.75 s virtual windows per word 10 s windows, 1 s overlap
--> phoneme-level IPA + confidence full-file coverage
| |
|---------------- merge -------------------------------|
|
v
4. CTC phoneme-to-word grouping (150 ms gap threshold)
+ Kurdish Arabic-script --> IPA backup per word
|
v
5. Parallel candidate scoring
|-- A: Within-speaker repetition detection (30 s window, cluster 2-4)
|-- B: Cross-speaker concept matching (4-strategy cascade)
|
v
6. Weighted confidence scoring (formula below)
|
v
7. Per-concept ranked candidates --> Annotate mode suggestion cards
|
v
8. Human review --> final IPA + orthographic annotation
|
v
9. Compare mode --> cognate adjudication (accept / split / merge / cycle)
|
v
10. Export: LingPy TSV (COGIDs + borrowing flags) --> BEAST 2 [VERIFY]
The pipeline locates target items. It does not produce final transcriptions. When I open a concept in Annotate mode, PARSE presents ranked candidate segments — time regions where the pipeline believes the target word occurs — scored by confidence and displayed as cards with timestamps and match quality indicators. The models generate their own IPA and orthographic transcriptions for each candidate, which serve as a cross-check against the field transcriptions I produced during elicitation (see section 4.1). I listen, verify the region boundaries, and where the model output diverges from my field notes, re-examine the audio to determine which is correct. Nothing is saved without an explicit action. The pipeline narrows the search space and proposes a transcription. The human makes the judgment. This division follows the same logic as the ASR validation pass described in section 4.1, where the Wav2Vec2 output was treated not as ground truth but as a systematic cross-check. In PARSE, the speech processing output serves primarily as a locator. The models find approximately where a word is and propose what it might be. I verify both.
When timestamps are missing or unreliable for a given concept, a second system takes over. I call it the Lexical Anchor Alignment System. It treats verified items from the wordlist as known reference points and uses them to bootstrap the location of items that the initial pipeline failed to timestamp. The underlying logic is straightforward: if a dozen items have already been located and verified in a given recording, their positions provide a rough map of the elicitation sequence. A missing item that should fall between two known anchors is likely to appear in a predictable time window. The system formalises this intuition through two complementary approaches.
The first detects repetitions within a single speaker's recording. Each item in the elicitation is typically produced two to four times in succession, so clusters of phonetically similar forms appearing within a thirty-second window are strong candidates for a repeated target item. Phonetic similarity here is measured by normalised Levenshtein distance on IPA strings, with a threshold low enough to accommodate the natural variation between repetitions of the same form. The second approach matches across speakers. It compares the phonetic form of an unassigned segment against verified annotations from other speakers for the same concept, using a priority-ordered set of matching strategies. An exact orthographic match receives the highest confidence. Fuzzy orthographic and phonetic rule-based matches receive progressively less. A set of phonetic variation rules, encoding documented Southern Kurdish alternations such as onset voicing (k/g, t/d, p/b), nucleus variation (e/a), and coda deletion, expands the comparison space so that legitimate dialectal variants are not rejected as mismatches. These rules are specific to the varieties in the dataset and were derived from the phonological patterns documented in MacKenzie (1961) and observed during transcription.
These two signals feed into a weighted scoring function together with two additional components: the acoustic confidence of the original transcription, and a positional prior derived from the median timestamp of each concept across all speakers who have already been processed. Phonetic similarity carries the greatest weight at fifty percent, followed by repetition evidence at twenty-five percent, positional proximity at fifteen percent, and cluster coherence at ten percent. The positional component applies a tolerance window of forty-five seconds, decaying linearly from full credit at the expected position to zero at the boundary. Forty-five seconds is wide enough to accommodate the variation in elicitation pacing across speakers while narrow enough to penalise candidates that appear far from where the concept ought to fall. The confidence formula is deliberately additive, so that a strong phonetic match survives even when repetition or positional evidence is weak:
Let
The system presents its top-ranked candidates. I review each one against the audio and either accept, adjust, or discard it. The system never assigns a concept on its own.
A property of this design is that it improves as the dataset grows. Each verified annotation adds to the reference pool that the cross-speaker matching draws on. A concept that was difficult to locate for the third speaker becomes easier to find for the eighth, because the system now has seven verified forms to compare against. This is not a learning algorithm in the machine learning sense. It is a straightforward consequence of having more reference data. But it means that the most labour-intensive phase of the annotation is the beginning, and the work becomes progressively faster as the reference pool fills in.
In Compare mode, cognate adjudication follows the LexStat analysis described in section 4.1 but with an additional layer of human oversight. The matrix displays computed cognate groupings for each concept. I can accept the grouping as computed, split a speaker into a separate cognate class, merge all forms into one class, or cycle a speaker through available classes. These manual overrides are stored in a separate enrichments layer that preserves both the computed result and the correction, so the reasoning behind each decision remains recoverable. This matters for reproducibility: another researcher can see not only the final cognate assignments but also where they diverge from the algorithmic output and why. Borrowing adjudication works on the same principle. Contact language similarity scores for Arabic and Persian appear alongside each form, computed by comparing the IPA against known forms in those languages with the same phonetic variation rules applied. I mark items as native, borrowed, or uncertain. Items flagged as borrowed enter the export so that the downstream analysis can handle them according to the exclusion criteria in section 4.1.
Processed annotations are exported in LingPy-compatible format with cognate class identifiers and borrowing flags, feeding into the BEAST 2 pipeline described in sections 4.1 and 4.2 [VERIFY]. One design constraint runs through the entire system: audio playback always references the full original recording via stored timestamps. No segmented audio files are generated at any point. This preserves the archival integrity of the field recordings while allowing precise, reproducible access to any annotated segment.
The pipeline described across this chapter handles each major challenge of the Southern Kurdish dataset in turn. Human-primary transcription addresses the phonological subtlety that automated models miss. The PARSE workstation scales that human judgment across thirty hours of field recordings without sacrificing the transparency or reproducibility that manual-only processing would lack. Explicit borrowing exclusion manages the contact saturation that makes even basic vocabulary unreliable as a genealogical signal. Probabilistic inference with documented uncertainty quantification handles the classificatory ambiguity that makes a single best-tree answer the wrong kind of answer for this dataset. Every threshold, exclusion criterion, and model specification is documented here in enough detail to allow independent replication — or adaptation to other low-resource dialect continua where the same pressures apply.
Working Title: Quantifying Dialect Divergence in Southern Kurdish: A Bayesian Phylogenetic Approach
Student: Lucas Ardelean Supervisor: Prof. Geoffrey Haig (Uni Bamberg) External Reviewer: Sarah Babinski (UZH, Bayesian specialist) Deadline: End of May 2026
- 1.1 The Classification Puzzle: Concrete geography (Zagros ridges, Kermanshah, Ilam, Diyala); Fattah's "shatter zone" (2000: 56) reframed via Belelli (2019) as compact continuum with blurred boundaries; multilingual ecology; intelligibility patterns; Faili displacement.
- 1.2 The Classification Problem: MacKenzie's tripartite division (1961: 142); Fattah's 35 varieties and Laki removal; flat star vs. hierarchical tree for Faili/Kalhori/Khanaqini; contradictory isoglosses; Mohammadirad's "transitional zone" argument.
- 1.3 A Bayesian Solution: Posterior distributions handle ambiguity; Gray & Atkinson 2003, Bouckaert et al. 2012; BEAST 2 + Covarion; Bayes Factors.
- 1.4 Scope & Contributions: Proof of concept; Laki excluded; 11 speakers, 85 items, 6 varieties; 3 research questions; chapter roadmap.
- 2.1 Southern Kurdish Varieties: Geographic distribution; MacKenzie (1961); Fattah (2000) and 35 varieties; Fattah's 7-group classification (Bijāri, Kolyā'i, Laki-Kermānshāhi, Kalhori-Sanjābi-Zangane, Malekshāhi, Badre'i, Kordali) via Belelli (2019); Laki controversy; Gorani superstrate; contemporary displacement and language shift.
- 2.2 The Dialect Continuum Challenge: Stammbaumtheorie vs. Wellentheorie; Paul (1998) on West Iranian continuum; contact layers (Arabic, Persian, Gorani); computational precedents (Heggarty et al. 2023, Cathcart 2020).
- 2.3 Bayesian Phylogenetics: Gray & Atkinson (2003) and critics; extensions to other families; posterior distributions; clade support; ESS threshold; CTMC model; Gamma rates; lexical stability; Yule prior; BEAST 2 ecosystem; proof-of-concept framing.
- 3.1 Speaker Cohort: 11 speakers, 6 varieties; Fattah-group mapping; demographic composition; age and gender discussion; detailed profiles; displacement and mobility as defining sociolinguistic reality.
- 3.2 Fieldwork Design: Shatter zone context; Observer's Paradox mitigation; recording settings; equipment (Rode Wireless Go II + VideoMic Pro+ on Sony A7S III, 4K/60fps); elicitation via carrier sentences, gestures, pointing; translators; verbal informed consent + video face-cropping policy.
- 3.3 The Lexical Dataset: Borrowability theory (Haspelmath & Tadmor 2009); JBIL + KLQ (~500 items combined) intersected with Oxford Handbook wordlist; 85-item borrowing-resistant core; Leipzig-Jakarta validation (92% overlap); compound numeral treatment.
- 4.1 Transcription: Human-primary IPA transcription for SK vowels; Wav2Vec2 ASR validation.
- 4.2 Cognate Detection: LingPy LexStat (threshold 0.55-0.65); manual correction guided by MacKenzie (1961) phonology.
- 4.3 Phylogenetic Inference: BEAST 2; Binary Covarion (primary) + Binary CTMC+Gamma (sensitivity); Strict Clock; Yule Prior; agnostic rooting (no outgroups; SK data only); MCMC parameters.
- 5.1 Cognate clustering outcomes.
- 5.2 Phylogenetic trees (MCC tree, posterior support).
- 5.3 DensiTree visualization.
- 5.4 Bayes Factors (testing specific topological hypotheses).
- 6.1 Clade vs. gradient: does the data support discrete SK subgroups?
- 6.2 The position of outliers (Mandali, Qasri).
- 6.3 Methodological validity: did the hybrid LingPy approach work?
- 6.4 Limitations (sample size, ASR noise).
- 7.1 Summary of findings.
- 7.2 Proof of concept established.
- 7.3 Future work (scaling to full West Iranian family).
The appendices are ordered to follow the analytic pipeline: speaker cohort → elicited forms → phonetic normalization → cognate-state coding → documentation quality assessment → supplementary data. Appendix A is reproduced inline below; Appendices B–F are standalone documents under appendix/.
| Speaker | Variety | Gender | Birth year | Birthplace | Key residences | Languages | Education | Fattah group |
|---|---|---|---|---|---|---|---|---|
| Fail01 | Faili | M | 1984 | — (Baghdad diaspora) | Baghdad (6 yr), Tehran (7 yr), Ilam (5 yr), Sulaymaniyah (since c. 2002) | Faili, Arabic, Farsi, Kalhori, English | — | MAL/BAD (uncertain) |
| Fail02 | Faili | F | 1968 | — (Baghdad diaspora) | Baghdad (12 yr), Tehran (24 yr), Sulaymaniyah (since 2019) | Faili, Arabic, Farsi | — | MAL/BAD (uncertain) |
| Kalh01 | Kalhori | M | 1981 | Klaw Drizh, Kermanshah | Klaw Drizh (25 yr), Tehran, southern Iran, Germany (since 2016) | Kalhori, Farsi, English, Sorani, German, passive Kurmanji, formal Arabic | MA General Linguistics | KSZ |
| Khan01 | Khanaqini | F | 1970 | Baghdad | Baghdad (36 yr), Sulaymaniyah (since 2006) | Khanaqini, Arabic | — | KSZ (nominally) |
| Khan02 | Khanaqini | F | 1967 | Khanaqin | Khanaqin (25 yr), Baghdad (16 yr), Sulaymaniyah | Khanaqini, Germiani, Arabic, some Turkish | BA Kurdish (Baghdad) | KSZ (nominally) |
| Khan03 | Khanaqini | M | 1978 | — | Baghdad (10 yr), Khanaqin (5 yr), Sulaymaniyah area | Kalhori-Khanaqini, Arabic, Farsi | — | KSZ (nominally) |
| Khan04 | Khanaqini | F | c. 2000 | Khanaqin | Khanaqin, Sulaymaniyah (recent) | Khanaqini, Sorani | — | KSZ (nominally) |
| Qasr01 | Qasri | M | 1973 | Qasr-e Shirin | Qasr-e Shirin, Kermanshah, Karaj-Tehran, Iraq (since c. 2014) | Qasri, Farsi, Arabic | — | KSZ |
| Mand01 | Mandali | M | c. 1962 | Khanaqin | Khanaqin, Mandali, Baghdad (20 yr), Sulaymaniyah (since 2006) | Mandali, Khanaqini, Arabic, Farsi, Turkmen | — | KSZ/MAL (border) |
| Saha01 | Sahana | F | c. 1978 | Sahana (Sahne) | Sahana (22 yr), Kermanshah, Tehran, Sulaymaniyah (since c. 2022) | Xwareen, Farsi, Laki, some Sorani | — | L-KER |
| Oxfo01 | Reference | — | — | — | — | — | — | BIJ? (TBC) |
Notes: Birth years marked "c." are approximate. Education listed only where known. Fattah group assignments follow the mapping in §3.2 (Table 2); see issues #20 and #21 for unresolved assignments. "—" indicates data not available or not applicable.
Per-concept source-item numbers across the elicitation surveys, the recorded IPA and orthographic forms for each of the ten analytic speakers, and the per-concept cognate-set decisions from PARSE compare mode.
→ appendix/Appendix_B_Concept_Forms_and_Cognate_Decisions_2026-06-03.md
The phonological normalization rules applied when building the binary character matrix for the Bayesian analyses — the prose companion to the machine-applied ruleset in Appendix F.
→ appendix/Appendix_C_Phonetic_Normalization_Rules.md
The data sources and coding methodology underlying the lexical character matrix, with the revisions recommended on the basis of the data-sourcing rules, polymorphism-handling protocol, and per-concept research record.
→ appendix/Appendix_D_Cognate_State_Coding_2026-06-03.md
The auditor prompt used to assess whether the cognate-state decision notes, summaries, and logs adequately document their reasoning, raw evidence, current coding, and Bayesian implications.
→ appendix/Appendix_E_Cognate_Decision_Notes_Assessment_Prompt_2026-05-18.xml
Machine-readable artifacts for reproducibility: the applied phonetic-normalization ruleset, the per-concept cognate-set / loan-flag / discarded artifacts, and the ten per-speaker time-aligned PARSE annotations.
→ appendix/Appendix_F_Supplementary_Data/
An interactive browser tool for inspecting the cognate decisions: the ten-speaker matrix with row-expand variants, a per-decision notes overlay, and 606 audio clips with their timestamps.
→ Live (loads the tool): https://ardeleanlucas.github.io/thesis/review_tool/
→ In-repo submission copy: review_tool/
