Voicing the Voiceless: Developing a Dholuo Parallel Corpus for Natural Language Processing
DOI:
https://doi.org/10.58721/jllcs.v5i1.1736Keywords:
Dholuo, Dictionaries, Low-resource languages, Parallel corpusAbstract
Despite the exponential growth of Natural Language Processing (NLP) technologies worldwide, African indigenous languages remain severely underrepresented in digital language resources. Dholuo, a Western Nilotic language spoken by approximately four million people in Kenya and Tanzania, lacks the structured parallel corpora needed to support machine translation, speech recognition, and other AI-driven language tools. This paper reports on the design, methodology, and preliminary findings of a corpus development initiative funded by United States International University-Africa (USIU-Africa) to build a Dholuo–English parallel corpus comprising 20,000 translated sentence pairs and 30 hours of transcribed speech data. Preliminary findings indicate that approximately 18,400 parallel sentence pairs have been collected, of which 31% contain at least one figurative element, confirming the centrality of figurative expression in authentic Dholuo discourse and validating the native-speaker-led, community-engaged collection methodology. Drawing on community-driven data collection, crowdsourced translation platforms, and the Living Dictionaries digital lexicography tool (livingdictionaries.app), this study integrates culturally embedded linguistic features, including metaphors (weche mitiyo kodo kakaranyisi), proverbs (Ngeche), riddles (Ponge), and folktales (sigendni mochuogi), into the corpus architecture. The paper situates the initiative within broader debates about digital linguistic equity, FAIR data principles, with text data deposited in CoNLL-U format and speech archived as WAV with TextGrid annotation files to ensure interoperability, the role of indigenous languages in sustainable development, and the decolonisation of computational linguistics. It argues that Dholuo corpus construction must go beyond lexical tokenisation to capture the pragmatic and metaphorical richness that characterises oral-indigenous discourse; figurative items are tagged with semantic domain labels in the Living Dictionaries companion platform. The findings demonstrate that native-speaker validation using an 80% accuracy threshold, combined with the Living Dictionaries platform for orthographic consistency checking and open-access deposition on Zenodo and Mozilla Common Voice, constitutes a replicable and community-empowering model for enabling NLP in low-resource African languages. The corpus is partially multimodal: the Mozilla Common Voice speech recordings are sentence-aligned with the text pairs, while the interview and narration recordings constitute a separate spontaneous speech sub-corpus intended for ASR development.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Journal of Linguistics, Literary and Communication Studies

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
