Addressing Language Gaps in African AI Development
A significant challenge in Africa’s artificial intelligence landscape has come to light, revealing that many AI systems struggle to accurately determine the language of a given text. This limitation significantly hinders the use of various African languages, preventing their incorporation into the training of more sophisticated AI tools. Recently, a new open-source model aims to bridge this crucial gap.
Launch of CommonLingua Model
On April 28, the AI research firm Pleias, in collaboration with the GSMA (GSM Association), unveiled CommonLingua, a language identification (LID) model that encompasses a total of 334 languages, including 61 African languages from eight diverse language families. This initiative marks the inaugural release within the GSMA’s AI Language Model for Africa by Africa project, aimed at rectifying the language disparity in AI on the continent.
The Importance of Language Classification
The introduction of CommonLingua addresses a fundamental issue within the AI development pipeline. Prior to the construction of models for languages such as Swahili, Yoruba, or Wolof, it is vital to accurately classify texts in those languages. Existing tools, like fastText, GlotLID, and OpenLID, predominantly focus on European and Asian languages, often categorizing African language texts erroneously as English or French. This disparity is evident, as even the most advanced AI models exhibit approximately 30 percentage points lower accuracy with African languages compared to more widely spoken global languages.
Performance of CommonLingua
CommonLingua boasts an impressive 83% accuracy rate and an F1 macro score of 0.79 on the newly established CommonLID benchmark. This performance surpasses that of leading language identification models by over 10 percentage points under similar testing conditions, all while utilizing roughly 300 times fewer parameters. With an 8 MB file size, the model can process around 20 texts per second on a standard CPU and handle up to 3,000 texts per second on a single GPU, making it suitable for implementation in resource-limited environments.
Diverse Language Coverage
The model’s coverage includes a wide array of African languages, such as those belonging to the Bantu, Niger-Congo, West African, Afro-Asiatic, Semitic, Cushitic, Chadian, Berber, Nilo-Saharan, Pidgin, and Creole families. Notably, CommonLingua operates on raw text byte sequences, allowing it to effectively process multiple writing systems, including Latin, Arabic, Ethiopian, N’Ko, and Tifinagh, without relying on specific tokenization.
Advancing AI Infrastructure in Africa
Pierre-Carl Langlais, co-founder and CTO of Pleias, highlighted that effective language identification serves as a foundational component for subsequent advancements in African AI development. Lewis Powell, Director of AI Initiatives at the GSMA, emphasized that addressing these foundational infrastructure challenges is essential for progress. Collaborative tools such as CommonLingua are crucial for creating AI systems that authentically represent the continent’s linguistic diversity.
Commitment to Open Data and Future Dialogue
The model is trained exclusively on open-licensed public domain data, with all datasets shared under permissive licenses. The GSMA and its partners are committed to continued discussions at the upcoming MWC26 Kigali event in June.
