Jun 27, 2024
Policy Primer - The AI Language Gap
Current state-of-art AI large language models cover only a small percentage of the world’s languages. This policy primer describes how this “language gap” in AI came to be, its potential consequences, and considerations for how those working in policy and governance can help to address it.

Authors
Cohere Labs team
Abstract
More than 7000 languages are spoken around the world today, but current, state-of-art AI large language models cover only a small percentage of them and favor North American language and cultural perspectives. This is in part because many non-English languages are considered "low-resource,” meaning they are less prominent within computer science research and lack the high-quality datasets necessary for training language models. This language gap in AI has several undesirable consequences: 1) Many language speakers and communities may be left behind as language models that do not cover their language become increasingly integral to economies and societies. 2) The lack of linguistic diversity in models can introduce biases that reflect Anglo-centric and North American viewpoints, and undermine other cultural perspectives. 3) The safety of all language models is compromised without multilingual capabilities, creating opportunities for malicious users and exposing vulnerable users to harm. There are many global efforts to address the language gap in AI, including Cohere For AI’s Aya project — a global initiative that has developed and publicly released multilingual language models and datasets covering 101 languages. However, more work is needed. To contribute to efforts to address the AI language gap, we offer four considerations for those working in policy and governance around the world: 1) Direct resources towards multilingual research and development. 2) Support multilingual dataset creation. 3) Recognize that the safety of all language models is improved through multilingual approaches. 4) Foster knowledge-sharing and transparency among researchers, developers, and communities.
Related works

Research
Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation
Read

Research
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Read

Research
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
Read