Policy Primer - The AI Language Gap

AUTHORS

Cohere Labs team

ABSTRACT

More than 7000 languages are spoken around the world today, but current, state-of-art AI large language models cover only a small percentage of them and favor North American language and cultural perspectives. This is in part because many non-English languages are considered "low-resource,” meaning they are less prominent within computer science research and lack the high-quality datasets necessary for training language models. This language gap in AI has several undesirable consequences: 1) Many language speakers and communities may be left behind as language models that do not cover their language become increasingly integral to economies and societies. 2) The lack of linguistic diversity in models can introduce biases that reflect Anglo-centric and North American viewpoints, and undermine other cultural perspectives. 3) The safety of all language models is compromised without multilingual capabilities, creating opportunities for malicious users and exposing vulnerable users to harm.

There are many global efforts to address the language gap in AI, including Cohere For AI’s Aya project — a global initiative that has developed and publicly released multilingual language models and datasets covering 101 languages. However, more work is needed. To contribute to efforts to address the AI language gap, we offer four considerations for those working in policy and governance around the world: 1) Direct resources towards multilingual research and development. 2) Support multilingual dataset creation. 3) Recognize that the safety of all language models is improved through multilingual approaches. 4) Foster knowledge-sharing and transparency among researchers, developers, and communities.