Policy Primer - Translating Safety
AUTHORS
Aidan Peppin, Marzieh Fadaee, Beyza Ermis, Seraphina Goldfarb-Tarrant, Julia Kreutzer, Sara Hooker
ABSTRACT
Global AI safety efforts have gained traction and momentum, but a critical challenge remains: how to ensure safety across diverse languages and cultures.
This challenge is often overlooked, or absent, in governance and research efforts to advance AI safety. Safety alignment efforts primarily focus on English or monolingual settings, leading to potential security flaws for other languages. This leaves many risks unaddressed or amplified for non-English speakers.
Addressing multilingual safety is complex, and involves reconciling global harms and unique local contexts. Most current approaches to improving model safety are language-specific and lack reliable datasets for evaluation beyond a few languages.
Despite these challenges, progress is being made. Many researchers around the world, including Cohere For AI, have dedicated efforts to tackle these language gaps, offering potential solutions to enhance AI safety across diverse linguistic and cultural contexts.
This Policy Primer summarises several promising avenues to addressing the language gap in AI safety. This includes: collecting robust multilingual evaluation data; distilling different safety instructions into models; adapting preference training to multilingual and multicultural contexts, merging models to increase performance, adapting evaluations across languages, and developing safety techniques for toxicity that keep pace with natural evolutions in language.
From this research, we identify five recommendations for researchers and policymakers to consider in their efforts to improve AI safety for everyone:
- AI safety and alignment efforts should not be monolithic or monolingual.
- Multilingual safety should be addressed throughout the model training lifecycle.
- Including more languages in safety mitigation can provide gains across all contexts.
- Reporting on models’ coverage of different languages is critical.
- Curating data using human annotators with experiences and perspectives covering different languages and cultures is key.