Feb 13, 2024
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
The Aya Collection stands as the most extensive assembly of multilingual instruction fine-tuning datasets to date, featuring 513 million prompts and completions across 114 languages. We fully open-source the collection, which includes rare, human-curated annotations from fluent speakers worldwide.

Authors
Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemiński, Hakimeh Fadaei, Irem Ergün, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet Üstün, Marzieh Fadaee, Sara Hooker
Abstract
Related works

Research
Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation
Read

Research
Unlocking Reasoning Capability on Machine Translation in Large Language Models
Read

Research
The Art of Asking: Multilingual Prompt Optimization for Synthetic Data
Read