Oct 25, 2021
No News is Good News: A Critique of the One Billion Word Benchmark
Authors
Helen Ngo, João G.M. Araújo, Jeffrey Hui, Nicholas Frosst
Abstract
The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. We suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets.
Related works

Research
Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation
Read

Research
Reality Check: A New Evaluation Ecosystem Is Necessary to Understand AI's Real World Effects
Read

Research
Reverse Engineering Human Preferences with Reinforcement Learning
Read