Oct 25, 2021

No News is Good News: A Critique of the One Billion Word Benchmark

Authors

Helen Ngo, João G.M. Araújo, Jeffrey Hui, Nicholas Frosst

Abstract

The One Billion Word Benchmark is a dataset derived from the WMT 2011 News Crawl, commonly used to measure language modeling ability in natural language processing. We train models solely on Common Crawl web scrapes partitioned by year, and demonstrate that they perform worse on this task over time due to distributional shift. Analysis of this corpus reveals that it contains several examples of harmful text, as well as outdated references to current events. We suggest that the temporal nature of news and its distribution shift over time makes it poorly suited for measuring language modeling ability, and discuss potential impact and considerations for researchers building language models and evaluation datasets.

Related works

Research

CALIBER: Calibrating confidence before and after reasoning in language models

Read

Research

Reverse Engineering Human Preferences with Reinforcement Learning

Read

Research

RewardBench 2: Advancing Reward Model Evaluation

Read