Jun 15, 2026

The Culture Funnel: You can’t align what isn’t in the data

Current cultural alignment approaches are flawed due to a "cultural data funnel" where explicit cultural signals decline sharply during post-training, with geographically concentrated data dominating despite multilinguality's benefits, necessitating a shift in training pipeline focus to improve cultural representation and benchmark performance.

Authors


Ananya Sahu, Mehrnaz Mofakhami, Daniel D’souza, Thomas Euyang, Julia Kreutzer, and Marzieh Fadaee

Abstract


Current cultural alignment approaches focus on inference-time interventions, assuming models already contain sufficient cultural knowledge. We argue modern LLM pipelines suffer from a cultural data funnel. Using a multidimensional tagging framework across pretraining, fine-tuning, alignment, and reasoning datasets, we show explicit cultural signals decline sharply during post-training, while geographically concentrated, task-specialized data dominates. Multilinguality enhances geographic diversity of cultural knowledge but does not ensure balanced representation. Our tags improve downstream cultural benchmark performance, demonstrating that advances require shifting focus in training data pipelines.

Related works