
Filter papers
Remove All Filters
The Leaderboard Illusion
Evaluation
Language Models
Evaluation
Language Models
Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
multilingual
Evaluation
Language Models
multilingual
Evaluation
Language Models
Kaleidoscope: Exams for Multilingual Vision Evaluation
Evaluation
Open Source
multilingual
Generative Models
Multimodal
Evaluation
Open Source
multilingual
Generative Models
Multimodal
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions
Code
Collaboration
Evaluation
Reasoning
Tooling
Code
Collaboration
Evaluation
Reasoning
Tooling
Global MMLU
Evaluation
Open Source
multilingual
Generative Models
Evaluation
Open Source
multilingual
Generative Models
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Data
Evaluation
Generative Models
multilingual
Open Source
Language Models
Data
Evaluation
Generative Models
multilingual
Open Source
Language Models
M-RewardBench: Evaluating Reward Models in Multilingual Settings
multilingual
Data
Evaluation
Open Release
Collaboration
multilingual
Data
Evaluation
Open Release
Collaboration
Elo Uncovered: Robustness and Best Practices in Language Model Evaluation
Evaluation
Reproducibility
Language
Generative Models
Evaluation
Reproducibility
Language
Generative Models
Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation
Evaluation
Efficiency
Language
Generative Models
Evaluation
Efficiency
Language
Generative Models
No News is Good News: A Critique of the One Billion Word Benchmark
Responsible AI
Evaluation
Responsible AI
Evaluation