A Standardized Format for AI Evaluation Data

Team: Jan Batzner, Leshem Choshen, Subho Majumdar

Category: Infrastructure

Tags:

Description

The AI evaluation ecosystem currently lacks standardized methods for storing, sharing, and comparing evaluation results across different models and benchmarks. This fragmentation leads to unnecessary duplication of compute-intensive evaluations, challenges in reproducing results, and barriers to comprehensive cross-model analysis. We plan to address these challenges by developing a comprehensive standardized format for capturing the complete evaluation lifecycle. This format will provide a clear and extensible structure for documenting evaluation inputs (hyperparameters, prompts, datasets), outputs, metrics, and metadata. This standardization enables efficient storage, retrieval, sharing, and comparison of evaluation results across the AI research community. Building on this foundation, we will create a centralized repository with both raw data access and API interfaces that allow researchers to contribute evaluation runs and access cached results. The project will integrate with popular evaluation frameworks (LM-eval, HELM, Unitxt) and provide SDKs to simplify adoption. Additionally, we will populate the repository with evaluation results from leading AI models across diverse benchmarks, creating a valuable resource that reduces computational redundancy and facilitates deeper comparative analysis. By bridging the gap between rigorous evaluation practices and practical implementation, this project aims to enhance reproducibility, reduce resource requirements, and accelerate research in AI evaluation science.