Croissant: A Metadata Format for ML-Ready Datasets

Bibliographic Details
Title: Croissant: A Metadata Format for ML-Ready Datasets
Authors: Akhtar, Mubashara, Benjelloun, Omar, Conforti, Costanza, Foschini, Luca, Giner-Miguelez, Joan, Gijsbers, Pieter, Goswami, Sujata, Jain, Nitisha, Karamousadakis, Michalis, Kuchnik, Michael, Krishna, Satyapriya, Lesage, Sylvain, Lhoest, Quentin, Marcenac, Pierre, Maskey, Manil, Mattson, Peter, Oala, Luis, Oderinwale, Hamidah, Ruyssen, Pierre, Santos, Tim, Shinde, Rajat, Simperl, Elena, Suresh, Arjun, Thomas, Goeffry, Tykhonov, Slava, Vanschoren, Joaquin, Varma, Susheel, van der Velde, Jos, Vogler, Steffen, Wu, Carole-Jean, Zhang, Luyao
Publication Year: 2024
Collection: Computer Science
Subject Terms: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Databases, Computer Science - Information Retrieval
More Details: Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.
Comment: Published at the NeurIPS 2024 Datasets and Benchmark Track. A shorter version appeared earlier in Proceedings of ACM SIGMOD/PODS'24 Data Management for End-to-End Machine Learning (DEEM) Workshop https://dl.acm.org/doi/10.1145/3650203.3663326
Document Type: Working Paper
DOI: 10.1145/3650203.3663326
Access URL: http://arxiv.org/abs/2403.19546
Accession Number: edsarx.2403.19546
Database: arXiv
More Details
DOI:10.1145/3650203.3663326