Croissant (metadata format)

Last updated

Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML. [1] [2]

Contents

Structure

Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic: [1] [3]

It also provides a default extension for metadata related to responsible AI. [1] [2]

The use of a standard machine-readable structure increases, for example, the discoverability of datasets in search engines such as Google Dataset Search. [4] [5]

History

Croissant was shared in arXiv in March 2024 and published in the proceedings of NeurIPS 2024. [1] [6] [7] It started as community driven as a MLCommons Croissant Working Group, including stakeholders organizations from academia and industry, including Google, the open data institute, Sage Bionetworks and King's College London. [1] [8]

Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets. [9] Other technical extensions, such as support for RDF, soon followed. [10] [11]

References

  1. 1 2 3 4 5 Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Gijsbers, Pieter; Giner-Miguelez, Joan; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis; Krishna, Satyapriya; Kuchnik, Michael; Lesage, Sylvain; Lhoest, Quentin; Marcenac, Pierre; Maskey, Manil (2024-12-16). "Croissant: A Metadata Format for ML-Ready Datasets". Advances in Neural Information Processing Systems. 37: 82133–82148.
  2. 1 2 Bischl, Bernd; Casalicchio, Giuseppe; Das, Taniya; Feurer, Matthias; Fischer, Sebastian; Gijsbers, Pieter; Mukherjee, Subhaditya; Müller, Andreas C.; Németh, László; Oala, Luis; Purucker, Lennart; Ravi, Sahithya; Rijn, Jan N. van; Singh, Prabhant; Vanschoren, Joaquin (2025-07-11). "OpenML: Insights from 10 years and more than a thousand papers". Patterns. 6 (7) 101317. doi:10.1016/j.patter.2025.101317. ISSN   2666-3899. PMC   12416095 . PMID   40926970.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  3. Meroño-Peñuela, Albert; Simperl, Elena; Kurteva, Anelia; Reklos, Ioannis (2025-05-01). "KG.GOV: Knowledge graphs as the backbone of data governance in AI". Journal of Web Semantics. 85 100847. doi:10.1016/j.websem.2024.100847. ISSN   1570-8268.
  4. Giner-Miguelez, Joan; Gómez, Abel; Cabot, Jordi (2025-01-13). "On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning". Scientific Data. 12 (1): 61. Bibcode:2025NatSD..12...61G. doi:10.1038/s41597-025-04402-4. ISSN   2052-4463. PMC   11730645 . PMID   39805856.
  5. Hulsebos, Madelon; Lin, Wenjing; Shankar, Shreya; Parameswaran, Aditya (2024-06-18). "It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?". Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. HILDA 24. New York, NY, USA: Association for Computing Machinery. pp. 1–4. doi:10.1145/3665939.3665959. ISBN   979-8-4007-0693-6.
  6. "NeurIPS Poster Croissant: A Metadata Format for ML-Ready Datasets". neurips.cc. Retrieved 2025-10-14.
  7. Akhtar, Mubashara; Benjelloun, Omar; Conforti, Costanza; Foschini, Luca; Giner-Miguelez, Joan; Gijsbers, Pieter; Goswami, Sujata; Jain, Nitisha; Karamousadakis, Michalis (2024-12-09), "Croissant: A Metadata Format for ML-Ready Datasets", Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning, pp. 1–6, arXiv: 2403.19546 , doi:10.1145/3650203.3663326, ISBN   979-8-4007-0611-0
  8. "Transforming AI data governance with Croissant: a new standard for ML metadata". The ODI. 2024-06-19. Retrieved 2025-10-14.
  9. IMPACT, Rajat Shinde and Derek Koehl NASA (2024-03-28). "Introducing Croissant: A Format for Machine Learning Datasets | NASA Earthdata". www.earthdata.nasa.gov. Retrieved 2025-10-14.
  10. Bolleman, Jerven. "An assessment of Croissant ML metadata descriptors for AI-ready datasets". osf.io. doi:10.37044/osf.io/4sgdq_v1 . Retrieved 2025-10-14.
  11. Steinberg, David. "Bridging Machine Learning and Semantic Web: A Case Study on Converting Hugging Face Metadata to RDF". osf.io. Retrieved 2025-10-14.