Croissant is a metadata format design to support sharing of datasets for machine learning applications. It is a platform-agnostic schema used to standardize metadata in data repositories like Hugging Face, kaggle, Dataverse and OpenML.[1][2]
Croissant builds upon schema.org, uses primarily JSON-LD, and divides metadata in four "layers": Dataset Metadata, Resource, Structure and Semantic:[1][3]
The Dataset Metadata layer constrains which schema.org properties should be used, including additional properties, linking together the resources (files) of the dataset with general metadata, like licensing and citation information.
The Resource layer describes the individual files and sets of those using two new classes, FileObject and FileSet. A FileSet may be a collection of related images.
The Structure layer specifies how the files are organized in the dataset. A RecordSet class describes how resources are present, configurations that may very a lot between modality. This specification facilitates interoperability of the datasets.
Finally, the Semantic layer adds information for practical reuse of the dataset, such as splits for train, test and validation subsets.
It also provides a default extension for metadata related to responsible AI.[1][2]
Variations of Croissant are developed to support datasets in different areas of research, such as Geo-Croissant for geospatial datasets.[9] Other technical extensions, such as support for RDF, soon followed.[10][11]
This page is based on this Wikipedia article Text is available under the CC BY-SA 4.0 license; additional terms may apply. Images, videos and audio are available under their respective licenses.