Executive Order 14168: This repository is under review for potential modification in compliance with Administration directives.
📚 View Documentation & Examples · CLI Reference
A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on". To this end the authors create a series of topics and over 50 questions addressing different aspects of datasets, also useful in an AI/ML context. An example of completed datasheet for datasets can be found here: Structured dataset documentation: a datasheet for CheXpert
Google is working with a different model called Data Cards, which in practice is close to the original Datasheets for Datasets template.
This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers (work in progress). Beyond a less structured markdown template for this model (e.g. template for datasheet for dataset) we are not aware of any other structured form representing Datasheets for Datasets.
We are also tracking related developments, such as augmented Datasheets for Datasets models as in Augmented Datasheets for Speech Datasets and Ethical Decision-Making.
Curated comprehensive datasheets for each Bridge2AI data generating project:
- AI-READI - Retinal imaging and diabetes dataset
- CM4AI - Cell maps for AI dataset
- VOICE - Voice biomarker dataset
- CHORUS - Health data for underrepresented populations
The D4D-Core schema is the curated, interop-focused subset of D4D — the recommended starting point for new datasheets and for systems that exchange datasheets with RO-Crate / FAIRSCAPE / DCAT consumers. Every slot in d4d-core is paired with a SKOS-aligned external term in the Semantic Exchange Layer.
| Artifact | Path | Description |
|---|---|---|
| Source schema | src/data_sheets_schema/schema/data_sheets_schema_core.yaml |
Core schema entry point (imports D4D_Core.yaml) |
| Module | src/data_sheets_schema/schema/D4D_Core.yaml |
CoreDataset, CoreDatasetCollection, CoreDistribution definitions |
| Merged form | src/data_sheets_schema/schema/data_sheets_schema_core_all.yaml |
Single-file merged schema (auto-generated) |
| HTML examples | Bridge2AI generating-center datasheets (above) | Curated d4d-core renderings |
| Validate / build | make validate-core, make gen-core-schema, make lint-core |
Core-schema-only Make targets |
Scope: ~95 fields across CoreDataset, CoreDatasetCollection, CoreDistribution and supporting classes (Person, Organization, Creator, Grant, FundingMechanism). The full schema (data_sheets_schema.yaml, ~284 attributes) remains the extended reservoir.
The Semantic Exchange Layer is the canonical SKOS + SSSOM mapping that lets a D4D datasheet round-trip through RO-Crate, FAIRSCAPE EVI, schema.org, DCAT, and Croissant RAI. All artifacts live in two directories:
| Artifact | Path | Description |
|---|---|---|
| SKOS alignment (authoritative) | src/data_sheets_schema/semantic_exchange/d4d_rocrate_skos_alignment.ttl |
100+ skos:exactMatch / closeMatch / relatedMatch triples |
| Semantic SSSOM | src/data_sheets_schema/semantic_exchange/d4d_rocrate_sssom_mapping.tsv |
19-column SSSOM with json_path / pydantic / interface columns |
| URI SSSOM | d4d_rocrate_sssom_uri_mapping.tsv + _comprehensive.tsv |
Auto-regenerated URI variants |
| Structural SSSOM | data/semantic_exchange/d4d_rocrate_structural_mapping.sssom.tsv |
sssom-py-compatible 17-column structural mapping |
| Generators | src/semantic_exchange/ |
Scripts that derive the URI/comprehensive/structural variants |
| Tests | tests/test_semantic_exchange/ + tests/test_fairscape_integration/ |
SSSOM column/structure validation |
| Add a new mapping | /d4d-add-mapping Claude Code skill (command) |
Schema-driven workflow for new SSSOM rows |
Build / validate:
make gen-sssom-all # regenerate URI + comprehensive + structural variants
poetry run pytest tests/test_semantic_exchange tests/test_fairscape_integration -vBrowse the source code repository:
- src/data/examples/ - example YAML data
- project/ - project files (do not edit these)
- src/ - source files (edit these)
- src/data_sheets_schema/schema/ - LinkML schema (edit this);
data_sheets_schema_core.yamlis the d4d-core entry point - src/data_sheets_schema/semantic_exchange/ - canonical SKOS + SSSOM exchange-layer artifacts
- src/data_sheets_schema/datamodel/ - generated Python datamodel
- src/semantic_exchange/ - SSSOM/SKOS generator scripts
- src/data_sheets_schema/schema/ - LinkML schema (edit this);
- data/semantic_exchange/ - structural SSSOM + analysis docs
- tests/ - Python tests (
test_semantic_exchange/,test_fairscape_integration/, …)
This branch introduces a unified d4d CLI for the Datasheets for Datasets workflow. The command is exposed through Poetry:
poetry install
poetry run d4d --helpAfter installation you can also invoke it as d4d, but poetry run d4d is the safest form while developing in the repo.
Most subcommands currently expect a repository checkout because they import repo-local code from src/ and .claude/agents/scripts/.
The CLI is organized into six top-level groups:
d4d download: fetch, preprocess, and concatenate source materialsd4d evaluate: run presence-based and LLM-based evaluationsd4d render: render datasheets and evaluation outputs to HTMLd4d rocrate: parse, merge, and transform RO-Crate metadatad4d schema: inspect schema metrics and validate D4D YAMLd4d utils: inspect pipeline status and validate preprocessing output
Full option-by-option documentation is available in the docs site: CLI Reference.
Download, preprocess, and concatenate source documents for one project:
poetry run d4d download sources --project AI_READI
poetry run d4d download preprocess --project AI_READI
poetry run d4d download concatenate --project AI_READIEvaluate generated datasheets:
poetry run d4d evaluate presence --project AI_READI --method gpt5
poetry run d4d evaluate llm \
--file data/d4d_concatenated/gpt5/AI_READI_d4d.yaml \
--project AI_READI \
--method gpt5 \
--rubric bothRender and validate outputs:
poetry run d4d render html \
docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
-o /tmp/AI_READI_d4d.html
poetry run d4d render html \
docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
--template linkml \
-o /tmp/AI_READI_d4d_linkml.html
poetry run d4d render evaluation \
data/evaluation_llm/rubric10/concatenated/AI_READI_claudecode_agent_evaluation.json \
-o /tmp/AI_READI_evaluation.html
poetry run d4d schema validate docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml
poetry run d4d utils status --quickd4d evaluate llmrequiresANTHROPIC_API_KEY.d4d render html --template human-readablerenders a single datasheet YAML file to the exact--outputpath you provide and copiesdatasheet-common.cssinto the same directory so the HTML remains styled when opened directly.d4d render html --template linkmlrenders the same structured input into the more technical LinkML-style HTML view.d4d render evaluationrenders evaluation JSON directly and auto-detectsrubric10vsrubric20unless you specify--rubric.- Evaluation naming is now consistent: if you omit
-o, rubric10 outputs default to*_evaluation.html, while rubric20 outputs default to*_evaluation_rubric20.html. d4d render generate-allis a convenience command that points users to the bulk HTML generation workflow (make gen-d4d-html).d4d schemaandd4d rocraterely on helper scripts in.claude/agents/scripts/, so running from a repository checkout is important.
Details
Use the `make` command to generate project artefacts:make all: make everythingmake deploy: deploys site
This project was made with linkml-project-cookiecutter.