data-sheets-schema

Executive Order 14168: This repository is under review for potential modification in compliance with Administration directives.

data-sheets-schema

📚 View Documentation & Examples · CLI Reference

A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on". To this end the authors create a series of topics and over 50 questions addressing different aspects of datasets, also useful in an AI/ML context. An example of completed datasheet for datasets can be found here: Structured dataset documentation: a datasheet for CheXpert

Google is working with a different model called Data Cards, which in practice is close to the original Datasheets for Datasets template.

This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers (work in progress). Beyond a less structured markdown template for this model (e.g. template for datasheet for dataset) we are not aware of any other structured form representing Datasheets for Datasets.

We are also tracking related developments, such as augmented Datasheets for Datasets models as in Augmented Datasheets for Speech Datasets and Ethical Decision-Making.

Bridge2AI Generating Center Datasheets

Curated comprehensive datasheets for each Bridge2AI data generating project:

AI-READI - Retinal imaging and diabetes dataset
CM4AI - Cell maps for AI dataset
VOICE - Voice biomarker dataset
CHORUS - Health data for underrepresented populations

View all D4D examples →

D4D-Core Schema (Recommended Entry Point)

The D4D-Core schema is the curated, interop-focused subset of D4D — the recommended starting point for new datasheets and for systems that exchange datasheets with RO-Crate / FAIRSCAPE / DCAT consumers. Every slot in d4d-core is paired with a SKOS-aligned external term in the Semantic Exchange Layer.

Artifact	Path	Description
Source schema	`src/data_sheets_schema/schema/data_sheets_schema_core.yaml`	Core schema entry point (imports `D4D_Core.yaml`)
Module	`src/data_sheets_schema/schema/D4D_Core.yaml`	`CoreDataset`, `CoreDatasetCollection`, `CoreDistribution` definitions
Merged form	`src/data_sheets_schema/schema/data_sheets_schema_core_all.yaml`	Single-file merged schema (auto-generated)
HTML examples	Bridge2AI generating-center datasheets (above)	Curated d4d-core renderings
Validate / build	`make validate-core`, `make gen-core-schema`, `make lint-core`	Core-schema-only Make targets

Scope: ~95 fields across CoreDataset, CoreDatasetCollection, CoreDistribution and supporting classes (Person, Organization, Creator, Grant, FundingMechanism). The full schema (data_sheets_schema.yaml, ~284 attributes) remains the extended reservoir.

Semantic Exchange Layer (D4D ↔ RO-Crate / FAIRSCAPE)

The Semantic Exchange Layer is the canonical SKOS + SSSOM mapping that lets a D4D datasheet round-trip through RO-Crate, FAIRSCAPE EVI, schema.org, DCAT, and Croissant RAI. All artifacts live in two directories:

Artifact	Path	Description
SKOS alignment (authoritative)	`src/data_sheets_schema/semantic_exchange/d4d_rocrate_skos_alignment.ttl`	100+ `skos:exactMatch` / `closeMatch` / `relatedMatch` triples
Semantic SSSOM	`src/data_sheets_schema/semantic_exchange/d4d_rocrate_sssom_mapping.tsv`	19-column SSSOM with json_path / pydantic / interface columns
URI SSSOM	`d4d_rocrate_sssom_uri_mapping.tsv` + `_comprehensive.tsv`	Auto-regenerated URI variants
Structural SSSOM	`data/semantic_exchange/d4d_rocrate_structural_mapping.sssom.tsv`	sssom-py-compatible 17-column structural mapping
Generators	`src/semantic_exchange/`	Scripts that derive the URI/comprehensive/structural variants
Tests	`tests/test_semantic_exchange/` + `tests/test_fairscape_integration/`	SSSOM column/structure validation
Add a new mapping	`/d4d-add-mapping` Claude Code skill (command)	Schema-driven workflow for new SSSOM rows

Build / validate:

make gen-sssom-all       # regenerate URI + comprehensive + structural variants
poetry run pytest tests/test_semantic_exchange tests/test_fairscape_integration -v

Repository Structure

Browse the source code repository:

src/data/examples/ - example YAML data
project/ - project files (do not edit these)
src/ - source files (edit these)
- src/data_sheets_schema/schema/ - LinkML schema (edit this); data_sheets_schema_core.yaml is the d4d-core entry point
- src/data_sheets_schema/semantic_exchange/ - canonical SKOS + SSSOM exchange-layer artifacts
- src/data_sheets_schema/datamodel/ - generated Python datamodel
- src/semantic_exchange/ - SSSOM/SKOS generator scripts
data/semantic_exchange/ - structural SSSOM + analysis docs
tests/ - Python tests (test_semantic_exchange/, test_fairscape_integration/, …)

D4D CLI

This branch introduces a unified d4d CLI for the Datasheets for Datasets workflow. The command is exposed through Poetry:

poetry install
poetry run d4d --help

After installation you can also invoke it as d4d, but poetry run d4d is the safest form while developing in the repo.

Most subcommands currently expect a repository checkout because they import repo-local code from src/ and .claude/agents/scripts/.

Command Groups

The CLI is organized into six top-level groups:

d4d download: fetch, preprocess, and concatenate source materials
d4d evaluate: run presence-based and LLM-based evaluations
d4d render: render datasheets and evaluation outputs to HTML
d4d rocrate: parse, merge, and transform RO-Crate metadata
d4d schema: inspect schema metrics and validate D4D YAML
d4d utils: inspect pipeline status and validate preprocessing output

Full option-by-option documentation is available in the docs site: CLI Reference.

Common Workflows

Download, preprocess, and concatenate source documents for one project:

poetry run d4d download sources --project AI_READI
poetry run d4d download preprocess --project AI_READI
poetry run d4d download concatenate --project AI_READI

Evaluate generated datasheets:

poetry run d4d evaluate presence --project AI_READI --method gpt5
poetry run d4d evaluate llm \
  --file data/d4d_concatenated/gpt5/AI_READI_d4d.yaml \
  --project AI_READI \
  --method gpt5 \
  --rubric both

Render and validate outputs:

poetry run d4d render html \
  docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
  -o /tmp/AI_READI_d4d.html
poetry run d4d render html \
  docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
  --template linkml \
  -o /tmp/AI_READI_d4d_linkml.html
poetry run d4d render evaluation \
  data/evaluation_llm/rubric10/concatenated/AI_READI_claudecode_agent_evaluation.json \
  -o /tmp/AI_READI_evaluation.html
poetry run d4d schema validate docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml
poetry run d4d utils status --quick

Current CLI Notes

d4d evaluate llm requires ANTHROPIC_API_KEY.
d4d render html --template human-readable renders a single datasheet YAML file to the exact --output path you provide and copies datasheet-common.css into the same directory so the HTML remains styled when opened directly.
d4d render html --template linkml renders the same structured input into the more technical LinkML-style HTML view.
d4d render evaluation renders evaluation JSON directly and auto-detects rubric10 vs rubric20 unless you specify --rubric.
Evaluation naming is now consistent: if you omit -o, rubric10 outputs default to *_evaluation.html, while rubric20 outputs default to *_evaluation_rubric20.html.
d4d render generate-all is a convenience command that points users to the bulk HTML generation workflow (make gen-d4d-html).
d4d schema and d4d rocrate rely on helper scripts in .claude/agents/scripts/, so running from a repository checkout is important.

Developer Documentation

Details

Use the `make` command to generate project artefacts:

make all: make everything
make deploy: deploys site

Credits

This project was made with linkml-project-cookiecutter.

Name		Name	Last commit message	Last commit date
Latest commit History 419 Commits
.claude		.claude
.github		.github
aurelian @ 0574129		aurelian @ 0574129
cache/duo		cache/duo
data		data
docs		docs
examples/output		examples/output
fairscape_models @ 8c80311		fairscape_models @ 8c80311
html-demos		html-demos
linkml_mappings		linkml_mappings
notebooks		notebooks
notes		notes
project		project
reports		reports
scripts		scripts
src		src
tests		tests
utils		utils
.cruft.json		.cruft.json
.gitignore		.gitignore
.gitmodules		.gitmodules
.goosehints		.goosehints
.mcp.json		.mcp.json
CHORUS		CHORUS
CLAUDE.md		CLAUDE.md
DESCRIPTION_STYLE_GUIDE.md		DESCRIPTION_STYLE_GUIDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
about.yaml		about.yaml
analyze_property_distribution.py		analyze_property_distribution.py
check_description_quality.py		check_description_quality.py
check_missing_descriptions.py		check_missing_descriptions.py
config.env		config.env
data-stat.out		data-stat.out
mkdocs.yml		mkdocs.yml
pip.list		pip.list
pip.show		pip.show
poetry.lock		poetry.lock
project.Makefile		project.Makefile
pyproject.toml		pyproject.toml
test_mkdocs_build.sh		test_mkdocs_build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-sheets-schema

Bridge2AI Generating Center Datasheets

D4D-Core Schema (Recommended Entry Point)

Semantic Exchange Layer (D4D ↔ RO-Crate / FAIRSCAPE)

Repository Structure

D4D CLI

Command Groups

Common Workflows

Current CLI Notes

Developer Documentation

Credits

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-sheets-schema

Bridge2AI Generating Center Datasheets

D4D-Core Schema (Recommended Entry Point)

Semantic Exchange Layer (D4D ↔ RO-Crate / FAIRSCAPE)

Repository Structure

D4D CLI

Command Groups

Common Workflows

Current CLI Notes

Developer Documentation

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages