FAIR Data,
Grounded AI

A knowledge graph linking 9.3 million data citations across nearly 2,000 data repositories — demonstrating how FAIR principles power trustworthy, traceable AI answers grounded in structured evidence.

101M
RDF Triples
9.3M
Data Citations
1.3M
Publications
3.2M
Unique Datasets
1,979
Repositories
3
Linked Data Sources
Why It Matters

Grounded AI vs Ungrounded AI

The same question, two approaches. One traces every claim to structured evidence in a knowledge graph. The other relies on statistical patterns from training data — and invents plausible-sounding facts.

Knowledge-grounded (GraphRAG)
“Which repository has the most data citations?”
ENA (European Nucleotide Archive) leads with 3.76 million data citations across 1.34 million unique datasets, followed by PDB with 1.73M citations. Of the nearly 2,000 repositories tracked, only Figshare and Dryad mint DOIs — the vast majority use repository-specific accession numbers.

✓ Every number verified against the knowledge graph.
Repositories graph (pre-computed stats) → Citations graph (CZI text-mining)
SPARQL query executed · 9 results returned · Full provenance available
Ungrounded LLM (no knowledge graph)
“Which repository has the most data citations?”
Based on general knowledge, GenBank is likely the most cited data repository, with ✗ millions of citations across the life sciences. Other highly cited repositories include ✗ ArrayExpress and the ✗ NCBI Sequence Read Archive. Data citation practices vary widely across disciplines.

✗ Fabricated specifics — GenBank is not in this corpus. ArrayExpress and SRA are not tracked. No actual numbers provided.
⚠ No data source queried · No provenance chain · No verification possible · Trained on statistical patterns only
Knowledge-grounded (GraphRAG)
“What percentage of publications include ORCID identifiers?”
ORCID adoption in data-citing publications has grown from 8% in 2012 to 74% in 2023 — a near-tenfold increase. This trajectory suggests the research community is steadily embracing persistent author identifiers, though 26% of recent publications still lack any ORCID.

✓ Year-by-year data from 1.3M publications in the knowledge graph.
Analytics graph (pre-computed) → Publications graph (Crossref)
SPARQL query executed · 14 years of data · Full provenance available
Ungrounded LLM (no knowledge graph)
“What percentage of publications include ORCID identifiers?”
ORCID adoption has been growing steadily. As of recent estimates, ✗ approximately 40-50% of scholarly publications include at least one ORCID. ✗ A 2022 study found that ORCID usage reached about 45% in major publisher outputs. Adoption varies significantly by discipline and publisher.

✗ Vague and inaccurate — no specific dataset, no year-by-year trend, invented study reference.
⚠ No data source queried · Invented citation · Approximate numbers from training data · Cannot be verified

FAIR data doesn’t just improve data management — it makes AI trustworthy. Every answer from the knowledge graph comes with a provenance trail: which data sources were queried, which graphs were traversed, and how many results were returned. Try it yourself →

Measuring FAIR

FAIR Scorecard

FAIR isn’t abstract — it’s measurable. None of these scores are 100%, and that’s the point. Even partial FAIR compliance unlocks powerful cross-source queries, traceable AI, and insights that would be impossible with siloed data. Imagine what becomes possible as these numbers climb.

F
Findable
33%
33% of datasets have persistent
identifiers (DOIs). The remaining 67%
use repository-specific accession numbers.
A
Accessible
70%
70% of publications include at least
one author ORCID, enabling traceable
attribution and access to creator profiles.
I
Interoperable
85%
85% of datasets have standardised
subject classifications, enabling
cross-repository discovery and linking.
R
Reusable
60%
60% of datasets have machine-readable
licenses. 52% of publications include
funder metadata for provenance.

Scores derived from 1.3M publications (Crossref) and 1.0M DOI-minted datasets (DataCite) in the knowledge graph. February 2026 snapshot.

This knowledge graph was built with nearly 2,000 repositories and 3 data sources. There are thousands more. Every additional FAIR-compliant repository, every DOI minted instead of an accession number, every ORCID added to a publication — expands what’s queryable. The 33% Findability score isn’t a failure; it’s 1.0 million datasets already discoverable through persistent identifiers, with 2.2 million more waiting to become machine-readable. Perfect shouldn’t be the enemy of good — and good is already remarkably powerful.

The Data

1,979 Repositories, One Graph

Data from domain-specific repositories (ENA, PDB, GEO, UniProt) and general-purpose repositories (Figshare, Dryad, ICPSR) — each with different identifier practices — unified through a knowledge graph. Showing the top 12 by citation count.

Repository Citations Datasets Publications Relative Scale
European Nucleotide Archive 3,755,354 1,341,067 993,399
Protein Data Bank 1,729,783 212,259 432,862
dbSNP 890,431 95,294 61,137
CCDC 684,149 660,698 297,412
GEO 489,706 92,391 89,619
NCBI RefSeq 259,548 97,441 84,297
UniProt 257,986 87,420 76,172
ICPSR 242,946 17,929 30,277
BioProject 113,611 68,026 56,957
Figshare 112,594 107,013 28,102
Dryad 109,041 44,394 44,509
Ensembl 106,217 35,771 40,252

Showing top 12 of 1,979 repositories by citation count. Numbers queried live from the knowledge graph.

A note on precision. Citation links are extracted by CZI’s text mining, which pattern-matches accession numbers in full-text papers. For accession-based repositories (ENA, PDB, dbSNP), this produces false positives — “A549” is a cell line, not a nucleotide sequence; “6MWT” is a clinical test, not a protein structure. DOI-minted repositories (Figshare, Dryad, Zenodo) don’t have this problem: a DOI is unambiguous. This is precisely the argument for persistent identifiers — FAIR data is not just findable, it’s distinguishable from noise.

How It Works

From Data to Grounded AI

Three open data sources are linked through a knowledge graph, enabling AI that can trace every answer back to its evidence. This is GraphRAG in practice — retrieval-augmented generation grounded in structured, FAIR data.

CZI Text Mining

9.3M data citations extracted from full-text publications by the Chan Zuckerberg Initiative. The raw link between papers and datasets.

Crossref Enrichment

1.3M publications enriched with titles, authors, ORCIDs, journals, funders, and citation counts. The scholarly metadata layer.

DataCite Metadata

1.0M datasets with creators, subjects, licenses, and download counts. Rich descriptive metadata for DOI-minted research data.

Knowledge Graph

101M triples in GraphDB, modelled with a custom ontology. Seven named graphs with pre-computed analytics for instant queries.

AI Chat (GraphRAG)

Natural language questions are translated to SPARQL, executed against the graph, and interpreted with full provenance. Every answer is traceable.

Open & Reproducible

Built on open data (DataCite Data Citation Corpus), open standards (RDF, SPARQL, FAIR), and open-source tools. Fully reproducible.

DataCite Data Citation Corpus
Citation links (via CZI)
+
Crossref
Publication metadata
+
DataCite
Dataset metadata
Knowledge Graph
101M triples in GraphDB
Grounded AI
Traceable answers via GraphRAG