Data Citations Knowledge Graph — FAIR Data in Action

Why It Matters

Grounded AI vs Ungrounded AI

The same question, two approaches. One traces every claim to structured evidence in a knowledge graph. The other relies on statistical patterns from training data — and invents plausible-sounding facts.

Knowledge-grounded (GraphRAG)

“Which repository has the most data citations?”

ENA (European Nucleotide Archive) leads with 3.76 million data citations across 1.34 million unique datasets, followed by PDB with 1.73M citations. Of the nearly 2,000 repositories tracked, only Figshare and Dryad mint DOIs — the vast majority use repository-specific accession numbers.

✓ Every number verified against the knowledge graph.

Repositories graph (pre-computed stats) → Citations graph (CZI text-mining)
SPARQL query executed · 9 results returned · Full provenance available

Ungrounded LLM (no knowledge graph)

“Which repository has the most data citations?”

Based on general knowledge, GenBank is likely the most cited data repository, with ✗ millions of citations across the life sciences. Other highly cited repositories include ✗ ArrayExpress and the ✗ NCBI Sequence Read Archive. Data citation practices vary widely across disciplines.

✗ Fabricated specifics — GenBank is not in this corpus. ArrayExpress and SRA are not tracked. No actual numbers provided.

⚠ No data source queried · No provenance chain · No verification possible · Trained on statistical patterns only

Knowledge-grounded (GraphRAG)

“What percentage of publications include ORCID identifiers?”

ORCID adoption in data-citing publications has grown from 8% in 2012 to 74% in 2023 — a near-tenfold increase. This trajectory suggests the research community is steadily embracing persistent author identifiers, though 26% of recent publications still lack any ORCID.

✓ Year-by-year data from 1.3M publications in the knowledge graph.

Analytics graph (pre-computed) → Publications graph (Crossref)
SPARQL query executed · 14 years of data · Full provenance available

Ungrounded LLM (no knowledge graph)

“What percentage of publications include ORCID identifiers?”

ORCID adoption has been growing steadily. As of recent estimates, ✗ approximately 40-50% of scholarly publications include at least one ORCID. ✗ A 2022 study found that ORCID usage reached about 45% in major publisher outputs. Adoption varies significantly by discipline and publisher.

✗ Vague and inaccurate — no specific dataset, no year-by-year trend, invented study reference.

⚠ No data source queried · Invented citation · Approximate numbers from training data · Cannot be verified

FAIR data doesn’t just improve data management — it makes AI trustworthy. Every answer from the knowledge graph comes with a provenance trail: which data sources were queried, which graphs were traversed, and how many results were returned. Try it yourself →

Measuring FAIR

FAIR Scorecard

FAIR isn’t abstract — it’s measurable. None of these scores are 100%, and that’s the point. Even partial FAIR compliance unlocks powerful cross-source queries, traceable AI, and insights that would be impossible with siloed data. Imagine what becomes possible as these numbers climb.

Findable

33%

33% of datasets have persistent
identifiers (DOIs). The remaining 67%
use repository-specific accession numbers.

Accessible

70%

70% of publications include at least
one author ORCID, enabling traceable
attribution and access to creator profiles.

Interoperable

85%

85% of datasets have standardised
subject classifications, enabling
cross-repository discovery and linking.

Reusable

60%

60% of datasets have machine-readable
licenses. 52% of publications include
funder metadata for provenance.

Scores derived from 1.3M publications (Crossref) and 1.0M DOI-minted datasets (DataCite) in the knowledge graph. February 2026 snapshot.

This knowledge graph was built with nearly 2,000 repositories and 3 data sources. There are thousands more. Every additional FAIR-compliant repository, every DOI minted instead of an accession number, every ORCID added to a publication — expands what’s queryable. The 33% Findability score isn’t a failure; it’s 1.0 million datasets already discoverable through persistent identifiers, with 2.2 million more waiting to become machine-readable. Perfect shouldn’t be the enemy of good — and good is already remarkably powerful.

The Data

1,979 Repositories, One Graph

Data from domain-specific repositories (ENA, PDB, GEO, UniProt) and general-purpose repositories (Figshare, Dryad, ICPSR) — each with different identifier practices — unified through a knowledge graph. Showing the top 12 by citation count.

Repository	Citations	Datasets	Publications
European Nucleotide Archive	3,755,354	1,341,067	993,399
Protein Data Bank	1,729,783	212,259	432,862
dbSNP	890,431	95,294	61,137
CCDC	684,149	660,698	297,412
GEO	489,706	92,391	89,619
NCBI RefSeq	259,548	97,441	84,297
UniProt	257,986	87,420	76,172
ICPSR	242,946	17,929	30,277
BioProject	113,611	68,026	56,957
Figshare	112,594	107,013	28,102
Dryad	109,041	44,394	44,509
Ensembl	106,217	35,771	40,252

Showing top 12 of 1,979 repositories by citation count. Numbers queried live from the knowledge graph.

A note on precision. Citation links are extracted by CZI’s text mining, which pattern-matches accession numbers in full-text papers. For accession-based repositories (ENA, PDB, dbSNP), this produces false positives — “A549” is a cell line, not a nucleotide sequence; “6MWT” is a clinical test, not a protein structure. DOI-minted repositories (Figshare, Dryad, Zenodo) don’t have this problem: a DOI is unambiguous. This is precisely the argument for persistent identifiers — FAIR data is not just findable, it’s distinguishable from noise.

How It Works

From Data to Grounded AI

Three open data sources are linked through a knowledge graph, enabling AI that can trace every answer back to its evidence. This is GraphRAG in practice — retrieval-augmented generation grounded in structured, FAIR data.

CZI Text Mining

9.3M data citations extracted from full-text publications by the Chan Zuckerberg Initiative. The raw link between papers and datasets.

Crossref Enrichment

1.3M publications enriched with titles, authors, ORCIDs, journals, funders, and citation counts. The scholarly metadata layer.

DataCite Metadata

1.0M datasets with creators, subjects, licenses, and download counts. Rich descriptive metadata for DOI-minted research data.

Knowledge Graph

101M triples in GraphDB, modelled with a custom ontology. Seven named graphs with pre-computed analytics for instant queries.

AI Chat (GraphRAG)

Natural language questions are translated to SPARQL, executed against the graph, and interpreted with full provenance. Every answer is traceable.

Open & Reproducible

Built on open data (DataCite Data Citation Corpus), open standards (RDF, SPARQL, FAIR), and open-source tools. Fully reproducible.

FAIR Data,Grounded AI