Genomic Data Portals and the FAIR Gap the Field Has Stopped Talking About

There is a version of the genomic data management story that the field tells about itself, and it is broadly accurate in outline. The National Center for Biotechnology Information maintains more than 40 distinct repositories, knowledgebases, and services — PubMed, GenBank, ClinicalTrials.gov, dbSNP, ClinVar, GTEx, the Comparative Genomics Resource, and dozens of others. Together they represent one of the most extraordinary concentrations of biomedical information ever assembled. They are freely accessible. They are continuously updated. They are staffed by scientists who take the infrastructure's quality seriously.

And they are, according to the FAIR Guiding Principles that NCBI itself endorses in its 2026 annual database resources paper published in Nucleic Acids Research, committed to alignment with principles for reliable and competent data management. The FAIR principles — Findable, Accessible, Interoperable, Reusable — are not aspirational decoration in NCBI's public documentation. They are stated operational commitments, with individual resources publishing details of their specific alignment efforts.

The version of the story the field tells about itself is therefore true. What it omits is the gap between commitment and capability — the specific, documented, largely unresolved distance between what FAIR requires and what the biomedical data infrastructure currently delivers.

What NCBI's Own Data Shows

The 2026 NCBI database resources paper, published by Sayers et al. in January 2026, is an annual tradition: a comprehensive account of what each NCBI resource has done in the previous year, what significant updates have been made, and what the resource's current state looks like. It is useful precisely because it is detailed — it tells the field what exists and how it functions, rather than offering aspirational framing.

Read against the FAIR principles framework, the paper reveals a picture that is more complicated than the headline numbers suggest. GenBank, the largest open repository of nucleotide sequence data in the world — holding over 2.8 trillion base pairs from more than 500,000 organisms — has made significant progress on the interoperability dimension of FAIR through the BioProject and BioSample record system, which links sequence records from the same research effort across different NCBI repositories and connects them to supporting literature in PubMed and associated funding sources.

This is genuine progress. The ability to navigate from a GenBank submission to its associated clinical data in dbGaP, to the relevant publications in PubMed, and to the funding source that enabled the research is a significant improvement over the isolated record model that characterised earlier repository design. It partially fulfils the I-principles of FAIR — the use of formal, accessible, shared, and broadly applicable language for knowledge representation — and the R1 principle requiring that data be described with accurate and relevant attributes.

What it does not fully address is the F-dimension: Findable. The principle requires that data be registered in a searchable resource with globally unique and persistent identifiers. NCBI provides persistent identifiers — accession numbers, RefSeq identifiers, PMID numbers — that are stable and widely used. But the discovery problem for biomedical data is not primarily an identifier problem. It is a description problem.

The Discovery Problem in Biomedical Data

A 2023 paper in Science Data identifying barriers in FAIR data practices for biomedical data — published in the context of the NIH's updated Data Sharing Policy that came into effect that year, mandating timely sharing of all NIH-funded data — frames the discovery challenge precisely. For the policy to shift the data sharing culture, improve research reproducibility, and promote data reuse, the first step — discovering that relevant data exists — is compounded by problems in incentives, standardization, and coordination.

The specific mechanism of the problem is this: biomedical datasets are described using metadata that is frequently insufficient, inconsistent, or both. A researcher asking whether there are existing datasets relevant to a specific clinical question about, say, immune responses in a particular patient population, faces a discovery landscape in which the vocabulary used to describe datasets in GenBank is not necessarily the vocabulary used to describe related datasets in dbGaP, which is not necessarily the vocabulary used in the associated clinical trial records in ClinicalTrials.gov. The BioProject linking system partially addresses this, but the underlying metadata heterogeneity is a structural problem that linking alone does not resolve.

This is not a problem unique to NCBI. The 2025 paper "Making Genomic Data FAIR Through Effective Data Portals," published in Scientific Data by Speir et al. from the Genomics Institute at UC Santa Cruz and EMBL-EBI, examines the challenge across the broader ecosystem of genomic data portals — ENCODE, GTEx, HuBMAP, the 4D Nucleome Data Portal, and others. The paper's central argument is that genomic data portals collect, annotate, and make data files available to researchers and increasingly to AI algorithms, but that the effectiveness of this function depends critically on the quality and consistency of the metadata that makes individual data objects discoverable and interpretable in the context of related resources.

Speir et al. identify a specific tension: the portals operated by consortium-specific Data Coordination Centers (DCCs), which hold some of the most carefully curated biomedical datasets available, are designed and maintained independently of each other, with their own metadata schemas, ontological frameworks, and interface conventions. The controlled-access data model — used by NIH for sensitive human subjects data in dbGaP and AnVIL — adds another layer of fragmentation, because the access approval process operates separately from the discovery process, making it effectively impossible for a researcher to determine whether the data they need exists before they apply for access.

TRUST and the Longer Problem

The NCBI 2026 paper invokes not only FAIR but the TRUST Principles for digital repositories — Transparency, Responsibility, User Focus, Sustainability, Technology. The TRUST principles, developed to extend FAIR toward questions of institutional reliability and long-term preservation, address something that FAIR itself does not: the question of whether a repository will still exist, and will still provide access, in ten or twenty years.

For biomedical data, this is not an abstract concern. The field has accumulated datasets that required years of patient recruitment, clinical data collection, and genomic analysis to produce — datasets whose research value extends well beyond the immediate period of the study that generated them. Long-term access to these datasets depends on institutions that are funded, maintained, and technically current across timescales that exceed any individual grant cycle or technology platform's operational life.

NCBI's sustained operation — funded by the National Library of Medicine, maintained across administrations, technically upgraded over decades — is the field's primary example of what TRUST-compliant infrastructure looks like in practice. But NCBI cannot hold everything. The ecosystem of smaller repositories and consortium-specific portals that host the most specialized biomedical datasets operates with significantly less institutional stability. When a Data Coordination Center's funding ends, when a principal investigator moves institutions, when a small repository runs out of storage budget — the data it held may not disappear immediately, but its long-term accessibility becomes genuinely uncertain.

What Needs to Change and What Is Actually Changing

The NIH Data Sharing Policy that came into full effect in 2023 represents the most significant policy lever the field has yet applied to the FAIR gap. By mandating that all NIH-funded research share its data according to a Data Management and Sharing Plan, and requiring that those plans be reviewed and approved before funding is awarded, the policy puts FAIR compliance on the critical path of the research process rather than treating it as a post-publication aspiration.

The practical implementation has been uneven. Data Management and Sharing Plans vary significantly in quality. Repository selection is often driven by what is most convenient for the investigator rather than what produces the most FAIR-compliant outcome. The burden of producing FAIR-compliant metadata falls primarily on the researcher, whose training and incentives are oriented toward scientific discovery rather than information science.

What is actually changing — slowly, structurally — is the tooling. The NCBI Submission Portal, which continues to expand support for eukaryotic nuclear sequences as part of the transition from the older BankIt system, now provides interactive workflows that guide submitters through the creation of BioProject and BioSample records that improve FAIR alignment. The submission process is doing more of the metadata work, rather than relying on the submitter to independently understand what metadata is required for downstream discovery and reuse.

This is the right direction. It does not yet constitute a solution to the discovery problem at the scale that the NIH Data Sharing Policy envisions.

The Library Perspective

Biomedical libraries occupy a specific position in this challenge. They have the professional expertise in metadata, in controlled vocabulary, in the design of discovery systems — that the research community largely lacks. And they are structurally positioned to engage with both the policy dimension and the technical dimension of the FAIR gap in ways that neither funders nor researchers can manage independently.

What the field needs from its library community is not more FAIR compliance documentation. It is operational engagement with the discovery problem — the development of metadata enhancement tools, the maintenance of crosswalks between the vocabulary systems used by different biomedical repositories, the design of researcher-facing interfaces that make FAIR metadata production a natural consequence of normal research workflow rather than a separate compliance burden.

The JCDL community has contributed significantly to this problem space over more than two decades. The 2026 conference tracks on Metadata and Semantics and on Infrastructure and Systems are precisely the right forums for advancing this work. The genomic data ecosystem is large, consequential, and partially broken in ways that digital library expertise is specifically equipped to address. The gap between stated commitment and operational capability does not close by itself.