FAIR Was Never Built for Machine Learning

What FAIR Was Designed to Do

The FAIR principles address a specific and genuine problem in research data management: the inability of data generated in one context to be discovered, accessed, and reused in another. The problem was real. Datasets generated in publicly funded research were routinely published with insufficient metadata to allow discovery, in formats that prevented machine-readable access, under licensing terms that were unclear or restrictive, with provenance documentation inadequate for any downstream user to assess the data's quality and limitations.

Wilkinson et al.'s key insight was that the primary beneficiary of FAIR data infrastructure was not the human researcher but the machine: "It is our intent that the principles apply not only to 'data' in the conventional sense, but also to the algorithms, tools, and workflows that led to that data." The FAIR principles were, from their inception, designed to enable machine-readable data discovery and access. The F4 principle specifies that metadata and data are "registered or indexed in a searchable resource" — the infrastructure component that enables automated discovery. The I1 principle requires that metadata use "a formal, accessible, shared, and broadly applicable language for knowledge representation" — specifically to enable machines to exchange and read data without human intermediation.

In this sense, the FAIR principles are not being subverted by AI-mediated data consumption. They are being fulfilled — but in ways that expose the limits of the original framework's scope.

What the Machine-Learning Turn Introduces

The critical difference between FAIR-compliant human data reuse and AI-mediated collection ingestion is not scale, though scale matters. It is the nature of the reuse relationship and the absence of the accountability mechanisms the FAIR framework implicitly assumes.

FAIR assumes a discoverable reuse event: a researcher finds a dataset, assesses its provenance and licensing, downloads it, and uses it in a way that is documented and attributable. The R1.1 principle requires that metadata include "a clear and accessible data usage license." The R1.2 principle requires that metadata be "associated with detailed provenance." These requirements are meaningful in a world where reuse is mediated by a human who reads the license and traces the provenance.

AI model training is not that kind of reuse. A foundation model is trained on a corpus assembled by automated web crawling, repository harvesting, and licensed data acquisition. The individual data objects in that corpus — research articles, digitised manuscripts, catalog records, oral history transcripts, metadata schemas — are processed in aggregate at a scale that makes individual license review and provenance attribution practically impossible. The model does not reuse a dataset. It consumes it, at a granularity below the level of the data object, in a process that produces outputs with no traceable relationship to any specific input.

Three consequences follow. First, FAIR-compliant data is being consumed by systems that are not FAIR-compliant in their output: the provenance chain that FAIR requires for input data is severed at the training stage. Second, licensing terms that govern human data reuse — and that the FAIR framework assumes will be read and respected — may not be technically enforceable against automated training pipelines that operate below the granularity at which licenses apply. Third, the "Reusable" dimension of FAIR — which requires clear data usage licenses and detailed provenance — is being satisfied only in the direction of input, not output: researchers can trace where a FAIR dataset came from, but they cannot trace what a model trained on that dataset "knows" back to any specific source.

The Cataloger’s Dimension Authority Deduplication and the Amplification of Bias

Library collections occupy a specific position in this problem because of the structured metadata that accompanies them. A research repository that indexes under FAIR principles provides not just data objects but the semantic infrastructure of subject headings, authority records, classification codes, and controlled vocabularies that describe those objects. When an AI system ingests a FAIR-compliant library catalog, it ingests not just the bibliographic records but the implicit knowledge organisation encoded in decades of cataloging decisions.

Those decisions carry historical biases. Library of Congress Subject Headings — the most widely used controlled vocabulary in North American library cataloging — encode the conceptual frameworks of a predominantly white, Western, English-language cataloging tradition. Terms that were considered neutral professional practice when first assigned now represent contested or harmful characterisations of Indigenous peoples, religious minorities, and marginalised communities. The field has been engaged in a slow and incomplete process of updating this vocabulary for decades.

When AI systems train on catalog data without engaging with this dimension of its history, they amplify the biases encoded in the metadata as if they were neutral facts about the world. The subject headings assigned to a collection of materials about a particular community become, in the model's representations, the correct way to categorise that community — irrespective of whether those headings reflect the community's own frameworks for understanding itself.

This is not a problem that FAIR as currently specified addresses. FAIR requires rich metadata and provenance documentation. It does not require that the metadata itself be assessed for bias, or that training pipelines be informed of known limitations in the vocabulary systems on which the metadata depends.

Toward an Extended Stewardship Contract

The extension that the machine-learning turn requires is not a replacement of FAIR but an additional dimension: Accountable. The A that FAIR currently uses for Accessible needs a companion that addresses the specific accountability gap that AI-mediated reuse creates.

Four elements form a workable extension. First, training consent infrastructure: data objects in library repositories and research collections should be associated with machine-readable training consent metadata that specifies whether and under what conditions the object may be used in AI model training, analogous to the robots.txt convention in web indexing but with more granular capability. Second, model provenance disclosure: institutions that deploy AI systems trained on library collections should be required to disclose, at the level of collection rather than individual object, what collections were used in training — enabling downstream users to assess what the system "knows" and what biases it may carry.

Third, vocabulary bias documentation: controlled vocabularies used in library cataloging should carry associated documentation of known historical biases, update histories, and contested terms — metadata about the metadata — that can be consumed by AI training pipelines and used to inform model outputs or flag uncertain representations.

Fourth, and most fundamentally: the library field needs to insert itself into the governance conversations about AI training data standards that are currently being held in technology and policy circles without significant input from information professionals. The FAIR principles succeeded because researchers, funders, and publishers agreed that the framework applied to them. An equivalent commitment from AI developers and deployers to an extended stewardship standard for library and archival collections will require the same kind of cross-sector agreement — and the organisations with the most credibility to broker it are the ones that have been managing these collections for the longest time.

The field that spent a century building the infrastructure of trustworthy information is the field best positioned to specify what trustworthy AI-mediated information access requires. What it has not yet done is claim that position with the urgency the moment demands.