Skip to content
JCDL 2004
JCDL.2004
Digital Libraries Summit
← All posts

Model Collapse and What Happens When AI Starts Learning From Itself

The large language models that have reshaped how we write, search, and learn were built on a single irreplaceable resource: the vast accumulation of text that humans wrote and put on the internet over decades. That corpus was the raw material, the ore from which the models were smelted. But the models have now begun to flood the very source they drew from. A growing share of new text online is machine-generated, and the next generation of models is increasingly learning from it. This sets up a strange and consequential feedback loop, one researchers have given a suitably ominous name: model collapse. Understanding it turns out to matter enormously for anyone who cares about the long-term integrity of human knowledge.

A copy of a copy of a copy

The clearest way to grasp model collapse is by analogy to photocopying. Make a copy of a photograph, then a copy of that copy, then a copy of that, and so on. Each generation looks roughly like the last, but errors accumulate, fine detail is lost, and after enough iterations the image degrades into a blurred, distorted ghost of the original. Something structurally similar happens when a generative model is trained substantially on the output of previous generative models.

Researchers studying this have found that models trained on the output of earlier models tend to drift. They lose the rare cases, the unusual events, the long tail of genuine variety that lived in the original human data, and converge instead toward the bland, the average, the most probable. Across successive generations the diversity of the output narrows, oddities vanish, and quality degrades. The model trained on synthetic data does not merely fail to improve; it actively gets worse, forgetting the richness of the real distribution it once approximated.

Why this is more than a technical curiosity

It would be tempting to file this under the engineering problems that future systems will simply solve. But the mechanism reveals something profound about what these models actually are and what they depend on. A generative model is, at bottom, a compression of human expression. When it begins training on its own kind, it starts compressing a compression, and the information that made the original valuable — the texture, the exceptions, the hard-won specificity of real human knowledge — leaks away at each pass.

The danger is amplified by the fact that synthetic and human text are increasingly difficult to tell apart, and the internet does not come neatly labelled. As machine-generated content proliferates across the open web, the well from which future models drink becomes progressively more polluted, and there is no easy filter to separate the authentic source water from the recycled. The commons of human expression, which everyone assumed was an inexhaustible given, turns out to be both finite and corruptible.

The new scarcity: verified human knowledge

Here is where the concerns of the digital-library world move sharply into focus. If the open internet is becoming an unreliable, self-polluting training source, then collections of verified, curated, provenance-rich human knowledge acquire a new and serious value. A well-described archive of authentic human work — scholarship, literature, primary documents, datasets with known origins — is exactly the kind of clean, trustworthy source that becomes precious in an age of synthetic flood.

This is a striking reversal of a decade's assumptions. The conventional wisdom held that the open web's scale made curation almost quaint; why painstakingly catalogue when you could simply ingest everything? Model collapse suggests the opposite. Scale without provenance becomes a liability, while curation — knowing what something is, where it came from, and that a human actually produced it — becomes the scarce and valuable thing. The disciplines libraries never abandoned, of describing, verifying, and preserving sources, turn out to be exactly what the information ecosystem most urgently needs.

Provenance as the antidote

If the problem is that we can no longer tell authentic human knowledge from its synthetic echo, then provenance is the closest thing to a cure. Knowing the chain of origin of a piece of content — who created it, when, on what basis — is what allows it to be trusted as training data, as evidence, as knowledge. Provenance is, and has always been, a library's core competence: the citation, the catalogue record, the documented chain of custody that lets a later user verify rather than guess.

In a world where generated content is cheap and indistinguishable, the ability to certify "this was made by a human, here, at this time, from these sources" stops being archival housekeeping and becomes critical infrastructure. The institutions that can guarantee the human origin and integrity of a collection hold something the open web is rapidly losing the ability to provide.

What we owe the well

Model collapse is, in the end, a parable about a commons. For a generation, we treated the accumulated written knowledge of humanity as a free and limitless input, something to be scraped and consumed without thought for renewal. The feedback loop now closing reminds us that the well can be fouled, that a resource everyone draws on and no one tends will eventually fail, and that the authentic human record is not an infinite background condition but a thing that must be actively protected.

The response is not to abandon the technology, which is too powerful and too useful to refuse, but to take seriously the stewardship that keeps it honest: preserving verifiable human knowledge, insisting on provenance, and maintaining curated collections whose authenticity can be trusted precisely because someone took responsibility for them. The next century of digital libraries may find that its most important task is not competing with the machines but doing the one thing the machines cannot do for themselves — keeping a clean, true, human record of what we actually knew, so that there remains something worth learning from at all.

Discover more in our comprehensive guide, where we explain the process in detail and highlight the most important points to consider.

Keep reading

More from Web Innovations