Skip to content
JCDL 2004
JCDL.2004
Digital Libraries Summit
What Born-Digital Archives Are Losing Before Anyone Notices
← All posts

What Born-Digital Archives Are Losing Before Anyone Notices

There is a category of loss that archivists have spent two decades naming and still have not solved: the loss that accumulates not through flood, fire, or institutional collapse, but through the perfectly ordinary passage of time in a digital ecosystem that does not stand still. A .wp4 file created in WordPerfect 4 in 1988 is, technically, preserved. The bits are intact. The checksums pass. The storage media is functioning. The file exists in exactly the form it was saved. And it is, for practical purposes, unreadable — because the software that rendered it no longer runs on any hardware currently in production, and the hardware that ran that software requires components that have not been manufactured for thirty years. This is format obsolescence. It is the defining preservation challenge of the born-digital era, and it is happening right now, in every archive that holds materials created on computers before approximately 2000, and in a significant portion of materials created after. The National Archives and Records Administration's Digital Preservation Strategy 2022–2026 identifies Format and Media Sustainability as one of five core strategic pillars — a recognition that the problem is not hypothetical but operational, present in active collections, requiring resources and decisions that most institutions have not yet fully allocated. The Specific Mechanism of Obsolescence Understanding format obsolescence requires understanding what a digital file actually is. Unlike a book, which contains its content in a form directly perceptible to a human reader with adequate light and literacy, a digital file is a sequence of binary data that requires a software layer to interpret and a hardware layer to run that software. The file's content is not accessible to the human directly — it is accessible to the software, which makes it accessible to the human. Remove the software, and you have a sequence of bits that encodes information no one can retrieve. This dependency chain creates multiple failure surfaces. Physical storage media degrades — magnetic tape loses its magnetic alignment, optical discs develop bit rot, solid state storage leaks charge over time. The hardware that reads the storage media becomes obsolete, replaced by new architectures that may not support older media formats. The operating systems that run on the new hardware may not support the old software. The old software, even when the executable survives, may not function correctly on modern operating systems without specific compatibility layers that themselves require maintenance. The Digital Preservation Coalition defines digital preservation as "the series of managed activities necessary to ensure continued access to digital materials for as long as necessary." The qualifying phrase — "as long as necessary" — contains most of the field's practical difficulty. Necessary for whom? Determined by what authority? Over what planning horizon? These are institutional and political questions as much as technical ones, and archives have been answering them inconsistently, inadequately, and often only after specific materials have already become inaccessible. Three Preservation Strategies and Their Trade-offs The field has converged on three primary approaches to born-digital preservation, each with specific costs, capabilities, and limitations. They are not alternatives to each other — the most effective preservation programs deploy all three in combination — but they represent distinct philosophical and technical commitments. Bit-level preservation is the foundation. An exact copy of the file's content information and data structure is created at ingest, stored in two or more geographically distributed locations, and verified continuously through checksum comparison. This addresses the threat of physical loss and storage media failure. It does not address the threat of format obsolescence: a perfectly preserved .wp4 file is still a .wp4 file, and bit-level preservation does nothing to ensure that .wp4 remains renderable. Format migration converts at-risk files to current stable formats before they become inaccessible. A collection of WordPerfect documents is migrated to a PDF/A or plain text representation. The document's information content is preserved; its original format is lost. This trade-off is sometimes acceptable — if the content is what matters, the migration preserves what matters. But migration also carries significant risks: automated migration processes can introduce errors, subtle formatting information may not survive conversion, and the archival record of the original format is replaced by an interpretation of it. The Smithsonian Institution Archives employs migration as one of three preservation prongs, normalising files to stable formats while retaining originals where storage permits — a practice that acknowledges both the necessity and the cost of the approach. Emulation preserves the original software environment rather than migrating the content. An emulator reproduces the behaviour of the original hardware and operating system on modern infrastructure, allowing the original software to run and the original files to be rendered. The famous early model was Emory University's Rose Library, which in 2009 constructed a convincing digital replica of Salman Rushdie's Power Macintosh 5400 — preserving not just his documents but the specific environment in which they were created. Emulation preserves the most, but it is also the most resource-intensive approach. Building and maintaining accurate emulation environments for the dozens of hardware platforms and hundreds of software applications represented in a large born-digital archive requires specialised expertise and ongoing investment that most institutions cannot sustain at scale.

What Born-Digital Archives Are Losing Before Anyone Notices

The OAIS Reference Model and Its Implementation Gap

The Open Archival Information System reference model, published by the Consultative Committee for Space Data Systems and adopted as ISO 14721, provides the conceptual framework within which most serious digital preservation programs operate. OAIS defines the functional components of a preservation system — ingest, archival storage, data management, administration, preservation planning, and access — and describes the information packages that flow between them. It is the closest thing the field has to a shared architecture.

The implementation gap between OAIS as a reference model and OAIS as a deployed system in real institutions is significant and, in much of the literature, underdiscussed. The model describes what a preservation system should do; it does not specify how to do it with constrained budgets, legacy infrastructure, and staff trained primarily in traditional archival practice rather than digital systems. The University at Buffalo Libraries Special Collections uses Preservica — an OAIS-compliant workflow suite — for its born-digital management, with primary strategy centered on normalising files for preservation and presentation. This is a representative model for well-resourced academic libraries. It is not representative of the field as a whole, where many institutions lack the staff, software, or storage infrastructure to implement OAIS in any meaningful sense.

NARA's 2022–2026 strategy mandates tools for forensic identification and format characterisation, including file format identification, format validation against documented specifications, and technical metadata extraction. This is the correct approach — a preservation system that does not know what it holds cannot prioritise its interventions. But building and maintaining those tools across a collection of tens of millions of records, acquired across decades, in formats ranging from standardised to idiosyncratic, is an engineering problem of genuine complexity that the strategy document acknowledges without fully resolving.

The Proprietary Format Problem

The field's consensus position on file format selection — that chosen formats should be "open, standard, non-proprietary, and well-established" — is correct and widely endorsed. It is also almost entirely retrospective in its application. The archives that most urgently need preservation attention are not archives of material created in open formats; they are archives of material created in whatever formats were standard at the time, which for most of the personal computing era meant Microsoft Word, Excel, PowerPoint, various versions of WordPerfect, Adobe's proprietary formats, and dozens of specialist applications in scientific, legal, and creative domains.

The wide adoption of proprietary file formats has created a situation in which only the program that created the file — or in some cases, only a specific version of that program — can be used to open it correctly. Format documentation, when it exists, is often incomplete, proprietary, or subject to licensing restrictions that complicate open implementation. The result is that a significant portion of the born-digital cultural record of the last four decades exists in formats whose full specification is controlled by companies that may not prioritise preservation and whose business decisions can render those formats unreadable overnight.

What the Next Decade Requires

The field has diagnosed the problem accurately. What it has not done is build the institutional infrastructure to address it at scale.

Three things are needed that the current state of the field does not reliably provide. First, format risk assessment capacity — the ability to survey a collection and identify which formats are at greatest risk of imminent obsolescence, so that preservation resources can be allocated to the highest-priority materials before access is lost. PRONOM, the UK National Archives' file format registry, provides the technical infrastructure for this; deployment of format assessment tools against real collections at real institutions remains inconsistent.

Second, shared emulation infrastructure — the capacity to run original software environments without requiring each institution to build and maintain its own emulation stack. The Software Preservation Network in the United States and the Software Heritage initiative in Europe represent steps in this direction. They are not yet at the scale the problem requires.

Third, and most difficult: honest institutional accounting of the gap between stated preservation commitments and actual preservation capability. An institution that holds born-digital materials and lacks the staff, storage, and software to implement even basic preservation workflows is not preserving those materials, regardless of what its collection policy says. Closing that gap requires resources that acquisitions decisions do not currently account for and that institutional leadership has not consistently prioritised.

The quiet emergency is not quiet because it is small. It is quiet because bits, unlike books, do not visibly deteriorate. They simply stop being readable. And by the time an institution discovers that a format has become obsolete, the moment for cost-effective intervention is often already past.

Keep reading

More from Technology