Surveillance of the Past in the AI Archive

by Todd Fine (PhD candidate in History)

After the “archival turn” in the humanities placed the purpose, creation, and content of historical archives in doubt, dramatic innovations in artificial intelligence have come along to spotlight the critique. A cultural battle of our time is if AI will further entrench the problems of the archive or disabuse researchers and the academic disciplines from biases rooted in the archive, possibly freeing humanity from its deepest mental fallacies.

At first glance, the prospects don’t look good. Generative AI is a creature of the archive. One could even say that Large Language Models themselves are the archive compressed into a giant, indecipherable matrix and come alive. The archive is the food. The new AI models are built from stored documents, compiled datasets of all kinds of texts and images accumulated on the Internet, including books and newspaper articles. The choice of data intrinsically forms the model.

Biases––toward the wealthy and powerful, toward men and dominant racial identities, and toward Western countries––reflected in the archive express themselves in the mass of content stored on the internet. The anonymity of internet communication, and the internet’s origins as a place of leisure for the relatively wealthy and privileged, have fueled articulation of racist, sexist, and classist ideas in internet discourse, in many cases beyond what is acceptable to be uttered in physical social settings.

On the other hand, it does seem that a broader spectrum of people find better expression in stored electronic forums and on social media (albeit through a corporate filter) than in the traditional institutional archive. The struggles of the poor and marginalized may not be as invisible and obscured for AI models as they are in our histories based on traditional archives.

While concerns about the social biases of AI gain broad attention, academics have only begun to speculate about how these AI models will be used to study the existing mass historical archive. This accelerated restructuring of knowledge from historical archives might reinforce its biases if countermeasures are not taken, but they could also extract other information about history that the archive sought to elide.

One truth that humanities researchers know all too well is that many archival collections are hardly utilized, due to their scope, the high expense of archive work, and a general decline in jobs and support. Archival depositors may also have overestimated how important their lives and activities would be to future generations who face their own problems.

The physical archives that scholars rely on have been slow to be digitized, due to copyright concerns and the expense of scanning and hosting. Archives also became a kind of property of institutions that they held more tightly than academic principles should have tolerated in the internet era. Whether the thirst for data to train AI systems on will accelerate the digitization of historical documents (perhaps supported by AI investment and profits) remains an open question. If preservation and safety issues can be resolved satisfactorily, robots might be able to scan the varied forms of documents in historical archives. There might even be a national or international push to take these responsibilities away from strained academic institutions and shift them toward ambitious government and corporate programs (as was done in the early 2000s with library book scanning). A government crash digitization program might be a blessing if managed with good intentions for the public good. The historical archive probably should not become a private asset in the manner that publishers and tech companies have tried to make of the internet archive.

An acceleration of scanning, combined with more sophisticated Optical Character Recognition techniques, supplemented by AI visual and textual intelligence, will lead to entirely different forms of research. Questions in the past that might have seemed like looking for a needle in a large archival haystack will be assisted by AI agents (who know every language and every human biography) that can read every document in an archive looking for the answer. Search will no longer be done only by textual algorithms looking for words, an exercise that will increasingly seem primitive and a waste of time. The research of archival correspondence, which currently requires the one-sided examination of what one archive includes, could be transformed by AI agents that have access to all sides of correspondence (and related correspondence) owned by a variety of physical archives. AI could be used to create a master correspondence database, a sort of human genome project of social history.

Materials that have been hard to access––documents in obscure or forgotten languages, with hard-to-read handwriting, or in physically challenging formats like old papyri––will finally be made accessible, instantly translated into every possible language. Some of the first successful projects involving advanced AI show this promise, for example, efforts to transcribe out-of-date German Fraktur typefaces and handwriting styles like Kurrent and Sütterlin.

This sudden promise of the rapid digitization of the historical archive, however, is only the beginning of the changes that the AI-empowered encounter with the archive may trigger. Surveillance of our lives may be matched by surveillance of the past, a prospect both seductive and terrifying. Imagine history produced by an entity that has access to as complete “biographies” of every past person that the archive can construct. The archive’s biases might still come through, but its attempts to elide might be overwhelmed by the model’s determination to construct an accurate world model. The archive’s errors and distortions could have catastrophic effects on AI reconstruction or they might collapse in the face of the ruthless rationalizations of the model to find truth. (And moral scandals about the personal lives of research subjects will surely proliferate to the point of becoming trite.)

Intelligent AI models, who already demonstrate their abilities to create “world models” that enable their realistic reconstruction of speech and static and moving imagery may be able to create models of historical action that transform current ideas about causality. With more complete “surveillance” world models, AI might give us new perspectives on the social, economic, and ecological dynamics of past societies, recasting classic debates on idealism vs. materialism, the bases of geopolitics, and the role of the individual in history in ways that humans have never considered.

At the same time, human sciences and academic disciplines have still failed in most of their attempts to understand human behavior and psychology. The human mind is one subject that consistently evades quantitative analysis. Even with the archive laid bare and their own complex sensor systems installed, AI models may never be able to truly understand a single human experience as we perceive it. AI’s attempt to create “world models” of history may always be too simplistic, without understanding human experience. Addressing this gap of understanding might become the central problematic of the rest of our lives. Creating a “new archive” that better reflects human experience, and which fights the biases of the archive that we know, might be our best hope.

Leave a Reply Cancel reply