Document Preservation and Retrieval

In the old days, the work of historians was much like that of the Indiana Jones character: traveling across the globe seeking ancient manuscripts and other works. Some historians still must do so, and other prefer to do so, but now there is a tremendous amount of historical information available through the internet. Saving old physical works is another major component of digital history. This lesson concerns document preservation and retrieval, literally saving old information with new technology.

Objectives

Learn about using digital technologies to preserve documents, photos and other 2D materials.
Learn about the various file types, and how to store, organize and protect digital archives.

Traditional Means of Document Preservation

There were several ancient means of recording historical events. The earliest may have been storytelling and songs to help people learn and remember about past experiences and event sin the society. Cave paintings may have been another early means. When language began to be written (recorded in an external physical form), clay tablets, stone engraving and scrolls of papyrus paper were used. Eventually other forms of paper were used as well, and sheets were combined into books. While inventions such as the printing press resulted in the proliferation of written works, such technologies were mere improvements of the same approach.

Phonographs, film and magnetic materials in the nineteenth and twentieth centuries finally made breakthroughs in recording events and other information. Recordings of audio and visual events could be made so that future persons could experience direct sensory perceptions of those events rather than reading about them. Further, historical documents could be stored in film version and retrieved and photocopied at will. Yet these technologies were not digital.

Saving Old Information With New Technology

Digital technology allows for the preservation, storage and retrieval of historical documents. Technology has allowed such for thousands of years, so what is special about digital technology?

The term digital refers to recording and processing information ultimately as strings of numbers, ultimately as binary numbers being 0 and 1. This allows that information to be processed by computers. Computers are fast, and the information within them can be transmitted and transformed with relative ease. That means that historical recordings can be reproduced instantly. Documents can be retrieved quickly and searches can be performed easily across millions of documents to search for names, places and terms. Of course the term easy is a relative one, as compared to such searches without digital technology. Searching can still require thinking and skill, but the ratio of brain work to mere mechanical, manual activities has increased significantly.

Digital records often begin life in image form. Often they are then processed using optical character recognition (OCR) and either include a text “layer” or are converted into a text document.

Digital records and archives

Let’s first discuss information that is already present on the internet or in other digital forms. There are several important aspects of digital information:

What it is
In what form it is
Where it is located
How to access it

Is there one answer to rule them all? No! As a historian, you may be confronted with a tremendous variety of answers to these questions. Some sources will be on floppy disks, CDs or even magnetic tapes. Some will be behind paywalls. Some will be in file formats which modern computers cannot read. You can savor the exotic possibilities later. For now, the most common cases will be covered.

Images of Primary Source Documents

A three thousand year old clay tablet can be converted into digital form by simply photographing it. The image will be in a file. If you have technology that can access, read and display the image, then you can see much of the information contained by that ancient tablet. What might be even better is if the contents of the tablet are searchable, in the form of a text file. (Sometimes they will be, sometimes they won’t).

Examples of images include scans of documents and photographs. Sometimes there might be other representations, such as vector graphics.

Text-Based Documents of Primary Source

Primary sources in digital form can be original digital documents (such as notes from a meeting typed on a word processor) or in indirect digital form such as a typed up copy of a newspaper article. Text-based documents are ultimately in files.

Secondary Sources

Older secondary sources may be processed in the same manner as old primary documents. Newer secondary sources are probably already natively in a digital form and can be found by an ordinary web or database search. Unfortunately, many of these sources are behind a paywall. If you don’t want to pay, go through a library. If your university supports OneSearch, this is the easiest way to start looking. Otherwise, your library may have research guides concerning which materials it has available and how to access them. (Each library is somewhat different.)

File Types

Text Files

There are several common forms of text files.

Pure text files end with “.txt”. They only contain text and neither other types of content nor formatting information. This form is generally easy to read by humans, and suitable for computer program code. Sometimes these are called plain text files.
Rich format files contain some formatting information and end with “.rtf”. These are generally not suitable for computer programs.
Some older word processors read and generate files that end with ” .doc”. These can contain formatting information, images and other types of content.
Some newer word processors generate files that end with “.docx”. They are similar to .doc files in terms of content and information, but have a significantly different file format.
Comma separated value files have values separated by commas. Strictly speaking, these are text files, but in a form that are readable by databases, spreadsheets and other specialized software. Collections of public and historical records are often exported in this format. They often end with “.csv”.
Structured Information files are similar to database records, and may be considerably more complex than a simple .csv file. They may end with “.xml”.

Archival Files

Archival files may contain primarily text, but they often contain additional elements such as formatting. The Portable Document File (PDF) format can preserve formatting, but it has several possible deficiencies that make retrieving the document in its original for, problematic. However, the PDF/A is a preferred archival file format. It attempts to maintain device independence, self-containment and self-documentation. For example, it requires embedded fonts rather than linked fonts, in case the linked source is no longer available. The PDF/A format prohibits the inclusion of audio, video, Javascript and executable content. So this format may be suitable for traditional print media but not for your favorite video game.

There are several versions of PDF/A files, such as PDF/A-1, PDF/A-2, and PDF/A-3, with the higher numbers allowing for embedding more advanced content such as richer graphics. (If interactivity is required, the PDF/E format might be considered, albeit at the loss of some portability.)

Image Files

There are several common types of image files:

.gif—good for illustrations
.jpg—good for photographs, compressed format to save disk space and load faster
.png—good for illustrations and photographs but may not be supported by all platforms.

Video Files