Records and Archival Management of World Wide Web Sites
*Mr. LeFurgy has held a variety of positions over the last 20 years with U.S. local and national government archival institutions. The opinions expressed are solely those of the author.
Since first gaining broad public attention in the mid-1990s, the World Wide Web has rapidly expanded across the cultural landscape. Virtually all organizations—including most government agencies—have set up web sites to provide information and conduct business. As web sites grow so does dependency on them for accountability, evidence, and other purposes that require recorded documentation. Governments find they must take steps to manage web sites as information resources and, in some cases, to preserve sites as archival records. This is a terrifying prospect. Web sites are maddeningly different than paper records and are even different from databases, e-mail, and other electronic records.
How can the well-intentioned archivist or records manager cope?
Unfortunately, there are no easy answers. The web is still new and the technology upon which it is based is constantly changing. A period of trial, error, and learning lies ahead before there are broadly applicable philosophies and techniques for effectively managing web records. Despite the frustrating lack of a silver bullet, there are some concepts and approaches that archivists and records managers can consider right now in their efforts to tame the web.
Developing a Management Structure for Web Sites
The level of effort needed to manage web site records will vary. Since the costs of possible approaches vary significantly, it is wise to select an option tailored to the needs of the individual site. One concept that has enjoyed recent popularity is based on risk management: an organization assigns a high, medium, or low risk level to its site (or to sections of the site). Risk is defined in terms of potential legal, operational, or financial requirements that might be associated with the site and its information. For example, if the site is used to file benefit claims, it likely would have a high-risk designation since there is a good chance that someone might contest the process or its results. The assigned level of risk will have a direct bearing on choices for site recordkeeping. This concept is best explained in a publication of the Canadian government, An Approach to Managing Internet and Intranet Information for Long Term Access and Accountability.
Organizations will use risk perception and other factors (including potential research value) to establish appropriate recordkeeping methods. Possible approaches range from simple techniques (such as filing hard copies in a desk drawer) to complex technological solutions (such as electronic filing with version control by means of a records management application).
Appraising Web Records
The issue is complicated by two opposing factors. The first is that a web site is often a singular collection that can provide important evidential and informational value. With its metadata tags and links, a web version of a document differs significantly from hard copy and other versions. Due to ease of updating, it is possible that a web version of a document may be the most current. An electronic format also can make a document significantly easier to use, most especially for searching and copying. This argument in favor of archival retention is forcefully counterbalanced by the second factor: the large majority of web sites cannot ensure reliable recordkeeping. Most sites do not provide for secure filing and cannot guarantee that all information presented is complete or accurate. And while things may change in the future, the critical documentary evidence of an organization’s activities most often reside somewhere other than the web, such as in other electronic systems or in paper documents.
Even with their shortcomings, however, chances are that at least some web sites and related documentation do merit a place in archival collections. The trick is to determine what to save. The decision can be helped by grouping web site records into three categories:
Appraisal decisions for each category can vary. If the primary interest is in preserving the information posted on a site, appraisal can zero in on the third category, and within that, perhaps only on parts of a site. Interest in preserving the actual "look and feel" of the site will require capturing everything in category 3, as well as some documentation in category 2. If an organization is placing great emphasis on using the web to fulfill a core mission (such as though e-commerce or e-government) it might be appropriate to preserve some elements of category 1 to document the transition. Whatever immediate appraisal decisions are made, however, it is important to recognize that changes in web site use or technology compel periodic reassessments.
Many web sites now enable access to databases and other large information collections. Such collections are typically separate from the web site itself: software retrieves information from a database and presents it on the web. Since they are structurally separate, it typically makes sense to appraise, and if necessary, preserve, a database separately from the web site. Appraisal efforts should also take account of different types of web sites, most particularly Intranets (information made available only within an organization) and Extranets (information made available only to specified individuals outside the organization). Where they are used, the content and purpose of such sites can vary greatly, which will influence appraisal decisions. The National Archives of Australia has provided some basic information about management of such records.
Capturing, Preserving, and Accessing Archival Web Sites
There are a number of software packages that will automatically copy a specified Uniform Resource Locator (URL) and store the files on a PC hard drive. Examples include Teleport Pro, HTTrack, and WebCopier; there are many others to choose from. Such software basically duplicates or "mirrors" a site as it appears on its host computer, although there are options to exclude image files, parse embedded software files (such as Java), and limit the extent to which linked or lower level pages are captured. The extent to which these options are used depends on appraisal, preservation and access considerations.
Preserving a web site for ongoing access is challenging. All aspects of computer technology have a tendency toward rapid obsolescence. Today’s electronic files may be difficult to access in 20 years because the computer software and hardware needed to interpret and present the information may not be available. This is especially true of proprietary technology: a company with current popular products could easily be out of business or using different technology in just a few short years. This leaves archival collections of electronic records vulnerable to obsolescence as well. While there is hope that archivists can one day have tools to cope with this threat, there is no current assurance that electronic information tied to a proprietary format can be kept accessible into the future. Non-proprietary formats, on the other hand, can be kept accessible. The U.S. National Archives and Records Administration has a non-proprietary transfer standard that involves use of ASCII software files stored on magnetic tape.
Basic web text documents in Hypertext Markup Language (HTML) can be saved according to non-proprietary standards. But it is readily apparent that web sites are full of proprietary file formats, including Java, ActiveX or other applets; .jpg, .gif, and .tiff images; and Word, WordPerfect, and Adobe .pdf documents. Since such files are often a critical element of a web site, saving just the HTML text is an incomplete solution. The best strategy at this point for preserving a web site or a section of it is to copy all pertinent directories and files, as they exist on the host computer. This provides a full portrait of the site and is also the easiest way to use the copying tools. (Some of the applets and image files may prove unreadable in the future, but full capture will provide at least the potential for viewing the site as it existed; full capture also is the best way to preserve a site’s content, context, and structure). At least two exact copies should be made, including at least one on removable storage media such as magnetic tape or CD-ROM. The copies should be periodically checked and recopied to ensure that the media remains readable; if possible, it is wise to store the media in controlled environmental conditions (for an example of a policy in this area click here).
The frequency for capturing a site depends again on appraisal considerations. If the site documents a temporary organization or function, it might be best to capture only the final version. An ongoing entity could be handled with periodic copying of the whole site or alternatively with copies of changes made to site content. This capture is separate and distinct from systems backups made as part of regular computer operations. Typically made with specialty software, systems backups do not serve long-term preservation purposes.
In conjunction with an effort to mirror a web site, it is important to document technical issues and other aspects of the site. This is necessary to understand the original purpose of the site, as well as its technical parameters. It might be appropriate to prepare a narrative and collect useful policy statements, project plans, and other descriptions of the site that may exist. Printing certain portions of the site such as top-level site pages could also be worthwhile for ease of reference. Technical documentation should include an overview of the types of file formats and software used within the site (such as .pdf, Word, .jpg, .gif, Java, and so forth); this description should also include the version of the formats and software, if known. A site map (hierarchical list of directories and files included within the site) annotated with useful descriptions will greatly enable future use of the information. Details regarding the number and type of storage media used are also important.
The simplest method to provide access to a copy of a smallish web site is to store the copy in a separate directory on a hard disk. The dates and content of each directory can be listed to facilitate reference. This would permit quick access to the information, either in-house or through the web. If the amount of data precludes keeping an online reference copy, some variety of removable media can be used. Where multiple media are involved, descriptive labels must be used. Regardless of the method used to provide access, a copy of the information must be maintained separately, preferably as far physically as possible from the first copy.
The ideas and approaches outlined here offer no guarantee that web sites can be appraised and preserved with complete success. We do not yet know what parts of web sites will be most important in terms of historical documentation, and this makes it hard to settle on a firm appraisal policy. We do not know how quickly and how radically web technology will change, and this makes it difficult to prescribe capture and preservation standards. We do know, however, that the web is a historic phenomenon and that it is necessary to dig in and do our best to ensure that it is addressed as such. From that practical experience will come improved tools and techniques that archivists and records managers need to deal with web records.