UVC-based preservation

Last updated

UVC-based preservation is an archival strategy for handling the preservation of digital objects. It employs the use of a Universal Virtual Computer (UVC)—a virtual machine (VM) specifically designed for archival purposes, that allows both emulation and migration to a language-neutral format like XML. [1]

Contents

Background to the development of a UVC approach

Digital preservation problem

Preservation of digital resources is of a paramount importance for deposit libraries, research libraries, archives, government agencies, and actually most organizations. [1] The dominant approach to digital preservation is migration. Migration entails making periodic transformations of archived information into new logical formats as their native formats, or the software or hardware on which they depend becomes obsolete. The notable danger of migration is data loss, and possible loss of original functionality or the ‘look and feel’ of the original format. Furthermore, digital migrations are time consuming and costly as the process requires converting the format of every document, in addition to copying converted bit streams to new media as necessary.

Emulation theory

Jeff Rothenberg caused a bit of stir in organizations concerned and responsible for digital preservation with his report in 1999: "Avoiding technological quicksand: Finding a viable technical foundation for digital preservation". He states that there are no viable solutions to ensure that digital information will be readable in the future. The proposed solutions of relying on standards and migrations are labeled time consuming and ultimately incapable of preserving digital documents in their original form. He suggests:

"an ideal approach should provide a single, extensible, long-term solution that can be designed once and for all and applied uniformly, automatically, and in synchrony (for example, at every future refresh cycle) to all types of documents and all media, with minimal human intervention."

He proposes that the best way to satisfy the above criteria is Emulation by; developing an emulator that will run on unknown future computers; developing techniques to capture the metadata needed to find, access and recreate the document; developing techniques for encapsulating documents, their attendant metadata, software, and emulator specifications.

In 2000, he suggests implementing an emulation-based preservation approach in which emulator specification are expressed as programs and interpreted by an emulator specification interpreter program written for an emulation virtual machine.

Rothenberg's approach was met with skepticism and considered too technically challenging, too expensive and too time consuming, and therefore an economic risk (without the support of empirical evidence). (See further reading section)

UVC concept development

Role of IBM

Raymond A. Lorie, during his employment at IBM Research Centre Almaden, initiated the development of a UVC-based solution to long-term digital preservation. [2] He describes the approach as ‘Universal’ because its definition is so basic that it will endure forever, ‘Virtual’ because it will never have to be physically built and it is a ‘Computer’ in its functionality.

IBM (NL), the asset owner of the UVC, continues to develop the UVC concept within the PLANETS project. Raymond van Diessen is responsible for extending the application of the UVC concept to preserve more complex objects.

Role of the National Library of the Netherlands

The National Library of the Netherlands (Koninklijke Bibliotheek, KB) played a major role in demonstrating that emulation based on the UVC concept is a viable option for long-term digital preservation.

In 2000, the emulation advocate, Jeff Rothenberg participated in a study with the KB to test and evaluate the feasibility of using emulation as a long-term preserving strategy. His method was to use software emulation to reproduce the behaviour of obsolete computing platforms on newer platforms offering a way of running a digital document’s original software in the far future, thereby recreating the content, behaviour, and ‘look and feel’ of the original document. [3] Rothenberg was criticized for trying to preserve the wrong thing by suggesting to emulate the behavior of old hardware platforms and operating systems to access the original data through the original software program associated with it. Raymond A. Lorie recognized the difficulties in trying to create a program to emulate a 'real' machine on a future platform and realised that this approach was overkill for the purpose of preserving digital objects. Instead he introduced a novel approach of data/program archiving using a ‘Universal Virtual Computer’. [2] The concept of the UVC-based preservation strategy was implemented by the KB and tested on PDF files as part of a KB/IBM ‘Long Term Preservation’ (LTP) study. [4] Creating a UVC for PDF documents is more complex. Instead the KB decided on developing a UVC for images because this approach would also cover PDF documents (a PDF file can easily be converted to a series of images). The UVC-based approach resulted in the UVC as one of the permanent access tools for JPEG/GIF87 images within the Preservation Subsystem of the KB’s e-Depot. [5] Following the successful implementation of the UVC, the KB has continued to develop their emulation strategy for long-term digital preservation by focusing on 'full' or hardware emulation. This approach delivered a durable x86 component-based computer emulator: Dioscuri, the first modular emulator for digital preservation. [6]

UVC-based preservation

The Universal Virtual Computer is part of a broader concept, called the UVC-based preservation method. This method allows digital objects (like text documents, spreadsheets, images, sound waves, etc.) to be reconstructed in its original appearance anytime in the future. The methods are programs written in the machine language of a Universal Virtual Computer (UVC). The UVC is completely independent of the architecture of the computer on which it runs.

The UVC itself is a program which contains a set of instructions rather than a physical computer. It will run as a software application on a future platform. Because we do not know at this time which hardware is available in the future, the UVC must be created at the time we want to access a particular document from the repository. This UVC then forms the platform on which programs can run that have been specifically written for such UVC in the past. Creating an emulation program for the UVC in the future is much simpler than trying to emulate a 'real' machine.

Application description

The method of a UVC-based preservation strategy differentiates between data archiving which does not require full emulation, and program archiving which does. For archiving data, the UVC is used to archive methods which interpret the stored data stream. [2] The methods are programs written in the machine language of a Universal Virtual Computer (UVC). The UVC program is completely independent of the architecture of the computer on which it runs.

Data archiving

Data archiving reconstructs the 'look and feel' of the original file but not the functionality of the original format. If the electronic form of the document is only used for compact storage or if the way the document looks to the human eye is all there is, then it suffices to archive the document as an image. If additional functionality is needed, such as text searching, storing only the image is not enough. In this case the text also needs to be archived along with the image of the document. By restoring the original appearance of a file as an image a future user can see what the original file looks like in page layout, style, font etc. The text itself needs to be exported i.e. in ASCII format and can be saved as a sequence of homogeneous elements (all presentation attributes like font, size, etc. are the same for all characters) because the page image shows the exact look of the page. In this case the UVC program of the data has two parts, one to decode the text and one to decode the image.

What it entails

The data contained in the bit stream is stored with an internal representation, extracted from the data stream, of logical data elements that obey a certain schema in a certain data model. A decoding algorithm (method) extracts the various data elements from the internal representation and returns them tagged according to the schema. An additional schema (schema to read schemas) with information of the schema is similarly stored with the data together with a method to decode the schema to read schemas.

Logical Data View

The logical data model is kept simple in order to minimize the amount of description accompanying the data and to decrease the difficulty of understanding the structure of the data. The data model chosen for the UVC-based preservation method linearizes the data elements into a hierarchy of tagged elements organized using a XML-like approach. The tagged data elements are extracted from the data stream of the digital file. A tag specifies the role that the data element plays in the data structure. The element tags hold the specific information about the content of the data in a technology-independent manner. Furthermore, the data elements tagged according to the schema are returned to the client in a Logical Data View (LDV)

Example of Logical Data View

<Collection>  <landscape>   <name>    sugarloaf   <category>   mountains   <picture>    <year>   1916    <nbr_lines>  1500    <dots_per_line>  2100     <line>  …
The schema (format decoder)

More information is needed about the various data elements in order to humanly understand what each element means, information such as the place of the tags in the hierarchy, the type of data (numeric, characters), together with some information on the semantics of the data. For example, the image has two attributes, width and heights, indicating that width times height pixels follow; but are these pixels stored line by line or column by column? Or, for colored pictures, how to interpret the RGB values in order to recreate the right color? This extra information is also called metadata. The schema is clearly application-dependent as it describes the structure and meaning of the tags as parts of a specific information type.

Schema to read the schemas (Logical Data Schema (LDS))

If in the future a user gets the tagged data elements, he/she will generally not understand the meaning of the data and the relationships between them and the future user will need additional information on the logical structure. In other word, a schema to read the metadata schema is needed. A simple solution adopted for the UVC approach is a method for the schema similar to the method for the data: the schema information is stored in an internal representation, and accompanied by a method to decode it.

At this point, what will be included in the archive is: the data itself, the metadata, a UVC program to decode the data, and a UVC program to decode the metadata.

Program archiving

The UVC method for data archiving can be extended for program archiving. Program archiving involves archiving the behavior and functionality of a program and may involve archiving the operating system as well. Archiving the operating system may not be needed if the program is only a series of native instructions of the operating system. However, the operating system must be archived if the digital object is a full-fledged system with Input/Output interactions.

If no Input/Output interactions are needed it suffices to archive the operating system's program. In this case, using a similar method as described above, the following needs to be stored at archiving time:

  • the program
  • a UVC program that emulates the instruction sets of the operating system on which the original program (i.e. software application) runs.

In the future - the UVC interprets the UVC code which will yield the same result as the original program running on the original operating system.

When Input/Output interactions are involved things become more complicated as an additional UVC program that mimics the functioning of the Input/Output device processor must be archived. This UVC program will produce an Input/Output data structure.

In the future - a mapping of the data structure needs to be written to the actual device.

The UVC method replaces the need for a multitude of standards (one for each format) by a single standard on the UVC method. That standard should cover: the UVC functional specifications, the interface to call the methods, the model for the schema and for the schema to read schemas

Specification

The central idea of UVC-based preservation is that digital objects preserved in an archive can be reconstructed anytime in the future without losing the meaning of that object. The UVC architecture is influenced by the characteristics as a real existing computer. It contains a memory, registers and a set of low-level instructions. The architecture differs from a 'real' computer in that it never has to be physically implemented. Consequently there are no actual physical cost. The core element of the UVC is its segment-based memory. It uses segments of memory to store distinct parts of the data. This segment-based design prevents the allocated memory to be accidentally overwritten by other applications as it does not share its memory space. [7]

Conceptual model

Together with the original data it is possible to reconstruct the meaning of each particular digital object. The UVC can be seen as the heart of the system. Like the Java Virtual Machine and the Common Language Runtime, the UVC is actually an emulator which allows a program to run on virtual instances of the necessary, usually obsolete, hardware, and will continue to emulate the necessary hardware as technology continues to evolve. Because we do not know at this time which hardware is available in the future, the UVC must be created at the time we want to access a particular document from the repository. The UVC forms the platform on which programs specifically written for the UVC can run.

What needs to be done

Different steps must be taken at archiving time (present) and retrieval time (future).

At archiving time

Step 1 - Define the appropriate logical schema for a given application

Step 2 - Choose an internal representation and associate a UVC program P with the data. This is part of the normal design of an application

Step 3 – Write the UVC program for data interpretation

Step 4 - Archive the schema information by storing an internal representation of the schema information in the bit stream together with a UVC program Q to decode it. Since the structure of the schema is the same for all applications, a schema to read schemata is chosen once and for all.

At retrieval time

Step 1 - Create an emulator on the current platform. Because of the simplicity of the UVC concept, it is fairly easy for skilled software developers to construct a UVC emulator for a particular platform of the time

Step 2 - Develop a Logical Data Viewer (a restore program to restore the data). This is an application program that reads the UVC object code and the bit stream and invokes the emulator to execute the UVC program i.e. the program controls the UVC and all input/output interaction between it

Step 3 - Write a restore program to restore the schema. Since the logical view for the schema information is fixed a single restore program may actually support all applications. If the future client already knows the logical view for the documents being restored then the schema does not necessarily needs retrieving. Furthermore, the schema only needs to be requested once for a collection of documents of the same type

UVC convention

The UVC convention includes the information items that need to be archived today, and preserved indefinitely to enable the retrieval of digital objects in the future. Included in the convention are;

The convention must be 'written in stone'. It can be saved digitally, on paper and/or in micrographic media.

Preservation System

Components

UVC-based preservation as the central idea of the UVC-based preservation method is based on four different components. These are:

  • UVC program (format decoder)
  • Logical data Schema (LDS) with information type description
  • Universal Virtual Computer
  • Logical data viewer (UVC interface)
    • UVC interpreter
    • Restoration program

Fig. 2 UVC and its components UVC and its components.jpg

Description of method

The UVC program decodes the file format of a digital object. This format decoder program runs on the UVC, which is the platform-independent layer, independent of future hard- and software changes. Executing the format decoder delivers the element tags. These elements build the Logical Data View (LDV) of the data, which is quite similar to XML. The LDV is an instantiation of the LDS, describing the structure and meaning of the tags as parts of a specific information type.

All these components are controlled by a Logical Data Viewer simply called viewer. For reconstruction, the viewer starts the UVC and feeds it with the data of the digital object to a format decoder running on top of the UVC. In return it retrieves an LDV and reconstructs a specific representation of the original object’s meaning.

Performance

The architecture relies on concepts that have existed since the beginning of the computer era: memory, registers and basic instructions without secondary features often introduced to improve the execution performance. The performance is of secondary concern as the UVC programs are run mostly to restore the data and not work with them.

Speed is not a real concern either as future machines will be much faster, and an emulation of the UVC on a future machine will consequently run much faster. Furthermore, the flexibility of the UVC is more important than the speed of execution. Even so, performance can always still be improved.

Implementation

The UVC for data archiving i.e. the archiving of static files, is proven to work in an operational digital archiving environment. The UVC is one of the permanent access tools for images at the KB.

UVC for Images

The UVC is proven to successfully restore digital objects in their original form. The application is simple because with images no functionality is needed. The approach to develop a UVC for JPEG images is justified as most formats can be converted to this format. For example, a PDF document can be displayed as a series of JPEG images thereby retaining the 'look and feel' of the original digital object but sacrificing the functionality. Furthermore, the application for JPEG images can be easily adopted to emulate TIFF images by making a small adjustment to the Logical Data Schema.

The approach can also be applied to all other objects which do not contain behavioral aspects. For example, interpreters have been written (partially) for Excel, Lotus 1-2-3 and PDF. However, these applications only handle the static features of the formats.

UVC-based emulation

UVC-based emulation uses the UVC as a universal platform on which a platform independent emulator can be built. The UVC (software program) recreates a simple general purpose computer and can be easily implemented on any computer platform now and in the future. With this strategy future users should always be able to access and view the original object. The official UVC specification must be preserved at preservation time. Also decoders must be developed for each specific file format and a LSD is required for each type of digital object, defining object types by image, sound, spreadsheet, text, etc. And of course the original objects should be preserved as well. [8]

Complex objects/dynamic content

As mentioned before, the UVC-based approach has only been effectively implemented for static files. The technology continues to be developed by Raymond van Diessen (IBM) to include dynamic objects by exploiting the communication facility between the UVC program and a future application. [9]

Alternative emulation approaches

Other emulation approaches are stacked emulation, migrated emulation and Emulation Virtual Machine (VM).

Stacked emulation

Stacked emulation is platform dependent emulation that requires, over time, multiple emulators running on top of each other to reconstruct a historic platform. This brings better performance and functionality but lacks compatibility between platforms. This approach can mainly be found in the gaming industry.

Migrated emulation

Migrated emulation involves creating a platform dependent emulator that must be migrated (adapted) to subsequent newer hosts. When the particular operating system on which the emulator is created becomes obsolete, the emulator is translated to run on the new current platform. This approach is a strategy with high risks

Emulation Virtual Machine (EVM)

The EVM was presented by Jeff Rothenberg in 1999 and involves introducing an additional layer between the host platform and emulator and is said to be platform and time independent. This approach uses a virtual machine, and an emulator specification interpreter. It is said to be platform and time independent. It is quite complex as an emulation specification needs to be written for the computer platform on which the original software runs. The specification is then interpreted by an emulation specification interpreter that creates an emulator for the old platform. Both the interpreter and the created emulator run on the EVM.

Cost implications

Copyright issues for this approach are not expected to be different from those of any other approach.

If intellectual property rights exist for a format this issue has to be taken up with the format owners. Similarly, to 'UVC-enable" applications the source code is required from the developer and therefore permission from the owner. Lastly, for hardware emulation, all the relevant licenses of the software running on the system are required.

Historical timeline

See also

Related Research Articles

In computing, a virtual machine (VM) is an emulation of a computer system. Virtual machines are based on computer architectures and provide functionality of a physical computer. Their implementations may involve specialized hardware, software, or a combination.

A disk image, in computing, is a computer file containing the contents and structure of a disk volume or of an entire data storage device, such as a hard disk drive, tape drive, floppy disk, optical disc, or USB flash drive. A disk image is usually made by creating a sector-by-sector copy of the source medium, thereby perfectly replicating the structure and contents of a storage device independent of the file system. Depending on the disk image format, a disk image may span one or more computer files.

MAME Arcade game emulation software

MAME is a free and open-source emulator designed to recreate the hardware of arcade game systems in software on modern personal computers and other platforms. Its intention is to preserve gaming history by preventing vintage games from being lost or forgotten. It does this by emulating the inner workings of the emulated arcade machines; the ability to actually play the games is considered "a nice side effect". Joystiq has listed MAME as an application that every Windows and Mac gamer should have.

ROM image data dump from a ROM chip

A ROM image, or ROM file, is a computer file which contains a copy of the data from a read-only memory chip, often from a video game cartridge, or used to contain a computer's firmware, or from an arcade game's main board. The term is frequently used in the context of emulation, whereby older games or firmware are copied to ROM files on modern computers and can, using a piece of software known as an emulator, be run on a different device than which they were designed for. ROM burners are used to copy ROM images to hardware, such as ROM cartridges, or ROM chips, for debugging and QA testing.

BBC Domesday Project crowdsourced born-digital description of the UK, published in 1986

The BBC Domesday Project was a partnership between Acorn Computers, Philips, Logica and the BBC to mark the 900th anniversary of the original Domesday Book, an 11th-century census of England. It has been cited as an example of digital obsolescence on account of the physical medium used for data storage.

Digital obsolescence

Digital obsolescence is a situation where a digital resource is no longer readable because of its archaic format. There are two main categories of digital obsolescence:

  1. Software: the software needed to access the digital file becomes obsolete. An example here would be WordStar, a word processor popular in the 1980s which used a closed data format. WordStar is no longer readily available and modern machines are not configured to read data in this format.
  2. Hardware: the hardware needed to access the digital file becomes obsolete and thus no longer available. One example is the floppy disk and floppy disk drive; the disks are no longer commercially available and modern computers no longer have the drives to read them built in as standard.

In library and archival science, digital preservation is a formal endeavor to ensure that digital information of continuing value remains accessible and usable. It involves planning, resource allocation, and application of preservation methods and technologies, and it combines policies, strategies and actions to ensure access to reformatted and "born-digital" content, regardless of the challenges of media failure and technological change. The goal of digital preservation is the accurate rendering of authenticated content over time. The Association for Library Collections and Technical Services Preservation and Reformatting Section of the American Library Association, defined digital preservation as combination of "policies, strategies and actions that ensure access to digital content over time." According to the Harrod's Librarian Glossary, digital preservation is the method of keeping digital material alive so that they remain usable as technological advances render original hardware and software specification obsolete.

Amiga software is computer software engineered to run on the Amiga personal computer. Amiga software covers many applications, including productivity, digital art, games, commercial, freeware and hobbyist products. The market was active in the late 1980s and early 1990s but then dwindled. Most Amiga products were originally created directly for the Amiga computer, and were not ported from other platforms.

Digital dark age conceptual situation in which large swathes of historic digitised materials are inaccessible

The digital dark age is a lack of historical information in the digital age as a direct result of outdated file formats, software, or hardware that becomes corrupt, scarce, or inaccessible as technologies evolve and data decay. Future generations may find it difficult or impossible to retrieve electronic documents and multimedia, because they have been recorded in an obsolete and obscure file format, or on an obsolete physical medium, for example, floppy disks. The name derives from the term Dark Ages in the sense that there could be a relative lack of records in the digital age, as documents are transferred to digital formats and original copies are lost. An early mention of the term was at a conference of the International Federation of Library Associations and Institutions (IFLA) in 1997. The term was also mentioned in 1998 at the Time and Bits conference, which was co-sponsored by the Long Now Foundation and the Getty Conservation Institute.

The conservation and restoration of new media art is the study and practice of techniques for sustaining new media art created using from materials such as digital, biological, performative, and other variable media.

Digital artifactual value, a preservation term, is the intrinsic value of a digital object, rather than the informational content of the object. Though standards are lacking, born-digital objects and digital representations of physical objects may have a value attributed to them as artifacts.

In computing, virtualization refers to the act of creating a virtual version of something, including virtual computer hardware platforms, storage devices, and computer network resources.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open.

Metadata Data about data

Metadata is "data that provides information about other data". In other words, it is "data about data". Many distinct types of metadata exist, including descriptive metadata, structural metadata, administrative metadata, reference metadata and statistical metadata.

Video game console emulator Program that reproduces video game consoles behavior

A video game console emulator is a type of emulator that allows a computing device to emulate a video game console's hardware and play its games on the emulating platform. More often than not, emulators carry additional features that surpass the limitations of the original hardware, such as broader controller compatibility, timescale control, greater performance, clearer quality, easier access to memory modifications, one-click cheat codes, and unlocking of gameplay features. Emulators are also a useful tool in the development process of homebrew demos and the creation of new games for older, discontinued, or more rare consoles.

Emulator system that emulates a real system such that the behavior closely resembles the behavior of the real system

In computing, an emulator is hardware or software that enables one computer system to behave like another computer system. An emulator typically enables the host system to run software or use peripheral devices designed for the guest system. Emulation refers to the ability of a computer program in an electronic device to emulate another program or device. Many printers, for example, are designed to emulate HP LaserJet printers because so much software is written for HP printers. If a non-HP printer emulates an HP printer, any software written for a real HP printer will also run in the non-HP printer emulation and produce equivalent printing. Since at least the 1990s, many video game enthusiasts have used emulators to play classic arcade games from the 1980s using the games' original 1980s machine code and data, which is interpreted by a current-era system.

The following

The conservation and restoration of time-based media art is the study and practice of conserving time-based media and its components. The conservation and restoration of time-based media art is a complex undertaking within the field of conservation that includes understanding both physical and digital conservation methods; there are many facets of time-based media conservation. Its major aim is to detect and monitor short and long-term changes that an artwork may undergo in response to its environment, technological developments, exhibition-design, or technicians' preferences.

Video game preservation form of digital preservation applied to video games

Video game preservation is a form of preservation applied to the video game industry that includes, but is not limited to digital preservation. Such preservation efforts include archiving development source code and art assets, digital copies of video games, emulation of video game hardware, maintenance and preservation of specialized video game hardware such as arcade games and video game consoles, and digitization of print video game magazines and books prior to the Digital Revolution.

References

  1. 1 2 Lorie R. A., 2002. A Methodology and System for Preserving Digital Data. Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, Portland, Oregon, USA. 14–18 July 2002. New York, NY: Association of Computing Machinery. pp. 312-319 doi : 10.1145/544220.544296
  2. 1 2 3 Lorie R. A., 2001. Long term preservation of digital information. Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, Roanoke, Virginia, United States. 24–28 June 2001. New York, NY: Association of Computing Machinery. pp. 346-352 doi : 10.1145/379437.379726
  3. Rothenberg, J., 2000. Experiment in using digital emulation to preserve digital publications. NEDLIB Report Series 1 [online] Den Haag: National Library of the Netherlands
  4. Lorie, R. A., 2002. The UVC: a Method for Preserving Digital Documents – Proof of Concept. IBM/KB Long-term Preservation Study. Amsterdam: IBM Netherlands
  5. Wijngaarden H., Oltmans, E., 2004. Digital Preservation and Permanent Access: The UVC for Images. Proceedings of the Imaging Science & Technology Archiving Conference, San Antonio, USA. 23 April 2004 pp. 254-259
  6. van der Hoeven J. R., Lohman B., Verdegem R., 2007. Emulation for Digital Preservation in Practice: The Results. The International Journal of Digital Curation, 2(2), pp. 123-132
  7. van Diessen R. J., van der Hoeven J. R., van der Meer K., 2005. Development of a Universal Virtual Computer (UVC) for long-term preservation of digital objects, 31(3), pp. 196-208 doi : 10.1177/0165551505052347
  8. van der Hoeven, J.R., van Wijngaarden, H., Verdegem, R., Slats, J., 2005. Emulation – a viable preservation strategy. [online] Koninklijke Bibliotheek / Nationaal Archief: The Hague, The Netherlands.
  9. Lorie R. A., van Diessen R. J., 2005. Long-term preservation of complex processes. Archiving 2005, Vol. 2, pp. 14-19

Further reading