Data archaeology

Last updated

There are two conceptualisations of data archaeology, the technical definition and the social science definition.

Contents

Data archaeology (also data archeology) in the technical sense refers to the art and science of recovering computer data encoded and/or encrypted in now obsolete media or formats. Data archaeology can also refer to recovering information from damaged electronic formats after natural disasters or human error.

It entails the rescue and recovery of old data trapped in outdated, archaic or obsolete storage formats such as floppy disks, magnetic tape, punch cards and transforming/transferring that data to more usable formats.

Data archaeology in the social sciences usually involves an investigation into the source and history of datasets and the construction of these datasets. It involves mapping out the entire lineage of data, its nature and characteristics, its quality and veracity and how these affect the analysis and interpretation of the dataset.

The findings of performing data archaeology affect the level to which the conclusions parsed from data analysis can be trusted. [1]

The term data archaeology originally appeared in 1993 as part of the Global Oceanographic Data Archaeology and Rescue Project (GODAR). The original impetus for data archaeology came from the need to recover computerised records of climatic conditions stored on old computer tape, which can provide valuable evidence for testing theories of climate change. These approaches allowed the reconstruction of an image of the Arctic that had been captured by the Nimbus 2 satellite on September 23, 1966, in higher resolution than ever seen before from this type of data. [2]

NASA also utilises the services of data archaeologists to recover information stored on 1960s-era vintage computer tape, as exemplified by the Lunar Orbiter Image Recovery Project (LOIRP). [3]

Recovery

There is a distinction between data recovery and data intelligibility. One may be able to recover data but not understand it. For data archaeology to be effective, the data must be intelligible. [4]

A term closely related to data archaeology is data lineage. The first step in performing data archaeology is an investigation into their data lineage. Data lineage entails the history of the data, its source and any alterations or transformations they have undergone. Data lineage can be found in the metadata of a dataset, the para data of a dataset or any accompanying identifiers (methodological guides etc). With data archaeology comes methodological transparency which is the level to which the data user can access the data history. The level of methodological transparency available determines not only how much can be recovered, but assists in knowing the data. Data lineage investigation involves what instruments were used, what the selection criteria are, the measurement parameters and the sampling frameworks. [1]

In the socio-political manner, data archaeology involves the analysis of data assemblages to reveal their discursive and material socio-technical elements and apparatuses. This kind of analysis can reveal the politics of the data being analysed and thus that of their producing institution. Archaeology in this sense, refers to the provenance of data. It involves mapping the sites, formats and infrastructures through which data flows and are altered or transformed over time. it has an interest in the life of data, and the politics that shapes the circulation of data. This serves to expose the key actors, practices and praxes at play and their roles. It can be accomplished in two steps. First is, accessing and assessing the technical stack of the data (this refers to the infrastructure and material technologies used to build/gather the data) to understand the physical representation of the data and also. Second, analysing the contextual stack of the data which shapes how the data is constructed, used and analysed. This can be done via a variety of processes, interviews, analysing technical and policy documents and investigating the effect of the data on a community or the institutional, financial, legal and material framing. This can be attained by creating a data assemblage [1]

Data archaeology charts the way data moves across different sites and can sometimes encounter data friction. [5]

Disaster recovery

Data archaeologists can also use data recovery after natural disasters such as fires, floods, earthquakes, or even hurricanes. For example, in 1995 during Hurricane Marilyn the National Media Lab assisted the National Archives and Records Administration in recovering data at risk due to damaged equipment. The hardware was damaged from rain, salt water, and sand, yet it was possible to clean some of the disks and refit them with new cases thus saving the data within. [4]

Recovery techniques

Data stored in outdated formats like the floppy disk have to be restored to newer formats Floppy disk 2009 G1.jpg
Data stored in outdated formats like the floppy disk have to be restored to newer formats

When deciding whether or not to try and recover data, the cost must be taken into account. If there is enough time and money, most data will be able to be recovered. In the case of magnetic media, which are the most common type used for data storage, there are various techniques that can be used to recover the data depending on the type of damage. [4] :17

Humidity can cause tapes to become unusable as they begin to deteriorate and become sticky. In this case, a heat treatment can be applied to fix this problem, by causing the oils and residues to either be reabsorbed into the tape or evaporate off the surface of the tape. However, this should only be done in order to provide access to the data so it can be extracted and copied to a medium that is more stable. [4] :17–18

Lubrication loss is another source of damage to tapes. This is most commonly caused by heavy use, but can also be a result of improper storage or natural evaporation. As a result of heavy use, some of the lubricant can remain on the read-write heads which then collect dust and particles. This can cause damage to the tape. Loss of lubrication can be addressed by re-lubricating the tapes. This should be done cautiously, as excessive re-lubrication can cause tape slippage, which in turn can lead to media being misread and the loss of data. [4] :18

Water exposure will damage tapes over time. This often occurs in a disaster situation. If the media is in salty or dirty water, it should be rinsed in fresh water. The process of cleaning, rinsing, and drying wet tapes should be done at room temperature in order to prevent heat damage. Older tapes should be recovered prior to newer tapes, as they are more susceptible to water damage. [4] :18

The next step (after investigating the data lineage) is to establish what counts as good data and bad data to ensure that only the 'good' data gets migrated to the new data warehouse or repository. A good example of bad data is 'test data' in the technical data sense is test data.

Prevention

To prevent the need of data archaeology, creators and holders of digital documents should take care to employ digital preservation.

Storing data in an off shore server is a good preventive measure against data loss Servers in a Rack.jpg
Storing data in an off shore server is a good preventive measure against data loss

Another effective preventive measure is the use of offshore backup facilities that could not be affected should a disaster occur. From these backup servers, copies of the lost data could easily be retrieved. A multi-site and multi-technique data distribution plan is advised for optimal data recovery, especially when dealing with big data. TCP/IP method, snapshot recovery, mirror sites and tapes safeguarding data in a private cloud are also all good preventive methods. Daily transferring data from their mirror sites to the emergency servers. [6]

See also

Related Research Articles

<span class="mw-page-title-main">Underwater archaeology</span> Archaeological techniques practiced at underwater sites

Underwater archaeology is archaeology practiced underwater. As with all other branches of archaeology, it evolved from its roots in pre-history and in the classical era to include sites from the historical and industrial eras.

<span class="mw-page-title-main">Archaeological excavation</span> Exposure, processing and recording of archaeological remains

In archaeology, excavation is the exposure, processing and recording of archaeological remains. An excavation site or "dig" is the area being studied. These locations range from one to several areas at a time during a project and can be conducted over a few weeks to several years.

<span class="mw-page-title-main">Business continuity planning</span> Prevention and recovery from threats that might affect a company

Business continuity may be defined as "the capability of an organization to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident", and business continuity planning is the process of creating systems of prevention and recovery to deal with potential threats to a company. In addition to prevention, the goal is to enable ongoing operations before and during execution of disaster recovery. Business continuity is the intended outcome of proper execution of both business continuity planning and disaster recovery.

In science and engineering, root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. It is widely used in IT operations, manufacturing, telecommunications, industrial process control, accident analysis, medicine, healthcare industry, etc. Root cause analysis is a form of inductive and deductive inference.

<span class="mw-page-title-main">Paleoethnobotany</span> Study of plants used by people in ancient times

Paleoethnobotany, or archaeobotany, is the study of past human-plant interactions through the recovery and analysis of ancient plant remains. Both terms are synonymous, though paleoethnobotany is generally used in North America and acknowledges the contribution that ethnographic studies have made towards our current understanding of ancient plant exploitation practices, while the term archaeobotany is preferred in Europe and emphasizes the discipline's role within archaeology.

In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server.

<span class="mw-page-title-main">Computer forensics</span> Branch of digital forensic science

Computer forensics is a branch of digital forensic science pertaining to evidence found in computers and digital storage media. The goal of computer forensics is to examine digital media in a forensically sound manner with the aim of identifying, preserving, recovering, analyzing and presenting facts and opinions about the digital information.

Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster, such as a storm or battle. It employs policies, tools, and procedures. Disaster recovery focuses on information technology (IT) or technology systems supporting critical business functions as opposed to business continuity. This involves keeping all essential aspects of a business functioning despite significant disruptive events; it can therefore be considered a subset of business continuity. Disaster recovery assumes that the primary site is not immediately recoverable and restores data and services to a secondary site.

<span class="mw-page-title-main">Digital obsolescence</span> Data loss as the format goes into disuse

Digital obsolescence is the risk of data loss because of inabilities to access digital assets, due to the hardware or software required for information retrieval being repeatedly replaced by newer devices and systems, resulting in increasingly incompatible formats. While the threat of an eventual "digital dark age" was initially met with little concern until the 1990s, modern digital preservation efforts in the information and archival fields have implemented protocols and strategies such as data migration and technical audits, while the salvage and emulation of antiquated hardware and software address digital obsolescence to limit the potential damage to long-term information access.

Data loss is an error condition in information systems in which information is destroyed by failures or neglect in storage, transmission, or processing. Information systems implement backup and disaster recovery equipment and processes to prevent data loss or restore lost data. Data loss can also occur if the physical medium containing the data is lost or stolen.

Given organizations' increasing dependency on information technology to run their operations, business continuity planning covers the entire organization, and disaster recovery focuses on IT.

Preservation of documents, pictures, recordings, digital content, etc., is a major aspect of archival science. It is also an important consideration for people who are creating time capsules, family history, historical documents, scrapbooks and family trees. Common storage media are not permanent, and there are few reliable methods of preserving documents and pictures for the future.

In computing, data recovery is a process of retrieving deleted, inaccessible, lost, corrupted, damaged, or formatted data from secondary storage, removable media or files, when the data stored in them cannot be accessed in a usual way. The data is most often salvaged from storage media such as internal or external hard disk drives (HDDs), solid-state drives (SSDs), USB flash drives, magnetic tapes, CDs, DVDs, RAID subsystems, and other electronic devices. Recovery may be required due to physical damage to the storage devices or logical damage to the file system that prevents it from being mounted by the host operating system (OS).

Aviation archaeology is a recognized sub-discipline within archaeology and underwater archaeology as a whole. It is an activity practiced by both enthusiasts and academics in pursuit of finding, documenting, recovering, and preserving sites important in aviation history. For the most part, these sites are aircraft wrecks and crash sites, but also include structures and facilities related to aviation. It is also known in some circles and depending on the perspective of those involved as aircraft archaeology or aerospace archaeology and has also been described variously as crash hunting, underwater aircraft recovery, wreck chasing, or wreckology.

<span class="mw-page-title-main">Digital forensics</span> Branch of forensic science

Digital forensics is a branch of forensic science encompassing the recovery, investigation, examination, and analysis of material found in digital devices, often in relation to mobile devices and computer crime. The term "digital forensics" was originally used as a synonym for computer forensics but has expanded to cover investigation of all devices capable of storing digital data. With roots in the personal computing revolution of the late 1970s and early 1980s, the discipline evolved in a haphazard manner during the 1990s, and it was not until the early 21st century that national policies emerged.

<span class="mw-page-title-main">PhotoRec</span> Open source data recovery software

PhotoRec is a free and open-source utility software for data recovery with text-based user interface using data carving techniques, designed to recover lost files from various digital camera memory, hard disk and CD-ROM. It can recover the files with more than 480 file extensions . It is also possible to add custom file signature to detect less known files.

Research data archiving is the long-term storage of scholarly research data, including the natural sciences, social sciences, and life sciences. The various academic journals have differing policies regarding how much of their data and methods researchers are required to store in a public archive, and what is actually archived varies widely between different disciplines. Similarly, the major grant-giving institutions have varying attitudes towards public archival of data. In general, the tradition of science has been for publications to contain sufficient information to allow fellow researchers to replicate and therefore test the research. In recent years this approach has become increasingly strained as research in some areas depends on large datasets which cannot easily be replicated independently.

The preservation of optical media is essential because it is a resource in libraries, and stores audio, video, and computer data to be accessed by patrons. While optical discs are generally more reliable and durable than older media types, environmental conditions and/or poor handling can result in lost information.

The Archaeology Data Service (ADS) is an open access digital archive for archaeological research outputs. It is located in The King's Manor, at the University of York. Originally intended to curate digital outputs from archaeological researchers based in the UK's Higher Education sector, the ADS also holds archive material created under the auspices of national and local government as well as in the commercial archaeology sector. The ADS carries out research, most of which focuses on resource discovery, cross-searching and interoperability with other relevant archives in the UK, Europe and the United States of America.

<span class="mw-page-title-main">WISDOM Project</span>

The WISDOM Project is a bilateral research project between Germany and Vietnam, focusing on the creation of a Water related Information System for the Mekong Delta. Initiated by the Vietnamese and the German Government it was started in the year 2007, and is planned to continue until the year 2013. Water-related Information System for the sustainable Development of the Mekong Delta in Vietnam

References

  1. 1 2 3 Kitchin, Rob (2022). The Data Revolution. Sage.
  2. Techno-archaeology rescues climate data from early satellites Archived 2010-11-26 at the Wayback Machine U.S. National Snow and Ice Data Center (NSIDC), January 2010
  3. LOIRP Overview NASA website November 14, 2008 Archived
  4. 1 2 3 4 5 6 Study on website October 23, 2011
  5. Bates, Jo (2016). "Data Journeys: Capturing the socio-material constitution of data objects and flows". Big Data and Society. 3 (2): 1–12. doi: 10.1177/2053951716654502 . S2CID   54719310.
  6. Chang, V (2015). "Towards a Big Data system disaster recovery in a Private Cloud" (PDF). Ad Hoc Networks. 5: 65–82. doi:10.1016/j.adhoc.2015.07.012. S2CID   18230189 via Elsevier.

Further reading