This page is a timeline of digital preservation and Web archiving . It covers various aspects of saving and preserving digital data, whether they are born-digital or not.
Digital preservation encompasses a variety of efforts and technologies, so its history can be viewed through various trends in these separate efforts:
|Year||Month and date||Topic||Details|
|1972||Versioning||Marc Rochkind develops the Source Code Control System at Bell Labs.|
|1982||October||Physical storage||The compact disc (CD) as well as the CD player first become commercially available in Japan.|
|1987||June||Physical storage||The term "RAID" is invented by David Patterson, Garth A. Gibson, and Randy Katz at the University of California, Berkeley in 1987. In their June 1988 paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)", presented at the SIGMOD conference, they would argue that the top performing mainframe disk drives of the time could be beaten on performance by an array of the inexpensive drives that had been developed for the growing personal computer market. Although failures would rise in proportion to the number of drives, by configuring for redundancy, the reliability of an array could far exceed that of any large single drive.|
|1989||November 13||Versioning||Continuous data protection, the technique of backing up computer data by automatically saving a copy of every change made to that data, is patented by British entrepreneur Pete Malcolm.|
|1990||Possibly the earliest reference to the term "digital preservation" (to mean converting analog media to digital and preserving in digital form) is from this year. : 124|
|1996||January||Web archiving||The initial version of the command-line downloading program Wget, then known as Geturl, is released.|
|1996||Web archiving||The Internet Archive is founded by Brewster Kahle.|
|1996||April||Web archiving||Alexa Internet is founded by Brewster Kahle. Since this year, Alexa Internet has donated its crawl data to the Internet Archive.|
|1996||Preserving Digital Information: Report of the Task Force on Archiving of Digital Information (Donald Waters, John Garrett, eds.) is published. It became a fundamental document in the field of digital preservation that helped set out key concepts, requirements, and challenges.|
|1997||April 8||Web archiving||cURL, a computer software project providing a library and command-line tool for transferring data using various protocols, releases its initial version of the tool. It is known at this point as HttpGet, would briefly rename itself to urlget, and would finally rename itself to cURL in March 1998. cURL can be used to download files over a network.|
|1998||May||Web archiving||The first version of HTTrack, a free and open source Web crawler and offline browser, is released.|
|2000||The National Digital Information Infrastructure and Preservation Program (NDIIPP) launches.|
|2001||October||Web archiving||The Wayback Machine is launched.|
|2001||October 14||Version 1.0 of the Parity Volume Set specification, used in Par1, is published.|
|2002||January||Web archiving||TinyURL, the first notable URL shortening service, is launched.|
|2003||July||The International Internet Preservation Consortium is founded.|
|2005||Cloud storage||Box is launched as Box.net.|
|2005||April 7||Versioning||The initial version of Git, a version control system with support for data integrity, is released.|
|2005||April 29||Web archiving||Safari version 2.0 introduces the ability to save complete websites using the proprietary WebArchive format (details at Safari version history).|
|2005||August 1||Physical storage||The article "Kryder's Law" is published The law observes that magnetic disk areal storage density has been increasing very quickly.|
|2005||August||Versioning||Writely, a web-based word processor created by the software company Upstartle, launches. By January 2006, Writely would have support for revision history. Upstartle would later be acquired by Google and Writely would be integrated into Google Docs.|
|2005||October 31||File system||The first implementation of ZFS, a file system that includes protection against data corruption, is integrated into Solaris.|
|2006||March 19||Cloud storage||Amazon Web Services launches by releasing the Simple Storage Service (S3), intended for storing individual files (called objects) in a highly redundant and available fashion. S3 is designed for at least 99.999999999% durability (i.e., that percentage of objects is expected to survive after a year) and 99.99% availability (i.e., that percentage of objects is accessible at any given time). The cost of S3 storage dropped over the next decade, reaching 2.3 cents a GB effective December 1, 2016. S3 has been widely used by corporations, libraries, and governments to digitize data for long-term storage.|
|2007||January 30||Versioning||Microsoft Office 2007 is released. Word 2007 introduces the ability to track changes in documents.|
|2007||June||Cloud storage||Dropbox is founded by MIT students Drew Houston and Arash Ferdowsi, as a startup company from the American seed accelerator Y Combinator.|
|2007||September 21||Physical storage||The initial version of Paperkey is released. Paperkey is a free software implementation of a paper key. It extracts the essential secret bytes from an OpenPGP private key, which can then be printed to paper.|
|2007||October 26||Versioning||Apple releases the initial version of Time Machine.|
|2007||Physical storage||Two software for densely storing information on paper are released: PaperBack and Twibright Labs' Optar.|
|2007||Federal Agencies Digital Guidelines Initiative (FADGI)||FADGI is a collaborative effort of 20 federal agencies to articulate common sustainable practices and guidelines for digitized and born digital historical, archival and cultural content. Two working groups study issues specific to two major areas, Still Image and Audio-Visual.|
|2008||Web archiving||The URL shortening service Bitly is launched.|
|2008||April 10||Versioning||GitHub, a web-based Git repository hosting service, is launched. GitHub would popularize version control and Git. GitHub would also play an important role in encouraging people to make their source code freely available for posterity, allowing others to fork the code and acting as a de facto archive. In addition to software projects, GitHub would also be used to host code repositories for scientific research as well as for hosting and backing up websites and content.|
|2008||November 20||Digitizing||The prototype for Europeana launches.|
|2009||January 6||Web archiving||The Archive Team begins operating. Its first big effort, for which it receives press coverage, is to download Geocities data before the service shuts down.|
|2009||Web archiving||SocialSafe Ltd, the company responsible for developing SocialSafe, is founded.|
|2009||March 23||File system||The initial version of Btrfs, a file system that supports checksums, incremental backups, and the ability to repair errors, is released as part of the Linux kernel version 2.6.29.|
|2009||May 15||Web archiving||The WARC file format is published as the standard ISO 28500:2009 1st edition.|
|2009||October 26||Web archiving||Yahoo! GeoCities, a web hosting service founded in 1994, closes its United States branch. Various attempts at archiving GeoCities are made. The site would continue to be available only in Japan.|
|2010||April 14||Web archiving||Twitter announces that it will donate its archive of public Tweets to the Library of Congress.|
|2010||December 1||Web archiving||The Memento Project provides a standard for interoperability between web archives and the live web. Memento wins the Digital Preservation Award 2010 because "Memento offers an elegant and easily deployed method that reunites web archives with their home on the live web. It opens web archives to tens of millions of new users and signals a dramatic change in the way we use and perceive digital archives."|
|2011||June 28||Web archiving||Google Takeout is launched by the Google Data Liberation Front.|
|2012||August 1||File system||Microsoft introduces ReFS. ReFS has a number of features related to digital preservation including integrity checking and data scrubbing, protection against data degradation, built-in handling of hard disk drive failure and redundancy, and integration of the RAID functionality.|
|2012||August 21||Cloud storage||Amazon Web Services launches Amazon Glacier, an addition to its S3 offerings with lower storage costs than S3 (initially 1 cent per GB). Glacier is intended for long-term archival in cases where retrieval is rare; therefore retrieval is costly and slow. Glacier offers the same durability as the standard S3 offering. In December 2016, the price of Glacier is reduced to 0.4 cents per GB. Glacier has been used by governments, corporations, and libraries for low-cost long-term archival. It has also been recommended for use for personal backups when frequent access is not needed.|
|2013||April 6||Web archiving||In the United Kingdom, the Legal Deposit Libraries (Non-Print Works) Regulations come into force, bringing digital and online material under the scope of the UK's legal deposit. Previously, the Legal Deposit Libraries Act 2003 had given the Secretary of State the powers to make regulations governing the deposit of non-print publications, but such regulations were never made at that time. : 5|
|2013||April 18||Digitizing||The Digital Public Library of America launches.|
|2013||July 1||Web archiving||Google Reader, an RSS/Atom feed aggregator operated by Google, shuts down after having launched in 2005. The shutdown prompts an effort to archive the feed data from the service.|
|2013||December||Web archiving||The Memento Project is published as a standard in RFC 7089.|
|2017||August||Web archiving||The WARC file format is published as the standard ISO 28500:2017 2nd edition.|
The system utility
fsck is a tool for checking the consistency of a file system in Unix and Unix-like operating systems, such as Linux, macOS, and FreeBSD. The equivalent programs on MS-DOS and Microsoft Windows are CHKDSK, SFC, and SCANDISK.
GNU Wget is a computer program that retrieves content from web servers. It is part of the GNU Project. Its name derives from "World Wide Web" and "get". It supports downloading via HTTP, HTTPS, and FTP.
cURL is a computer software project providing a library (libcurl) and command-line tool (curl) for transferring data using various network protocols. The name stands for "Client for URL".
Filesystem in Userspace (FUSE) is a software interface for Unix and Unix-like computer operating systems that lets non-privileged users create their own file systems without editing kernel code. This is achieved by running file system code in user space while the FUSE module provides only a bridge to the actual kernel interfaces.
Extended file attributes are file system features that enable users to associate computer files with metadata not interpreted by the filesystem, whereas regular attributes have a purpose strictly defined by the filesystem. Unlike forks, which can usually be as large as the maximum file size, extended attributes are usually limited in size to a value significantly smaller than the maximum file size. Typical uses include storing the author of a document, the character encoding of a plain-text document, or a checksum, cryptographic hash or digital certificate, and discretionary access control information.
Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. Amazon S3 uses the same scalable storage infrastructure that Amazon.com uses to run its e-commerce network. Amazon S3 can store any type of object, which allows uses like storage for Internet applications, backups, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage. AWS launched Amazon S3 in the United States on March 14, 2006, then in Europe in November 2007.
Amazon Elastic Compute Cloud (EC2) is a part of Amazon.com's cloud-computing platform, Amazon Web Services (AWS), that allows users to rent virtual computers on which to run their own computer applications. EC2 encourages scalable deployment of applications by providing a web service through which a user can boot an Amazon Machine Image (AMI) to configure a virtual machine, which Amazon calls an "instance", containing any software desired. A user can create, launch, and terminate server-instances as needed, paying by the second for active servers – hence the term "elastic". EC2 provides users with control over the geographical location of instances that allows for latency optimization and high levels of redundancy. In November 2010, Amazon switched its own retail website platform to EC2 and AWS.
Btrfs is a computer storage format that combines a file system based on the copy-on-write (COW) principle with a logical volume manager, developed together. It was founded by Chris Mason in 2007 for use in Linux, and since November 2013, the file system's on-disk format has been declared stable in the Linux kernel.
Ceph is a free and open-source software-defined storage platform that provides object storage, block storage, and file storage built on a common distributed cluster foundation. Ceph provides completely distributed operation without a single point of failure and scalability to the exabyte level, and is freely available. Since version 12 (Luminous), Ceph does not rely on any other, conventional filesystem and directly manages HDDs and SSDs with its own storage backend BlueStore and can expose a POSIX filesystem.
Dropbox is a file hosting service operated by the American company Dropbox, Inc., headquartered in San Francisco, California, U.S. that offers cloud storage, file synchronization, personal cloud, and client software. Dropbox was founded in 2007 by MIT students Drew Houston and Arash Ferdowsi as a startup company, with initial funding from seed accelerator Y Combinator.
This is a comparison of file hosting services that are currently active. File hosting services are a particular kind of online file storage; however, various products that are designed for online file storage may not have features or characteristics that others designed for sharing files have.
ExpanDrive is a network filesystem client for MacOS, Microsoft Windows and Linux that facilitates mapping of local volume to many different types of cloud storage. When a server is mounted with ExpanDrive any program can read, write, and manage remote files as if they were stored locally. This is different from most File Transfer Clients because it is integrated into all applications on the operating system. It also does not require a file to be downloaded to access portions of the content. ExpanDrive is commercial software, at a cost of $49.95 per license. A 7-day, unrestricted demo is available for evaluation.
An Amazon Machine Image (AMI) is a special type of virtual appliance that is used to create a virtual machine within the Amazon Elastic Compute Cloud ("EC2"). It serves as the basic unit of deployment for services delivered using EC2.
The most widespread standard for configuring multiple hard disk drives is RAID, which comes in a number of standard configurations and non-standard configurations. Non-RAID drive architectures also exist, and are referred to by acronyms with tongue-in-cheek similarity to RAID:
BagIt is a set of hierarchical file system conventions designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" and "tags," which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, BagIt, is inspired by the "enclose and deposit" method, sometimes referred to as "bag it and tag it."
Amazon S3 Glacier is an online file storage web service that provides storage for data archiving and backup.
This is a timeline of Amazon Web Services, which offers a suite of cloud computing services that make up an on-demand computing platform.
Zstandard, commonly known by the name of its reference implementation zstd, is a lossless data compression algorithm developed by Yann Collet at Facebook. Zstd is the reference implementation in C. Version 1 of this implementation was released as open-source software on 31 August 2016.
Rclone is an open source, multi threaded, command line computer program to manage or migrate content on cloud and other high latency storage. Its capabilities include sync, transfer, crypt, cache, union, compress and mount. The rclone website lists supported backends including S3 and Google Drive.
CBS released the world's first commercially available CD, a reissue of Billy Joel's 52nd Street, in Japan in October 1982. Philips missed the production deadline so the international release was put back to March 1983.
On October 1, 1982, Sony ignited a digital audio revolution with the release of the world's first commercial compact disc player, the CDP-101 (above), in Japan.
Patterson recalled the beginnings of his RAID project in 1987. […] 1988: David A. Patterson leads a team that defines RAID standards for improved performance, reliability and scalability.
Filing date Nov 13, 1989
The earliest reference that I could find in English to the "digital preservation" of data occurs in the context of the research that Anne Kenney and Lynne Personnius undertook in 1990 at the Cornell University Library in conjunction with the Xerox Corporation.
Wget 1.4.0 [formerly known as Geturl] is an extensive rewrite of Geturl.This NEWS file is included in source distributions of Wget.
Daniel simply adopted an existing command-line open-source tool, httpget, that Brazilian Rafael Sagula had written and recently release version 0.1 of. After a few minor adjustments, it did just what he needed. […] HttpGet 1.0 was released on April 8th 1997 with brand new HTTP proxy support.
The first release was in May 1998, but only as binaries.
2000 - NDIIP legislation is passed
So the 24-year-old Web developer from Blaine, Minnesota, launched TinyURL.com in January 2002, a free site that converts huge strings of characters into more manageable snippets.
Twitter is a rare case: it has arranged to archive all of its tweets at the Library of Congress. […] The U.K. has what's known as a legal-deposit law; it requires copies of everything published in Britain to be deposited in the British Library. In 2013, that law was revised to include everything published on the U.K. Web.
Development for Box, then Box.net, started at the end of 2004, but really got off the ground and went online in 2005 during their sophomore years of college.
Box –which now competes with Redmond's very own Microsoft SharePoint –had been started in early '05 from college dorm rooms in California and North Carolina.
In older versions of Safari, "saving" a Web page saved only its HTML source code; images and other embedded content were lost. Fortunately, Apple fixed this in Safari 2.0: the Save As command includes a Web Archive option, which saves nearly everything on the page, including images.
Writely saves all the revisions each time you edit, so that you can go back and see what has been edited at each revision.
And today, 10/31/2005, we integrated into Solaris.
Dropbox was founded by Drew Houston and Arash Ferdowsi in 2007, and received seed funding from Y Combinator.
Paperkey extracts just those secret bytes and prints them.From the NEWS file of the http://www.jabberwocky.com/software/paperkey/paperkey-1.4.tar.gz source]: "Noteworthy changes in version 0.5 (2007-09-21) […] Initial release."
Introduced in 2008, Bitly has grown rapidly because, along with shortening URLs for character-limited social media like Twitter, it helps users monitor how others subsequently share the links that they share.
2008 Europeana's prototype is launched on November 20th by Viviane Reding, European Commissioner for Information Society and Media, and the President of the Commission, José Manuel Barroso.
Linux 2.6.29 kernel released on 23 March 2009. […] Btrfs is a new filesystem developed from scratch following the design principles of filesystems like ZFS, WAFL, etc.
It is our pleasure to donate access to the entire archive of public Tweets to the Library of Congress for preservation and research. […] there are some specifics regarding this arrangement. Only after a six-month delay can the Tweets be used for internal library use, for non-commercial research, public display by the library itself, and preservation.
As of 6 April 2013, legal deposit also covers material published digitally and online, so that the Legal Deposit Libraries can provide a national archive of the UK's non-print published material, such as websites, blogs, e-journals and CD-ROMs.