Parchive

Last updated
Parchive
Filename extension
.par, .par2, .p??, (.par3 future)
Type of format Erasure code, archive file

Parchive (a portmanteau of parity archive, and formally known as Parity Volume Set Specification [1] [2] ) is an erasure code system that produces par files for checksum verification of data integrity, with the capability to perform data recovery operations that can repair or regenerate corrupted or missing data.

Contents

Parchive was originally written to solve the problem of reliable file sharing on Usenet, [3] but it can be used for protecting any kind of data from data corruption, disc rot, bit rot, and accidental or malicious damage. Despite the name, Parchive uses more advanced techniques (specifically error correction codes) than simplistic parity methods of error detection.

As of 2014, PAR1 is obsolete, PAR2 is mature for widespread use, and PAR3 is a discontinued experimental version developed by MultiPar author Yutaka Sawada. [4] [5] [6] [7] The original SourceForge Parchive project has been inactive since April 30, 2015. [8] A new PAR3 specification has been worked on since April 28, 2019 by PAR2 specification author Michael Nahas. An alpha version of the PAR3 specification has been published on January 29, 2022 [9] while the program itself is being developed.

History

Parchive was intended to increase the reliability of transferring files via Usenet newsgroups. Usenet was originally designed for informal conversations, and the underlying protocol, NNTP was not designed to transmit arbitrary binary data. Another limitation, which was acceptable for conversations but not for files, was that messages were normally fairly short in length and limited to 7-bit ASCII text. [10]

Various techniques were devised to send files over Usenet, such as uuencoding and Base64. Later Usenet software allowed 8 bit Extended ASCII, which permitted new techniques like yEnc. Large files were broken up to reduce the effect of a corrupted download, but the unreliable nature of Usenet remained.

With the introduction of Parchive, parity files could be created that were then uploaded along with the original data files. If any of the data files were damaged or lost while being propagated between Usenet servers, users could download parity files and use them to reconstruct the damaged or missing files. Parchive included the construction of small index files (*.par in version 1 and *.par2 in version 2) that do not contain any recovery data. These indexes contain file hashes that can be used to quickly identify the target files and verify their integrity.

Because the index files were so small, they minimized the amount of extra data that had to be downloaded from Usenet to verify that the data files were all present and undamaged, or to determine how many parity volumes were required to repair any damage or reconstruct any missing files. They were most useful in version 1 where the parity volumes were much larger than the short index files. These larger parity volumes contain the actual recovery data along with a duplicate copy of the information in the index files (which allows them to be used on their own to verify the integrity of the data files if there is no small index file available).

In July 2001, Tobias Rieper and Stefan Wehlus proposed the Parity Volume Set specification, and with the assistance of other project members, version 1.0 of the specification was published in October 2001. [11] Par1 used Reed–Solomon error correction to create new recovery files. Any of the recovery files can be used to rebuild a missing file from an incomplete download.

Version 1 became widely used on Usenet, but it did suffer some limitations:

In January 2002, Howard Fukada proposed that a new Par2 specification should be devised with the significant changes that data verification and repair should work on blocks of data rather than whole files, and that the algorithm should switch to using 16 bit numbers rather than the 8 bit numbers that PAR1 used. Michael Nahas and Peter Clements took up these ideas in July 2002, with additional input from Paul Nettle and Ryan Gallagher (who both wrote Par1 clients). Version 2.0 of the Parchive specification was published by Michael Nahas in September 2002. [14]

Peter Clements then went on to write the first two Par2 implementations, QuickPar and par2cmdline. Abandoned since 2004, Paul Houle created phpar2 to supersede par2cmdline. Yutaka Sawada created MultiPar to supersede QuickPar. MultiPar uses par2j.exe (which is partially based on par2cmdline's optimization techniques) to use as MultiPar's backend engine.

Versions

Versions 1 and 2 of the file format are incompatible. (However, many clients support both.)

Par1

For Par1, the files f1, f2, ..., fn, the Parchive consists of an index file (f.par), which is CRC type file with no recovery blocks, and a number of "parity volumes" (f.p01, f.p02, etc.). Given all of the original files except for one (for example, f2), it is possible to create the missing f2 given all of the other original files and any one of the parity volumes. Alternatively, it is possible to recreate two missing files from any two of the parity volumes and so forth. [15]

Par1 supports up to a total of 256 source and recovery files.

Par2

Par2 files generally use this naming/extension system: filename.vol000+01.PAR2, filename.vol001+02.PAR2, filename.vol003+04.PAR2, filename.vol007+06.PAR2, etc. The number after the "+" in the filename indicates how many blocks it contains, and the number after "vol" indicates the number of the first recovery block within the PAR2 file. If an index file of a download states that 4 blocks are missing, the easiest way to repair the files would be by downloading filename.vol003+04.PAR2. However, due to the redundancy, filename.vol007+06.PAR2 is also acceptable. There is also an index file filename.PAR2, it is identical in function to the small index file used in PAR1.

Par2 specification supports up to 32,768 source blocks and up to 65,535 recovery blocks. Input files are split into multiple equal-sized blocks so that recovery files do not need to be the size of the largest input file.

Although Unicode is mentioned in the PAR2 specification as an option, most PAR2 implementations do not support Unicode.

Directory support is included in the PAR2 specification, but most or all implementations do not support it.

Par3

The Par3 specification was originally planned to be published as an enhancement over the Par2 specification. However, to date,[ when? ] it has remained closed source by specification owner Yutaka Sawada.

A discussion on a new format started in the GitHub issue section of the maintained fork par2cmdline on January 29, 2019. The discussion led to a new format which is also named as Par3. The new Par3 format's specification is published on GitHub, but remains being an alpha draft as of January 28, 2022. The specification is written by Michael Nahas, the author of Par2 specification, with the help from Yutaka Sawada, animetosho and malaire.

The new format claims to have multiple advantages over the Par2 format, including:

Software

Multi-Platform

Windows

Mac OS X

POSIX

Software for POSIX conforming operating systems:

See also

Related Research Articles

gzip GNU file compression/decompression tool

gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU. Version 0.1 was first publicly released on 31 October 1992, and version 1.0 followed in February 1993.

<span class="mw-page-title-main">ISO 9660</span> File system for CD-R and CD-ROM optical discs

ISO 9660 is a file system for optical disc media. The file system is an international standard available from the International Organization for Standardization (ISO). Since the specification is available for anybody to purchase, implementations have been written for many operating systems.

New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred filesystem on Windows and is supported in Linux and BSD as well. NTFS reading and writing support is provided using a free and open-source kernel implementation known as NTFS3 in Linux and the NTFS-3G driver in BSD. By using the convert command, Windows can convert FAT32/16/12 into NTFS without the need to rewrite all files. NTFS uses several files typically hidden from the user to store metadata about other files stored on the drive which can help improve speed and performance when reading data. Unlike FAT and High Performance File System (HPFS), NTFS supports access control lists (ACLs), filesystem encryption, transparent compression, sparse files and file system journaling. NTFS also supports shadow copy to allow backups of a system while it is running, but the functionality of the shadow copies varies between different versions of Windows.

Universal Disk Format (UDF) is an open, vendor-neutral file system for computer data storage for a broad range of media. In practice, it has been most widely used for DVDs and newer optical disc formats, supplanting ISO 9660. Due to its design, it is very well suited to incremental updates on both recordable and (re)writable optical media. UDF was developed and maintained by the Optical Storage Technology Association (OSTA).

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

<span class="mw-page-title-main">Apple ProDOS</span> Operating system on Apple II series computers

ProDOS is the name of two similar operating systems for the Apple II series of personal computers. The original ProDOS, renamed ProDOS 8 in version 1.2, is the last official operating system usable by all 8-bit Apple II series computers, and was distributed from 1983 to 1993. The other, ProDOS 16, was a stop-gap solution for the 16-bit Apple IIGS that was replaced by GS/OS within two years.

<span class="mw-page-title-main">JAR (file format)</span> Java archive file format

A JAR file is a package file format typically used to aggregate many Java class files and associated metadata and resources into one file for distribution.

RAR is a proprietary archive file format that supports data compression, error correction and file spanning. It was developed in 1993 by Russian software engineer Eugene Roshal and the software is licensed by win.rar GmbH. The name RAR stands for Roshal Archive.

7z is a compressed archive file format that supports several different data compression, encryption and pre-processing algorithms. The 7z format initially appeared as implemented by the 7-Zip archiver. The 7-Zip program is publicly available under the terms of the GNU Lesser General Public License. The LZMA SDK 4.62 was placed in the public domain in December 2008. The latest stable version of 7-Zip and LZMA SDK is version 22.01.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

Simple file verification (SFV) is a file format for storing CRC32 checksums of files to verify the integrity of files. SFV is used to verify that a file has not been corrupted, but it does not otherwise verify the file's authenticity. The .sfv file extension is usually used for SFV files.

Files-11 is the file system used in the RSX-11 and OpenVMS operating systems from Digital Equipment Corporation. It supports record-oriented I/O, remote network access, and file versioning. The original ODS-1 layer is a flat file system; the ODS-2 version is a hierarchical file system, with support for access control lists,.

The Amiga Fast File System is a file system used on the Amiga personal computer. The previous Amiga filesystem was never given a specific name and known originally simply as "DOS" or AmigaDOS. Upon the release of FFS, the original filesystem became known as Amiga Old File System (OFS). OFS, which was primarily designed for use with floppy disks, had been proving slow to keep up with hard drives of the era. FFS was designed as a full replacement for the original Amiga filesystem. FFS differs from its predecessor mainly in the removal of redundant information. Data blocks contain nothing but data, allowing the filesystem to manage the transfer of large chunks of data directly from the host adapter to the final destination.

<span class="mw-page-title-main">QuickPar</span>

QuickPar is a computer program that creates parchives used as verification and recovery information for a file or group of files, and uses the recovery information, if available, to attempt to reconstruct the originals from the damaged files and the PAR volumes.

In computing, an archive file is a computer file that is composed of one or more files along with metadata. Archive files are used to collect multiple data files together into a single file for easier portability and storage, or simply to compress files to use less storage space. Archive files often store directory structures, error detection and correction information, arbitrary comments, and sometimes use built-in encryption.

exFAT is a file system introduced by Microsoft in 2006 and optimized for flash memory such as USB flash drives and SD cards. exFAT was proprietary until 28 August 2019, when Microsoft published its specification. Microsoft owns patents on several elements of its design.

File spanning is the ability to package a single file or data stream into separate files of a specified size. This task implies the ability to re-combine the package files back into the original file or data stream.

lzip Data compression utility

lzip is a free, command-line tool for the compression of data; it employs the Lempel–Ziv–Markov chain algorithm (LZMA) with a user interface that is familiar to users of usual Unix compression tools, such as gzip and bzip2.

Resilient File System (ReFS), codenamed "Protogon", is a Microsoft proprietary file system introduced with Windows Server 2012 with the intent of becoming the "next generation" file system after NTFS.

References

  1. Re: Correction to Parchive on Wikipedia, Archived 2014-10-14 at the Wayback Machine reply #3, by Yutaka Sawada: "Their formal title are "Parity Volume Set Specification 1.0" and "Parity Volume Set Specification 2.0."
  2. Re: Correction to Parchive on Wikipedia, reply #3, by Yutaka Sawada: "Their formal title are "Parity Volume Set Specification 1.0" and "Parity Volume Set Specification 2.0."
  3. "Parchive: Parity Archive Volume Set" . Retrieved 2009-10-29. The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet.
  4. "possibility of new PAR3 file". Archived from the original on 2012-07-07. Retrieved 2012-07-01.
  5. "Question about your usage of PAR3". Archived from the original on 2014-03-09. Retrieved 2012-07-01.
  6. "Risk of undetectable intended modification". Archived from the original on 2014-03-09. Retrieved 2012-07-01.
  7. "PAR3 specification proposal not finished as of April 2011". Archived from the original on 2014-03-09. Retrieved 2012-07-01.
  8. "Parchive: Parity Archive Tool" . Retrieved 2020-05-20.
  9. "Parity Volume Set Specification 3.0 [2022-01-28 ALPHA DRAFT]". Michael Nahas, Yutaka-Sawada, animetosho, and malaire.
  10. Kantor, Brian; Lapsley, Phil (February 1986). "Character Codes". Network News Transfer Protocol. IETF. p. 5. sec. 2.2. doi: 10.17487/RFC0977 . RFC 977 . Retrieved 2009-10-29.
  11. Nahas, Michael (2001-10-14). "Parity Volume Set Specification v1.0" . Retrieved 2017-06-19.
  12. Plank, James S.; Ding, Ying (April 2003). "Note: Correction to the 1997 Tutorial on Reed-Solomon Coding" . Retrieved 2009-10-29.
  13. Plank, James S. (September 1997). "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems" . Retrieved 2009-10-29.
  14. Nahas, Michael; Clements, Peter; Nettle, Paul; Gallagher, Ryan (2003-05-11). "Parity Volume Set Specification 2.0" . Retrieved 2009-10-29.
  15. Wang, Wallace (2004-10-25). "Finding movies (or TV shows): Recovering missing RAR files with PAR and PAR2 files". Steal this File Sharing Book (1st ed.). San Francisco, California: No Starch Press. pp.  164–167. ISBN   978-1-59327-050-6 . Retrieved 2009-09-24.
  16. "MultiPar works with PCBSD 9.0". Archived from the original on 2013-09-28. Retrieved 2012-02-27.
  17. Working on Ubuntu 18.04 via wine [ dead link ]
  18. "contacted you, asking about sourcecode". Archived from the original on 2013-09-26. Retrieved 2013-09-21.