File verification

Last updated August 23, 2023

File verification is the process of using an algorithm for verifying the integrity of a computer file, usually by checksum. This can be done by comparing two files bit-by-bit, but requires two copies of the same file, and may miss systematic corruptions which might occur to both files. A more popular approach is to generate a hash of the copied file and comparing that to the hash of the original file.

Integrity verification

File integrity can be compromised, usually referred to as the file becoming corrupted. A file can become corrupted by a variety of ways: faulty storage media, errors in transmission, write errors during copying or moving, software bugs, and so on.

Hash-based verification ensures that a file has not been corrupted by comparing the file's hash value to a previously calculated value. If these values match, the file is presumed to be unmodified. Due to the nature of hash functions, hash collisions may result in false positives, but the likelihood of collisions is often negligible with random corruption.

Authenticity verification

It is often desirable to verify that a file hasn't been modified in transmission or storage by untrusted parties, for example, to include malicious code such as viruses or backdoors. To verify the authenticity, a classical hash function is not enough as they are not designed to be collision resistant; it is computationally trivial for an attacker to cause deliberate hash collisions, meaning that a malicious change in the file is not detected by a hash comparison. In cryptography, this attack is called a preimage attack.

For this purpose, cryptographic hash functions are employed often. As long as the hash sums cannot be tampered with — for example, if they are communicated over a secure channel — the files can be presumed to be intact. Alternatively, digital signatures can be employed to assure tamper resistance.

File formats

A checksum file is a small file that contains the checksums of other files.

There are a few well-known checksum file formats.^[1]

Several utilities, such as md5deep, can use such checksum files to automatically verify an entire directory of files in one operation.

The particular hash algorithm used is often indicated by the file extension of the checksum file.

The ".sha1" file extension indicates a checksum file containing 160-bit SHA-1 hashes in sha1sum format.

The ".md5" file extension, or a file named "MD5SUMS", indicates a checksum file containing 128-bit MD5 hashes in md5sum format.

The ".sfv" file extension indicates a checksum file containing 32-bit CRC32 checksums in simple file verification format.

The "crc.list" file indicates a checksum file containing 32-bit CRC checksums in brik format.

As of 2012, best practice recommendations is to use SHA-2 or SHA-3 to generate new file integrity digests; and to accept MD5 and SHA1 digests for backward compatibility if stronger digests are not available. The theoretically weaker SHA1, the weaker MD5, or much weaker CRC were previously commonly used for file integrity checks.^[2]^[3]^[4]^[5]^[6]^[7]^[8]^[9]^[10]

CRC checksums cannot be used to verify the authenticity of files, as CRC32 is not a collision resistant hash function -- even if the hash sum file is not tampered with, it is computationally trivial for an attacker to replace a file with the same CRC digest as the original file, meaning that a malicious change in the file is not detected by a CRC comparison.^{[ citation needed ]}

Related Research Articles

<span class="mw-page-title-main">Checksum</span> Data used to detect errors in other data

A checksum is a small-sized block of data derived from another block of digital data for the purpose of detecting errors that may have been introduced during its transmission or storage. By themselves, checksums are often used to verify data integrity but are not relied upon to verify data authenticity.

<span class="mw-page-title-main">HMAC</span> Computer communications hash algorithm

In cryptography, an HMAC is a specific type of message authentication code (MAC) involving a cryptographic hash function and a secret cryptographic key. As with any MAC, it may be used to simultaneously verify both the data integrity and authenticity of a message.

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value. MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function MD4, and was specified in 1992 as RFC 1321.

In cryptography, SHA-1 is a hash function which takes an input and produces a 160-bit (20-byte) hash value known as a message digest – typically rendered as 40 hexadecimal digits. It was designed by the United States National Security Agency, and is a U.S. Federal Information Processing Standard. The algorithm has been cryptographically broken but is still widely used.

In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits.

A cryptographic hash function (CHF) is a hash algorithm that has special properties desirable for a cryptographic application:

md5sum is a computer program that calculates and verifies 128-bit MD5 hashes, as described in RFC 1321. The MD5 hash functions as a compact digital fingerprint of a file. As with all such hashing algorithms, there is theoretically an unlimited number of files that will have any given MD5 hash. However, it is very unlikely that any two non-identical files in the real world will have the same MD5 hash, unless they have been specifically created to have the same hash.

In cryptography, a message authentication code (MAC), sometimes known as an authentication tag, is a short piece of information used for authenticating a message. In other words, to confirm that the message came from the stated sender and has not been changed. The MAC value protects a message's data integrity, as well as its authenticity, by allowing verifiers to detect any changes to the message content.

Simple file verification (SFV) is a file format for storing CRC32 checksums of files to verify the integrity of files. SFV is used to verify that a file has not been corrupted, but it does not otherwise verify the file's authenticity. The .sfv file extension is usually used for SFV files.

cksum is a command in Unix and Unix-like operating systems that generates a checksum value for a file or stream of data. The cksum command reads each file given in its arguments, or standard input if no arguments are provided, and outputs the file's 32-bit cyclic redundancy check (CRC) checksum and byte count. The CRC output by cksum is different from the CRC-32 used in zip, PNG and zlib.

The Secure Hash Algorithms are a family of cryptographic hash functions published by the National Institute of Standards and Technology (NIST) as a U.S. Federal Information Processing Standard (FIPS), including:

In cryptography, a collision attack on a cryptographic hash tries to find two inputs producing the same hash value, i.e. a hash collision. This is in contrast to a preimage attack where a specific target hash value is specified.

Digest access authentication is one of the agreed-upon methods a web server can use to negotiate credentials, such as username or password, with a user's web browser. This can be used to confirm the identity of a user before sending sensitive information, such as online banking transaction history. It applies a hash function to the username and password before sending them over the network. In contrast, basic access authentication uses the easily reversible Base64 encoding instead of hashing, making it non-secure unless used in conjunction with TLS.

Magnet is a URI scheme that defines the format of magnet links, a de facto standard for identifying files (URN) by their content, via cryptographic hash value rather than by their location.

SHA-2 is a set of cryptographic hash functions designed by the United States National Security Agency (NSA) and first published in 2001. They are built using the Merkle–Damgård construction, from a one-way compression function itself built using the Davies–Meyer structure from a specialized block cipher.

In cryptography, the Merkle–Damgård construction or Merkle–Damgård hash function is a method of building collision-resistant cryptographic hash functions from collision-resistant one-way compression functions. This construction was used in the design of many popular hash algorithms such as MD5, SHA-1 and SHA-2.

sha1sum is a computer program that calculates and verifies SHA-1 hashes. It is commonly used to verify the integrity of files. It is installed by default on most Linux distributions. Typically distributed alongside sha1sum are sha224sum, sha256sum, sha384sum and sha512sum, which use a specific SHA-2 hash function and b2sum, which uses the BLAKE2 cryptographic hash function.

md5deep is a software package used in the computer security, system administration and computer forensics communities to run large numbers of files through any of several different cryptographic digests. It was originally authored by Jesse Kornblum, at the time a special agent of the Air Force Office of Special Investigations. As of 2017, he still maintains it.

<span class="mw-page-title-main">Fingerprint (computing)</span> Digital identifier derived from the data by an algorithm

In computer science, a fingerprinting algorithm is a procedure that maps an arbitrarily large data item to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes. This fingerprint may be used for data deduplication purposes. This is also referred to as file fingerprinting, data fingerprinting, or structured data fingerprinting.

crypt is a POSIX C library function. It is typically used to compute the hash of user account passwords. The function outputs a text string which also encodes the salt, and identifies the hash algorithm used. This output string forms a password record, which is usually stored in a text file.

References

↑ "Checksum".
↑ NIST. "NIST's policy on hash functions" Archived 2011-06-09 at the Wayback Machine . 2012.
↑ File Transfer Consulting. "Integrity".
↑ "Intrusion Detection FAQ: What is the role of a file integrity checker like Tripwire in intrusion detection?" Archived 2014-10-12 at the Wayback Machine .
↑ Hacker Factor. "Tutorial: File Digest".
↑ Steve Mead. "Unique File Identification in the National Software Reference Library" p. 4.
↑ Del Armstrong. "An Introduction To File Integrity Checking On Unix Systems". 2003.
↑ "Cisco IOS Image Verification"
↑ Elizabeth D. Zwicky, Simon Cooper, D. Brent Chapman. "Building Internet Firewalls". p. 296.
↑ Simson Garfinkel, Gene Spafford, Alan Schwartz. "Practical UNIX and Internet Security". p. 630.

v t e Computer files
Types	Binary file / text file File format List of file formats File signatures Magic number Metafile Sidecar file Sparse file Swap file System file Temporary file Zero-byte file
Properties	Filename 8.3 filename Long filename Filename mangling Filename extension List of filename extensions File attribute Extended file attributes File size Hidden file / Hidden directory
Organisation	Directory/folder NTFS links Temporary folder Directory structure File sequence File system Filesystem Hierarchy Standard Path
Operations	Open Close Read Write
Linking	File descriptor Hard link Shortcut Alias Shadow Symbolic link
Management	File comparison Data compression File manager Comparison of file managers File system permissions File transfer File sharing File synchronization File verification