Freedup

Last updated

freedup is a program to scan directories or file lists for duplicate files. The file lists may be provided to an input pipe or internally generated using find with provided options. There are more options to specify the search conditions more detailed. Other options influence the performed actions, i.e. whether to display only or to specify what kind of link under which circumstances. freedup first compares file sizes, then on equal sizes the MD5 signatures, and before taking actions a byte-by-byte check for verification is performed. An interactive mode allows to decide individually which files to link soft or hard or to delete.

Contents

The comparison by ignoring metadata tags and comments is a unique feature of freedup. Filesize, start and end of unique content is kept for later processing. Comparing sound files you may ignore the tags, e.g. whether one is tagged with an ID3v1-tag while another sound file with identical music is tagged with ID3v2. It also works, if you copied and retagged the copy to fit into another album. This works for JPEG files (Exif) and mp4-Movies as well. An auto-Mode is supported to instruct freedup to ignore all tags that are recognized. The author will extend this function on demand, if there is sufficient documentation how to strip the tags.

freedup is written in POSIX compliant C and is released under the GNU General Public License. Its complexity is O(n log n) for full file comparison. This is done for equally long files after sorting according to filesize using qsort(). [1]

See also

Related Research Articles

<span class="mw-page-title-main">PNG</span> Family of lossless-compression image file formats

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".

Waveform Audio File Format is an audio file format standard for storing an audio bitstream on personal computers. The format was developed and published for the first time in 1991 by IBM and Microsoft. It is the main format used on Microsoft Windows systems for uncompressed audio. The usual bitstream encoding is the linear pulse-code modulation (LPCM) format.

In computing, the utility diff is a data comparison tool that computes and displays the differences between the contents of files. Unlike edit distance notions used for other purposes, diff is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other. The utility displays the changes in one of several standard formats, such that both humans or computers can parse the changes, and use them for patching.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own, such as devices that use magnetic tape. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

Direct Client-to-Client (DCC) is an IRC-related sub-protocol enabling peers to interconnect using an IRC server for handshaking in order to exchange files or perform non-relayed chats. Once established, a typical DCC session runs independently from the IRC server. Originally designed to be used with ircII it is now supported by many IRC clients. Some peer-to-peer clients on napster-protocol servers also have DCC send/get capability, including TekNap, SunshineUN and Lopster. A variation of the DCC protocol called SDCC, also known as DCC SCHAT supports encrypted connections. An RFC specification on the use of DCC does not exist.

<span class="mw-page-title-main">Comma-separated values</span> File format used to store data

Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

Ctags is a programming tool that generates an index file of names found in source and header files of various programming languages to aid code comprehension. Depending on the language, functions, variables, class members, macros and so on may be indexed. These tags allow definitions to be quickly and easily located by a text editor, a code search engine, or other utility. Alternatively, there is also an output mode that generates a cross reference file, listing information about various names found in a set of language files in human-readable form.

Client-to-client protocol (CTCP) is a special type of communication between Internet Relay Chat (IRC) clients.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

du (Unix) Standard Unix program

du is a standard Unix program used to estimate file space usage—space used under a particular directory or files on a file system. A Windows commandline version of this program is part of Sysinternals suite by Mark Russinovich.

cmp (Unix) Computer file comparison utility

In computing, cmp is a command-line utility on Unix and Unix-like operating systems that compares two files of any type and writes the results to the standard output. By default, cmp is silent if the files are the same; if they differ, the byte and line number at which the first difference occurred is reported. The command is also available in the OS-9 shell.

The seven standard Unix file types are regular, directory, symbolic link, FIFO special, block special, character special, and socket as defined by POSIX. Different OS-specific implementations allow more types than what POSIX requires. A file's type can be identified by the ls -l command, which displays the type in the first character of the file-system permissions field.

File locking is a mechanism that restricts access to a computer file, or to a region of a file, by allowing only one user or process to modify or delete it at a specific time and to prevent reading of the file while it's being modified or deleted.

In Unix-like operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

cpio is a general file archiver utility and its associated file format. It is primarily installed on Unix-like computer operating systems. The software utility was originally intended as a tape archiving program as part of the Programmer's Workbench (PWB/UNIX), and has been a component of virtually every Unix operating system released thereafter. Its name is derived from the phrase copy in and out, in close description of the program's use of standard input and standard output in its operation.

In computing, a shebang is the character sequence consisting of the characters number sign and exclamation mark at the beginning of a script. It is also called sharp-exclamation, sha-bang, hashbang, pound-bang, or hash-pling.

The following tables compare general and technical information for a number of file systems.

cat (Unix) Unix command utility

cat is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files . It has been ported to a number of operating systems.

References

  1. "What is POSIX (Portable Operating System Interface)?". WhatIs.com. Retrieved 2023-04-16.