Freedup

Last updated

freedup is a program to scan directories or file lists for duplicate files. The file lists may be provided to an input pipe or internally generated using find with provided options. There are more options to specify the search conditions more detailed. Other options influence the performed actions, i.e. whether to display only or to specify what kind of link under which circumstances. freedup first compares file sizes, then on equal sizes the MD5 signatures, and before taking actions a byte-by-byte check for verification is performed. An interactive mode allows to decide individually which files to link soft or hard or to delete.

Contents

The comparison by ignoring tags and comments is a unique feature of freedup. Filesize, start and end of unique content is kept for later processing. Comparing music you may ignore the tags, e.g. whether one is tagged with an v1-tag while another sound file with identical music is tagged with v2. It also works, if you copied and retagged the copy to fit into another album. This works for jpegs (Exif) and mp4-Movies as well. An auto-Mode is supported to instruct freedup to ignore all tags that are recognized. The author will extend this function on demand, if there is sufficient documentation howto strip the tags.

freedup is written in POSIX compliant C and is released under the GNU General Public License. Its complexity is O(n log n) for full file comparison. This is done for equally long files after sorting according to filesize using qsort(). [1]

See also

Related Research Articles

<span class="mw-page-title-main">PNG</span> Family of lossless compression file formats for image files

Portable Network Graphics is a raster-graphics file format that supports lossless data compression. PNG was developed as an improved, non-patented replacement for Graphics Interchange Format (GIF)—unofficially, the initials PNG stood for the recursive acronym "PNG's not GIF".

Waveform Audio File Format is an audio file format standard, developed by IBM and Microsoft, for storing an audio bitstream on PCs. It is the main format used on Microsoft Windows systems for uncompressed audio. The usual bitstream encoding is the linear pulse-code modulation (LPCM) format.

Berkeley sockets is an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). It is commonly implemented as a library of linkable modules. It originated with the 4.2BSD Unix operating system, which was released in 1983.

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

Direct Client-to-Client (DCC) is an IRC-related sub-protocol enabling peers to interconnect using an IRC server for handshaking in order to exchange files or perform non-relayed chats. Once established, a typical DCC session runs independently from the IRC server. Originally designed to be used with ircII it is now supported by many IRC clients. Some peer-to-peer clients on napster-protocol servers also have DCC send/get capability, including TekNap, SunshineUN and Lopster. A variation of the DCC protocol called SDCC, also known as DCC SCHAT supports encrypted connections. An RFC specification on the use of DCC does not exist.

dd is a command-line utility for Unix, Plan 9, Inferno, and Unix-like operating systems and beyond, the primary purpose of which is to convert and copy files. On Unix, device drivers for hardware and special device files appear in the file system just like normal files; dd can also read and/or write from/to these files, provided that function is implemented in their respective driver. As a result, dd can be used for tasks such as backing up the boot sector of a hard drive, and obtaining a fixed amount of random data. The dd program can also perform conversions on the data as it is copied, including byte order swapping and conversion to and from the ASCII and EBCDIC text encodings.

Ctags is a programming tool that generates an index file of names found in source and header files of various programming languages to aid code comprehension. Depending on the language, functions, variables, class members, macros and so on may be indexed. These tags allow definitions to be quickly and easily located by a text editor, a code search engine, or other utility. Alternatively, there is also an output mode that generates a cross reference file, listing information about various names found in a set of language files in human-readable form.

Client-to-client protocol (CTCP) is a special type of communication between Internet Relay Chat (IRC) clients.

cp (Unix) Unix command utility

In computing, cp is a command in various Unix and Unix-like operating systems for copying files and directories. The command has three principal modes of operation, expressed by the types of arguments presented to the program for copying a file to another file, one or more files to a directory, or for copying entire directories to another directory.

rzip is a huge-scale data compression computer program designed around initial LZ77-style string matching on a 900 MB dictionary window, followed by bzip2-based Burrows–Wheeler transform and entropy coding (Huffman) on 900 kB output chunks.

du (Unix) Standard Unix program

du is a standard Unix program used to estimate file space usage—space used under a particular directory or files on a file system. A Windows commandline version of this program is part of Sysinternals suite by Mark Russinovich.

cmp (Unix)

In computing, cmp is a command-line utility on Unix and Unix-like operating systems that compares two files of any type and writes the results to the standard output. By default, cmp is silent if the files are the same; if they differ, the byte and line number at which the first difference occurred is reported. The command is also available in the OS-9 shell.

The seven standard Unix file types are regular, directory, symbolic link, FIFO special, block special, character special, and socket as defined by POSIX. Different OS-specific implementations allow more types than what POSIX requires. A file's type can be identified by the ls -l command, which displays the type in the first character of the file-system permissions field.

File locking is a mechanism that restricts access to a computer file, or to a region of a file, by allowing only one user or process to modify or delete it at a specific time and to prevent reading of the file while it's being modified or deleted.

In Unix-like and some other operating systems, find is a command-line utility that locates files based on some user-specified criteria and either prints the pathname of each matched object or, if another action is requested, performs that action on each matched object.

cpio is a general file archiver utility and its associated file format. It is primarily installed on Unix-like computer operating systems. The software utility was originally intended as a tape archiving program as part of the Programmer's Workbench (PWB/UNIX), and has been a component of virtually every Unix operating system released thereafter. Its name is derived from the phrase copy in and out, in close description of the program's use of standard input and standard output in its operation.

In computing, a shebang is the character sequence consisting of the characters number sign and exclamation mark at the beginning of a script. It is also called sharp-exclamation, sha-bang, hashbang, pound-bang, or hash-pling.

<span class="mw-page-title-main">DirSync Pro</span> Computer file synchronization software

DirSync Pro is an open-source file synchronization and backup utility for Windows, Linux and macOS. DirSync Pro is based on the program Directory Synchronize (DirSync), which was first released in February 2003 by Elias Gerber. He subsequently developed it with Frank Gerbig and T. Groetzner. DirSync Pro was released by O. Givi in July 2008, based on a branch of the DirSync code. Many parts of DirSync Pro have gone through major rewriting and redesign ever since.

<span class="mw-page-title-main">Unified Code Count</span>

The Unified Code Counter (UCC) is a comprehensive software lines of code counter produced by the USC Center for Systems and Software Engineering. It is available to the general public as open source code and can be compiled with any standard ANSI C++ compiler.

cat (Unix) Unix command utility

cat is a standard Unix utility that reads files sequentially, writing them to standard output. The name is derived from its function to (con)catenate files. It has been ported to a number of operating systems.

References

  1. "What is POSIX (Portable Operating System Interface)?". WhatIs.com. Retrieved 2023-04-16.