File sequence

Last updated October 11, 2023

In computing, as well as in non-computing contexts, a file sequence is a well-ordered, (finite) collection of files, usually related to each other in some way.

In computing, file sequences should ideally obey some kind of locality of reference principle, so that not only all the files belonging to the same sequence ought to be locally referenced to each other, but they also obey that as much as is their proximity with respect to the ordering relation. Explicit file sequences are, in fact, sequences whose filenames all end with a numeric or alphanumeric tag in the end (excluding file extension).

The aforementioned locality of reference usually pertains either to the data, the metadata (e.g. their filenames or last-access dates), or the physical proximity within the storage media they reside in. In the latter acception it is better to speak about file contiguity (see below).

Identification

Every GUI program shows contents of folders by usually ordering its files according to some criteria, mostly related to the files' metadata, like the filename. The criterion is, by default, the alphanumeric ordering of filenames, although some operating systems do that in "smarter" ways than others: for example file1.ext should ideally be placed before file10.ext, like GNOME Files and Thunar do, whereas, alphanumerically, it comes after (more on that later). Other criteria exist, like ordering files by their file type (or by their extension) and, if the same type, by either filename or last-access date, and so on.

For this reason, when a file sequence has a more strong locality of reference, particularly when it is related to their actual contents, it is better to highlight this fact by letting their well-ordering induce an alphanumeric ordering of the filenames too. That is the case of explicit file sequences.

Explicit file sequences

Explicit file sequences have the same filename (including file extensions in order to confirm their contents' locality of reference) except for the final part (excluding the extension), which is a sequence of either numeric, alphanumeric or purely alphabetical characters to force a specific ordering; such sequences should also be ideally located all within the same directory.

In this sense any files sharing the same filename (and possibly extension), only differing by the sequence number at the end of the filename, automatically belong to the same file sequence, at least when they are located in the same folder. It is also part of many naming conventions that number-indexed file sequences (in any number base) containing as many files as to span at most a fixed number of digits, make use of "trailing zeroes" in their filenames so that:

all the files in the sequence share exactly the same number of characters in their complete filenames;
non-smart alphanumeric orderings, like those of operating systems' GUIs, do not erroneously permute them within the sequence.

To better explain the latter point, consider that, strictly speaking, file1.ext (1st file in the sequence) comes alphanumerically afterfile100.ext, which is actually the hundredth. By renaming the first file to file001.ext with two trailing zeroes, the problem is universally solved.

Examples of explicit file sequences include: file00000.ext, file00001.ext, file00002.ext, $...$ , file02979.ext (five trailing zeroes), and another with a hexadecimal ordering of 256 files tag_00.ext, tag_01.ext, $...$ , tag_09.ext, tag_0A.ext, ..., tag_0F.ext, tag_10.ext, ..., tag_0F.ext, ..., tag_FF.ext (with just one trailing zero).

Software and programming conventions usually represent a file sequence as a single virtual file object, whose name is comprehensively written in C-like formatted-string notation to represent where the sequence number is located in the filename and what is its formatting. For the two examples above, that would be filename%05d.ext and tag_%02H.ext, respectively, whereas for the former one, the same convention without trailing zeroes would be filename%5d.ext. Note, however, that such notation is usually not valid at operating system and command-line interface levels, because the '%' character is neither a valid regular expression nor a universally legal filename character: that notation just stands as a placeholder for the virtual file-like representing the whole explicit file sequence.

Notable software packages acknowledging explicit file sequences as single filesystem objects, rather typical in the Audio/Video post-production industry (see below), are found among products by Autodesk, Quantel, daVinci, DVS, as well as Adobe After Effects.

File scattering

A file sequence located within a mass storage device is said to be contiguous if:

every file in the sequence is unfragmented, i.e. each file is stored in one contiguous and ordered piece of storage space (ideally in one or multiple, but contiguous, extents);
consecutive files in the sequence occupy contiguous portions of storage space (extents, yet consistently with their file ordering).

File contiguity is a more practical requirement for file sequences than just their locality of reference, because it is related to the storage medium hosting the whole sequence than to the sequence itself (or its metadata). At the same time, it is a "high-level" feature, because it is not related to the physical and technical details of mass storage itself: particularly, file contiguity is realized in different ways according to the storage device's architecture and actual filesystem structure. At "low level", each file in a contiguous sequence must be placed in contiguous blocks, in spite of reserved areas or special metadata required by the filesystem (like inodes or inter-sector headers) actually interleaving them.

File contiguity is, in most practical applications, "invisible" at operating-system or user levels, since all the files in a sequence are always available to applications in the same way, regardless of their physical location on the storage device (due to operating systems hiding the filesystem internals to higher-level services). Indeed, file contiguity may be related to I/O performance when the sequence is to be read or written in the shortest time possible. In some contexts (like optical disk burning - also cfr. below), data in a file sequence must be accessed in the same order as the file sequence itself; in other contexts, a "random" access to the sequence may be required. In both cases, most professional filesystems provide faster access strategies to contiguous files than non-contiguous ones. Data pre-allocation is crucial for write access, whereas burst read speeds are achievable only for contiguous data.

When a file sequence is not contiguous, it is said to be scattered, since its files are stored in sparse locations on the storage device. File scattering is the process of allocating (or re-allocating) a file sequence as being (or becoming) uncontiguous. That is often associated with file fragmentation too, where each file is also stored in several, non-contiguous blocks; mechanisms contributing to the former are usually a common cause to the latter too. The act of reducing file scattering by means of allocating (in the first place) or moving (for already-stored data) files in the same sequence near together on the storage medium is called (file) file descattering. A few defragmentation strategies and dedicated software are able to both defragment single files and descatter file sequences.

Multimedia file sequences

There are many contexts which explicit file sequences are particularly important in: incremental backups, periodic logs and multimedia files captured or created with a chronological locality of reference. In the latter case, explicit file numbering is extremely important in order to provide both software and end users a way to discern the consequentiality of the contents stored therein. For example, digital cameras and similar devices save all the picture files in the same folder (until it either reaches its maximum file-number capacity, or a new event like midnight-coming or device-switching takes place) with a final number sequence: it would be very unpractical to choose a filename for each taken shot on the very shooting time, so the camera firmware/software picks one which is perfectly identifiable by its sequence number. With the aid of other metadata (and usually of specialized PC software), users can later on discern the multimedia contents and re-organize them, if needed.

The Digital Intermediate example

A typical example where explicit file sequences, as well as their contiguity, becomes crucial is in the digital intermediate (DI) workflow for motion picture and video industries. In such contexts, video data need to maintain the highest quality and be ready for visualization (usually real-time if not even better). Usually video data are acquired from either a digital video camera or a motion picture film scanner and stored into file sequences (as much as a common photographic camera does) and need to be post-produced in several steps, including at least editing, conforming and colour-correction. That requires:

Uncompressed data, because any lossy compression, which is common in most finalized products, introduces unacceptable quality losses.
Uncompressed data (again), because decompression times may degrade playing/visualization performance by hardware and software.
Frame-per-file data management, because common post-production operations imply the shortest seek-times ever; "fast-forwarding" or "rewinding" to a specific (key) frame is much faster if done at filesystem level rather than within a huge, possibly fragmented video file; every frame is then stored in a single file as a still digital picture.
Unambiguous frames' ordering, for obvious reasons, which is best accomplished grouping all the files together with explicit file numbering.
File contiguity, because many filesystem architectures employ higher I/O speeds if transferring data on contiguous areas of the storage, whereas random allocation might prevent real-time or better loading performances.

Consider that a single frame in a DI project is currently from 9MB to 48MB large (depending upon resolution and colour-depth), whereas video refresh rate is commonly 24 or 25 frames per second (if not faster); any storage required for real-time playing such contents thus needs a minimum overall throughput of 220MB/s to 1.2GB/s, respectively. With those numbers, all the above requirements (particularly file contiguity, given nowadays storage performances) become strictly mandatory.

External links

PySeq Archived 2012-03-21 at the Wayback Machine PySeq is an open source python module that finds groups of items that follow a naming convention containing a numerical sequence index (e.g. fileA.001.png, fileA.002.png, fileA.003.png...) and serializes them into a compressed sequence string representing the entire sequence (e.g. fileA.1-3.png).
checkfileseq checkfileseq is an open source python script (usable via CLI) that scans a directory structure recursively for files missing in a file sequence and prints a report upon completion. It supports a wide array of filename patterns and can be customized to gain additional pattern logic.

Related Research Articles

ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

Apache Subversion is a software versioning and revision control system distributed as open source under the Apache License. Software developers use Subversion to maintain current and historical versions of files such as source code, web pages, and documentation. Its goal is to be a mostly compatible successor to the widely used Concurrent Versions System (CVS).

In computing, tar is a computer software utility for collecting many files into one archive file, often referred to as a tarball, for distribution or backup purposes. The name is derived from "tape archive", as it was originally developed to write data to sequential I/O devices with no file system of their own. The archive data sets created by tar contain various file system parameters, such as name, timestamps, ownership, file-access permissions, and directory organization. POSIX abandoned tar in favor of pax, yet tar sees continued widespread use.

A filename or file name is a name used to uniquely identify a computer file in a file system. Different file systems impose different restrictions on filename lengths.

The inode is a data structure in a Unix-style file system that describes a file-system object such as a file or a directory. Each inode stores the attributes and disk block locations of the object's data. File-system object attributes may include metadata, as well as owner and permission data.

Files-11 is the file system used in the RSX-11 and OpenVMS operating systems from Digital Equipment Corporation. It supports record-oriented I/O, remote network access, and file versioning. The original ODS-1 layer is a flat file system; the ODS-2 version is a hierarchical file system, with support for access control lists,.

A FourCC is a sequence of four bytes used to uniquely identify data formats. It originated from the OSType or ResType metadata system used in classic Mac OS and was adopted for the Amiga/Electronic Arts Interchange File Format and derivatives. The idea was later reused to identify compressed data types in QuickTime and DirectShow.

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site systems. Since June 2005, Lustre has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 1 ranked TOP500 supercomputer in November 2022, Frontier, as well as previous top supercomputers such as Fugaku, Titan and Sequoia.

In computer science, a record-oriented filesystem is a file system where data is stored as collections of records. This is in contrast to a byte-oriented filesystem, where the data is treated as an unformatted stream of bytes. There are several different possible record formats; the details vary depending on the particular system. In general the formats can be fixed-length or variable length, with different physical organizations or padding mechanisms; metadata may be associated with the file records to define the record length, or the data may be part of the record. Different access methods for records may be provided, for example records may be retrieved in sequential order, by key, or by record number.

Design rule for Camera File system (DCF) is a JEITA specification which defines a file system for digital cameras, including the directory structure, file naming method, character set, file format, and metadata format. It is currently the de facto industry standard for digital still cameras. The file format of DCF conforms to the Exif specification, but the DCF specification also allows use of any other file formats. As of 2021, the latest version of the standard was 2.0, issued in 2010.

ext4 is a journaling file system for Linux, developed as the successor to ext3.

Disk encryption is a technology which protects information by converting it into code that cannot be deciphered easily by unauthorized people or processes. Disk encryption uses disk encryption software or hardware to encrypt every bit of data that goes on a disk or disk volume. It is used to prevent unauthorized access to data storage.

Filesystem-level encryption, often called file-based encryption, FBE, or file/folder encryption, is a form of disk encryption where individual files or directories are encrypted by the file system itself.

<span class="mw-page-title-main">File system fragmentation</span> Condition where a segmented file system is used inefficiently

In computing, file system fragmentation, sometimes called file system aging, is the tendency of a file system to lay out the contents of files non-continuously to allow in-place modification of their contents. It is a special case of data fragmentation. File system fragmentation negatively impacts seek time in spinning storage media, which is known to hinder throughput. Fragmentation can be remedied by re-organizing files and free space back into contiguous areas, a process called defragmentation.

A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.

File carving is the process of reassembling computer files from fragments in the absence of filesystem metadata.

A journaling file system is a file system that keeps track of changes not yet committed to the file system's main part by recording the goal of such changes in a data structure known as a "journal", which is usually a circular log. In the event of a system crash or power failure, such file systems can be brought back online more quickly with a lower likelihood of becoming corrupted.

BagIt is a set of hierarchical file system conventions designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" and "tags," which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, BagIt, is inspired by the "enclose and deposit" method, sometimes referred to as "bag it and tag it."

ZFS is a file system with volume management capabilities. It began as part of the Sun Microsystems Solaris operating system in 2001. Large parts of Solaris – including ZFS – were published under an open source license as OpenSolaris for around 5 years from 2005 before being placed under a closed source license when Oracle Corporation acquired Sun in 2009–2010. During 2005 to 2010, the open source version of ZFS was ported to Linux, Mac OS X and FreeBSD. In 2010, the illumos project forked a recent version of OpenSolaris, including ZFS, to continue its development as an open source project. In 2013, OpenZFS was founded to coordinate the development of open source ZFS. OpenZFS maintains and manages the core ZFS code, while organizations using ZFS maintain the specific code and validation processes required for ZFS to integrate within their systems. OpenZFS is widely used in Unix-like systems.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

v t e Computer files
Types	Binary file / text file File format List of file formats File signatures Magic number Metafile Sidecar file Sparse file Swap file System file Temporary file Zero-byte file
Properties	Filename 8.3 filename Long filename Filename mangling Filename extension List of filename extensions File attribute Extended file attributes File size Hidden file / Hidden directory
Organisation	Directory/folder NTFS links Temporary folder Directory structure File sequence File system Filesystem Hierarchy Standard Path
Operations	Open Close Read Write
Linking	File descriptor Hard link Shortcut Alias Shadow Symbolic link
Management	File comparison Data compression File manager Comparison of file managers File system permissions File transfer File sharing File synchronization File verification