File carving

Last updated

File carving is the process of reassembling computer files from fragments in the absence of filesystem metadata.

Contents

Introduction and basic principles

All filesystems contain some metadata that describes the actual file system. At a minimum, this includes the hierarchy of folders and files, with names for each. The filesystem will also record the physical locations on the storage device where each file is stored. As explained below, a file might be scattered in fragments at different physical addresses.

File carving is the process of trying to recover files without this metadata. This is done by analyzing the raw data and identifying what it is (text, executable, png, mp3, etc.). This can be done in different ways, but the simplest is to look for the file signature or "magic numbers" that mark the beginning and/or end of a particular file type. [1] For instance, every Java class file has as its first four bytes the hexadecimal value CA FE BA BE. Some files contain footers as well, making it just as simple to identify the ending of the file.

Most file systems, such as the FAT family and UNIX's Fast File System, work with the concept of clusters of an equal and fixed size. For example, a FAT32 file system might be broken into clusters of 4 KiB each. Any file smaller than 4 KiB fits into a single cluster, and there is never more than one file in each cluster. Files that take up more than 4 KiB are allocated across many clusters. Sometimes these clusters are all contiguous, while other times they are scattered across two or potentially many more so called fragments, with each fragment containing a number of contiguous clusters storing one part of the file's data. Obviously, large files are more likely to be fragmented.

Simson Garfinkel [2] reported fragmentation statistics collected from over 350 disks containing FAT, NTFS and UFS file systems. He showed that while fragmentation in a typical disk is low, the fragmentation rate of forensically important files such as email, JPEG and Word documents is relatively high. The fragmentation rate of JPEG files was found to be 16%, Word documents had 17% fragmentation, AVI had a 22% fragmentation rate and PST files (Microsoft Outlook) had a 58% fragmentation rate (the fraction of files being fragmented into two or more fragments). Pal, Shanmugasundaram, and Memon [3] presented an efficient algorithm based on a greedy heuristic and alpha-beta pruning for reassembling fragmented images. Pal, Sencar, and Memon [4] introduced sequential hypothesis testing as an effective mechanism for detecting fragmentation points. Richard and Roussev [5] presented Scalpel, an open-source file-carving tool [6] existing since 2005 and initially based on Foremost. [7]

File carving is a highly complex task, with a potentially huge number of permutations to try. To make this task tractable, carving software typically makes extensive use of models and heuristics. This is necessary not only from a standpoint of execution time, but also for the accuracy of the results. State-of-the-art file carving algorithms use statistical techniques like sequential hypothesis testing for determining fragmentation points.

Motivation

In most cases, when a file is deleted, the entry in the file system metadata is removed but the actual data is still on the disk. File carving can be used to recover data from a hard disk where the metadata was removed or otherwise damaged. This process may be successful even after a drive is formatted or repartitioned.

File carving can be performed using free or commercial software and is often performed in conjunction with computer forensics examinations or alongside other recovery efforts (e.g. hardware repair) by data recovery companies. [8] Whereas the primary goal of data recovery is to recover the file content, computer forensics examiners are often just as interested in the metadata such as who owned a file, where it was stored, and when it was last modified. [9] Thus, while a forensic examiner could use file carving to prove that a file was once stored on a hard drive, he or she might need to seek out other evidence to prove who put it there.

Carving schemes

Bifragment gap carving

Garfinkel [2] introduced the use of fast object validation for reassembling files that have been split into two pieces. This technique is referred to as Bifragment Gap Carving (BGC). A set of starting fragments and a set of finishing fragments are identified. The fragments are reassembled if together they form a valid object.

SmartCarving

Pal [3] developed a carving scheme that is not limited to bifragmented files. The technique, known as SmartCarving, makes use of heuristics regarding the fragmentation behavior of known filesystems. The algorithm has three phases: preprocessing, collation, and reassembly. In the preprocessing phase, blocks are decompressed and/or decrypted if necessary. In the collation phase, blocks are sorted according to their file type. In the reassembly phase, the blocks are placed in sequence to reproduce the deleted files. The SmartCarving algorithm is the basis for the Adroit Photo Forensics and Adroit Photo Recovery applications from Digital Assembly.

Carving memory dumps

Snapshots of computers' volatile memory (i.e. RAM) can be carved. Memory-dump carving is routinely used in digital forensics, allowing investigators to access ephemeral evidence. Ephemeral evidence includes recently accessed images and Web pages, documents, chats and communications committed via social networks. For example LiME [10] can be used in conjunction with Volatility [11] do such a task. If an encrypted volume (TrueCrypt, BitLocker, PGP Disk) was used, binary keys to encrypted containers can be extracted and used to instantly mount such volumes. The content of volatile memory gets fragmented. A proprietary carving algorithm was developed by Belkasoft to enable carving fragmented memory sets (BelkaCarving).

See also

Related Research Articles

New Technology File System (NTFS) is a proprietary journaling file system developed by Microsoft. Starting with Windows NT 3.1, it is the default file system of the Windows NT family. It superseded File Allocation Table (FAT) as the preferred filesystem on Windows and is supported in Linux and BSD as well. NTFS reading and writing support is provided using a free and open-source kernel implementation known as NTFS3 in Linux and the NTFS-3G driver in BSD. By using the convert command, Windows can convert FAT32/16/12 into NTFS without the need to rewrite all files. NTFS uses several files typically hidden from the user to store metadata about other files stored on the drive which can help improve speed and performance when reading data. Unlike FAT and High Performance File System (HPFS), NTFS supports access control lists (ACLs), filesystem encryption, transparent compression, sparse files and file system journaling. NTFS also supports shadow copy to allow backups of a system while it is running, but the functionality of the shadow copies varies between different versions of Windows.

ext3, or third extended filesystem, is a journaled file system that is commonly used by the Linux kernel. It used to be the default file system for many popular Linux distributions. Stephen Tweedie first revealed that he was working on extending ext2 in Journaling the Linux ext2fs Filesystem in a 1998 paper, and later in a February 1999 kernel mailing list posting. The filesystem was merged with the mainline Linux kernel in November 2001 from 2.4.15 onward. Its main advantage over ext2 is journaling, which improves reliability and eliminates the need to check the file system after an unclean shutdown. Its successor is ext4.

<span class="mw-page-title-main">Defragmentation</span> Rearrangement of sectors on a hard disk into contiguous units

In the maintenance of file systems, defragmentation is a process that reduces the degree of fragmentation. It does this by physically organizing the contents of the mass storage device used to store files into the smallest number of contiguous regions. It also attempts to create larger regions of free space using compaction to impede the return of fragmentation. Some defragmentation utilities try to keep smaller files within a single directory together, as they are often accessed in sequence.

<span class="mw-page-title-main">Computer forensics</span> Branch of digital forensic science

Computer forensics is a branch of digital forensic science pertaining to evidence found in computers and digital storage media. The goal of computer forensics is to examine digital media in a forensically sound manner with the aim of identifying, preserving, recovering, analyzing and presenting facts and opinions about the digital information.

In computing, the Global File System 2 or GFS2 is a shared-disk file system for Linux computer clusters. GFS2 allows all members of a cluster to have direct concurrent access to the same shared block storage, in contrast to distributed file systems which distribute data throughout the cluster. GFS2 can also be used as a local file system on a single computer.

<span class="mw-page-title-main">File system</span> Format or program for storing files and directories

In computing, a file system or filesystem is a method and data structure that the operating system uses to control how data is stored and retrieved. Without a file system, data placed in a storage medium would be one large body of data with no way to tell where one piece of data stopped and the next began, or where any piece of data was located when it was time to retrieve it. By separating the data into pieces and giving each piece a name, the data are easily isolated and identified. Taking its name from the way a paper-based data management system is named, each group of data is called a "file". The structure and logic rules used to manage the groups of data and their names is called a "file system."

The Smart File System (SFS) is a journaling filesystem used on Amiga computers and AmigaOS-derived operating systems. It is designed for performance, scalability and integrity, offering improvements over standard Amiga filesystems as well as some special or unique features.

The Write Anywhere File Layout (WAFL) is a proprietary file system that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure, and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances like NetApp FAS, AFF, Cloud Volumes ONTAP and ONTAP Select.

In computing, an extent is a contiguous area of storage reserved for a file in a file system, represented as a range of block numbers, or tracks on count key data devices. A file can consist of zero or more extents; one file fragment requires one extent. The direct benefit is in storing each range compactly as two numbers, instead of canonically storing every block number in the range. Also, extent allocation results in less file fragmentation.

Undeletion is a feature for restoring computer files which have been removed from a file system by file deletion. Deleted data can be recovered on many file systems, but not all file systems provide an undeletion feature. Recovering data without an undeletion facility is usually called data recovery, rather than undeletion. Undeletion can both help prevent users from accidentally losing data, or can pose a computer security risk, since users may not be aware that deleted files remain accessible.

In Linux, Logical Volume Manager (LVM) is a device mapper framework that provides logical volume management for the Linux kernel. Most modern Linux distributions are LVM-aware to the point of being able to have their root file systems on a logical volume.

<span class="mw-page-title-main">TestDisk</span>

TestDisk is a free and open-source data recovery utility that helps users recover lost partitions or repair corrupted filesystems. TestDisk can collect detailed information about a corrupted drive, which can then be sent to a technician for further analysis. TestDisk supports DOS, Microsoft Windows, Linux, FreeBSD, NetBSD, OpenBSD, SunOS, and MacOS. TestDisk handles non-partitioned and partitioned media. In particular, it recognizes the GUID Partition Table (GPT), Apple partition map, PC/Intel BIOS partition tables, Sun Solaris slice and Xbox fixed partitioning scheme. TestDisk uses a command line user interface. TestDisk can recover deleted files with 97% accuracy.

<span class="mw-page-title-main">PhotoRec</span> Open source data recovery software

PhotoRec is a free and open-source utility software for data recovery with text-based user interface using data carving techniques, designed to recover lost files from various digital camera memory, hard disk and CD-ROM. It can recover the files with more than 480 file extensions . It is also possible to add custom file signature to detect less known files.

The following tables compare general and technical information for a number of file systems.

ext4 is a journaling file system for Linux, developed as the successor to ext3.

Disk encryption is a technology which protects information by converting it into code that cannot be deciphered easily by unauthorized people or processes. Disk encryption uses disk encryption software or hardware to encrypt every bit of data that goes on a disk or disk volume. It is used to prevent unauthorized access to data storage.

Anti–computer forensics or counter-forensics are techniques used to obstruct forensic analysis.

Photo recovery is the process of salvaging digital photographs from damaged, failed, corrupted, or inaccessible secondary storage media when it cannot be accessed normally. Photo recovery can be considered a subset of the overall data recovery field.

Optimistic decompression is a digital forensics technique in which each byte of an input buffer is examined for the possibility of compressed data. If data is found that might be compressed, a decompression algorithm is invoked to perform a trial decompression. If the decompressor does not produce an error, the decompressed data is processed. The decompressor is thus called optimistically---that is, with the hope that it might be successful.

References

  1. "File Signatures".
  2. 1 2 Simson Garfinkel, "Carving Contiguous and Fragmented Files with Fast Object Validation" Archived 2012-05-23 at the Wayback Machine , in Proceedings of the 2007 digital forensics research workshop, DFRWS, Pittsburgh, PA, August 2007
  3. 1 2 A. Pal and N. Memon, "Automated reassembly of file fragmented images using greedy algorithms - URL now invalid" in IEEE Transactions on Image processing, February 2006, pp. 385–393
  4. A. Thus, finding the header of a file means that the first fragment of the file is found, but the other fragments might be scattered anywhere else on the partition, making file carving much more challenging. By studying how file systems actually do fragmentation and applying statistics, it is possible to make qualified guesses as to which fragments might fit together. These fragments are then put together in various possible permutations and it is tested if the fragments fit together. For some files it is easy for the software to test if they fit, while for others, the software might accidentally fit the pieces together incorrectly. Pal, T. Sencar and N. Memon, "Detecting File Fragmentation Point Using Sequential Hypothesis Testing - URL now invalid", Digital Investigations, Fall 2008
  5. Richard, Golden, Roussev, V., "Scalpel: a frugal, high performance file carver" Archived 2019-02-09 at the Wayback Machine , in Proceedings of the 2005 Digital Forensics Research Workshop, DFRWS, August 2005
  6. https://github.com/sleuthkit/scalpel
  7. https://foremost.sourceforge.net/
  8. "Professional Data Recovery Services | SERT Data Recovery Company". Archived from the original on 2015-05-12. Retrieved 2015-05-05.
  9. "Understanding Deleted Files"
  10. https://github.com/504ensicsLabs/LiME
  11. https://github.com/volatilityfoundation/volatility