Chunking (computing)

Last updated January 11, 2024

In computer programming, chunking has multiple meanings.

In memory management

Typical modern software systems allocate memory dynamically from structures known as heaps. Calls are made to heap-management routines to allocate and free memory. Heap management involves some computation time and can be a performance issue. Chunking refers to strategies for improving performance by using special knowledge of a situation to aggregate related memory-allocation requests. For example, if it is known that a certain kind of object will typically be required in groups of eight, instead of allocating and freeing each object individually, making sixteen calls to the heap manager, one could allocate and free an array of eight of the objects, reducing the number of calls to two.

In HTTP message transmission

Chunking is a specific feature of the HTTP 1.1 protocol.^[1] Here, the meaning is the opposite of that used in memory management. It refers to a facility that allows inconveniently large messages to be broken into conveniently-sized smaller "chunks".

In data deduplication, data synchronization and remote data compression

In data deduplication, data synchronization and remote data compression, Chunking is a process to split a file into smaller pieces called chunks by the chunking algorithm. It can help to eliminate duplicate copies of repeating data on storage, or reduces the amount of data sent over the network by only selecting changed chunks. The Content-Defined Chunking (CDC) algorithm like Rolling hash and its variants have been the most popular data deduplication algorithms for the last 15 years.^[2]

Related Research Articles

In computer science, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced; such memory is called garbage. Garbage collection was invented by American computer scientist John McCarthy around 1959 to simplify manual memory management in Lisp.

<span class="mw-page-title-main">Cache (computing)</span> Additional storage that enables faster access to main storage

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere. A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot. Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

Hyphanet is a peer-to-peer platform for censorship-resistant, anonymous communication. It uses a decentralized distributed data store to keep and deliver information, and has a suite of free software for publishing and communicating on the Web without fear of censorship. Both Freenet and some of its associated tools were originally designed by Ian Clarke, who defined Freenet's goal as providing freedom of speech on the Internet with strong anonymity protection.

Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

Java and C++ are two prominent object-oriented programming languages. By many language popularity metrics, the two languages have dominated object-oriented and high-performance software development for much of the 21st century, and are often directly compared and contrasted. Java's syntax was based on C/C++.

Memory management is a form of resource management applied to computer memory. The essential requirement of memory management is to provide ways to dynamically allocate portions of memory to programs at their request, and free it for reuse when no longer needed. This is critical to any advanced computer system where more than a single process might be underway at any time.

A database engine is the underlying software component that a database management system (DBMS) uses to create, read, update and delete (CRUD) data from a database. Most database management systems include their own application programming interface (API) that allows the user to interact with their underlying engine without going through the user interface of the DBMS.

C dynamic memory allocation refers to performing manual memory management for dynamic memory allocation in the C programming language via a group of functions in the C standard library, namely malloc, realloc, calloc, aligned_alloc and free.

In computer programming, tracing garbage collection is a form of automatic memory management that consists of determining which objects should be deallocated by tracing which objects are reachable by a chain of references from certain "root" objects, and considering the rest as "garbage" and collecting them. Tracing garbage collection is the most common type of garbage collection – so much so that "garbage collection" often refers to tracing garbage collection, rather than other methods such as reference counting – and there are a large number of algorithms used in implementation.

In computer science, thrashing occurs in a system with virtual memory when a computer's real storage resources are overcommitted, leading to a constant state of paging and page faults, slowing most application-level processing. This causes the performance of the computer to degrade or collapse. The situation can continue indefinitely until the user closes some running applications or the active processes free up additional virtual memory resources.

In computer storage, fragmentation is a phenomenon in which storage space, main storage or secondary storage, is used inefficiently, reducing capacity or performance and often both. The exact consequences of fragmentation depend on the specific system of storage allocation in use and the particular form of fragmentation. In many cases, fragmentation leads to storage space being "wasted", and in that case the term also refers to the wasted space itself.

Slab allocation is a memory management mechanism intended for the efficient memory allocation of objects. In comparison with earlier mechanisms, it reduces fragmentation caused by allocations and deallocations. This technique is used for retaining allocated memory containing a data object of a certain type for reuse upon subsequent allocations of objects of the same type. It is analogous to an object pool, but only applies to memory, not other resources.

A rolling hash is a hash function where the input is hashed in a window that moves through the input.

WAN optimization is a collection of techniques for improving data transfer across wide area networks (WANs). In 2008, the WAN optimization market was estimated to be $1 billion, and was to grow to $4.4 billion by 2014 according to Gartner, a technology research firm. In 2015 Gartner estimated the WAN optimization market to be a $1.1 billion market.

<span class="mw-page-title-main">Hashed array tree</span>

In computer science, a hashed array tree (HAT) is a dynamic array data-structure published by Edward Sitarski in 1996, maintaining an array of separate memory fragments to store the data elements, unlike simple dynamic arrays which maintain their data in one contiguous memory area. Its primary objective is to reduce the amount of element copying due to automatic array resizing operations, and to improve memory usage patterns.

Byte serving is the process introduced in HTTP protocol 1.1 of sending only a portion of a message from a server to a client. Byte serving begins when an HTTP server advertises its willingness to serve partial requests using the Accept-Ranges response header. A client then requests a specific part of a file from the server using the Range request header. If the range is valid, the server sends it to the client with a 206 Partial Content status code and a Content-Range header listing the range sent. If the range is invalid, the server responds with a 416 Requested Range Not Satisfiable status code.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent.

A flash memory controller manages data stored on flash memory and communicates with a computer or electronic device. Flash memory controllers can be designed for operating in low duty-cycle environments like memory cards, or other similar media for use in PDAs, mobile phones, etc. USB flash drives use flash memory controllers designed to communicate with personal computers through the USB port at a low duty-cycle. Flash controllers can also be designed for higher duty-cycle environments like solid-state drives (SSD) used as data storage for laptop computer systems up to mission-critical enterprise storage arrays.

ONTAP or Data ONTAP or Clustered Data ONTAP (cDOT) or Data ONTAP 7-Mode is NetApp's proprietary operating system used in storage disk arrays such as NetApp FAS and AFF, ONTAP Select, and Cloud Volumes ONTAP. With the release of version 9.0, NetApp decided to simplify the Data ONTAP name and removed the word "Data" from it, and remove the 7-Mode image, therefore, ONTAP 9 is the successor of Clustered Data ONTAP 8.

References

↑ "HTTP/1.1: Protocol Parameters" . Retrieved 2019-12-10.
↑ FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication (PDF). USENIX ATC ’16. 2016. Retrieved 2019-12-10.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "HTTP/1.1: Protocol Parameters" . Retrieved 2019-12-10.

[2] FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication (PDF). USENIX ATC ’16. 2016. Retrieved 2019-12-10.

[1]

[2]