Data binning

Last updated

Data binning, also called data discrete binning or data bucketing, is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall into a given small interval, a bin , are replaced by a value representative of that interval, often a central value (mean or median).[ citation needed ] It is related to quantization: data binning operates on the abscissa axis while quantization operates on the ordinate axis. Binning is a generalization of rounding.

Contents

Statistical data binning is a way to group numbers of more-or-less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping every five years together). It can also be used in multivariate statistics, binning in several dimensions at once.

In digital image processing, "binning" has a very different meaning. Pixel binning is the process of combining blocks of adjacent pixels throughout an image, by summing or averaging their values, during or after readout. It reduces the amount of data; also the relative noise level in the result is lower.

Example usage

Histograms are an example of data binning used in order to observe underlying frequency distributions. They typically occur in one-dimensional space and in equal intervals for ease of visualization.

Data binning may be used when small instrumental shifts in the spectral dimension from mass spectrometry (MS) or nuclear magnetic resonance (NMR) experiments will be falsely interpreted as representing different components, when a collection of data profiles is subjected to pattern recognition analysis. A straightforward way to cope with this problem is by using binning techniques in which the spectrum is reduced in resolution to a sufficient degree to ensure that a given peak remains in its bin despite small spectral shifts between analyses. For example, in NMR the chemical shift axis may be discretized and coarsely binned, and in MS the spectral accuracies may be rounded to integer atomic mass unit values. Also, several digital camera systems incorporate an automatic pixel binning function to improve image contrast. [1]

Binning is also used in machine learning to speed up [2] the decision-tree boosting method for supervised classification and regression in algorithms such as Microsoft's LightGBM and scikit-learn's Histogram-based Gradient Boosting Classification Tree.

See also

Related Research Articles

<span class="mw-page-title-main">Histogram</span> Graphical representation of the distribution of numerical data

A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to "bin" the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often of equal size.

<span class="mw-page-title-main">Compression artifact</span> Distortion of media caused by lossy data compression

A compression artifact is a noticeable distortion of media caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired disk space or transmitted (streamed) within the available bandwidth. If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts. The compression algorithm may not be intelligent enough to discriminate between distortions of little subjective importance and those objectionable to the user.

<span class="mw-page-title-main">Sampling (signal processing)</span> Measurement of a signal at discrete time intervals

In signal processing, sampling is the reduction of a continuous-time signal to a discrete-time signal. A common example is the conversion of a sound wave to a sequence of "samples". A sample is a value of the signal at a point in time and/or space; this definition differs from the term's usage in statistics, which refers to a set of such values.

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

In image processing and photography, a color histogram is a representation of the distribution of colors in an image. For digital images, a color histogram represents the number of pixels that have colors in each of a fixed list of color ranges, that span the image's color space, the set of all possible colors.

<span class="mw-page-title-main">Canny edge detector</span> Image edge detection algorithm

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

<span class="mw-page-title-main">Octree</span> Tree data structure in which each internal node has exactly eight children, to partition a 3D space

An octree is a tree data structure in which each internal node has exactly eight children. Octrees are most often used to partition a three-dimensional space by recursively subdividing it into eight octants. Octrees are the three-dimensional analog of quadtrees. The word is derived from oct + tree. Octrees are often used in 3D graphics and 3D game engines.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Dither is an intentionally applied form of noise used to randomize quantization error, preventing large-scale patterns such as color banding in images. Dither is routinely used in processing of both digital audio and video data, and is often one of the last stages of mastering audio to a CD.

<span class="mw-page-title-main">Gaussian blur</span> Type of image blur produced by a Gaussian function

In image processing, a Gaussian blur is the result of blurring an image by a Gaussian function.

In computer graphics, image tracing, raster-to-vector conversion or raster vectorization is the conversion of raster graphics into vector graphics.

<span class="mw-page-title-main">Median filter</span> Non-linear digital filtering technique to remove noise

The median filter is a non-linear digital filtering technique, often used to remove noise from an image or signal. Such noise reduction is a typical pre-processing step to improve the results of later processing. Median filtering is very widely used in digital image processing because, under certain conditions, it preserves edges while removing noise, also having applications in signal processing.

<span class="mw-page-title-main">Histogram equalization</span> Method in image processing of contrast adjustment using the images histogram

Histogram equalization is a method in image processing of contrast adjustment using the image's histogram.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

Adaptive histogram equalization (AHE) is a computer image processing technique used to improve contrast in images. It differs from ordinary histogram equalization in the respect that the adaptive method computes several histograms, each corresponding to a distinct section of the image, and uses them to redistribute the lightness values of the image. It is therefore suitable for improving the local contrast and enhancing the definitions of edges in each region of an image.

In various science/engineering applications, such as independent component analysis, image analysis, genetic analysis, speech recognition, manifold learning, and time delay estimation it is useful to estimate the differential entropy of a system or process, given some observations.

Median cut is an algorithm to sort data of an arbitrary number of dimensions into series of sets by recursively cutting each set of data at the median point along the longest dimension. Median cut is typically used for color quantization. For example, to reduce a 64k-colour image to 256 colours, median cut is used to find 256 colours that match the original data well.

Local binary patterns (LBP) is a type of visual descriptor used for classification in computer vision. LBP is the particular case of the Texture Spectrum model proposed in 1990. LBP was first described in 1994. It has since been found to be a powerful feature for texture classification; it has further been determined that when LBP is combined with the Histogram of oriented gradients (HOG) descriptor, it improves the detection performance considerably on some datasets. A comparison of several improvements of the original LBP in the field of background subtraction was made in 2015 by Silva et al. A full survey of the different versions of LBP can be found in Bouwmans et al.

LightGBM, short for light gradient-boosting machine, is a free and open-source distributed gradient-boosting framework for machine learning, originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks. The development focus is on performance and scalability.

References

  1. "Use of binning in photography". Nikon, FSU. Retrieved 2011-01-18.
  2. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". Neural Information Processing Systems (NIPS). Retrieved 2019-12-18.