Data augmentation

Last updated

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. [1] [2] Data augmentation has important applications in Bayesian analysis, [3] and the technique is widely used in machine learning to reduce overfitting when training machine learning models, [4] achieved by training models on several slightly-modified copies of existing data.

Contents

Synthetic oversampling techniques for traditional machine learning

Synthetic Minority Over-sampling Technique (SMOTE) is a method used to address imbalanced datasets in machine learning. In such datasets, the number of samples in different classes varies significantly, leading to biased model performance. For example, in a medical diagnosis dataset with 90 samples representing healthy individuals and only 10 samples representing individuals with a particular disease, traditional algorithms may struggle to accurately classify the minority class. SMOTE rebalances the dataset by generating synthetic samples for the minority class. For instance, if there are 100 samples in the majority class and 10 in the minority class, SMOTE can create synthetic samples by randomly selecting a minority class sample and its nearest neighbors, then generating new samples along the line segments joining these neighbors. This process helps increase the representation of the minority class, improving model performance. [5]

Data augmentation for image classification

When convolutional neural networks grew larger in mid-1990s, there was a lack of data to use, especially considering that some part of the overall dataset should be spared for later testing. It was proposed to perturb existing data with affine transformations to create new examples with the same labels, [6] which were complemented by so-called elastic distortions in 2003, [7] and the technique was widely used as of 2010s. [8] Data augmentation can enhance CNN performance and acts as a countermeasure against CNN profiling attacks. [9]

Data augmentation has become fundamental in image classification, enriching training dataset diversity to improve model generalization and performance. The evolution of this practice has introduced a broad spectrum of techniques, including geometric transformations, color space adjustments, and noise injection. [10]

Geometric Transformations

Geometric transformations alter the spatial properties of images to simulate different perspectives, orientations, and scales. Common techniques include:

Color Space Transformations

Color space transformations modify the color properties of images, addressing variations in lighting, color saturation, and contrast. Techniques include:

Noise Injection

Injecting noise into images simulates real-world imperfections, teaching models to ignore irrelevant variations. Techniques involve:

Data augmentation for signal processing

Residual or block bootstrap can be used for time series augmentation.

Biological signals

Synthetic data augmentation is of paramount importance for machine learning classification, particularly for biological data, which tend to be high dimensional and scarce. The applications of robotic control and augmentation in disabled and able-bodied subjects still rely mainly on subject-specific analyses. Data scarcity is notable in signal processing problems such as for Parkinson's Disease Electromyography signals, which are difficult to source - Zanini, et al. noted that it is possible to use a generative adversarial network (in particular, a DCGAN) to perform style transfer in order to generate synthetic electromyographic signals that corresponded to those exhibited by sufferers of Parkinson's Disease. [11]

The approaches are also important in electroencephalography (brainwaves). Wang, et al. explored the idea of using deep convolutional neural networks for EEG-Based Emotion Recognition, results show that emotion recognition was improved when data augmentation was used. [12]

A common approach is to generate synthetic signals by re-arranging components of real data. Lotte [13] proposed a method of "Artificial Trial Generation Based on Analogy" where three data examples provide examples and an artificial is formed which is to what is to . A transformation is applied to to make it more similar to , the same transformation is then applied to which generates . This approach was shown to improve performance of a Linear Discriminant Analysis classifier on three different datasets.

Current research shows great impact can be derived from relatively simple techniques. For example, Freer [14] observed that introducing noise into gathered data to form additional data points improved the learning ability of several models which otherwise performed relatively poorly. Tsinganos et al. [15] studied the approaches of magnitude warping, wavelet decomposition, and synthetic surface EMG models (generative approaches) for hand gesture recognition, finding classification performance increases of up to +16% when augmented data was introduced during training. More recently, data augmentation studies have begun to focus on the field of deep learning, more specifically on the ability of generative models to create artificial data which is then introduced during the classification model training process. In 2018, Luo et al. [16] observed that useful EEG signal data could be generated by Conditional Wasserstein Generative Adversarial Networks (GANs) which was then introduced to the training set in a classical train-test learning framework. The authors found classification performance was improved when such techniques were introduced.

Mechanical signals

The prediction of mechanical signals based on data augmentation brings a new generation of technological innovations, such as new energy dispatch, 5G communication field, and robotics control engineering. [17] In 2022, Yang et al. [17] integrate constraints, optimization and control into a deep network framework based on data augmentation and data pruning with spatio-temporal data correlation, and improve the interpretability, safety and controllability of deep learning in real industrial projects through explicit mathematical programming equations and analytical solutions.

See also

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Unsupervised learning is a method in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and then generate imaginative content from it.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

Within statistics, oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set. These terms are used both in statistical sampling, survey design methodology and in machine learning.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Feature learning</span> Set of learning techniques in machine learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">Quantum machine learning</span> Interdisciplinary research area at the intersection of quantum physics and machine learning

Quantum machine learning is the integration of quantum algorithms within machine learning programs.

<span class="mw-page-title-main">Generative adversarial network</span> Deep learning method

A generative adversarial network (GAN) is a class of machine learning frameworks and a prominent framework for approaching generative AI. The concept was initially developed by Ian Goodfellow and his colleagues in June 2014. In a GAN, two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is another agent's loss.

An MRI artifact is a visual artifact in magnetic resonance imaging (MRI). It is a feature appearing in an image that is not present in the original object. Many different artifacts can occur during MRI, some affecting the diagnostic quality, while others may be confused with pathology. Artifacts can be classified as patient-related, signal processing-dependent and hardware (machine)-related.

EEG analysis is exploiting mathematical signal analysis methods and computer technology to extract information from electroencephalography (EEG) signals. The targets of EEG analysis are to help researchers gain a better understanding of the brain; assist physicians in diagnosis and treatment choices; and to boost brain-computer interface (BCI) technology. There are many ways to roughly categorize EEG analysis methods. If a mathematical model is exploited to fit the sampled EEG signals, the method can be categorized as parametric, otherwise, it is a non-parametric method. Traditionally, most EEG analysis methods fall into four categories: time domain, frequency domain, time-frequency domain, and nonlinear methods. There are also later methods including deep neural networks (DNNs).

<span class="mw-page-title-main">StyleGAN</span> Novel generative adversarial network

StyleGAN is a generative adversarial network (GAN) introduced by Nvidia researchers in December 2018, and made source available in February 2019.

<span class="mw-page-title-main">Variational autoencoder</span> Deep learning generative model to encode data representation

In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling. It is part of the families of probabilistic graphical models and variational Bayesian methods.

An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).

Applications of machine learning in earth sciences include geological mapping, gas leakage detection and geological features identification. Machine learning (ML) is a type of artificial intelligence (AI) that enables computer systems to classify, cluster, identify and analyze vast and complex sets of data while eliminating the need for explicit instructions and programming. Earth science is the study of the origin, evolution, and future of the planet Earth. The Earth system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere and biosphere.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

References

  1. Dempster, A.P.; Laird, N.M.; Rubin, D.B. (1977). "Maximum Likelihood from Incomplete Data Via the EM Algorithm". Journal of the Royal Statistical Society. Series B (Methodological). 39 (1): 1–22. doi:10.1111/j.2517-6161.1977.tb01600.x.
  2. Rubin, Donald (1987). "Comment: The Calculation of Posterior Distributions by Data Augmentation". Journal of the American Statistical Association. 82 (398). doi:10.2307/2289460. JSTOR   2289460.
  3. Jackman, Simon (2009). Bayesian Analysis for the Social Sciences. John Wiley & Sons. p. 236. ISBN   978-0-470-01154-6.
  4. Shorten, Connor; Khoshgoftaar, Taghi M. (2019). "A survey on Image Data Augmentation for Deep Learning". Mathematics and Computers in Simulation. 6. springer: 60. doi: 10.1186/s40537-019-0197-0 .
  5. Wang, Shujuan; Dai, Yuntao; Shen, Jihong; Xuan, Jingxue (2021-12-15). "Research on expansion and classification of imbalanced data based on SMOTE algorithm". Scientific Reports. 11 (1): 24039. Bibcode:2021NatSR..1124039W. doi:10.1038/s41598-021-03430-5. ISSN   2045-2322. PMC   8674253 . PMID   34912009.
  6. Yann Lecun; et al. (1995). Learning algorithms for classification: A comparison on handwritten digit recognition (Conference paper). World Scientific. pp. 261–276. Retrieved 14 May 2023.{{cite book}}: |website= ignored (help)
  7. Simard, P.Y.; Steinkraus, D.; Platt, J.C. (2003). "Best practices for convolutional neural networks applied to visual document analysis". Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. Vol. 1. pp. 958–963. doi:10.1109/ICDAR.2003.1227801. ISBN   0-7695-1960-1. S2CID   4659176.
  8. Hinton, Geoffrey E.; Srivastava, Nitish; Krizhevsky, Alex; Sutskever, Ilya; Salakhutdinov, Ruslan R. (2012). "Improving neural networks by preventing co-adaptation of feature detectors". arXiv: 1207.0580 [cs.NE].
  9. Cagli, Eleonora; Dumas, Cécile; Prouff, Emmanuel (2017). "Convolutional Neural Networks with Data Augmentation Against Jitter-Based Countermeasures: Profiling Attacks Without Pre-processing". In Fischer, Wieland; Homma, Naofumi (eds.). Cryptographic Hardware and Embedded Systems – CHES 2017. Lecture Notes in Computer Science. Vol. 10529. Cham: Springer International Publishing. pp. 45–68. doi:10.1007/978-3-319-66787-4_3. ISBN   978-3-319-66787-4. S2CID   54088207.
  10. Shorten, Connor; Khoshgoftaar, Taghi M. (2019-07-06). "A survey on Image Data Augmentation for Deep Learning". Journal of Big Data. 6 (1): 60. doi: 10.1186/s40537-019-0197-0 . ISSN   2196-1115.
  11. Anicet Zanini, Rafael; Luna Colombini, Esther (2020). "Parkinson's Disease EMG Data Augmentation and Simulation with DCGANs and Style Transfer". Sensors. 20 (9): 2605. Bibcode:2020Senso..20.2605A. doi: 10.3390/s20092605 . ISSN   1424-8220. PMC   7248755 . PMID   32375217.
  12. Wang, Fang; Zhong, Sheng-hua; Peng, Jianfeng; Jiang, Jianmin; Liu, Yan (2018). "Data Augmentation for EEG-Based Emotion Recognition with Deep Convolutional Neural Networks". MultiMedia Modeling. Lecture Notes in Computer Science. Vol. 10705. pp. 82–93. doi:10.1007/978-3-319-73600-6_8. ISBN   978-3-319-73599-3. ISSN   0302-9743.
  13. Lotte, Fabien (2015). "Signal Processing Approaches to Minimize or Suppress Calibration Time in Oscillatory Activity-Based Brain–Computer Interfaces" (PDF). Proceedings of the IEEE. 103 (6): 871–890. doi:10.1109/JPROC.2015.2404941. ISSN   0018-9219. S2CID   22472204.
  14. Freer, Daniel; Yang, Guang-Zhong (2020). "Data augmentation for self-paced motor imagery classification with C-LSTM". Journal of Neural Engineering. 17 (1): 016041. Bibcode:2020JNEng..17a6041F. doi:10.1088/1741-2552/ab57c0. hdl: 10044/1/75376 . ISSN   1741-2552. PMID   31726440. S2CID   208034533.
  15. Tsinganos, Panagiotis; Cornelis, Bruno; Cornelis, Jan; Jansen, Bart; Skodras, Athanassios (2020). "Data Augmentation of Surface Electromyography for Hand Gesture Recognition". Sensors. 20 (17): 4892. Bibcode:2020Senso..20.4892T. doi: 10.3390/s20174892 . ISSN   1424-8220. PMC   7506981 . PMID   32872508.
  16. Luo, Yun; Lu, Bao-Liang (2018). "EEG Data Augmentation for Emotion Recognition Using a Conditional Wasserstein GAN". 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Vol. 2018. pp. 2535–2538. doi:10.1109/EMBC.2018.8512865. ISBN   978-1-5386-3646-6. PMID   30440924. S2CID   53105445.
  17. 1 2 Yang, Yang (2022). "Wind speed forecasting with correlation network pruning and augmentation: A two-phase deep learning method". Renewable Energy. 198 (1): 267–282. arXiv: 2306.01986 . doi:10.1016/j.renene.2022.07.125. ISSN   0960-1481. S2CID   251511199.