This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these template messages)
|
Part of a series on |
Machine learning and data mining |
---|
Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. [1] [lower-alpha 1] While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. [1] Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter". [2]
Later, advances in hardware and the development of the backpropagation algorithm as well as recurrent neural networks and convolutional neural networks, renewed interest in ANNs. The 2010s, saw the development of a deep neural network (a neural network with many layers) called AlexNet. [3] It greatly outperformed other image recognition models, and is thought to have launched the ongoing AI spring, and further increasing interest in ANNs. [4] The transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language, [5] and is the predominant architecture used by large language models, such as GPT-4. Diffusion models were first described in 2015, and began to be used by image generation models such as DALL-E in the 2020s.[ citation needed ]
The simplest kind of feedforward neural network is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node. The mean squared errors between these calculated outputs and a given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the method of least squares or linear regression. It was used as a means of finding a good rough linear fit to a set of points by Legendre (1805) and Gauss (1795) for the prediction of planetary movement. [6] [7] [8] [9] [10]
Warren McCulloch and Walter Pitts [11] (1943) also considered a non-learning computational model for neural networks. [12] This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata. [13]
In the early 1940s, D. O. Hebb [14] created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long-term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. Farley and Clark [15] (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956). [16]
Rosenblatt [1] (1958) created the perceptron, an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells. [17]
Some say that research stagnated following Minsky and Papert (1969), [18] who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. However, by the time this book came out, methods for training multilayer perceptrons (MLPs) by deep learning were already known. [9]
The first deep learning MLP was published by Alexey Grigorevich Ivakhnenko and Valentin Lapa in 1965, as the Group Method of Data Handling. [19] [20] [21] This method employs incremental layer by layer training based on regression analysis, where useless units in hidden layers are pruned with the help of a validation set.
The first deep learning MLP trained by stochastic gradient descent [22] was published in 1967 by Shun'ichi Amari. [23] [9] In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful internal representations to classify non-linearily separable pattern classes. [9]
The backpropagation algorithm is an efficient application of the Leibniz chain rule (1673) [24] to networks of differentiable nodes. [9] It is also known as the reverse mode of automatic differentiation or reverse accumulation, due to Seppo Linnainmaa (1970). [25] [26] [27] [28] [9] The term "back-propagating errors" was introduced in 1962 by Frank Rosenblatt, [29] [9] but he did not have an implementation of this procedure, although Henry J. Kelley had a continuous precursor of backpropagation [30] already in 1960 in the context of control theory. [9] In 1982, Paul Werbos applied backpropagation to MLPs in the way that has become standard. [31] In 1986, David E. Rumelhart et al. published an experimental analysis of the technique. [32]
Wilhelm Lenz and Ernst Ising created and analyzed the Ising model (1925) [33] which is essentially a non-learning artificial recurrent neural network (RNN) consisting of neuron-like threshold elements. [9] In 1972, Shun'ichi Amari made this architecture adaptive. [34] [9] His learning RNN was popularised by John Hopfield in 1982. [35]
Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982. [36] [37] SOMs are neurophysiologically inspired [38] artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.
SOMs create internal representations reminiscent of the cortical homunculus, [39] a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.
The origin of the CNN architecture is the "neocognitron" [40] introduced by Kunihiko Fukushima in 1980. [41] [42] It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.
In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function. [43] [9] The rectifier has become the most popular activation function for CNNs and deep neural networks in general. [44]
The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance. [45] It did so by utilizing weight sharing in combination with backpropagation training. [46] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one. [45]
In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system. [47] [48]
In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days. [49] Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991 [50] and breast cancer detection in mammograms in 1994. [51]
In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system. [52] In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch. [53] [54] [55] [56] Max-pooling is often used in modern CNNs. [57]
LeNet-5, a 7-level CNN by Yann LeCun et al. in 1998, [58] that classifies digits, was applied by several banks to recognize hand-written numbers on checks (British English : cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.
In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants. [59] Behnke (2003) relied only on the sign of the gradient (Rprop) [60] on problems such as image reconstruction and face localization. Rprop is a first-order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992. [61]
In 2011, a deep GPU-based CNN called "DanNet" by Dan Ciresan, Ueli Meier, and Juergen Schmidhuber achieved human-competitive performance for the first time in computer vision contests. [62] Subsequently, a similar GPU-based CNN by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge 2012. [63] A very deep CNN with over 100 layers by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of Microsoft won the ImageNet 2015 contest. [64]
ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs) [65] whose embodiments are Where-What Networks, WWN-1 (2008) [66] through WWN-7 (2013). [67]
In 1991, Juergen Schmidhuber published adversarial neural networks that contest with each other in the form of a zero-sum game, where one network's gain is the other network's loss. [68] [69] [70] The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity." Earlier adversarial machine learning systems "neither involved unsupervised neural networks nor were about modeling data nor used gradient descent." [70]
In 2014, this adversarial principle was used in a generative adversarial network (GAN) by Ian Goodfellow et al. [71] Here the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. This can be used to create realistic deepfakes. [72]
In 1992, Schmidhuber also published another type of gradient-based adversarial neural networks where the goal of the zero-sum game is to create disentangled representations of input patterns. This was called predictability minimization. [73] [74]
Nvidia's StyleGAN (2018) [75] is based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. [76] Here the GAN generator is grown from small to large scale in a pyramidal fashion. StyleGANs improve consistency between fine and coarse details in the generator network.
Many modern large language models such as ChatGPT, GPT-4, and BERT use a feedforward neural network called Transformer by Ashish Vaswani et. al. in their 2017 paper "Attention Is All You Need." [77] Transformers have increasingly become the model of choice for natural language processing problems, [78] replacing recurrent neural networks (RNNs) such as long short-term memory (LSTM). [79]
Basic ideas for this go back a long way: in 1992, Juergen Schmidhuber published the Transformer with "linearized self-attention" (save for a normalization operator), [80] which is also called the "linear Transformer." [81] [82] [9] He advertised it as an "alternative to RNNs" [80] that can learn "internal spotlights of attention," [83] and experimentally applied it to problems of variable binding. [80] Here a slow feedforward neural network learns by gradient descent to control the fast weights of another neural network through outer products of self-generated activation patterns called "FROM" and "TO" which in Transformer terminology are called "key" and "value" for "self-attention." [82] This fast weight "attention mapping" is applied to queries. The 2017 Transformer [77] combines this with a softmax operator and a projection matrix. [9]
Transformers are also increasingly being used in computer vision. [84]
In the 1980s, backpropagation did not work well for deep FNNs and RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. [85] The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.
To overcome this problem, Juergen Schmidhuber (1992) proposed a self-supervised hierarchy of RNNs pre-trained one level at a time by self-supervised learning. [86] This "neural history compressor" uses predictive coding to learn internal representations at multiple self-organizing time scales. [9] The deep architecture may be used to reproduce the original data from the top level feature activations. [86] The RNN hierarchy can be "collapsed" into a single RNN, by "distilling" a higher level "chunker" network into a lower level "automatizer" network. [86] [9] In 1993, a chunker solved a deep learning task whose CAP depth exceeded 1000. [87] Such history compressors can substantially facilitate downstream supervised deep learning. [9]
Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine [88] to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. [89] [90] In 2012, Andrew Ng and Jeff Dean created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos. [91]
Sepp Hochreiter's diploma thesis (1991) [92] was called "one of the most important documents in the history of machine learning" by his supervisor Juergen Schmidhuber. [9] Hochreiter not only tested the neural history compressor, [86] but also identified and analyzed the vanishing gradient problem. [92] [93] He proposed recurrent residual connections to solve this problem. This led to the deep learning method called long short-term memory (LSTM), published in 1997. [94] LSTM recurrent neural networks can learn "very deep learning" tasks [85] with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. The "vanilla LSTM" with forget gate was introduced in 1999 by Felix Gers, Schmidhuber and Fred Cummins. [95] LSTM has become the most cited neural network of the 20th century. [9]
In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used LSTM principles to create the Highway network, a feedforward neural network with hundreds of layers, much deeper than previous networks. [96] [97] 7 months later, Kaiming He, Xiangyu Zhang; Shaoqing Ren, and Jian Sun won the ImageNet 2015 competition with an open-gated or gateless Highway network variant called Residual neural network. [98] This has become the most cited neural network of the 21st century. [9]
In 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU [43] of Kunihiko Fukushima also helps to overcome the vanishing gradient problem, [99] compared to widely used activation functions prior to 2011.
The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s. [100]
Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices [101] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices). [102] Ciresan and colleagues (2010) [103] in Schmidhuber's group showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.
Between 2009 and 2012, recurrent neural networks and deep feedforward neural networks developed in Schmidhuber's research group won eight international competitions in pattern recognition and machine learning. [104] [105] For example, the bi-directional and multi-dimensional long short-term memory (LSTM) [106] [107] [108] [109] of Graves et al. won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three languages to be learned. [108] [107]
Ciresan and colleagues won pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition, [110] the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge [111] and others. Their neural networks were the first pattern recognizers to achieve human-competitive/superhuman performance [62] on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem.
Researchers demonstrated (2010) that deep neural networks interfaced to a hidden Markov model with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary speech recognition tasks such as voice search.[ citation needed ]
GPU-based implementations [112] of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition, [110] the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge, [111] the ImageNet Competition [63] and others.
Deep, highly nonlinear neural architectures similar to the neocognitron [113] and the "standard architecture of vision", [114] inspired by simple and complex cells, were pre-trained with unsupervised methods by Hinton. [90] [89] A team from his lab won a 2012 contest sponsored by Merck to design software to help find molecules that might identify new drugs. [115]
Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.
Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.
A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.
A neural network is a neural circuit of biological neurons, sometimes also called a biological neural network, or a network of artificial neurons or nodes in the case of an artificial neural network.
An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.
Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.
Kunihiko Fukushima is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the Fuzzy Logic Systems Institute in Fukuoka, Japan.
There are many types of artificial neural networks (ANN).
Deep learning is the subset of machine learning methods based on artificial neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
In machine learning, the vanishing gradient problem is encountered when training recurrent neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.
Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can get information from past (backwards) and future (forward) states simultaneously. Invented in 1997 by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information available to the network. For example, multilayer perceptron (MLPs) and time delay neural network (TDNNs) have limitations on the input data flexibility, as they require their input data to be fixed. Standard recurrent neural network (RNNs) also have restrictions as the future input information cannot be reached from the current state. On the contrary, BRNNs do not require their input data to be fixed. Moreover, their future input information is reachable from the current state.
This page is a timeline of machine learning. Major discoveries, achievements, milestones and other major events in machine learning are included.
AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.
Felix Gers is a professor of computer science at Berlin University of Applied Sciences Berlin. With Jürgen Schmidhuber and Fred Cummins, he introduced the forget gate to the long short-term memory recurrent neural network architecture. This modification of the original architecture has been shown to be crucial to the success of the LSTM at such tasks as speech and handwriting recognition.
In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous artificial neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by Long Short-Term Memory (LSTM) recurrent neural networks. The advantage of a Highway Network over the common deep neural networks is that it solves or partially prevents the vanishing gradient problem, thus leading to easier to optimize neural networks. The gating mechanisms facilitate information flow across many layers.
A Residual Neural Network is a deep learning model in which the weight layers learn residual functions with reference to the layer inputs. A Residual Network is a network with skip connections that perform identity mappings, merged with the layer outputs by addition. It behaves like a Highway Network whose gates are opened through strongly positive bias weights. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper. The identity skip connections, often referred to as "residual connections", are also used in the 1997 LSTM networks, Transformer models, the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.
LeNet is a convolutional neural network structure proposed by LeCun et al. in 1998. In general, LeNet refers to LeNet-5 and is a simple convolutional neural network. Convolutional neural networks are a kind of feed-forward neural network whose artificial neurons can respond to a part of the surrounding cells in the coverage range and perform well in large-scale image processing.
A layer in a deep learning model is a structure or network topology in the model's architecture, which takes information from the previous layers and then passes it to the next layer.
{{cite book}}
: |journal=
ignored (help){{cite journal}}
: Cite journal requires |journal=
(help)Rectifier and softplus activation functions. The second one is a smooth version of the first.