Sepp Hochreiter

Last updated
Sepp Hochreiter
Sepp Hochreiter 1.jpg
Hochreiter in 2012
Born (1967-02-14) 14 February 1967 (age 56)
Nationality German
Alma mater Technische Universität München
Scientific career
Fields Machine learning, bioinformatics
Institutions Johannes Kepler University Linz
Thesis Generalisierung bei neuronalen Netzen geringer Komplexität  (1999)
Doctoral advisor Wilfried Brauer

Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018. In 2017 he became the head of the Linz Institute of Technology (LIT) AI Lab. Hochreiter is also a founding director of the Institute of Advanced Research in Artificial Intelligence (IARAI). [1] Previously, he was at the Technical University of Berlin, at the University of Colorado at Boulder, and at the Technical University of Munich. He is a chair of the Critical Assessment of Massive Data Analysis (CAMDA) conference. [2]

Contents

Hochreiter has made contributions in the fields of machine learning, deep learning and bioinformatics, most notably the development of the long short-term memory (LSTM) neural network architecture, [3] [4] but also in meta-learning, [5] reinforcement learning [6] [7] and biclustering with application to bioinformatics data.

Scientific career

Long short-term memory (LSTM)

Hochreiter developed the long short-term memory (LSTM) neural network architecture in his diploma thesis in 1991 leading to the main publication in 1997. [3] [4] LSTM overcomes the problem of numerical instability in training recurrent neural networks (RNNs) that prevents them from learning from long sequences (vanishing or exploding gradient). [3] [8] [9] In 2007, Hochreiter and others successfully applied LSTM with an optimized architecture to very fast protein homology detection without requiring a sequence alignment. [10] LSTM networks have also been also used in Google Voice for transcription [11] and search, [12] and in the Google Allo chat app for generating response suggestion with low latency. [13]

Other machine learning contributions

Beyond LSTM, Hochreiter has developed "Flat Minimum Search" to increase the generalization of neural networks [14] and introduced rectified factor networks (RFNs) for sparse coding [15] [16] which have been applied in bioinformatics and genetics. [17] Hochreiter introduced modern Hopfield networks with continuous states [18] and applied them to the task of immune repertoire classification. [19]

Hochreiter worked with Jürgen Schmidhuber in the field of reinforcement learning on actor-critic systems that learn by "backpropagation through a model". [6] [20]

Hochreiter has been involved in the development of factor analysis methods with application to bioinformatics, including FABIA for biclustering, [21] HapFABIA for detecting short segments of identity by descent [22] and FARMS for preprocessing and summarizing high-density oligonucleotide DNA microarrays to analyze RNA gene expression. [23]

In 2006, Hochreiter and others proposed an extension of the support vector machine (SVM), the "Potential Support Vector Machine" (PSVM), [24] which can be applied to non-square kernel matrices and can be used with kernels that are not positive definite. Hochreiter and his collaborators have applied PSVM to feature selection, including gene selection for microarray data. [25] [26] [27]

Awards

Hochreiter was awarded the IEEE CIS Neural Networks Pioneer Prize in 2021 for his work on LSTM. [28]

Related Research Articles

<span class="mw-page-title-main">Artificial neural network</span> Computational model used in machine learning, based on connected, hierarchical functions

Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

<span class="mw-page-title-main">Recurrent neural network</span> Computational model used in machine learning

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs.

<span class="mw-page-title-main">Neural network</span> Structure in biology and artificial intelligence

A neural network can refer to either a neural circuit of biological neurons, or a network of artificial neurons or nodes in the case of an artificial neural network. Artificial neural networks are used for solving artificial intelligence (AI) problems; they model connections of biological neurons as weights between nodes. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.

<span class="mw-page-title-main">Meta-learning (computer science)</span> Subfield of machine learning

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behaviour is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is an artificial neural network used in the fields of artificial intelligence and deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. Such a recurrent neural network (RNN) can process not only single data points, but also entire sequences of data. This characteristic makes LSTM networks ideal for processing and predicting data. For example, LSTM is applicable to tasks such as unsegmented, connected handwriting recognition, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. A standard integrated circuit can be seen as a digital network of activation functions that can be "ON" (1) or "OFF" (0), depending on input. This is similar to the linear perceptron in neural networks. However, only nonlinear activation functions allow such networks to compute nontrivial problems using only a small number of nodes, and such activation functions are called nonlinearities.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Rectifier (neural networks)</span> Activation function

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the positive part of its argument:

<span class="mw-page-title-main">Vanishing gradient problem</span> Machine learning model training problem

In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (0,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can get information from past (backwards) and future (forward) states simultaneously. Invented in 1997 by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information available to the network. For example, multilayer perceptron (MLPs) and time delay neural network (TDNNs) have limitations on the input data flexibility, as they require their input data to be fixed. Standard recurrent neural network (RNNs) also have restrictions as the future input information cannot be reached from the current state. On the contrary, BRNNs do not require their input data to be fixed. Moreover, their future input information is reachable from the current state.

Alex Graves is a computer scientist. Before working as a research scientist at DeepMind, he earned a BSc in Theoretical Physics from the University of Edinburgh and a PhD in artificial intelligence under Jürgen Schmidhuber at IDSIA. He was also a postdoc under Schmidhuber at the Technical University of Munich and under Geoffrey Hinton at the University of Toronto.

Felix Gers is a professor of computer science at Berlin University of Applied Sciences Berlin. With Jürgen Schmidhuber and Fred Cummins, he introduced the forget gate to the long short-term memory recurrent neural network architecture. This modification of the original architecture has been shown to be crucial to the success of the LSTM at such tasks as speech and handwriting recognition.

In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous artificial neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by Long Short-Term Memory (LSTM) recurrent neural networks. The advantage of a Highway Network over the common deep neural networks is that it solves or partially prevents the vanishing gradient problem, thus leading to easier to optimize neural networks. The gating mechanisms facilitate information flow across many layers.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A Residual Neural Network is a deep learning model in which the weight layers learn residual functions with reference to the layer inputs. A Residual Network is a network with skip connections that perform identity mappings, merged with the layer outputs by addition. It behaves like a Highway Network whose gates are opened through strongly positive bias weights. This enables deep learning models with tens or hundreds of layers to train easily and approach better accuracy when going deeper. The identity skip connections, often referred to as "residual connections", are also used in the 1997 LSTM networks, Transformer models, the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. This effect enhances some parts of the input data while diminishing other parts—the motivation being that the network should devote more focus to the important parts of the data, even though they may be a small portion of an image or sentence. Learning which part of the data is more important than another depends on the context, and this is trained by gradient descent.

References

  1. "IARAI – INSTITUTE OF ADVANCED RESEARCH IN ARTIFICIAL INTELLIGENCE". www.iarai.ac.at. Retrieved 2021-02-13.
  2. "CAMDA 2021". 20th International Conference on Critical Assessment of Massive Data Analysis. Retrieved 2021-02-13.
  3. 1 2 3 Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (diploma thesis). Technical University Munich, Institute of Computer Science.
  4. 1 2 Hochreiter, S.; Schmidhuber, J. (1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID   9377276. S2CID   1915014.
  5. Hochreiter, S.; Younger, A. S.; Conwell, P. R. (2001). Learning to Learn Using Gradient Descent (PDF). Lecture Notes in Computer Science - ICANN 2001. Lecture Notes in Computer Science. Vol. 2130. pp. 87–94. CiteSeerX   10.1.1.5.323 . doi:10.1007/3-540-44668-0_13. ISBN   978-3-540-42486-4. ISSN   0302-9743.
  6. 1 2 Hochreiter, S. (1991). Implementierung und Anwendung eines neuronalen Echtzeit-Lernalgorithmus für reaktive Umgebungen (PDF) (Report). Technical University Munich, Institute of Computer Science.
  7. Arjona-Medina, J. A.; Gillhofer, M.; Widrich, M.; Unterthiner, T.; Hochreiter, S. (2018). "RUDDER: Return Decomposition for Delayed Rewards". arXiv: 1806.07857 [cs.LG].
  8. Hochreiter, S. (1998). "The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 06 (2): 107–116. doi:10.1142/S0218488598000094. ISSN   0218-4885.
  9. Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. (2000). Kolen, J. F.; Kremer, S. C. (eds.). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Networks. New York City: IEEE Press. pp. 237–244. CiteSeerX   10.1.1.24.7321 .
  10. Hochreiter, S.; Heusel, M.; Obermayer, K. (2007). "Fast model-based protein homology detection without alignment". Bioinformatics. 23 (14): 1728–1736. doi: 10.1093/bioinformatics/btm247 . PMID   17488755.
  11. "The neural networks behind Google Voice transcription".
  12. "Google voice search: faster and more accurate".
  13. Khaitan, Pranav (May 18, 2016). "Chat Smarter with Allo". Google AI Blog. Retrieved 2021-10-20.
  14. Hochreiter, S.; Schmidhuber, J. (1997). "Flat Minima". Neural Computation. 9 (1): 1–42. doi:10.1162/neco.1997.9.1.1. PMID   9117894. S2CID   733161.
  15. Clevert, D.-A.; Mayr, A.; Unterthiner, T.; Hochreiter, S. (2015). "Rectified Factor Networks". arXiv: 1502.06464v2 [cs.LG].
  16. Clevert, D.-A.; Mayr, A.; Unterthiner, T.; Hochreiter, S. (2015). Rectified Factor Networks. Advances in Neural Information Processing Systems 29. arXiv: 1502.06464 .
  17. Clevert, D.-A.; Unterthiner, T.; Povysil, G.; Hochreiter, S. (2017). "Rectified factor networks for biclustering of omics data". Bioinformatics. 33 (14): i59–i66. doi:10.1093/bioinformatics/btx226. PMC   5870657 . PMID   28881961.
  18. Ramsauer, H.; Schäfl, B.; Lehner, J.; Seidl, P.; Widrich, M.; Gruber, L.; Holzleitner, M.; Pavlović, M.; Sandve, G. K.; Greiff, V.; Kreil, D.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. (2020). "Hopfield Networks is All You Need". arXiv: 2008.02217 [cs.NE].
  19. Widrich, M.; Schäfl, B.; Ramsauer, H.; Pavlović, M.; Gruber, L.; Holzleitner, M.; Brandstetter, J.; Sandve, G. K.; Greiff, V.; Hochreiter, S.; Klambauer, G. (2020). "Modern Hopfield Networks and Attention for Immune Repertoire Classification". arXiv: 2007.13505 [cs.LG].
  20. Schmidhuber, J. (1990). Making the world differentiable: On Using Fully Recurrent Self-Supervised Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments (PDF) (Technical report). Technical University Munich, Institute of Computer Science. FKI-126-90 (revised).
  21. Hochreiter, Sepp; Bodenhofer, Ulrich; Heusel, Martin; Mayr, Andreas; Mitterecker, Andreas; Kasim, Adetayo; Khamiakova, Tatsiana; Van Sanden, Suzy; Lin, Dan; Talloen, Willem; Bijnens, Luc; Göhlmann, Hinrich W. H.; Shkedy, Ziv; Clevert, Djork-Arné (2010-06-15). "FABIA: factor analysis for bicluster acquisition". Bioinformatics. 26 (12): 1520–1527. doi:10.1093/bioinformatics/btq227. PMC   2881408 . PMID   20418340.
  22. Hochreiter, S. (2013). "HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data". Nucleic Acids Research. 41 (22): e202. doi:10.1093/nar/gkt1013. PMC   3905877 . PMID   24174545.
  23. Hochreiter, S.; Clevert, D.-A.; Obermayer, K. (2006). "A new summarization method for affymetrix probe level data". Bioinformatics. 22 (8): 943–949. doi: 10.1093/bioinformatics/btl033 . PMID   16473874.
  24. Hochreiter, S.; Obermayer, K. (2006). "Support Vector Machines for Dyadic Data". Neural Computation. 18 (6): 1472–1510. CiteSeerX   10.1.1.228.5244 . doi:10.1162/neco.2006.18.6.1472. PMID   16764511. S2CID   26201227.
  25. Hochreiter, S.; Obermayer, K. (2006). Nonlinear Feature Selection with the Potential Support Vector Machine. Feature Extraction, Studies in Fuzziness and Soft Computing. pp. 419–438. doi:10.1007/978-3-540-35488-8_20. ISBN   978-3-540-35487-1.
  26. Hochreiter, S.; Obermayer, K. (2003). "Classification and Feature Selection on Matrix Data with Application to Gene-Expression Analysis". 54th Session of the International Statistical Institute. Archived from the original on 2012-03-25.
  27. Hochreiter, S.; Obermayer, K. (2004). "Gene Selection for Microarray Data". Kernel Methods in Computational Biology. MIT Press: 319–355. Archived from the original on 2012-03-25.
  28. "Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021 - IARAI". www.iarai.ac.at. 24 July 2020. Retrieved 3 June 2021.