Sepp Hochreiter

Last updated
Sepp Hochreiter
Sepp Hochreiter 1.jpg
Hochreiter in 2012
Born (1967-02-14) 14 February 1967 (age 57)
Nationality German
Alma mater Technische Universität München
Scientific career
Fields Machine learning, bioinformatics
Institutions Johannes Kepler University Linz
Thesis Generalisierung bei neuronalen Netzen geringer Komplexität  (1999)
Doctoral advisor Wilfried Brauer

Josef "Sepp" Hochreiter (born 14 February 1967) is a German computer scientist. Since 2018 he has led the Institute for Machine Learning at the Johannes Kepler University of Linz after having led the Institute of Bioinformatics from 2006 to 2018. In 2017 he became the head of the Linz Institute of Technology (LIT) AI Lab. Hochreiter is also a founding director of the Institute of Advanced Research in Artificial Intelligence (IARAI). [1] Previously, he was at the Technical University of Berlin, at University of Colorado Boulder, and at the Technical University of Munich. He is a chair of the Critical Assessment of Massive Data Analysis (CAMDA) conference. [2]

Contents

Hochreiter has made contributions in the fields of machine learning, deep learning and bioinformatics, most notably the development of the long short-term memory (LSTM) neural network architecture, [3] [4] but also in meta-learning, [5] reinforcement learning [6] [7] and biclustering with application to bioinformatics data.

Scientific career

Long short-term memory (LSTM)

Hochreiter developed the long short-term memory (LSTM) neural network architecture in his diploma thesis in 1991 leading to the main publication in 1997. [3] [4] LSTM overcomes the problem of numerical instability in training recurrent neural networks (RNNs) that prevents them from learning from long sequences (vanishing or exploding gradient). [3] [8] [9] In 2007, Hochreiter and others successfully applied LSTM with an optimized architecture to very fast protein homology detection without requiring a sequence alignment. [10] LSTM networks have also been used in Google Voice for transcription [11] and search, [12] and in the Google Allo chat app for generating response suggestion with low latency. [13]

Other machine learning contributions

Beyond LSTM, Hochreiter has developed "Flat Minimum Search" to increase the generalization of neural networks [14] and introduced rectified factor networks (RFNs) for sparse coding [15] [16] which have been applied in bioinformatics and genetics. [17] Hochreiter introduced modern Hopfield networks with continuous states [18] and applied them to the task of immune repertoire classification. [19]

Hochreiter worked with Jürgen Schmidhuber in the field of reinforcement learning on actor-critic systems that learn by "backpropagation through a model". [6] [20]

Hochreiter has been involved in the development of factor analysis methods with application to bioinformatics, including FABIA for biclustering, [21] HapFABIA for detecting short segments of identity by descent [22] and FARMS for preprocessing and summarizing high-density oligonucleotide DNA microarrays to analyze RNA gene expression. [23]

In 2006, Hochreiter and others proposed an extension of the support vector machine (SVM), the "Potential Support Vector Machine" (PSVM), [24] which can be applied to non-square kernel matrices and can be used with kernels that are not positive definite. Hochreiter and his collaborators have applied PSVM to feature selection, including gene selection for microarray data. [25] [26] [27]

Awards

Hochreiter was awarded the IEEE CIS Neural Networks Pioneer Prize in 2021 for his work on LSTM. [28]

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments. As of 2017, the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

<span class="mw-page-title-main">Activation function</span> Artificial neural network node function

The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation function is nonlinear. Modern activation functions include the smooth version of the ReLU, the GELU, which was used in the 2018 BERT model, the logistic (sigmoid) function used in the 2012 speech recognition model developed by Hinton et al, the ReLU used in the 2012 AlexNet computer vision model and in the 2015 ResNet model.

<span class="mw-page-title-main">Encog</span> Machine learning framework

Encog is a machine learning framework available for Java and .Net. Encog supports different learning algorithms such as Bayesian Networks, Hidden Markov Models and Support Vector Machines. However, its main strength lies in its neural network algorithms. Encog contains classes to create a wide variety of networks, as well as support classes to normalize and process data for these neural networks. Encog trains using many different techniques. Multithreading is used to allow optimal training performance on multicore machines.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Rectifier (neural networks)</span> Activation function

In the context of artificial neural networks, the rectifier or ReLU activation function is an activation function defined as the positive part of its argument:

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural networks weights receives an update proportional to the partial derivative of the error function with respect to the current weight. The problem is that as the sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range [-1,1], and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the early layers in an n-layer network, meaning that the gradient decreases exponentially with n while the early layers train very slowly.

Alex Graves is a computer scientist and research scientist at DeepMind.

Felix Gers is a professor of computer science at Berlin University of Applied Sciences Berlin. With Jürgen Schmidhuber and Fred Cummins, he introduced the forget gate to the long short-term memory recurrent neural network architecture. This modification of the original architecture has been shown to be crucial to the success of the LSTM at such tasks as speech and handwriting recognition.

In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous artificial neural networks. It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by Long Short-Term Memory (LSTM) recurrent neural networks. The advantage of a Highway Network over the common deep neural networks is that it solves or partially prevents the vanishing gradient problem, thus leading to easier to optimize neural networks. The gating mechanisms facilitate information flow across many layers.

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a seminal deep learning model in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

<span class="mw-page-title-main">EMRBots</span>

EMRBots are experimental artificially generated electronic medical records (EMRs). The aim of EMRBots is to allow non-commercial entities to use the artificial patient repositories to practice statistical and machine-learning algorithms. Commercial entities can also use the repositories for any purpose, as long as they do not create software products using the repositories.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

<span class="mw-page-title-main">Transformer (deep learning architecture)</span> Machine learning algorithm used for natural-language processing

A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. The transformer paper, published in 2017, is based on the softmax-based attention mechanism proposed by Bahdanau et. al. in 2014 for machine translation, and the Fast Weight Controller, similar to a transformer, proposed in 1992.

References

  1. "IARAI – INSTITUTE OF ADVANCED RESEARCH IN ARTIFICIAL INTELLIGENCE". www.iarai.ac.at. Retrieved 2021-02-13.
  2. "CAMDA 2021". 20th International Conference on Critical Assessment of Massive Data Analysis. Retrieved 2021-02-13.
  3. 1 2 3 Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (diploma thesis). Technical University Munich, Institute of Computer Science.
  4. 1 2 Hochreiter, S.; Schmidhuber, J. (1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID   9377276. S2CID   1915014.
  5. Hochreiter, S.; Younger, A. S.; Conwell, P. R. (2001). "Learning to Learn Using Gradient Descent". Artificial Neural Networks — ICANN 2001 (PDF). Lecture Notes in Computer Science. Vol. 2130. pp. 87–94. CiteSeerX   10.1.1.5.323 . doi:10.1007/3-540-44668-0_13. ISBN   978-3-540-42486-4. ISSN   0302-9743. S2CID   52872549.
  6. 1 2 Hochreiter, S. (1991). Implementierung und Anwendung eines neuronalen Echtzeit-Lernalgorithmus für reaktive Umgebungen (PDF) (Report). Technical University Munich, Institute of Computer Science.
  7. Arjona-Medina, J. A.; Gillhofer, M.; Widrich, M.; Unterthiner, T.; Hochreiter, S. (2018). "RUDDER: Return Decomposition for Delayed Rewards". arXiv: 1806.07857 [cs.LG].
  8. Hochreiter, S. (1998). "The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions". International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 06 (2): 107–116. doi:10.1142/S0218488598000094. ISSN   0218-4885. S2CID   18452318.
  9. Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. (2000). Kolen, J. F.; Kremer, S. C. (eds.). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Networks. New York City: IEEE Press. pp. 237–244. CiteSeerX   10.1.1.24.7321 .
  10. Hochreiter, S.; Heusel, M.; Obermayer, K. (2007). "Fast model-based protein homology detection without alignment". Bioinformatics. 23 (14): 1728–1736. doi: 10.1093/bioinformatics/btm247 . PMID   17488755.
  11. "The neural networks behind Google Voice transcription". 11 August 2015.
  12. "Google voice search: faster and more accurate". 24 September 2015.
  13. Khaitan, Pranav (May 18, 2016). "Chat Smarter with Allo". Google AI Blog. Retrieved 2021-10-20.
  14. Hochreiter, S.; Schmidhuber, J. (1997). "Flat Minima". Neural Computation. 9 (1): 1–42. doi:10.1162/neco.1997.9.1.1. PMID   9117894. S2CID   733161.
  15. Clevert, D.-A.; Mayr, A.; Unterthiner, T.; Hochreiter, S. (2015). "Rectified Factor Networks". arXiv: 1502.06464v2 [cs.LG].
  16. Clevert, D.-A.; Mayr, A.; Unterthiner, T.; Hochreiter, S. (2015). Rectified Factor Networks. Advances in Neural Information Processing Systems 29. arXiv: 1502.06464 .
  17. Clevert, D.-A.; Unterthiner, T.; Povysil, G.; Hochreiter, S. (2017). "Rectified factor networks for biclustering of omics data". Bioinformatics. 33 (14): i59–i66. doi:10.1093/bioinformatics/btx226. PMC   5870657 . PMID   28881961.
  18. Ramsauer, H.; Schäfl, B.; Lehner, J.; Seidl, P.; Widrich, M.; Gruber, L.; Holzleitner, M.; Pavlović, M.; Sandve, G. K.; Greiff, V.; Kreil, D.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. (2020). "Hopfield Networks is All You Need". arXiv: 2008.02217 [cs.NE].
  19. Widrich, M.; Schäfl, B.; Ramsauer, H.; Pavlović, M.; Gruber, L.; Holzleitner, M.; Brandstetter, J.; Sandve, G. K.; Greiff, V.; Hochreiter, S.; Klambauer, G. (2020). "Modern Hopfield Networks and Attention for Immune Repertoire Classification". arXiv: 2007.13505 [cs.LG].
  20. Schmidhuber, J. (1990). Making the world differentiable: On Using Fully Recurrent Self-Supervised Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments (PDF) (Technical report). Technical University Munich, Institute of Computer Science. FKI-126-90 (revised).
  21. Hochreiter, Sepp; Bodenhofer, Ulrich; Heusel, Martin; Mayr, Andreas; Mitterecker, Andreas; Kasim, Adetayo; Khamiakova, Tatsiana; Van Sanden, Suzy; Lin, Dan; Talloen, Willem; Bijnens, Luc; Göhlmann, Hinrich W. H.; Shkedy, Ziv; Clevert, Djork-Arné (2010-06-15). "FABIA: factor analysis for bicluster acquisition". Bioinformatics. 26 (12): 1520–1527. doi:10.1093/bioinformatics/btq227. PMC   2881408 . PMID   20418340.
  22. Hochreiter, S. (2013). "HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data". Nucleic Acids Research. 41 (22): e202. doi:10.1093/nar/gkt1013. PMC   3905877 . PMID   24174545.
  23. Hochreiter, S.; Clevert, D.-A.; Obermayer, K. (2006). "A new summarization method for affymetrix probe level data". Bioinformatics. 22 (8): 943–949. doi: 10.1093/bioinformatics/btl033 . PMID   16473874.
  24. Hochreiter, S.; Obermayer, K. (2006). "Support Vector Machines for Dyadic Data". Neural Computation. 18 (6): 1472–1510. CiteSeerX   10.1.1.228.5244 . doi:10.1162/neco.2006.18.6.1472. PMID   16764511. S2CID   26201227.
  25. Hochreiter, S.; Obermayer, K. (2006). Nonlinear Feature Selection with the Potential Support Vector Machine. Feature Extraction, Studies in Fuzziness and Soft Computing. pp. 419–438. doi:10.1007/978-3-540-35488-8_20. ISBN   978-3-540-35487-1.
  26. Hochreiter, S.; Obermayer, K. (2003). "Classification and Feature Selection on Matrix Data with Application to Gene-Expression Analysis". 54th Session of the International Statistical Institute. Archived from the original on 2012-03-25.
  27. Hochreiter, S.; Obermayer, K. (2004). "Gene Selection for Microarray Data". Kernel Methods in Computational Biology. MIT Press: 319–355. Archived from the original on 2012-03-25.
  28. "Sepp Hochreiter receives IEEE CIS Neural Networks Pioneer Award 2021 - IARAI". www.iarai.ac.at. 24 July 2020. Retrieved 3 June 2021.