Layered hidden Markov model

Last updated

The layered hidden Markov model (LHMM) is a statistical model derived from the hidden Markov model (HMM). A layered hidden Markov model (LHMM) consists of N levels of HMMs, where the HMMs on level i + 1 correspond to observation symbols or probability generators at level i. Every level i of the LHMM consists of Ki HMMs running in parallel. [1]

Hidden Markov model statistical Markov model

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states.

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data. A statistical model represents, often in considerably idealized form, the data-generating process.

Contents

Background

LHMMs are sometimes useful in specific structures because they can facilitate learning and generalization. For example, even though a fully connected HMM could always be used if enough training data were available, it is often useful to constrain the model by not allowing arbitrary state transitions. In the same way it can be beneficial to embed the HMM in a layered structure which, theoretically, may not be able to solve any problems the basic HMM cannot, but can solve some problems more efficiently because less training data is needed.

The layered hidden Markov model

A layered hidden Markov model (LHMM) consists of levels of HMMs where the HMMs on level corresponds to observation symbols or probability generators at level . Every level of the LHMM consists of HMMs running in parallel.

A layered hidden Markov model LHMM.png
A layered hidden Markov model

At any given level in the LHMM a sequence of observation symbols can be used to classify the input into one of classes, where each class corresponds to each of the HMMs at level . This classification can then be used to generate a new observation for the level HMMs. At the lowest layer, i.e. level , primitive observation symbols would be generated directly from observations of the modeled process. For example, in a trajectory tracking task the primitive observation symbols would originate from the quantized sensor values. Thus at each layer in the LHMM the observations originate from the classification of the underlying layer, except for the lowest layer where the observation symbols originate from measurements of the observed process.

It is not necessary to run all levels at the same time granularity. For example, it is possible to use windowing at any level in the structure so that the classification takes the average of several classifications into consideration before passing the results up the layers of the LHMM. [2]

Instead of simply using the winning HMM at level as an input symbol for the HMM at level it is possible to use it as a probability generator by passing the complete probability distribution up the layers of the LHMM. Thus instead of having a "winner takes all" strategy where the most probable HMM is selected as an observation symbol, the likelihood of observing the th HMM can be used in the recursion formula of the level HMM to account for the uncertainty in the classification of the HMMs at level . Thus, if the classification of the HMMs at level is uncertain, it is possible to pay more attention to the a-priori information encoded in the HMM at level .

In probability theory and statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. In more technical terms, the probability distribution is a description of a random phenomenon in terms of the probabilities of events. For instance, if the random variable X is used to denote the outcome of a coin toss, then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails. Examples of random phenomena can include the results of an experiment or survey.

A LHMM could in practice be transformed into a single layered HMM where all the different models are concatenated together. [3] Some of the advantages that may be expected from using the LHMM over a large single layer HMM is that the LHMM is less likely to suffer from overfitting since the individual sub-components are trained independently on smaller amounts of data. A consequence of this is that a significantly smaller amount of training data is required for the LHMM to achieve a performance comparable of the HMM. Another advantage is that the layers at the bottom of the LHMM, which are more sensitive to changes in the environment such as the type of sensors, sampling rate etc. can be retrained separately without altering the higher layers of the LHMM.

Overfitting production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably

In statistics, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably". An overfitted model is a statistical model that contains more parameters than can be justified by the data. The essence of overfitting is to have unknowingly extracted some of the residual variation as if that variation represented underlying model structure.

See also

Related Research Articles

In statistics, a likelihood function is a function of parameters within the parameter space that describes the probability of obtaining the observed data . It is proportional—up to a function of only the observed data—to the joint probability distribution of given . The likelihood principle states that all relevant information for inference about is contained in the likelihood function for the observed data given the assumed statistical model. The case for using likelihood in the foundation of statistics was first made by the founder of modern statistics, R. A. Fisher, who believed it to be a self-contained framework for statistical modelling and inference. But the likelihood function also plays a fundamental role in frequentist and Bayesian statistics.

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Markov chain mathematical system

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

Perceptron algorithm for supervised learning of binary classifiers

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

Kalman filter algorithm

In statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe. The filter is named after Rudolf E. Kálmán, one of the primary developers of its theory.

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models.

Expectation–maximization algorithm iterative method for finding maximum likelihood estimates in statistical models

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

In electrical engineering, computer science, statistical computing and bioinformatics, the Baum–Welch algorithm is used to find the unknown parameters of a hidden Markov model (HMM). It makes use of a forward-backward algorithm.

The forward algorithm, in the context of a hidden Markov model (HMM), is used to calculate a 'belief state': the probability of a state at a certain time, given the history of evidence. The process is also known as filtering. The forward algorithm is closely related to, but distinct from, the Viterbi algorithm.

Pattern theory, formulated by Ulf Grenander, is a mathematical formalism to describe knowledge of the world as patterns. It differs from other approaches to artificial intelligence in that it does not begin by prescribing algorithms and machinery to recognize and classify patterns; rather, it prescribes a vocabulary to articulate and recast the pattern concepts in precise language.

Conditional random field

Conditional random fields (CRFs) are a class of statistical modeling method often applied in pattern recognition and machine learning and used for structured prediction. CRFs fall into the sequence modeling family. Whereas a discrete classifier predicts a label for a single sample without considering "neighboring" samples, a CRF can take context into account; e.g., the linear chain CRF predicts sequences of labels for sequences of input samples.

The forward–backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions , i.e. it computes, for all hidden state variables , the distribution . This inference task is usually called smoothing. The algorithm makes use of the principle of dynamic programming to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward algorithm.

The hierarchical hidden Markov model (HHMM) is a statistical model derived from the hidden Markov model (HMM). In an HHMM each state is considered to be a self-contained probabilistic model. More precisely each state of the HHMM is itself an HHMM.

The discrete phase-type distribution is a probability distribution that results from a system of one or more inter-related geometric distributions occurring in sequence, or phases. The sequence in which each of the phases occur may itself be a stochastic process. The distribution can be represented by a random variable describing the time until absorption of an absorbing Markov chain with one absorbing state. Each of the states of the Markov chain represents one of the phases.

Time-inhomogeneous hidden Bernoulli model (TI-HBM) is an alternative to hidden Markov model (HMM) for automatic speech recognition. Contrary to HMM, the state transition process in TI-HBM is not a Markov-dependent process, rather it is a generalized Bernoulli process. This difference leads to elimination of dynamic programming at state-level in TI-HBM decoding process. Thus, the computational complexity of TI-HBM for probability evaluation and state estimation is . The TI-HBM is able to model acoustic-unit duration by using a built-in parameter named survival probability. The TI-HBM is simpler and faster than HMM in a phoneme recognition task, but its performance is comparable to HMM.

In machine learning, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in natural language processing, specifically in part-of-speech tagging and information extraction.

In statistics and machine learning, the hierarchical Dirichlet process (HDP) is a nonparametric Bayesian approach to clustering grouped data. It uses a Dirichlet process for each group of data, with the Dirichlet processes for all groups sharing a base distribution which is itself drawn from a Dirichlet process. This method allows groups to share statistical strength via sharing of clusters across groups. The base distribution being drawn from a Dirichlet process is important, because draws from a Dirichlet process are atomic probability measures, and the atoms will appear in all group-level Dirichlet processes. Since each atom corresponds to a cluster, clusters are shared across all groups. It was developed by Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David Blei and published in the Journal of the American Statistical Association in 2006, as a formalization and generalization of the infinite hidden Markov model published in 2002.

Time-series segmentation is a method of time-series analysis in which an input time-series is divided into a sequence of discrete segments in order to reveal the underlying properties of its source. A typical application of time-series segmentation is in speaker diarization, in which an audio signal is partitioned into several pieces according to who is speaking at what times. Algorithms based on change-point detection include sliding windows, bottom-up, and top-down methods. Probabilistic methods based on hidden Markov models have also proved useful in solving this problem.

Extreme learning machine

Extreme learning machines are feedforward neural networks for classification, regression, clustering, sparse approximation, compression and feature learning with a single layer or multiple layers of hidden nodes, where the parameters of hidden nodes need not be tuned. These hidden nodes can be randomly assigned and never updated, or can be inherited from their ancestors without being changed. In most cases, the output weights of hidden nodes are usually learned in a single step, which essentially amounts to learning a linear model. The name "extreme learning machine" (ELM) was given to such models by its main inventor Guang-Bin Huang.

References

  1. N. Oliver, A. Garg and E. Horvitz, "Layered Representations for Learning and Inferring Office Activity from Multiple Sensory Channels", Computer Vision and Image Understanding, vol. 96, p. 163180, 2004.
  2. D. Aarno and D. Kragic "Evaluation of Layered HMM for Motion Intention Recognition", IEEE International Conference on Advanced Robotics, 2007
  3. D. Aarno and D. Kragic: "Layered HMM for Motion Intention Recognition", IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006.