Morphological dictionary

Last updated

In the fields of computational linguistics and applied linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the lemma followed by grammatical information (for example the part of speech, gender and number). In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries.

Contents

Notable examples and formalisms

Universal Morphologies

Inspired by the success of the Universal Dependencies for cross-linguistic annotation of syntactic dependencies, similar efforts have emerged for morphology, e.g., UniMorph [1] and UDer. [2] These feature simple tabular (tab-separated) formats with one form in a row, and its derivation (UDer), resp., inflection information (UniMorph):

aalen   aalend  V.PTCP;PRS

aalen   aalen   V;IND;PRS;1;PL

aalen   aalen   V;IND;PRS;3;PL

aalen   aalen   V;NFIN

(UniMorph, German. Columns are LEMMA, FORM, FEATURES)

In UDer, additional information (part of speech) is encoded within the columns:

abändern_V      Abänderung_Nf   dVN07>

Abarbeiten_Nn   abarbeiten_V    dNV09>

abartig_A       Abartigkeit_Nf  dAN03>

Abart_Nf        abartig_A       dNA05>

abbaggern_V     Abbaggern_Nn    dVN09>

(UDer, German DErivBase 0.5. Columns are BASE, DERIVED, RULE)

At the time of writing (2021), all of these are non-aligned morphological dictionaries (see below). Their simplistic format is particularly well-suited for the application of machine learning techniques, and UniMorph in particular, has been subject of numerous shared tasks.

Finite State Transducers

Finite State Transducers (FSTs) are a popular technique for the computational handling of morphology, esp., inflectional morphology. In rule-based morphological parsers, both lexicon and rules are normally formalized as finite state automata and subsequently combined. They thus require morphological dictionaries with specific processing instructions (which often have a linguistic interpretation, but, technically, are just treated like arbitrary string symbols). [3] Popular FST packages such as SFST [4] (as available from the fst package in Debian and Ubuntu) allow to define application-specific file formats for morphological lexica, that bundle different pieces of morphological information with every individual morpheme. These are thus aligned morphological dictionaries, but very rich (and also, idiosyncratic) in structure.


Sample data from SMOR [5] (German SFST grammar):

<Base_Stems>Aachen<NN><base><nativ><Name-Neut_s>

<Base_Stems>Aal<NN><base><nativ><NMasc_es_e>

<Base_Stems>Aarau<NN><base><nativ><Name-Neut_s>

<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<>:e<>:n<NN><SUFF><kompos><frei>

<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<NN><SUFF><base><frei><NMasc_en_en>

<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<NN><SUFF><deriv><frei>

Interlinear Glossed Text editors

Interlinear Glossed Text (IGT) is a popular formalism in language documentation, linguistic typology and other branches of linguistics and the philologies. Although IGT can be created without any specialized software (but just with a conventional editor), such specialized software has been developed, with notable examples such as Toolbox, [6] the FieldWorks Language Explorer (FLEx) [7] or open source alternatives such as Xigt. [8] Toolbox and FLEx support semi-automated annotation by means of an internal morphological dictionary. Whenever a morphological segment is encountered for which an annotation in the dictionary can be found, this annotations is applied. Whenever a morphological segment is newly annotated, the annotation is stored in the dictionary. FLEx and Toolbox provide different editor functionalities for annotating text and editing dictionaries, so that additional information beyond that found in annotations can be added, but at its core, their formats provide aligned morphological dictionaries.

FLEx and Xigt are based on XML formats, Toolbox uses a plain text format with idiosyncratic "markers". FLEx and Toolbox are not directly interoperable with each other, but a semiautomated converter for Toolbox to FLEx does exist. Xigt comes with FLEx and Toolbox importers, but is less widely used that either FLEx or Toolbox. Their formats of FLEx and Toolbox are not intended for human consumption, nor are they well-supported by any processing software other than their native tools.

OntoLex-Morph: A community standard for morphological dictionaries

OntoLex is a community standard for machine-readable dictionaries on the web. In 2019, the OntoLex-Morph module has been proposed to facilitate data modelling of morphology in lexicography, as well as to provide a data model for morphological dictionaries for Natural Language Processing. [9] OntoLex-Morph does support both aligned and non-aligned morphological dictionaries. A specific goal is to establish interoperability between and among IGT dictionaries, FST lexicons and morphological dictionaries used for machine learning.

Types and structure of morphological dictionaries

Aligned morphological dictionaries

In an aligned morphological dictionary, the correspondence between the surface form and the lexical form of a word is aligned at the character level, for example:

(h,h) (o,o) (u,u) (s,s) (e,e) (s,n), (θ,pl)

Where θ is the empty symbol and n signifies "noun", and pl signifies "plural".

In the example the left hand side is the surface form (input), and the right hand side is the lexical form (output). This order is used in morphological analysis where a lexical form is generated from a surface form. In morphological generation this order would be reversed.

Formally, if Σ is the alphabet of the input symbols, and is the alphabet of the output symbols, an aligned morphological dictionary is a subset , where:

is the alphabet of all the possible alignments including the empty symbol. That is, an aligned morphological dictionary is a set of string in .

Non-aligned morphological dictionaries (full-form dictionaries)

A non-aligned morphological dictionary (or full-form dictionary) is simply a set of pairs of input and output strings. A non-aligned morphological dictionary would represent the previous example as:

(houses, housenpl)

It is possible to convert a non-aligned dictionary into an aligned dictionary. Besides trivial alignments to the left or to the right, linguistically motivated alignments which align characters to their corresponding morphemes are possible.

Lexical ambiguities

Frequently there exists more than one lexical form associated with a surface form of a word. For example, "house" may be a noun in the singular, /haʊs/, or may be a verb in the present tense, /haʊz/. As a result of this it is necessary to have a function which relates input strings with their corresponding output strings.

If we define the set of input words such that , the correspondence function would be defined as .

Related Research Articles

Lorentz transformation Family of linear transformations

In physics, the Lorentz transformations are a six-parameter family of linear transformations from a coordinate frame in spacetime to another frame that moves at a constant velocity relative to the former. The respective inverse transformation is then parameterized by the negative of this velocity. The transformations are named after the Dutch physicist Hendrik Lorentz.

Spinor Non-tensorial representation of the spin group; represents fermions in physics

In geometry and physics, spinors are elements of a complex vector space that can be associated with Euclidean space. Like geometric vectors and more general tensors, spinors transform linearly when the Euclidean space is subjected to a slight (infinitesimal) rotation. However, when a sequence of such small rotations is composed (integrated) to form an overall final rotation, the resulting spinor transformation depends on which sequence of small rotations was used. Unlike vectors and tensors, a spinor transforms to its negative when the space is continuously rotated through a complete turn from 0° to 360°. This property characterizes spinors: spinors can be viewed as the "square roots" of vectors.

Surface energy Quantifies the disruption of intermolecular bonds that occurs when a surface is created

Surface free energy or interfacial free energy or surface energy quantifies the disruption of intermolecular bonds that occurs when a surface is created. In the physics of solids, surfaces must be intrinsically less energetically favorable than the bulk of a material, otherwise there would be a driving force for surfaces to be created, removing the bulk of the material. The surface energy may therefore be defined as the excess energy at the surface of a material compared to the bulk, or it is the work required to build an area of a particular surface. Another way to view the surface energy is to relate it to the work required to cut a bulk sample, creating two surfaces. There is "excess energy" as a result of the now-incomplete, unrealized bonding at the two surfaces.

Einstein tensor Tensor used in general relativity

In differential geometry, the Einstein tensor is used to express the curvature of a pseudo-Riemannian manifold. In general relativity, it occurs in the Einstein field equations for gravitation that describe spacetime curvature in a manner that is consistent with conservation of energy and momentum.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

In statistics, econometrics and signal processing, an autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes in nature, economics, etc. The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term ; thus the model is in the form of a stochastic difference equation. Together with the moving-average (MA) model, it is a special case and key component of the more general autoregressive–moving-average (ARMA) and autoregressive integrated moving average (ARIMA) models of time series, which have a more complicated stochastic structure; it is also a special case of the vector autoregressive model (VAR), which consists of a system of more than one interlocking stochastic difference equation in more than one evolving random variable.

In probability and statistics, a circular distribution or polar distribution is a probability distribution of a random variable whose values are angles, usually taken to be in the range [0, 2π). A circular distribution is often a continuous probability distribution, and hence has a probability density, but such distributions can also be discrete, in which case they are called circular lattice distributions. Circular distributions can be used even when the variables concerned are not explicitly angles: the main consideration is that there is not usually any real distinction between events occurring at the lower or upper end of the range, and the division of the range could notionally be made at any point.

The simply typed lambda calculus, a form of type theory, is a typed interpretation of the lambda calculus with only one type constructor that builds function types. It is the canonical and simplest example of a typed lambda calculus. The simply typed lambda calculus was originally introduced by Alonzo Church in 1940 as an attempt to avoid paradoxical uses of the untyped lambda calculus, and it exhibits many desirable and interesting properties.

Møller scattering

Møller scattering is the name given to electron-electron scattering in quantum field theory, named after the Danish physicist Christian Møller. The electron interaction that is idealized in Møller scattering forms the theoretical basis of many familiar phenomena such as the repulsion of electrons in the helium atom. While formerly many particle colliders were designed specifically for electron-electron collisions, more recently electron-positron colliders have become more common. Nevertheless, Møller scattering remains a paradigmatic process within the theory of particle interactions.

The J-integral represents a way to calculate the strain energy release rate, or work (energy) per unit fracture surface area, in a material. The theoretical concept of J-integral was developed in 1967 by G. P. Cherepanov and independently in 1968 by James R. Rice, who showed that an energetic contour path integral was independent of the path around a crack.

Beta-binomial distribution

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. The beta-binomial distribution is the binomial distribution in which the probability of success at each of n trials is not fixed but randomly drawn from a beta distribution. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics to capture overdispersion in binomial type distributed data.

In fluid mechanics and mathematics, a capillary surface is a surface that represents the interface between two different fluids. As a consequence of being a surface, a capillary surface has no thickness in slight contrast with most real fluid interfaces.

Great ellipse

A great ellipse is an ellipse passing through two points on a spheroid and having the same center as that of the spheroid. Equivalently, it is an ellipse on the surface of a spheroid and centered on the origin, or the curve formed by intersecting the spheroid by a plane through its center. For points that are separated by less than about a quarter of the circumference of the earth, about , the length of the great ellipse connecting the points is close to the geodesic distance. The great ellipse therefore is sometimes proposed as a suitable route for marine navigation. The great ellipse is special case of an earth section path.

Sliding window based part-of-speech tagging is used to part-of-speech tag a text.

Gravitational lensing formalism

In general relativity, a point mass deflects a light ray with impact parameter by an angle approximately equal to

Wigner rotation

In theoretical physics, the composition of two non-collinear Lorentz boosts results in a Lorentz transformation that is not a pure boost but is the composition of a boost and a rotation. This rotation is called Thomas rotation, Thomas–Wigner rotation or Wigner rotation. The rotation was discovered by Llewellyn Thomas in 1926, and derived by Wigner in 1939. If a sequence of non-collinear boosts returns an object to its initial velocity, then the sequence of Wigner rotations can combine to produce a net rotation called the Thomas precession.

Stokes theorem Theorem in vector calculus

Stokes' theorem, also known as Kelvin–Stokes theorem after Lord Kelvin and George Stokes, the fundamental theorem for curls or simply the curl theorem, is a theorem in vector calculus on . Given a vector field, the theorem relates the integral of the curl of the vector field over some surface, to the line integral of the vector field around the boundary of the surface. The classical Stokes' theorem can be stated in one sentence: The line integral of a vector field over a loop is equal to the flux of its curl through the enclosed surface.

In differential geometry, a fibered manifold is surjective submersion of smooth manifolds YX. Locally trivial fibered manifolds are fiber bundles. Therefore, a notion of connection on fibered manifolds provides a general framework of a connection on fiber bundles.

Let YX be an affine bundle modelled over a vector bundle YX. A connection Γ on YX is called the affine connection if it as a section Γ : Y → J1Y of the jet bundle J1YY of Y is an affine bundle morphism over X. In particular, this is an affine connection on the tangent bundle TX of a smooth manifold X. (That is, the connection on an affine bundle is an example of an affine connection; it is not, however, a general definition of an affine connection. These are related but distinct concepts both unfortunately making use of the adjective "affine".)

Batch normalization is a method used to make artificial neural networks faster and more stable through normalization of the layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.

References

  1. Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui et al. "UniMorph 2.0: universal morphology." In LREC (2018).
  2. Kyjánek, L., Žabokrtský, Z., Ševčíková, M., & Vidra, J. (2019, September). Universal derivations kickoff: a collection of harmonized derivational resources for eleven languages. In Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology (pp. 101-110).
  3. "A Short History of Two-Level Morphology". www.ling.helsinki.fi. Retrieved 2021-11-30.
  4. Schmid, Helmut. "A programming language for finite state transducers." In FSMNLP, vol. 4002, pp. 308-309. 2005.
  5. Schmid, Helmut, Arne Fitschen, and Ulrich Heid. "SMOR: A German computational morphology covering derivation, composition and inflection." In LREC, pp. 1-263. 2004.
  6. "Field Linguist's Toolbox". software.sil.org. Retrieved 2021-11-27.
  7. "FieldWorks". software.sil.org. Retrieved 2021-11-27.
  8. "XIGT". XIGT. Retrieved 2021-11-27.
  9. Klimek, B., McCrae, J. P., Bosque-Gil, J., Ionov, M., Tauber, J. K., & Chiarcos, C. (2019). Challenges for the representation of morphology in ontology lexicons. Proceedings of eLex.