HFST

Last updated
Helsinki Finite-State Technology
Developer(s) HFST team
Initial release2008;15 years ago (2008)
Stable release
3.15.4 / February 13, 2021;2 years ago (2021-02-13) [1]
Repository
Written in C++, Prolog, Python
Operating system Cross-platform: Linux, Mac OS X, Windows
Platform x86
Available inEnglish
Type Finite-state toolkit
License GPLv3, part Apache
Website hfst.github.io

Helsinki Finite-State Technology (HFST) is a computer programming library and set of utilities for natural language processing with finite-state automata and finite-state transducers. It is free and open-source software, released under a mix of the GNU General Public License version 3 (GPLv3) and the Apache License.

Contents

Features

The library functions as an interchanging interface to multiple backends, such as OpenFST, foma and SFST. The utilities comprise various compilers, such as hfst-twolc (a compiler for morphological two-level rules), [2] hfst-lexc (a compiler for lexicon definitions) and hfst-regexp2fst (a regular expression compiler). Functions from Xerox's proprietary scripting language xfst is duplicated in hfst-xfst, and the pattern matching utility pmatch in hfst-pmatch, which goes beyond the finite-state formalism in having recursive transition networks (RTNs).

The library and utilities are written in C++, with an interface to the library in Python and a utility for looking up results from transducers ported to Java and Python.

Transducers in HFST may incorporate weights depending on the backend. For performing FST operations, this is currently only possible via the OpenFST backend. HFST provides two native backends, one designed for fast lookup (hfst-optimized-lookup), the other for format interchange. Both of them can be weighted.

Uses

HFST has been used for writing various linguistic tools, such as spell-checkers, hyphenators, and morphologies. [3] [4] Morphological dictionaries written in other formalisms have also been converted to HFST's formats. [5]

See also

Notes

  1. "Releases · hfst/hfst". github.com. Retrieved 2021-04-12.
  2. "A Short History of Two-Level Morphology".
  3. "GitHub - flammie/Omorfi: Open morphology for Finnish". GitHub . 2019-02-23.
  4. "How to Configure and Optimise Spellers".
  5. "Helsinki Finite-State Technology - Browse /Resources at SourceForge.net".

Related Research Articles

<span class="mw-page-title-main">Eclipse (software)</span> Software development environment

Eclipse is an integrated development environment (IDE) used in computer programming. It contains a base workspace and an extensible plug-in system for customizing the environment. It is the second-most-popular IDE for Java development, and, until 2016, was the most popular. Eclipse is written mostly in Java and its primary use is for developing Java applications, but it may also be used to develop applications in other programming languages via plug-ins, including Ada, ABAP, C, C++, C#, Clojure, COBOL, D, Erlang, Fortran, Groovy, Haskell, JavaScript, Julia, Lasso, Lua, NATURAL, Perl, PHP, Prolog, Python, R, Ruby, Rust, Scala, and Scheme. It can also be used to develop documents with LaTeX and packages for the software Mathematica. Development environments include the Eclipse Java development tools (JDT) for Java and Scala, Eclipse CDT for C/C++, and Eclipse PDT for PHP, among others.

Programming languages can be grouped by the number and types of paradigms supported.

<span class="mw-page-title-main">Konsole</span> Terminal emulator

Konsole is a free and open-source terminal emulator graphical application which is part of KDE Applications and ships with the KDE desktop environment. Konsole was originally written by Lars Doelle. It ls licensed under the GPL-2.0-or-later and the GNU Free Documentation License.

In computing, gettext is an internationalization and localization system commonly used for writing multilingual programs on Unix-like computer operating systems. One of the main benefits of gettext is that it separates programming from translating. The most commonly used implementation of gettext is GNU gettext, released by the GNU Project in 1995. The runtime library is libintl. gettext provides an option to use different strings for any number of plural forms of nouns, but this feature has no support for grammatical gender. The main filename extensions used by this system are .POT, .PO and .MO.

A finite-state transducer (FST) is a finite-state machine with two memory tapes, following the terminology for Turing machines: an input tape and an output tape. This contrasts with an ordinary finite-state automaton, which has a single tape. An FST is a type of finite-state automaton (FSA) that maps between two sets of symbols. An FST is more general than an FSA. An FSA defines a formal language by defining a set of accepted strings, while an FST defines relations between sets of strings.

IronPython is an implementation of the Python programming language targeting the .NET Framework and Mono. The project is currently maintained by a group of volunteers at GitHub. It is free and open-source software, and can be implemented with Python Tools for Visual Studio, which is a free and open-source extension for Microsoft's Visual Studio IDE.

<span class="mw-page-title-main">Poppler (software)</span> Free library for creating PDF documents

Poppler is a free software utility library for rendering Portable Document Format (PDF) documents. Its development is supported by freedesktop.org. It is commonly used on Linux systems, and is used by the PDF viewers of the open source GNOME and KDE desktop environments.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Enthought, Inc. is a software company based in Austin, Texas, United States that develops scientific and analytic computing solutions using primarily the Python programming language. It is best known for the early development and maintenance of the SciPy library of mathematics, science, and engineering algorithms and for its Python for scientific computing distribution Enthought Canopy.

<span class="mw-page-title-main">FEniCS Project</span>

The FEniCS Project is a collection of free and open-source software components with the common goal to enable automated solution of differential equations. The components provide scientific computing tools for working with computational meshes, finite-element variational formulations of ordinary and partial differential equations, and numerical linear algebra.

Mapnik is an open-source mapping toolkit for desktop and server based map rendering, written in C++. Artem Pavlenko, the original developer of Mapnik, set out with the explicit goal of creating beautiful maps by employing the sub-pixel anti-aliasing of the Anti-Grain Geometry (AGG) library. Mapnik now also has a Cairo rendering backend. For handling common software tasks such as memory management, file system access, regular expressions, and XML parsing, Mapnik utilizes the Boost C++ libraries. An XML file can be used to define a collection of mapping objects that determine the appearance of a map, or objects can be constructed programmatically in C++, Python, and Node.js.

<span class="mw-page-title-main">Flask (web framework)</span> Python web framework

Flask is a micro web framework written in Python. It is classified as a microframework because it does not require particular tools or libraries. It has no database abstraction layer, form validation, or any other components where pre-existing third-party libraries provide common functions. However, Flask supports extensions that can add application features as if they were implemented in Flask itself. Extensions exist for object-relational mappers, form validation, upload handling, various open authentication technologies and several common framework related tools.

Foma is a free and open source finite-state toolkit created and maintained by Mans Hulden. It includes a compiler, programming language, and C library for constructing finite-state automata and transducers (FST's) for various uses, most typically Natural Language Processing uses such as morphological analysis.

mpv (media player) Free and open-source media player software

mpv is free and open-source media player software based on MPlayer, mplayer2 and FFmpeg. It runs on several operating systems, including Unix-like operating systems and Microsoft Windows, along with having an Android port called mpv-android. It is cross-platform, running on ARM, PowerPC, x86/IA-32, x86-64, and MIPS architecture.

The following table compares notable software frameworks, libraries and computer programs for deep learning.

Mans Hulden is a researcher in computational linguistics currently holding the title of Assistant Professor at the Department of Linguistics of the University of Colorado Boulder. He teaches courses in computational linguistics, phonology, and phonetics, and is the creator and maintainer of the free and open source finite-state toolkit Foma.

<span class="mw-page-title-main">ROCm</span> Parallel computing platform: GPGPU libraries and application programming interface

ROCm is an Advanced Micro Devices (AMD) software stack for graphics processing unit (GPU) programming. ROCm spans several domains: general-purpose computing on graphics processing units (GPGPU), high performance computing (HPC), heterogeneous computing. It offers several programming models: HIP, OpenMP/Message Passing Interface (MPI), OpenCL.

References

Lindén, Krister; Axelson, Erik; Drobac, Senka; Hardwick, Sam; Kuokkala, Juha; Niemi, Jyrki; Pirinen, Tommi; Silfverberg, Miikka (2013). "HFST - A System for Creating NLP Tools". In Mahlow, Cerstin; Piotrowski, Michael (eds.). Systems and Frameworks for Computational Morphology. Systems and Frameworks for Computational Morphology. Communications in Computer and Information Science. Vol. 380. Humboldt-Universität in Berlin: Springer. pp. 53–71.