LOLITA

Last updated

LOLITA is a natural language processing system developed by Durham University between 1986 and 2000. The name is an acronym for "Large-scale, Object-based, Linguistic Interactor, Translator and Analyzer".

Contents

LOLITA was developed by Roberto Garigliano and colleagues between 1986 and 2000. It was designed as a general-purpose tool for processing unrestricted text that could be the basis of a wide variety of applications. At its core was a semantic network containing some 90,000 interlinked concepts. Text could be parsed and analysed then incorporated into the semantic net, where it could be reasoned about (Long and Garigliano, 1993). Fragments of the semantic net could also be rendered back to English or Spanish.

Several applications were built using the system, including financial information analysers and information extraction tools for Darpa’s “Message Understanding Conference Competitions” (MUC-6 and MUC-7). The latter involved processing original Wall Street Journal articles, to perform tasks such as identifying key job changes in businesses and summarising articles. LOLITA was one of some systems worldwide to compete in all sections of the tasks. A system description and an analysis of the MUC-6 results were written by Callaghan (Callaghan, 1998).

LOLITA was an early example of a substantial application written in a functional language: it consisted of around 50,000 lines of Haskell, with around 6000 lines of C. It is also a complex and demanding application, in which many aspects of Haskell were invaluable in development.

LOLITA was designed to handle unrestricted text, so that ambiguity at various levels was unavoidable and significant. Laziness was essential in handling the explosion of syntactic ambiguity resulting from a large grammar, and it was much used with semantic ambiguity too. The system used multiple "domain specific embedded languages" for semantic and pragmatic processing and for generation of natural language text from the semantic net. Also, important was the ability to work with complex abstractions and to prototype new analysis algorithms quickly. [1]

Later systems based on the same design include Concepts and SenseGraph.

See also

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

In computer engineering, a hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, most commonly to design ASICs and program FPGAs.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in Large language models (LLMs), but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Syntactic ambiguity, also known as structural ambiguity, amphiboly, or amphibology, is characterized by the potential for a sentence to yield multiple interpretations due to its ambiguous syntax. This form of ambiguity is not derived from the varied meanings of individual words but rather from the relationships among words and clauses within a sentence, concealing interpretations beneath the word order. Consequently, a sentence presents as syntactically ambiguous when it permits reasonable derivation of several possible grammatical structures by an observer.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Web development is the work involved in developing a website for the Internet or an intranet. Web development can range from developing a simple single static page of plain text to complex web applications, electronic businesses, and social network services. A more comprehensive list of tasks to which Web development commonly refers, may include Web engineering, Web design, Web content development, client liaison, client-side/server-side scripting, Web server and network security configuration, and e-commerce development.

<span class="mw-page-title-main">Ada Semantic Interface Specification</span> Interface

The Ada Semantic Interface Specification (ASIS) is a layered, open architecture providing vendor-independent access to the Ada Library Environment. It allows for the static analysis of Ada programs and libraries. It is an open, published interface library that consists of the Ada environment and their tools and applications.

Semantic AI is a privately held software company headquartered in San Diego, California with offices in the National Capitol Region. Semantic AI is a Delaware C-corporation that offers patented, graph-based knowledge discovery, analysis and visualization software technology. Its original product is a link analysis software application called Semantica Pro, and it introduced a web-based analytical environment called the Cortex Enterprise Intelligence Platform, or Cortex EIP.

The term conceptual model refers to any model that is formed after a conceptualization or generalization process. Conceptual models are often abstractions of things in the real world, whether physical or social. Semantic studies are relevant to various stages of concept formation. Semantics is fundamentally a study of concepts, the meaning that thinking beings give to various elements of their experience.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing. The problem is non-trivial, because while some written languages have explicit word boundary markers, such as the word spaces of written English and the distinctive initial, medial and final letter shapes of Arabic, such signals are sometimes ambiguous and not present in all written languages.

Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining. Because artifacts are typically a loosely structured sequence of words and other symbols, the problem is nontrivial, but it can provide powerful insights into the meaning, provenance and similarity of documents.

A workflow application is a software application that automates, to at least some degree, a process or processes. The processes are usually business-related but can be any process that requires a series of steps to be automated via software. Some steps of the process may require human intervention, such as approval or the development of custom text, but functions that can be automated should be handled by the application. Advanced applications allow users to introduce new components into the operation.

In computer programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing.

Natural-language user interface is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications.

Haskell is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell has pioneered a number of programming language features such as type classes, which enable type-safe operator overloading, and monadic input/output (IO). It is named after logician Haskell Curry. Haskell's main implementation is the Glasgow Haskell Compiler (GHC).

SemEval is an ongoing series of evaluations of computational semantic analysis systems; it evolved from the Senseval word sense evaluation series. The evaluations are intended to explore the nature of meaning in language. While meaning is intuitive to humans, transferring those intuitions to computational analysis has proved elusive.

The following outline is provided as an overview of and topical guide to natural-language processing:

NetOwl is a suite of multilingual text and identity analytics products that analyze big data in the form of text data – reports, web, social media, etc. – as well as structured entity data about people, organizations, places, and things.

References