The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences (created 1986). [1] It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of systems in medical informatics.
UMLS consists of Knowledge Sources (databases) and a set of software tools.
The UMLS was designed and is maintained by the US National Library of Medicine, is updated quarterly and may be used for free. The project was initiated in 1986 by Donald A.B. Lindberg, M.D., then Director of the Library of Medicine, and directed by Betsy Humphreys. [2]
The number of biomedical resources available to researchers is enormous. Often this is a problem due to the large volume of documents retrieved when the medical literature is searched. The purpose of the UMLS is to enhance access to this literature by facilitating the development of computer systems that understand biomedical language. This is achieved by overcoming two significant barriers: "the variety of ways the same concepts are expressed in different machine-readable sources & by different people" and "the distribution of useful information among many disparate databases & systems".[ citation needed ]
Users of the system are required to sign a "UMLS agreement" and file brief annual usage reports. Academic users may use the UMLS free of charge for research purposes. Commercial or production use requires copyright licenses for some of the incorporated source vocabularies.
The Metathesaurus forms the base of the UMLS and comprises over 1 million biomedical concepts and 5 million concept names, all of which stem from the over 100 incorporated controlled vocabularies and classification systems. Some examples of the incorporated controlled vocabularies are CPT, ICD-10, MeSH, SNOMED CT, DSM-IV, LOINC, WHO Adverse Drug Reaction Terminology, UK Clinical Terms, RxNorm, Gene Ontology, and OMIM (see full list).
The Metathesaurus is organized by concept, and each concept has specific attributes defining its meaning and is linked to the corresponding concept names in the various source vocabularies. Numerous relationships between the concepts are represented, for instance hierarchical ones such as "isa" for subclasses and "is part of" for subunits, and associative ones such as "is caused by" or "in the literature often occurs close to" (the latter being derived from Medline).
The scope of the Metathesaurus is determined by the scope of the source vocabularies. If different vocabularies use different names for the same concept, or if they use the same name for different concepts, then this will be faithfully represented in the Metathesaurus. All hierarchical information from the source vocabularies is retained in the Metathesaurus. Metathesaurus concepts can also link to resources outside of the database, for instance gene sequence databases.
Each concept in the Metathesaurus is assigned one or more semantic types (categories), which are linked with one another through semantic relationships. [3] The semantic network is a catalog of these semantic types and relationships. This is a rather broad classification; there are 127 semantic types and 54 relationships in total.
The major semantic types are organisms, anatomical structures, biologic function, chemicals, events, physical objects, and concepts or ideas. The links among semantic types define the structure of the network and show important relationships between the groupings and concepts. The primary link between semantic types is the "isa" link, establishing a hierarchy of types. The network also has 5 major categories of non-hierarchical (or associative) relationships, which constitute the remaining 53 relationship types. These are "physically related to", "spatially related to", "temporally related to", "functionally related to" and "conceptually related to". [3]
The information about a semantic type includes an identifier, definition, examples, hierarchical information about the encompassing semantic type(s), and associative relationships. Associative relationships within the Semantic Network are very weak. They capture at most some-some relationships, i.e. they capture the fact that some instance of the first type may be connected by the salient relationship to some instance of the second type. Phrased differently, they capture the fact that a corresponding relational assertion is meaningful (though it need not be true in all cases).
An example of an associative relationship is "may-cause", applied to the terms (smoking, lung cancer) would yield: smoking "may-cause" lung cancer.
The SPECIALIST Lexicon contains information about common English vocabulary, biomedical terms, terms found in MEDLINE and terms found in the UMLS Metathesaurus. Each entry contains syntactic (how words are put together to create meaning), morphological (form and structure) and orthographic (spelling) information. A set of Java programs use the lexicon to work through the variations in biomedical texts by relating words by their parts of speech, which can be helpful in web searches or searches through an electronic medical record.
Entries may be one-word or multiple-word terms. Records contain four parts: base form (i.e. "run" for "running"); parts of speech (of which Specialist recognizes eleven); a unique identifier; and any available spelling variants. For example, a query for "anesthetic" would return the following: [4]
{ base=anaesthetic spelling_variant=anesthetic entry=E0008769 cat=noun variants=reg } { base=anaesthetic spelling_variant=anesthetic entry=E0008770 cat=adj variants=inv position=attrib(3) }
The SPECIALIST lexicon is available in two formats. The "unit record" format can be seen above, and comprises slots and fillers. A slot is the element (i.e. "base=" or "spelling variant=") and the fillers are the values attributable to that slot for that entry. The "relational table" format is not yet normalized and contain a great deal of redundant data in the files.
Given the size and complexity of the UMLS and its permissive policy on integrating terms, errors are inevitable. [5] Errors include ambiguity and redundancy, hierarchical relationship cycles (a concept is both an ancestor and descendant to another), missing ancestors (semantic types of parent and child concepts are unrelated), and semantic inversion (the child/parent relationship with the semantic types is not consistent with the concepts). [6]
These errors are discovered and resolved by auditing the UMLS. Manual audits can be very time-consuming and costly. Researchers have attempted to address the issue through a number of ways. Automated tools can be used to search for these errors. For structural inconsistencies (such as loops), a trivial solution based on the order would work. However, the same wouldn't apply when the inconsistency is at the term or concept level (context-specific meaning of a term). [7] This requires an informed search strategy to be used (knowledge representation).
In addition to the knowledge sources, the National Library of Medicine also provides supporting tools.
In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of terms and relational expressions that represent the entities in that subject area. The field which studies ontologies so conceived is sometimes referred to as applied ontology.
A glossary, also known as a vocabulary or clavis, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a book and includes terms within that book that are either newly introduced, uncommon, or specialized. While glossaries are most commonly associated with non-fiction books, in some cases, fiction novels sometimes include a glossary for unfamiliar terms.
A medical classification is used to transform descriptions of medical diagnoses or procedures into standardized statistical code in a process known as clinical coding. Diagnosis classifications list diagnosis codes, which are used to track diseases and other health conditions, inclusive of chronic diseases such as diabetes mellitus and heart disease, and infectious diseases such as norovirus, the flu, and athlete's foot. Procedure classifications list procedure code, which are used to capture interventional data. These diagnosis and procedure codes are used by health care providers, government health programs, private health insurance companies, workers' compensation carriers, software developers, and others for a variety of applications in medicine, public health and medical informatics, including:
OpenGALEN is a not-for-profit organisation that provides an open source medical terminology. This terminology is written in a formal language called GRAIL and also distributed in OWL.
The Systematized Nomenclature of Medicine (SNOMED) is a systematic, computer-processable collection of medical terms, in human and veterinary medicine, to provide codes, terms, synonyms and definitions which cover anatomy, diseases, findings, procedures, microorganisms, substances, etc. It allows a consistent way to index, store, retrieve, and aggregate medical data across specialties and sites of care. Although now international, SNOMED was started in the U.S. by the College of American Pathologists (CAP) in 1973 and revised into the 1990s. In 2002 CAP's SNOMED Reference Terminology was merged with, and expanded by, the National Health Service's Clinical Terms Version 3 to produce SNOMED CT.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
Logical Observation Identifiers Names and Codes (LOINC) is a database and universal standard for identifying medical laboratory observations. First developed in 1994, it was created and is maintained by the Regenstrief Institute, a US nonprofit medical research organization. LOINC was created in response to the demand for an electronic clinical care and management database and is publicly available at no cost.
SNOMED CT or SNOMED Clinical Terms is a systematically organized computer-processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting. SNOMED CT is considered to be the most comprehensive, multilingual clinical healthcare terminology in the world. The primary purpose of SNOMED CT is to encode the meanings that are used in health information and to support the effective clinical recording of data with the aim of improving patient care. SNOMED CT provides the core general terminology for electronic health records. SNOMED CT comprehensive coverage includes: clinical findings, symptoms, diagnoses, procedures, body structures, organisms and other etiologies, substances, pharmaceuticals, devices and specimens.
The Diseases Database is a free website that provides information about the relationships between medical conditions, symptoms, and medications. The database is run by Medical Object Oriented Software Enterprises Ltd, a company based in London.
Medcin, is a system of standardized medical terminology, a proprietary medical vocabulary and was developed by Medicomp Systems, Inc. MEDCIN is a point-of-care terminology, intended for use in Electronic Health Record (EHR) systems, and it includes over 280,000 clinical data elements encompassing symptoms, history, physical examination, tests, diagnoses and therapy. This clinical vocabulary contains over 38 years of research and development as well as the capability to cross map to leading codification systems such as SNOMED CT, CPT, ICD-9-CM/ICD-10-CM, DSM, LOINC, CDT, CVX, and the Clinical Care Classification (CCC) System for nursing and allied health.
DeCS – Health Sciences Descriptors is a structured and trilingual thesaurus created by BIREME – Latin American and Caribbean Center on Health Sciences Information – in 1986 for indexing scientific journal articles, books, proceedings of congresses, technical reports and other types of materials, as well as for searching and recovering scientific information in LILACS, MEDLINE and other databases. In the VHL, Virtual Health Library, DeCS is the tool that permits the navigation between records and sources of information through controlled concepts and organized in Portuguese, Spanish and English.
Yves A. Lussier is a physician-scientist conducting research in Precision medicine, Translational bioinformatics and Personal Genomics. As a co-founder of Purkinje, he pioneered the commercial use of controlled medical vocabulary organized as directed semantic networks in electronic medical records, as well as Pen computing for clinicians.
NeuroLex is a lexicon of neuroscience concepts supported by the Neuroscience Information Framework project, which is funded by the NIH Blueprint for Neuroscience Research. It is the lexical part of the NIF knowledge base, and NeuroLex is intended to make literature review easier and ensure consistent terminology and usage across researchers for the topics of experimental, clinical, and transnational neuroscience, and for genetic and genomic resources. It is structured as a semantic wiki, using Semantic MediaWiki.
Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of knowledge extraction and automated hypothesis generation that uses papers and other academic publications to find new relationships between existing knowledge. Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated.
The Clinical Care Classification (CCC) System is a standardized, coded nursing terminology that identifies the discrete elements of nursing practice. The CCC provides a unique framework and coding structure. Used for documenting the plan of care; following the nursing process in all health care settings.
Apache cTAKES: clinical Text Analysis and Knowledge Extraction System is an open-source Natural Language Processing (NLP) system that extracts clinical information from electronic health record unstructured text. It processes clinical notes, identifying types of clinical named entities — drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context, and negated/not negated.
RxNorm is short for medical prescription normalized Medical prescription
Carol Friedman is a scientist and biomedical informatician. She is among the pioneers the use of expert systems in medical language processing and the explicit medical concept representation underpinning the use of entity–attribute–value modeling underpinning electronic medical records.
Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.
Betsy L. Humphreys is an American medical librarian and health informatician known for leading the cross-institutional efforts to establish biomedical terminology standards such as SNOMED CT and the Unified Medical Language System. She was the deputy director of the National Library of Medicine from 2005 until her retirement in 2017, serving as acting director from 2015 to 2016.