Taxonomy

Last updated

Taxonomy is the practice and science of categorization or classification.

Contents

Generalized scheme of taxonomy Hierarchical clustering diagram.png
Generalized scheme of taxonomy

A taxonomy (or taxonomical classification) is a scheme of classification, especially a hierarchical classification, in which things are organized into groups or types. Among other things, a taxonomy can be used to organize and index knowledge (stored as documents, articles, videos, etc.), such as in the form of a library classification system, or a search engine taxonomy, so that users can more easily find the information they are searching for. Many taxonomies are hierarchies (and thus, have an intrinsic tree structure), but not all are.

Originally, taxonomy referred only to the categorisation of organisms or a particular categorisation of organisms. In a wider, more general sense, it may refer to a categorisation of things or concepts, as well as to the principles underlying such a categorisation. Taxonomy organizes taxonomic units known as "taxa" (singular "taxon")."

Taxonomy is different from meronomy, which deals with the categorisation of parts of a whole.

Etymology

The word was coined in 1813 by the Swiss botanist A. P. de Candolle and is irregularly compounded from the Greek τάξις, taxis 'order' and νόμος, nomos 'law', connected by the French form -o-; the regular form would be taxinomy, as used in the Greek reborrowing ταξινομία. [1] [2]

Applications

Wikipedia categories form a taxonomy, [3] which can be extracted by automatic means. [4] As of 2009, it has been shown that a manually-constructed taxonomy, such as that of computational lexicons like WordNet, can be used to improve and restructure the Wikipedia category taxonomy. [5]

In a broader sense, taxonomy also applies to relationship schemes other than parent-child hierarchies, such as network structures. Taxonomies may then include a single child with multi-parents, for example, "Car" might appear with both parents "Vehicle" and "Steel Mechanisms"; to some however, this merely means that 'car' is a part of several different taxonomies. [6] A taxonomy might also simply be organization of kinds of things into groups, or an alphabetical list; here, however, the term vocabulary is more appropriate. In current usage within knowledge management, taxonomies are considered narrower than ontologies since ontologies apply a larger variety of relation types. [7]

Mathematically, a hierarchical taxonomy is a tree structure of classifications for a given set of objects. It is also named containment hierarchy. At the top of this structure is a single classification, the root node, that applies to all objects. Nodes below this root are more specific classifications that apply to subsets of the total set of classified objects. The progress of reasoning proceeds from the general to the more specific.

By contrast, in the context of legal terminology, an open-ended contextual taxonomy is employed—a taxonomy holding only with respect to a specific context. In scenarios taken from the legal domain, a formal account of the open-texture of legal terms is modeled, which suggests varying notions of the "core" and "penumbra" of the meanings of a concept. The progress of reasoning proceeds from the specific to the more general. [8]

History

Anthropologists have observed that taxonomies are generally embedded in local cultural and social systems, and serve various social functions. Perhaps the most well-known and influential study of folk taxonomies is Émile Durkheim's The Elementary Forms of Religious Life . A more recent treatment of folk taxonomies (including the results of several decades of empirical research) and the discussion of their relation to the scientific taxonomy can be found in Scott Atran's Cognitive Foundations of Natural History. Folk taxonomies of organisms have been found in large part to agree with scientific classification, at least for the larger and more obvious species, which means that it is not the case that folk taxonomies are based purely on utilitarian characteristics. [9]

In the seventeenth century the German mathematician and philosopher Gottfried Leibniz, following the work of the thirteenth-century Majorcan philosopher Ramon Llull on his Ars generalis ultima , a system for procedurally generating concepts by combining a fixed set of ideas, sought to develop an alphabet of human thought. Leibniz intended his characteristica universalis to be an "algebra" capable of expressing all conceptual thought. The concept of creating such a "universal language" was frequently examined in the 17th century, also notably by the English philosopher John Wilkins in his work An Essay towards a Real Character and a Philosophical Language (1668), from which the classification scheme in Roget's Thesaurus ultimately derives.

Taxonomy in various disciplines

Natural sciences

Taxonomy in biology encompasses the description, identification, nomenclature, and classification of organisms. Uses of taxonomy include:

Business and economics

Uses of taxonomy in business and economics include:

Computing

Software engineering

Vegas et al. [10] make a compelling case to advance the knowledge in the field of software engineering through the use of taxonomies. Similarly, Ore et al. [11] provide a systematic methodology to approach taxonomy building in software engineering related topics.

Several taxonomies have been proposed in software testing research to classify techniques, tools, concepts and artifacts. The following are some example taxonomies:

  1. A taxonomy of model-based testing techniques [12]
  2. A taxonomy of static-code analysis tools [13]

Engström et al. [14] suggest and evaluate the use of a taxonomy to bridge the communication between researchers and practitioners engaged in the area of software testing. They have also developed a web-based tool [15] to facilitate and encourage the use of the taxonomy. The tool and its source code are available for public use. [16]

Other uses of taxonomy in computing

Education and academia

Uses of taxonomy in education include:

Safety

Uses of taxonomy in safety include:

Other taxonomies

Research publishing

Citing inadequacies with current practices in listing authors of papers in medical research journals, Drummond Rennie and co-authors called in a 1997 article in JAMA, the Journal of the American Medical Association for

a radical conceptual and systematic change, to reflect the realities of multiple authorship and to buttress accountability. We propose dropping the outmoded notion of author in favor of the more useful and realistic one of contributor. [17] :152

Since 2012, several major academic and scientific publishing bodies have mounted Project CRediT to develop a controlled vocabulary of contributor roles. [18] Known as CRediT (Contributor Roles Taxonomy), this is an example of a flat, non-hierarchical taxonomy; however, it does include an optional, broad classification of the degree of contribution: lead, equal or supporting. Amy Brand and co-authors summarise their intended outcome as:

Identifying specific contributions to published research will lead to appropriate credit, fewer author disputes, and fewer disincentives to collaboration and the sharing of data and code. [17] :151

As of mid-2018, this taxonomy apparently restricts its scope to research outputs, specifically journal articles; however, it does rather unusually "hope to … support identification of peer reviewers". [18] (As such, it has not yet defined terms for such roles as editor or author of a chapter in a book of research results.) Version 1, established by the first Working Group in the (northern) autumn of 2014, identifies 14 specific contributor roles using the following defined terms:

Reception has been mixed, with several major publishers and journals planning to have implemented CRediT by the end of 2018, whilst almost as many are not persuaded of the need or value of using it. For example,

The National Academy of Sciences has created a TACS (Transparency in Author Contributions in Science) webpage to list the journals that commit to setting authorship standards, defining responsibilities for corresponding authors, requiring ORCID iDs, and adopting the CRediT taxonomy. [19]

The same webpage has a table listing 21 journals (or families of journals), of which:

The taxonomy is an open standard conforming to the OpenStand principles, [20] and is published under a Creative Commons licence. [18]

Taxonomy for the web

Websites with a well designed taxonomy or hierarchy are easily understood by users, due to the possibility of users developing a mental model of the site structure. [21]

Guidelines for writing taxonomy for the web include:

Is-a and has-a relationships, and hyponymy

Two of the predominant types of relationships in knowledge-representation systems are predication and the universally quantified conditional. Predication relationships express the notion that an individual entity is an example of a certain type (for example, John is a bachelor), while universally quantified conditionals express the notion that a type is a subtype of another type (for example, "A dog is a mammal", which means the same as "All dogs are mammals"). [22]

The "has-a" relationship is quite different: an elephant has a trunk; a trunk is a part, not a subtype of elephant. The study of part-whole relationships is mereology.

Taxonomies are often represented as is-a hierarchies where each level is more specific than the level above it (in mathematical language is "a subset of" the level above). For example, a basic biology taxonomy would have concepts such as mammal, which is a subset of animal, and dogs and cats, which are subsets of mammal. This kind of taxonomy is called an is-a model because the specific objects are considered as instances of a concept. For example, Fido is-an instance of the concept dog and Fluffy is-a cat. [23]

In linguistics, is-a relations are called hyponymy. When one word describes a category, but another describe some subset of that category, the larger term is called a hypernym with respect to the smaller, and the smaller is called a "hyponym" with respect to the larger. Such a hyponym, in turn, may have further subcategories for which it is a hypernym. In the simple biology example, dog is a hypernym with respect to its subcategory collie, which in turn is a hypernym with respect to Fido which is one of its hyponyms. Typically, however, hypernym is used to refer to subcategories rather than single individuals.

Research

Comparison of categories of small and large populations Larger populations promote category convergence across populations.webp
Comparison of categories of small and large populations

Researchers reported that large populations consistently develop highly similar category systems. This may be relevant to lexical aspects of large communication networks and cultures such as folksonomies and language or human communication, and sense-making in general. [24] [25]

See also

Notes

  1. Oxford English Dictionary. Oxford University Press. 1910. (partially updated December 2021), s.v.
  2. review of Aperçus de Taxinomie Générale in Nature60:489–490 Archived 2023-01-26 at the Wayback Machine (1899)
  3. Zirn, Cäcilia, Vivi Nastase and Michael Strube. 2008. "Distinguishing Between Instances and Classes in the Wikipedia Taxonomy" (video lecture). Archived 2019-12-20 at the Wayback Machine 5th Annual European Semantic Web Conference (ESWC 2008).
  4. S. Ponzetto and M. Strube. 2007. "Deriving a large scale taxonomy from Wikipedia" Archived 2017-08-14 at the Wayback Machine . Proc. of the 22nd Conference on the Advancement of Artificial Intelligence, Vancouver, B.C., Canada, pp. 1440–1445.
  5. S. Ponzetto, R. Navigli. 2009. "Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia". Proc. of the 21st International Joint Conference on Artificial Intelligence (IJCAI 2009), Pasadena, California, pp. 2083–2088.
  6. Jackson, Joab. "Taxonomy's not just design, it's an art," Archived 2020-02-05 at the Wayback Machine Government Computer News (Washington, D.C.). September 2, 2004.
  7. Suryanto, Hendra and Paul Compton. "Learning classification taxonomies from a classification knowledge based system." Archived 2017-08-09 at the Wayback Machine University of Karlsruhe; "Defining 'Taxonomy'," Archived 2017-08-09 at the Wayback Machine Straights Knowledge website.
  8. Grossi, Davide, Frank Dignum and John-Jules Charles Meyer. (2005). "Contextual Taxonomies" in Computational Logic in Multi-Agent Systems, pp. 33–51 [ dead link ].
  9. Kenneth Boulding; Elias Khalil (2002). Evolution, Order and Complexity. Routledge. ISBN   9780203013151. p. 9
  10. Vegas, S. (2009). "Maturing software engineering knowledge through classifications: A case study on unit testing techniques". IEEE Transactions on Software Engineering. 35 (4): 551–565. CiteSeerX   10.1.1.221.7589 . doi:10.1109/TSE.2009.13. S2CID   574495.
  11. Ore, S. (2014). "Critical success factors taxonomy for software process deployment". Software Quality Journal. 22 (1): 21–48. doi:10.1007/s11219-012-9190-y. S2CID   18047921.
  12. Utting, Mark (2012). "A taxonomy of model-based testing approaches". Software Testing, Verification & Reliability. 22 (5): 297–312. doi:10.1002/stvr.456. S2CID   6782211. Archived from the original on 2019-12-20. Retrieved 2017-04-23.
  13. Novak, Jernej (May 2010). "Taxonomy of static code analysis tools". Proceedings of the 33rd International Convention MIPRO: 418–422. Archived from the original on 2022-06-27. Retrieved 2020-03-03.
  14. Engström, Emelie (2016). "SERP-test: a taxonomy for supporting industry–academia communication". Software Quality Journal. 25 (4): 1269–1305. doi:10.1007/s11219-016-9322-x. S2CID   34795073.
  15. "SERP-connect". Archived from the original on 2021-08-28. Retrieved 2021-08-28.
  16. Engstrom, Emelie (4 December 2019). "SERP-connect backend". GitHub . Archived from the original on 10 December 2019. Retrieved 25 October 2016.
  17. 1 2 Brand, Amy; Allen, Liz; Altman, Micah; Hlava, Marjorie; Scott, Jo (1 April 2015). "Beyond authorship: attribution, contribution, collaboration, and credit". Learned Publishing. 28 (2): 151–155. doi: 10.1087/20150211 . S2CID   45167271.
  18. 1 2 3 "CRediT". CASRAI. CASRAI. 2 May 2018. Archived from the original (online) on 12 June 2018. Retrieved 13 June 2018.
  19. "Transparency in Author Contributions in Science (TACS)" (online). National Academy of Sciences. 2018. Archived from the original on 19 May 2019. Retrieved 13 June 2018.
  20. "OpenStand". OpenStand. Archived from the original on 18 September 2019. Retrieved 13 June 2018.
  21. 1 2 3 Peter., Morville (2007). Information architecture for the World Wide Web. Rosenfeld, Louis., Rosenfeld, Louis. (3rd ed.). Sebastopol, CA: O'Reilly. ISBN   9780596527341. OCLC   86110226.
  22. Ronald J. Brachman; What IS-A is and isn't. An Analysis of Taxonomic Links in Semantic Networks Archived 2020-06-30 at the Wayback Machine . IEEE Computer, 16 (10); October 1983.
  23. Brachman, Ronald (October 1983). "What IS-A is and isn't. An Analysis of Taxonomic Links in Semantic Networks". IEEE Computer. 16 (10): 30–36. doi:10.1109/MC.1983.1654194. S2CID   16650410.
  24. "Why independent cultures think alike when it comes to categories: It's not in the brain". phys.org. Archived from the original on 25 January 2021. Retrieved 13 February 2021.
  25. Guilbeault, Douglas; Baronchelli, Andrea; Centola, Damon (12 January 2021). "Experimental evidence for scale-induced category convergence across populations". Nature Communications. 12 (1): 327. Bibcode:2021NatCo..12..327G. doi: 10.1038/s41467-020-20037-y . ISSN   2041-1723. PMC   7804416 . PMID   33436581. CC-BY icon.svg Available under CC BY 4.0 Archived 2017-10-16 at the Wayback Machine .

Related Research Articles

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

<span class="mw-page-title-main">WordNet</span> Computational lexicon of English

WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.

<span class="mw-page-title-main">Hypernymy and hyponymy</span> Semantic relations involving the type-of property

Hypernymy and hyponymy are the semantic relations between a generic term (hypernym) and a specific instance of it (hyponym). The hypernym is also called a supertype, umbrella term, or blanket term. The hyponym is a subtype of the hypernym. The semantic field of the hyponym is included within that of the hypernym. For example, pigeon, crow, and hen are all hyponyms of bird and animal; bird and animal are both hypernyms of pigeon, crow, and hen.

A faceted classification is a classification scheme used in organizing knowledge into a systematic order. A faceted classification uses semantic categories, either general or subject-specific, that are combined to create the full classification entry. Many library classification systems use a combination of a fixed, enumerative taxonomy of concepts with subordinate facets that further refine the topic.

<span class="mw-page-title-main">Tag (metadata)</span> Keyword assigned to information

In information systems, a tag is a keyword or term assigned to a piece of information. This kind of metadata helps describe an item and allows it to be found again by browsing or searching. Tags are generally chosen informally and personally by the item's creator or by its viewer, depending on the system, although they may also be chosen from a controlled vocabulary.

The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences. It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of systems in medical informatics.

<span class="mw-page-title-main">SNOMED CT</span> System for medical classification

SNOMED CT or SNOMED Clinical Terms is a systematically organized computer-processable collection of medical terms providing codes, terms, synonyms and definitions used in clinical documentation and reporting. SNOMED CT is considered to be the most comprehensive, multilingual clinical healthcare terminology in the world. The primary purpose of SNOMED CT is to encode the meanings that are used in health information and to support the effective clinical recording of data with the aim of improving patient care. SNOMED CT provides the core general terminology for electronic health records. SNOMED CT comprehensive coverage includes: clinical findings, symptoms, diagnoses, procedures, body structures, organisms and other etiologies, substances, pharmaceuticals, devices and specimens.

In information science and ontology, a classification scheme is the product of arranging things into kinds of things (classes) or into groups of classes; this bears similarity to categorization, but with perhaps a more theoretical bent, as classification can be applied over a wide semantic spectrum.

Simple Knowledge Organization System (SKOS) is a W3C recommendation designed for representation of thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary. SKOS is part of the Semantic Web family of standards built upon RDF and RDFS, and its main objective is to enable easy publication and use of such vocabularies as linked data.

Gellish is an ontology language for data storage and communication, designed and developed by Andries van Renssen since mid-1990s. It started out as an engineering modeling language but evolved into a universal and extendable conceptual data modeling language with general applications. Because it includes domain-specific terminology and definitions, it is also a semantic data modelling language and the Gellish modeling methodology is a member of the family of semantic modeling methodologies.

Frames are an artificial intelligence data structure used to divide knowledge into substructures by representing "stereotyped situations". They were proposed by Marvin Minsky in his 1974 article "A Framework for Representing Knowledge". Frames are the primary data structure used in artificial intelligence frame languages; they are stored as ontologies of sets.

Knowledge management software is a subset of content management software, which contains a range of software that specializes in the way information is collected, stored and/or accessed. The concept of knowledge management is based on a range of practices used by an individual, a business, or a large corporation to identify, create, represent and redistribute information for a range of purposes. Software that enables an information practice or range of practices at any part of the processes of information management can be deemed to be called information management software. A subset of information management software that emphasizes an approach to build knowledge out of information that is managed or contained is often called knowledge management software.

The concept of the Social Semantic Web subsumes developments in which social interactions on the Web lead to the creation of explicit and semantically rich knowledge representations. The Social Semantic Web can be seen as a Web of collective knowledge systems, which are able to provide useful information based on human contributions and which get better as more people participate. The Social Semantic Web combines technologies, strategies and methodologies from the Semantic Web, social software and the Web 2.0.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

Enterprise bookmarking is a method for Web 2.0 users to tag, organize, store, and search bookmarks of both web pages on the Internet and data resources stored in a distributed database or fileserver. This is done collectively and collaboratively in a process by which users add tag (metadata) and knowledge tags.

Folksonomy is a classification system in which end users apply public tags to online items, typically to make those items easier for themselves or others to find later. Over time, this can give rise to a classification system based on those tags and how often they are applied or searched for, in contrast to a taxonomic classification designed by the owners of the content and specified when it is published. This practice is also known as collaborative tagging, social classification, social indexing, and social tagging. Folksonomy was originally "the result of personal free tagging of information [...] for one's own retrieval", but online sharing and interaction expanded it into collaborative forms. Social tagging is the application of tags in an open online environment where the tags of other users are available to others. Collaborative tagging is tagging performed by a group of users. This type of folksonomy is commonly used in cooperative and collaborative projects such as research, content repositories, and social bookmarking.

In natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas can be represented using a smaller set of words.

Classora is a knowledge base for the Internet oriented to data analysis. From a practical point of view, Classora is a digital repository that stores structured information and allows it to be displayed in multiple formats: analytically, graphically, geographically ; as well as carry out OLAP analysis. The information contained in Classora comes from public sources and is uploaded into the system through bots and ETL processes. The Knowledge Base has a commercial API for semantic enhancement, and an open web through which any user can access to part of the information collected.

The following outline is provided as an overview of and topical guide to natural-language processing:

Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence.

References