Information science |
---|
General aspects |
Related fields and subfields |
Taxonomy is a practice and science concerned with classification or categorization. Typically, there are two parts to it: the development of an underlying scheme of classes (a taxonomy) and the allocation of things to the classes (classification).
Originally, taxonomy referred only to the classification of organisms on the basis of shared characteristics. Today it also has a more general sense. It may refer to the classification of things or concepts, as well as to the principles underlying such work. Thus a taxonomy can be used to organize species, documents, videos or anything else.
A taxonomy organizes taxonomic units known as "taxa" (singular "taxon")." Many are hierarchies.
One function of a taxonomy is to help users more easily find what they are searching for. This may be effected in ways that include a library classification system and a search engine taxonomy.
The word was coined in 1813 by the Swiss botanist A. P. de Candolle and is irregularly compounded from the Greek τάξις, taxis 'order' and νόμος, nomos 'law', connected by the French form -o-; the regular form would be taxinomy, as used in the Greek reborrowing ταξινομία. [1] [2]
Wikipedia categories form a taxonomy, [3] which can be extracted by automatic means. [4] As of 2009 [update] , it has been shown that a manually-constructed taxonomy, such as that of computational lexicons like WordNet, can be used to improve and restructure the Wikipedia category taxonomy. [5]
In a broader sense, taxonomy also applies to relationship schemes other than parent-child hierarchies, such as network structures. Taxonomies may then include a single child with multi-parents, for example, "Car" might appear with both parents "Vehicle" and "Steel Mechanisms"; to some however, this merely means that 'car' is a part of several different taxonomies. [6] A taxonomy might also simply be organization of kinds of things into groups, or an alphabetical list; here, however, the term vocabulary is more appropriate. In current usage within knowledge management, taxonomies are considered narrower than ontologies since ontologies apply a larger variety of relation types. [7]
Mathematically, a hierarchical taxonomy is a tree structure of classifications for a given set of objects. It is also named containment hierarchy. At the top of this structure is a single classification, the root node, that applies to all objects. Nodes below this root are more specific classifications that apply to subsets of the total set of classified objects. The progress of reasoning proceeds from the general to the more specific.
By contrast, in the context of legal terminology, an open-ended contextual taxonomy is employed—a taxonomy holding only with respect to a specific context. In scenarios taken from the legal domain, a formal account of the open-texture of legal terms is modeled, which suggests varying notions of the "core" and "penumbra" of the meanings of a concept. The progress of reasoning proceeds from the specific to the more general. [8]
Anthropologists have observed that taxonomies are generally embedded in local cultural and social systems, and serve various social functions. Perhaps the most well-known and influential study of folk taxonomies is Émile Durkheim's The Elementary Forms of Religious Life . A more recent treatment of folk taxonomies (including the results of several decades of empirical research) and the discussion of their relation to the scientific taxonomy can be found in Scott Atran's Cognitive Foundations of Natural History. Folk taxonomies of organisms have been found in large part to agree with scientific classification, at least for the larger and more obvious species, which means that it is not the case that folk taxonomies are based purely on utilitarian characteristics. [9]
In the seventeenth century the German mathematician and philosopher Gottfried Leibniz, following the work of the thirteenth-century Majorcan philosopher Ramon Llull on his Ars generalis ultima , a system for procedurally generating concepts by combining a fixed set of ideas, sought to develop an alphabet of human thought. Leibniz intended his characteristica universalis to be an "algebra" capable of expressing all conceptual thought. The concept of creating such a "universal language" was frequently examined in the 17th century, also notably by the English philosopher John Wilkins in his work An Essay towards a Real Character and a Philosophical Language (1668), from which the classification scheme in Roget's Thesaurus ultimately derives.
Taxonomy in biology encompasses the description, identification, nomenclature, and classification of organisms. Uses of taxonomy include:
Uses of taxonomy in business and economics include:
Vegas et al. [10] make a compelling case to advance the knowledge in the field of software engineering through the use of taxonomies. Similarly, Ore et al. [11] provide a systematic methodology to approach taxonomy building in software engineering related topics.
Several taxonomies have been proposed in software testing research to classify techniques, tools, concepts and artifacts. The following are some example taxonomies:
Engström et al. [14] suggest and evaluate the use of a taxonomy to bridge the communication between researchers and practitioners engaged in the area of software testing. They have also developed a web-based tool [15] to facilitate and encourage the use of the taxonomy. The tool and its source code are available for public use. [16]
Uses of taxonomy in education include:
Uses of taxonomy in safety include:
Citing inadequacies with current practices in listing authors of papers in medical research journals, Drummond Rennie and co-authors called in a 1997 article in JAMA, the Journal of the American Medical Association for
a radical conceptual and systematic change, to reflect the realities of multiple authorship and to buttress accountability. We propose dropping the outmoded notion of author in favor of the more useful and realistic one of contributor. [17] : 152
In 2012, several major academic and scientific publishing bodies mounted Project CRediT to develop a controlled vocabulary of contributor roles. [18] Known as CRediT (Contributor Roles Taxonomy), this is an example of a flat, non-hierarchical taxonomy; however, it does include an optional, broad classification of the degree of contribution: lead, equal or supporting. Amy Brand and co-authors summarise their intended outcome as:
Identifying specific contributions to published research will lead to appropriate credit, fewer author disputes, and fewer disincentives to collaboration and the sharing of data and code. [17] : 151
CRediT comprises 14 specific contributor roles using the following defined terms:
The taxonomy is an open standard conformiing to the OpenStand principles, [19] and is published under a Creative Commons licence. [18]
Websites with a well designed taxonomy or hierarchy are easily understood by users, due to the possibility of users developing a mental model of the site structure. [20]
Guidelines for writing taxonomy for the web include:
Frederick Suppe [21] distinguished two senses of classification: a broad meaning, which he called "conceptual classification" and a narrow meaning, which he called "systematic classification".
About conceptual classification Suppe wrote: [21] : 292 "Classification is intrinsic to the use of language, hence to most if not all communication. Whenever we use nominative phrases we are classifying the designated subject as being importantly similar to other entities bearing the same designation; that is, we classify them together. Similarly the use of predicative phrases classifies actions or properties as being of a particular kind. We call this conceptual classification, since it refers to the classification involved in conceptualizing our experiences and surroundings"
About systematic classification Suppe wrote: [21] : 292 "A second, narrower sense of classification is the systematic classification involved in the design and utilization of taxonomic schemes such as the biological classification of animals and plants by genus and species.
Two of the predominant types of relationships in knowledge-representation systems are predication and the universally quantified conditional. Predication relationships express the notion that an individual entity is an example of a certain type (for example, John is a bachelor), while universally quantified conditionals express the notion that a type is a subtype of another type (for example, "A dog is a mammal", which means the same as "All dogs are mammals"). [22]
The "has-a" relationship is quite different: an elephant has a trunk; a trunk is a part, not a subtype of elephant. The study of part-whole relationships is mereology.
Taxonomies are often represented as is-a hierarchies where each level is more specific than the level above it (in mathematical language is "a subset of" the level above). For example, a basic biology taxonomy would have concepts such as mammal, which is a subset of animal, and dogs and cats, which are subsets of mammal. This kind of taxonomy is called an is-a model because the specific objects are considered as instances of a concept. For example, Fido is-an instance of the concept dog and Fluffy is-a cat. [23]
In linguistics, is-a relations are called hyponymy. When one word describes a category, but another describe some subset of that category, the larger term is called a hypernym with respect to the smaller, and the smaller is called a "hyponym" with respect to the larger. Such a hyponym, in turn, may have further subcategories for which it is a hypernym. In the simple biology example, dog is a hypernym with respect to its subcategory collie, which in turn is a hypernym with respect to Fido which is one of its hyponyms. Typically, however, hypernym is used to refer to subcategories rather than single individuals.
Researchers reported that large populations consistently develop highly similar category systems. This may be relevant to lexical aspects of large communication networks and cultures such as folksonomies and language or human communication, and sense-making in general. [24] [25]
Hull (1998) suggested "The fundamental elements of any classification are its theoretical commitments, basic units and the criteria for ordering these basic units into a classification". [26]
There is a widespread opinion in knowledge organization and related fields that such classes corresponds to concepts. We can, for example, classify "waterfowls" into the classes "ducks", "geese", and "swans"; we can also say, however, that the concept “waterfowl” is a generic broader term in relation to the concepts "ducks", "geese", and "swans". This example demonstrates the close relationship between classification theory and concept theory. A main opponent of concepts as units is Barry Smith. [27] Arp, Smith and Spear (2015) discuss ontologies and criticize the conceptualist understanding. [28] : 5ff The book writes (7): “The code assigned to France, for example, is ISO 3166 – 2:FR and the code is assigned to France itself — to the country that is otherwise referred to as Frankreich or Ranska. It is not assigned to the concept of France (whatever that might be).” Smith's alternative to concepts as units is based on a realist orientation, when scientists make successful claims about the types of entities that exist in reality, they are referring to objectively existing entities which realist philosophers call universals or natural kinds. Smith's main argument - with which many followers of the concept theory agree - seems to be that classes cannot be determined by introspective methods, but must be based on scientific and scholarly research. Whether units are called concepts or universals, the problem is to decide when a thing (say a "blackbird") should be considered a natural class. In the case of blackbirds, for example, recent DNA analysis have reconsidered the concept (or universal) "blackbird" and found that what was formerly considered one species (with subspecies) are in reality many different species, which just have chosen similar characteristics to adopt to their ecological niches. [29] : 141
An important argument for considering concepts the basis of classification is that concepts are subject to change and that they changes when scientific revolutions occur. Our concepts of many birds, for example, have changed with recent development in DNA analysis and the influence of the cladistic paradigm - and have demanded new classifications. Smith's example of France demands an explanation. First, France is not a general concept, but an individual concept. Next, the legal definition of France is determined by the conventions that France has made with other countries. It is still a concept, however, as Leclercq (1978) demonstrates with the corresponding concept Europe. [30]
Hull (1998) continued: [26] "Two fundamentally different sorts of classification are those that reflect structural organization and those that are systematically related to historical development." What is referred to is that in biological classification the anatomical traits of organisms is one kind of classification, the classification in relation to the evolution of species is another (in the section below, we expand these two fundamental sorts of classification to four). Hull adds that in biological classification, evolution supplies the theoretical orientation. [26]
Ereshefsky (2000) presented and discussed three general philosophical schools of classification: "essentialism, cluster analysis, and historical classification. Essentialism sorts entities according to causal relations rather than their intrinsic qualitative features." [31]
These three categories may, however, be considered parts of broader philosophies. Four main approaches to classification may be distinguished: (1) logical and rationalist approaches including "essentialism"; (2) empiricist approaches including cluster analysis (It is important to notice that empiricism is not the same as empirical study, but a certain ideal of doing empirical studies. With the exception of the logical approaches they all are based on empirical studies, but are basing their studies on different philosophical principles). (3) Historical and hermeneutical approaches including Ereshefsky's "historical classification" and (4) Pragmatic, functionalist and teleological approaches (not covered by Ereshefsky). In addition there are combined approaches (e.g., the so-called evolutionary taxonomy", which mixes historical and empiricist principles).
Logical division [32] (top-down classification or downward classification) is an approach that divides a class into subclasses and then divide subclasses into their subclasses, and so on, which finally forms a tree of classes. The root of the tree is the original class, and the leaves of the tree are the final classes. Plato advocated a method based on dichotomy, which was rejected by Aristotle and replaced by the method of definitions based on genus, species, and specific difference. [33] The method of facet analysis (cf., faceted classification) is primarily based on logical division. [34] This approach tends to classify according to "essential" characteristics, a widely discussed and criticized concept (cf., essentialism). These methods may overall be related to the rationalist theory of knowledge.
"Empiricism alone is not enough: a healthy advance in taxonomy depends on a sound theoretical foundation" [35] : 548
Phenetics or numerical taxonomy [36] is by contrast bottom-up classification, where the starting point is a set of items or individuals, which are classified by putting those with shared characteristics as members of a narrow class and proceeding upward. Numerical taxonomy is an approach based solely on observable, measurable similarities and differences of the things to be classified. Classification is based on overall similarity: The elements that are most alike in most attributes are classified together. But it is based on statistics, and therefore does not fulfill the criteria of logical division (e.g. to produce classes, that are mutually exclusive and jointly coextensive with the class they divide). Some people will argue that this is not classification/taxonomy at all, but such an argument must consider the definitions of classification (see above). These methods may overall be related to the empiricist theory of knowledge.
Genealogical classification is classification of items according to their common heritage. This must also be done on the basis of some empirical characteristics, but these characteristics are developed by the theory of evolution. Charles Darwin's [37] main contribution to classification theory of not just his claim "... all true classification is genealogical ..." but that he provided operational guidance for classification. [38] : 90–92 Genealogical classification is not restricted to biology, but is also much used in, for example, classification of languages, and may be considered a general approach to classification." These methods may overall be related to the historicist theory of knowledge. One of the main schools of historical classification is cladistics, which is today dominant in biological taxonomy, but also applied to other domains.
The historical and hermeneutical approaches is not restricted to the development of the object of classification (e.g., animal species) but is also concerned with the subject of classification (the classifiers) and their embeddedness in scientific traditions and other human cultures.
Pragmatic classification (and functional [39] and teleological classification) is the classification of items which emphasis the goals, purposes, consequences, [40] interests, values and politics of classification. It is, for example, classifying animals into wild animals, pests, domesticated animals and pets. Also kitchenware (tools, utensils, appliances, dishes, and cookware used in food preparation, or the serving of food) is an example of a classification which is not based on any of the above-mentioned three methods, but clearly on pragmatic or functional criteria. Bonaccorsi, et al. (2019) is about the general theory of functional classification and applications of this approach for patent classification. [39] Although the examples may suggest that pragmatic classifications are primitive compared to established scientific classifications, it must be considered in relation to the pragmatic and critical theory of knowledge, which consider all knowledge as influences by interests. [41] Ridley (1986) wrote: [42] : 191 "teleological classification. Classification of groups by their shared purposes, or functions, in life - where purpose can be identified with adaptation. An imperfectly worked-out, occasionally suggested, theoretically possible principle of classification that differs from the two main such principles, phenetic and phylogenetic classification".
Natural classification is a concept closely related to the concept natural kind. Carl Linnaeus is often recognized as the first scholar to clearly have differentiated "artificial" and "natural" classifications [43] [44] A natural classification is one, using Plato's metaphor, that is “carving nature at its joints” [45] Although Linnaeus considered natural classification the ideal, he recognized that his own system (at least partly) represented an artificial classification.
John Stuart Mill explained the artificial nature of the Linnaean classification and suggested the following definition of a natural classification:
"The Linnæan arrangement answers the purpose of making us think together of all those kinds of plants, which possess the same number of stamens and pistils; but to think of them in that manner is of little use, since we seldom have anything to affirm in common of the plants which have a given number of stamens and pistils." [46] : 498 "The ends of scientific classification are best answered, when the objects are formed into groups respecting which a greater number of general propositions can be made, and those propositions more important, than could be made respecting any other groups into which the same things could be distributed." [46] : 499 "A classification thus formed is properly scientific or philosophical, and is commonly called a Natural, in contradistinction to a Technical or Artificial, classification or arrangement." [46] : 499
Ridley (1986) provided the following definitions: [42]
Stamos (2004) [47] : 138 wrote: "The fact is, modern scientists classify atoms into elements based on proton number rather than anything else because it alone is the causally privileged factor [gold is atomic number 79 in the periodic table because it has 79 protons in its nucleus]. Thus nature itself has supplied the causal monistic essentialism. Scientists in their turn simply discover and follow (where "simply" ≠ "easily")."
The periodic table is the classification of the chemical elements which is in particular associated with Dmitri Mendeleev (cf., History of the periodic table). An authoritative work on this system is Scerri (2020). [48] Hubert Feger (2001; numbered listing added) wrote about it: [49] : 1967–1968 "A well-known, still used, and expanding classification is Mendeleev's Table of Elements. It can be viewed as a prototype of all taxonomies in that it satisfies the following evaluative criteria:
Bursten (2020) wrote, however "Hepler-Smith, a historian of chemistry, and I, a philosopher whose work often draws on chemistry, found common ground in a shared frustration with our disciplines’ emphases on the chemical elements as the stereotypical example of a natural kind. The frustration we shared was that while the elements did display many hallmarks of paradigmatic kindhood, elements were not the kinds of kinds that generated interesting challenges for classification in chemistry, nor even were they the kinds of kinds that occupied much contemporary critical chemical thought. Compounds, complexes, reaction pathways, substrates, solutions – these were the kinds of the chemistry laboratory, and rarely if ever did they slot neatly into taxonomies in the orderly manner of classification suggested by the Periodic Table of Elements. A focus on the rational and historical basis of the development of the Periodic Table had made the received view of chemical classification appear far more pristine, and far less interesting, than either of us believed it to be." [50]
Linnaean taxonomy is the particular form of biological classification (taxonomy) set up by Carl Linnaeus, as set forth in his Systema Naturae (1735) and subsequent works. A major discussion in the scientific literature is whether a system that was constructed before Charles Darwin's theory of evolution can still be fruitful and reflect the development of life. [51] [52]
Astronomy is a fine example on how Kuhn's (1962) theory of scientific revolutions (or paradigm shifts) influences classification. [53] For example:
Hornbostel–Sachs is a system of musical instrument classification devised by Erich Moritz von Hornbostel and Curt Sachs, and first published in 1914. [54] In the original classification, the top categories are:
A fifth top category,
Each top category is subdivided and Hornbostel-Sachs is a very comprehensive classification of musical instruments with wide applications. In Wikipedia, for example, all musical instruments are organized according to this classification.
In opposition to, for example, the astronomical and biological classifications presented above, the Hornbostel-Sachs classification seems very little influenced by research in musicology and organology. It is based on huge collections of musical instruments, but seems rather as a system imposed upon the universe of instruments than as a system with organic connections to scholarly theory. It may therefore be interpreted as a system based on logical division and rationalist philosophy.
Diagnostic and Statistical Manual of Mental Disorders (DSM) is a classification of mental disorders published by the American Psychiatric Association (APA).The first edition of the DSM was published in 1952, [55] and the newest, fifth edition was published in 2013. [56] In contrast to, for example, the periodic table and the Hornbostel-Sachs classification, the principles for classification have changed much during its history. The first edition was influenced by psychodynamic theory, The DSM-III, published in 1980 [57] adopted an atheoretical, “descriptive” approach to classification [58] The system is very important for all people involved in psychiatry, whether as patients, researchers or therapists (in addition to insurance companies), but the systems is strongly criticized and has not the scientific status as many other classifications. [59]
A hierarchy is an arrangement of items that are represented as being "above", "below", or "at the same level as" one another. Hierarchy is an important concept in a wide variety of fields, such as architecture, philosophy, design, mathematics, computer science, organizational theory, systems theory, systematic biology, and the social sciences.
Knowledge representation and reasoning is a field of artificial intelligence (AI) dedicated to representing information about the world in a form that a computer system can use to solve complex tasks, such as diagnosing a medical condition or having a natural-language dialog. Knowledge representation incorporates findings from psychology about how humans solve problems and represent knowledge, in order to design formalisms that make complex systems easier to design and build. Knowledge representation and reasoning also incorporates findings from logic to automate various kinds of reasoning.
Linnaean taxonomy can mean either of two related concepts:
In biology, taxonomy is the scientific study of naming, defining (circumscribing) and classifying groups of biological organisms based on shared characteristics. Organisms are grouped into taxa and these groups are given a taxonomic rank; groups of a given rank can be aggregated to form a more inclusive group of higher rank, thus creating a taxonomic hierarchy. The principal ranks in modern use are domain, kingdom, phylum, class, order, family, genus, and species. The Swedish botanist Carl Linnaeus is regarded as the founder of the current system of taxonomy, as he developed a ranked system known as Linnaean taxonomy for categorizing organisms and binomial nomenclature for naming organisms.
WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into synsets with short definitions and usage examples. It can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. It was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. There are now WordNets in more than 200 languages.
In information science, an ontology encompasses a representation, formal naming, and definitions of the categories, properties, and relations between the concepts, data, or entities that pertain to one, many, or all domains of discourse. More simply, an ontology is a way of showing the properties of a subject area and how they are related, by defining a set of terms and relational expressions that represent the entities in that subject area. The field which studies ontologies so conceived is sometimes referred to as applied ontology.
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves. Examples include diagnostic tests, identifying spam emails and deciding whether to give someone a driving license.
Nomenclature is a system of names or terms, or the rules for forming these terms in a particular field of arts or sciences. The principles of naming vary from the relatively informal conventions of everyday speech to the internationally agreed principles, rules, and recommendations that govern the formation and use of the specialist terminology used in scientific and any other disciplines.
A folk taxonomy is a vernacular naming system, as distinct from scientific taxonomy. Folk biological classification is the way people traditionally describe and organize the world around them, typically making generous use of form taxa such as "shrubs", "bugs", "ducks", "fish", "algae", "vegetables", or of economic criteria such as "game animals", "pack animals", "weeds" and other like terms.
In information science and ontology, a classification scheme is an arrangement of classes or groups of classes. The activity of developing the schemes bears similarity to taxonomy, but with perhaps a more theoretical bent, as a single classification scheme can be applied over a wide semantic spectrum while taxonomies tend to be devoted to a single topic.
In philosophy, the term formal ontology is used to refer to an ontology defined by axioms in a formal language with the goal to provide an unbiased view on reality, which can help the modeler of domain- or application-specific ontologies to avoid possibly erroneous ontological assumptions encountered in modeling large-scale ontologies.
Gellish is an ontology language for data storage and communication, designed and developed by Andries van Renssen since mid-1990s. It started out as an engineering modeling language but evolved into a universal and extendable conceptual data modeling language with general applications. Because it includes domain-specific terminology and definitions, it is also a semantic data modelling language and the Gellish modeling methodology is a member of the family of semantic modeling methodologies.
Knowledge organization (KO), organization of knowledge, organization of information, or information organization is an intellectual discipline concerned with activities such as document description, indexing, and classification that serve to provide systems of representation and order for knowledge and information objects. According to The Organization of Information by Joudrey and Taylor, information organization:
examines the activities carried out and tools used by people who work in places that accumulate information resources for the use of humankind, both immediately and for posterity. It discusses the processes that are in place to make resources findable, whether someone is searching for a single known item or is browsing through hundreds of resources just hoping to discover something useful. Information organization supports a myriad of information-seeking scenarios.
Frames are an artificial intelligence data structure used to divide knowledge into substructures by representing "stereotyped situations".
Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.
Contemporary ontologies share many structural similarities, regardless of the ontology language in which they are expressed. Most ontologies describe individuals (instances), classes (concepts), attributes, and relations.
In library and information science documents are classified and searched by subject – as well as by other attributes such as author, genre and document type. This makes "subject" a fundamental term in this field. Library and information specialists assign subject labels to documents to make them findable. There are many ways to do this and in general there is not always consensus about which subject should be assigned to a given document. To optimize subject indexing and searching, we need to have a deeper understanding of what a subject is. The question: "what is to be understood by the statement 'document A belongs to subject category X'?" has been debated in the field for more than 100 years
The following outline is provided as an overview of and topical guide to natural-language processing:
Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence.
The Information Coding Classification (ICC) is a classification system covering almost all extant 6500 knowledge fields. Its conceptualization goes beyond the scope of the well known library classification systems, such as Dewey Decimal Classification (DDC), Universal Decimal Classification (UDC), and Library of Congress Classification (LCC), by extending also to knowledge systems that so far have not afforded to classify literature. ICC actually presents a flexible universal ordering system for both literature and other kinds of information, set out as knowledge fields. From a methodological point of view, ICC differs from the above-mentioned systems along the following three lines: