Automated species identification

Last updated

Automated species identification is a method of making the expertise of taxonomists available to ecologists, parataxonomists and others via digital technology and artificial intelligence. Today, most automated identification systems rely on images depicting the species for the identification. [1] Based on precisely identified images of a species, a classifier is trained. Once exposed to a sufficient amount of training data, this classifier can then identify the trained species on previously unseen images.

Contents

Introduction

The automated identification of biological objects such as insects (individuals) and/or groups (e.g., species, guilds, characters) has been a dream among systematists for centuries. The goal of some of the first multivariate biometric methods was to address the perennial problem of group discrimination and inter-group characterization. Despite much preliminary work in the 1950s and '60s, progress in designing and implementing practical systems for fully automated object biological identification has proven frustratingly slow. As recently as 2004 Dan Janzen [2] updated the dream for a new audience:

The spaceship lands. He steps out. He points it around. It says 'friendly–unfriendly—edible–poisonous—safe– dangerous—living–inanimate'. On the next sweep it says 'Quercus oleoides—Homo sapiens—Spondias mombin—Solanum nigrum—Crotalus durissus—Morpho peleides—serpentine'. This has been in my head since reading science fiction in ninth grade half a century ago.[ clarification needed ]

The species identification problem

DFE - the graphical interface of the Daisy system. The image is the wing of a biting midge Culicoides sp., some species of which are vectors of Bluetongue. Others may also be vectors of Schmallenberg virus an emerging disease of livestock, especially sheep.
(Credit: Mark A. O'Neill) Daisy GTK+-Linu interface DFE.png
DFE - the graphical interface of the Daisy system. The image is the wing of a biting midge Culicoides sp., some species of which are vectors of Bluetongue. Others may also be vectors of Schmallenberg virus an emerging disease of livestock, especially sheep.
(Credit: Mark A. O'Neill)

Janzen's preferred solution to this classic problem involved building machines to identify species from their DNA. However, recent developments in computer architectures, as well as innovations in software design, have placed the tools needed to realize Janzen's vision in the hands of the systematics and computer science community not in several years hence, but now; and not just for creating DNA barcodes, but also for identification based on digital images.

A survey published in 2004, [3] studies why automated species identification had not become widely employed at this time and whether it would be a realistic option for the future. The authors found that "a small but growing number of studies sought to develop automated species identification systems based on morphological characters". An overview of 20 studies analyzing species' structures, such as cells, pollen, wings, and genitalia, shows identification success rates between 40% and 100% on training sets with 1 to 72 species. However, they also identified four fundamental problems with these systems: (1) training sets—were too small (5-10 specimens per species) and their extension especially for rare species may be difficult, (2) errors in identification—are not sufficiently studied to handle them and to find systematics, (3) scaling—studies consider only small numbers of species (<200 species), and (4) novel species — systems are restricted to the species they have been trained for and will classify any novel observation as one of the known species.

A survey published in 2017 [4] systematically compares and discusses progress and findings towards automated plant species identification within the last decade (2005–2015). 120 primary studies have been published in high-quality venues within this time, mainly by authors with computer science background. These studies propose a wealth of computer vision approaches, i.e., features reducing the high-dimensionality of the pixel-based image data while preserving the characteristic information as well as classification methods. The vast majority of these studies analyzes leaves for identification, while only 13 studies propose methods for flower-based identification. The reasons being that leaves can easier be collected and imaged and are available for most of the year. Proposed features capture generic object characteristic, i.e., shape, texture, and color as well as leaf-specific characteristics, i.e., venation and margin. The majority of studies still used datasets for evaluation that contained no more than 250 species. However, there is progress in this regard, one study uses a dataset with >2k [5] and another with >20k [6] species.

A system developed in 2022 [7] showed that automated identification achieves accuracy that is sufficiently high for being used in an automated insect surveillance system using electronic traps. By training classifiers on a few hundred images it correctly identified fruit-flies, and can be used for continuous monitoring aimed at detecting species invasion or pest outbreak. Several aspects contribute to the success of this system. Primarily, using e-traps provide a standardized setting, which means that even though they are deployed in different countries and regions, the visual variability, in terms of size view angle and illumination are controlled. This suggests that trap-based systems may be easier to develop than free-view systems for automatic pest identification.

There is a shortage of specialists who can identify the very biodiversity whose preservation has become a global concern. In commenting on this problem in palaeontology in 1993, Roger Kaesler [8] recognized:

"... we are running out of systematic palaeontologists who have anything approaching synoptic knowledge of a major group of organisms ... Palaeontologists of the next century are unlikely to have the luxury of dealing at length with taxonomic problems ... Palaeontology will have to sustain its level of excitement without the aid of systematists, who have contributed so much to its success."

This expertise deficiency cuts as deeply into those commercial industries that rely on accurate identifications (e.g., agriculture, biostratigraphy) as it does into a wide range of pure and applied research programmes (e.g., conservation, biological oceanography, climatology, ecology). It is also commonly, though informally, acknowledged that the technical, taxonomic literature of all organismal groups is littered with examples of inconsistent and incorrect identifications. This is due to a variety of factors, including taxonomists being insufficiently trained and skilled in making identifications (e.g., using different rules-of-thumb in recognizing the boundaries between similar groups), insufficiently detailed original group descriptions and/or illustrations, inadequate access to current monographs and well-curated collections and, of course, taxonomists having different opinions regarding group concepts. Peer review only weeds out the most obvious errors of commission or omission in this area, and then only when an author provides adequate representations (e.g., illustrations, recordings, and gene sequences) of the specimens in question.

Systematics too has much to gain from the further development and use of automated identification systems. In order to attract both personnel and resources, systematics must transform itself into a "large, coordinated, international scientific enterprise". [9] Many have identified use of the Internet— especially via the World Wide Web — as the medium through which this transformation can be made. While establishment of a virtual, GenBank-like system for accessing morphological data, audio clips, video files and so forth would be a significant step in the right direction, improved access to observational information and/or text-based descriptions alone will not address either the taxonomic impediment or low identification reproducibility issues successfully. Instead, the inevitable subjectivity associated with making critical decisions on the basis of qualitative criteria must be reduced or, at the very least, embedded within a more formally analytic context.

SDS protein gel images of sphinx moth caterpillars. It can be used in a similar way to DNA fingerprinting SphinxGels.png
SDS protein gel images of sphinx moth caterpillars. It can be used in a similar way to DNA fingerprinting

Properly designed, flexible, and robust, automated identification systems, organized around distributed computing architectures and referenced to authoritatively identified collections of training set data (e.g., images, and gene sequences) can, in principle, provide all systematists with access to the electronic data archives and the necessary analytic tools to handle routine identifications of common taxa. Properly designed systems can also recognize when their algorithms cannot make a reliable identification and refer that image to a specialist (whose address can be accessed from another database). Such systems can also include elements of artificial intelligence and so improve their performance the more they are used. Once morphological (or molecular) models of a species have been developed and demonstrated to be accurate, these models can be queried to determine which aspects of the observed patterns of variation and variation limits are being used to achieve the identification, thus opening the way for the discovery of new and (potentially) more reliable taxonomic characters.

See also

References cited

  1. Wäldchen, Jana; Mäder, Patrick (November 2018). Cooper, Natalie (ed.). "Machine learning for image based species identification". Methods in Ecology and Evolution. 9 (11): 2216–2225. doi: 10.1111/2041-210X.13075 . S2CID   91666577.
  2. Janzen, Daniel H. (March 22, 2004). "Now is the time". Philosophical Transactions of the Royal Society of London. B. 359 (1444): 731–732. doi:10.1098/rstb.2003.1444. PMC   1693358 . PMID   15253359.
  3. Gaston, Kevin J.; O'Neill, Mark A. (March 22, 2004). "Automated species recognition: why not?". Philosophical Transactions of the Royal Society of London. B. 359 (1444): 655–667. doi:10.1098/rstb.2003.1442. PMC   1693351 . PMID   15253351.
  4. Wäldchen, Jana; Mäder, Patrick (2017-01-07). "Plant Species Identification Using Computer Vision Techniques: A Systematic Literature Review". Archives of Computational Methods in Engineering. 25 (2): 507–543. doi:10.1007/s11831-016-9206-z. ISSN   1134-3060. PMC   6003396 . PMID   29962832.
  5. Joly, Alexis; Goëau, Hervé; Bonnet, Pierre; Bakić, Vera; Barbe, Julien; Selmi, Souheil; Yahiaoui, Itheri; Carré, Jennifer; Mouysset, Elise (2014-09-01). "Interactive plant identification based on social image data". Ecological Informatics. Special Issue on Multimedia in Ecology and Environment. 23: 22–34. doi: 10.1016/j.ecoinf.2013.07.006 .
  6. Wu, Huisi; Wang, Lei; Zhang, Feng; Wen, Zhenkun (2015-08-01). "Automatic Leaf Recognition from a Big Hierarchical Image Database". International Journal of Intelligent Systems. 30 (8): 871–886. doi: 10.1002/int.21729 . ISSN   1098-111X. S2CID   12917626.
  7. Diller, Yoshua; Shamsian, Aviv; Shaked, Ben; Altman, Yam; Danziger, Bat-Chen; Manrakhan, Aruna; Serfontein, Leani; Bali, Elma; Wernicke, Matthias; Egartner, Alois; Colacci, Marco; Sciarretta, Andrea; Chechik, Gal; Alchanatis, Victor; Papadopoulos, Nikos T. (2022-06-28). "A real-time remote surveillance system for fruit flies of economic importance: sensitivity and image analysis" (PDF). Journal of Pest Science. 96 (2): 611–622. doi: 10.1007/s10340-022-01528-x . ISSN   1612-4766. S2CID   250127830.
  8. Kaesler, Roger L (1993). "A window of opportunity: peering into a new century of palaeontology". Journal of Paleontology. 67 (3): 329–333. Bibcode:1993JPal...67..329K. doi:10.1017/S0022336000036805. JSTOR   1306022. S2CID   133097253.
  9. Wheeler, Quentin D. (2003). "Transforming taxonomy" (PDF) (22). The Systematist: 3–5.{{cite journal}}: Cite journal requires |journal= (help)
  10. "iNaturalist Computer Vision Explorations". iNaturalist.org. 2017-07-27. Retrieved 2017-08-12.
  11. "How Google Photos tells the difference between dogs, cats, bears, and any other animal in your photos". 2015-06-04.
  12. MLMU.cz - FlowerChecker: Exciting journey of one ML startup – O. Veselý & J. Řihák , retrieved 2022-01-12
  13. "Tvůrci FlowerCheckeru spouštějí Shazam pro kytky. Plant.id staví na AI". 7 May 2018. Archived from the original on 12 May 2018. Retrieved 11 May 2018.

Here are some links to the home pages of species identification systems. The SPIDA and DAISY system are essentially generic and capable of classifying any image material presented. The ABIS and DrawWing system are restricted to insects with membranous wings as they operate by matching a specific set of characters based on wing venation.

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

<span class="mw-page-title-main">Linnaean taxonomy</span> Rank based classification system for organisms

Linnaean taxonomy can mean either of two related concepts:

  1. The particular form of biological classification (taxonomy) set up by Carl Linnaeus, as set forth in his Systema Naturae (1735) and subsequent works. In the taxonomy of Linnaeus there are three kingdoms, divided into classes, and they, in turn, into lower ranks in a hierarchical order.
  2. A term for rank-based classification of organisms, in general. That is, taxonomy in the traditional sense of the word: rank-based scientific classification. This term is especially used as opposed to cladistic systematics, which groups organisms into clades. It is attributed to Linnaeus, although he neither invented the concept of ranked classification nor gave it its present form. In fact, it does not have an exact present form, as "Linnaean taxonomy" as such does not really exist: it is a collective (abstracting) term for what actually are several separate fields, which use similar approaches.

In biology, phylogenetics is the study of the evolutionary history and relationships among or within groups of organisms. These relationships are determined by phylogenetic inference methods that focus on observed heritable traits, such as DNA sequences, protein amino acid sequences, or morphology. The result of such an analysis is a phylogenetic tree—a diagram containing a hypothesis of relationships that reflects the evolutionary history of a group of organisms.

In biology, phenetics, also known as taximetrics, is an attempt to classify organisms based on overall similarity, usually in morphology or other observable traits, regardless of their phylogeny or evolutionary relation. It is closely related to numerical taxonomy which is concerned with the use of numerical methods for taxonomic classification. Many people contributed to the development of phenetics, but the most influential were Peter Sneath and Robert R. Sokal. Their books are still primary references for this sub-discipline, although now out of print.

<span class="mw-page-title-main">Systematics</span> Branch of biology

Systematics is the study of the diversification of living forms, both past and present, and the relationships among living things through time. Relationships are visualized as evolutionary trees. Phylogenies have two components: branching order and branch length. Phylogenetic trees of species and higher taxa are used to study the evolution of traits and the distribution of organisms (biogeography). Systematics, in other words, is used to understand the evolutionary history of life on Earth.

In biology, taxonomy is the scientific study of naming, defining (circumscribing) and classifying groups of biological organisms based on shared characteristics. Organisms are grouped into taxa and these groups are given a taxonomic rank; groups of a given rank can be aggregated to form a more inclusive group of higher rank, thus creating a taxonomic hierarchy. The principal ranks in modern use are domain, kingdom, phylum, class, order, family, genus, and species. The Swedish botanist Carl Linnaeus is regarded as the founder of the current system of taxonomy, as he developed a ranked system known as Linnaean taxonomy for categorizing organisms and binomial nomenclature for naming organisms.

<span class="mw-page-title-main">Taxon</span> Grouping of biological populations

In biology, a taxon is a group of one or more populations of an organism or organisms seen by taxonomists to form a unit. Although neither is required, a taxon is usually known by a particular name and given a particular ranking, especially if and when it is accepted or becomes established. It is very common, however, for taxonomists to remain at odds over what belongs to a taxon and the criteria used for inclusion, especially in the context of rank-based ("Linnaean") nomenclature. If a taxon is given a formal scientific name, its use is then governed by one of the nomenclature codes specifying which scientific name is correct for a particular grouping.

Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading bar coded tags or as sophisticated as identifying a person from their face.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

<span class="mw-page-title-main">History of plant systematics</span> Development of understanding of relationships among plants

The history of plant systematics—the biological classification of plants—stretches from the work of ancient Greek to modern evolutionary biologists. As a field of science, plant systematics came into being only slowly, early plant lore usually being treated as part of the study of medicine. Later, classification and description was driven by natural history and natural theology. Until the advent of the theory of evolution, nearly all classification was based on the scala naturae. The professionalization of botany in the 18th and 19th century marked a shift toward more holistic classification methods, eventually based on evolutionary relationships.

Plant taxonomy is the science that finds, identifies, describes, classifies, and names plants. It is one of the main branches of taxonomy.

<span class="mw-page-title-main">Identification (biology)</span> Process of taking existing name to single organisms

Identification in biology is the process of assigning a pre-existing taxon name to an individual organism. Identification of organisms to individual scientific names may be based on individualistic natural body features, experimentally created individual markers, or natural individualistic molecular markers. Individual identification is used in ecology, wildlife management and conservation biology. The more common form of identification is the identification of organisms to common names or scientific name. By necessity this is based on inherited features ("characters") of the sexual organisms, the inheritance forming the basis of defining a class. The features may, e. g., be morphological, anatomical, physiological, behavioral, or molecular.

Walter Max Zimmermann was a German botanist and systematist. Zimmernann’s notions of classifying life objectively based on phylogenetic methods and on evolutionarily important characters were foundational for modern phylogenetics. Though they were later implemented by Willi Hennig in his fundamental work on phylogenetic systematics, Zimmermann's contributions to this field have largely been overlooked. Zimmermann also made several significant developments in the field of plant systematics such as the discovery of the telome theory. The standard botanical author abbreviation W.Zimm. is applied to species he described.

A species (pl. species) is often defined as the largest group of organisms in which any two individuals of the appropriate sexes or mating types can produce fertile offspring, typically by sexual reproduction. It is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. Other ways of defining species include their karyotype, DNA sequence, morphology, behaviour, or ecological niche. In addition, paleontologists use the concept of the chronospecies since fossil reproduction cannot be examined.

In biology, determination is the process of matching a specimen of an organism to a known taxon, for example identifying a plant. The term is also used in cellular biology, where it means the act of the differentiation of stem cells becoming fixed. Various methods are used, for example single or multi-access identification keys.

Parataxonomy is a system of labor division for use in biodiversity research, in which the rough sorting tasks of specimen collection, field identification, documentation and preservation are conducted by primarily local, less specialized individuals, thereby alleviating the workload for the "alpha" or "master" taxonomist. Parataxonomy may be used to improve taxonomic efficiency by enabling more expert taxonomists to restrict their activity to the tasks that require their specialist knowledge and skills, which has the potential to expedite the rate at which new taxa may be described and existing taxa may be sorted and discussed. Parataxonomists generally work in the field, sorting collected samples into recognizable taxonomic units (RTUs) based on easily recognized features. The process can be used alone for rapid assessment of biodiversity.

<span class="mw-page-title-main">Digital Automated Identification System</span>

Digital automated identification system (DAISY) is an automated species identification system optimised for the rapid screening of invertebrates by non-experts.

<span class="mw-page-title-main">DNA barcoding</span> Method of species identification using a short section of DNA

DNA barcoding is a method of species identification using a short section of DNA from a specific gene or genes. The premise of DNA barcoding is that by comparison with a reference library of such DNA sections, an individual sequence can be used to uniquely identify an organism to species, just as a supermarket scanner uses the familiar black stripes of the UPC barcode to identify an item in its stock against its reference database. These "barcodes" are sometimes used in an effort to identify unknown species or parts of an organism, simply to catalog as many taxa as possible, or to compare with traditional taxonomy in an effort to determine species boundaries.

An all-taxa biodiversity inventory, or ATBI, is an attempt to document and identify all biological species living in some defined area, usually a park, reserve, or research area. The term was coined in 1993, in connection with an effort initiated by ecologist Daniel Janzen to document the diversity of the Guanacaste National Park in Costa Rica.

<span class="mw-page-title-main">Aquatic macroinvertebrate DNA barcoding</span>

DNA barcoding is an alternative method to the traditional morphological taxonomic classification, and has frequently been used to identify species of aquatic macroinvertebrates. Many are crucial indicator organisms in the bioassessment of freshwater and marine ecosystems.