Biodiversity informatics

Last updated

Biodiversity informatics is the application of informatics techniques to biodiversity information, such as taxonomy, biogeography or ecology. It is defined as the application of Information technology technologies to management, algorithmic exploration, analysis and interpretation of primary data regarding life, particularly at the species level organization. [1] Modern computer techniques can yield new ways to view and analyze existing information, as well as predict future situations (see niche modelling). Biodiversity informatics is a term that was only coined around 1992 but with rapidly increasing data sets has become useful in numerous studies and applications, such as the construction of taxonomic databases or geographic information systems. Biodiversity informatics contrasts with "bioinformatics", which is often used synonymously with the computerized handling of data in the specialized area of molecular biology.

Contents

Overview

Biodiversity informatics (different but linked to bioinformatics) is the application of information technology methods to the problems of organizing, accessing, visualizing and analyzing primary biodiversity data. Primary biodiversity data is composed of names, observations and records of specimens, and genetic and morphological data associated to a specimen. Biodiversity informatics may also have to cope with managing information from unnamed taxa such as that produced by environmental sampling and sequencing of mixed-field samples. The term biodiversity informatics is also used to cover the computational problems specific to the names of biological entities, such as the development of algorithms to cope with variant representations of identifiers such as species names and authorities, and the multiple classification schemes within which these entities may reside according to the preferences of different workers in the field, as well as the syntax and semantics by which the content in taxonomic databases can be made machine queryable and interoperable for biodiversity informatics purposes...

History of the discipline

Biodiversity Informatics can be considered to have commenced with the construction of the first computerized taxonomic databases in the early 1970s, and progressed through subsequent developing of distributed search tools towards the late 1990s including the Species Analyst from Kansas University, the North American Biodiversity Information Network NABIN, CONABIO in Mexico, INBio in Costa Rica, and others, [2] the establishment of the Global Biodiversity Information Facility in 2001, and the parallel development of a variety of niche modelling and other tools to operate on digitized biodiversity data from the mid-1980s onwards (e.g. see [3] ). In September 2000, the U.S. journal Science devoted a special issue to "Bioinformatics for Biodiversity", [4] the journal Biodiversity Informatics commenced publication in 2004, and several international conferences through the 2000s have brought together biodiversity informatics practitioners, including the London e-Biosphere conference in June 2009. A supplement to the journal BMC Bioinformatics (Volume 10 Suppl 14 [5] ) published in November 2009 also deals with biodiversity informatics.

History of the term

According to correspondence reproduced by Walter Berendsohn, [6] the term "Biodiversity Informatics" was coined by John Whiting in 1992 to cover the activities of an entity known as the Canadian Biodiversity Informatics Consortium, a group involved with fusing basic biodiversity information with environmental economics and geospatial information in the form of GPS and GIS. Subsequently, it appears to have lost any obligate connection with the GPS/GIS world and be associated with the computerized management of any aspects of biodiversity information (e.g. see [7] )

Digital taxonomy (systematics)

Global list of all species

One major goal for biodiversity informatics is the creation of a complete master list of currently recognised species of the world. This goal has been achieved to a large extent by the Catalogue of Life project which lists >2 million species in its 2022 Annual Checklist. [8] A similar effort for fossil taxa, the Paleobiology Database [9] documents some 100,000+ names for fossil species, out of an unknown total number.

Genus and species scientific names as unique identifiers

Application of the Linnaean system of binomial nomenclature for species, and uninomials for genera and higher ranks, has led to many advantages but also problems with homonyms (the same name being used for multiple taxa, either inadvertently or legitimately across multiple kingdoms), synonyms (multiple names for the same taxon), as well as variant representations of the same name due to orthographic differences, minor spelling errors, variation in the manner of citation of author names and dates, and more. In addition, names can change through time on account of changing taxonomic opinions (for example, the correct generic placement of a species, or the elevation of a subspecies to species rank or vice versa), and also the circumscription of a taxon can change according to different authors' taxonomic concepts. One proposed solution to this problem is the usage of Life Science Identifiers (LSIDs) for machine-machine communication purposes, although there are both proponents and opponents of this approach.

A consensus classification of organisms

Organisms can be classified in a multitude of ways (see main page Biological classification), which can create design problems for Biodiversity Informatics systems aimed at incorporating either a single or multiple classification to suit the needs of users, or to guide them towards a single "preferred" system. Whether a single consensus classification system can ever be achieved is probably an open question, however the Catalogue of Life has commissioned activity in this area [10] which has been succeeded by a published system proposed in 2015 by M. Ruggiero and co-workers. [11]

Biodiversity Maps

Data Flow Diagram for Biodiversity Map data collection. Shows: Collectors and Maintainers of Spatio-Temporal Species Data and types of data used in Biodiversity Maps. Individual contributors supply range maps for species, common habitats for a given species, and local adaption information. Larger organizations supply aggregated checklists and distribution information from individual contributors as well as any survey data from studies. Point databases hold point records that describe exact location, species, and characteristics of a sighting. Informing biodiversity maps.png
Data Flow Diagram for Biodiversity Map data collection. Shows: Collectors and Maintainers of Spatio-Temporal Species Data and types of data used in Biodiversity Maps. Individual contributors supply range maps for species, common habitats for a given species, and local adaption information. Larger organizations supply aggregated checklists and distribution information from individual contributors as well as any survey data from studies. Point databases hold point records that describe exact location, species, and characteristics of a sighting.
A species richness map is a type of Biodiversity map that uses color to show quantity or density of species in an area. This map shows the counts of bird species across the Americas. Darker blues represent richer areas. Americas Bird Species Richness Map (5457385535).jpg
A species richness map is a type of Biodiversity map that uses color to show quantity or density of species in an area. This map shows the counts of bird species across the Americas. Darker blues represent richer areas.

Biodiversity maps provide a cartographic representation of spatial biodiversity data. [12] This data can be used in conjunction with Species Checklists to help with biodiversity conservation efforts. Biodiversity maps can help reveal patterns of species distribution and range changes. This may reflect biodiversity loss, habitat degradation, or changes in species composition. Combined with urban development data, maps can inform land management by modeling scenarios which might impact biodiversity.

Biodiversity maps can be produced in a variety of ways: traditionally range maps were hand-drawn based on literature reports but increasingly large-scale data, e.g. from citizen science projects (e.g. iNaturalist) and digitized museum collections (e.g. VertNet) are used. GIS tools such as ArcGIS or R packages such as dismo can specifically aid in species distribution modeling (ecological niche modeling) and even predict impacts of ecological change on biodiversity. [13] GBIF, OBIS, and IUCN are large web-based repositories of species spatial-temporal data that source many existing biodiversity maps.

Biodiversity MapsDescriptionLink
Map of Life (MOL)A scalable web platform geared for large biodiversity and environmental data [14] mol.org
The Map of Biodiversity ImportanceIdentifies areas of biodiversity importance critical to preventing extinctions in the contiguous United States https://www.natureserve.org/map-biodiversity-importance
Biodiversity Maps (National Biodiversity Data Centre)An overview of the state of knowledge on the distribution of Ireland's biodiversity https://maps.biodiversityireland.ie/
Saving NatureBiodiversity Maps that depict patterns to guide conservation efforts. https://savingnature.com/our-biodiversity-maps/

Mobilizing primary biodiversity information

"Primary" biodiversity information can be considered the basic data on the occurrence and diversity of species (or indeed, any recognizable taxa), commonly in association with information regarding their distribution in either space, time, or both. Such information may be in the form of retained specimens and associated information, for example as assembled in the natural history collections of museums and herbaria, or as observational records, for example either from formal faunal or floristic surveys undertaken by professional biologists and students, or as amateur and other planned or unplanned observations including those increasingly coming under the scope of citizen science. Providing online, coherent digital access to this vast collection of disparate primary data is a core Biodiversity Informatics function that is at the heart of regional and global biodiversity data networks, examples of the latter including OBIS and GBIF.

As a secondary source of biodiversity data, relevant scientific literature can be parsed either by humans or (potentially) by specialized information retrieval algorithms to extract the relevant primary biodiversity information that is reported therein, sometimes in aggregated / summary form but frequently as primary observations in narrative or tabular form. Elements of such activity (such as extracting key taxonomic identifiers, keywording / index terms, etc.) have been practiced for many years at a higher level by selected academic databases and search engines. However, for the maximum Biodiversity Informatics value, the actual primary occurrence data should ideally be retrieved and then made available in a standardized form or forms; for example both the Plazi and INOTAXA projects are transforming taxonomic literature into XML formats that can then be read by client applications, the former using TaxonX-XML [15] and the latter using the taXMLit format. The Biodiversity Heritage Library is also making significant progress in its aim to digitize substantial portions of the out-of-copyright taxonomic literature, which is then subjected to optical character recognition (OCR) so as to be amenable to further processing using biodiversity informatics tools.

Standards and protocols

In common with other data-related disciplines, Biodiversity Informatics benefits from the adoption of appropriate standards and protocols in order to support machine-machine transmission and interoperability of information within its particular domain. Examples of relevant standards include the Darwin Core XML schema for specimen- and observation-based biodiversity data developed from 1998 onwards, plus extensions of the same, Taxonomic Concept Transfer Schema, [16] plus standards for Structured Descriptive Data, [17] and Access to Biological Collection Data (ABCD); [18] while data retrieval and transfer protocols include DiGIR (now mostly superseded) and TAPIR (TDWG Access Protocol for Information Retrieval). [19] Many of these standards and protocols are currently maintained, and their development overseen, by Biodiversity Information Standards (TDWG).

Current activities

At the 2009 e-Biosphere conference in the U.K., [20] the following themes were adopted, which is indicative of a broad range of current Biodiversity Informatics activities and how they might be categorized:

A post-conference workshop of key persons with current significant Biodiversity Informatics roles also resulted in a Workshop Resolution that stressed, among other aspects, the need to create durable, global registries for the resources that are basic to biodiversity informatics (e.g., repositories, collections); complete the construction of a solid taxonomic infrastructure; and create ontologies for biodiversity data. [21]

Example projects

Global:

Regional / national projects:

A listing of over 600 current biodiversity informatics related activities can be found at the TDWG "Biodiversity Information Projects of the World" database. [22]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Biological database</span>

Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.

<span class="mw-page-title-main">Integrated Taxonomic Information System</span> Authoritative taxonomic information on plants, animals, fungi, and microbes

The Integrated Taxonomic Information System (ITIS) is an American partnership of federal agencies designed to provide consistent and reliable information on the taxonomy of biological species. ITIS was originally formed in 1996 as an interagency group within the US federal government, involving several US federal agencies, and has now become an international body, with Canadian and Mexican government agencies participating. The database draws from a large community of taxonomic experts. Primary content staff are housed at the Smithsonian National Museum of Natural History and IT services are provided by a US Geological Survey facility in Denver. The primary focus of ITIS is North American species, but many biological groups exist worldwide and ITIS collaborates with other agencies to increase its global coverage.

The Gene Ontology (GO) is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies, being one of the Initial Candidate Members of the OBO Foundry.

Biodiversity Information Standards (TDWG), originally called the Taxonomic Databases Working Group, is a non-profit scientific and educational association that works to develop open standards for the exchange of biodiversity data, facilitating biodiversity informatics. It is affiliated with the International Union of Biological Sciences. It is best known for the Darwin Core standard for exchanging biodiversity, which has been used by the Global Biodiversity Information Facility to collect millions of biological observations from museums and other organizations from around the world.

<span class="mw-page-title-main">Global Biodiversity Information Facility</span> Aggregator of scientific data on biodiversity; data portal

The Global Biodiversity Information Facility (GBIF) is an international organisation that focuses on making scientific data on biodiversity available via the Internet using web services. The data are provided by many institutions from around the world; GBIF's information architecture makes these data accessible and searchable through a single portal. Data available through the GBIF portal are primarily distribution data on plants, animals, fungi, and microbes for the world, and scientific names data.

Life Science Identifiers are a way to name and locate pieces of information on the web. Essentially, an LSID is a unique identifier for some data, and the LSID protocol specifies a standard way to locate the data. They are a little like DOIs used by many publishers.

The Access to Biological Collections Data (ABCD) schema is a highly structured data exchange and access model for taxon occurrence data, i.e. primary biodiversity data.

The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information from the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.

The World Register of Marine Species (WoRMS) is a taxonomic database that aims to provide an authoritative and comprehensive list of names of marine organisms.

Darwin Core is an extension of Dublin Core for biodiversity informatics. It is meant to provide a stable standard reference for sharing information on biological diversity (biodiversity). The terms described in this standard are a part of a larger set of vocabularies and technical specifications under development and maintained by Biodiversity Information Standards (TDWG).

A taxonomic database is a database created to hold information on biological taxa – for example groups of organisms organized by species name or other taxonomic identifier – for efficient data management and information retrieval. Taxonomic databases are routinely used for the automated construction of biological checklists such as floras and faunas, both for print publication and online; to underpin the operation of web-based species information systems; as a part of biological collection management ; as well as providing, in some cases, the taxon management component of broader science or biology information systems. They are also a fundamental contribution to the discipline of biodiversity informatics.

Plazi is a Swiss-based international non-profit association supporting and promoting the development of persistent and openly accessible digital bio-taxonomic literature. Plazi is cofounder of the Biodiversity Literature Repository and is maintaining this digital taxonomic literature repository at Zenodo to provide access to FAIR data converted from taxonomic publications using the TreatmentBank service, enhances submitted taxonomic treatments by creating a version in the XML format Taxpub, and educates about the importance of maintaining open access to scientific discourse and data. It is a contributor to the evolving e-taxonomy in the field of Biodiversity Informatics.

AnimalBase is a project brought to life in 2004 and is maintained by the University of Göttingen, Germany. The goal of the AnimalBase project is to digitize early zoological literature, provide copyright-free open access to zoological works, and provide manually verified lists of names of zoological genera and species as a free resource for the public. AnimalBase contributed to opening up the classical taxonomic literature, which is considered as useful because access to early literature can be difficult for researchers who need the old sources for their taxonomic research.

<span class="mw-page-title-main">Catherine N. Norton</span> American judge

Catherine Norton was an American librarian. She was the first Director of Information Systems at the Marine Biological Laboratory (MBL).

Plinian Core is a set of vocabulary terms that can be used to describe different aspects of biological species information. Under "biological species Information" all kinds of properties or traits related to taxa—biological and non-biological—are included. Thus, for instance, terms pertaining descriptions, legal aspects, conservation, management, demographics, nomenclature, or related resources are incorporated.

<span class="mw-page-title-main">Interim Register of Marine and Nonmarine Genera</span> Taxonomic database

The Interim Register of Marine and Nonmarine Genera (IRMNG) is a taxonomic database which attempts to cover published genus names for all domains of life, from 1758 in zoology up to the present, arranged in a single, internally consistent taxonomic hierarchy, for the benefit of Biodiversity Informatics initiatives plus general users of biodiversity (taxonomic) information. In addition to containing just over 500,000 published genus name instances as at May 2023, the database holds over 1.7 million species names, although this component of the data is not maintained in as current or complete state as the genus-level holdings. IRMNG can be queried online for access to the latest version of the dataset and is also made available as periodic snapshots or data dumps for import/upload into other systems as desired. The database was commenced in 2006 at the then CSIRO Division of Marine and Atmospheric Research in Australia and, since 2016, has been hosted at the Flanders Marine Institute (VLIZ) in Belgium.

Avibase is an online taxonomic database that organizes bird taxonomic and distribution data globally. The database relies on the notion of taxonomic concepts rather than taxonomic names. Avibase incorporates and organizes taxonomic data from the main avian taxonomic publishers and other regional sources. Taxonomic concepts in over 230 different taxonomic sources have been mapped and cross-referenced to Avibase concepts.

<span class="mw-page-title-main">Tony Rees (scientist)</span>

Anthony J. J. ("Tony") Rees is a British-born software developer, data manager and biologist resident in Australia since 1986, and previously a data manager with CSIRO Marine and Atmospheric Research. He is responsible for developing a number of software systems currently used in science data management, including c-squares, Taxamatch, and IRMNG, the Interim Register of Marine and Nonmarine Genera. He has also been closely involved with the development of other biodiversity informatics initiatives including the Ocean Biogeographic Information System (OBIS), AquaMaps, and the iPlant Taxonomic Name Resolution Service (TNRS).

References

  1. Soberón, J., & Peterson, A. T. (2004). Biodiversity informatics: Managing and applying primary biodiversity data. Philosophical Transactions of the Royal Society B: Biological Sciences, 359(1444), 689–698.
  2. Krishtalka L, Humphrey PS (2000). "Can Natural History Museums Capture the Future?". BioScience. 50 (7): 611–617. doi: 10.1641/0006-3568(2000)050[0611:CNHMCT]2.0.CO;2 . hdl: 1808/16508 .
  3. Peterson AT, Vieglais D (2001). "Predicting Species Invasions Using Ecological Niche Modeling: New Approaches from Bioinformatics Attack a Pressing Problem" (PDF). BioScience. 51 (5): 363–371. doi: 10.1641/0006-3568(2001)051[0363:PSIUEN]2.0.CO;2 . Archived from the original (PDF) on 2016-08-07. Retrieved 2009-10-09.
  4. "Bioinformatics for Biodiversity?". Science. 289: 2229–2440. 2000.
  5. "Biodiversity Informatics". BMC Bioinformatics. 10 Suppl 14. 2009. Archived from the original on 2010-01-27. Retrieved 2009-11-15.
  6. ""Biodiversity Informatics", The Term" . Retrieved 2009-08-06.
  7. Bisby FA; et al. (2000). "The Quiet Revolution: Biodiversity Informatics and the Internet". Science. 289 (5488): 2309–2312. Bibcode:2000Sci...289.2309B. doi:10.1126/science.289.5488.2309. PMID   11009408. S2CID   31852825.
  8. "Catalogue of Life - 2016 Annual Checklist : The 2016 Annual Checklist". www.catalogueoflife.org. Retrieved 2021-09-08.
  9. "the Paleobiology Database" . Retrieved 2009-08-06.
  10. "Towards a management hierarchy (classification) for the Catalogue of Life. Draft Discussion Document by Dr. Dennis P. Gordon, May 2009". Archived from the original on 2009-08-08. Retrieved 2009-08-06.
  11. Ruggiero, M.A.; Gordon, D.P.; Orrell, T.M.; Bailly, N.; Bourgoin, T.; Brusca, R.C.; et al. (2015). "A higher level classification of all living organisms". PLOS ONE. 10 (4): e0119248. Bibcode:2015PLoSO..1019248R. doi: 10.1371/journal.pone.0119248 . PMC   4418965 . PMID   25923521.
  12. "Biodiversity Maps: Transforming Data into Visual Tools into Meaningful Action for Biodiversity Conservation -". 2016-11-30. Retrieved 2022-05-05.
  13. Elith, Jane; Franklin, Janet (2013), "Species Distribution Modeling", Encyclopedia of Biodiversity, Elsevier, pp. 692–705, doi:10.1016/b978-0-12-384719-5.00318-x, ISBN   978-0-12-384720-1, S2CID   82987545 , retrieved 2022-05-05
  14. Jetz, Walter; McPherson, Jana M.; Guralnick, Robert P. (2012). "Integrating biodiversity distribution knowledge: toward a global map of life". Trends in Ecology & Evolution. 27 (3): 151–159. doi: 10.1016/j.tree.2011.09.007 . PMID   22019413.
  15. "TaxonX". SourceForge. Retrieved 2021-09-08.
  16. "Taxonomic Concept Transfer Schema (TCS)". Biodiversity Information Standards (TDWG).
  17. "Structured Descriptive Data". Biodiversity Information Standards (TDWG).
  18. "Access to Biological Collection Data (ABCD)". Biodiversity Information Standards (TDWG).
  19. "GitHub - tdwg/tapir: TDWG Access Protocol for Information Retrieval (TAPIR)". GitHub. 16 June 2020. Retrieved 2021-09-08.
  20. "Home". e-biosphere09.org.
  21. "Archived copy" (PDF). www.e-biosphere09.org. Archived from the original (PDF) on 26 February 2012. Retrieved 12 January 2022.{{cite web}}: CS1 maint: archived copy as title (link)
  22. "TDWG: Biodiversity Information Projects of the World". www.tdwg.org. Archived from the original on 14 July 2009. Retrieved 12 January 2022.

Further reading