Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of knowledge extraction and automated hypothesis generation that uses papers and other academic publications (the "literature") to find new relationships between existing knowledge (the "discovery"). Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated. [1]
LBD can help researchers to quickly discover and explore hypotheses as well as gain information on relevant advances inside and outside of their niches and increase interdisciplinary information sharing. [1]
The most basic and widespread type of LBD is called the ABC paradigm because it centers around three concepts called A, B and C. [2] [3] [4] It states that if there is a connection between A and B and one between B and C, then there is one between A and C which, if not explicitly stated, is yet to be explored. [1]
The LBD technique was pioneered by Don R. Swanson in the 1980s. [5] He hypothesized that the combination of two separately published results indicating an A-B relationship and a B-C relationship are evidence of an A-C relationship which is unknown or unexplored. He used this to propose fish oil as a treatment for Raynaud syndrome due to their shared relationship with blood viscosity. [6] This hypothesis was later shown to have merit in a prospective study [7] and he continually proposed other discoveries using similar methods. [8] [9] [10] [1]
Swanson linking is a term proposed in 2003 [11] that refers to connecting two pieces of knowledge previously thought to be unrelated. [12] For example, it may be known that illness A is caused by chemical B, and that drug C is known to reduce the amount of chemical B in the body. However, because the respective articles were published separately from one another (called "disjoint data"), the relationship between illness A and drug C may be unknown. Swanson linking aims to find these relationships and report them.
Although the ABC paradigm is widely used, critics of the system have argued that much of science is not captured on simple assertions and it is rather built from analogies and images at a higher level of abstraction. [13]
LBD comes generally in two flavours: open and closed discovery. In open discovery, only A is given. The approach finds Bs and uses them to return possibly interesting Cs to the user, thus generating hypotheses from A. With closed discovery, the A and C are given to the approach which seeks to find the Bs which can link the two, thus testing a hypothesis about A and C. [1]
A number of systems to perform literature-based discovery have been developed over the years, extending the original idea of Don Swanson, and the evaluation of the quality of such systems is an active area of research. [14] Some systems include web versions for increased user-friendliness. [15] A common approach to many systems is the use of MeSH terms to represent scientific articles. This is used by the systems Manjal, BITOLA and LitLinker. [16]
One well-known system within the field is called Arrowsmith and is tailored to find connections between two disjoint sets of articles, an approach labeled "two-node" search. [17] [18]
Another well-known system, LION LBD, [19] uses PubTator [20] for annotating PubMed scientific articles with concepts such as chemicals, genes/proteins, mutations, diseases and species; as well as sentence-level annotation of cancer hallmarks that describe fundamental cancer processes and behaviour. [21] It uses co-occurrence metrics to rank relations between concepts and performs both open and closed discovery. [1]
While LBD systems are based on traditional statistical methods, [16] other systems leverage sophisticated machine learning methods, like neural networks. [1] Some LBD systems represent the connection between concepts as a knowledge graph, and thus employ techniques of graph theory. [22] The graph-based representation is also the foundation for LBD systems that employ graph databases like Neo4J, enabling discovery via graph query languages such as Cypher. [23]
Graph-based LBD systems represent the relations between concepts using a different relation types, such as those in the UMLS Semantic Network. [24] Some approaches go further and try to apply contextualized relations, [25] an approach also used by the Gene Ontology for their Causal Activity Modeling (GO-CAM). [26]
Besides extracting information from the body of scientific articles, LBD systems often employ structured knowledge from biocurated biological resources, like the Online Mendelian Inheritance in Men (OMIM). [27]
These are the published LBD systems, ordered by date of publication: [29]
A common task in literature-based discovery is assigning words/concepts to different semantic types. A concept might be classified under one type or multiple types. For example in the Unified Medical Language System (UMLS) the term migraine is classified under the type disease and syndrome, while the term magnesium is under two types: biologically active substance and element, ion, or isotope. [16] The typing of concepts hones the discovery of connections between particular classes of concepts, i.e. diseases-genes or diseases-drugs. [16]
The evaluation of literature-based discoveries is challenging, and includes both experimental and in silico methods. [45] Methods try to quantify the amount of knowledge generated by systems, that should be provided in an amount and richness that is useful for scientists. [46]
Evaluation is difficult in LBD for several reasons: disagreement about the role of LBD systems in research and thus what makes a successful one; difficulty in determining how useful, interesting or actionable a discovery is; and difficulty in objectively defining a ‘discovery’, which hinders the creation of a standard evaluation set which quantifies when a discovery has been replicated or found. [1]
A popular method used in LBD is to replicate previous discoveries. [4] [47] [48] These are usually LBD-based discoveries as they are relatively easy to quantify compared to other discoveries. There are only a handful of such discoveries and approaches tuned to perform well on these discoveries might not generalise. In this type of evaluation, the literature before the discovery to be replicated is used to generate a ranked list of discovery candidates as target or linking terms. Success is measured by reporting the rank of the term(s) of interest; the higher the rank, the better the approach.
Literature- or time-slicing involves splitting the existing literature at a point in time. The LBD system is then exposed to the literature before the split and is evaluated by how many of the discoveries in the later period it can discover. LBD systems have used term co-occurrences, [49] relationships from external biomedical resources (e.g SemMedDB) [50] and semantic relationships [51] to generate the gold standards. A high precision approach is to get expert opinion to generate the gold standard, [52] but this is time-consuming, expensive and tends to produce low recall rates. [1]
The advantage of time-slicing in comparison to the replication of previous discoveries is the evaluation on a large number of test instances. This raises the need for evaluation metrics which can quantify performance on large, ranked lists. [1] LBD works have used metrics popular in Information Retrieval [53] which include Precision, Recall, Area Under the Curve (AUC), Precision at k, Mean Average Precision (MAP) and others. [1]
The approach of Proposing new discoveriesor treatments goes beyond replicating past discoveries or predicting time-sliced instances of a particular relationship and shows that a system is capable of being used in realistic situations. [54] [47] [55] [56] This is usually accompanied by peer-reviewed publication in the domain or vetting by a domain expert. [1]
The automation of literature-based discovery relies heavily on text mining. [58]
The language in scientific articles often include ambiguities, and an important step for coeherent parsing of the literature is the extraction of the sense of each term in the context they are used, a task called Word-sense disambiguation (WSD). [59] For example, terms for genes like CT (PCYT1A) called and MR (NR3C2) can be confused with the acronyms for Computational Tomography and Magnetic Resonance, requiring sofisticated disambiguation systems. [60] Terms are often reconciled to ontologies or other sources of unique identifiers, such as the Unified Medical Language System (UMLS). [61] This process of mapping multiple different utterances to a single name or identifier is called normalization. [57]
LBD has already been used in different ways to identify new connections between biomedical entities and new candidate genes and treatments for illnesses. [62] [1]
LBD has seen use in drug development and repurposing [54] [63] as well as predicting adverse drug reactions. [64] [65] [1]
The method of literature-based discovery has been used to search for treatments for a number of human diseases, including:
The approach has also been used to propose relations of genes with particular diseases, [70] like breast cancer. [71]
In the context of systems vaccinology, it was used to identify proteins related to interferon gamma and that play a role in the response to vaccines. [57]
It has also been used to propose mechanisms for currently used drugs. [72]
LBD has been explored as a tool to identify biomarkers for diagnostic and prognostic for diseases, e.g. for the risk of type 2 diabetes. [73]
Besides providing scientific hypotheses about the world, LBD has also been used to improve data analysis, via the automatic identification of possible confounding factors using the medical literature. [74]
It has also been used to understand better disease etiology and the relation of different diseases, for example looking for the genes connecting myocardial infarction and depression, [75] and connections between psychiatric and somatic diseases. [76]
LBD has mostly been deployed in the biomedical domain, but it has also been used outside of it as it has been applied to research into developing water purification systems, accelerating development of developing countries and identifying promising research collaborations. [77] [78] [79]
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.
The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences. It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of systems in medical informatics.
Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in fielded form in databases or annotated in documents.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people dedicated to build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a suite of interoperable reference ontologies in the biomedical domain. Currently, there are more than a hundred ontologies that follow the OBO Foundry principles.
Alan L. Rector is a Professor of Medical Informatics in the Department of Computer Science at the University of Manchester in the UK.
The National Centre for Text Mining (NaCTeM) is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information from the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.
Lawrence E. Hunter is a Professor and Director of the Center for Computational Pharmacology and of the Computational Bioscience Program at the University of Colorado School of Medicine and Professor of Computer Science at the University of Colorado Boulder. He is an internationally known scholar, focused on computational biology, knowledge-driven extraction of information from the primary biomedical literature, the semantic integration of knowledge resources in molecular biology, and the use of knowledge in the analysis of high-throughput data, as well as for his foundational work in computational biology, which led to the genesis of the major professional organization in the field and two international conferences.
Anne O'Tate is a free, web-based application that analyses sets of records identified on PubMed, the bibliographic database of articles from over 5,500 biomedical journals worldwide. While PubMed has its own wide range of search options to identify sets of records relevant to a researchers query it lacks the ability to analyse these sets of records further, a process for which the terms text mining and drill down have been used. Anne O'Tate is able to perform such analysis and can process sets of up to 25,000 PubMed records.
Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.
Arrowsmith was a literature-based discovery system built by Don R. Swanson using the concept of undiscovered public knowledge. He called it Arrowsmith: ‘An intellectual adventure’
"Imagine that the pieces of a puzzle are independently designed and created, and that, when retrieved and assembled, they then reveal a pattern – undesigned, unintended, and never before seen, yet a pattern that commands interest and invites interpretation. So it is, I claim, that independently created pieces of knowledge can harbor an unseen, unknown, and unintended pattern. And so it is that the world of recorded knowledge can yield genuinely new discoveries"
Cathy H. Wu is the Edward G. Jefferson Chair and professor and director of the Center for Bioinformatics & Computational Biology (CBCB) at the University of Delaware. She is also the director of the Protein Information Resource (PIR) and the North east Bioinformatics Collaborative Steering Committee, and the adjunct professor at the Georgetown University Medical Center.
Dr. Fabio Rinaldi is head of NLP research at IDSIA, Switzerland. He earned his PhD in Computational Linguistics from the University of Zurich, Switzerland in 2008. He continued to work at the University of Zurich as a lecturer, senior researcher and group leader until 2020.
Nanoinformatics is the application of informatics to nanotechnology. It is an interdisciplinary field that develops methods and software tools for understanding nanomaterials, their properties, and their interactions with biological entities, and using that information more efficiently. It differs from cheminformatics in that nanomaterials usually involve nonuniform collections of particles that have distributions of physical properties that must be specified. The nanoinformatics infrastructure includes ontologies for nanomaterials, file formats, and data repositories.
Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.
Biological data refers to a compound or information derived from living organisms and their products. A medicinal compound made from living organisms, such as a serum or a vaccine, could be characterized as biological data. Biological data is highly complex when compared with other forms of data. There are many forms of biological data, including text, sequence data, protein structure, genomic data and amino acids, and links among others.
Noémie Elhadad is an American data scientist who is an associate professor of Biomedical Informatics at the Columbia University Vagelos College of Physicians and Surgeons. As of 2022, she serves as the Chair of the Department of Biomedical Informatics. Her research considers machine learning in bioinformatics, natural language processing and medicine.
Suzanne B. Bakken Henry is an American nurse who is a professor of biomedical informatics at Columbia University. Her research considers health equity and informatics. She is a Fellow of the New York Academy of Medicine, American College of Medical Informatics and American Academy of Nursing.