This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these template messages)
|
Established | 2004 |
---|---|
Parent institution | Department of Computer Science, University of Manchester |
Affiliation | University of Manchester |
Director | Sophia Ananiadou |
Location | , |
Website | www |
The National Centre for Text Mining (NaCTeM) [1] is a publicly funded text mining (TM) centre. It was established to provide support, advice and information on TM technologies and to disseminate information within the larger TM community, while also providing services and tools in response to the requirements of the United Kingdom academic community.
The software tools and services which NaCTeM supplies allow researchers to apply text mining techniques to problems within their specific areas of interest – examples of these tools are highlighted below. In addition to providing services, the centre is also involved in, and makes significant contributions to, the text mining research community both nationally and internationally in initiatives such as Europe PubMed Central.
The centre is located in the Manchester Institute of Biotechnology and is operated and organised by the Department of Computer Science, University of Manchester. NaCTeM contributes expertise in natural language processing and information extraction, including named-entity recognition, and extractions of complex relationships (or events) that hold between named entities, along with parallel and distributed data mining systems in biomedical and clinical applications.
TerMine is a domain independent method for automatic term recognition which can be used to help locate the most important terms in a document and automatically rank them. [2]
AcroMine finds all known expanded forms of acronyms as they have appeared in Medline entries or conversely, it can be used to find possible acronyms of expanded forms as they have previously appeared in Medline and disambiguates them. [3]
Medie is an intelligent search engine for the semantic retrieval of sentences containing biomedical correlations from Medline abstracts. [4] [5]
Facta+ is a Medline search engine for finding associations between biomedical concepts. [6]
Facta+ Visualizer is a web application that aids in understanding FACTA+ search results through intuitive graphical visualisation. [7]
KLEIO is a faceted semantic information retrieval system over Medline abstracts.
Europe PMC EvidenceFinder Europe PMC EvidenceFinder helps users to explore facts that involve entities of interest within the full text articles of the Europe PubMed Central database. [8]
EUPMC Evidence Finder for Anatomical entities with meta-knowledge is similar to the Europe PMC EvidenceFinder, allowing exploration of facts involving anatomical entities within the full text articles of the Europe PubMed Central database. Facts can be filtered according to various aspects of their interpretation (e.g., negation, certainly level, novelty).
Info-PubMed provides information and graphical representation of biomedical interactions extracted from Medline using deep semantic parsing technology. This is supplemented with a term dictionary consisting of over 200,000 protein/gene names and identification of disease types and organisms.
ASCOT is an efficient, semantically-enhanced search application, customised for clinical trial documents. [9]
HOM is a semantic search system over historical medical document archives
BioLexicon is a large-scale terminological resource for the biomedical domain. [10]
GENIA is a collection of reference materials for the development of biomedical text mining systems.
GREC is a semantically annotated corpus of Medline abstracts intended for training IE systems and/or resources which are used to extract events from biomedical literature. [11]
This is a corpus of Medline abstracts annotated by experts with metabolite and enzyme names.
A collection of corpora manually annotated with fine-grained, species-independent anatomical entities, to facilitate the development of text mining systems that can carry out detailed and comprehensive analyses of biomedical scientific text. [12] [13]
This is an enrichment of the GENIA Event corpus, in which events are enriched with various levels of information pertaining to their interpretation. The aim is to allow systems to be trained that can distinguish between events that factual information or experimental analyses, definite information from speculated information, etc. [14]
The objective of the Argo project is to develop a workbench for analysing (primarily annotating) textual data. The workbench, which is accessed as a web application, supports the combination of elementary text-processing components to form comprehensive processing workflows. It provides functionality to manually intervene in the otherwise automatic process of annotation by correcting or creating new annotations, and facilitates user collaboration by providing sharing capabilities for user-owned resources. Argo benefits users such as text-analysis designers by providing an integrated environment for the development of processing workflows; annotators/curators by providing manual annotation functionalities supported by automatic pre-processing and post-processing; and developers by providing a workbench for testing and evaluating text analytics.
Big mechanisms are large, explanatory models of complicated systems in which interactions have important causal effects. Whilst the collection of big data is increasingly automated, the creation of big mechanisms remains a largely human effort, which is becoming increasingly challenging, according to the fragmentation and distribution of knowledge. The ability to automate the construction of big mechanisms could have a major impact on scientific research. As one of a number of different projects that make up the big mechanism programme, funded by DARPA, the aim is to assemble an overarching big mechanism from the literature and prior experiments and to utilise this for the probabilistic interpretation of new patient panomics data. We will integrate machine reading of the cancer literature with probabilistic reasoning across cancer claims using specially-designed ontologies, computational modelling of cancer mechanisms (pathways), automated hypothesis generation to extend knowledge of the mechanisms and a 'Robot Scientist' that performs experiments to test the hypotheses. A repetitive cycle of text mining, modelling, experimental testing, and worldview updating is intended to lead to increased knowledge about cancer mechanisms.
Pathtext/Refine is a system designed to integrate a pathway visualiser, text mining systems and annotation tools. [15] [16]
This project aims to produce a knowledge repository of Philippine biodiversity by combining the domain-relevant expertise and resources of Philippine partners with the text mining-based big data analytics of the University of Manchester's National Centre for Text Mining. The repository will be a synergy of different types of information, e.g., taxonomic, occurrence, ecological, biomolecular, biochemical, thus providing users with a comprehensive view on species of interest that will allow them to (1) carry out predictive analysis on species distributions, and (2) investigate potential medicinal applications of natural products derived from Philippine species.
This is a collaboration with the Text-Mining group at the European Bioinformatics Institute (EBI) and Mimas (data centre), forming a work package in the Europe PubMed Central project (formerly UKPMC) hosted and coordinated by the British Library. Europe PMC, as a whole, forms a European version of the PubMed Central paper repository, in collaboration with the National Institutes of Health (NIH) in the United States. Europe PMC is funded by a consortium of key funding bodies from the biomedical research funders. The contribution to this major project is in the application of text mining solutions to enhance information retrieval and knowledge discovery. As such this is an application of technology developed in other NaCTeM projects on a large scale and in a prominent resource for the Biomedicine community.
This project aims to transform the Biodiversity Heritage Library (BHL) into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public. The project integrates novel text mining methods, visualisation, crowdsourcing and social media into the BHL. The resulting digital resource will provide fully interlinked and indexed access to the full content of BHL library documents, via semantically enhanced and interactive browsing and searching capabilities, allowing users to locate precisely the information of interest to them in an easy and efficient manner.
This project aims to conduct novel research in text mining and machine learning to transform the way in which evidence-based public health (EBPH) reviews are conducted. The aims of the project include developing unsupervised new text mining methods to derive term similarities, supporting screening during EBPH reviews, and creating new algorithms for ranking and visualising meaningful associations of multiple types in a dynamic and iterative manner. These newly developed methods will be evaluated in EBPH reviews, based on implementation of a pilot, to ascertain the level of transformation in EBPH reviewing.
Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.
In the field of bioinformatics, a sequence database is a type of biological database that is composed of a large collection of computerized ("digital") nucleic acid sequences, protein sequences, or other polymer sequences stored on a computer. The UniProt database is an example of a protein sequence database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate. Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable.
PubMed is a free database including primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
BioCreAtIvE consists in a community-wide effort for evaluating information extraction and text mining developments in the biological domain.
The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people dedicated to build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a suite of interoperable reference ontologies in the biomedical domain. Currently, there are more than a hundred ontologies that follow the OBO Foundry principles.
The bibliome is the totality of biological text corpus. This term was coined around 2000 in EBI to denote the importance of biological text information. Similar terms that have been less frequently used are literaturome and textome. By approximate analogy to widely used terms like genome, metabolome, proteome, and transcriptome, this -ome would properly refer to the literature of a specified or contextually implied field, hence: biological bibliome, political bibliome, etc. However the term has not (yet) been applied outside the biological and medical sciences so it currently by default applies just to the biomedical fields. It would make little sense to apply it to a particular body of texts such as MEDLINE, despite a natural analogy that might seem to suggest this: the terms genome, proteome, channelome, metabolome, and transcriptome all usually assume a specific organism or cell set and a specific time point. The reason following this analogy would make little sense is that there is already an established term for this purpose, corpus.
Robert David Stevens is a professor of bio-health informatics. and former Head of Department of Computer Science at The University of Manchester
The Open Regulatory Annotation Database is designed to promote community-based curation of regulatory information. Specifically, the database contains information about regulatory regions, transcription factor binding sites, regulatory variants, and haplotypes.
Europe PubMed Central is an open-access repository that contains millions of biomedical research works. It was known as UK PubMed Central until 1 November 2012.
A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE additionally requires the removal of repeated relations (disambiguation) and generally refers to the extraction of many different relationships.
Co-occurrence network, sometimes referred to as a semantic network, is a method to analyze text that includes a graphic visualization of potential relationships between people, organizations, concepts, biological organisms like bacteria or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text compliant to text mining.
Jun'ichi Tsujii is a Japanese computer scientist specializing in natural language processing and text mining, particularly in the field of biology and bioinformatics.
Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of knowledge extraction and automated hypothesis generation that uses papers and other academic publications to find new relationships between existing knowledge. Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated.
Anne O'Tate is a free, web-based application that analyses sets of records identified on PubMed, the bibliographic database of articles from over 5,500 biomedical journals worldwide. While PubMed has its own wide range of search options to identify sets of records relevant to a researchers query it lacks the ability to analyse these sets of records further, a process for which the terms text mining and drill down have been used. Anne O'Tate is able to perform such analysis and can process sets of up to 25,000 PubMed records.
Teresa K. Attwood is a professor of Bioinformatics in the Department of Computer Science and School of Biological Sciences at the University of Manchester and a visiting fellow at the European Bioinformatics Institute (EMBL-EBI). She held a Royal Society University Research Fellowship at University College London (UCL) from 1993 to 1999 and at the University of Manchester from 1999 to 2002.
In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.
Sophia Ananiadou is a Greek-British computer scientist and computational linguist. She led the development of and directs the National Centre for Text Mining (NaCTeM) in the United Kingdom. She is also Professor in Computer Science in the Department of Computer Science at the University of Manchester.
Biocuration is the field of life sciences dedicated to organizing biomedical data, information and knowledge into structured formats, such as spreadsheets, tables and knowledge graphs. The biocuration of biomedical knowledge is made possible by the cooperative work of biocurators, software developers and bioinformaticians and is at the base of the work of biological databases.
{{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite conference}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link)