Disease Informatics (also known as infectious disease informatics) studies the knowledge production, sharing, modeling, and management of infectious diseases. [1] It became a more studied field as a by-product of the rapid increases in the amount of biomedical and clinical data widely available, and to meet the demands for useful data analyses of such data. [1]
Considering infectious diseases contribute to millions of deaths every year, the ability to identify and understand disease diffusion is crucial for society to apply control and prevention measures. [2] The knowledge gained by researchers in the field of disease informatics can be used to aid policymakers' decisions on issues such as spreading public awareness, updating the training of health professionals, and buying vaccines. [2]
Aside from aiding in policymakers' decisions, the goals of disease informatics also include increased identification of biomarkers for transmissibility, improved vaccine design, and a deeper understanding of host-pathogen interactions, and the optimization of antimicrobial development. [1]
The use of artificial intelligence (AI) tools, such as machine learning and natural language processing (NLP), in disease informatics increase efficiency by automating and speeding up several data analysis processes. Advances with AI and increased accessibility of data aid in predictive modeling and public health surveillance. AI uses predictive modeling to examine vast data sets and forecast future outcomes to increase the ability to predict disease outbreaks and help guide public health treatments. [3] AI also provides a valuable avenue by combining its ability of spatial modeling with geographic information system (GIS) data to uncover geographical patterns (for example disease clusters) to support data-driven decision-making for local-level predictions of disease diffusion. [3] As the growth of AI continues, more advances for its use in disease informatics are expected to come.
Machine learning (ML) techniques aid the study of disease informatics with its capability to spatially and temporally predict the progression and transmission of infectious diseases. [2] In disease informatics, ML algorithms are used to analyze extensive amounts of complex data sets to identify patterns across varying types of data such as demographics, electronic health records, environmental conditions, etc. [2] The types of ML techniques commonly used are decision trees (decision tree model), random forests, support vector machines (support vector machine), and deep learning networks (deep learning). [2] Using these tools, researchers can apply them to data sets (for example genomic data, social media posts, and health records) to make predictions about the potential sources of an outbreak, the likelihood of an individual contracting a certain disease, and forecasting the number of cases of a disease in a given region. [2] ML models have proven to be just as accurate as traditional statistical methods (especially when multiple ML models are used concurrently) when it comes to predicting the spread and onset of diseases, according to numerous studies. [2]
The use of text mining has become a beneficial avenue for querying large amounts of data to aid in gene mapping and the analysis of genomes. [1] This tool provides the ability to query medical databases for processes such as genomic mapping, by integrating the genomic and proteomic data to map the genes and highlight their interrelationships with various diseases. [1] Retrieving data of targeted sequences can be done in two ways, through a similarity search or by keyword search. A similarity search (using software like BLAST (biotechnology) is performed by entering a known sequence as a query sequence to search for sequences that have similarities. A keyword search (public tools include SRS, Entrez, and ACNUC) uses annotations that define the features of genes, such as sequence positions, to retrieve the desired gene sequences being searched for. [1]
Through a process called syndromic surveillance (related to public health surveillance) data analysis methods can be successfully used to predict potential disease outbreaks by detecting timely, pre-diagnosis health indicators. [4] Syndromic surveillance combines demographic data (age, gender, ethnicity, etc.) with patient visit data (admission status, chief complaint, type of office visit, etc.) that can be put through natural language processes to highlight potential predictors of an outbreak. [4] Due to the time-sensitivity in predicting possible outbreaks, the use of chief complaint data is valuable as it is available much more quickly than formal diagnosis data from physicians' offices. [4] The key to successfully harnessing surveillance data for disease informatics is to use more than one source. Other important sources that are commonly used synchronically include the following: [4]
The accuracy of these AI tools and techniques relies upon providing them with high-quality, comprehensive data. Accessibility and collection of such data is still an ongoing challenge because most of the data pulled is incomplete, noisy, and contains human errors (i.e. grammar, abbreviations, spelling) which means the data must undergo a thorough cleaning (data cleansing) before it is eligible to be used. [2] [4]
The data collected will also come from numerous sources (due to differences in data availability and governance) that use varying formatting and software, creating an issue of needing some form of standardized infrastructure to better integrate and manage data. [3] The formation of a standardized taxonomy for data analysis and predictive modeling would facilitate research collaboration, accelerate decisions, and help select the right predictive models to be used. [3]
One method being used is federated learning, which allows the AI to be trained across multiple different centers without the need for sharing raw data, keeping the data safe within its source. [3] However, the same issues of different formatting and software to ensure model convergence still affect this approach as well, so algorithmic improvements are needed.
Another concern is the potential for bias and overfitting of the predictive models, which could lead to inaccurate predictions. [2] Human error can still persist even using these tools to automate tasks, due to the fact that if the AI tools are trained incorrectly, they will produce inaccurate data. A relevant study suggests that implementing AI with wearable devices and other emerging technology in the future would benefit some of the challenges by providing real-time data for the models to use, which could lead to increased accuracy of the data in its raw form, creating less need to spend time cleaning the data, and allowing the models to make more accurate predictions. [3]
A critical concern for using AI and predictive modeling in disease informatics is data security and privacy. The data sources being used (electronic health records, demographics, etc.) contain highly sensitive information that must be protected for all parties involved. Any models or techniques being used need to be in compliance with local governmental regulations and laws such as HIPAA in the United States. The data used must also undergo rigorous data anonymization and de-identification protocols to protect patient privacy. [3]
Through the further use and growth of explainable AI, also referred to as XAI, (explainable artificial intelligence) researchers and all parties involved can ensure transparency and accountability when it comes to using data analysis and computational methods in the field of disease informatics. XAI provides explanations of how the algorithms being used work, why they were chosen, what knowledge they produce, and so on. [3]
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.
Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Advances in the field of deep learning have allowed neural networks to surpass many previous approaches in performance.
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.
Public health surveillance is, according to the World Health Organization (WHO), "the continuous, systematic collection, analysis and interpretation of health-related data needed for the planning, implementation, and evaluation of public health practice." Public health surveillance may be used to track emerging health-related issues at an early stage and find active solutions in a timely manner. Surveillance systems are generally called upon to provide information regarding when and where health problems are occurring and who is affected.
Public health informatics has been defined as the systematic application of information and computer science and technology to public health practice, research, and learning. It is one of the subdomains of health informatics, data management applied to medical systems.
Disease surveillance is an epidemiological practice by which the spread of disease is monitored in order to establish patterns of progression. The main role of disease surveillance is to predict, observe, and minimize the harm caused by outbreak, epidemic, and pandemic situations, as well as increase knowledge about which factors contribute to such circumstances. A key part of modern disease surveillance is the practice of disease case reporting.
Imaging informatics, also known as radiology informatics or medical imaging informatics, is a subspecialty of biomedical informatics that aims to improve the efficiency, accuracy, usability and reliability of medical imaging services within the healthcare enterprise. It is devoted to the study of how information about and contained within medical images is retrieved, analyzed, enhanced, and exchanged throughout the medical enterprise.
Real-time outbreak and disease surveillance system (RODS) is a syndromic surveillance system developed by the University of Pittsburgh, Department of Biomedical Informatics. It is "prototype developed at the University of Pittsburgh where real-time clinical data from emergency departments within a geographic region can be integrated to provide an instantaneous picture of symptom patterns and early detection of epidemic events."
The Influenza Research Database (IRD) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for influenza virus research. IRD is one of the five Bioinformatics Resource Centers (BRC) funded by the National Institute of Allergy and Infectious Diseases (NIAID), a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.
Infoveillance is a type of syndromic surveillance that specifically utilizes information found online. The term, along with the term infodemiology, was coined by Gunther Eysenbach to describe research that uses online information to gather information about human behavior.
Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.
In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.
Genomic and medical data refers to an area within genetics that concerns the recording, sequencing and analysis of an organism's genome.
Artificial intelligence in healthcare is the application of artificial intelligence (AI) to analyze and understand complex medical and healthcare data. In some cases, it can exceed or augment human capabilities by providing better or faster ways to diagnose, treat, or prevent disease.
Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Explainable AI (XAI), often overlapping with interpretable AI, or explainable machine learning (XML), is a field of research within artificial intelligence (AI) that explores methods that provide humans with the ability of intellectual oversight over AI algorithms. The main focus is on the reasoning behind the decisions or predictions made by the AI algorithms, to make them more understandable and transparent. This addresses users' requirement to assess safety and scrutinize the automated decision making in applications. XAI counters the "black box" tendency of machine learning, where even the AI's designers cannot explain why it arrived at a specific decision.
Automated decision-making (ADM) involves the use of data, machines and algorithms to make decisions in a range of contexts, including public administration, business, health, education, law, employment, transport, media and entertainment, with varying degrees of human oversight or intervention. ADM involves large-scale data from a range of sources, such as databases, text, social media, sensors, images or speech, that is processed using various technologies including computer software, algorithms, machine learning, natural language processing, artificial intelligence, augmented intelligence and robotics. The increasing use of automated decision-making systems (ADMS) across a range of contexts presents many benefits and challenges to human society requiring consideration of the technical, legal, ethical, societal, educational, economic and health consequences.
Acoustic epidemiology refers to the study of the determinants and distribution of disease. It also refers to the analysis of sounds produced by the body through a single tool or a combination of diagnostic tools.