Population informatics

Last updated

The field of population informatics is the systematic study of populations via secondary analysis of massive data collections (termed "big data") about people. Scientists in the field refer to this massive data collection as the social genome, denoting the collective digital footprint of our society. Population informatics applies data science to social genome data to answer fundamental questions about human society and population health much like bioinformatics applies data science to human genome data to answer questions about individual health. It is an emerging research area at the intersection of SBEH (Social, Behavioral, Economic, & Health) sciences, computer science, and statistics in which quantitative methods and computational tools are used to answer fundamental questions about our society. [[File:Data science.png|alt=Data Science|thumb|Data Science]

Contents

Bioinformatics Bioinformatics.png
Bioinformatics
Population Informatics Population informatics.png
Population Informatics

Introduction

History

The term was first used in August 2012 when the Population Informatics Lab was founded at the University of North Carolina at Chapel Hill by Dr. Hye-Cung Kum. The term was first defined in a peer reviewed article in 2013 [1] and further elaborated on in another article in 2014. [2] The first Workshop on Population Informatics for Big Data was held at the ACM SIGKDD conference in Sydney, Australia, in August 2015.

Goals

To study social, behavioral, economic, and health sciences using the massive data collections, aka social genome data, about people. The primary goal of population informatics is to increase the understanding of social processes by developing and applying computationally intensive techniques to the social genome data.[ citation needed ]

Some of the important sub-disciplines are :[ citation needed ]

Approaches

Record Linkage, the task of finding records in a dataset that refer to the same entity across different data sources, is a major activity in the population informatics field because most of the digital traces about people are fragmented in many heterogeneous databases that need to be linked before analysis can be done.[ citation needed ]

Once relevant datasets are linked, the next task is usually to develop valid meaningful measures to answer the research question. Often developing measures involves iterating between inductive and deductive approaches with the data and research question until usable measures are developed because the data were collected for other purposes with no intended use to answer the question at hand. Developing meaningful and useful measures from existing data is a major challenge in many research projects. In computation fields, these measures are often called features.[ citation needed ]

Finally, with the datasets linked and required measures developed, the analytic dataset is ready for analysis. Common analysis methods include traditional hypothesis driven research as well more inductive approaches such as data science and predictive analytics.

Relation to other fields

Computational social science refers to the academic sub-disciplines concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. Fields include computational economics and computational sociology. The seminal article on computational social science is by Lazer et al. 2009 [3] which was a summary of a workshop held at Harvard with the same title. However, the article does not define the term computational social science precisely.

In general, computational social science is a broader field and encompasses population informatics. Besides population informatics, it also includes complex simulations of social phenomena. Often complex simulation models use results from population informatics to configure with real world parameters.[ citation needed ]

Data Science for Social Good (DSSG) is another similar field coming about. But again, DSSG is a bigger field applying data science to any social problem that includes study of human populations but also many problems that do not use any data about people.[ citation needed ]

Population reconstruction is the multi-disciplinary field to reconstruct specific (historical) populations by linking data from diverse sources, leading to rich novel resources for study by social scientists. [4]

The firstWorkshop on Population Informatics for Big Data was held at the ACM SIGKDD conference in Sydney, Australia, in 2015. The workshop brought together computer science researchers, as well as public health practitioners and researchers. This Wikipedia page started at the workshop.

The International Population Data Linkage Network (IPDLN) facilitates communication between centres that specialize in data linkage and users of the linked data. The producers and users alike are committed to the systematic application of data linkage to produce community benefit in the population and health-related domains.

Challenges

Three major challenges specific to population informatics are:

  1. Preserving privacy of the subjects of the data – due to increasing concerns of privacy and confidentiality sharing or exchanging sensitive data about the subjects across different organizations is often not allowed. Therefore, population informatics need to be applied on encrypted data or in a privacy-preserving setting. [1] [5] [6]
  2. The need for error bounds on the results – since real world data often contain errors and variations error bound need to be used (for approximate matching) so that real decisions that have direct impact on people can be made based on these results. [7] [8] Research on error propagation in the full data pipeline from data integration to final analysis is also important. [9]
  3. Scalability – databases are continuously growing in size which makes population informatics computationally expensive in terms of the size and number of data sources. [10] Scalable algorithms need to be developed for providing efficient and practical population informatics applications in the real world context.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, chemistry, physics, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using computational and statistical techniques.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer engineering which uses bioengineering to build computers.

<span class="mw-page-title-main">Health informatics</span> Applications of information processing concepts and machinery in medicine

Health informatics is the field of science and engineering that aims at developing methods and technologies for the acquisition, processing, and study of patient data, which can come from different sources and modalities, such as electronic health records, diagnostic test results, medical scans. The health domain provides an extremely wide variety of problems that can be tackled using computational techniques.

Record linkage is the task of finding records in a data set that refer to the same entity across different data sources. Record linkage is necessary when joining different data sets based on entities that may or may not share a common identifier, which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked.

<span class="mw-page-title-main">Electronic health record</span> Digital collection of patient and population electronically stored health information

An electronic health record (EHR) is the systematized collection of patient and population electronically stored health information in a digital format. These records can be shared across different health care settings. Records are shared through network-connected, enterprise-wide information systems or other information networks and exchanges. EHRs may include a range of data, including demographics, medical history, medication and allergies, immunization status, laboratory test results, radiology images, vital signs, personal statistics like age and weight, and billing information.

Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical and molecular biology domains. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies developed through studies in this field are frequently applied to the biomedical and molecular biology literature available through services such as PubMed.

Public health informatics has been defined as the systematic application of information and computer science and technology to public health practice, research, and learning. It is one of the subdomains of health informatics.

Mark Bender Gerstein is an American scientist working in bioinformatics and Data Science. As of 2009, he is co-director of the Yale Computational Biology and Bioinformatics program.

Personal genomics or consumer genetics is the branch of genomics concerned with the sequencing, analysis and interpretation of the genome of an individual. The genotyping stage employs different techniques, including single-nucleotide polymorphism (SNP) analysis chips, or partial or full genome sequencing. Once the genotypes are known, the individual's variations can be compared with the published literature to determine likelihood of trait expression, ancestry inference and disease risk.

<span class="mw-page-title-main">Big data</span> Information assets characterized by high volume, velocity, and variety

Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many entries (rows) offer greater statistical power, while data with higher complexity may lead to a higher false discovery rate. Though used sometimes loosely partly because of a lack of formal definition, the interpretation that seems to best describe big data is the one associated with large body of information that we could not comprehend when used only in smaller amounts.

Infoveillance is a type of syndromic surveillance that specifically utilizes information found online. The term, along with the term infodemiology, was coined by Gunther Eysenbach to describe research that uses online information to gather information about human behavior.

Translational bioinformatics (TBI) is an emerging field in the study of health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.

The social genome is the collection of data about members of a society that is captured in ever-larger and ever-more complex databases. Some have used the term digital footprint to refer to individual traces.

Behavior informatics (BI) is the informatics of behaviors so as to obtain behavior intelligence and behavior insights. BI is a research method combining science and technology, specifically in the area of engineering. The purpose of BI includes analysis of current behaviors as well as the inference of future possible behaviors. This occurs through pattern recognition.

Data re-identification or de-anonymization is the practice of matching anonymous data with publicly available information, or auxiliary data, in order to discover the individual to which the data belong. This is a concern because companies with privacy policies, health care providers, and financial institutions may release the data they collect after the data has gone through the de-identification process.

DNA encryption is the process of hiding or perplexing genetic information by a computational method in order to improve genetic privacy in DNA sequencing processes. The human genome is complex and long, but it is very possible to interpret important, and identifying, information from smaller variabilities, rather than reading the entire genome. A whole human genome is a string of 3.2 billion base paired nucleotides, the building blocks of life, but between individuals the genetic variation differs only by 0.5%, an important 0.5% that accounts for all of human diversity, the pathology of different diseases, and ancestral story. Emerging strategies incorporate different methods, such as randomization algorithms and cryptographic approaches, to de-identify the genetic sequence from the individual, and fundamentally, isolate only the necessary information while protecting the rest of the genome from unnecessary inquiry. The priority now is to ascertain which methods are robust, and how policy should ensure the ongoing protection of genetic privacy.

<span class="mw-page-title-main">Lucila Ohno-Machado</span> Biomedical engineer

Lucila Ohno-Machado is a biomedical engineer and the chair of the Department of Biomedical Informatics and associate dean for informatics and technology at UC San Diego. She is an elected member of the American Society for Clinical Investigation and the National Academy of Medicine.

<span class="mw-page-title-main">Genome informatics</span>

Genome Informatics is a scientific study of information processing in genomes.

<span class="mw-page-title-main">Melissa Haendel</span> American bioinformaticist

Melissa Anne Haendel is an American bioinformaticist who is the Chief Research Informatics Officer of the Anschutz Medical Campus of the University of Colorado as well as a Professor of Biochemistry and Molecular Genetics and the Marsico Chair in Data Science. She serves as Director of the Center for Data to Health (CD2H). Her research makes use of data to improve the discovery and diagnosis of diseases. During the COVID-19 pandemic, Haendel joined with the National Institutes of Health to launch the National COVID Cohort Collaborative (N3C), which looks to identify the risk factors that can predict severity of disease outcome and help to identify treatments.

Biomedical data science is a multidisciplinary field which leverages large volumes of data to promote biomedical innovation and discovery. Biomedical data science draws from various fields including Biostatistics, Biomedical informatics, and machine learning, with the goal of understanding biological and medical data. It can be viewed as the study and application of data science to solve biomedical problems. Modern biomedical datasets often have specific features which make their analyses difficult, including:

References

  1. 1 2 Kum, Hye-Chung; Ahalt, Stanley (2013-01-01). "Privacy-by-Design: Understanding Data Access Models for Secondary Data". AMIA Joint Summits on Translational Science Proceedings AMIA Summit on Translational Science. 2013: 126–130. ISSN   2153-4063. PMC   3845756 . PMID   24303251.
  2. Kum, Hye-Chung; Krishnamurthy, A.; Machanavajjhala, A.; Ahalt, S.C. (2014-01-01). "Social Genome: Putting Big Data to Work for Population Informatics". Computer. 47 (1): 56–63. doi:10.1109/MC.2013.405. ISSN   0018-9162. S2CID   6275413.
  3. Lazer, David; Pentland, Alex (Sandy); Adamic, Lada; Aral, Sinan; Barabasi, Albert Laszlo; Brewer, Devon; Christakis, Nicholas; Contractor, Noshir; Fowler, James (2009-02-06). "Life in the network: the coming age of computational social science". Science. 323 (5915): 721–723. doi:10.1126/science.1167742. ISSN   0036-8075. PMC   2745217 . PMID   19197046.
  4. Bloothooft, G.; Christen, P.; Mandemakers, K.; Schraagen, M. (2015). Population Reconstruction - Springer. doi:10.1007/978-3-319-19884-2. ISBN   978-3-319-19883-5.
  5. Dinusha Vatsalan, Peter Christen, and Vassilios S. Verykios. "A taxonomy of privacy-preserving record linkage techniques." Journal of Information Systems (Elsevier), 38(6): 946-969, 2013. doi: 10.1016/j.is.2012.11.005
  6. Kum, Hye-Chung; Krishnamurthy, Ashok; Machanavajjhala, Ashwin; Reiter, Michael K; Ahalt, Stanley (2014-03-01). "Privacy preserving interactive record linkage (PPIRL)". Journal of the American Medical Informatics Association. 21 (2): 212–220. doi:10.1136/amiajnl-2013-002165. ISSN   1067-5027. PMC   3932473 . PMID   24201028.
  7. Peter Christen. "Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection". Data-Centric Systems and Applications (Springer) 2012. doi: 10.1007/978-3-642-31164-2
  8. Peter Christen, Dinusha Vatsalan, and Zhichun Fu. "Advanced Record Linkage Methods and Privacy Aspects for Population Reconstruction - A Survey and Case Studies". Population Reconstruction: 87-110 (Springer) 2015. doi: 10.1007/978-3-319-19884-2_5
  9. Lahiri, P.; Larsen, Michael D. (2005-03-01). "Regression Analysis with Linked Data". Journal of the American Statistical Association. 100 (469): 222–230. CiteSeerX   10.1.1.143.1706 . doi:10.1198/016214504000001277. JSTOR   27590532. S2CID   15873588.
  10. Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen. "Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage". PAKDD: 549-561 (Springer) 2015 doi: 10.1007/978-3-319-18032-8_43