Materials informatics is a field of study that applies the principles of informatics and data science to materials science and engineering to improve the understanding, use, selection, development, and discovery of materials. The term "materials informatics" is frequently used interchangeably with "data science", "machine learning", and "artificial intelligence" by the community. This is an emerging field, with a goal to achieve high-speed and robust acquisition, management, analysis, and dissemination of diverse materials data with the goal of greatly reducing the time and risk required to develop, produce, and deploy new materials, which generally takes longer than 20 years. [1] [2] [3] This field of endeavor is not limited to some traditional understandings of the relationship between materials and information. Some more narrow interpretations include combinatorial chemistry, process modeling, materials databases, materials data management, and product life cycle management. Materials informatics is at the convergence of these concepts, but also transcends them and has the potential to achieve greater insights and deeper understanding by applying lessons learned from data gathered on one type of material to others. By gathering appropriate meta data, the value of each individual data point can be greatly expanded.
Databases are essential for any informatics research and applications. In material informatics many databases exist containing both empirical data obtained experimentally, and theoretical data obtained computationally. Big data that can be used for machine learning is particularly difficult to obtain for experimental data due to the lack of a standard for reporting data and the variability in the experimental environment. This lack of big data has led to growing effort in developing machine learning techniques that utilize data extremely data sets. On the other hand, large uniform database of theoretical density functional theory (DFT) calculations exists. These databases have proven their utility in high-throughput material screening and discovery. Some common DFT databases and high throughput tools are listed below:
The concept of materials informatics is addressed by the Materials Research Society. For example, materials informatics was the theme of the December 2006 issue of the MRS Bulletin. The issue was guest-edited by John Rodgers of Innovative Materials, Inc., and David Cebon of Cambridge University, who described the "high payoff for developing methodologies that will accelerate the insertion of materials, thereby saving millions of investment dollars."
The editors focused on the limited definition of materials informatics as primarily focused on computational methods to process and interpret data. They stated that "specialized informatics tools for data capture, management, analysis, and dissemination" and "advances in computing power, coupled with computational modeling and simulation and materials properties databases" will enable such accelerated insertion of materials.
A broader definition of materials informatics goes beyond the use of computational methods to carry out the same experimentation, [4] viewing materials informatics as a framework in which a measurement or computation is one step in an information-based learning process that uses the power of a collective to achieve greater efficiency in exploration. When properly organized, this framework crosses materials boundaries to uncover fundamental knowledge of the basis of physical, mechanical, and engineering [5] properties.
While there are many who believe in the future of informatics in the materials development and scaling process, many challenges remain. Hill, et al., write that "Today, the materials community faces serious challenges to bringing about this data-accelerated research paradigm, including diversity of research areas within materials, lack of data standards, and missing incentives for sharing, among others. Nonetheless, the landscape is rapidly changing in ways that should benefit the entire materials research enterprise." [6] This remaining tension between traditional materials development methodologies and the use of more computationally, machine learning, and analytics approaches will likely exist for some time as the materials industry overcomes some of the cultural barriers necessary to fully embrace such new ways of thinking.
The overarching goals of bioinformatics and systems biology may provide a useful analogy. Andrew Murray of Harvard University expresses the hope that such an approach "will save us from the era of "one graduate student, one gene, one PhD". [7] Similarly, the goal of materials informatics is to save us from one graduate student, one alloy, one PhD. Such goals will require more sophisticated strategies and research paradigms than applying data-science methods to the same tasks set currently undertaken by students.
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.
Computational chemistry is a branch of chemistry that uses computer simulations to assist in solving chemical problems. It uses methods of theoretical chemistry incorporated into computer programs to calculate the structures and properties of molecules, groups of molecules, and solids. The importance of this subject stems from the fact that, with the exception of some relatively recent findings related to the hydrogen molecular ion, achieving an accurate quantum mechanical depiction of chemical systems analytically, or in a closed form, is not feasible. The complexity inherent in the many-body problem exacerbates the challenge of providing detailed descriptions of quantum mechanical systems. While computational results normally complement information obtained by chemical experiments, it can occasionally predict unobserved chemical phenomena.
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.
Biological databases are libraries of biological sciences, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.
Biomedical text mining refers to the methods and study of how text mining may be applied to texts and literature of the biomedical domain. As a field of research, biomedical text mining incorporates ideas from natural language processing, bioinformatics, medical informatics and computational linguistics. The strategies in this field have been applied to the biomedical literature available through services such as PubMed.
Neuroinformatics is the emergent field that combines informatics and neuroscience. Neuroinformatics is related with neuroscience data and information processing by artificial neural networks. There are three main directions where neuroinformatics has to be applied:
Erik Bongcam-Rudloff is a Chilean-born Swedish biologist and computer scientist. He received his doctorate in medical sciences from Uppsala University in 1994. He is Professor of Bioinformatics and the head of SLU-Global Bioinformatics Centre at the Swedish University of Agricultural Sciences. His main research deals with development of bioinformatics solutions for the Life Sciences community.
Computational epigenetics uses statistical methods and mathematical modelling in epigenetic research. Due to the recent explosion of epigenome datasets, computational methods play an increasing role in all areas of epigenetic research.
Søren Brunak is a Danish biological and physical scientist working in bioinformatics, systems biology, and medical informatics. He is a professor of Disease Systems Biology at the University of Copenhagen and professor of bioinformatics at the Technical University of Denmark. As Research Director at the Novo Nordisk Foundation Center for Protein Research at the University of Copenhagen Medical School, he leads a research effort where molecular-level systems biology data are combined with phenotypic data from the healthcare sector, such as electronic patient records, registry information, and biobank questionnaires. A major aim is to understand the network biology basis for time-ordered comorbidities and discriminate between treatment-related disease correlations and other comorbidities in disease trajectories. Søren Brunak also holds a position as a Medical Informatics Officer at Rigshospitalet, the Capital Region of Denmark.
The International Society for Computational Biology (ISCB) is a scholarly society for researchers in computational biology and bioinformatics. The society was founded in 1997 to provide a stable financial home for the Intelligent Systems for Molecular Biology (ISMB) conference and has grown to become a larger society working towards advancing understanding of living systems through computation and for communicating scientific advances worldwide.
The Clean Energy Project (CEP) was a virtual high-throughput discovery and design effort for the next generation of plastic solar cell materials that has finished. It studies millions of candidate structures to identify suitable compounds for the harvesting of renewable energy from the sun and for other organic electronic applications. It ran on the BOINC platform.
Lawrence E. Hunter is a Professor and Director of the Center for Computational Pharmacology and of the Computational Bioscience Program at the University of Colorado School of Medicine and Professor of Computer Science at the University of Colorado Boulder. He is an internationally known scholar, focused on computational biology, knowledge-driven extraction of information from the primary biomedical literature, the semantic integration of knowledge resources in molecular biology, and the use of knowledge in the analysis of high-throughput data, as well as for his foundational work in computational biology, which led to the genesis of the major professional organization in the field and two international conferences.
Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.
Gerbrand Ceder is a Belgian–American scientist who is a professor and the Samsung Distinguished Chair in Nanoscience and Nanotechnology Research at the University of California, Berkeley. He has a joint appointment as a senior faculty scientist in the Materials Sciences Division of Lawrence Berkeley National Laboratory. He is notable for his pioneering research in high-throughput computational materials design, and in the development of novel lithium-ion battery technologies. He is co-founder of the Materials Project, an open-source online database of ab initio calculated material properties, which inspired the Materials Genome Initiative by the Obama administration in 2011. He was previously the Founder and CTO of Pellion Technologies, which aimed to commercialize magnesium-ion batteries. In 2017 Gerbrand Ceder was elected a member of the National Academy of Engineering, "For the development of practical computational materials design and its application to the improvement of energy storage technology."
Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Nanoinformatics is the application of informatics to nanotechnology. It is an interdisciplinary field that develops methods and software tools for understanding nanomaterials, their properties, and their interactions with biological entities, and using that information more efficiently. It differs from cheminformatics in that nanomaterials usually involve nonuniform collections of particles that have distributions of physical properties that must be specified. The nanoinformatics infrastructure includes ontologies for nanomaterials, file formats, and data repositories.
Kristin Aslaug Persson is a Swedish/Icelandic American physicist and chemist. She was born in Lund, Sweden, in 1971, to Eva Haettner-Aurelius and Einar Benedikt Olafsson. She is the Daniel M. Tellep Distinguished Professor of Materials Science and Engineering at University of California, Berkeley and a faculty senior staff scientist at Lawrence Berkeley National Laboratory. Between 2020-2024, she served as the director of the Molecular Foundry, a national user facility managed by the US Department of Energy at Lawrence Berkeley National Laboratory. Persson is the director and founder of the Materials Project, a multi-national effort to compute the properties of all inorganic materials. Her research group focuses on the data-driven computational design and prediction of new materials for clean energy production and storage applications. In 2024, Persson was elected a member of the Royal Swedish Academy of Sciences, in the class of Chemistry.
The Materials Project is an open-access database offering material properties to accelerate the development of technology by predicting how new materials–both real and hypothetical–can be used. The project was established in 2011 with an emphasis on battery research, but includes property calculations for many areas of clean energy systems such as photovoltaics, thermoelectric materials, and catalysts. Most of the known 35,000 molecules and over 130,000 inorganic compounds are included in the database.
Biomedical data science is a multidisciplinary field which leverages large volumes of data to promote biomedical innovation and discovery. Biomedical data science draws from various fields including Biostatistics, Biomedical informatics, and machine learning, with the goal of understanding biological and medical data. It can be viewed as the study and application of data science to solve biomedical problems. Modern biomedical datasets often have specific features which make their analyses difficult, including:
Kamal Choudhary is an Indian American physicist and computational materials scientist in the thermodynamics and kinetics group at the National Institute of Standards and Technology. He is most notable for establishing the NIST-JARVIS infrastructure for data-driven materials design and Materials informatics. He is also an associate editor of the journals npj Computational Materials and Scientific Data.