Data Science and Predictive Analytics

Last updated
Data Science and Predictive Analytics: Biomedical and Health Applications using R
This is the official 2nd edition book cover (Data Science and Predictive Analytics).png
Author Ivo D. Dinov
LanguageEnglish
SeriesThe Springer Series in Applied Machine Learning
Subject Computer science, Data science, artificial intelligence
Publisher Springer
Publication date
2018 (1st ed.), 2023 (2nd edition)
Publication placeSwitzerland
Media typePrint (hardcover and softcover), electronic (PDF and EPub)
ISBN 978-3-031-17483-4 978-3-319-72346-4, 978-3-031-17485-8, 978-3-031-17482-7

The first edition of the textbook Data Science and Predictive Analytics: Biomedical and Health Applications using R, authored by Ivo D. Dinov, was published in August 2018 by Springer. [1] The second edition of the book was printed in 2023. [2]

Contents

This textbook covers some of the core mathematical foundations, computational techniques, and artificial intelligence approaches used in data science research and applications. [3]

By using the statistical computing platform R and a broad range of biomedical case-studies, the 23 chapters of the book first edition provide explicit examples of importing, exporting, processing, modeling, visualizing, and interpreting large, multivariate, incomplete, heterogeneous, longitudinal, and incomplete datasets (big data). [4]

Structure

First edition table of contents

The first edition of the Data Science and Predictive Analytics (DSPA) textbook [1] is divided into the following 23 chapters, each progressively building on the previous content.

  1. Motivation
  2. Foundations of R
  3. Managing Data in R
  4. Data Visualization
  5. Linear Algebra & Matrix Computing
  6. Dimensionality Reduction
  7. Lazy Learning: Classification Using Nearest Neighbors
  8. Probabilistic Learning: Classification Using Naive Bayes
  9. Decision Tree Divide and Conquer Classification
  10. Forecasting Numeric Data Using Regression Models
  11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
  12. Apriori Association Rules Learning
  13. k-Means Clustering
  14. Model Performance Assessment
  15. Improving Model Performance
  16. Specialized Machine Learning Topics
  17. Variable/Feature Selection
  18. Regularized Linear Modeling and Controlled Variable Selection
  19. Big Longitudinal Data Analysis
  20. Natural Language Processing/Text Mining
  21. Prediction and Internal Statistical Cross Validation
  22. Function Optimization
  23. Deep Learning, Neural Networks

Second edition table of contents

The significantly reorganized revised edition of the book (2023) [2] expands and modernizes the presented mathematical principles, computational methods, data science techniques, model-based machine learning and model-free artificial intelligence algorithms. The 14 chapters of the new edition start with an introduction and progressively build foundational skills to naturally reach biomedical applications of deep learning.

  1. Introduction
  2. Basic Visualization and Exploratory Data Analytics
  3. Linear Algebra, Matrix Computing, and Regression Modeling
  4. Linear and Nonlinear Dimensionality Reduction
  5. Supervised Classification
  6. Black Box Machine Learning Methods
  7. Qualitative Learning Methods—Text Mining, Natural Language Processing, and Apriori Association Rules Learning
  8. Unsupervised Clustering
  9. Model Performance Assessment, Validation, and Improvement
  10. Specialized Machine Learning Topics
  11. Variable Importance and Feature Selection
  12. Big Longitudinal Data Analysis
  13. Function Optimization
  14. Deep Learning, Neural Networks

Reception

The materials in the Data Science and Predictive Analytics (DSPA) textbook have been peer-reviewed in the Journal of the American Statistical Association, [5] International Statistical Institute’s ISI Review Journal, [3] and the Journal of the American Library Association. [4] Many scholarly publications reference the DSPA textbook. [6] [7]

As of January 17, 2021, the electronic version of the book first edition ( ISBN   978-3-319-72347-1) is freely available on SpringerLink [8] and has been downloaded over 6 million times. The textbook is globally available in print (hardcover and softcover) and electronic formats (PDF and EPub) in many college and university libraries [9] and has been used for data science, computational statistics, and analytics classes at various institutions. [10]

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Chemometrics is the science of extracting information from chemical systems by data-driven means. Chemometrics is inherently interdisciplinary, using methods frequently employed in core data-analytic disciplines such as multivariate statistics, applied mathematics, and computer science, in order to address problems in chemistry, biochemistry, medicine, biology and chemical engineering. In this way, it mirrors other interdisciplinary fields, such as psychometrics and econometrics.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Computational science, also known as scientific computing, technical computing or scientific computation (SC), is a division of science that uses advanced computing capabilities to understand and solve complex physical problems. This includes

Computational economics is an interdisciplinary research discipline that combines methods in computational science and economics to solve complex economic problems. This subject encompasses computational modeling of economic systems. Some of these areas are unique, while others established areas of economics by allowing robust data analytics and solutions of problems that would be arduous to research without computers and associated numerical methods.

JMP is a suite of computer programs for statistical analysis and machine learning developed by JMP, a subsidiary of SAS Institute. The program was launched in 1989 to take advantage of the graphical user interface introduced by the Macintosh operating systems. It has since been significantly rewritten and made available for the Windows operating system.

<span class="mw-page-title-main">Statistics Online Computational Resource</span>

The Statistics Online Computational Resource (SOCR) is an online multi-institutional research and education organization. SOCR designs, validates and broadly shares a suite of online tools for statistical computing, and interactive materials for hands-on learning and teaching concepts in data science, statistical analysis and probability theory. The SOCR resources are platform agnostic based on HTML, XML and Java, and all materials, tools and services are freely available over the Internet.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Data science</span> Field of study to extract insights from data

Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, scientific visualization, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.

Jianlin (Jack) Cheng is the William and Nancy Thompson Missouri Distinguished Professor in the Electrical Engineering and Computer Science (EECS) Department at the University of Missouri, Columbia. He earned his PhD from the University of California-Irvine in 2006, his MS degree from Utah State University in 2001, and his BS degree from Huazhong University of Science and Technology in 1994.

<span class="mw-page-title-main">JASP</span> Free and open-source statistical program

JASP is a free and open-source program for statistical analysis supported by the University of Amsterdam. It is designed to be easy to use, and familiar to users of SPSS. It offers standard analysis procedures in both their classical and Bayesian form. JASP generally produces APA style results tables and plots to ease publication. It promotes open science via integration with the Open Science Framework and reproducibility by integrating the analysis settings into the results. The development of JASP is financially supported by several universities and research funds. As the JASP GUI is developed in C++ using Qt framework, some of the team left to make a notable fork which is Jamovi which has its GUI developed in JavaScript and HTML5.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of and topical guide to machine learning:

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

Donald C. Wunsch II is Mary K. Finley Distinguished Professor of computer engineering at the Missouri University of Science and Technology, and a Fellow of the Institute of Electrical and Electronics Engineers He is known for his work on " hardware implementations, reinforcement and unsupervised learning".

DSPA is an acronym for the Dutch Society for the Protection of Animals

<span class="mw-page-title-main">Ivo D. Dinov</span> Bulgarian–American academic scholar

Ivaylo (Ivo) D. Dinov is a mathematical statistician, data scientist, and computational neuroscientist, who is the Henry Philip Tappan collegiate professor at the University of Michigan. He is a co-developer of the 5D spacekime model, a new technique for complex time (kime) representation, modeling, and analysis of repeated measurement longitudinal processes. Dinov is the author of the Data Science and Predictive Analytics (DSPA) book and has published significantly on a wide range of topics, including mathematical modeling, computational statistics, data science, neuroscience, applied statistics, and generative artificial intelligence models (GAIMs).

References