Summary
Background
Traditional statistical tests are unable to handle large numbers of variables. The simplest method to reduce large numbers of variables is the use of add-up scores. But add-up scores do not account the relative importance of the separate variables, their interactions and differences in units. Machine learning can be defined as knowledge for making predictions as obtained from processing training data through a computer. If data sets involve multiple variables, data analyses will be complex, and modern computationally intensive methods will have to be applied for analysis.
Objective and Methods
The current book, using real data examples as well as simulated data, reviews important methods relevant for health care and research, although little used in the field so far.
Results and Conclusions
One of the first machine learning methods used in health research is logistic regression for health profiling where single combinations of x-variables are used to predict the risk of a medical event in single persons ).
A wonderful method for analyzing imperfect data with multiple variables is optimal scaling ().
Partial correlations analysis is the best method for removing interaction effects from large clinical data sets).
Mixed linear modeling (1), binary partitioning (2), item response modeling (3), time dependent predictor analysis (4) and autocorrelation (5) are linear or loglinear regression methods suitable for assessing data with respectively repeated measures (1), binary decision trees (2), exponential exposure-response relationships (3), different values at different periods (4) and those with seasonal differences (5), ().
Clinical data sets with non-linear relationships between exposure and outcome variables require special analysis methods, and can usually also be adequately analyzed with neural networks methods like multi layer perceptron networks, and radial basis functions networks ().
Clinical data with multiple exposure variables are usually analyzed using analysis of (co-) variance (AN(C)OVA), but this method does not adequately account the relative importance of the variables and their interactions. Factor analysis and hierarchical cluster analysis account for all of these limitations ().
Data with multiple outcome variables are usually analyzed with multivariate analysis of (co-) variance (MAN(C)OVA). However, this has the same limitations as ANOVA. Partial least squares analysis, discriminant analysis, and canonical regression account all of these limitations ().
Fuzzy modeling is a method suitable for modeling soft data, like data that are partially true or response patterns that are different at different times ).
Introduction
Traditional statistical tests are unable to handle large numbers of variables. The simplest method to reduce large numbers of variables is the use of add-up scores. But add-up scores do not account the relative importance of the separate variables, their interactions and differences in units.
Principal components analysis and partial least square analysis, hierarchical cluster analysis, optimal scaling and canonical regression are modern computationally intensive methods, currently often listed as machine learning methods. This is because the computations they make are far too complex to perform without the help of a computer, and because they turn imputed information into knowledge, which is in human terms a kind of learning process.
An additional advantage is that the novel methods are able to account all of the limitations of the traditional methods. Although widely used in the fields of behavioral sciences, social sciences, marketing, operational research and applied sciences, they are virtually unused in medicine. This is a pity given the omnipresence of large numbers of variables in this field of research. However, this is probably just a matter of time, now that the methods are increasingly available in SPSS statistical software and many other packages.
We will start with logistic regression for health profiling where single combinations of x-variables are used to predict the risk of a medical event in single persons explains fuzzy modeling as a method for modeling soft data, like data that are partially true or response patterns that are different at different times.
A nice thing about the novel methodologies, thus, is that, unlike the traditional methods like ANOVA and MANOVA, they not only can handle large data files with numerous exposure and outcome variables, but also can do it in a relatively unbiased way.
The current book serves as an introduction to machine learning methods in clinical research, and was written as a hand-hold presentation accessible to clinicians, and as a must-read publication for those new to the methods. It is the authors experience, as master class professors, that students are eager to master adequate command of statistical software. For their benefit all of the steps of the novel methods from logging in to the final result using SPSS statistical software will be given in most of the chapters. We will end up this initial chapter with some machine learning terminology.
Machine Learning Terminology
Artificial Intelligence
Engineering method that simulates the structures and operating principles of the human brain.
Bootstraps
Machine learning methods are computationally intensive. Computers make use of bootstraps, otherwise called random sampling from the data with replacement, in order to facilitate the calculations. Bootstraps is a Monte Carlo method.
Canonical Regression
Multivariate method. ANOVA / ANCOVA (analysis of (co)variance) and MANOVA / MANCOVA (multivariate analysis of (co)variance) are the standard methods for the analysis of data with respectively multiple independent and dependent variables. A problem with these methods is, that they rapidly lose statistical power with increasing numbers of variables, and that computer commands may not be executed due to numerical problems with higher order calculations among components. Also, clinically, we are often more interested in the combined effects of the clusters of variables than in the separate effects of the different variables. As a simple solution composite variables can be used as add-up sums of separate variables, but add-up sums do not account the relative importance of the separate variables, their interactions, and differences in units. Canonical analysis can account all of that, and, unlike MANCOVA, gives, in addition to test statistics of the separate variables, overall test statistics of entire sets of variables.