1. An Introduction to Structural Equation Models
The past two decades have witnessed a remarkable acceleration of interest in structural equation modeling (SEM) methods in many areas of research. In the social sciences, researchers often distinguish SEM approaches from more powerful systems of regression equation approaches by the inclusion of unobservable constructs (called latent variables in the SEM vernacular), and by the use of computationally intensive iterative searches for coefficients that fit the data. The expansion of statistical analysis to encompass unmeasurable constructs using SEM, canonical correlation, Likert scale quantification, principal components, and factor analysis has vastly extended the scope and relevance of the social sciences over the past century. Subjects that were previously the realm of abstract argumentation have been transported into the mainstream of scientific research.
The products of SEM statistical analysis algorithms fall into three groups: (1) pairwise canonical correlations between pairs of prespecified latent variables computed from observable data (from the so-called partial least squares path analysis, or PLS-PA approaches); (2) multivariate canonical correlation matrices for prespecified networks of latent variables computed from observable data (from a group of computer-intensive search algorithms originating with Karl Jreskog); and (3) systems of regression approaches that fit data to networks of observable variables. A fourth approach is fast emerging with the introduction of powerful new social network analysis tools. These allow both visualization and network-specific statistics that draw on an old and rich literature in graph theory and physical network effects.
Most of the PLS-PA algorithms are variations on an incompletely documented software package released in 1980 (Lohmller, ) canonical correlation methods.
Two different covariance structure algorithms are widely used: (1) LISREL (an acronym for LInear Structural RELations) (K. G. Jreskog, ). Variations on these algorithms have been implemented in EQS, TETRAD, and other packages.
Methods in systems of equation modeling and social network analytics are not as familiar in the social sciences as the first two methods, but offer comparatively more analytical power. Accessible and comprehensive tools for these additional approaches are covered in this book, as are research approaches to take advantage of the additional explanatory power that these approaches offer to social science research.
The breadth of application of SEM methods has been expanding, with SEM increasingly applied to exploratory, confirmatory, and predictive analysis through a variety of ad hoc topics and models. SEM is particularly useful in the social sciences where many if not most key concepts are not directly observable, and models that inherently estimate latent variables are desirable. Because many key concepts in the social sciences are inherently latent, questions of construct validity and methodological soundness take on a particular urgency. The popularity of SEM path analysis methods in the social sciences in one sense reflects a more holistic, and less blatantly causal, interpretation of many real-world phenomena than is found in the natural sciences. Direction in the directed network models of SEM arises from presumed cause-effect assumptions made about reality. Social interactions and artifacts are often epiphenomenasecondary phenomena that are difficult to directly link to causal factors. An example of a physiological epiphenomenon is, for example, time to complete a 100-m sprint. I may be able to improve my sprint speed from 12 to 11 sbut I will have difficulty attributing that improvement to any direct causal factors, like diet, attitude, and weather. The 1-s improvement in sprint time is an epiphenomenonthe holistic product of interaction of many individual factors. Such epiphenomena lie at the core of many sociological and psychological theories, and yet are impossible to measure directly. SEM provides one pathway to quantify concepts and theories that previously had only existed in the realm of ideological disputations.
To this day, methodologies for assessing suitable sample size requirements are a worrisome question in SEM-based studies. The number of degrees of freedom in structural model estimation increases with the number of potential combinations of latent variables, while the information supplied in estimating increases with the number of measured parameters (i.e., indicators) times the number of observations (i.e., the sample size)both are nonlinear in model parameters. This should imply that requisite sample size is not a linear function solely of indicator count, even though such heuristics are widely invoked in justifying SEM sample size. Monte Carlo simulation in this field has lent support to the nonlinearity of sample size requirements. Sample size formulas for SEM are provided in the latter part of this book, along with assessments of existing rules of thumb. Contrary to much of the conventional wisdom, sample size for a particular model is constant across methodsPLS-PA, LISREL, and systems of regression approaches all require similar sample sizes when similar models are tested at similar power and significance levels. None of these methods generates information that is not in the sample, though at the margin particular methods may more efficiently use sample information in specific situations.
1.1 The Problem Domains of Structural Equation Models
Many real-world phenomena involve networks of theoretical constructs of interest to both the natural and social sciences. Structural equation modeling has evolved to help specify real-world network models to fit observations to theory. Early approaches lacked the computational power to do little more that trace out pathways along the networks under study. These so-called path models were initially applied in the natural sciences to map networks of heritable genetic traits: constructs such as black hair, long ears, and so forth in laboratory animals, with relationship links defined by ancestry. This early research was directed towards developing useful models of inheritance from straightforward observation, without the benefit of preexisting theories. In case a model did not at first fit the data, researchers had easy access to additional observations that theoretically could be replicated without end, simply by breeding another generation.
Quantification of the social sciences during the mid-twentieth century demanded statistical methods that could assess the abstract and often unobservable constructs inherent in these softer research areas. Many social science observations, e.g., a year of economic performance in the US economy, could never be repeated or replicated. Data was the product of quasi-experiments: non-replicable, with potential biases controlled via expanded scope rather than replication of the experiment. Early work empirically tested pairwise relationships between soft constructs using canonical correlations (Hotelling, ), were focused on hypothesis testing simple theories about the structural relationships between unobservable quantities in social sciences. Applications started with economics, but found greater usefulness in measuring unobservable model constructs such as intelligence, trust, value, and so forth in psychology, sociology, and consumer sentiment. These approaches remain popular today, as many central questions in the social sciences involve networks of abstract ideas that are often hard to measure directly.