1. Introduction
In this book, a particular class of models is considered: multistate models. Multistate models are ideally suited to model life histories. At a given instant, an individual has a set of attributes, such as marital status, employment status, living arrangement, health status and place of residence. In multistate analysis, a person with a given set of attributes is said to occupy a given state, and persons with the same attributes occupy the same state. When an attribute changes, the person moves to a different state. Most personal attributes change in the life course, implying transitions between states. Marriage, marriage dissolution, birth of a child, job change, migration, onset of disability and death are events that imply a transition between states. The set of possible states is the state space. The state variable is the state an individual occupies at a given time or age. If individuals are combined in cohorts or populations, the state variable is the number of individuals in a state at a given time or age. The life course is operationalised as a sequence of states and transitions between states. Two types of states are distinguished: states that can be entered and left (transient states) and states that can be entered but not left (absorbing states). Age is not a personal attribute; it is a time scale. Different time scales may be used to measure time to transition, calendar time and age being the most common time measurements.
The multistate model is approached from a survival analysis perspective. Survival analysis is a subfield of statistics that studies the occurrence and timing of events. An event is an outcome of a stochastic process. The occurrence of the event and the waiting time to the event are random variables with characteristic distributions. A stochastic process model implies a parametric model of the waiting time to the event. For instance, a model that assumes that the event occurs at a constant rate implies an exponential waiting time distribution. A model that assumes that the rate declines exponentially with duration leads to a Gompertz distribution of time-to-event. Instead of using a model, the empirical distribution of waiting times may be used directly to estimate event rates. In that case, no stochastic process model and associated waiting time distribution are assumed. The method is known as the non-parametric approach.
It is often useful to distinguish event types. For instance, upon completion of college education and receipt of a bachelor degree, a person may move on to graduate school, get a job, take time off for travel or get involved in another activity. These activities are competing for the individuals time. They are competing destinations and competing risks. Another example: Marital dissolution is an event caused by death of the spouse or a divorce. Death of the spouse and divorce are competing causes of marriage dissolution. They compete to be the reason for marriage dissolution. In multistate analysis, competing risks are everywhere, and the modelling of competing risks is an important part of multistate modelling.
In multistate modelling, the life course is modelled as a continuous-time Markov process, which may be written as a system of differential equations. The parameters of the model are instantaneous transition rates, also referred to as hazard rates. They are estimated from data by tracking event occurrences and persons at risk of the event. To experience an event, a person has to be at risk. For example, only married persons are at risk of divorce. Partners who are not married may separate, and a separation may be perceived as a divorce, but it is not a divorce. The risk concept is central to the study of life histories. To determine the probability of an event at a given age, event occurrences at that age and persons at risk need to be recorded. Tracking events and persons is complicated when (a) people can enter, leave and re-enter the population at risk any time during a period of observation, (b) people may leave for reasons unrelated to the study or (c) observations do not cover the entire sequence of entries and exits but only a segment of that sequence: the segment in the observation period or observation window. The third complication implies that the observation starts after some people have already experienced the event or ends before all people included in the observation have experienced the event. The statistical theory for estimating hazard rates and probabilities by counting events and tracking exposure times is the counting process theory (Andersen et al. 1993; Aalen et al. ). It is the main theory applied in this book. A counting process tracks event occurrences and an at risk process keeps track of who is exposed. Occurrences are related to exposures (population at risk and exposure times). Transition counts, risk sets and exposure times provide the necessary information to derive transition rates. One approach is to update and cumulate the transition rate each time a transition is recorded. Life history measures are computed from cumulated hazards. In the book, the method is contrasted with an alternative method, which also counts events and tracks exposure times. Instead of estimating hazard rates each time an event occurs, the rates are estimated for time periods. During a period of 1 year, say, the event count and exposure time are determined and the hazard rate is computed as the ratio of occurrences and exposures. This approach to estimating occurrence-exposure rates is common in demography, epidemiology and other disciplines. Both methods are covered in this book. The first method is implemented in statistical packages for multistate modelling discussed in this book. The second method is implemented in Biograph .
Biograph tracks transitions and the population at risk of a transition. The package relies on life history data, collected retrospectively in cross-sectional surveys or prospectively in follow-up studies. Life history data come in a variety of formats. Most empirical studies organise data by life domain, e.g. employment, partnership and marriage, family and fertility, health and migration. For the study of life histories, events need to be ordered chronologically by time of occurrence, and populations at risk at these times must be determined. Biograph uses a particular chronological format, known as the wide format (see later). Other authors use a different format. For that reason, a number of functions are included in Biograph that convert one data format into another. The Biograph format is the data structure of a Biograph object.
The graphics capabilities of R motivated the visualisation of life histories. The methods presented in the book should be considered as a first step towards visualisation of life history data. In the demographic tradition, individual lifelines are presented in an age-time diagram with age on the y-axis and calendar time on the x-axis. In several textbooks, the diagram is used to show how measurement and estimates vary by age, period and cohort. The diagram is known as the Lexis diagram. Biograph uses two packages to display life histories in the Lexis diagram: the Epi package that includes functions to produce Lexis diagrams and the ggplot2 package. Some functions in Biograph include functions of another package in CRAN with considerable graphics capabilities: TraMineR .
Biograph was designed to make life history data analysis accessible to a large group of students and researchers. The package includes a step-by-step method for tracking event occurrences and populations at risk and for calculating rates of transition between states. The rates are then used to predict the probability of a particular transition (transition probability), the probability of being in a given state at a given age (state probability) and the expected time spent in each of the states (state occupation times).