In social sciences, sequence analysis (SA) is concerned with the analysis of sets of categorical sequences that typically describe longitudinal data. Analyzed sequences are encoded representations of, for example, individual life trajectories such as family formation, school to work transitions, working careers, but they may also describe daily or weekly time use or represent the evolution of observed or self-reported health, of political behaviors, or the development stages of organizations. Such sequences are chronologically ordered unlike words or DNA sequences for example.
SA is a longitudinal analysis approach that is holistic in the sense that it considers each sequence as a whole. SA is essentially exploratory. Broadly, SA provides a comprehensible overall picture of sets of sequences with the objective of characterizing the structure of the set of sequences, finding the salient characteristics of groups, identifying typical paths, comparing groups, and more generally studying how the sequences are related to covariates such as sex, birth cohort, or social origin.
Introduced in the social sciences in the 80s by Andrew Abbott, [1] [2] SA has gained much popularity after the release of dedicated software such as the SQ [3] and SADI [4] addons for Stata and the TraMineR R package [5] with its companions TraMineRextras [6] and WeightedCluster. [7]
Despite some connections, the aims and methods of SA in social sciences strongly differ from those of sequence analysis in bioinformatics.
Sequence analysis methods were first imported into the social sciences from the information and biological sciences (see Sequence alignment) by the University of Chicago sociologist Andrew Abbott in the 1980s, and they have since developed in ways that are unique to the social sciences. [8] Scholars in psychology, economics, anthropology, demography, communication, political science, learning sciences, organizational studies, and especially sociology have been using sequence methods ever since.
In sociology, sequence techniques are most commonly employed in studies of patterns of life-course development, cycles, and life histories. [9] [10] [11] [12] There has been a great deal of work on the sequential development of careers, [13] [14] [15] and there is increasing interest in how career trajectories intertwine with life-course sequences. [16] [17] Many scholars have used sequence techniques to model how work and family activities are linked in household divisions of labor and the problem of schedule synchronization within families. [18] [19] [20] The study of interaction patterns is increasingly centered on sequential concepts, such as turn-taking, the predominance of reciprocal utterances, and the strategic solicitation of preferred types of responses (see Conversation Analysis). Social network analysts (see Social network analysis) have begun to turn to sequence methods and concepts to understand how social contacts and activities are enacted in real time, [21] [22] and to model and depict how whole networks evolve. [23] Social network epidemiologists have begun to examine social contact sequencing to better understand the spread of disease. [24] Psychologists have used those methods to study how the order of information affects learning, and to identify structure in interactions between individuals (see Sequence learning).
Many of the methodological developments in sequence analysis came on the heels of a special section devoted to the topic in a 2000 issue [10] of Sociological Methods & Research , which hosted a debate over the use of the optimal matching (OM) edit distance for comparing sequences. In particular, sociologists objected to the descriptive and data-reducing orientation of optimal matching, as well as to a lack of fit between bioinformatic sequence methods and uniquely social phenomena. [25] [26] The debate has given rise to several methodological innovations (see Pairwise dissimilarities below) that address limitations of early sequence comparison methods developed in the 20th century. In 2006, David Stark and Balazs Vedres [23] proposed the term "social sequence analysis" to distinguish the approach from bioinformatic sequence analysis. However, if we except the nice book by Benjamin Cornwell, [27] the term was seldom used, probably because the context prevents any confusion in the SA literature. Sociological Methods & Research organized a special issue on sequence analysis in 2010, leading to what Aisenbrey and Fasang [28] referred to as the "second wave of sequence analysis", which mainly extended optimal matching and introduced other techniques to compare sequences. Alongside sequence comparison, recent advances in SA concerned among others the visualization of sets of sequence data, [5] [29] the measure and analysis of the discrepancy of sequences, [30] the identification of representative sequences, [31] and the development of summary indicators of individual sequences. [32] Raab and Struffolino [33] have conceived more recent advances as the third wave of sequence analysis. This wave is largely characterized by the effort of bringing together the stochastic and the algorithmic modeling culture [34] by jointly applying SA with more established methods such as analysis of variance, event history analysis, Markovian modeling, social network analysis, or causal analysis and statistical modeling in general. [35] [36] [37] [27] [30] [38] [39]
The analysis of sequence patterns has foundations in sociological theories that emerged in the middle of the 20th century. [27] Structural theorists argued that society is a system that is characterized by regular patterns. Even seemingly trivial social phenomena are ordered in highly predictable ways. [40] This idea serves as an implicit motivation behind social sequence analysts' use of optimal matching, clustering, and related methods to identify common "classes" of sequences at all levels of social organization, a form of pattern search. This focus on regularized patterns of social action has become an increasingly influential framework for understanding microsocial interaction and contact sequences, or "microsequences." [41] This is closely related to Anthony Giddens's theory of structuration, which holds that social actors' behaviors are predominantly structured by routines, and which in turn provides predictability and a sense of stability in an otherwise chaotic and rapidly moving social world. [42] This idea is also echoed in Pierre Bourdieu's concept of habitus, which emphasizes the emergence and influence of stable worldviews in guiding everyday action and thus produce predictable, orderly sequences of behavior. [43] The resulting influence of routine as a structuring influence on social phenomena was first illustrated empirically by Pitirim Sorokin, who led a 1939 study that found that daily life is so routinized that a given person is able to predict with about 75% accuracy how much time they will spend doing certain things the following day. [44] Talcott Parsons's argument [40] that all social actors are mutually oriented to their larger social systems (for example, their family and larger community) through social roles also underlies social sequence analysts' interest in the linkages that exist between different social actors' schedules and ordered experiences, which has given rise to a considerable body of work on synchronization between social actors and their social contacts and larger communities. [19] [18] [45] All of these theoretical orientations together warrant critiques of the general linear model of social reality, which as applied in most work implies that society is either static or that it is highly stochastic in a manner that conforms to Markov processes [1] [46] This concern inspired the initial framing of social sequence analysis as an antidote to general linear models. It has also motivated recent attempts to model sequences of activities or events in terms as elements that link social actors in non-linear network structures [47] [48] This work, in turn, is rooted in Georg Simmel's theory that experiencing similar activities, experiences, and statuses serves as a link between social actors. [49] [50]
In demography and historical demography, from the 1980s the rapid appropriation of the life course perspective and methods was part of a substantive paradigmatic change that implied a stronger embedment of demographic processes into social sciences dynamics. After a first phase with a focus on the occurrence and timing of demographic events studied separately from each other with a hypothetico-deductive approach, from the early 2000s [34] [51] the need to consider the structure of the life courses and to make justice to its complexity led to a growing use of sequence analysis with the aim of pursuing a holistic approach. At an inter-individual level, pairwise dissimilarities and clustering appeared as the appropriate tools for revealing the heterogeneity in human development. For example, the meta-narrations contrasting individualized Western societies with collectivist societies in the South (especially in Asia) were challenged by comparative studies revealing the diversity of pathways to legitimate reproduction. [52] At an intra-individual level, sequence analysis integrates the basic life course principle that individuals interpret and make decision about their life according to their past experiences and their perception of contingencies. [34] The interest for this perspective was also promoted by the changes in individuals' life courses for cohorts born between the beginning and the end of the 20th century. These changes have been described as de-standardization, de-synchronization, de-institutionalization. [53] Among the drivers of these dynamics, the transition to adulthood is key: [54] for more recent birth cohorts this crucial phase along individual life courses implied a larger number of events and lengths of the state spells experienced. For example, many postponed leaving parental home and the transition to parenthood, in some context cohabitation replaced marriage as long-lasting living arrangement, and the birth of the first child occurs more frequently while parents cohabit instead of within a wedlock. [55] Such complexity required to be measured to be able to compare quantitative indicators across birth cohorts [11] [56] (see [57] for an extension of this questioning to populations from low- and medium income countries). The demography's old ambition to develop a 'family demography' has found in the sequence analysis a powerful tool to address research questions at the cross-road with other disciplines: for example, multichannel techniques [58] represent precious opportunities to deal with the issue of compatibility between working and family lives. [59] [37] Similarly, more recent combinations of sequence analysis and event history analysis have been developed (see [36] for a review) and can be applied, for instance, for understanding of the link between demographic transitions and health.
The analysis of temporal processes in the domain of political sciences [60] regards how institutions, that is, systems and organizations (regimes, governments, parties, courts, etc.) that crystallize political interactions, formalize legal constraints and impose a degree of stability or inertia. Special importance is given to, first, the role of contexts, which confer meaning to trends and events, while shared contexts offer shared meanings; second, to changes over time in power relationships, and, subsequently, asymmetries, hierarchies, contention, or conflict; and, finally, to historical events that are able to shape trajectories, such as elections, accidents, inaugural speeches, treaties, revolutions, or ceasefires. Empirically, political sequences' unit of analysis can be individuals, organizations, movements, or institutional processes. Depending on the unit of analysis, the sample sizes may be limited few cases (e.g., regions in a country when considering the turnover of local political parties over time) or include a few hundreds (e.g., individuals' voting patterns). Three broad kinds of political sequences may be distinguished. The first and most common is careers, that is, formal, mostly hierarchical positions along which individuals progress in institutional environments, such as parliaments, cabinets, administrations, parties, unions or business organizations. [61] [62] [63] We may name trajectories political sequences that develop in more informal and fluid contexts, such as activists evolving across various causes and social movements, [64] [65] or voters navigating a political and ideological landscape across successive polls. [66] Finally, processes relate to non-individual entities, such as: public policies developing through successive policy stages across distinct arenas; [67] sequences of symbolic or concrete interactions between national and international actors in diplomatic and military contexts; [68] [69] and development of organizations or institutions, such as pathways of countries towards democracy (Wilson 2014). [70]
A sequence s is an ordered list of elements (s1,s2,...,sl) taken from a finite alphabet A. For a set S of sequences, three sizes matter: the number n of sequences, the size a = |A| of the alphabet, and the length l of the sequences (that could be different for each sequence). In social sciences, n is generally something between a few hundreds and a few thousands, the alphabet size remains limited (most often less than 20), while sequence length rarely exceeds 100.
We may distinguish between state sequences and event sequences, [71] where states last while events occur at one time point and do not last but contribute possibly together with other events to state changes. For instance, the joint occurrence of the two events leaving home and starting a union provoke a state change from 'living at home with parents' to 'living with a partner'.
When a state sequence is represented as the list of states observed at the successive time points, the position of each element in the sequence conveys this time information and the distance between positions reflects duration. An alternative more compact representation of a sequence, is the list of the successive spells stamped with their duration, where a spell (also called episode) is a substring in a same state. For example, in aabbbc, bbb is a spell of length 3 in state b, and the whole sequence can be represented as (a,2)-(b,3)-(c,1). [71]
A crucial point when looking at state sequences is the timing scheme used to time align the sequences. This could be the historical calendar time, or a process time such as age, i.e. time since birth.
In event sequences, positions do not convey any time information. Therefore event occurrence time must be explicitly provided (as a timestamp) when it matters.
SA is essentially concerned with state sequences.
Conventional SA consists essentially in building a typology of the observed trajectories. Abbott and Tsay (2000) [10] describe this typical SA as a three-step program: 1. Coding individual narratives as sequences of states; 2. Measuring pairwise dissimilarities between sequences; and 3. Clustering the sequences from the pairwise dissimilarities. However, SA is much more (see e.g. [35] [8] ) and encompasses also among others the description and visual rendering of sets of sequences, ANOVA-like analysis and regression trees for sequences, the identification of representative sequences, the study of the relationship between linked sequences (e.g. dyadic, linked-lives, or various life dimensions such as occupation, family, health), and sequence-network.
Given an alignment rule, a set of sequences can be represented in tabular form with sequences in rows and columns corresponding to the positions in the sequences.
To describe such data, we may look at the columns and consider the cross-sectional state distributions at the successive positions.
The chronogram or density plot of a set of sequences renders these successive cross-sectional distributions.
For each (column) distribution we can compute characteristics such as entropy or modal state and look at how these values evolve over the positions (see [5] pp 18–21).
Alternatively, we can look at the rows. The index plot [73] where each sequence is represented as a horizontal stacked bar or line is the basic plot for rendering individual sequences.
We can compute characteristics of the individual sequences and examine the cross-sectional distribution of these characteristics.
Main indicators of individual sequences [32]
State sequences can nicely be rendered graphically and such plots prove useful for interpretation purposes. As shown above, the two basic plots are the index plot that renders individual sequences and the chronogram that renders the evolution of the cross-sectional state distribution along the timeframe. Chronograms (also known as status proportion plot or state distribution plot) completely overlook the diversity of the sequences, while index plots are often too scattered to be readable. Relative frequency plots and plots of representative sequences attempt to increase the readability of index plots without falling in the oversimplification of a chronogram. In addition, there are many plots that focus on specific characteristics of the sequences. Below is a list of plots that have been proposed in the literature for rendering large sets of sequences. For each plot, we give examples of software (details in section Software) that produce it.
Pairwise dissimilarities between sequences serve to compare sequences and many advanced SA methods are based on these dissimilarities. The most popular dissimilarity measure is optimal matching (OM), i.e. the minimal cost of transforming one sequence into the other by means of indel (insert or delete) and substitution operations with possibly costs of these elementary operations depending on the states involved. SA is so intimately linked with OM that it is sometimes named optimal matching analysis (OMA).
There are roughly three categories of dissimilarity measures: [86]
Pairwise dissimilarities between sequences give access to a series of techniques to discover holistic structuring characteristics of the sequence data. In particular, dissimilarities between sequences can serve as input to cluster algorithms and multidimensional scaling, but also allow to identify medoids or other representative sequences, define neighborhoods, measure the discrepancy of a set of sequences, proceed to ANOVA-like analyses, and grow regression trees.
Although dissimilarity-based methods play a central role in social SA, essentially because of their ability to preserve the holistic perspective, several other approaches also prove useful for analyzing sequence data.
Some recent advances can be conceived as the third wave of SA. [33] This wave is largely characterized by the effort of bringing together the stochastic and the algorithmic modeling culture by jointly applying SA with more established methods such as analysis of variance, event history, network analysis, or causal analysis and statistical modeling in general. Some examples are given below; see also "Other methods of analysis".
Although SA witnesses a steady inflow of methodological contributions that address the issues raised two decades ago, [28] some pressing open issues remain. [36] Among the most challenging, we can mention:
Up-to-date information on advances, methodological discussions, and recent relevant publications can be found on the Sequence Analysis Association webpage.
These techniques have proved valuable in a variety of contexts. In life-course research, for example, research has shown that retirement plans are affected not just by the last year or two of one's life, but instead how one's work and family careers unfolded over a period of several decades. People who followed an "orderly" career path (characterized by consistent employment and gradual ladder-climbing within a single organization) retired earlier than others, including people who had intermittent careers, those who entered the labor force late, as well as those who enjoyed regular employment but who made numerous lateral moves across organizations throughout their careers. [12] In the field of economic sociology, research has shown that firm performance depends not just on a firm's current or recent social network connectedness, but also the durability or stability of their connections to other firms. Firms that have more "durably cohesive" ownership network structures attract more foreign investment than less stable or poorly connected structures. [23] Research has also used data on everyday work activity sequences to identify classes of work schedules, finding that the timing of work during the day significantly affects workers' abilities to maintain connections with the broader community, such as through community events. [19] More recently, social sequence analysis has been proposed as a meaningful approach to study trajectories in the domain of creative enterprise, allowing the comparison among the idiosyncrasies of unique creative careers. [129] While other methods for constructing and analyzing whole sequence structure have been developed during the past three decades, including event structure analysis, [117] [118] OM and other sequence comparison methods form the backbone of research on whole sequence structures.
Some examples of application include:
Sociology
Demography and historical demography
Political sciences
Education and learning sciences
Psychology
Medical research
Survey methodology
Geography
Two main statistical computing environment offer tools to conduct a sequence analysis in the form of user-written packages: Stata and R.
The first international conference dedicated to social-scientific research that uses sequence analysis methods – the Lausanne Conference on Sequence Analysis, or LaCOSA – was held in Lausanne, Switzerland in June 2012. [157] A second conference (LaCOSA II) was held in Lausanne in June 2016. [158] [159] The Sequence Analysis Association (SAA) was founded at the International Symposium on Sequence Analysis and Related Methods, in October 2018 at Monte Verità, TI, Switzerland. The SAA is an international organization whose goal is to organize events such as symposia and training courses and related events, and to facilitate scholars' access to sequence analysis resources.
Qualitative research is a type of research that aims to gather and analyse non-numerical (descriptive) data in order to gain an understanding of individuals' social reality, including understanding their attitudes, beliefs, and motivation. This type of research typically involves in-depth interviews, focus groups, or field observations in order to collect data that is rich in detail and context. Qualitative research is often used to explore complex phenomena or to gain insight into people's experiences and perspectives on a particular topic. It is particularly useful when researchers want to understand the meaning that people attach to their experiences or when they want to uncover the underlying reasons for people's behavior. Qualitative methods include ethnography, grounded theory, discourse analysis, and interpretative phenomenological analysis. Qualitative research methods have been used in sociology, anthropology, political science, psychology, communication studies, social work, folklore, educational research, information science and software engineering research.
Conversation analysis (CA) is an approach to the study of social interaction that empirically investigates the mechanisms by which humans achieve mutual understanding. It focuses on both verbal and non-verbal conduct, especially in situations of everyday life. CA originated as a sociological method, but has since spread to other fields. CA began with a focus on casual conversation, but its methods were subsequently adapted to embrace more task- and institution-centered interactions, such as those occurring in doctors' offices, courts, law enforcement, helplines, educational settings, and the mass media, and focus on multimodal and nonverbal activity in interaction, including gaze, body movement and gesture. As a consequence, the term conversation analysis has become something of a misnomer, but it has continued as a term for a distinctive and successful approach to the analysis of interactions. CA and ethnomethodology are sometimes considered one field and referred to as EMCA.
In mathematics, computer science and network science, network theory is a part of graph theory. It defines networks as graphs where the vertices or edges possess attributes. Network theory analyses these networks over the symmetric relations or asymmetric relations between their (discrete) components.
In social sciences and other domains, representative sequences are whole sequences that best characterize or summarize a set of sequences. In bioinformatics, representative sequences also designate substrings of a sequence that characterize the sequence.
The sociology of scientific knowledge (SSK) is the study of science as a social activity, especially dealing with "the social conditions and effects of science, and with the social structures and processes of scientific activity." The sociology of scientific ignorance (SSI) is complementary to the sociology of scientific knowledge. For comparison, the sociology of knowledge studies the impact of human knowledge and the prevailing ideas on societies and relations between knowledge and the social context within which it arises.
Sociological imagination is a term used in the field of sociology to describe a framework for understanding social reality that places personal experiences within a broader social and historical context.
The life course approach, also known as the life course perspective or life course theory, refers to an approach developed in the 1960s for analyzing people's lives within structural, social, and cultural contexts. It views one's life as a socially sequenced timeline and recognizes the importance of factors such as generational succession and age in shaping behavior and career. Development does not end at childhood, but instead extends through multiple life stages to influence life trajectory.
Michael Burawoy is a British sociologist working within Marxist social theory, best known as the leading proponent of public sociology and the author of Manufacturing Consent: Changes in the Labor Process Under Monopoly Capitalism—a study on the sociology of industry that has been translated into a number of languages.
Optimal matching is a sequence analysis method used in social science, to assess the dissimilarity of ordered arrays of tokens that usually represent a time-ordered sequence of socio-economic states two individuals have experienced. Once such distances have been calculated for a set of observations classical tools can be used. The method was tailored to social sciences from a technique originally introduced to study molecular biology sequences. Optimal matching uses the Needleman-Wunsch algorithm.
Historical sociology is an interdisciplinary field of research that combines sociological and historical methods to understand the past, how societies have developed over time, and the impact this has on the present. It emphasises a mutual line of inquiry of the past and present to understand how discrete historical events fit into wider societal progress and ongoing dilemmas through complementary comparative analysis.
Sociology is the scientific study of human society that focuses on society, human social behavior, patterns of social relationships, social interaction, and aspects of culture associated with everyday life. Regarded as a part of both the social sciences and humanities, sociology uses various methods of empirical investigation and critical analysis to develop a body of knowledge about social order and social change. Sociological subject matter ranges from micro-level analyses of individual interaction and agency to macro-level analyses of social systems and social structure. Applied sociological research may be applied directly to social policy and welfare, whereas theoretical approaches may focus on the understanding of social processes and phenomenological method.
A social network is a social structure consisting of a set of social actors, sets of dyadic ties, and other social interactions between actors. The social network perspective provides a set of methods for analyzing the structure of whole social entities as well as a variety of theories explaining the patterns observed in these structures. The study of these structures uses social network analysis to identify local and global patterns, locate influential entities, and examine network dynamics. For instance, social network analysis has been used in studying the spread of misinformation on social media platforms or analyzing the influence of key figures in social networks.
Computational social science is an interdisciplinary academic sub-field concerned with computational approaches to the social sciences. This means that computers are used to model, simulate, and analyze social phenomena. It has been applied in areas such as computational economics, computational sociology, computational media analysis, cliodynamics, culturomics, nonprofit studies. It focuses on investigating social and behavioral relationships and interactions using data science approaches, network analysis, social simulation and studies using interactive systems.
Stephen Lawrence Morgan is a Bloomberg Distinguished Professor of Sociology and Education at the Johns Hopkins University School of Arts and Sciences and Johns Hopkins School of Education. A quantitative methodologist, he is known for his contributions to quantitative methods in sociology as applied to research on schools, particularly in models for educational attainment, improving the study of causal relationships, and his empirical research focusing on social inequality and education in the United States.
Angela Dale is a British social scientist and statistician whose research has involved the secondary analysis of government survey data, and the study of women in the workforce. Formerly Deputy Director of the Social Statistics Research Unit of City, University of London, and Professor of Quantitative Research and Director of the Cathie Marsh Centre for Census and Survey Research at the University of Manchester, she is now a professor emerita at Manchester.
Life course research is an interdisciplinary field in the social and behavioral sciences. Developed during the 1960s, it aims to study human development over the entire life span. As such, it brings together aspects of human development that had previously only been studied separately. In the 1970s, scholars first started to commonly refer to their field as "life course research". The field includes research conceptualizing the life course as one of many different concepts, including developmental processes, cultural constructs, and demographic accounts.
Gilbert Ritschard is a Swiss statistician specialized in quantitative methods for the social sciences and in the analysis of longitudinal data describing life courses. He is Professor Emeritus at the University of Geneva. He earned a Ph.D. in Econometrics and Statistics at the University of Geneva in 1979. His main contributions are in sequence analysis. He initiated and led the SNFS project that developed the TraMineR R toolkit for sequence analysis. He is one of the founders of the Sequence Analysis Association, which he served as first president.
Necessary condition analysis (NCA) is a research approach and tool employed to discern "necessary conditions" within datasets. These indispensable conditions stand as pivotal determinants of particular outcomes, wherein the absence of such conditions ensures the absence of the intended result. For example, the admission of a student into a Ph.D. program necessitates a prior degree; the progression of AIDS necessitates the presence of HIV; and organizational change necessitates communication.
Video Data Analysis (VDA) is a curated multi-disciplinary collection of tools, techniques, and quality criteria intended for analyzing the content of visuals to study driving dynamics of social behavior and events in real-life settings. It often uses visual data in combination with other data types. VDA is employed across the social sciences such as sociology, psychology, criminology, business research, and education research.
Jeylan T. Mortimer is an American sociologist. She is Professor Emeritus at the University of Minnesota, where she founded the Life Course Center and served as its Director from 1986 to 2006.