SDTM

Last updated

SDTM (Study Data Tabulation Model) defines a standard structure for human clinical trial (study) data tabulations and for nonclinical study data tabulations that are to be submitted as part of a product application to a regulatory authority such as the United States Food and Drug Administration (FDA). The Submission Data Standards team of Clinical Data Interchange Standards Consortium (CDISC) defines SDTM.

Contents

On July 21, 2004, SDTM was selected as the standard specification for submitting tabulation data to the FDA for clinical trials and on July 5, 2011 for nonclinical studies. Eventually, all data submissions will be expected to conform to this format. As a result, clinical and nonclinical Data Managers will need to become proficient in the SDTM to prepare submissions and apply the SDTM structures, where appropriate, for operational data management.

Background

SDTM is built around the concept of observations collected about subjects who participated in a clinical study. Each observation can be described by a series of variables, corresponding to a row in a dataset or table. Each variable can be classified according to its Role. A Role determines the type of information conveyed by the variable about each distinct observation and how it can be used. Variables can be classified into four major roles:

A fifth type of variable role, Rule, can express an algorithm or executable method to define start, end, or looping conditions in the Trial Design model.

The set of Qualifier variables can be further categorized into five sub-classes:

For example, in the observation, 'Subject 101 had mild nausea starting on Study Day 6,' the Topic variable value is the term for the adverse event, 'NAUSEA'. The Identifier variable is the subject identifier, '101'. The Timing variable is the study day of the start of the event, which captures the information, 'starting on Study Day 6', while an example of a Record Qualifier is the severity, the value for which is 'MILD'.

Additional Timing and Qualifier variables could be included to provide the necessary detail to adequately describe an observation.

Datasets and domains

Observations are normally collected for all subjects in a series of domains. A domain is defined as a collection of logically-related observations with a topic-specific commonality about the subjects in the trial. The logic of the relationship may relate to the scientific matter of the data, or to its role in the trial.

Typically, each domain is represented by a dataset, but it is possible to have information relevant to the same topicality spread among multiple datasets. Each dataset is distinguished by a unique, two-character DOMAIN code that should be used consistently throughout the submission. This DOMAIN code is used in the dataset name, the value of the DOMAIN variable within that dataset, and as a prefix for most variable names in the dataset.

The dataset structure for observations is a flat file representing a table with one or more rows and columns. Normally, one dataset is submitted for each domain. Each row of the dataset represents a single observation and each column represents one of the variables. Each dataset or table is accompanied by metadata definitions that provide information about the variables used in the dataset. The metadata are described in a data definition document named 'Define' that is submitted along with the data to regulatory authorities.

Submission Metadata Model uses seven distinct metadata attributes to be defined for each dataset variable in the metadata definition document:

Data stored in dataset variables include both raw (as originally collected) and derived values (e.g., converted into standard units, or computed on the basis of multiple values, such as an average). In SDTM only the name, label, and type are listed with a set of CDISC guidelines that provide a general description for each variable used by a general observation class.

Comments are included as necessary according to the needs of individual studies. The presence of an asterisk (*) in the 'Controlled Terms or Format' column indicates that a discrete set of values (controlled terminology) is expected to be made available for this variable. This set of values may be sponsor-defined in cases where standard vocabularies have not yet been defined (represented by a single *) or from an external published source such as MedDRA (represented by **).

Special-purpose domains

The CDISC Version 3.x Submission Data Domain Models include special-purpose domains with a specific structure and cannot be extended with any additional qualifier or timing variables other than those specified.

Additional fixed structure, non-extensible special-purpose domains are discussed in the Trial Design model.

The general domain classes

Most observations collected during the study (other than those represented in special purpose domains) should be divided among three general observation classes: Interventions, Events, or Findings:

In most cases, the identification of the general class appropriate to a specific collection of data by topicality is straightforward. Often the Findings general class is the best choice for general observational data collected as measurements or responses to questions. In cases when the topicality may not be as clear, the choice of class may be based more on the scientific intent of the protocol or analysis plan or the data structure.

All datasets based on any of the general observation classes share a set of common Identifier variables and Timing variables. Three general rules apply when determining which variables to include in a domain:

The CDISC standard domain models (SDTMIG 3.2)

Special-Purpose Domains:

Interventions General Observation Class:

Events General Observation Class:

Findings General Observation Class:

Findings About :

Trial Design Domains:

Special-Purpose Relationship Datasets:

Limitations and criticism of standards

One criticism of the SDTM standards is that they are continually changing, with new versions released frequently. CDISC claims that SDTM standards are backward compatible. But the claim is unreliable. It is not possible to map the data from EDC DBMS to SDTM standards until the clinical trial completes. New domains, for example the exposure as collected (EC) domain, were added recently. However, backward compatibility with earlier domains is not always possible. [1] The standards are not reliable, and well evolved. The controlled terminology is a very small subset of National Cancer institute terminology. [2]

See also

Related Research Articles

Biostatistics are the development and application of statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

Statistics Study of the collection, analysis, interpretation, and presentation of data

Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as "all people living in a country" or "every atom composing a crystal". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.

In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country that belongs to a larger collection of such entities being studied.

In statistics, a frequency distribution is a list, table or graph that displays the frequency of various outcomes in a sample. Each entry in the table contains the frequency or count of the occurrences of values within a particular group or interval.

Survival analysis is a branch of statistics for analyzing the expected duration of time until one event occurs, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer certain questions, such as what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

Observer bias is one of the types of detection bias and is defined as any kind of systematic divergence from accurate facts during observation and the recording of data and information in studies. The definition can be further expanded upon to include the systematic difference between what is observed due to variation in observers, and what the true value is.

Mathematical statistics Branch of statistics

Mathematical statistics is the application of probability theory, a branch of mathematics, to statistics, as opposed to techniques for collecting statistical data. Specific mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.

This glossary of statistics and probability is a list of definitions of terms and concepts used in the mathematical sciences of statistics and probability, their sub-disciplines, and related fields. For additional related terms, see Glossary of mathematics.

In statistics and econometrics, panel data and longitudinal data are both multi-dimensional data involving measurements over time. Panel data is a subset of longitudinal data where observations are for the same subjects each time.

The Clinical Data Interchange Standards Consortium (CDISC) is a standards developing organization (SDO) dealing with medical research data linked with healthcare, to "enable information system interoperability to improve medical research and related areas of healthcare". The standards support medical research from protocol through analysis and reporting of results and have been shown to decrease resources needed by 60% overall and 70–90% in the start-up stages when they are implemented at the beginning of the research process.

Entity–attribute–value model (EAV) is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.

Observational study Draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints

In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints. One common observational study is about the possible effect of a treatment on subjects, where the assignment of subjects into a treated group versus a control group is outside the control of the investigator. This is in contrast with experiments, such as randomized controlled trials, where each subject is randomly assigned to a treated group or a control group. Observational studies, for lacking an assignment mechanism, naturally present difficulties for inferential analysis.

Repeated measures design is a research design that involves multiple measures of the same variable taken on the same or matched subjects either under different conditions or over two or more time periods. For instance, repeated measurements are collected in a longitudinal study in which change over time is assessed.

Janus clinical trial data repository is a clinical trial data repository standard as sanctioned by the U.S. Food and Drug Administration (FDA). It was named for the Roman god Janus (mythology), who had two faces, one that could see in the past and one that could see in the future. The analogy is that the Janus data repository would enable the FDA and the pharmaceutical industry to both look retrospectively into past clinical trials, and also relative to one or more current clinical trials.

Standard for Exchange of Non-clinical Data

The Standard for Exchange of Nonclinical Data (SEND) is an implementation of the CDISC Standard Data Tabulation Model (SDTM) for nonclinical studies, which specifies a way to present nonclinical data in a consistent format. These types of studies are related to animal testing conducted during drug development. Raw data of toxicology animal studies started after December 18, 2016 to support submission of new drugs to the US Food and Drug Administration will be submitted to the agency using SEND.

Metadata Data about data

Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including:

The following outline is provided as an overview of and topical guide to clinical research:

Observations and Measurements (O&M) is an international standard which defines a conceptual schema encoding for observations, and for features involved in sampling when making observations. While the O&M standard was developed in the context of geographic information systems, the model is derived from generic patterns proposed by Fowler and Odell, and is not limited to spatial information. O&M is one of the core standards in the OGC Sensor Web Enablement suite, providing the response model for Sensor Observation Service (SOS).

In computing, a data definition specification (DDS) is a guideline to ensure comprehensive and consistent data definition. It represents the attributes required to quantify data definition. A comprehensive data definition specification encompasses enterprise data, the hierarchy of data management, prescribed guidance enforcement and criteria to determine compliance.

A common data model (CDM) can refer to any standardized data model which allows for data and information exchange between different applications and data sources. Common data models aim to standardise logical infrastructure so that related applications can "operate on and share the same data", and can be seen as a way to "organize data from many sources that are in different formats into a standard structure".

References

  1. Phuse, 2011. "SDTM Implementation Guide – Clear as Mud" (PDF). lex jansen. PHUSE. Retrieved 17 December 2015.{{cite web}}: CS1 maint: numeric names: authors list (link)
  2. CDISC, Terminology. "National Cancer Institute". cancer.gov. NIH. Retrieved 17 December 2015.