Synthetic data

Last updated

Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. [1]

Contents

Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated.

Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public; [2] synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation.

Usefulness

Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing many systems, from simulations based on theoretical value, to database processors, etc. This helps detect and solve unexpected issues such as information processing limitations. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. [3] Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data, while still allowing for use in testing systems.

A science article's abstract, quoted below, describes software that generates synthetic data for testing fraud detection systems. "This enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment." [3] In defense and military contexts, synthetic data is seen as a potentially valuable tool to develop and improve complex AI systems, particularly in contexts where high-quality real-world data is scarce. [4]

History

Scientific modelling of physical systems, which allows to run simulations in which one can estimate/compute/generate datapoints that haven't been observed in actual reality, has a long history that runs concurrent with the history of physics itself. For example, research into synthesis of audio and voice can be traced back to the 1930s and before, driven forward by the developments of e.g. the telephone and audio recording. Digitization gave rise to software synthesizers from the 1970s onwards.

In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by Rubin. [5] Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. [6] Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file. [7]

In 1994, Fienberg came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling. [6] Later, other important contributors to the development of synthetic data generation were Trivellore Raghunathan, Jerry Reiter, Donald Rubin, John M. Abowd, and Jim Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation. [6]

Calculations

Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms". [8]

Synthetic data can be generated through the use of random lines, having different orientations and starting positions. [9] Datasets can get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data. [10]

Constructing a synthesizer build involves constructing a statistical model. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality of the original data. [10]

David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data: "Researchers frequently need to explore the effects of certain data characteristics on their data model." [10] To help construct datasets exhibiting specific properties, such as auto-correlation or degree disparity, proximity can generate synthetic data having one of several types of graph structure: random graphs that are generated by some random process; lattice graphs having a ring structure; lattice graphs having a grid structure, etc. [10] In all cases, the data generation process follows the same process:

  1. Generate the empty graph structure.
  2. Generate attribute values based on user-supplied prior probabilities.

Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively. [10]

Applications

Fraud detection and confidentiality systems

Testing and training fraud detection and confidentiality systems are devised using synthetic data. Specific algorithms and generators are designed to create realistic data, [11] which then assists in teaching a system how to react to certain situations or criteria. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion. [3]

Scientific research

Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing.

Real data can contain information that researchers may not want released, [12] so synthetic data is sometimes used to protect the privacy and confidentiality of a dataset. Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual.

Machine learning

Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to enable more data science experiments via the construction of general-purpose synthetic data generators, such as the Synthetic Data Vault. [13] In general, synthetic data has several natural advantages:

This usage of synthetic data has been proposed for computer vision applications, in particular object detection, where the synthetic environment is a 3D model of the object, [14] and learning to navigate environments by visual information.

At the same time, transfer learning remains a nontrivial problem, and synthetic data has not become ubiquitous yet. Research results indicate that adding a small amount of real data significantly improves transfer learning with synthetic data. Advances in generative adversarial networks (GAN), lead to the natural idea that one can produce data and then use it for training. Since at least 2016, such adversarial training has been successfully used to produce synthetic data of sufficient quality to produce state-of-the-art results in some domains, without even needing to re-mix real data in with the generated synthetic data. [15]

Examples

In 1987, a Navlab autonomous vehicle used 1200 synthetic road images as one approach to training. [16]

In 2021, Microsoft released a database of 100,000 synthetic faces based on (500 real faces) that claims to "match real data in accuracy". [16] [17]

See also

Related Research Articles

<span class="mw-page-title-main">Data set</span> Collection of data

A data set is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.

Data mining is the process of extracting and discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting information from a data set and transforming the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". There are three main problems that missing data causes: missing data can introduce a substantial amount of bias, make the handling and analysis of the data more arduous, and create reductions in efficiency. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data. There have been many theories embraced by scientists to account for missing data but the majority of them introduce bias. A few of the well known attempts to deal with missing data include: hot deck and cold deck imputation; listwise and pairwise deletion; mean imputation; non-negative matrix factorization; regression imputation; last observation carried forward; stochastic imputation; and multiple imputation.

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

Differential privacy (DP) is an approach for providing privacy while sharing information about a group of individuals, by describing the patterns within the group while withholding information about specific individuals. This is done by making arbitrary small changes to individual data that do not change the statistics of interest. Thus the data cannot be used to infer much about any individual.

Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user, to label new data points with the desired outputs. The human user must possess knowledge/expertise in the problem domain, including the ability to consult/research authoritative sources when necessary. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle.

Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.

Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure no person or organization is identifiable from the results of an analysis of survey or administrative data, or in the release of microdata. The purpose of SDC is to protect the confidentiality of the respondents and subjects of the research.

Data augmentation is a statistical technique which allows maximum likelihood estimation from incomplete data. Data augmentation has important applications in Bayesian analysis, and the technique is widely used in machine learning to reduce overfitting when training machine learning models, achieved by training models on several slightly-modified copies of existing data.

<span class="mw-page-title-main">EMRBots</span>

EMRBots are experimental artificially generated electronic medical records (EMRs). The aim of EMRBots is to allow non-commercial entities to use the artificial patient repositories to practice statistical and machine-learning algorithms. Commercial entities can also use the repositories for any purpose, as long as they do not create software products using the repositories.

Local differential privacy (LDP) is a model of differential privacy with the added requirement that if an adversary has access to the personal responses of an individual in the database, that adversary will still be unable to learn much of the user's personal data. This is contrasted with global differential privacy, a model of differential privacy that incorporates a central aggregator with access to the raw data.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

Identity replacement technology is any technology that is used to cover up all or parts of a person's identity, either in real life or virtually. This can include face masks, face authentication technology, and deepfakes on the Internet that spread fake editing of videos and images. Face replacement and identity masking are used by either criminals or law-abiding citizens. Identity replacement tech, when operated on by criminals, leads to heists or robbery activities. Law-abiding citizens utilize identity replacement technology to prevent government or various entities from tracking private information such as locations, social connections, and daily behaviors.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

References

  1. "What is synthetic data? - Definition from WhatIs.com". SearchCIO. Retrieved 2022-09-08.
  2. Nikolenko, Sergey I. (2021). Synthetic Data for Deep Learning. Springer Optimization and Its Applications. Vol. 174. doi:10.1007/978-3-030-75178-4. ISBN   978-3-030-75177-7. S2CID   202750227.
  3. 1 2 3 Barse, E.L.; Kvarnström, H.; Jonsson, E. (2003). Synthesizing test data for fraud detection systems. Proceedings of the 19th Annual Computer Security Applications Conference. IEEE. doi:10.1109/CSAC.2003.1254343.
  4. Deng, Harry (30 November 2023). "Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer". United Nations Institute for Disarmament Research.
  5. "Discussion: Statistical Disclosure Limitation". Journal of Official Statistics. 9: 461–468. 1993.
  6. 1 2 3 Abowd, John M. "Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods. [Powerpoint slides]" . Retrieved 17 February 2011.
  7. "Statistical Analysis of Masked Data". Journal of Official Statistics. 9: 407–426. 1993.
  8. Jackson, Charles; Murphy, Robert F.; Kovačević, Jelena (September 2009). "Intelligent Acquisition and Learning of Fluorescence Microscope Data Models" (PDF). IEEE Transactions on Image Processing. 18 (9): 2071–84. Bibcode:2009ITIP...18.2071J. doi:10.1109/TIP.2009.2024580. PMID   19502128. S2CID   3718670.
  9. Wang, Aiqi; Qiu, Tianshuang; Shao, Longtan (July 2009). "A Simple Method of Radial Distortion Correction with Centre of Distortion Estimation". Journal of Mathematical Imaging and Vision. 35 (3): 165–172. doi:10.1007/s10851-009-0162-1. S2CID   207175690.
  10. 1 2 3 4 5 David Jensen (2004). "6. Using Scripts". Proximity 4.3 Tutorial.
  11. Deng, Robert H.; Bao, Feng; Zhou, Jianying (December 2002). Information and Communications Security. Proceedings of the 4th International Conference, ICICS 2002 Singapore. ISBN   9783540361596.
  12. Abowd, John M.; Lane, Julia (June 9–11, 2004). New Approaches to Confidentiality Protection: Synthetic Data, Remote Access and Research Data Centers. Privacy in Statistical Databases: CASC Project Final Conference, Proceedings. Barcelona, Spain. doi:10.1007/978-3-540-25955-8_22.
  13. Patki, Neha; Wedge, Roy; Veeramachaneni, Kalyan. The Synthetic Data Vault. Data Science and Advanced Analytics (DSAA) 2016. IEEE. doi:10.1109/DSAA.2016.49.
  14. Peng, Xingchao; Sun, Baochen; Ali, Karim; Saenko, Kate (2015). "Learning Deep Object Detectors from 3D Models". arXiv: 1412.7122 [cs.CV].
  15. Shrivastava, Ashish; Pfister, Tomas; Tuzel, Oncel; Susskind, Josh; Wang, Wenda; Webb, Russ (2016). "Learning from Simulated and Unsupervised Images through Adversarial Training". arXiv: 1612.07828 [cs.CV].
  16. 1 2 "Neural Networks Need Data to Learn. Even If It's Fake". June 2023. Retrieved 17 June 2023.
  17. Wood, Erroll; Baltrušaitis, Tadas; Hewitt, Charlie; Dziadzio, Sebastian; Cashman, Thomas J.; Shotton, Jamie (2021). "Fake It Till You Make It: Face Analysis in the Wild Using Synthetic Data Alone". Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV): 3681–3691. arXiv: 2109.15102 .

Further reading