Synthetic data is information that's artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated. Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the

confidentiality Confidentiality involves a set of rules or a promise usually executed through confidentiality agreements that limits the access or places restrictions on certain types of information. Legal confidentiality By law, lawyers are often required ...

of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public; synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation.

Usefulness

Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. This can be useful when designing any type of system because the synthetic data are used as a simulation or as a theoretical value, situation, etc. This allows us to take into account unexpected results and have a basic solution or remedy, if the results prove to be unsatisfactory. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data. As stated previously, synthetic data is used in testing and creating many different types of systems; below is a quote from the abstract of an article that describes a software that generates synthetic data for testing fraud detection systems that further explains its use and importance. "This enables us to create realistic behavior profiles for users and attackers. The data is used to train the

fraud In law, fraud is intentional deception to secure unfair or unlawful gain, or to deprive a victim of a legal right. Fraud can violate civil law (e.g., a fraud victim may sue the fraud perpetrator to avoid the fraud or recover monetary compens ...

detection system itself, thus creating the necessary adaptation of the system to a specific environment."

History

Scientific modelling Scientific modelling is a scientific activity, the aim of which is to make a particular part or feature of the world easier to understand, define, quantify, visualize, or simulate by referencing it to existing and usually commonly accepted ...

of physical systems, which allows to run simulations in which one can estimate/compute/generate datapoints that haven't been observed in actual reality, has a long history that runs concurrent with the

history of physics Physics is a branch of science whose primary objects of study are matter and energy. Discoveries of physics find applications throughout the natural sciences and in technology. Physics today may be divided loosely into classical physics and mode ...

itself. For example, research into synthesis of

audio Audio most commonly refers to sound, as it is transmitted in signal form. It may also refer to: Sound * Audio signal, an electrical representation of sound *Audio frequency, a frequency in the audio spectrum * Digital audio, representation of sou ...

and

voice The human voice consists of sound made by a human being using the vocal tract, including talking, singing, laughing, crying, screaming, shouting, humming or yelling. The human voice frequency is specifically a part of human sound producti ...

can be traced back to the 1930s and before, driven forward by the developments of e.g. the telephone and audio recording. Digitization gave rise to

software synthesizers A software synthesizer or softsynth is a computer program that generates digital audio, usually for music. Computer software that can create sounds or music is not new, but advances in processing speed now allow softsynths to accomplish the sa ...

from the 1970s onwards. In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by

Rubin Rubin is both a surname and a given name. Rubins is a Latvian-language form of the name. As a Jewish name, it derives from the biblical name Reuben. The choice is also influenced by the word ''rubin'' meaning "ruby" is some languages.

. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file. In 1994, Fienberg came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling. Later, other important contributors to the development o
synthetic data generation
were Trivellore Raghunathan, Jerry Reiter,

Donald Rubin Donald is a masculine given name derived from the Gaelic name ''Dòmhnall''.. This comes from the Proto-Celtic *''Dumno-ualos'' ("world-ruler" or "world-wielder"). The final -''d'' in ''Donald'' is partly derived from a misinterpretation of the ...

, John M. Abowd, and Jim Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly they came up with the technique of Sequential Regression Multivariate Imputation.

Calculations

Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

s". Synthetic data can be generated through the use of random lines, having different orientations and starting positions. Datasets can get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data. Constructing a synthesizer build involves constructing a

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...

. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the

of the original data. David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data: "Researchers frequently need to explore the effects of certain data characteristics on their

data model A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to the properties of real-world entities. For instance, a data model may specify that the data element representing a car be c ...

." To help construct datasets exhibiting specific properties, such as

auto-correlation Autocorrelation, sometimes known as serial correlation in the discrete time case, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations of a random variable ...

or degree disparity, proximity can generate synthetic data having one of several types of graph structure:

random graph In mathematics, random graph is the general term to refer to probability distributions over graphs. Random graphs may be described simply by a probability distribution, or by a random process which generates them. The theory of random graphs ...

s that are generated by some

random process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables. Stochastic processes are widely used as mathematical models of systems and phenomena that appe ...

;

lattice graph In graph theory, a lattice graph, mesh graph, or grid graph is a graph whose drawing, embedded in some Euclidean space , forms a regular tiling. This implies that the group of bijective transformations that send the graph to itself is a la ...

s having a ring structure;

s having a grid structure, etc. In all cases, the data generation process follows the same process: # Generate the empty graph structure. # Generate attribute values based on user-supplied prior probabilities. Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively.

Applications

Fraud detection and confidentiality systems

Testing and training

detection and confidentiality systems are devised using synthetic data. Specific algorithms and generators are designed to create realistic data, which then assists in teaching a system how to react to certain situations or criteria. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion.

Scientific research

Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. Real data can contain information that researchers may not want released, so synthetic data is sometimes used to protect the

privacy Privacy (, ) is the ability of an individual or group to seclude themselves or information about themselves, and thereby express themselves selectively. The domain of privacy partially overlaps with security, which can include the concepts of a ...

and

of a dataset. Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual.

Machine learning

Synthetic data is increasingly being used for

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

applications: a model is trained on a synthetically generated dataset with the intention of

transfer learning Transfer learning (TL) is a research problem in machine learning (ML) that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize ...

to real data. Efforts have been made to construct general-purpose synthetic data generators to enable

data science Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a br ...

experiments. In general, synthetic data has several natural advantages: * once the synthetic environment is ready, it is fast and cheap to produce as much data as needed; * synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impossible to obtain by hand; * the synthetic environment can be modified to improve the model and training; * synthetic data can be used as a substitute for certain real data segments that contain, e.g., sensitive information. This usage of synthetic data has been proposed for computer vision applications, in particular

object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...

, where the synthetic environment is a 3D model of the object, and learning to navigate environments by visual information. At the same time, transfer learning remains a nontrivial problem, and synthetic data has not become ubiquitous yet. Research results indicate that adding a small amount of real data significantly improves transfer learning with synthetic data. Advances in

generative adversarial networks A generative adversarial network (GAN) is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in the form of a zero-sum game, where one agent's gain is a ...

(GAN), lead to the natural idea that one can produce data and then use it for training. This fully synthetic approach has not yet materialized, although GANs and adversarial training in general are already successfully used to improve synthetic data generation. Currently, synthetic data is used in practice for emulated environments for training self-driving cars (in particular, using realistic computer games for synthetic environments), point tracking, and retail applications, with techniques such as domain randomizations for transfer learning.

References

* * *