If you are developing a software running on data, you have already faced the following problem: how do you test the software, without having proper data to run on? How do you benchmark its performances? How do you validate its robustness to erroneous data?
Running on real data is an option, however you need to have access to it. Let’s assume you do – although recent regulatory advances (we’re looking at you, GDPR) do restrict such access drastically – you still (often) have access to a single dataset, with few diversity in size, in the types of data issues, you will face. This is not ideal to develop a robust, well-checked software, for which you have benchmarked multiple use cases, varying the many parameters that impact your software’s performances: What if we double our subscriber base? What if our customers are 10% more active? What if…
You are not alone. Many players in the so-called Big Data field are facing the same issue, for which a new type of need has emerged: synthetic data.
What is it all about? Again, just as a scientist may need to produce synthetic material to conduct experiments at low risk, the data scientist will have at some point to produce synthetic or fake data, that has the same or almost the same properties as the real one, with the capacity to vary the properties of the data to test diverse scenarios.
At Real Impact Analytics, we are facing these problems daily. That’s why we created Trumania, our synthetic data generator. It allows us to create very diverse realistic scenarios that generate the data we need, based on empirical distributions. This way, we do create a parameterized dataset, which represents realistic yet fake data.
Generating synthetic data is not entirely new, and there are three major approaches to the problem. They can apply to various data contexts, but we will succinctly explain them here with the example of Call Detail Records or CDRs (i.e. data record produced by a telephone that documents the details of a phone call or text message).
Schema-based solutions are the simplest ways to generate synthetic data, as they tend to simply respect a pre-defined schema, without aiming to reproduce any correlation or statistical distribution. These tools will require as input an expected schema (name and type of each field), and draw random outputs from pre-defined generators. For example, customer numbers in CDRs could be random numbers of a pre-defined length, timestamps, random timestamps in a given time window.
It is important to understand that in these tools, entries have no correlation with each other, all data having been drawn at random from simple generators.
The principle is to observe real-world statistic distributions from the original data and reproduce fake data by drawing numbers from those observed distributions. These distributions can be multivariate, meaning that correlations between multiple variables can be inferred, to represent more complex interactions. For example, if the goal is to reproduce the same telecom activity level over time and over geographical distance on a real customer base, we will:
- Observe what is the real distribution of calls over time and distance on CDRs
- Create an artificial base of customers
- Simulate calls from these customers with time stamps and geographical positions, respecting the distribution that was observed, all other fields in the CDR being randomly generated
Agent based-like modelling
The principle is to create a physical model that explains the observed behaviour, then reproduce random data using this model. It is generally agreed that observed features of a complex system can often emerge from a set of simple rules. Take the example of simulating CDRs with a certain temporal pattern. A simple physical system would be one where users engage in activities over time, and their probability to make a phone call varies with the number of activities in which they’re engaged. Then, depending on how activities are distributed over time, a similar time pattern as the one observed in real data can be recreated with such a physical model. Again, CDRs are then created while running the simulation over time, with all other fields created randomly.
Each approach has its advantages, and its drawbacks. Creating random data from a schema is very fast, and very useful to test basic functionalities of a piece of code, but won’t be very helpful on benchmarking complex algorithms, where the relationship between fields does matter. Inferring distributions from real data is a fast way to reproduce realistic data, however when multivariate models grow, the risk gets high that observed correlations become overfitted, limiting the possibilities to simulate complex scenarios. On the other side of the playground, agent-based-like models do allow very complex scenarios, but limit the realism of their simulations to the veracity of their models, which are often subject to debate.
At Real Impact Analytics, we decide to take the best out of these two latter options, by creating a generator which is based on a lightweight agent-based modeling framework while using inferred distributions. Trumania represents every synthetic dataset as a scenario, in which dimensional data is generated either from classical parametric random generators, or from inferred non-parametric distributions. Fact data is then generated as interactions between agents, which use the dimensional data as input to determine their behaviour.
Synthetic data is not real data
Generating data that looks like the real thing may seem a fantastic playground for businessmen. If they can replicate datasets, they could do simulations that allow them to predict consumer behavior and consequently set up winning strategies! The reality is unfortunately not that exciting, because as we mentioned earlier, synthetic models only replicate specific properties of the data. They cannot match a given dataset perfectly; only simulate general trends.
To make an analogy, you could possibly simulate an advertising campaign in SIM city, but it would teach you nothing about the real world response to your campaign because SIM city inhabitants are not reacting to the product like humans. You are always bound by what you can model. However, using synthetic data has some great advantages, too. First, it might be useful for visualization purposes and to test the scalability as well as the robustness of new algorithms. This is absolutely key for everyone who is busy with big data applications. Second, the resulting indicators can be shared broadly, often as open data. So it contributes to the community’s general big data/algorithmic knowledge.
Synthetic data is a useful tool to safely share data for testing the scalability of algorithms and the performance of new software. It cannot be used for research purposes however, as it only aims at reproducing specific properties of the data. Producing quality synthetic data is complicated because the more complex the system, the more difficult it is to keep track of all the features that need to be similar to real data.
As we perceive this as a global issue for the Big Data community, and as we have, as a Big Data player, often taken advantage of contributions of the community from open source solutions, we have decided to open source our solution, Trumania. We believe such a tool can help many data scientists and developers struggling while developing their solutions, and that the generic approach of Trumania goes makes it a great tool for diverse applications. We encourage people interested in the field to join the community, suggest improvements, and why not, contribute!
By Gautier Krings, former Chief Scientist at Real Impact Analytics