The digital world is producing data at an exponential level. Many businesses try to take advantage of the so called “big data”, would they be dealing with data themselves – like banks, telecom operators, retailers, etc. – or be new players positioning themselves as big data experts. To exploit this data and extract value out of it, these actors develop specific software and scripts, that compute the relevant metrics on the raw data. The software needs to be tested and the results validated, their performances need to be benchmarked, and their robustness to erratic data needs to be asserted.
How can one test a data management software without proper data to run? One option would be to run on actual data. Unfortunately, by doing so they are confronted with legal hurdles as privacy laws becoming increasingly strict all over the planet. In order to take advantage of big data without transgressing any (useful!) laws, big data players face a fairly new need: synthetic data.
What is it all about? Again, just as a scientist may need to produce synthetic material to conduct experiments at low risk, the data scientist will have at some point to produce synthetic or fake data, that has the same or almost the same properties as the real one.
How can he do that? It’s not an easy job. The more features the dataset has, the more complex it is to reproduce comparable data, and the more computing power it requires. A trade off between quality and cost is unavoidable.
There are two major ways to generate synthetic data. They can apply to various data contexts, but we will succinctly explain them here with the example of Call Detail Records or CDRs (i.e. data record produced by a telephone that documents the details of a phone call or text message).
Drawing numbers from a distribution
The principle is to observe real-world statistic distributions from the original data and reproduce fake data by drawing simple numbers. For example, if the goal is to reproduce the same telecom activity level over time as a real customer base, we will:
- Observe what is the real distribution on CDRs
- Create an artificial base of customers
- Simulate calls from these customers with time stamps, respecting the distribution that was observed, all other fields in the CDR being randomly generated
Agent based-like modelling
The principle is to create a physical model that explains the observed behaviour, then reproduce random data using this model. It is generally agreed that observed features of a complex system can often emerge from a set of simple rules. Take the example of simulating CDRs with a certain temporal pattern. A simple physical system would be one where users engage in activities over time, and their probability to make a phone call varies with the number of activities in which they’re engaged. Then, depending on how activities are distributed over time, a similar time pattern as the one observed in real data can be recreated with such a physical model. Again, CDRs are then created while running the simulation over time, with all other fields created randomly.
In both cases, it is simple to generate data that fits one or two features of a dataset, however the complexity of generating more features grows then quickly. At Real Impact Analytics, we work at creating a generic generator that simplifies the way such complex data is created. We combine lightweight agent-based modelling with real-data distributions, and bundle this into a powerful modelization engine. It models even complex interactions through a set of simple choices and transformations by the agents.
Advantages and issues of using synthetic data
Generating data that looks like the real thing may seem a fantastic playground for businessmen. If they can replicate datasets, they could do simulations that allow them to predict consumer behavior and consequently set up winning strategies! The reality is unfortunately not that exciting, because as we mentioned earlier, synthetic models only replicate specific properties of the data. They cannot match a given dataset perfectly; only simulate general trends.
To make an analogy, you could possibly simulate an advertising campaign in SIM city, but it would teach you nothing about the real world response to your campaign because SIM city inhabitants are not reacting to the product like humans. You are always bound by what you can model. However, using synthetic data has some great advantages, too. First, it might be useful for visualization purposes and to test the scalability as well as the robustness of new algorithms. This is absolutely key for everyone who is busy with big data applications. Second, the resulting indicators can be shared broadly, often as open data. So it contributes to the community’s general big data/algorithmic knowledge.
Synthetic data is a useful tool to safely share data for testing the scalability of algorithms and the performance of new software. It cannot be used for research purposes however, as it only aims at reproducing specific properties of the data. Producing quality synthetic data is complicated because the more complex the system, the more difficult it is to keep track of all the features that need to be similar to real data.
The advent of tougher privacy regulations is making it necessary for data owners to prepare themselves for restricted access to private data (even their own!). As big data tools become increasingly widespread, an investment in simulating real data is critical. Whether it is to develop an in-house generator or pay for ad-hoc development, companies will have to include these new realities in their strategic planning.
By Gautier Krings, Chief Scientist at Real Impact Analytics