My latest book: Twin Wolves: Balancing risk and reward to make the most of AI
Photo by Deng Xiang on Unsplash
(This post has been a long time in the making. According to my notes, I'd planned to write about synthetic data in 2016 and later delivered a guest lecture on the topic in 2018. Now, in 2025, a Bluesky post by Frank Wiles of RevSys jogged my memory … so I decided to finally get to it. Consider this blog post to be a longer, more detailed version of my off-the-cuff reply to Frank.)
We often think of data as something we collect, a product of some real-world process. We can also generate data – that is, create artificial records – to backfill a data deficit.
Generating synthetic data isn't as easy as it may look, though. Over the coming pages I'll walk through several considerations to help you steer clear of pitfalls.
True to form, this will be a very high-level post. There's no code in here. Not even pseudocode! I'll explain the reason for this near the end.
Not only is generating synthetic data fun, and occasionally funny, it's a hell of a way to stretch your brain as a data practitioner or software developer. In part because the first step involves understanding the problem you're trying to solve.
In this case you're not just "synthesizing data"; you're playing the role of a data-generating process (DGP). That means you have to understand the DGP itself, sure. You also have to understand the DGP through the lens of the downstream consumer(s) of the synthetic data. Different aspects of the data, and therefore of your artificial DGP, will matter more than others.
To develop that picture of what matters for the task at hand, you'll work through questions such as:
1/ "Why do we need this data?" How will it be used? Are we testing a data pipeline? Filling out web pages to check formatting? Something else?
2/ "How realistic does it have to be?" Do we just need random characters or digits to fill slots? Or will the downstream process be more discerning, like an accounting system that checks for anomalies?
3/ "How much data do we need?" Is this a quick one-off where we don't need to worry about reproducibility? Will this require a steady stream of synthetic data for ongoing work? Do we need enough data to overwhelm a downstream system, in order to stress-test it?
Having sussed out which parts of the DGP you'll need to emulate, you can size up your tools and methods.
The first method is to use your existing data. Seriously. If you're able to do that, generating synthetic data is overkill. Pulling sample rows from your database is faster, cheaper, and cleaner than writing code to simulate those rows. (Assuming your data storage isn't all over the place. But that's another story.)
Using real data isn't always an option, though. Maybe you need synthetic data for privacy reasons. Or you don't have enough real data for what you need to test. Or you need records that your system hasn't seen before, in order to catch corner cases. Perhaps you have no data at all because an upstream vendor isn't ready to provide it. Keep in mind that the "upstream vendor" may be another department in your company, one that needs extra time to properly gather or obscure it. Or maybe they're holding out because someone in their leadership chain is feuding with someone in your leadership chain. It happens.
When choosing your tools to generate that data, consider how realistic it needs to be. Will any old numbers do? Fire up your favorite (pseudo-)random number generator and you're done.
If downstream consumers need more realistic numbers, you still have options. You'll first need to run summary statistics across the existing, real dataset to get a rough idea of the required shape and range of values: "must be a number smaller than 100"; "messages usually run about 100'ish words"; and so on.
Given that, we can apply some of this theory to three examples:
Let's say we run a trading operation and we need pricing (tick) data to test our systems. As a simplifying assumption, we only need the prices but not any timestamps. Essentially, we need a Big List of Numbers.
We should be able to pull values from a random number generator and be done with it. Right?
Yes.
But also, no.
Beneath the surface there are some weird spots so consider.
The randomness is certainly there. Tick data is known as a "random walk" because the numbers appear to bob up and down with no rhyme or reason: "up 5.2, down 3.7, down 2.25, up 19.8 …" But it can't be too random. Since the goal is to test a trading system, we need the numbers to be somewhat realistic – even if we also need them to be occasionally weird so we can smoke out corner cases.
As a first step, you could use your historical market data. Share prices are public information, so there are no privacy concerns about reusing it. It's realistic data because it is real. Best of all, you already have it!
Replaying historical data is a common test in a trading environment. This is extremely useful data, but it's also limited because it describes what's already happened. Sometimes you need to test your trading system on things that have not happened (yet) so you can check for weak points in your code and processes.
As a next step, you could use historical data but reorder the values. This way you still have the same price movements (maintaining the air of realism) but it's a brand-new stream of data because your system has never seen those price movements in that sequence. Maybe a jump from +0.2 to -9.5 will be enough to trigger a bug.
From there, you could corrupt the data by randomly making values several times larger or smaller. Or you could make them zero. Or change the signs on the values such that 90% of the prices are negative. This may sound silly, but your system probably wants to treat extreme price movements differently than movements within some given range. Exchanges typically halt trading if a given price moves far enough, quickly enough, because the assumption is that this move happened in error.
If you need to generate data from scratch, the random number generator is still your friend. And with enough historical data, you can gather summary stats to make the synthetic data more realistic by setting upper and lower bounds on values.
The next example will require multiple fields for each record. Let's say we run a mobile phone operator and we need to generate some call records. Each record includes the phone numbers involved, the start time, and the duration. For simplicity we'll assume that each call has two participants (so, no group calls) and only US-based callers (ten-digit phone numbers) from the same general area (which limits the number of area codes).
This example is surprisingly amenable to simple methods:
The area codes will be part of a predefined set. For New York City, say, that would be 212, 646, 917, 347, and 718. Draw two of these and set them off to the side.
To create the remaining seven digits – a combination of the central office and subscriber number, in technical terms – randomly choose pairs from the range 1,111,0000 to 9,999,999. (Reformatted for legibility, that would be "111-0000" to "999-9999"). Combine these with the two randomly-chosen area codes to get a pair of ten-digit phone numbers.
While it's unlikely this process would return a pair of identical numbers, it's possible. You could filter these out. Or, you could intentionally insert some cases in which one number calls itself. That corner case would certainly throw some systems for a loop.
What about call start times? You want this to be realistic, so you grab several thousand from your database and randomly pick from that list to fill out the synthetic dataset. Maybe you over-weight certain times of day to simulate an unexpected spike in traffic due to some emergency event.
The same holds for call durations: you could sample real values from your database, and perhaps randomly extend or shrink certain values to introduce extrema.
For the final example, let's say we need to generate text data. Perhaps we're filling out web page layouts, or load-testing an app that processes web forms.
As usual, we start with the easy version and then work our way up to something more complicated. The easiest test text of all is Lorem ipsum. You could also use one of its tastier variants, like Bacon ipsum.
Given enough time and enough sample documents, you could scramble that corpus to create a domain-specific Lorem ipsum.
With a little more elbow grease you could use that same data to develop a generative system. That includes genAI bots (via LLMs), yes. It also includes Markov bots, Restricted Boltzmann machines, generative adversarial networks (GANs), and a host of other technologies.
The entire purpose of generative systems is to make up (somewhat-)realistic data. The ML-based systems boil down to: "give me a record that was not in your training data, but could reasonably have fit in with that training data." That said, I'll note that the definition of "realistic" is based on both the training data and the underlying tools. If the tools only pick up on grammatical patterns instead and you need factual patterns, beware. But if you just need filler text, this is the way to go.
Assuming that speed is a high priority, the popular genAI bots are a good start. It's hard to beat "a few lines of API calls" when it comes to generating synthetic data, though there are some tradeoffs. The greatest caveat of LLMs – of generative systems in general, but LLMs in particular – is the risk of so-called "hallucinations." These are the generated artifacts we don't like, usually because they are some mix of incorrect and inappropriate. Hallucinations stem from the models' inherent randomness.
If you're using LLMs to generate text that you'll put in front of customers, certainly, hallucinations are a concern and you should watch out for them. But going back to what I said earlier: the entire purpose of generative systems is to make up (somewhat-)realistic data. One reason LLMs shine for generating synthetic data is that you can use their inherent randomness to your advantage. If you squint just right, you can treat a generative system as a special flavor of a distribution from which to sample.
(That was the driving force behind my genAI art project, which generates fortune cookies.)
I hope this post helps you the next time you need to develop a synthetic dataset.
Looking at my notes from 2016, I was pleasantly surprised to see that the core ideas still hold up nine years later. The arrival of GANs and LLMs add new tools to your toolbox, but you still need to start with the basics steps: "understand the DGP" and "understand the downstream consumers." Be mindful of anyone who wants to dive into writing code before you've sorted those out.
Complex Machinery 049: Looking for a grenade in a haystack
The latest issue of Complex Machinery: The meltdown of auto parts company First Brands may tell us about lurking problems in the genAI space.
Complex Machinery 050: Too hot to handle
The latest issue of Complex Machinery: Insurance, AI, and where the two won't meet.