A tension every data lead faces
By 2024 Gartner had predicted that 60% of the data used for AI and analytics projects would be synthetically generated, up from roughly 1% in 2021 – a forecast the industry debated at the time and is still reconciling against ground truth. Whatever the exact share today, the direction is unambiguous: synthetic data has moved from research curiosity to mainstream pipeline. So has the counter-pressure. Andrew Ng's "data-centric AI" framing reminded the field that label quality, not model architecture, is the usual ceiling on production performance – and label quality, for now, still leans heavily on humans.
The result is a genuine strategic question inside every enterprise ML program: for any given dataset, should you generate, label, or both? The honest answer is that it depends on what you're training, where it will run, and what failure mode you can tolerate. Here is how we think about the tradeoff.
Where synthetic data wins
Synthetic data earns its keep when the problem is physics-bounded, the edge cases are rare or dangerous to collect, or the volume required would be prohibitively expensive to annotate.
- Autonomous systems. Waymo has publicly reported tens of billions of miles simulated in CarCraft / Waymax to complement real-world driving – weather, cut-ins, pedestrian dart-outs – events that are statistically rare in real data. NVIDIA DRIVE Sim and Isaac Sim occupy the same role for robotics and AV customers.
- Privacy-constrained domains. Healthcare and financial services often cannot share real records across borders; synthetic patient records and transaction streams let ML teams train, test, and benchmark without triggering GDPR, HIPAA, or cross-border transfer reviews.
- Imbalanced-class augmentation. When positive examples are rare – fraud, device failure, rare diseases – generating plausible synthetic positives (GANs, diffusion, programmatic labelling) can lift recall where human collection would take years.
- Safety-critical red teaming. Prompt-injection corpora, adversarial images, jailbreak attempts – curated synthetically, these are often the only way to stress-test a deployed model at volume.
Where human annotation still wins
Human-in-the-loop annotation remains the anchor when the task requires judgement, the output must hold in court or clinic, or the distribution drifts faster than simulators can model.
- Subjective or culturally grounded labels. Content moderation, sentiment, toxicity, legal categorisation – these categories do not survive a reduction to rules, and synthetic generators trained on old data entrench yesterday's blind spots.
- Safety-critical domains. Radiology, pathology, clinical-decision support, and autonomous-driving perception systems under regulator scrutiny still require human ground truth; regulators and insurers ask who labelled what.
- Long-tail and drift. Real users behave in ways your simulator didn't anticipate. Human labelling on a rolling sample of production traffic is the cheapest insurance against silent performance decay.
- Low-resource languages and scripts. Synthetic text quality degrades steeply outside English, Mandarin, and a handful of high-resource languages. Across APAC – Thai, Vietnamese, Bahasa Indonesia, Tagalog, Khmer – quality gains usually come from in-language human labelling before any generator can be trusted.
The hybrid is usually the right answer
Most mature enterprise pipelines do not pick one. They use synthetic data to fill volume and cover rare events, human labelling to anchor ground truth on the decision boundary, and active-learning loops to route uncertain predictions back to humans. The cost ratio (roughly 10-100× cheaper per label for high-quality synthetic versus expert human annotation, depending on domain) makes the economics straightforward once you separate the two questions: "can this label be generated?" and "can this label be trusted without a human signing it off?"
A practical pipeline we see working in production: pre-train or baseline-train on synthetic, fine-tune or RLHF on human labels from the target distribution, monitor with active learning, and re-label the delta. The ratio of synthetic to human shifts across domains – in AV perception it might be 1000:1, in legal-document classification it is closer to 1:1 – but the pattern holds.
A decision framework
Before committing budget to either path, pressure-test four questions:
- Can the label be grounded in physics or a formal rule? If yes, simulation or programmatic labelling is likely faster and cheaper.
- Is the failure mode regulated or adversarial? If yes, assume human labels are required on the decision boundary – for audit, not just accuracy.
- Is your test distribution stationary? If no, build the human-in-the-loop muscle before you scale synthetic generation, or drift will silently eat your margins.
- What does your data-use contract say? Synthetic data derived from a licensed corpus can inherit the licence. Read the fine print before assuming synthetic means IP-clean.
Where DataX Power fits
Our data practice operates exactly this split. Our annotation sub-service delivers high-quality human labelling – including strong coverage across APAC languages – while our AI practice designs the synthetic-generation, active-learning, and QA layers that make the combined pipeline cheaper and more defensible than either half on its own. If your team is trying to cost, scope, or debug an annotation strategy, that's the shape of conversation we have most weeks.