17.02.2025

Synthetic Data: A quick cure-all?

by Marianna Capasso, and Payal Arora, Utrecht University

3 min read 

A 2024 New York Times investigation revealed that Big Tech companies like Open AI, Meta, and Google are using the data faster than it is being produced to feed their Large Language Models. If the current trends continue, researchers predict that we will run out of high-quality data to fuel AI models before 2026. In their desperate hunt for data, these companies have ignored data regulations and navigated a legal grey area, gathering copyrighted data from across the internet without permission. This goes beyond intellectual property rights as we have to also grapple with the fair use of public data especially as the majority world are fast coming online and so is their data, and have far fewer resources to counter Big Tech. As more AI companies are in the chase for new data to develop their products, to what lengths are they going to go for new data and at whose costs?

Against this background, we are witnessing the rise of synthetic data ‒ data that is algorithmically generated, and could serve as a novel solution to this data crisis. In 2022 the MIT Technology Review listed synthetic data among the ‘Top 10 Breakthrough Technologies’ of the year, since it can be used to fill the gap in areas where real data is scarce. Gartner estimates that by 2030, synthetic data will completely overshadow real data to build AI models, arguing that “the most valuable data will be the data we create, not the data we collect.” This shifts focus from collecting to creating data in the pursuit of building AI tools and services. Synthetic data can also be a powerful tool to address various data biases found in existing datasets, but only if we aim to not reproduce but rather, recalibrate what counts as quality data through an equitable lens.  

Geopolitics of Data Gaps

The lack of data is indeed a problem, driven often by geopolitics and cultural norms, and not just the lack of resources. For instance, sufficient data simply does not exist, in regions of conflict like women’s healthcare and education under the Taliban in Afghanistan, or due to the existing humanitarian crisis in Sudan. This matters as the UN estimates that two billion people, or 25% of the world’s population, now live in conflict-affected areas. Another reason for data deficiency is in its sensitive status, where the revelation of  certain personal characteristics such as sexual orientation can be criminalized, as in the case of Uganda. The countries’ Anti-Homosexuality Act in 2023 has enabled a climate of impunity for attacks against LGBTQ people. In such a context, being visible and heard and claiming ones sexual orientation online comes with significant risk. According to Statista 2024, 64 countries criminalize homosexuality with most of them located in the Middle East, Africa, and Asia. Given the high risks and stakes involved, synthetic data could be better, safer, and faster than real data for such use cases. However, it is not a quick cure all.

Synthetic Data Quality

The lack of data is just one side of the problem. The other and equally important measure is the quality of data. Quality rests on measures of authentic representation, privacy and fairness, and data justice. How can we ensure that synthetic data is ‘high quality’ data to train AI models? Real advances in building quality synthetic data generation rest on measures for tradeoffs to address data deficits and biases, be it between inclusion and risk, and privacy and utility.

Synthetic datasets aim to overcome the limitations of real datasets, yet, to be valuable, they should be applied to real-world scenarios.  As LLMs are trained more and more on synthetic data, one phenomenon that may occur is model collapse: quality and diversity of generative models decrease over generations, no matter how much data you add, and this may amplify biases. More data does not mean more insights. These limitations are likely to persist if there is no critical attention on how to ethically capture the socio-cultural and nuanced complexity found in real-world scenarios, and integrate empirically-better techniques, additional verification steps, and careful data curation in LLM’s training data.

Fit for purpose

How can we ensure that synthetic data fit for purpose? Equitable use-case analysis is essential before training a model and generating synthetic data. There is no one-size fits all for synthetic datasets, as different domains (e.g., recruitment, fraud detection, healthcare) may require diverse solutions according to the stakeholders involved, as well as methods to tackle the needs and values of marginalized groups that may be impacted by the use of synthetic data. To assess the openness and transparency of data ecosystems, we need to coordinate between the different public and private actors involved in data generation techniques, who need to share the information needed to address questions of fair data use.  The fact is that instead of synthetic data being invoked as a neutral ‘technical-fix’ or a quick ‘cure-all’ to complex social problems like data scarcity, data protection, and discrimination, we need to meaningfully account for varied cultural values that underpin and configure the context in which the data are generated to avoid perpetuating biases or inequalities.

The Inclusive AI Lab at Utrecht University is dedicated to help build inclusive, responsible, and ethical AI data, tools, services, and platforms that prioritize the needs, concerns, experiences, and aspirations of chronically neglected user communities and their environments, with a special focus on the Global South. It is a women led, cross-disciplinary and public-private stakeholder initiative co-founded by Payal Arora, Professor of Inclusive AI Cultures at the Department of Media and Culture Studies UU and Dr. Laura Herman, Head of AI Research at Adobe.

Dr. Marianna Capasso is a Postdoctoral Researcher in Philosophy & Ethics of Technology at Utrecht University, where she Leads the Cross-Cultural AI Ethics Cluster at the AI Inclusive Lab. She is working on a EU-funded project FINDHR that aims at facilitating the prevention, detection, and management of discrimination in algorithmic hiring. Marianna’s areas of expertise includes the socio-ethical implications of synthetic data generation and use, the ethics of algorithmic discrimination, and the potential of AI to shape and transform the future of work. Prior to this, Marianna was a Postdoctoral Researcher at Erasmus University Rotterdam, and at the Sant'Anna School of Advanced Studies in Pisa.

Prof. Dr. Payal Arora is a Professor of Inclusive AI Cultures at Utrecht University and co-founder of FemLab, a Feminist Futures of Work initiative, and Inclusive AI Lab, a Global South centered debiasing data initiative. She is a digital anthropologist, with two decades of user-experiences in the Global South. She is the author of award-winning books including ‘The Next Billion Users’ with Harvard Press and ‘From Pessimism to Promise: Lessons from the Global South on Designing Inclusive Tech’ with MIT Press. Forbes called her the ‘next billion champion’ and the ‘right kind of person to reform tech.’

Technology, Employment and Wellbeing is an FES blog that offers original insights on the ways new technologies impact the world of work. The blog focuses on bringing different views from tech practitioners, academic researchers, trade union representatives and policy makers.

FES Future of Work

Cours Saint Michel 30e
1040 Brussels
Belgium

+32 2 329 30 32

futureofwork(at)fes.de

Meet the team

Follow us on LinkedIn and  X

Subscribe to receive our Newsletter

Watch our short videos and recorded events Youtube