by Marianna Capasso, and Payal Arora, Utrecht University
3 min read
A 2024 New York Times investigation revealed that Big Tech companies like Open AI, Meta, and Google are using the data faster than it is being produced to feed their Large Language Models. If the current trends continue, researchers predict that we will run out of high-quality data to fuel AI models before 2026. In their desperate hunt for data, these companies have ignored data regulations and navigated a legal grey area, gathering copyrighted data from across the internet without permission. This goes beyond intellectual property rights as we have to also grapple with the fair use of public data especially as the majority world are fast coming online and so is their data, and have far fewer resources to counter Big Tech. As more AI companies are in the chase for new data to develop their products, to what lengths are they going to go for new data and at whose costs?
Against this background, we are witnessing the rise of synthetic data ‒ data that is algorithmically generated, and could serve as a novel solution to this data crisis. In 2022 the MIT Technology Review listed synthetic data among the ‘Top 10 Breakthrough Technologies’ of the year, since it can be used to fill the gap in areas where real data is scarce. Gartner estimates that by 2030, synthetic data will completely overshadow real data to build AI models, arguing that “the most valuable data will be the data we create, not the data we collect.” This shifts focus from collecting to creating data in the pursuit of building AI tools and services. Synthetic data can also be a powerful tool to address various data biases found in existing datasets, but only if we aim to not reproduce but rather, recalibrate what counts as quality data through an equitable lens.
The lack of data is indeed a problem, driven often by geopolitics and cultural norms, and not just the lack of resources. For instance, sufficient data simply does not exist, in regions of conflict like women’s healthcare and education under the Taliban in Afghanistan, or due to the existing humanitarian crisis in Sudan. This matters as the UN estimates that two billion people, or 25% of the world’s population, now live in conflict-affected areas. Another reason for data deficiency is in its sensitive status, where the revelation of certain personal characteristics such as sexual orientation can be criminalized, as in the case of Uganda. The countries’ Anti-Homosexuality Act in 2023 has enabled a climate of impunity for attacks against LGBTQ people. In such a context, being visible and heard and claiming ones sexual orientation online comes with significant risk. According to Statista 2024, 64 countries criminalize homosexuality with most of them located in the Middle East, Africa, and Asia. Given the high risks and stakes involved, synthetic data could be better, safer, and faster than real data for such use cases. However, it is not a quick cure all.
The lack of data is just one side of the problem. The other and equally important measure is the quality of data. Quality rests on measures of authentic representation, privacy and fairness, and data justice. How can we ensure that synthetic data is ‘high quality’ data to train AI models? Real advances in building quality synthetic data generation rest on measures for tradeoffs to address data deficits and biases, be it between inclusion and risk, and privacy and utility.
Synthetic datasets aim to overcome the limitations of real datasets, yet, to be valuable, they should be applied to real-world scenarios. As LLMs are trained more and more on synthetic data, one phenomenon that may occur is model collapse: quality and diversity of generative models decrease over generations, no matter how much data you add, and this may amplify biases. More data does not mean more insights. These limitations are likely to persist if there is no critical attention on how to ethically capture the socio-cultural and nuanced complexity found in real-world scenarios, and integrate empirically-better techniques, additional verification steps, and careful data curation in LLM’s training data.
How can we ensure that synthetic data fit for purpose? Equitable use-case analysis is essential before training a model and generating synthetic data. There is no one-size fits all for synthetic datasets, as different domains (e.g., recruitment, fraud detection, healthcare) may require diverse solutions according to the stakeholders involved, as well as methods to tackle the needs and values of marginalized groups that may be impacted by the use of synthetic data. To assess the openness and transparency of data ecosystems, we need to coordinate between the different public and private actors involved in data generation techniques, who need to share the information needed to address questions of fair data use. The fact is that instead of synthetic data being invoked as a neutral ‘technical-fix’ or a quick ‘cure-all’ to complex social problems like data scarcity, data protection, and discrimination, we need to meaningfully account for varied cultural values that underpin and configure the context in which the data are generated to avoid perpetuating biases or inequalities.
The Inclusive AI Lab at Utrecht University is dedicated to help build inclusive, responsible, and ethical AI data, tools, services, and platforms that prioritize the needs, concerns, experiences, and aspirations of chronically neglected user communities and their environments, with a special focus on the Global South. It is a women led, cross-disciplinary and public-private stakeholder initiative co-founded by Payal Arora, Professor of Inclusive AI Cultures at the Department of Media and Culture Studies UU and Dr. Laura Herman, Head of AI Research at Adobe.
Dr. Marianna Capasso is a Postdoctoral Researcher in Philosophy & Ethics of Technology at Utrecht University, where she Leads the Cross-Cultural AI Ethics Cluster at the AI Inclusive Lab. She is working on a EU-funded project FINDHR that aims at facilitating the prevention, detection, and management of discrimination in algorithmic hiring. Marianna’s areas of expertise includes the socio-ethical implications of synthetic data generation and use, the ethics of algorithmic discrimination, and the potential of AI to shape and transform the future of work. Prior to this, Marianna was a Postdoctoral Researcher at Erasmus University Rotterdam, and at the Sant'Anna School of Advanced Studies in Pisa.
Prof. Dr. Payal Arora is a Professor of Inclusive AI Cultures at Utrecht University and co-founder of FemLab, a Feminist Futures of Work initiative, and Inclusive AI Lab, a Global South centered debiasing data initiative. She is a digital anthropologist, with two decades of user-experiences in the Global South. She is the author of award-winning books including ‘The Next Billion Users’ with Harvard Press and ‘From Pessimism to Promise: Lessons from the Global South on Designing Inclusive Tech’ with MIT Press. Forbes called her the ‘next billion champion’ and the ‘right kind of person to reform tech.’
Technology, Employment and Wellbeing is an FES blog that offers original insights on the ways new technologies impact the world of work. The blog focuses on bringing different views from tech practitioners, academic researchers, trade union representatives and policy makers.
Cours Saint Michel 30e 1040 Brussels Belgium
+32 2 329 30 32
futureofwork(at)fes.de
Meet the team
Follow us on LinkedIn and X
Subscribe to receive our Newsletter
Watch our short videos and recorded events Youtube
This site uses third-party website tracking technologies to provide and continually improve our services, and to display advertisements according to users' interests. I agree and may revoke or change my consent at any time with effect for the future.
These technologies are required to activate the core functionality of the website.
This is an self hosted web analytics platform.
Data Purposes
This list represents the purposes of the data collection and processing.
Technologies Used
Data Collected
This list represents all (personal) data that is collected by or through the use of this service.
Legal Basis
In the following the required legal basis for the processing of data is listed.
Retention Period
The retention period is the time span the collected data is saved for the processing purposes. The data needs to be deleted as soon as it is no longer needed for the stated processing purposes.
The data will be deleted as soon as they are no longer needed for the processing purposes.
These technologies enable us to analyse the use of the website in order to measure and improve performance.
This is a video player service.
Processing Company
Google Ireland Limited
Google Building Gordon House, 4 Barrow St, Dublin, D04 E5W5, Ireland
Location of Processing
European Union
Data Recipients
Data Protection Officer of Processing Company
Below you can find the email address of the data protection officer of the processing company.
https://support.google.com/policies/contact/general_privacy_form
Transfer to Third Countries
This service may forward the collected data to a different country. Please note that this service might transfer the data to a country without the required data protection standards. If the data is transferred to the USA, there is a risk that your data can be processed by US authorities, for control and surveillance measures, possibly without legal remedies. Below you can find a list of countries to which the data is being transferred. For more information regarding safeguards please refer to the website provider’s privacy policy or contact the website provider directly.
Worldwide
Click here to read the privacy policy of the data processor
https://policies.google.com/privacy?hl=en
Click here to opt out from this processor across all domains
https://safety.google/privacy/privacy-controls/
Click here to read the cookie policy of the data processor
https://policies.google.com/technologies/cookies?hl=en
Storage Information
Below you can see the longest potential duration for storage on a device, as set when using the cookie method of storage and if there are any other methods used.
This service uses different means of storing information on a user’s device as listed below.
This cookie stores your preferences and other information, in particular preferred language, how many search results you wish to be shown on your page, and whether or not you wish to have Google’s SafeSearch filter turned on.
This cookie measures your bandwidth to determine whether you get the new player interface or the old.
This cookie increments the views counter on the YouTube video.
This is set on pages with embedded YouTube video.
This is a service for displaying video content.
Vimeo LLC
555 West 18th Street, New York, New York 10011, United States of America
United States of America
Privacy(at)vimeo.com
https://vimeo.com/privacy
https://vimeo.com/cookie_policy
This cookie is used in conjunction with a video player. If the visitor is interrupted while viewing video content, the cookie remembers where to start the video when the visitor reloads the video.
An indicator of if the visitor has ever logged in.
Registers a unique ID that is used by Vimeo.
Saves the user's preferences when playing embedded videos from Vimeo.
Set after a user's first upload.
This is an integrated map service.
Gordon House, 4 Barrow St, Dublin 4, Ireland
https://support.google.com/policies/troubleshooter/7575787?hl=en
United States of America,Singapore,Taiwan,Chile
http://www.google.com/intl/de/policies/privacy/