Synthetic data has produced a lot of buzz in recent times, particularly after the advent of generative AI. Its use in privacy-enhancing technologies, applications for simulations, and ability to fill in data gaps for underrepresented subjects makes it an essential ingredient for data analytics purposes.
Synthetic data, often referred to as artificially generated data, is typically produced through algorithms, while it retains the statistical characteristics of real data. It is often used to validate mathematical models and to train machine learning models.
To make the concept of synthetic data understandable, let’s take an example of a real-world event. Think of synthetic data as a flight simulator for pilots. It’s not a plane or a real sky, but it’s made to feel very real so pilots can practice safely. Now, imagine a real-world event like a football game happening live. This game is like actual data it’s happening for real, with real players and real scores. Synthetic data is the practice game with no real crowd or stakes, while the real-world event is the big game day where everything counts.
In the modern world, one of the biggest use cases of synthetic data is to combat financial crime and to prevent fraud related to identity theft. Many ID verification solution providers that use biometric verification need a large dataset of facial images to train their models but they can’t access it due to privacy concerns. This is where synthetic data comes to play its role. Instead of using real-life data, ID verification vendors use synthetic data to train their algorithms.
Is it prudent, or safe to use synthetic data in training the models which are later used for detecting fake images and documents? Before answering this question, let’s have a look at some of the industry use cases.
Real-life examples of synthetic data application
Artificially generated data is widely used as a simulation or theoretical value, for preserving the privacy and confidentiality of real data, and for testing and training fraud detection systems. Real-life examples of industries synthetic data applications include tech companies, healthcare, insurance companies, clinical research, and multinational financial services.
In healthcare, synthetic data is used in building applications like patient dashboards without accessing patients’ sensitive information, ensuring privacy and enabling the testing of edge cases. For example, Anthem, a health insurance company, is working with Google Cloud to generate synthetic data for using it in dashboards and to ensure data confidentiality.
Waymo employs synthetic data to train and test its autonomous vehicles, enabling control over complex scenarios and weather conditions that are challenging to capture with real-world data.
Amazon is making use of synthetic data to train Alexa in a new language with tools that generate grammar and sentences, optimizing natural language processing models for precision and effectiveness.
How Well Synthetic Data Can Be Used in Fighting Financial Crime?
The pros and cons of using synthetic data may differ according to each industry’s dynamics. As far as fighting financial crime is concerned then this artificially generated data presents both challenges and opportunities.
Traditional financial institutions like big banks use rule-based techniques like AML watchlist screening to detect money laundering coupled with well-defined processes to look for anomalies in transactions. However, building the tools for detecting the anomalies related to mobile money transactions is a challenge, primarily because of the lack of available data to train detection models.
This is where synthetic data comes to play its role.
An example of using synthetic data to prevent money laundering from taking place through mobile services is Paysim. It is a financial simulator that monitors the lifecycle of transactions across multiple systems and enables complete testing of payment platforms.
Furthermore, accessing real data can be difficult due to privacy concerns, legal restrictions, and technical issues related to size, diversity, and interpretation. To enable innovation and building services dependent on data while retaining the fundamental principles of privacy and fair-use data accessibility is a challenge that can be overcome by the use of Synthetic Data. This is because it allows the sharing of structures, patterns, and uniform content similar to real data without the risks of using real data.
Additionally, synthetic data can be used to replicate events in real life, creating the corresponding data and putting its use for, enhancing the training of ML models.
As per the report by the Alan Turing Institute, “Synthetic data can be shared between companies, departments, and research units for synergistic benefits,”
“By using synthetic data, organizations can store the relationships and statistical patterns of their data, without having to store individual level data”.
The detection of financial crime involves the analysis of extensive data to identify fraudulent activities, unusual transactions, and suspicious behavior. Synthetic data can significantly improve the analysis by offering new understanding and enhancing the capability of systems to detect fraud, bolstering their overall performance.
Why Synthetic Data Can be Often Unreliable?
There are primarily two reasons for this. If synthetic data becomes good enough to imitate real-life data then it can also assist criminals in conducting synthetic ID fraud, Secondly,
Very often synthetic data created by generative AI can be faulty due to errors in the mishandling of real data used to generate synthetic data. Therefore, relying on these artificially generated datasets to train models to be used in AML tech for detecting anomalies in financial transactions can’t be a prudent approach.
As synthetic data is often produced based on assumptions, therefore, it is likely that such data may not always reflect the diversity and variability of the real data. A lack of diversity will simply lead to reinforcing existing tools and techniques while duplicating the inherent bias in the original data.
The fact that synthetic data is likely to fail to capture the complexity and details of real data can’t be ignored. Moreover, ensuring the privacy and confidentiality of synthetic data can be challenging particularly when it is prone to misuse.
Regulating the use of Synthetic data
With an ever-increasing risk of financial crime in fintech, and e-commerce originating through social networking websites, authorities are now prompting financial institutions to explore novel technologies and strategies to enhance the identification and mitigation of financial crime risks.
The future of cybersecurity is increasingly influenced by the development of generative adversarial networks (GANs) for creating synthetic data. GANs consist of two parts, a generator and a discriminator. These two parts work in competition, where the generator is used for creating synthetic data while the discriminator is used for detecting synthetic data. Organizations utilize GANs to make synthetic data to counter fraud, enhance privacy, and preserve ethical standards.
As discussed above tools like synthetic data can be used both for the good and the bad as is the case with AI. Efficient and welfare-focused real-life applications depend upon multiple factors including regulatory support from the authorities, which is designed by thorough research.
Different government organizations think tanks, and financial sectors are exploring the potential harms and benefits of synthetic data across different sectors. They are investigating issues such as data confidentiality, data bias mitigation, and improvement in AI model resilience, on the other hand, considering challenges like ethical concerns and quality of data.
The Royal Society in collaboration with The Alan Turing Institute published a report on “Synthetic Data Survey” and researched the use and development of synthetic data in the modern world. The research concluded that synthetic data can be a promising technology in means of privacy and confidentiality, but it can impose serious problems if not used properly. Before using new technology, people must learn more about the technology and keep in mind how the technology can affect the social society.
National Institute of Standards and Technology (NIST) published research on “Issues in Synthetic Data Generation for Advanced Manufacturing”. This research helped to explore challenges and issues of synthetic generation as well as suggested desirable characteristics in the creation of synthetic data.
JP Morgan Chase & Co., American Multinational Financial Service, published a paper highlighting the growing concern about the generation of synthetic data in the financial domain and the challenges associated with synthetic data in other domains.
While private and government sector organizations are including the use and challenges associated with synthetic data in their focus, governments are also taking some initial steps to adopt a regulatory approach. The Financial Conduct Authority (FCA) of the UK established a Synthetic Data Expert Group (SDEG) in March 2023 to explore the potential of Synthetic data in the financial sector. The group comprises a diverse panel of 21 specialists from the financial sector, public agencies, data, and tech providers, as well as consumer groups. Recently, FCA has shared an update on the working of the SDEG group.
While the UK’s approach is more specific, other jurisdictions may also consider regulating the use of synthetic data either separately or as part of broader AI (artificial intelligence) related regulations. Sooner or later, there will be a need for state authorities to issue guidance on better use of this tool, followed by consultation with industry. Otherwise, a reliance on this technology in the absence of standards can have undesirable consequences.