Harnessing synthetic data: Advancing innovation with privacy-enhanced insights

Synthetic data has emerged as an innovative way to leverage high-quality data without compromising customer trust or causing any agency angst.

By Ryan Jackson

Tn today’s data-driven world, businesses across industries are increasingly turning to data analytics to optimize operations, reduce risks and offer personalized services to customers. The banking industry is no exception. However, the collection and processing of sensitive customer data also raises concerns around privacy and security, and regulators and customers expect banks to have proper controls in place. Driven by these concerns, synthetic data has emerged as an innovative way to leverage high-quality data without compromising customer trust or causing any agency angst.

What is synthetic data?

Synthetic data refers to artificially generated data that mimics the statistical properties of real-world data. It is created by learning patterns and relationships from existing real-world data and then generating new data points that reflect these patterns. Unlike real-world data, synthetic data does not contain any sensitive personal information of actual customers, making it a safer alternative for use in the highly regulated banking industry.

The use of synthetic data is gaining traction in the banking industry, particularly for its potential to test software and applications, enhance training of machine learning models, and create large and diverse datasets. By using synthetic data, banks can avoid exposing sensitive customer information to potential breaches or misuse, while still reaping the benefits of data-driven insights.

There are several ways banks can gain access to synthetic data. Banks can develop their own synthetic data generation capabilities internally. This involves setting up a new team or using existing resources with expertise in data generation techniques. Banks can also use third-party data generation platforms that specialize in creating synthetic data or data marketplaces that offer pre-built synthetic datasets. Lastly, banks can seek to establish partnerships with companies that specialize in synthetic data generation or participate in industry collaborations or consortiums that focus on generating synthetic data for common use cases.

How can banks employ synthetic data?

Vendor due diligence. Synthetic data can play an important role in how banks evaluate third-party vendor technologies. To effectively validate solutions, banks need to use high-quality data instead of relying solely on “dummy” (or made up) data, which can often lead to subpar validation and outcomes. Synthetic data can mimic various types of data, such as customer profiles, transactions or user behavior, and it allows banks to test vendor solutions in a realistic environment. As discussed by Madhu Narasimhan of Wells Fargo, “Synthetic data allows us to carry out our experiments at scale.” Using synthetic data, banks can assess how well the technology performs with different data inputs and complex use cases. This allows banks to more adequately test solutions to gain firsthand insights into a particular software’s capabilities. Synthetic data can also be used to evaluate the effectiveness of fraud detection algorithms in a safe and controlled environment, helping banks improve the accuracy and speed of fraud detection.

Model training. Another area where synthetic data could prove useful for banks is in training machine learning models, particularly those used for fraud detection. Fraudulent activities can have a significant impact on a bank’s bottom line and erode customer trust. By using synthetic data to train machine learning models, these models can better identify patterns and anomalies that could indicate fraudulent behavior once the model is put to work on actual customer data. Synthetic data can also be used to create large and diverse datasets for training credit risk models without exposing any customer information. Banks rely on accurate credit risk models to make informed lending decisions and manage their loan portfolios effectively. By using synthetic data, banks can improve the accuracy and fairness of credit risk models, while also reducing the risk of bias and discrimination. Fraud detection model training is one area where JPMorganChase has leveraged synthetic data, for example.

Advantages and disadvantages of synthetic data

The benefits of synthetic data for banks are numerous:

First, synthetic data can be generated quickly and at scale, enabling banks to create large datasets for testing new software and applications and training machine learning models. This can accelerate the typical development cycle and get products to market faster.
Second, synthetic data does not contain any sensitive information, making it a safe alternative for data sharing and analysis without compromising customer privacy.
Third, using synthetic data is often less expensive than acquiring and storing real data. Banks can reduce costs associated with data collection, storage and analysis, enabling them to optimize operations and improve efficiency.
Finally, synthetic data can be leveraged by banks to generate datasets that exhibit greater diversity and representativeness. This practice could assist banks in enhancing the accuracy and effectiveness of their machine learning models, ultimately resulting in improved predictions and outcomes.

However, synthetic data also comes with potential downsides and risks. One potential drawback is the reliability around synthetic data as it may not accurately reflect the complexity and variability of real-world data, which could lead to biased or inaccurate machine learning models.

Moreover, synthetic data does not entirely eliminate bias. It is created by learning from existing data, which means that any biases or errors in the existing data could also be replicated in synthetic data. Another challenge is that there is currently a limited regulatory framework — with no public guidance issued — around synthetic data, which could pose a challenge for banks to operationalize.

What can banks do?

View more
risk and compliance articles.

Banks embarking on the adoption of synthetic data should approach the decision with careful consideration and a strategic mindset. Ideally, the process should begin with identifying specific use cases where synthetic data can bring value, such as fraud detection, credit risk assessment or customer analytics. Next, banks should assess their data needs, evaluating the volume, variety and quality of data required for effective model training and validation. Scalability, performance and customization capabilities of synthetic data generation methods should also be considered.

To validate the viability and effectiveness of synthetic data, banks should develop pilot projects or proof-of-concept initiatives. These projects will help assess the performance of models trained on synthetic data against those trained on real data, measuring accuracy, decision-making capabilities, and operational efficiencies. Continuous monitoring and evaluation of synthetic data performance are emphasized, leading to improvements in the quality and relevance of synthetic data over time. Collaboration with industry peers, academic institutions and synthetic data experts can help banks stay at the forefront of developments, sharing insights and learning best practices for responsible adoption of synthetic data in the banking sector.

By developing an adoption framework based on these considerations, banks can successfully adopt and leverage synthetic data to enhance their data-driven decision-making processes, manage risks, and drive innovation across the industry.