Synthetic data: A safer, smarter solution for training AI?

In episode two of Voices in Tech, CEO Harry Keen introduces us to his synthetic data startup, Hazy, and we hear from PwC’s Marcus Hartmann on how companies can safely use their data to create value. Listen to the audio.

March 19, 2024

Rocket ship representing synthetic data as a solution for training AI and data being used as fuel for a business

Listen to Synthetic data: A safer, smarter solution for training AI?

Harry Keen, CEO of Hazy, shares how his startup is helping companies use synthetic data to strike the right balance between security and innovation.

About Voices in Tech

Hosted by series executive producer Shana Ting Lipton of PwC, Voices in Tech brings emerging tech to life through the stories of the companies solving big business problems at the intersection of technological acceleration and human innovation.

Businesses may be sitting on a gold mine of data that’s crucial for innovation. But it can be tricky to make use of it securely—whether it’s sharing sensitive information with prospective third-party software vendors or using it to train AI. Increasingly, companies are turning to a relatively novel enterprise solution to end the stand-off between data compliance and innovation: synthetic data, which is artificially created but often based on real-world datasets. In episode two of Voices in Tech, Harry Keen, the CEO and cofounder of synthetic data pioneer Hazy, talks about what it was like being early to market with a cutting-edge resource that most organizations didn’t fully grasp, and how companies should approach it going forward; Marcus Hartmann, chief data officer for PwC Germany, adds his insights.

Guest: Harry Keen, CEO and cofounder, Hazy

Featuring: Marcus Hartmann, Partner, PwC Germany, and Chief Data Officer, PwC Germany and Europe

Harry Keen: It definitely was an uphill battle, and six years ago when we started the business, not a single person knew what synthetic data was. There’s been a huge transition. There’s regulators writing about it. There’s companies like Gartner advising their customers about it. There’s data conferences that have entire sections dedicated to synthetic data. So, I think we’re at a really exciting moment for synthetic data. It’s come a long way.

Shana Ting Lipton: That’s Harry Keen, CEO and cofounder of Hazy. Today, we hear the story of his startup—a synthetic data pioneer—and how it uses generative AI to help companies make the most of their data while enhancing privacy.

I’m your host, Shana Ting Lipton, with PwC. And this is Voices in Tech, from our management publication, strategy+business, bringing you stories from the intersection of technological acceleration and human innovation.

In 2006, Netflix released an anonymous dataset of movie rankings by half a million of its users, with a challenge to the public: “Improve on our movie recommendation algorithm.” Researchers at the University of Texas may not have won the contest, but they managed to reidentify individual records [of Netflix users], using IMDB as a background source.

Data protection laws and technology have come a long way since then, so something like this could never happen today…or could it? Here’s Marcus Hartmann, partner at PwC Germany and chief data officer for PwC Germany and Europe.

Marcus Hartmann: Traditional types of anonymized data such as mask[ing] will never bring full safety, and this is what we saw in the Netflix case, right? A risk of reidentification always remains, depending on the type of data, functionality, use….

The traditional anonymization technologies at this time [were] a good starting point, but today also has different technology capabilities, and requires definitely a new way of anonymiz[ing] or even synthesiz[ing] information. But with synthetic data we can really go the next step into the future in using data really cross-industry, cross–use case, while preserving the sensitivity of information and the sovereignty of the data owner.

Keen: So synthetic data is machine learning–generated data that ultimately doesn’t contain any of the real information of a source dataset, but it sort of mimics all the properties of that source dataset. It doesn’t contain any sort of sensitive information but is statistically representative of that source dataset. And that means you can go and use it for machine learning and AI training and experimentation.

Lipton: Synthetic data has been around for decades but recently gained traction in enterprise, thanks to advances in AI and machine learning, which allow synthetic data to be generated at scale. Big tech companies like Microsoft and Amazon have led the way in adoption. Among startups, London’s Hazy was early to market when it introduced its synthetic data product to businesses in 2017. CEO Harry Keen recalls how it all started.

Keen: So, I met James Arthur at Opendesk. Opendesk was an open-source furniture company, trying to be the GitHub for furniture. We were both working in the tech team. He was actually CTO. And one of our challenges was working with remote software developers and giving them access to our customer database to build software on top of.

Surely, there must have been a quick way to produce this sort of anonymous version of your kind of core database that allowed your developers to go and work on, without having to give them access to the, sort of, crown jewels.

So, James and I started working together, and we knew Dr. Luke Robinson. He’s very much the sort of ex-Cambridge quantum physicist character, brings in a very academic angle to the business, and brought in some really new, fresh ideas for how we could solve this problem. And James and I were very much on the executional side, both on the technical and commercial.

Lipton: They got to work quickly, creating a prototype that ultimately helped them get into an accelerator, the UK’s CyLon. This early model was based on redacting data or data masking, which is different from the synthetic data solution that’s central to their business today. And the founders soon came to an important realization that completely reshaped their core product.

Keen: The tidal shift we were seeing was a big interest in machine learning and data science. This is back in 2017, so this is way before the generative AI revolution—a general global shift of moving towards more data-driven insights, and everyday businesses were able to start taking advantage of it. However, what we realized was that access to their data was actually becoming a really big bottleneck because of the sensitivity of the information there. And these techniques around anonymization, because they, (a) aren’t particularly private, as it turns out, and (b) destroy a lot of the information, were not capable of serving those more advanced data science, machine learning, AI use cases, where you really need high-fidelity data. And that’s where synthetic data came in.

You can get much more utility and fidelity from the private data output than you can with the traditional anonymization techniques—and at the same level, or even better levels, of privacy guarantees as well.

So, in theory, at least, synthetic data should be a replacement for, and I think it is becoming a replacement for, many of these old techniques, so: data masking, data anonymization. And where big enterprises are using those today, they can absolutely be replaced by synthetic data.

Lipton: Being early to market meant that Hazy was faced with the challenge of educating customers on what appeared to be a novel privacy-preserving solution. But, around this time, companies were busy preparing to comply with Europe’s new Global Data Protection Regulation—which threatened tough penalties for organizations that violated its standards. Marcus Hartmann.

Hartmann: So GDPR really boosts the awareness about data sovereignty and boosts, of course, also the technology developments of new privacy-preserving technologies, and synthetic data is a part of this. These kinds of developments really evolved in other non-European territories, like in the US, but also in Asia-Pacific markets. Without the GDPR, I would say, we would never have seen this kind of growing number of solutions around synthetic data.

Keen: What we realized quite early on is that synthetic data on its own wasn’t going to be able to produce sufficiently anonymous data under the definitions in the GDPR.

So, one of the early developments in our technology and in our technique was combining differential privacy—which is effectively a technique for adding clever noise into a dataset, so it makes it more difficult to reidentify individuals in that information—and synthetic data.

Lipton: Hazy fine-tuned its technique at University College London, one of the world’s leading institutions for foundational research in AI. That’s where the startup received spinout status, giving it access to top data scientists and funding. Keen and his team were onto something, but they still needed to identify the right product market fit.

Keen: So, we came to the conclusion after all of this work that, actually, financial services is the perfect market for this product. They’ve got huge data stores of really sensitive information. They’ve got legacy infrastructure. They’ve got governance controls. They’ve got regulators breathing down their necks, making sure they treat their data really carefully. And on top of that, they have a real pressure to innovate from challenger banks coming through and stealing market share. So, this sort of tension between security and innovation is felt really, really strongly in that sector. And, actually, synthetic data can come in and solve that problem really neatly.

Lipton: They got the chance to apply their solution with early customers like [the UK’s] Nationwide Building Society, which also became an investor in 2018, and the UK Ministry of Defence.

Keen: This groundswell of data privacy was sort of, seemed to be cresting into a bit of a tidal wave, with the GDPR hitting the airwaves. Privacy was a hot topic, and we just happened to be right in the eye of the storm at that moment. And Microsoft took notice, and we ended up winning, it was called, their Innovate AI competition, that came with a million dollars of funding and a bunch of Azure credits, which really helped us develop our products. But that was a really powerful moment for us.

Lipton: So, how has the startup been able to help large organizations transform and innovate?

Keen: So, a great example we love to use is Nationwide Building Society. So, they are a big, big financial-services organization. They’ve got huge IT infrastructure, and certainly, as a small startup, we’re not going to come in and suddenly sweep across their entire data-provisioning system and change it overnight.

They identified this area where they had the ability to test synthetic data. This was, sort of, four or five years ago now, where they were trying to evaluate third-party software vendors.

The challenge with that process is always that provisioning datasets to these untested, sometimes very small, third parties, is a really lengthy process. It’s six, nine months, sometimes 12 months–plus, and that just kills the sort of innovation cycle and makes it really difficult for these big businesses to adopt new technologies. So that was a really focused problem where we could go in, we could create a synthetic transaction dataset that they could then provision really quickly, and we’re talking in a matter of days, into sandbox environments, and they could invite third-party vendors in to quickly test and evaluate the efficacy of what they were claiming to be able to do.

Lipton: Around 2021, VCs became more bullish on synthetic data startups—which seemed to be everywhere—from Gretel.ai to Mostly AI.

Hartmann: The practice around building and creating synthetic data is a very, I would say, complex approach, and this is also the reason why we see the growth of so many new startups and research companies, because the topic is very complex, as I said. It’s a great technology, and I see a lot of new use cases in the future.

In the mobility data space, you’re using so [much] different information for a completely new business model, and synthetic data enable[s] all the parties to join this ecosystem to share information, which enables to create a new innovative business model, or business models, and really also enables the participants really to create new products and services.

Keen: It’s great, firstly, that a category has formed around this technology. Certainly, when we were starting, we felt a little bit like we were just banging the drum all on our own. And actually having more competitors in the space, it really feeds the perception amongst customers that this is absolutely a technology that’s going to be transformative in the way they do their business.

Lipton: In 2022, after OpenAI introduced the public to ChatGPT, suddenly everyone was talking about generative AI, which is capable of producing synthetic data. Hang on—how is this different from what Hazy is doing?

Keen: Synthetic data is a generative AI technology. It’s in the broader family that includes large language models. If you’re really trying to get accurate synthetic data that’s very representative of your source information, for whatever use case, you want to use that information to train a generative model. And actually some of the generative adversarial networks or Bayesian approaches that we use are actually very good at that. And, actually, large language models are perhaps more suited to sort of free text information and less-structured information.

Our customers also value things like performance and being able to run these models quickly and being able to run them in their own infrastructure and on not really expensive infrastructure, as well. So, there’s a whole host of additional factors that you’d need to consider when you’re trying to deploy this, actually, with a customer.

Lipton: Looking ahead, there’s still work to be done to educate companies at different levels of synthetic data maturity.

Keen: It definitely was an uphill battle, and six years ago, when we started the business, not a single person knew what synthetic data was. There’s been a huge transition. There’s regulators writing about it, there’s companies like Gartner advising their customers about it. There’s data conferences that have entire sections dedicated to synthetic data.

The next stage is for businesses to start baking this a bit more deeply into their data-provisioning systems, in just the way they really use data across their whole business, and start moving it away from, sort of, more isolated use cases into the more general sort of horizontal IT sort of stack.

Hartmann: You can do it in the sandbox. You can do it in the MVP context, but if you want to leverage the power of synthetic information, synthetic data has really the potential to be set in the middle of data management functions and procedures.

Keen: So, I think we’re at a really exciting moment for synthetic data. It’s come a long way. But now the education is: how do we really expand this across the enterprise and across our business to get real value out of it? How do I scale this up in a way that makes it really valuable to my business?

Lipton: Thanks for listening, and stay tuned for more episodes of Voices in Tech, brought to you by PwC’s strategy+business.

Six years ago when we started the business, not a single person knew what synthetic data was. There’s been a huge transition. There’s regulators writing about it. There’s companies like Gartner advising their customers about it.”