Insights      Artificial Intelligence      Building a Competitive Data Moat Starts with Privacy

Building a Competitive Data Moat Starts with Privacy

As we’ve mentioned in our blog on how we invest in R&D at Georgian, widely available ML, NLP, and speech processing tools are making it easier for companies to build rudimentary AI systems. 

This is a challenge for companies that have spent the last few years building a moat around their in-house ML and NLP technologies, as technology itself is no longer a sustainable differentiator. At the same time, access to more data is a key factor in developing better and more useful machine learning models and driving more utility and value from data. There is a unique opportunity to build a moat around data. 

But what does that mean, and how can companies start the process of building a data moat? We caught up with Parinaz Sobhani, Head of Applied Research at Georgian, to explain why privacy is an important part of building a data moat, and some privacy-preserving techniques to consider to help overcome privacy concerns. 

Customer buy-in comes before the data moat

For B2B companies, before even beginning the process of creating a data moat, it will be important to communicate value for clients. It’s easier for B2C tech companies like Facebook to think about how their data moat can create positive network effects, since they are providing similar services to all of these customers and their goal is to increase users on the platform. 

Enterprise customers are more invested in protecting their data, and may be wary of letting others use it. But, we can present the discussion as a win-win scenario that is more about the fair value exchange for their data: if an enterprise customer provides their data in a privacy-preserving manner that also protects their own customers’ data, they can receive benefits in return through your product. 

By using cutting-edge privacy-preserving technologies, B2B businesses can present the benefits of sharing data while providing mathematical guarantees that privacy will be respected. Communication and privacy protection should go beyond contractual obligations to truly foster trust. 

In the next section, we lay out some privacy models that can help businesses guarantee privacy. 

Technologies for data privacy

There are four main technologies that can guarantee privacy: de-identification, differential privacy, federated learning and homomorphic encryption. 

Simple de-identification or anonymization is considered among the least effective, as anonymizing customer data by removing personally identifiable information still leaves the information vulnerable or does not provide any privacy guarantees without certain assumptions about the availability of external/auxiliary sources of information. Techniques such as homomorphic encryption and differential privacy that were recently proposed in academic research provide more formal guarantees with mathematics foundations.

Differential privacy is a probabilistic framework to measure the level of privacy of a mechanism/algorithm (function) that uses data and provides an answer based on some computation on data. The main principle of preserving privacy is to introduce randomness to the computation function so that the final answer does not depend on any individual data points. Differential privacy techniques, simply put, can inject noise into a dataset, or into the output of a machine learning model, without introducing significant negative effects on data analysis or model performance. There are several open-source libraries to help build differential privacy solutions, including Georgian’s own resources on TensorFlow

Federated learning allows machine learning models to train on datasets at different locations, rather than having all data trained and located on a centralized server, which is normally the case for training machine learning models. This means that different institutions working together can collaborate on training machine learning models, while retaining sensitive data on their own servers.

Finally, homomorphic encryption — which can be considered a ‘holy grail’ for machine learning — allows companies to train using encrypted data without ever having to decrypt the data. It’s a very expensive solution, however, so the cost versus utility should be weighed. 

For B2B businesses working with different datasets from different customers, ensuring that you are gathering the right insights from data doesn’t end when you’ve amassed a large dataset. If businesses try to aggregate cross-customer data, it may not be useful because the types of users and data distributions may be different. In the end, you may just be adding noise to your customer’s data and not training your models in a meaningful way. 

This is why transfer learning is so important for researchers. Transfer learning is the computer science equivalent of using a skill or knowledge you’ve picked up in the past and applying it to a new situation. So, for example, a person that knows how to play tennis can take these same skills and knowledge and apply them to a new racquet sport, like squash. 

With transfer learning, machine learning models use the most relevant existing information from past customers when building a machine learning model for a new customer. 

Don’t get lost in the data moat

While building a data moat, it’s important to be guided by the idea that you’re improving an existing product or service. Some companies might have the misconception that they can monetize the data as a different product. Our perspective is that data moats are only valuable if they have some utility, so they should be about helping downstream applications or customers. 

Want more guidance on machine learning and AI? We have a lot of content dedicated to privacy and machine learning. Check out our resources here

Read more like this

Testing LLMs for trust and safety

We all get a few chuckles when autocorrect gets something wrong, but…

How AI is redefining coding

Sometimes it’s hard to know where to start when it comes to…