In today’s increasingly hyperconnected world, it’s more important than ever to protect our privacy online. But the more information companies collect about us, the more difficult it becomes to protect, even if it’s properly anonymized.
To address this critical issue, we’re excited to announce that we’ve worked with Google to make our differential privacy library publicly available through TensorFlow, the industry’s leading open-source machine learning framework.
So what is differential privacy, and why are we teaming up with the world’s top technology firm to make it freely available?
The Anonymity Illusion
When we sign up for online services, they assure us that all of our information is “completely anonymous.” This is completely incorrect.
We usually think of anonymity in binary terms, with information being either anonymous or identifiable. But anonymity is a continuum, with full anonymity as the asymptote — a point that we can get infinitely close to but never reach. The vast majority of “anonymous” data is pseudonymous, with personal information replaced by seemingly random identifiers. That might sound somewhat pedantic, but it makes all the difference.
Researchers have consistently demonstrated the ability to de-anonymize large data sets by looking for correlations and outliers, ranging from AOL to Netflix to even U.S. Census data — a practice known as data re-identification. To understand how this works in practice, consider the following thought experiment.
Imagine you have access to a health care database containing each patient’s date of birth, gender and ZIP code. By cross-referencing these three pieces of information with public records, how many of the patients could you identify? According to a 2000 study by Carnegie Mellon University, nearly 90% of Americans can be uniquely identified using just the information above. To demonstrate the practical applicability of the technique, the researcher then went ahead and identified the then-governor of Massachusetts William Weld’s medical records.
And if you think things have gotten better since 2000, think again. A brand new study released in July 2019 by the Imperial College of London examined the possibility of using modern machine learning techniques to re-identify similar types of data:
“Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.”
The Differential Solution
One of the most effective ways to reduce the risk is through the use of cutting-edge techniques that satisfy differential privacy. In essence, satisfying differential privacy means that random noise has been added during the analysis of data such that individual data items cannot be distinguished, but also such that the fundamental statistical properties of the data are still maintained. If that sounds like a bunch of math jargon, let’s consider a practical example.
Suppose you want to find out how many people cheat on their partners. The naive approach of asking 1000 random people whether they’ve cheated probably yields a result close to zero since almost no one wants to risk the information being leaked. So instead, you ask each participant to flip a coin twice secretly. If the first flip lands heads, you ask them to answer the question truthfully. Otherwise, you ask them to give you a forced answer based on the second flip — “yes” for heads and “no” for tails.
The key here is that you’re giving all of the participants plausible deniability — even if someone later finds out they answered “yes”, there’s no way to tell if the answer was real or forced, and the participant can always claim the latter. For each participant, true answers are indistinguishable from random coin-flip answers.
Now suppose you run the survey using the above technique and 30% of participants answer “yes”… what does that mean? Once you remove the 25% of participants that were forced to answer “yes” (leaving us with 5%) and double the number (since only half of the participants told the truth), you’re left with the conclusion that approximately 10% of people have cheated on a partner, a number likely closer to reality.
The Software Evolution
At Georgian, we’re building software to help our portfolio companies accelerate the adoption of key technologies directly tied to our applied research areas. We developed our differential privacy software in collaboration with some of our portfolio companies to solve an important business challenge: cross-customer data aggregation.
Imagine you start a business that builds machine learning models to help fast-food restaurants optimize their supply chains. You start with McDonald’s and spend months training your models, and after about a year, your software starts saving McDonald’s millions of dollars. Pretty soon Subway, KFC, Burger King and every other major fast-food restaurant wants your solution, but there’s one massive problem: you don’t have the time and staff to spend months training a brand new model from scratch for each one. Wouldn’t you love to be able to transfer the learnings from McDonald’s supply chain without giving away their secret sauce?
By solving this cold-start problem you can improve onboarding times and reduce time-to-value for new customers by using aggregate data and machine learning models from existing customers.
After exploring different methods, we identified the Bolt-on method as the best approach for our use cases, relying on stochastic gradient descent (SGD), an optimization technique that can be applied to convex optimization-based machine learning techniques such as Logistic Regression and linear Support Vector Machines (SVM). In our research, we observed that just a small amount of noise is needed to achieve reasonable privacy guarantees, making this approach perfect for our use cases. When we saw Google’s release of TensorFlow Privacy, we immediately saw the opportunity to collaborate and contribute to the project, adding support for Logistic Regression and Support Vector Machines.
The Privacy Revolution
At Georgian, we stand with industry leaders in our belief that privacy is a fundamental right. In a world where the risks and costs associated with privacy are on the rise, and privacy issues are leading to broader questions around trust and AI, we believe that differential privacy offers a viable solution. Differential privacy allows machine learning teams to create valuable products without risking the privacy of any individual’s data, and we believe that it should be accessible to all companies to help them overcome these challenges.
To get started with TensorFlow Privacy, check out the examples and tutorials in the GitHub repository. To learn more about differential privacy, check out the CEO’s Guide to Differential Privacy. Finally, to learn how privacy is integral to building and leveraging customer trust, check out the CEO’s Guide to Trust.
Read more like this
Why Georgian Invested in Armis (Again)
Armis offers visibility, security and risk management to enterprises across the Internet…
Cybersecurity Lessons Learned Using Machine Learning for Anomaly Detection
At Georgian, we invest in high-growth technology companies that harness the power…
Team Profile: Azin Asgarian, Applied Research Scientist
What do you work on at Georgian? As part of the R&D…