T his is the first blog post in a series about differential privacy. Check out the table of contents to see the next articles!
How to publish data about people while protecting their privacy? This question is far from new. Statistical agencies have grappled with it for decades. Computer scientists have proposed a whole bunch of creative notions to capture this idea. None of them was very satisfactory, though: all these notions were shown to be broken in some circumstances. They were also hard to apply without destroying the utility of the data.
This all changed in 2006, when four researchers introduced differential privacy. This new notion took a novel approach to defining privacy leakage, one that would prove much more rigorous and fruitful. So, what makes differential privacy special? How did it get so successful in adademic circles? Why did governments and tech companies start adopting it for their data publications?
This first article introducing differential privacy will attempt to answer that question. First, weâll describe the high-level intuition behind this successful notion. Then, weâll explain why itâs so successful: why is it so much more awesome than all the definitions that came before?
The core idea behind differential privacy
Suppose you have a process that takes some database as input, and returns some output.
This process can be anything. For example, it can be:
- a process calculating some statistics (âtell me how many users have red hairâ)
- a de-identification strategy (âremove names and last three digits of ZIP codesâ)
- a machine learning training process (âbuild a model to predict which users like catsâ)
- ⊠you get the idea.
To make a process differentially private, you usually have to modify it a little bit. Typically, you add some randomness, or noise, in some places. What exactly you do, and how much noise you add, depends on which process youâre modifying. Iâll abstract that part away and simply say that your process is now doing some unspecified âš magic âš.
Now, remove somebody from your database, and run your new process on it. If the new process is differentially private, then the two outputs are basically the same. This must be true no matter who you remove, and what database you had in the first place.
By âbasically the sameâ, I donât mean âit looks a bit similarâ. Instead, remember that the magic you added to the process was randomized. You donât always get the same output if you run the new process several times. So what does âbasically the sameâ means in this context? It means that you can get the exact same output from both databases with similar likelihood.
What does this have to do with privacy? Well, suppose youâre a creepy person trying to figure out whether your target is in the original data. By looking at the output, you canât be 100% certain of anything. Sure, it could have come from a database with your target in it. But it could also have come from the exact same database, without your target. Both options have a similar probability, so thereâs not much you can say.
You might have noticed that this definition doesnât say anything about what the output data looks like. Differential privacy is not a property of the output data. Itâs very different from, say, -anonymity, one of the first data privacy definitions. You canât look at the output data and determine whether it satisfies differential privacy. Instead, differential privacy is a property of the process: you have to know how the data was generated to determine whether itâs differentially private.
Thatâs about it for the high-level intuition. Itâs a little abstract, but not very complicated. So, why all the hype? What makes it so awesome compared to older, more straightforward definitions?
What makes differential privacy special
Privacy experts, especially in academia, are enthusiastic about differential privacy. It was first proposed by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith in 2006 1. Very soon, almost all researchers working on anonymization started building differentially private algorithms. Tech companies and governments are adopting it fast. So, why all the hype? I can count three main reasons.
You no longer need attack modeling
All definitions that came before needed some assumptions about the attacker. To choose the right notion, you needed to figure out the attackerâs capabilities and goals. How much prior knowledge do they have? What auxiliary data are they allowed to use? What kind of information do they want to learn?
Doing in practice was difficult and very error-prone. Answering these questions is very tricky: in particular, you might not know exactly what the attacker wants or is capable of. Worse, there might be unknown unknowns: attack vectors that you didnât anticipate at all. For that reason, you couldnât make very broad statements with these old-school definitions. You had to make some assumptions, which you couldnât be 100% sure of.
By contrast, when you use differential privacy, you get two awesome guarantees.
- You protect any kind of information about an individual. It doesnât matter what the attacker wants to do. Reidentify their target, know if theyâre in the dataset, deduce some sensitive attribute⊠All those things are protected. Thus, you donât have to think about the goals of your attacker.
- It works no matter what the attacker knows about your data. They might already know some people in the database. They might even add some fake users to your system. With differential privacy, it doesnât matter. The users that the attacker doesnât know are still protected.
You can quantify the privacy loss
Differential privacy, like older notions, comes with a numeric parameter that you can tweak. There is a big difference, though, in how meaningful that parameter is. Take -anonymity, for example. It tells you that each record in the output dataset âlooks likeâ at least other records. But does the value of tell us about the level of protection?
The answer is⊠not much. There is no clear link between the value of and how private the dataset is. So choosing is very handwavy, and canât be justified in a formal way. The problem is even worse with other old-school definitions.
Differential privacy is much better. When you use it, you can quantify the greatest possible information gain by the attacker. The corresponding parameter, named , allows you to make formal statements. Suppose . Then, you can say: âan attacker who thinks their target is in the dataset with probability 50% can increase their level of certainty to at most 75%.â Choosing the exact value of isnât easy, but at least, it can be interpreted in a formal way.
And do you remember the previous point about attack modeling? It means you can change this statement in many ways. You can replace âtheir target is in the datasetâ by anything about one individual. And you can add âno matter what the attacker knowsâ if you want to be extra-precise. Altogether, that makes differential privacy much stronger than all definitions that came before.
You can compose multiple mechanisms
Suppose you have some data. You want to share it with Alex and with Brinn, in some anonymized fashion. You trust Alex and Brinn equally, so you use the same definition of privacy for both of them. They are not interested in the same aspects of the data, so you give them two different versions of your data. Both versions are âanonymousâ, for the definition youâve chosen.
What happens if Alex and Brinn decide to conspire, and compare the data you gave them? Will the union of the two anonymized versions still be anonymous? It turns out that for most definitions of privacy, this is not the case. If you put two -anonymous versions of the same data together, the result wonât be -anonymous. So if Alex and Brinn collaborate, they might be able to reidentify users on their own⊠or even reconstruct all the original data! Thatâs not good news.
With differential privacy, you can avoid this failure mode. Suppose that you gave differentially private data to Alex and Brinn. Each time, you used a parameter of . Then if they conspire, the resulting data is still protected by differential privacy. The level of privacy is now weaker: the parameter becomes . So they still gain some information, but you can now quantify how much. This property is called composition.
This scenario sounds a bit far-fetched, but composition is super useful in practice. Organizations often want to do many things with data. Publish statistics, release an anonymized version, train machine learning algorithms⊠Composition is a way to stay in control of the level of risk as new use cases appear and processes evolve.
Conclusion
I hope the basic intuition behind differential privacy is now clear. If you remember a single thing, let this be this one-line summary: uncertainty in the process means uncertainty for the attacker, which means better privacy.
I also hope that youâre now wondering how it actually works! What hides behind this magic that makes everything safe and private? Why does differential privacy have all the awesome properties Iâve mentioned? This is the exact topic of the next article in this series, which explains this in more detail while still staying clear of heavy math.
Footnotes
-
The idea was first proposed in a scientific paper (pdf) presented at TCC 2006, and can also be found in a patent (pdf) filed by Dwork and McSherry in 2005. The name differential privacy seems to have appeared first in an invited paper (pdf) presented at ICALP 2006 by Dwork. â©