A New Approach to the Data-Deletion Conundrum

Excerpt

A team of computer scientists devised a way to quickly remove traces of sensitive user information from machine learning models.


Rising consumer concern over data privacy has led to a rush of “right to be forgotten” laws around the world that allow individuals to request their personal data be expunged from massive databases that catalog our increasingly online lives. Researchers in artificial intelligence have observed that user data does not only exist in its raw form in a database, it is also implicitly contained in models trained on that data. So far, they have struggled to find methods for deleting these “traces” of users efficiently. The more complex the model is, the more challenging it becomes to delete data.

“The exact deletion of data — the ideal — is hard to do in real time,” says James Zou, a professor of biomedical data science at Stanford University and an expert in artificial intelligence. “In training our machine learning models, bits and pieces of data can get embedded in the model in complicated ways. That makes it hard for us to guarantee a user has truly been forgotten without altering our models substantially.”

Zou is senior author of a paper recently presented at the International Conference on Artificial Intelligence and Statistics (AISTATS) that may provide a possible answer to the data deletion problem that works for privacy-concerned individuals and artificial intelligence experts alike. They call it approximate deletion.

Read the study: Approximate Data Deletion from Machine Learning Models

“Approximate deletion, as the name suggests, allows us to remove most of the users’ implicit data from the model. They are ‘forgotten,’ but in such a way that we can do the retraining of our models at a later, more opportune time,” says Zach Izzo, a graduate student in mathematics and the first author of the AISTATS paper.

Approximate deletion is especially useful in quickly removing sensitive information or features unique to a given individual that could potentially be used for identification after the fact, while postponing the computationally intensive full model retraining to times of lower computational demand. Under certain assumptions, Zou says, approximate deletion even achieves the holy grail of exact deletion of a user’s implicit data from the trained model.

Driven by Data

Machine learning works by combing databases and applying various predictive weights to features in the data — an online shopper’s age, location, and previous purchase history, for instance, or a streamer’s past viewing history and personal ratings of movies  watched. The models are not confined to commercial applications and are now widely used in radiology, pathology, and other fields of direct human impact.

In theory, information in a database is anonymized, but users concerned about privacy fear that they can still be identified by the bits and pieces of information about them that are still wedged in the models, begetting the need for right to be forgotten laws.

The gold standard in the field, Izzo says, is to find the exact same model as if the machine learning had never seen the deleted data points in the first place. That standard, known as “exact deletion,” is hard if not impossible to achieve, especially with large, complicated models like those that recommend products or movies to online shoppers and streamers. Exact data deletion effectively means retraining a model from scratch, Izzo says.

“Doing that requires taking the algorithm offline for retraining. And that costs real money and real time,” he says.

What is Approximate Deletion?

In solving the deletion conundrum, Zou and Izzo have come at things slightly differently than their counterparts in the field. In effect, they create synthetic data to replace — or, more accurately, negate — that of the individual who wishes to be forgotten.

This temporary solution satisfies the privacy-minded individual’s immediate desire to not be identified from data in the model — that is, to be forgotten — while reassuring the computer scientists, and the businesses that rely upon them, that their models will work as planned, at least until a more opportune time when the model can be retrained at lower cost.

There is a philosophical aspect to the challenge, the authors say. Where privacy, law, and commerce intersect, the discussion begins with a meaningful definition of what it means to “delete” information. Does deletion mean the actual destruction of data? Or is it enough to ensure that no one could ever identify an anonymous person from it? In the end, Izzo says, answering that key question requires balancing the privacy rights of consumers and the needs of science and commerce.

“That’s a pretty difficult, non-trivial question,” Izzo says. “For many of the more complicated models used in practice, even if you delete zero people from a database, retraining alone can result in a completely different model. So even defining the proper target for the retrained model is challenging.”

With their approximate deletion approach in hand, the authors then validated the effectiveness of their method empirically, confirming their theoretical approach on the path to practical application. That critical step now becomes the goal of future work.

“We think approximate deletion is an important initial step toward solving what has been a difficult challenge for AI,” Zou says.

Stanford HAI’s mission is to advance AI research, education, policy and practice to improve the human condition. Learn more.