In a rambling grabbag of an article in the NY Times, Steve Lohr mentions in passing that Netflix has shelved a second contest to improve its recomendation capability based on privacy concerns:
On Friday, Netflix said that it was shelving plans for a second contest — bowing to privacy concerns raised by the F.T.C. and a private litigant. In 2008, a pair of researchers at the University of Texas showed that the customer data released for that first contest, despite being stripped of names and other direct identifying information, could often be “de-anonymized” by statistically analyzing an individual’s distinctive pattern of movie ratings and recommendations.
The movie data that was supposedly ‘anonymized’ – stripped of indications of the identity of the people involved – but researchers were able to reconstruct identities buried in the ‘micro-data’ associated with them:
Robust De-anonymization of Large Datasets (How to Break Anonymity of the Netflix Prize Dataset), Arvind Narayanan and Vitaly Shmatikov, The University of Texas at Austin, February 5, 2008
We present a new class of statistical de-anonymization attacks against high-dimensional micro-data, such as individual preferences, recommendations, transaction records and so on. Our techniques are robust to perturbation in the data and tolerate some mistakes in the adversary’s background knowledge.
We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix, the world’s largest online movie rental service. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset. Using the Internet Movie Database as the source of background knowledge, we successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.
Datasets containing “micro-data,” that is, information about specific individuals, are increasingly becoming public—both in response to “open government” laws, and to support data mining research. Some datasets include legally protected information such as health histories; others contain individual preferences, pur- chases, and transactions, which many people may view as private or sensitive.
Privacy risks of publishing micro-data are well-known. Even if identifying information such as names, addresses, and Social Security numbers has been removed, the adversary can use contextual and back- ground knowledge, as well as cross-correlation with publicly available databases, to re-identify individual data records. Famous re-identification attacks include de-anonymization of a Massachusetts hospital dis- charge database by joining it with with a public voter database , de-anonymization of individual DNA sequences , and privacy breaches caused by (ostensibly anonymized) AOL search data .
Micro-data are characterized by high dimensionality and sparsity. Informally, micro-data records contain many attributes, each of which can be viewed as a dimension (an attribute can be thought of as a column in a database schema). Sparsity means that a pair of random records are located far apart in the multi-dimensional space defined by the attributes. This sparsity is empirically well-established [6, 4, 16] and related to the “fat tail” phenomenon: individual transaction and preference records tend to include statistically rare attributes.
[references are provided in the pdf.]
Netflix had become embroiled in a lawsuit brought by the contest, and bowing to pressure from the Federal Trade Commission the company decided to drop the second contest:
– Neil Hunt, Netflix Chief Product Officer, Netflix Prize Update
We have reached an understanding with the FTC and have settled the lawsuit with plaintiffs. The resolution to both matters involves certain parameters for how we use Netflix data in any future research programs.
In light of all this, we have decided to not pursue the Netflix Prize sequel that we announced on August 6, 2009.
We will continue to explore ways to collaborate with the research community and improve our recommendations system so we can constantly improve the movie recommendations we make for you. So stay tuned.
So, the upshot is that companies who gather data about us that is implicitly private – like our movie viewing habits – are probably not going to be able to publish this data in some hypothetically anonymized fashion.
On the other hand, if users opt into a publicy-based system – where their movie viewing habits are shared and published openly – it may seem that Netflix will have no problems. I doubt that Netflix needs all users to do so in order to improve the recommendation service.
But the researchers make a stronger case, saying that other hypothetically private information about users – like sexual preferences and political orientation – can be inferred from the datasets, not just determining the users’ identities:
Does privacy of Netflix ratings matter? The privacy question is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analyzing the Netflix Prize dataset?” The answer to the latter question is, undoubtedly, yes. As shown by our experiments with cross-correlating non-anonymous records from the Internet Movie Database with anonymized Netflix records (see below), it is possible to learn sensitive non-public information about a person’s political or even sexual preferences. We assert that even if the vast majority of Netflix subscribers did not care about the privacy of their movie ratings (which is not obvious by any means), our analysis would still indicate serious privacy issues with the Netflix Prize dataset.
Moreover, the linkage between an individual and her movie viewing history has implications for her future privacy. In network security, “forward secrecy” is important: even if the attacker manages to compro- mise a session key, this should not help him much in compromising the keys of future sessions. Similarly, one may state the “forward privacy” property: if someone’s privacy is breached (e.g., her anonymous online records have been linked to her real identity), future privacy breaches should not become easier. Now consider a Netflix subscriber Alice whose entire movie viewing history has been revealed. Even if in the future Alice creates a brand-new virtual identity (call her Ecila), Ecila will never be able to disclose any non-trivial information about the movies that she had rated within Netflix because any such information can be traced back to her real identity via the Netflix Prize dataset. In general, once any piece of data has been linked to a person’s real identity, any association between this data and a virtual identity breaks anonymity of the latter.
If someone wants to analyze the correlation between my movie choices and my political leanings, go ahead. But, of course, I live and watch movies in the US and not some repressive country that would jail me for enjoying ‘Breaking Away’ or ‘Breakfast Club’, and I live a very public life.
Narayanan and Shmatikov have conclusively demonstrated that there cannot be a separation between personal and non-personal: algorithms like theirs make this intuitive distinction meaningless. So any software solution – like Netflix – that has reserved the right to share, distribute or publish anonymized data based on large populations of users will likely be blocked in any attempt to actually use that data in that way. Even if the sharing is not done in a fully open way – as Netflix did when it opened its dataset up for the contest – there is the distinct possibility that senstive inferences can be associated with specific identities in the user base, and this will peirce the veils of privacy and secrecy.
Users have granted the right to Netflix to manage information about their viewing habits, and to use it in specific ways to make recommendations. But if it came to light that a part of their internal algorithm inferred explicitly what users’ sexual preferences are, for example, and stored that data somewhere – even if only for the length of a session – wouldn’t that be problematic, too? The possibility that such information exists would lead to the potential of all sorts of troublesome identity and privacy issues.