On Monday, the FTC hosted a public workshop on the topic of big data and discrimination entitled, “Big Data: A Tool for Inclusion or Exclusion?” The first panel, which explored today’s big-data landscape, featured the following speakers from government, industry, and academia: Kristin Amerling, Chief Investigative Counsel and Director of Oversight at the U.S. Senate Commerce Committee; danah boyd, Principal Researcher at Microsoft Research and Research Assistant Professor at New York University; Mallory Duncan, Senior Vice President and General Counsel of the National Retail Federation; Gene Gsell, Senior Vice President for U.S. Retail & CPG at SAS; David Robinson, Principal at Robinson + Yu; and Joseph Turow, Professor at the University of Pennsylvania Annenberg School for Communication.
An opening presentation by Princeton University’s Postdoctoral Research Associate Solon Barocas defined big data with “three Vs” — (1) volume, (2) velocity, and (3) variety, whereby big data is understood in three main forms: (1) observational data, such as transactional data; (2) self-reported and user-generated data, like social media; and (3) experimental data, obtained via A/B testing, for example. Building on this characterization, the panel first considered whether the “three Vs” are in fact an accurate description of big data. The overwhelming conclusion on this question, and of the panel generally, was that there is no single way to conceptualize big data. danah boyd advocated thinking of big data not just in technical terms, but as a socio-technological phenomenon that is more philosophical than methodological in nature, bringing about social confusion and chaos due to our very inability to grapple with the thing itself. She noted that much of the current dialogue is not actually about formalistic mechanisms, like data mining and analytics, but instead is rooted in a fundamental uncertainty over how we as a market society can understand, harness, leverage — and perhaps tame — big data. Indeed, this notion of framing big data principally as socio-philosophical discourse was a theme that emerged throughout the workshop.
While agreeing with the idea that there is a primary difficulty simply in making sense of big data, Gene Gsell took a contextual perspective, emphasizing that, “Data has been around for a really long time,” so ‘big data’ just means that, “Today, there is more of it.” Moreover, on scrutinizing how big data currently is being used, Mr. Gsell suggested that critics may be giving industry “more credit” than is presently due, because most are still “behind the curve” in dealing with all of the information that is now available. Joseph Turow echoed a similar thought, that the retail industry, for example, has expressed feeling “overwhelmed with data.” But Mr. Turow disagreed that big data merely is a larger continuation of something that always has existed. Instead he concluded that, although big data still remains in its infancy, it represents the beginning of a new era, characterized by new frontiers in technological advancement like predictive analytics and laser targeted advertising. He noted especially that the sheer ability to personalize data, by drawing non-intuitive inferences on the basis of hundreds of data points in order to make conclusions about consumers, is a “terrific change in how companies evaluate customers.”
Mallory Duncan added that, in terms of industry uses and benefits of these tools, such personalization enabled by big data has become critical to the commercial industry’s finding “long, loyal, valuable customers.” Relating back to the theme of the workshop, Mr. Turow pointed out that, nevertheless, recent research has shown that ‘personalization’ sometimes means unintentional (or even intentional) discrimination against particularly vulnerable groups. Mr. Turow wondered therefore about a conflicting “trajectory of interests,” by which companies that we can assume do not want to discriminate still seek personalization at such an extraordinary level of detail — such as predicting what a consumer will do when he or she enters a brick-and-mortar store — that unintended consequences disadvantaging some customers over others may be impossible to avoid. Without addressing unintentionality, in response to Mr. Turow’s thought, Mr. Gsell repeated his earlier suggestion that we assume industry is capable of much more than presently is actually the case. In an effort to deconstruct our understanding of big data and discrimination in more theoretical terms, danah boyd concluded that if we conceive of discrimination as how one is positioned into a protected class or category, ‘personalization’ or market categorization — i.e., fitting consumers within a network of actors based on behaviors over time, rather than on data voluntarily and deliberately given to a company in a limited setting — at a basic level accomplishes the same thing. In other words, personalization is only made possible by positioning a consumer statistically in relation to other actors and networks through interpolation and by drawing probabilistic correlations from data sets about which the consumer usually has no say or knowledge.
Much of the panel discussion therefore suggested that, in big data’s current framework, unintentional discrimination seems nearly unavoidable. It was added, however, that the existing lack of transparency in private-industry practices, along with putting consumers in a vulnerable position, creates major limitations in basic understanding, because a large part of the problem is that right now we simply don’t know much about how big data functions. Greater transparency might partially solve the discrimination problem, but danah boyd noted that, on the other hand, the same techniques that can be used to increase fairness and transparency also often can intensify complexities. This is because dealing with a company’s sophisticated insights can present, for instance, a variety of multifaceted, ethical concerns that perhaps for good reason should be unknown to the public. As an example, danah cited Microsoft researchers’ ability to detect with great certainty from a user’s Bing searches whether, within 48 hours, that individual will be hospitalized. Such conclusions drawn from big data can create major moral dilemmas: What is Microsoft’s ethical duty to act or intervene, especially when the data conclusions extend beyond the online-search or marketing ecosystem to preventative health and medical domains? Does Microsoft warn the consumer even though such a practice, while probably ultimately beneficial to the individual, likely would be viewed as intrusive and “creepy”?
These types of novel, ethical, and philosophical issues that big data currently presents more or less led the panel back to where it began: big data is, for now, a complex gray area.