Private Social Media Data in the Public Interest: What’s Next?

This post was originally published on November 2, 2021 to RSM’s former Medium account. Some formatting and content may have been edited for clarity.

The Berkman Klein Center’s Institute for Rebooting Social Media kicked off its fall programming with an event asking:

What would genuine data-driven oversight of social media companies look like?

The Institute is a three-year initiative to convene participants across sectors to accelerate progress in addressing social media’s most urgent problems, including content governance, business models, and infrastructure design. Led by Professors Jonathan Zittrain and James Mickens, the Institute is building a portfolio of programming, such as events, workshops, and a visiting scholar program for faculty in 2022–23.

Platform data that enables the public to understand the scale and specificity of harms from social media has been notoriously difficult for those working in the public interest to access, with journalists, regulators, academics and members of civil society encountering persistent barriers from social media companies. As the problems of social media platform oversight have gained renewed attention, recently highlighted by whistleblowers such as Frances Haugen, the challenges and possibilities of accessing, researching, and better understanding private social media data in the public interest are ever more critical.

Moderated by Jonathan Zittrain, the George Bemis Professor of International Law at Harvard Law School, the “Private Social Media Data in the Public Interest: What’s Next?” event featured an interdisciplinary group of expert speakers with experience in academia, journalism, government, and social media companies: Nate Persily, James B. McClatchy Professor of Law at Stanford Law School; Nabiha Syed, President of nonprofit newsroom The Markup; Nicole Wong, former Deputy US Chief Technology Officer and former Vice President and Deputy General Counsel at Google; and Ethan Zuckerman, Associate Professor of Public Policy, Communication and Information at the University of Massachusetts. The group discussed the needs for independent research about social media, problems of restricted platform data access, concrete proposals to allow for independent research access to platform data, and the tradeoffs inherent in such data access arrangements.

Nate Persily recently published his draft legislative proposal, the Platform Transparency and Accountability Act, which attempts to create a privacy-protective federally-mandated pathway for academic researchers to access platform data. The draft legislation aims to break up the monopoly on insights to data, and to compel platform data access as a form of oversight. Persily stressed platforms’ unprecedented power and scope over social interaction and communication online, which poses unique dangers to democracy and to the information ecosystem.

Persily noted that the lack of access to platform data has contributed to the problem of policymakers legislating in the dark, because they do not fully understand the problems and must resort to trusting the platforms themselves.

Obligating platforms to provide data access to accredited academic researchers on a regular basis would increase transparency, similar to the disclosures required in other industries.

Nabiha Syed, who leads The Markup, a nonprofit investigative newsroom, emphasized the importance of this moment when society is figuring out the right system for checks and balances to hold powerful technology companies accountable. She noted the similarities and differences between journalists and academic researchers with respect to their interests in platform data access: journalists have different time horizons, incentives, and at times methods and questions, than academic researchers, yet their interests in expanding access to independent platform data are aligned. Currently, some journalistic outlets, including The Markup, are using methods like data scraping and browser plug-ins that allow users to proactively share their data with outlets to conduct public interest investigations.

Syed argued that methods like scraping and data donation are vital layers in the system of checks and balances needed for accountability, even if transparency became federally mandated.

And yet, scraping, data donation, and similar methods are legally precarious, and journalists using them can be exposed to lawsuits. Syed strongly endorsed creating safe harbors for privacy-protecting adversarial data collection by public interest groups such as journalists and academic researchers.

Nicole Wong opened by drawing out a number of key embedded questions in the panel’s core question: what does genuine data-driven platform oversight look like?

“Oversight of what? Which platforms? What harms? Which public are we trying to serve? Which public interest is it? Who conducts it? With what tools?”

She emphasized that we’re still early in the regulatory process for social media, so research should play a particularly important role in elucidating the diverse problems that platforms face, such as slowing the spread of misinformation, protecting human rights, and ensuring privacy. She also emphasized the need for multiple layers of oversight, which she argued means the companies need to have multiple ways of making data accessible. Bringing in her experience in industry and government, she called attention to the problem of standardizing compliance and regulation with respect to platforms. For example, end users need to understand the context of their platform experience and have agency in making tradeoffs and choices that are right for them, while journalists and consumer groups need a different type of access in order to perform their functions.

Ethan Zuckerman leads the media research platform Media Cloud and has proposed a number of possible paths forward for better research access. Zuckerman opened by reflecting on Social Science One (which Persily previously helped lead and stepped down from), a voluntary platform data sharing experiment that he and many others felt culminated in disappointment and inaccurate datasets. Given that experience and his own research, he supports Persily’s proposal, and also emphasized the need for additional laws that protect public interest research based on publicly available data, similar to what Syed stressed. Whereas Persily’s proposal is particularly important for platform decisions and insights where not all data is publicly-accessible, like internal content moderation decisions, Zuckerman’s approach centers on research that is possible to conduct in the public domain. For example, one of his projects is to create a true random sample of YouTube content by creating a searchable index of transcripts of a million videos. Zuckerman argued that it is imperative for researchers to generate their own data sets like this, which he says can be done safely and ethically, regardless of platform cooperation, saying,

“First of all, you need the ability to assert the right to study public data like that [YouTube videos]. Second, there needs to be some sort of right of data donation or data altruism, like what people are doing with allowing users to put plug-ins in their browsers to share their data with NYU, or The Markup, or somewhere else.”

The conversation then shifted to a critical question: how, then, should researchers working with potentially sensitive data be held to the highest of standards? As Wong noted, if someone donates their social media data, it’s not just their data, it’s networked data. How can we learn from other contexts, like medical data donation, and ensure data privacy? Zuckerman asserted that researchers might be able to successfully regulate themselves, creating an industry body that performs a kind of peer review for research code and datasets. In response to the question of how to ensure researchers genuinely are accessing data in the public interest, Syed commented on the potential need for standards to serve a gatekeeping function, as distinct from instating specific gatekeepers, and multilayered accountability. Standards for research and journalistic data access should aim to be maximally inclusive, while explicitly excluding uses not in service of the public, such as those scraping social media to build facial recognition photo databases like Clearview AI. She noted that standards often arise from mistakes,

“In this universe, what do mistakes look like? What’s the level of mistake that we’re comfortable with as we stumble towards the right balance [of checks and balances] in this moment?”

The panelists uniformly emphasized that social media is propagating harms and that there is a need for increased data access. As Nicole Wong noted, “We’re at the start” of addressing and developing platform governance overall. Enacting the proposals discussed during the event would contribute to that start. In parallel, a tough and important question emerges:

What do we want social media of the future to look like?

This recap was written by Madeleine Matsui and Hilary Ross.