Critical Data Studies

Big Data and the illusion of choice: Comparing the evolution of India’s Aadhaar and China’s Social Credit System as technosocial discourses
Saif Shahin and Pei Zheng. Social Science Computer Review. 2018.

India and China have launched enormous projects aimed at collecting vital personal information regarding their billion-plus populations and building the world’s biggest data sets in the process. However, both Aadhaar in India and the Social Credit System in China are controversial and raise a plethora of political and ethical concerns. The governments claim that participation in these projects is voluntary, even as they link vital services to citizens registering with these projects. In this study, we analyze how the news media in India and China—crucial data intermediaries that shape public perceptions on data and technological practices—framed these projects since their inception. Topic modeling suggests news coverage in both nations disregards the public interest and focuses largely on how businesses can benefit from them. The media, institutionally and ideologically linked with governments and corporations, show little concern with violations of privacy and mass surveillance that these projects could lead to. We argue that this renders citizens structurally incapable of making a meaningful “choice” about whether or not to participate in such projects. Implications for various stakeholders are discussed.

Analysis of messy data
Saif Shahin. International Encyclopedia of Communication Research Methods. 2017.

Raw data collected through surveys, experiments, coding of textual artifacts, or other quantitative means may not meet the assumptions upon which statistical analyses rely. The presence of univariate or multivariate outliers, skewness or kurtosis in a distribution, and heteroscedasticity or multicollinearity among variables may compromise data analysis. Scholars have devised a variety of techniques to discern and address such problems.

A critical axiology for Big Data studies
Saif Shahin. Palabra Clave. 2016.

Big Data is having a huge impact on journalism and communication studies. At the same time, it has raised a plethora of social concerns ranging from mass surveillance to the legitimization of prejudices such as racism. This article develops an agenda for critical Big Data research. It discusses what the purpose of such research should be, what pitfalls it should guard against, and the possibility of adapting Big Data methods to conduct empirical research from a critical standpoint. Such a research program will not only enable critical scholarship to meaningfully challenge Big Data as a hegemonic tool, but will also make it possible for scholars to draw upon Big Data resources to address a range of social issues in previously impossible ways. The article calls for methodological innovation in combining emerging Big Data techniques with critical/qualitative methods of research, such as ethnography and discourse analysis, in ways that allow them to complement each other.

Right to be forgotten: How national identity, political orientation, and capitalist ideology structured a trans-Atlantic debate on information access and control
Saif Shahin. Journalism & Mass Communication Quarterly. 2016.

This study examines U.S. and British media coverage of the “right to be forgotten” in the light of their legal approaches and public attitudes toward privacy. Algorithmic and qualitative textual analysis techniques are combined to uncover the ideologies and interests that structure the discourse and shape its outcome. The analysis reveals that U.S. media, irrespective of their perceived “liberal” or “conservative” orientation, treat users’ online privacy as subservient to the business interests of technology companies—in line with the country’s lax legal approach. The coverage is more diverse in Britain, where the legal concept of privacy is also more stringent.

When scale meets depth: Integrating natural language processing and textual analysis for studying digital corpora
Saif Shahin. Communication Methods and Measures. 2016.

As computer-assisted research of voluminous datasets becomes more pervasive, so does the criticism of its epistemological, methodological, and ethical/normative inadequacies. This article proposes a hybrid approach that combines the scale of computational methods with the depth of qualitative analysis. It uses simple natural language processing algorithms to extract purposive samples from large textual corpora, which can then be analyzed using interpretive techniques. This approach helps research become more theoretically grounded and contextually sensitive—two major failings of typical “Big Data” studies. Simultaneously, it allows qualitative scholars to examine datasets that are otherwise too large to study manually and also bring more rigor to the process of sampling. The method is illustrated with two case studies, one looking at the inaugural addresses of U.S. presidents and the other investigating the news coverage of two shootings at an army camp in Texas.