// With the wild success of micro-blogging platforms such as Twitter and its supposed propensity as a tool for social revolution and a unique tool for the dissemination of alternative discourse, it was only a matter of time before researchers, journalists and historians would want to utilize the vast amounts of data collected to better understand just how people interact with social networks. In her talk with CGCS earlier this semester, Sandra Gonzalez Bailon discussed organization in times of protest through studying network effect, user type and heavy reliance on social networking platforms as tools for mobilization and revolution. In this article, she addresses many issues researchers encounter when using social media platforms as sources of qualitative and quantitative data sets.
A Snapshot of the Internet
In 2010 the Library of Congress announced its plans to archive every public tweet since the inception of Twitter in March 2006. Justifying this move was the claim that tweets are “part of significant events around the world” and therefore have historical value worthy of preservation. In line with the Library’s mission, the main purpose of this digital archive was to facilitate research, and enable future generations to reconstruct a part of history through the digital trails left behind. Years later, however, those plans have yet to materialize. Technical bottlenecks and the difficulty of storing all the information (hundreds of millions of messages daily, with more on the rise) have jeopardized the delivery and accuracy of access tools, and the feasibility of the project.
This means that those wanting to do research with Twitter still need to rely on the platform’s APIs (the application programming interface) to retrieve information that can help make sense of this significant dimension of social life. Because of the way in which the APIs operate, most research will consequently be based on samples of the full stream of communication: that is, an imperfect representation of everything that happens in the online network.
Pulling on Social Science Traditions to Answer New Media Questions
Social scientists are used to thinking about sampling strategies and the representativeness of their samples: since they can’t ask every citizen what they think about important issues, they run questionnaires with a subset and try inferring general trends about the overall unobserved population. This exercise often requires controlling for the inherent bias in samples. In other words, taking into account the imperfect correspondence between a small sample and the overall population.
When it comes to research with digital trails, however, the bias of the data is rarely considered: we are so overwhelmed by the richness of information that the question of how missing data might be biasing the observed trends slips under the radar. Why should we care about then, about the missing messages when we collected hundreds of thousands (if not millions) of them? Also, when researching Twitter communication we usually don’t care about the overall population, only about those who are Tweeting: why then should a discussion about inference and bias matter?
The answer to these questions does matter; there are good reasons to care about the bias of the data – especially when online communication is used to reconstruct networks of information exchange between users, and when those interactions have spill-over effects in the offline world.
Using an API to Recreate History Yields Imperfections, but Telling Nonetheless
The reconstruction of online communication networks often starts with a set of messages that relate to a given information domain (i.e. political protests or a new technology product). From those messages, you can snowball the users that sent them and reconstruct their interactions with other users. If the original set of messages is incomplete, researchers might underestimate the size of the network that is/was actively discussing a given topic, and also the amount of interactions between users, or the bandwidth (or intensity) of their communication. This creates noisy measurements that can affect conclusions about how online networks mediate social behavior.
One way to assess the extent of this bias is comparing samples of the same underlying population of messages. Since the full stream of communication is out of reach for most researchers (they require a commercial agreement with Twitter), we can compare the results retrieved through the APIs that are publicly available. The Twitter API policies (which determine the number of queries that can be run when downloading information, and therefore the size of the samples) have changed during the last few years, becoming increasingly more restrictive. Currently, there are two main ways of accessing information: the search API and the streaming API . The former can collect messages published during the previous week but applies an unspecified rate limit to the number of queries that can be run; the latter allows requests to remain open and pushes data as it becomes available but, depending on volume, still captures just a portion of all activity (about 1% of all the full stream). The question is: Does the picture that these two windows offer of Twitter activity differ significantly?
Search vs. Stream API:
This question has been considered in a recent paper, in the context of communication around political mobilizations. The comparison of the networks built using the search and streaming samples reveals that although the overlapping of data is large, it is not complete andalso that the smaller sample (search) is biased towards the most central users and the networks of communication reconstructed from it prune the large number of peripheral users (those who are not as engaged and interact more loosely with other users). This periphery is only captured by the larger sample (streaming), which is better at reconstructing the fringes of online networks. This bias towards the core of networks is greater for the network of mentions than for the networks of re-tweets (the two conventions that users employ to engage in direct communication with each other, and which offer a better proxy to actual information exchange than the underlying following-follower structure).
Researchers have been very imaginative in finding ways to analyze online communication, especially when it is as restrictive as the 140 characters that Twitter allows. However, the pace at which research on digital technologies moves prevents the consolidation of standards and procedures for data collection. Creating those standards will be necessary if we are to integrate research outputs and engage in cumulative research and theory building. To the extent that digital technologies are creating more channels for communication and increasingly mediating exposure to information, this is an important item in the research agenda of communication scholars. At least while the Library of Congress finds a way to make the full stream of online communication accessible and researchable – which will eliminate the sampling problem, but create other challenges (such as how to amass the vast amount of information to identify the patterns of interest).