The Promise—and Pitfalls—of Researching Extremism Online

Platforms Are Dynamic and Opaque
Extremism researchers suffer from the problems of too much and too little data simultaneously. Social media platforms have mushroomed, and researchers can now access previously unfathomable quantities of data: Every day, 500 million tweets are sent. Every minute 500 hours of video are uploaded to YouTube. Brandwatch, a popular company for social media analytics, sells access to 1.4 trillion posts captured historically and across multiple platforms. This much information makes it difficult to distinguish the signal from the noise.

Staying abreast of trends in online communities isn’t easy either. Researchers need to be able to predict where “deplatformed” users—those banned for violating a platform’s user agreement—will migrate, and under what names. Research on a specific set of platforms can quickly become out of date as accounts open, close, or reopen under a new name or shift to a new platform. This fluidity makes it extremely difficult to study behavior over time and identify trends. The internet itself is also evolving; today’s internet is literally different from the internet of yesterday, and radically different from the internet a decade ago.

Further, it is nearly impossible to account for the influence of platforms’ proprietary algorithms, which determine what content users see, because social media companies are not transparent about how these work. Algorithms can distort users’ online behavior and contribute to patterns they may not have otherwise sought out.

Once the right platforms have been identified, researchers still need to figure out how to access the requisite data. Counterintuitively, this can be most difficult for mainstream platforms. Large social media companies such as Facebook and TikTok are not obligated to share data with researchers. Several have been known to cut off or reduce independent researchers’ access to their platforms. When they do share data, often it represents only a fraction of the content hosted on their platform.

As a result, research on extremist use of social media has been skewed towards platforms like Twitter, which historically granted researchers’ the greatest access. Admittedly, such popular platforms host plenty of extremist content despite their content moderation policies. In our recent research for the State Department, we identified a community of 300,000 individual Twitter users who employed language consistent with a set of personality traits—known as the Dark Triad—that correlates with violent behaviors. This 300,000 represented more people than all the combined users of the extremist-tilting platforms Gab and Stormfront. But if we only study Twitter, we cannot answer basic questions like: Is someone more likely to encounter extremist material on Facebook than on YouTube or Twitter? Or are there more extremists on Facebook than Twitter?

Inconsistent Ideas of What’s Extremist
This brings us to an even thornier question: What qualifies as extremist content? The definition (PDF) is hotly debated by policymakers and researchers alike. Without a common standard, researchers often apply their own definitions, making it difficult to knit together existing analysis.

Capturing extremist content, once defined, introduces more difficulties. Researchers must contend with the constant flood of new material—and the likelihood that existing content may disappear at any moment. Although it is difficult to permanently delete content from the internet, the visible surface is ephemeral. This is particularly true of extremist material, which may be removed by platform moderators, and on most platforms, by users themselves.

Research in this area also has important ethical concerns. Scraping content from some platforms violates their platform use policies, for instance. Researchers must also decide whether to use data that has been illicitly acquired and published by hackers, particularly when it may compromise personal information of users.

The unique nature of speech on social media platforms also can be a problem. Slang runs rampant online, and the internet has its own set of acronyms (IYKYK). Language-processing software can sometimes account for platform-specific language, but such programs are often baselined on a large platform like Twitter, so they can be ineffective on niche platforms that have their own coded jargon. Further, extremist discourse often uses coded language. Researchers cannot go through millions of social media posts manually, and yet our pre-trained machine tools to process language aren’t tailored for this type of content. In fact, sometimes lexical AI like ChatGPT have been trained purposively to exclude extremist, violent, and other offensive content. This raises its own ethical questions, as these lexical models are often developed by humans who can be traumatized from exposure to extreme content in the process.

In the end, researchers are left trying to parse messy content, messy user data, and messy platform inputs. While society hungers for effective policy solutions to—or even clear understanding of—extremism online, this messy combination produces suggestive, rather than strong, conclusions.

Heather Williams, a senior policy researcher at the nonprofit, nonpartisan RAND Corporation, is associate director of the International Security and Defense Policy Program. Alexandra T. Evans is a policy researcher at RAND. Luke Matthews is a behavioral and social scientist at RAND. This article is published courtesy of RAND.