Skip to main content

Needles in a Haystack

Project Group: Analysis of Opinion-Forming on the Internet

Source: Stieglitz, S. & Dang-Xuan, L. (2012): Social Media and Political Communication - A Social Media Analytics Framework. Social Network Analysis and Mining (SNAM)
Photo :
Source: Stieglitz, S. & Dang-Xuan, L. (2012): Social Media and Political Communication - A Social Media Analytics Framework. Social Network Analysis and Mining (SNAM)

Social media have changed the world of public communication. Through applications such as Twitter, Facebook, and blogs, the opportunities for public communication have expanded, and individual and collective actors of all types participate as speakers in public discourses. Journalistic “gatekeepers” on the Internet are no longer mediators of topics and opinions. This development has an impact: Processes and structures of current discourses are changing. Topic arcs and opinions take other routes than in traditional media. Scientists from Potsdam take part in a project group that will examine this phenomenon. They want to develop and evaluate automated procedures that can be used to analyse large amounts of digital texts from online discourses. This might answer important questions in communication science.

According to which patterns do topics spread on the Internet? How does opinion-forming take place? Science has not answered these questions yet. This is why scientists at the universities of Potsdam, Münster, Munich, and Stuttgart-Hohenheim examine the course of political communication on the Internet. The Federal Ministry of Education and Research (BMBF) funds the project with a total of 800,000 Euros until summer 2015.

Facebook, Twitter, and blogs will be examined in more detail. The participating teams are confronted with a lot of questions: Do citizens really gain influence in the democratic process with these media? Or does the power over public opinion remain in the hands of a few? How does the media environment influence the quality of the discussions? They want to find the answers to these questions with the help of new interdisciplinary methods.

The research group “Analysis of Discourses in Social Media” wants to develop exactly these methods. The prospects of success are very good. The conditions are certainly excellent. After all, scientists of various disciplines are collaborating in the project: leading experts for information systems from Münster, computational linguists from Potsdam and communication scientists from Munich and Stuttgart-Hohenheim. Once the methods have been developed, it will be possible to analyze and evaluate large amounts of text from the Internet semiautomatically. The idea is also to record the networks between postings, i.e. the connections resulting from hyperlinks or sending short text messages, so-called tweets. The latter would help considerably to get a much better understanding of channels for spreading information and the influence of individual nodes. “We want to analyze the different types of social media texts with prototypic software and structure them on the macro level. To do this, we first have to analyze the sentiments and attitudes, as well as the quality of a discourse. This will provide us with an insight into the type of the respective utterance and the existing dynamics on the micro level of individual tweets. The analysis will be extended to entire discourses through combining automated and manual procedures,” Professor Manfred Stede explains the approach. The computational linguist has taken over the management of Potsdam’s part in the project. The project may provide new insights and ideas for scholars in the humanities and social sciences in particular.

Much will depend on how well Stede’s team can do their job: the customized application of computational methods to Internet texts. Only if this works out, they will lift the “secret” of emerging networks of contributions about a specific topic. Stede and two of his PhD students are looking for the proverbial needle in the haystack. It is their aim to create a set of instruments that will signal whether social media texts are assessments of events or persons. This tool has to recognize that the sentence “I watched the news on BBC.” does not contain any assessment. On the other hand, it has to detect the doubly negative assessment in the sentence “Gauck is even worse than Wulff”. This presents a real challenge to the scientists. They intend to classify these expressions of subjectivity in a way that makes it possible to quantify open or just encoded attitudes and possibly the reference to other postings. Furthermore, they want to prepare methods to determine the quality of social media statements and comments. Both these methods serve to support the human analyst with automated means.

Manfred Stede is aware of the great expectations, but the pressure does not put him on edge. On the contrary, the project work fascinates him. It is exciting to adapt the existing tools, i.e. the systems for language identification, for determining parts of speech or syntax, to the new types of texts, he says. The computer scientist is looking forward to contributing to fully automatic identification of irony and sarcasm. In computational linguistics, there are already preliminary approaches for different languages to an automatic identification of irony in Internet communication. “They draw on features like certain emoticons and abbreviations, the excessive use of punctuation marks and lexical exaggerations. It seems that articles also play a role in some languages,” Stede explains. Now it is important to find out to what extent these and other features exist in German tweets. “Here we will break new ground, at least when it comes to the automatic analysis.”

During the first project phase Stede’s group ran a kind of test for the actual analysis. The group received a large set of data about the “case” of the former Federal President Christian Wulff, which they use to test the suitability of the existing tools. These were all Twitter data: 253,172 tweets that comprised almost four million words. However, the number of contributions to be analysed became  a bit smaller. Foreign-language texts and tweets referring to other people with the name Wulff were omitted, as well as URLs and duplicates. More than one million words remained.

“We have made good progress in our research,” Stede states. “First we have classified certain linguistic phenomena to be able to use our tools and to modify them if necessary.” This was inevitable because Twitter texts have specific characteristics. In the meantime, a whole catalogue of “interference factors” has been accumulated. It includes morphological, lexical, syntactical and semantic problems but also typical spelling mistakes, smileys, and emoticons. Uladzimir Sidorenko, PhD student, mainly took care of these phenomena. He has to arrange for the normalization of text data and explore possible “pitfalls” for conventional computer programs. The Belorussian shares his work on the tools with the PhD student Andreas Peldszus, who deals with coreference resolution. For computational linguists this is, among other things, the automatic identification of the connection between a pronoun and a preceding noun. 

After completing the preliminary study the team has begun to work on the “actual” data collected in the meantime for the collaborative project on discourses about the topic ‘energy turnaround’. Also here they first focus on Twitter. The researchers try to group the tweets automatically according to subtopics and identify the expressed opinions. In this combination they can extract positive and negative comments on “wind turbines”, for example. Another priority lies in determining the quality of discourse. When twitterers answer one other, do they really talk to each other? Or against each other? Or do they talk at cross purposes? Does a tweet contribute to the development of a discourse’s content or does it make no headway? The formal analysis of these and other aspects of “quality” is a completely new task that only recently has been taken into consideration at an international level. 

 

The Project

Analysis of Opinion-Forming on the Internet
(“Analyse von Diskursen in Social Media”)
Project coordinator: Professor Stefan Stieglitz (Westfälische Wilhelms-Universität Münster)
Duration: 2012 to 2015
Financed by: Bundesministerium für Bildung und Forschung (BMBF)
www.social-media-analytics.org./de

Selection of Twitter-specific problems

Hashtags
Hashtags are words led by a #-sign. The tags mark specifically interesting and recurring Twitter topics. The computer programmes that have been developed for analyses do not comprehend that “#Wulff” and “Wulff” refer to the same word and person.

Colloquial language and slang words
They often appear on Twitter because people often have everyday conversations and exchange their opinions. German example: „#Wulff soll im Amt bleiben und wuppen für die Kohle!“ (#Wulff should stay in office and bust his back for the dough.)

Intersentential irony
This occurs when two or more statements result in an ironic connotation. German example: „Ich lese immer Frau Merkel stellt sich hinter #Wulff … Er steht am Abgrund, da ist dahinter besser). (I often read that Mrs Merkel backs Wulff … He is standing on the brink of disaster so in his back is safer.)

The Scientists

Professor Manfred Stede studied computer science and linguistics at the Technische Universität Berlin; in 1996 PhD degree in computer science at the University of Toronto. Since 2001 he has been Professor of Applied Computational Linguistics at the University of Potsdam.
Contact
Universität Potsdam
Department Linguistik
Karl-Liebknecht-Straße 24–25, 14476 Potsdam OT Golm
stedeling.uni-potsdamde

Uladzimir Sidorenko studied German language and literature at the Minsk State Linguistic University and graduated with a Master’s degree in computational linguistics in 2007.
Contact: Uladzimir.Sidarenkauni-potsdamde

Andreas Peldszus studied computational linguistics and philosophy at the University of Potsdam and graduated with a Master’s degree in 2011.
Contact: peldszusuni-potsdamde

Text: Petra Görlich, Online-Editing: Julia Schwaibold, Translation: Susanne Voigt.

Published