Comparison of Clustering Algorithms for the Identification of Topics on Twitter

Marjori N. M. Klinczak; Celso A. A. Kaestner

Authors

Marjori N. M. Klinczak University of Technology of Paraná
Celso A. A. Kaestner University of Technology of Paraná

Keywords:

text processing, clustering algorithms, NMF algorithm, Twitter topics identification

Abstract

Topic Identification in Social Networks has become an important task when dealing with event detection, particularly when global communities are affected. In order to attack this problem, text processing techniques and machine learning algorithms have been extensively used. In this paper we compare four clustering algorithms – k-means, k-medoids, DBSCAN and NMF (Non-negative Matrix Factorization) – in order to detect topics related to textual messages obtained from Twitter. The algorithms were applied to a database initially composed by tweets having hashtags related to the recent Nepal earthquake as initial context. Obtained results suggest that the NMF clustering algorithm presents superior results, providing simpler clusters that are also easier to interpret.

Downloads

Download data is not yet available.

Author Biography

Marjori N. M. Klinczak, University of Technology of Paraná

Jenny Torres es subdecana (e) de la Facultad de Ingeniería en Sistemas de la Escuela Politécnica Nacional (EPN). Obtuvo su doctorado en Informática en la Universidad Pierre y Marie Curie de Francia. En 2009 obtuvo su M.Sc en Ciencias Computacionales en la Universidad Paris-Est Créteil. Antes de ser becaria de la SENESCYT, culminó una maestría en Gerencia de Redes y Telecomunicaciones en la Escuela Politécnica del Ejército y en 2006 se graduó de ingeniera en Sistemas en la EPN.
Su investigación se centra en seguridad informática, gestión de redes, gestión de identidades, redes inalámbricas e infraestructuras abiertas. Fue docente invitada durante seis meses en la Universidad de Paraná, Curitiba, Brasil y forma parte de los equipos de investigación Phare y NR2 en Francia y Brasil respectivamente.

References

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, NewYork. 1999.

C. D. Manning, P. Raghavan and H. Schütze. Introduction to Information Retrieval, Cambridge University Press. 2008.

T. M. Mitchell. 1997. Machine Learning, McGraw-Hill.

Twitter Documentation.https://dev.twitter.com/rest/public. Accessed at May 29, 2015.

Hila Becker, Mor Naaman and Luis Gravano. Beyond Trending Topics: Real-World Event Identification on Twitter. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 438–441. 2011.

Aditi Gupta and Ponnurangam Kumaraguru. Credibility Ranking of Tweets during High Impact Events. Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, article 2. 2012.

David A. Shamma, Lyndon Kennedy e Elizabeth Churchill, 2009. Tweet the Debates: Understanding Community Annotation of Uncollected Sources. Proceedings of the first SIGMM Workshop on Social Media, vol. 22(1), pp. 3-10.

Takeshi Sakaki, Makoto Okazaki andYutaka Matsuo. Earthquake Shakes Twitter Users: Real-TimeEvent Detection By Social Sensors. Proceedingsof the 19th International conference on Word WideWeb, 851–860. 2010.

C. D. Manning, P. Raghavan e H. Schütze. Introduction to Information Retrieval, Cambridge University Press. 2008.

Daniel Godfrey, Caley Johns, Carol Sadek, Carl Meyer e Shaina Race, 2014. A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets. Cornell University Library, http://arxiv.org/abs/1408.5427, acessado em 20 de maio de 2015.

Moody Chu and Robert Plemmons. Nonnegative matrix factorization and applications. Bulletin of the International Linear Algebra Society, 34, 2–7. 2005.

Klinczak, Marjori N. M. and Kaestner, Celso A. A. Identification of Topics on Twitter: Comparison of Clustering Algorithms and Case Study. LA-CCI. 2015.

J. Han and M. Kamber. Data Mining Concepts and Techniques, Morgan Kaufmann. 2001.

D. D. Lee and H. S. Seung. Unsupervised learning by convex and conic coding. Advances in Neural Information Processing Systems, 9(1):515–521,MIT Press. 1997.

I.T. Jolliffe. Principal Component Analysis, Springer-Verlag. 2002.

Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52. 2006.

Twitter Team. 2012. Twitter turns six. http://blog.twitter.com/2012/03/twitter-turns-six.html. accessed atJuly 08, 2014.

David D. Lewis, Yiming Yang, Tony G. Rose and Fan Li. RCV1: A New Benchmark Collection forText Categorization Research. Journal of MachineLearning Research, 5(1):361–397. 2004.

Martin F. Porter. An Algorithm for Suffix Stripping. Program, 14(3): 130–137. 1980.

Anil K. Jain and Richard C. Dubes. Algorithmsfor Clustering Data, Prentice Hall. 1988.

Daniel Godfrey, Caley Johns, Carol Sadek, Carl Meyer and Shaina Race. 2014. A Case Study in Text Mining: Interpreting Twitter Data FromWorld Cup Tweets. Cornell University Library,http://arxiv.org/abs/1408.5427 , accessed in May 20th, 2015.

Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data, Prentice Hall. 1988.

Naaman, Boase, and Lai. Is it really about me? Message content in social awareness streams. CSCW10. 2010.

Gupta, Aditi & Kumaraguru, Ponnurangam. Credibility Ranking of Tweets during High Impact Events. Proceedings of the 1st Workshop on Privacy and Security in Online Social Media. 2012.

Ryaboy, Dmitriy& Lin, Jimmy. Scaling Big Data Mining Infrastructure: The Twitter Experience. ACM SIGKDD. 2012.