Comparison of Clustering Algorithms for the Identification of Topics on Twitter
Abstract
Topic Identification in Social Networks has become an important task when dealing with event detection, particularly when global communities are affected. In order to attack this problem, text processing techniques and machine learning algorithms have been extensively used. In this paper we compare four clustering algorithms – k-means, k-medoids, DBSCAN and NMF (Non-negative Matrix Factorization) – in order to detect topics related to textual messages obtained from Twitter. The algorithms were applied to a database initially composed by tweets having hashtags related to the recent Nepal earthquake as initial context. Obtained results suggest that the NMF clustering algorithm presents superior results, providing simpler clusters that are also easier to interpret.
Downloads
References
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, NewYork. 1999.
C. D. Manning, P. Raghavan and H. Schütze. Introduction to Information Retrieval, Cambridge University Press. 2008.
T. M. Mitchell. 1997. Machine Learning, McGraw-Hill.
Twitter Documentation.https://dev.twitter.com/rest/public. Accessed at May 29, 2015.
Hila Becker, Mor Naaman and Luis Gravano. Beyond Trending Topics: Real-World Event Identification on Twitter. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 438–441. 2011.
Aditi Gupta and Ponnurangam Kumaraguru. Credibility Ranking of Tweets during High Impact Events. Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, article 2. 2012.
David A. Shamma, Lyndon Kennedy e Elizabeth Churchill, 2009. Tweet the Debates: Understanding Community Annotation of Uncollected Sources. Proceedings of the first SIGMM Workshop on Social Media, vol. 22(1), pp. 3-10.
Takeshi Sakaki, Makoto Okazaki andYutaka Matsuo. Earthquake Shakes Twitter Users: Real-TimeEvent Detection By Social Sensors. Proceedingsof the 19th International conference on Word WideWeb, 851–860. 2010.
C. D. Manning, P. Raghavan e H. Schütze. Introduction to Information Retrieval, Cambridge University Press. 2008.
Daniel Godfrey, Caley Johns, Carol Sadek, Carl Meyer e Shaina Race, 2014. A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets. Cornell University Library, http://arxiv.org/abs/1408.5427, acessado em 20 de maio de 2015.
Moody Chu and Robert Plemmons. Nonnegative matrix factorization and applications. Bulletin of the International Linear Algebra Society, 34, 2–7. 2005.
Klinczak, Marjori N. M. and Kaestner, Celso A. A. Identification of Topics on Twitter: Comparison of Clustering Algorithms and Case Study. LA-CCI. 2015.
J. Han and M. Kamber. Data Mining Concepts and Techniques, Morgan Kaufmann. 2001.
D. D. Lee and H. S. Seung. Unsupervised learning by convex and conic coding. Advances in Neural Information Processing Systems, 9(1):515–521,MIT Press. 1997.
I.T. Jolliffe. Principal Component Analysis, Springer-Verlag. 2002.
Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52. 2006.
Twitter Team. 2012. Twitter turns six. http://blog.twitter.com/2012/03/twitter-turns-six.html. accessed atJuly 08, 2014.
David D. Lewis, Yiming Yang, Tony G. Rose and Fan Li. RCV1: A New Benchmark Collection forText Categorization Research. Journal of MachineLearning Research, 5(1):361–397. 2004.
Martin F. Porter. An Algorithm for Suffix Stripping. Program, 14(3): 130–137. 1980.
Anil K. Jain and Richard C. Dubes. Algorithmsfor Clustering Data, Prentice Hall. 1988.
Daniel Godfrey, Caley Johns, Carol Sadek, Carl Meyer and Shaina Race. 2014. A Case Study in Text Mining: Interpreting Twitter Data FromWorld Cup Tweets. Cornell University Library,http://arxiv.org/abs/1408.5427 , accessed in May 20th, 2015.
Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data, Prentice Hall. 1988.
Naaman, Boase, and Lai. Is it really about me? Message content in social awareness streams. CSCW10. 2010.
Gupta, Aditi & Kumaraguru, Ponnurangam. Credibility Ranking of Tweets during High Impact Events. Proceedings of the 1st Workshop on Privacy and Security in Online Social Media. 2012.
Ryaboy, Dmitriy& Lin, Jimmy. Scaling Big Data Mining Infrastructure: The Twitter Experience. ACM SIGKDD. 2012.
This article is published by LAJC under a Creative Commons Attribution-Non-Commercial-Share-Alike 4.0 International License. This means that non-exclusive copyright is transferred to the National Polytechnic School. The Author (s) give their consent to the Editorial Committee to publish the article in the issue that best suits the interests of this Journal. Find out more in our Copyright Notice.
Disclaimer
LAJC in no event shall be liable for any direct, indirect, incidental, punitive, or consequential copyright infringement claims related to articles that have been submitted for evaluation, or published in any issue of this journal. Find out more in our Disclaimer Notice.