Comparison of Clustering Algorithms for the Identification of Topics on Twitter
Palabras clave:
text processing, clustering algorithms, NMF algorithm, Twitter topics identificationResumen
Topic Identification in Social Networks has become an important task when dealing with event detection, particularly when global communities are affected. In order to attack this problem, text processing techniques and machine learning algorithms have been extensively used. In this paper we compare four clustering algorithms – k-means, k-medoids, DBSCAN and NMF (Non-negative Matrix Factorization) – in order to detect topics related to textual messages obtained from Twitter. The algorithms were applied to a database initially composed by tweets having hashtags related to the recent Nepal earthquake as initial context. Obtained results suggest that the NMF clustering algorithm presents superior results, providing simpler clusters that are also easier to interpret.
Descargas
Referencias
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press, NewYork. 1999.
C. D. Manning, P. Raghavan and H. Schütze. Introduction to Information Retrieval, Cambridge University Press. 2008.
T. M. Mitchell. 1997. Machine Learning, McGraw-Hill.
Twitter Documentation.https://dev.twitter.com/rest/public. Accessed at May 29, 2015.
Hila Becker, Mor Naaman and Luis Gravano. Beyond Trending Topics: Real-World Event Identification on Twitter. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, 438–441. 2011.
Aditi Gupta and Ponnurangam Kumaraguru. Credibility Ranking of Tweets during High Impact Events. Proceedings of the 1st Workshop on Privacy and Security in Online Social Media, article 2. 2012.
David A. Shamma, Lyndon Kennedy e Elizabeth Churchill, 2009. Tweet the Debates: Understanding Community Annotation of Uncollected Sources. Proceedings of the first SIGMM Workshop on Social Media, vol. 22(1), pp. 3-10.
Takeshi Sakaki, Makoto Okazaki andYutaka Matsuo. Earthquake Shakes Twitter Users: Real-TimeEvent Detection By Social Sensors. Proceedingsof the 19th International conference on Word WideWeb, 851–860. 2010.
C. D. Manning, P. Raghavan e H. Schütze. Introduction to Information Retrieval, Cambridge University Press. 2008.
Daniel Godfrey, Caley Johns, Carol Sadek, Carl Meyer e Shaina Race, 2014. A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets. Cornell University Library, http://arxiv.org/abs/1408.5427, acessado em 20 de maio de 2015.
Moody Chu and Robert Plemmons. Nonnegative matrix factorization and applications. Bulletin of the International Linear Algebra Society, 34, 2–7. 2005.
Klinczak, Marjori N. M. and Kaestner, Celso A. A. Identification of Topics on Twitter: Comparison of Clustering Algorithms and Case Study. LA-CCI. 2015.
J. Han and M. Kamber. Data Mining Concepts and Techniques, Morgan Kaufmann. 2001.
D. D. Lee and H. S. Seung. Unsupervised learning by convex and conic coding. Advances in Neural Information Processing Systems, 9(1):515–521,MIT Press. 1997.
I.T. Jolliffe. Principal Component Analysis, Springer-Verlag. 2002.
Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52. 2006.
Twitter Team. 2012. Twitter turns six. http://blog.twitter.com/2012/03/twitter-turns-six.html. accessed atJuly 08, 2014.
David D. Lewis, Yiming Yang, Tony G. Rose and Fan Li. RCV1: A New Benchmark Collection forText Categorization Research. Journal of MachineLearning Research, 5(1):361–397. 2004.
Martin F. Porter. An Algorithm for Suffix Stripping. Program, 14(3): 130–137. 1980.
Anil K. Jain and Richard C. Dubes. Algorithmsfor Clustering Data, Prentice Hall. 1988.
Daniel Godfrey, Caley Johns, Carol Sadek, Carl Meyer and Shaina Race. 2014. A Case Study in Text Mining: Interpreting Twitter Data FromWorld Cup Tweets. Cornell University Library,http://arxiv.org/abs/1408.5427 , accessed in May 20th, 2015.
Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data, Prentice Hall. 1988.
Naaman, Boase, and Lai. Is it really about me? Message content in social awareness streams. CSCW10. 2010.
Gupta, Aditi & Kumaraguru, Ponnurangam. Credibility Ranking of Tweets during High Impact Events. Proceedings of the 1st Workshop on Privacy and Security in Online Social Media. 2012.
Ryaboy, Dmitriy& Lin, Jimmy. Scaling Big Data Mining Infrastructure: The Twitter Experience. ACM SIGKDD. 2012.
Descargas
Publicado
Número
Sección
Licencia
Aviso de derechos de autor/a
Los autores/as que publiquen en esta revista aceptan las siguientes condiciones:
- Los autores conservan los derechos de autor y ceden a la revista el derecho de la primera publicación, con el trabajo registrado con la Creative Commons Attribution-Non-Commercial-Share-Alike 4.0 International, que permite a terceros utilizar lo publicado siempre que mencionen la autoría del trabajo y a la primera publicación en esta revista.
- Los autores pueden realizar otros acuerdos contractuales independientes y adicionales para la distribución no exclusiva de la versión del artículo publicado en esta revista (p. ej., incluirlo en un repositorio institucional o publicarlo en un libro) siempre que indiquen claramente que el trabajo se publicó por primera vez en esta revista.
- Se permite y recomienda a los autores a compartir su trabajo en línea (por ejemplo: en repositorios institucionales o páginas web personales) antes y durante el proceso de envío del manuscrito, ya que puede conducir a intercambios productivos, a una mayor y más rápida citación del trabajo publicado.
Descargo de Responsabilidad
LAJC en ningún caso será responsable de cualquier reclamo directo, indirecto, incidental, punitivo o consecuente de infracción de derechos de autor relacionado con artículos que han sido presentados para evaluación o publicados en cualquier número de esta revista. Más Información en nuestro Aviso de Descargo de Responsabilidad.