Comparison of Clustering Algorithms for Knowledge Discovery in Social Media PublicationsA Case Study of Mental Health Analysis

  1. Couto, Manuel
  2. Parapar, Javier
  3. Losada, David E.
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2024

Número: 73

Páginas: 69-81

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

En la era de las redes sociales, el contenido generado por los usuarios es fundamental para detectar los primeros signos de trastornos mentales. En este estudio utilizamos el agrupamiento de publicaciones por tópicos para analizar el contenido de la plataforma Reddit. Nuestro objetivo primordial es utilizar técnicas de agrupamiento para descubrir temas centrales, con un enfoque en la identificación de temas comunes entre los grupos de usuarios que sufren enfermedades mentales como la depresión, la anorexia, la adicción a los juegos de azar y las autolesiones. Nuestros hallazgos muestran que ciertos clusters son más cohesivos, por ejemplo mostrando una mayor proporción de textos de personas con depresión. Además, hemos descubierto subreddits que están fuertemente vinculados a textos escritos por usuarios deprimidos. Estos hallazgos arrojan luz sobre cómo las interacciones en línea y los temas que se tratan en los subreddits reflejan aspectos de salud mental, abriendo el camino para futuras investigaciones e intervenciones dirigidas a la prevención de trastornos.

Referencias bibliográficas

  • Ankerst, M., M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. Optics: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49–60.
  • Aragon, M. E., A. P. Lopez-Monroy, L.-C. G. Gonzalez-Gurrola, and M. Montes. 2021. Detecting mental disorders in social media through emotional patterns-the case of anorexia and depression. IEEE Transactions on Affective Computing.
  • Aragón, M. E., A. P. López-Monroy, and M. Montes-y Gómez. 2019. Inaoe-cimat at erisk 2019: Detecting signs of anorexia using fine-grained emotions. In CLEF (Working Notes).
  • Arthur, D. and S. Vassilvitskii. 2006. kmeans++: The advantages of careful seeding. Technical report, Stanford.
  • Bird, S., E. Klein, and E. Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
  • Bolla, M. 2013. Spectral clustering and biclustering: Learning large graphs and contingency tables. John Wiley & Sons.
  • Calinski, T. and J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27.
  • Chancellor, S. and M. De Choudhury. 2020. Methods in predictive techniques for mental health status on social media: a critical review. NPJ digital medicine, 3(1):43.
  • Clatworthy, J., D. Buick, M. Hankins, J.Weinman, and R. Horne. 2005. The use and reporting of cluster analysis in health psychology: A review. British journal of health psychology, 10(3):329–358.
  • Couto, M., A. Pérez, and J. Parapar. 2022. Temporal word embeddings for early detection of signs of depression. In Proceedings of the CIRCLE (Joint Conference of The Information Retrieval Communities in Europe).
  • Crestani, F., D. E. Losada, and J. Parapar. 2022. Early Detection of Mental Health Disorders by Social Media Monitoring: The First Five Years of the ERisk Project, volume 1018. Springer Nature.
  • Croft, W. B., D. Metzler, and T. Strohman. 2010. Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading.
  • Davies, D. L. and D. W. Bouldin. 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227.
  • Day, W. H. and H. Edelsbrunner. 1984. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of classification, 1(1):7–24.
  • Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint ar-Xiv:1810.04805.
  • Ding, C. and X. He. 2004. K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning, page 29.
  • Emmons, S., S. Kobourov, M. Gallant, and K. B¨orner. 2016. Analysis of network clustering algorithms and cluster quality metrics at scale. PloS one, 11(7):e0159161.
  • Ester, M., H.-P. Kriegel, J. Sander, X. Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231.
  • Ezugwu, A. E., A. M. Ikotun, O. O. Oyelade, L. Abualigah, J. O. Agushaka, C. I. Eke, and A. A. Akinyelu. 2022. A comprehensive survey of clustering algorithms: Stateof-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110:104743.
  • Fahad, A., N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3):267–279.
  • Frey, B. J. and D. Dueck. 2007. Clustering by passing messages between data points. science, 315(5814):972–976.
  • Gao, C. X., D. Dwyer, Y. Zhu, C. L. Smith, L. Du, K. M. Filia, J. Bayer, J. M. Menssink, T. Wang, C. Bergmeir, et al. 2023. An overview of clustering methods with guidelines for application in mental health research. Psychiatry Research, page 115265.
  • Ghaharian, K., B. Abarbanel, D. Phung, P. Puranik, S. Kraus, A. Feldman, and B. Bernhard. 2022. Applications of data science for responsible gambling: a scoping review. International Gambling Studies, pages 1–24.
  • Hubert, L. and P. Arabie. 1985. Comparing partitions. Journal of classification, 2:193–218.
  • Ikeda, K., G. Hattori, C. Ono, H. Asoh, and T. Higashino. 2013. Twitter user profiling based on text and community mining for market analysis. Knowledge-Based Systems, 51:35–47.
  • Kadhim, A. I., Y.-N. Cheah, and N. H. Ahamed. 2014. Text document preprocessing and dimension reduction techniques for text document clustering. In 2014 4th international conference on artificial intelligence with applications in engineering and technology, pages 69–73. IEEE.
  • Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Losada, D. E., F. Crestani, and J. Parapar. 2017. erisk 2017: Clef lab on early risk prediction on the internet: experimental foundations. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11–14, 2017, Proceedings 8, pages 346–360. Springer.
  • MacQueen, J. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281–297. University of California Los Angeles LA USA.
  • Mahdi, M. A., K. M. Hosny, and I. Elhenawy. 2021. Scalable clustering algorithms for big data: A review. IEEE Access, 9:80015–80027.
  • Marutho, D., S. H. Handaka, E. Wijaya, et al. 2018. The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In 2018 international seminar on application for technology of information and communication, pages 533–538. IEEE.
  • Murtagh, F. and P. Contreras. 2012. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97.
  • Nguyen, T., A. Yates, A. Zirikly, B. Desmet, and A. Cohan. 2022. Improving the generalizability of depression detection by leveraging clinical questionnaires. arXiv preprint arXiv:2204.10432.
  • Nielsen, F. and F. Nielsen. 2016. Hierarchical clustering. Introduction to HPC with MPI for Data Science, pages 195–211.
  • Palacio-Niño, J.-O. and F. Berzal. 2019. Evaluation metrics for unsupervised learning algorithms. arXiv preprint ar-Xiv:1905.05667.
  • Parapar, J., P. Martín-Rodilla, D. E. Losada, and F. Crestani. 2022. erisk 2022: pathological gambling, depression, and eating disorder challenges. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, pages 436–442. Springer.
  • Peres, F., E. Fallacara, L. Manzoni, M. Castelli, A. Popovic, M. Rodrigues, and P. Estevens. 2021. Time series clustering of online gambling activities for addicted users’ detection. Applied Sciences, 11(5):2397.
  • Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Rand, W. M. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.
  • Reynolds, D. A. 2009. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663).
  • Ríssola, E. A., D. E. Losada, and F. Crestani. 2021. A survey of computational methods for online mental state assessment on social media. ACM Trans. Comput. Healthcare, 2(2), mar.
  • Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.
  • Shensa, A., J. E. Sidani, M. A. Dew, C. G. Escobar-Viera, and B. A. Primack. 2018. Social media use and depression and anxiety symptoms: A cluster analysis. American journal of health behavior, 42(2):116–128.
  • Strehl, A. and J. Ghosh. 2002. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617.
  • Völske, M., M. Potthast, S. Syed, and B. Stein. 2017. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63.
  • Yazdavar, A. H., H. S. Al-Olimat, M. Ebrahimi, G. Bajaj, T. Banerjee, K. Thirunarayan, J. Pathak, and A. Sheth. 2017. Semisupervised approach to monitoring clinical depressive symptoms in social media. In Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, pages 1191–1198.
  • Zhang, T., R. Ramakrishnan, and M. Livny. 1996. Birch: an efficient data clustering method for very large databases. ACM sigmod record, 25(2):103–114.