Comparison of Clustering Algorithms for Knowledge Discovery in Social Media PublicationsA Case Study of Mental Health Analysis
- Couto, Manuel
- Parapar, Javier
- Losada, David E.
ISSN: 1135-5948
Year of publication: 2024
Issue: 73
Pages: 69-81
Type: Article
More publications in: Procesamiento del lenguaje natural
Abstract
In the age of social media, user-generated content is critical for detecting early signs of mental disorders. In this study, we use thematic clustering to analyze the content of the social media platform Reddit. Our primary goal is to use clustering techniques for comprehensive topic discovery, with a focus on identifying common themes among user groups suffering from mental illnesses such as depression, anorexia, gambling addiction, and self-harm. Our findings show that certain clusters are more cohesive, e.g., with a higher proportion of texts indicating depression. Furthermore, we discovered subreddits that are strongly linked to texts from the depressed user group. These findings shed light on how online interactions and subreddit themes may impact users’ mental health, paving the way for future research and more targeted interventions in the field of online mental health.
Bibliographic References
- Ankerst, M., M. M. Breunig, H.-P. Kriegel, and J. Sander. 1999. Optics: Ordering points to identify the clustering structure. ACM Sigmod record, 28(2):49–60.
- Aragon, M. E., A. P. Lopez-Monroy, L.-C. G. Gonzalez-Gurrola, and M. Montes. 2021. Detecting mental disorders in social media through emotional patterns-the case of anorexia and depression. IEEE Transactions on Affective Computing.
- Aragón, M. E., A. P. López-Monroy, and M. Montes-y Gómez. 2019. Inaoe-cimat at erisk 2019: Detecting signs of anorexia using fine-grained emotions. In CLEF (Working Notes).
- Arthur, D. and S. Vassilvitskii. 2006. kmeans++: The advantages of careful seeding. Technical report, Stanford.
- Bird, S., E. Klein, and E. Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
- Bolla, M. 2013. Spectral clustering and biclustering: Learning large graphs and contingency tables. John Wiley & Sons.
- Calinski, T. and J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1):1–27.
- Chancellor, S. and M. De Choudhury. 2020. Methods in predictive techniques for mental health status on social media: a critical review. NPJ digital medicine, 3(1):43.
- Clatworthy, J., D. Buick, M. Hankins, J.Weinman, and R. Horne. 2005. The use and reporting of cluster analysis in health psychology: A review. British journal of health psychology, 10(3):329–358.
- Couto, M., A. Pérez, and J. Parapar. 2022. Temporal word embeddings for early detection of signs of depression. In Proceedings of the CIRCLE (Joint Conference of The Information Retrieval Communities in Europe).
- Crestani, F., D. E. Losada, and J. Parapar. 2022. Early Detection of Mental Health Disorders by Social Media Monitoring: The First Five Years of the ERisk Project, volume 1018. Springer Nature.
- Croft, W. B., D. Metzler, and T. Strohman. 2010. Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading.
- Davies, D. L. and D. W. Bouldin. 1979. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence, (2):224–227.
- Day, W. H. and H. Edelsbrunner. 1984. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of classification, 1(1):7–24.
- Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22.
- Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint ar-Xiv:1810.04805.
- Ding, C. and X. He. 2004. K-means clustering via principal component analysis. In Proceedings of the twenty-first international conference on Machine learning, page 29.
- Emmons, S., S. Kobourov, M. Gallant, and K. B¨orner. 2016. Analysis of network clustering algorithms and cluster quality metrics at scale. PloS one, 11(7):e0159161.
- Ester, M., H.-P. Kriegel, J. Sander, X. Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231.
- Ezugwu, A. E., A. M. Ikotun, O. O. Oyelade, L. Abualigah, J. O. Agushaka, C. I. Eke, and A. A. Akinyelu. 2022. A comprehensive survey of clustering algorithms: Stateof-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110:104743.
- Fahad, A., N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing, 2(3):267–279.
- Frey, B. J. and D. Dueck. 2007. Clustering by passing messages between data points. science, 315(5814):972–976.
- Gao, C. X., D. Dwyer, Y. Zhu, C. L. Smith, L. Du, K. M. Filia, J. Bayer, J. M. Menssink, T. Wang, C. Bergmeir, et al. 2023. An overview of clustering methods with guidelines for application in mental health research. Psychiatry Research, page 115265.
- Ghaharian, K., B. Abarbanel, D. Phung, P. Puranik, S. Kraus, A. Feldman, and B. Bernhard. 2022. Applications of data science for responsible gambling: a scoping review. International Gambling Studies, pages 1–24.
- Hubert, L. and P. Arabie. 1985. Comparing partitions. Journal of classification, 2:193–218.
- Ikeda, K., G. Hattori, C. Ono, H. Asoh, and T. Higashino. 2013. Twitter user profiling based on text and community mining for market analysis. Knowledge-Based Systems, 51:35–47.
- Kadhim, A. I., Y.-N. Cheah, and N. H. Ahamed. 2014. Text document preprocessing and dimension reduction techniques for text document clustering. In 2014 4th international conference on artificial intelligence with applications in engineering and technology, pages 69–73. IEEE.
- Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Losada, D. E., F. Crestani, and J. Parapar. 2017. erisk 2017: Clef lab on early risk prediction on the internet: experimental foundations. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11–14, 2017, Proceedings 8, pages 346–360. Springer.
- MacQueen, J. 1967. Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281–297. University of California Los Angeles LA USA.
- Mahdi, M. A., K. M. Hosny, and I. Elhenawy. 2021. Scalable clustering algorithms for big data: A review. IEEE Access, 9:80015–80027.
- Marutho, D., S. H. Handaka, E. Wijaya, et al. 2018. The determination of cluster number at k-mean using elbow method and purity evaluation on headline news. In 2018 international seminar on application for technology of information and communication, pages 533–538. IEEE.
- Murtagh, F. and P. Contreras. 2012. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97.
- Nguyen, T., A. Yates, A. Zirikly, B. Desmet, and A. Cohan. 2022. Improving the generalizability of depression detection by leveraging clinical questionnaires. arXiv preprint arXiv:2204.10432.
- Nielsen, F. and F. Nielsen. 2016. Hierarchical clustering. Introduction to HPC with MPI for Data Science, pages 195–211.
- Palacio-Niño, J.-O. and F. Berzal. 2019. Evaluation metrics for unsupervised learning algorithms. arXiv preprint ar-Xiv:1905.05667.
- Parapar, J., P. Martín-Rodilla, D. E. Losada, and F. Crestani. 2022. erisk 2022: pathological gambling, depression, and eating disorder challenges. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part II, pages 436–442. Springer.
- Peres, F., E. Fallacara, L. Manzoni, M. Castelli, A. Popovic, M. Rodrigues, and P. Estevens. 2021. Time series clustering of online gambling activities for addicted users’ detection. Applied Sciences, 11(5):2397.
- Radford, A., J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Rand, W. M. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.
- Reynolds, D. A. 2009. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663).
- Ríssola, E. A., D. E. Losada, and F. Crestani. 2021. A survey of computational methods for online mental state assessment on social media. ACM Trans. Comput. Healthcare, 2(2), mar.
- Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.
- Shensa, A., J. E. Sidani, M. A. Dew, C. G. Escobar-Viera, and B. A. Primack. 2018. Social media use and depression and anxiety symptoms: A cluster analysis. American journal of health behavior, 42(2):116–128.
- Strehl, A. and J. Ghosh. 2002. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research, 3(Dec):583–617.
- Völske, M., M. Potthast, S. Syed, and B. Stein. 2017. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63.
- Yazdavar, A. H., H. S. Al-Olimat, M. Ebrahimi, G. Bajaj, T. Banerjee, K. Thirunarayan, J. Pathak, and A. Sheth. 2017. Semisupervised approach to monitoring clinical depressive symptoms in social media. In Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, pages 1191–1198.
- Zhang, T., R. Ramakrishnan, and M. Livny. 1996. Birch: an efficient data clustering method for very large databases. ACM sigmod record, 25(2):103–114.