การเปรียบเทียบ BERTopic และ LDA สำหรับการจำแนกหัวข้อภาวะซึมเศร้าในข้อความจาก Reddit BERTopic vs. LDA: A Comparative Analysis for Identifying Depression Topics in Reddit Messages

การเปรียบเทียบ BERTopic และ LDA สำหรับการจำแนกหัวข้อภาวะซึมเศร้าในข้อความจาก Reddit
BERTopic vs. LDA: A Comparative Analysis for Identifying Depression Topics in Reddit Messages

Thanesorn Khiewboriboon, Sorawit Taochoo, Arisara Yokyorkhun, Nantapong Keandoungchun

Abstract

ในปัจจุบันภาวะซึมเศร้าและปัญหาสุขภาพจิตเป็นประเด็นที่ทวีความรุนแรงและส่งผลกระทบต่อคุณภาพชีวิตของประชาชนโดยเฉพาะในกลุ่มวัยรุ่นและวัยทำงาน ทั้งนี้ผู้คนจำนวนมากมักแสดงออกถึงความรู้สึกและอาการผ่านสื่อสังคมออนไลน์ซึ่งสามารถนำมาใช้เป็นข้อมูลสำหรับการวิเคราะห์เชิงคอมพิวเตอร์ได้ งานวิจัยนี้มีวัตถุประสงค์เพื่อเปรียบเทียบประสิทธิภาพของอัลกอริทึมการจัดกลุ่มหัวข้อ (Topic Modeling) ได้แก่ Latent Dirichlet Allocation (LDA) และ BERTopic โดยใช้ชุดข้อมูลจำนวน 6,397 ข้อความจากแพลตฟอร์ม Reddit ที่เกี่ยวข้องกับภาวะซึมเศร้า การประเมินผลดำเนินการโดยใช้ตัวชี้วัด 3 ประการ ได้แก่ Purity Score, Entropy Score และ Rand Index (RI) ผลการศึกษาแสดงให้เห็นว่า BERTopic มีประสิทธิภาพเหนือกว่า LDA โดยให้ค่า Purity Score สูงกว่า (39.06%) ค่า Entropy ต่ำกว่า (1.93%) และค่า RI สูงกว่า (66.84%) เมื่อเปรียบเทียบกับ LDA ที่ได้ค่า 34.38%, 2.11% และ 65.47% ตามลำดับ สะท้อนถึงความสามารถในการสร้างกลุ่มหัวข้อที่แม่นยำและสอดคล้องกับข้อมูลจริงมากกว่า อย่างไรก็ตามงานวิจัยนี้ยังมีข้อจำกัดจากการใช้ชุดข้อมูลทดสอบเพียง 10% ของข้อมูลทั้งหมด ซึ่งอาจส่งผลต่อความครอบคลุมของการประเมินผล ดังนั้นการศึกษาในอนาคตควรเพิ่มปริมาณข้อมูลทดสอบ รวมทั้งพิจารณาบริบทของข้อความภาษาไทย เพื่อขยายขอบเขตการประยุกต์ใช้งานด้านสุขภาพจิตได้อย่างกว้างขวางยิ่งขึ้น

Depression and mental health problems have increasingly become critical issues that significantly affect the quality of life, particularly among adolescents and working-age populations. Many individuals often express their emotions and symptoms through social media platforms, which can serve as valuable sources of data for computational analysis. This study aims to compare the performance of two topic modeling algorithms, Latent Dirichlet Allocation (LDA) and BERTopic, using a dataset of 6,397 depression-related posts collected from Reddit. The evaluation employed three metrics: Purity Score, Entropy Score, and Rand Index (RI). The results demonstrate that BERTopic outperformed LDA, achieving a higher Purity Score (39.06%), lower Entropy (1.93%), and higher RI (66.84%) compared to LDA’s 34.38%, 2.11%, and 65.47%, respectively. These findings indicate BERTopic’s superior capability in producing co-herent and accurate topic clusters that align more closely with the ground truth. Nevertheless, this study is limited by the use of only 10% of the total dataset for testing, which may affect the comprehensiveness of the evaluation. Therefore, future studies should increase the size of the test set and incorporate Thai-language contexts to broaden the scope of practical applications in mental health research.

Keywords

References

[1] World Health Organization. (2022). Mental Disorders. [Online]. Available: https://www. who.int/news-room/fact-sheets/detail/mentaldisorders.

[2] World Health Organization. (2022). Mental Health and COVID-19: Early Evidence of the Pandemic's Impact: Scientific Brief. [Online]. Available: https://www.who.int/publications/i/ item/WHO-2019-nCoV-Sci_Brief-Mental_ health-2022.1.

[3] S. Salmi, R. v. d. Mei, S. Mérelle, and S. Bhulai, “Topic modeling for conversations for mental health helplines with utterance embedding,” Journal of Computational Social Science, vol. 13, 2024, doi: 10.1016/j.teler.2024.100126.

[4] A. Krishnan and P. Kennedyraj, “Exploring the power of topic modeling techniques in analyzing customer reviews: A comparative analysis,” arXiv, 2023, doi: 10.48550/arXiv.2308.11520.

[5] R. Egger and J. Yu, “A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts,” Frontiers in Sociology, vol. 7, 2022, doi: 10.3389/fsoc.2022.886498.

[6] A. Rkia, A. Fatima-Azzahrae, A. Mehdi, and L. Lily, “NLP and topic modeling with LDA, LSA, and NMF for monitoring psychosocial well-being in monthly surveys,” Procedia Computer Science, vol. 251, pp. 398-405, 2024, doi: 10.1016/j.procs.2024.11.126.

[7] A. Khan and R. Ali, “Measuring the effectiveness of LDA-based clustering for social media data,” 2022, doi: 10.37394/232025.2022.4.11.

[8] M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” arXiv, 2022, doi: 10.48550/arXiv.2203.05794.

[9] D. Sik, R. Németh, and E. Katona, “Topic modelling online depression forums: Beyond narratives of self-objectification and self-blaming,” Journal of Affective Disorders Reports, vol. 32, no. 2, pp. 386–395, 2021, doi: 10.1080/ 09638237.2021.1979493.

[10] L. Ma, R. Chen, W. Ge, P. Rogers, B. Lyn-Cook, H. Hong, W. Tong, N. Wu, and W. Zou, “AI-powered topic modeling: Comparing LDA and BERTopic in analyzing opioid-related cardiovascular risks in women,” Experimental Biology and Medicine, vol. 250, 2025, doi: 10.3389/ebm. 2025.10389.

[11] A. Qasim, G. Mehak, N. Hussain, A. Gelbukh, and G. Sidorov, “Detection of depression severity in social media text using transformer-based models,” Information, vol. 16, no. 2, 2025, doi: 10.3390/info16020114.

Full Text: PDF

DOI: 10.14416/j.kmutnb.2026.04.001

ISSN: 2985-2145

Username
Password
Remember me

The Journal of King Mongkut's University of Technology North Bangkokวารสารวิชาการพระจอมเกล้าพระนครเหนือ