การจำแนกข้อความโดยใช้การเรียนรู้ของเครื่องสำหรับหนังสือราชการไทย
Text Classification Using Machine Learning for Thai Official Letters
Abstract
บทความนี้มีวัตถุประสงค์เพื่อกำหนดรูปแบบการจำแนกประเภทข้อความที่เหมาะสมที่สุดสำหรับการจัดประเภทข้อความหลายชั้นในโดเมนเอกสารทางราชการภาษาไทย ในการทดลองได้ทำการศึกษา โดยการสร้างตัวแยกประเภทข้อความโดยใช้ WangchanBERTa ซึ่งเป็นโมเดลภาษาไทยแบบฝึกล่วงหน้าร่วมกับตัวแบบดั้งเดิมที่เป็นที่นิยมและเปรียบเทียบประสิทธิภาพ โมเดลจำแนกประเภททั้งหมดได้รับการปรับแต่งให้เหมาะสม และทำการฝึกฝนชุดข้อมูลองค์กร ซึ่งได้ประเมินจากเมตริกการประเมิน 4 แบบ ได้แก่ค่า Accuracy, Precision, Recall และ F1-score. ผลการทดลองแสดงให้เห็นว่า แบบจำลอง WangchanBERTa มีความแม่นยำสูงถึง 76% ซึ่งประสิทธิภาพดีกว่าแบบจำลองพื้นฐานอื่น ๆ และสามารถนำมาประยุกต์ใช้สำหรับหน่วยงานราชการไทย ในการจำแนกประเภทของหนังสือราชการไทยได้
This article aims to determine the most suitable text classification model for creating a multi-class Text classification in the Thai official letter domain. An experimental study was conducted by creating text classifiers using WangchanBERTa, a Pre-trained Thai Language Model, along with other popular traditional ones and comparing their performance. All classifiers were fine-tuning and trained on the organization dataset. They were evaluated by four evaluation metrics: accuracy, precision, recall, and F1-scores. The experiment results showed that the WangchanBERTa model outperforms the baseline models with the highest accuracy of 76%. It can also be applied for Thai government organizations to classify types of Thai official letters.
Keywords
[1] A. K. H. Tung, “Rule-based Classification,” Encyclopedia of Database Systems, Springer, Boston, MA, 2009, pp. 2459-2462
[2] A. Rana, (2018, Oct.). Journey From Machine Learning to Deep Learning. Towards Data Science. [Online]. Available:https://towardsdatascience. com/journey-from-machine-learning-to-deeplearning- 8a807e8f3c1c
[3] M. Marcus, “New trends in natural language processing: Statistical natural language processing,” in Proceedings of the National Academy of Sciences 92.22, 1995, pp. 10052–10059.
[4] C. J. Fall, A. Törcsvári, P. Fiévet, and G. Karetka, “Automated categorization of German-language patent documents,” Expert Systems with Applications, vol. 26 no. 2, pp. 269–277, 2004.
[5] D. Tikk, G. Biró, and J. D. Yang, “Experiment with a hierarchical text categorization method on WIPO patent collections,” in Applied Research in Uncertainty Modeling and Analysis. Springer, Boston, MA, 2005. pp. 283–302.
[6] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, “Deep learningbased text classification: a comprehensive review,” ACM computing surveys (CSUR), vol. 54, no. 3, pp. 1–40, 2021.
[7] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
[8] J. S. Lee, and J. Hsiang, “Patent classification by fine-tuning BERT language model,” World Patent Information, vol. 61, Art. no. 101965, 2020.
[9] L. Lowphansirikul, C. Polpanumas, N. Jantrakulchai, and S. Nutanong, “WangchanBerta: Pretraining transformer-based Thai language models,” arXiv:2101.09635, 2021.
[10] W. Meeprasert and E. Rattagan, “Voice of customer analysis on twitter for Shopee Thailand,” Journal of information systems in Business JISB, vol. 7, no. 3, pp. 6–18, 2021 (in Thai)
[11] National Archives of Thailand. Regulations of the Prime Minister's Office on Correspondence B.E. 2526 and No. 2 B.E. 2548. [Online]. (in Thai). Available: http://bit.ly/3JF9rQK
[12] W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanumas, A. Suriyawongkul, L. Lowphansirikul, and P. Chormai. (2020, June). PyThaiNLP/ pythainlp: PyThaiNLP 2.2.0 (v2.2.0). Zenodo. [Online]. Available: https://doi.org/10.5281/ zenodo.3906484
[13] N. Khamphakdee and P. Seresangtakul, “Sentiment analysis for thai language in hotel domain using machine learning algorithms,” Acta Informatica Pragensia, vol. 10, no. 2, pp. 155–171, 2021.
[14] M. Merrillees and L. Du, “stratified sampling for extreme Multi-label data,” In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, pp. 334–345.
[15] W. Arshad, M. Ali, M. M. Ali, A. Javed, and S. Hussain, “Multi-class text classification: Model comparison and selection,” in 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), 2021, pp. 1–5.
[16] M. M. Ramadhan, I. S. Sitanggang, F. R. Nasution, and A. Ghifari, “Parameter tuning in random forest based on grid search method for gender classification based on voice frequency,” DEStech transactions on computer science and engineering, vol. 10, pp. 625–629, 2017.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All You Need,” Advances in Neural Information Processing Systems, Curran Associates, Long Beach, CA, USA, pp. 2–11, 2017.
[18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv:1907.11692, 2019.
[19] M. Grandini, E. Bagli, and G. Visani, “Metrics for multi-class classification: an overview,” arXiv:2008.05756, 2020.
[20] G. Menardi, and N. Torelli, “Training and assessing classification rules with imbalanced data,” Data mining and knowledge discovery, vol. 28, pp. 92–122, 2014.
[21] S. El Anigri, M. M. Himmi, and A. Mahmoudi, “How BERT's dropout Fine-tuning affects text classification?,”in Proceedings Business Intelligence: 6th International Conference, CBI 2021, Beni Mellal, Morocco, 2021, pp. 130–139.
[22] A. Abdulwahab, H. Attya, and Y. H. Ali, “Documents classification based on deep learning,” International Journal of Scientific & Technology Research, vol. 9, no. 2, pp. 62–66. 2020.
[23] Torchsampler0.1.2. (2022, May) Imbalanced Dataset Sampler. [Online]. Available: https:// pypi.org/project/torchsampler/
DOI: 10.14416/j.kmutnb.2024.05.03
ISSN: 2985-2145