Hybrid Subword–Character Representation for Robust Sentiment Classification on Multilingual and Code-Mixed Indonesian Text

Danang Danang; Toni Wijanarko Adi Putra

doi:10.55606/jcsr-politama.v2i6.6153

Authors

Danang Danang Universitas Sains dan Teknologi Komputer
Toni Wijanarko Adi Putra Universitas Sains dan Teknologi Komputer

DOI:

https://doi.org/10.55606/jcsr-politama.v2i6.6153

Keywords:

sentiment analysis, code-mixing, multilingual NLP, robustness, character-level modeling, XLM-R

Abstract

User-generated Indonesian text frequently exhibits code-mixing with English (“Indonglish”), informal spelling, elongation, and keyboard typos. These phenomena break subword tokeniza tion assumptions and may degrade multilingual Transformer performance in deployment. This paper studies a hybrid representation that fuses XLM-R sentence features with a character-level CharCNN branch designed to capture orthographic patterns and mitigate character noise. We evaluate (i) a standard XLM-R fine-tuning baseline, (ii) an ablation that removes the character branch (NusaX only), and (iii) the proposed hybrid model on two datasets: NusaX-Senti (12 regional languages) and Indonglish (Indonesian–English code-mixed sentiment). Beyond clean test performance, we introduce a controlled robustness protocol by injecting character-level perturbations with probability p=0.18 and measuring performance drop. Results show that the XLM-R baseline achieves the best clean Macro-F1 on both datasets, while the hybrid model substantially improves robustness on Indonglish by reducing Macro-F1 drop from 0.030 to 0.007 under noise. We analyze common error confusions and discuss when character-aware features help or harm across languages.

Downloads

Download data is not yet available.

References

Bojanowski, P., Grave, É., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (TACL). https://doi.org/10.1162/tacl_a_00051

Clark, J. H., Garrette, D., Turc, I., & Wieting, J. (2021). Canine: Pre-training an efficient tokenization-free encoder for language representation. arXiv. https://doi.org/10.48550/ arXiv.2103.06874

Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv. https://doi.org/10.48550/arXiv.2003. 10555

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, É., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual represen tation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.acl-main.747 14

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). https://doi.org/10.18653/v1/N19-1423

El Boukkouri, H., Ferres, J., Mamou, J., Hamdy, M., Boudoukh, G., Firooz, H., Kuan, L., & Stoyanov, V. (2020). Charbert: Character-aware pre-trained language model. arXiv. https://doi.org/10.48550/arXiv.2010.10392

Gugger, S., et al. (2022). Accelerate: Training and inference at scale made simple, efficient and adaptable. arXiv. https://doi.org/10.48550/arXiv.2205.07917

Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). https://doi.org/10.18653/v1/D18-2012

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv. https://doi.org/10. 48550/arXiv.1909.11942

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of ACL. https://doi. org/10.18653/v1/2020.acl-main.703

Lhoest, Q., Delangue, C., von Platen, P., Wolf, T., Salazar, J., et al. (2021). Datasets: A community library for natural language processing. arXiv. https://doi.org/10.48550/ arXiv.2109.12209

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv. https://doi.org/10.48550/arXiv.1907.11692

Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. International Con ference on Learning Representations (ICLR).

Morris, J. X., Lifland, E., Yoo, J. Y., Grigsby, J., Jin, D., & Qi, Y. (2020). Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). https://doi.org/10.18653/v1/2020.emnlp-demos.16

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., ... Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS).

Patwa, P., Aguilar, G., Kar, S., Solorio, T., & Das, A. (2020). Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval). https://doi.org/10.18653/v1/2020.semeval-1.164

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/arXiv.1706.03762

Wei, J., & Zou, K. (2019). Eda: Easy data augmentation techniques for boosting performance on text classification tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP). https://doi.org/10.18653/v1/D19-1670 15

Winata, G. I., Cahyawijaya, S., et al. (2024). Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. arXiv. https://doi.org/10.48550/ arXiv.2406.10118

Winata, G. I., Cahyawijaya, S., Lin, Z., Wicaksono, A. F. A., et al. (2023). Nusax: Benchmarking machine translation for southeast asian languages. arXiv. https://doi.org/10.48550/ arXiv.2305.12267

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., & Brew, J. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP). https://doi.org/10.18653/v1/2020.emnlp demos.6

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2021). Byt5: Towards a token-free future with pretrained byte-to-byte models. arXiv. https://doi.org/10.48550/arXiv.2105.13626

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. arXiv. https://doi. org/10.48550/arXiv.1906.08237