Development of a Bilingual Hate Speech Detection System Using Filipino Reddit Texts

Date of Award

7-1-2023

Document Type

Thesis

Degree Name

Master of Science in Computer Science

First Advisor

Maria Regina Justina E. Estuar, PhD

Abstract

Hate speech, deliberate attacks on groups based on their identity, often proliferate through social media. However, existing automatic hate speech detection tools primarily focus on high-resource languages like English, posing challenges for detecting hate speech in low-resource languages like Filipino. This study addresses this limitation by developing a bilingual hate speech detection system and dataset using Filipino texts from Reddit. The system leverages bilingual and psycho-linguistic features, including part-of-speech tags, code-switching frequency, and language dominance. Machine learning and deep learning techniques are applied to develop the hate speech detection system. The results indicate that both model types exhibited competitive performance in hate speech detection, demonstrating potential in hate speech detection. The integration of psycho-linguistic features improved the performance of machine learning models, highlighting the value of incorporating linguistic information. The results highlight the development of a bilingual hate speech detection system, the creation of a usable and shareable annotated dataset, the utilization of various feature extraction techniques, the effectiveness of transformer-based models, and the system’s accuracy in detecting hate speech in the Filipino language. The resulting system and annotated dataset are deployed and made publicly available for future research, contributing to addressing the scarcity of hate speech detection resources for low-resource languages.

Share

COinS