Hansel: Italian hate speech detection through ensemble learning and deep neural networks

IRIS

The detection of hate speeches, over social media and online forums, is a relevant task for the research area of natural language processing. This interest is motivated by the complexity of the task and the social impact of its use in real scenarios. The task solution proposed in this work is based on an ensemble of three classification strategies, mediated by a majority vote algorithm: Support Vector Machine (Hearst et al., 1998) (SVM with RBF kernel), Random Forest (Breiman, 2001), Deep Multilayer Perceptron (Kolmogorov, 1992) (MLP). Each classifier has been tuned using a greedy strategy of hyper-parameters optimization over the”F1” score calculated on a 5-fold random subdivision of the training set. Each sentence has been pre-processed to transform it into word embeddings and TF-IDF bag of words. The results obtained on the cross-validation over the training sets have shown an F1 value of 0.8034 for Facebook sentences and 0.7102 for Twitter. The code of the system proposed can be downloaded from GitHub: https: //github.com/marcopoli/ haspeede_hate_detect.

Hansel: Italian hate speech detection through ensemble learning and deep neural networks

Polignano, Marco;Basile, Pierpaolo

2018-01-01

Abstract

The detection of hate speeches, over social media and online forums, is a relevant task for the research area of natural language processing. This interest is motivated by the complexity of the task and the social impact of its use in real scenarios. The task solution proposed in this work is based on an ensemble of three classification strategies, mediated by a majority vote algorithm: Support Vector Machine (Hearst et al., 1998) (SVM with RBF kernel), Random Forest (Breiman, 2001), Deep Multilayer Perceptron (Kolmogorov, 1992) (MLP). Each classifier has been tuned using a greedy strategy of hyper-parameters optimization over the”F1” score calculated on a 5-fold random subdivision of the training set. Each sentence has been pre-processed to transform it into word embeddings and TF-IDF bag of words. The results obtained on the cross-validation over the training sets have shown an F1 value of 0.8034 for Facebook sentences and 0.7102 for Twitter. The code of the system proposed can be downloaded from GitHub: https: //github.com/marcopoli/ haspeede_hate_detect.

Scheda breve

Scheda completa

Scheda completa (DC)

Anno

2018

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/225387

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

social impact