Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients

IRIS

The ability to understand people through spoken language is a skill that many human beings take for granted. On the contrary, the same task is not as easy for machines, as consequences of a large number of variables which vary the speaking sound wave while people are talking to each other. A sub-task of speeches understanding is about the detection of the emotions elicited by the speaker while talking, and this is the main focus of our contribution. In particular, we are presenting a classification model of emotions elicited by speeches based on deep neural networks (CNNs). For the purpose, we focused on the audio recordings available in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. The model has been trained to classify eight different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise) which correspond to the ones proposed by Ekman plus the neutral and calm ones. We considered as evaluation metric the F1 score, obtaining a weighted average of 0.91 on the test set and the best performances on the "Angry"class with a score of 0.95. Our worst results have been observed for the sad class with a score of 0.87 that is nevertheless better than the state-of-the-art. In order to support future development and the replicability of results, the source code of the proposed model is available on the following GitHub repository: https://github.com/marcogdepinto/Emotion-Classification-Ravdess

Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients

De Pinto M. G.;Polignano M.;Lops P.;Semeraro G.

2020-01-01

Abstract

The ability to understand people through spoken language is a skill that many human beings take for granted. On the contrary, the same task is not as easy for machines, as consequences of a large number of variables which vary the speaking sound wave while people are talking to each other. A sub-task of speeches understanding is about the detection of the emotions elicited by the speaker while talking, and this is the main focus of our contribution. In particular, we are presenting a classification model of emotions elicited by speeches based on deep neural networks (CNNs). For the purpose, we focused on the audio recordings available in the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. The model has been trained to classify eight different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprise) which correspond to the ones proposed by Ekman plus the neutral and calm ones. We considered as evaluation metric the F1 score, obtaining a weighted average of 0.91 on the test set and the best performances on the "Angry"class with a score of 0.95. Our worst results have been observed for the sad class with a score of 0.87 that is nevertheless better than the state-of-the-art. In order to support future development and the replicability of results, the source code of the proposed model is available on the following GitHub repository: https://github.com/marcogdepinto/Emotion-Classification-Ravdess

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2020
			
	Codice ISBN
	
				978-1-7281-4384-2
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11586/309238

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

36

18

social impact