Publication Date

2009-12-21

Availability

Open access

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PHD)

Department

Electrical and Computer Engineering (Engineering)

Date of Defense

2009-11-19

First Committee Member

Michael Scordilis - Committee Chair

Second Committee Member

Mei-Ling Shyu - Committee Member

Third Committee Member

Miroslav Kubat - Committee Member

Fourth Committee Member

Mohamed Abdel-Mottaleb - Committee Member

Fifth Committee Member

Subramanian Ramakrishnan - Outside Committee Member

Abstract

Emotion conveys the psychological state of a person. It is expressed by a variety of physiological changes, such as changes in blood pressure, heart beat rate, degree of sweating, and can be manifested in shaking, changes in skin coloration, facial expression, and the acoustics of speech. This research focuses on the recognition of emotion conveyed in speech. There were three main objectives of this study. One was to examine the role played by the glottal source signal in the expression of emotional speech. The second was to investigate whether it can provide improved robustness in real-world situations and in noisy environments. This was achieved through testing in clear and various noisy conditions. Finally, the performance of glottal features was compared to diverse existing and newly introduced emotional feature domains. A novel glottal symmetry feature is proposed and automatically extracted from speech. The effectiveness of several inverse filtering methods in extracting the glottal signal from speech has been examined. Other than the glottal symmetry, two additional feature classes were tested for emotion recognition domains. They are the: Tonal and Break Indices (ToBI) of American English intonation, and Mel Frequency Cepstral Coefficients (MFCC) of the glottal signal. Three corpora were specifically designed for the task. The first two investigated the four emotions: Happy, Angry, Sad, and Neutral, and the third added Fear and Surprise in a six emotions recognition task. This work shows that the glottal signal carries valuable emotional information and using it for emotion recognition has many advantages over other conventional methods. For clean speech, in a four emotion recognition task using classical prosodic features achieved 89.67% recognition, ToBI combined with classical features, reached 84.75% recognition, while using glottal symmetry alone achieved 98.74%. For a six emotions task these three methods achieved 79.62%, 90.39% and 85.37% recognition rates, respectively. Using the glottal signal also provided greater classifier robustness under noisy conditions and distortion caused by low pass filtering. Specifically, for additive white Gaussian noise at SNR = 10 dB in the six emotion task the classical features and the classical with ToBI both failed to provide successful results; speech MFCC's achieved a recognition rate of 41.43% and glottal symmetry reached 59.29%. This work has shown that the glottal signal, and the glottal symmetry in particular, provides high class separation for both the four and six emotion cases. It is confidently surpassing the performance of all other features included in this investigation in noisy speech conditions and in most clean signal conditions.

Keywords

Speech Processing; Speech Emotion Recognition; Emotion Recognition; Speech Recognition

Share

COinS