EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model

Deng Li Lappeenranta-Lahti University of Technology LUTLappeenrantaFinland deng.li@lut.fi , Xin Liu Lappeenranta-Lahti University of Technology LUTLappeenrantaFinland xin.liu@lut.fi , Bohao Xing Lappeenranta-Lahti University of Technology LUTLappeenrantaFinland bohao.xing@lut.fi , Baiqiang Xia Silo AIHelsinkiFinland baiqiang.xia@silo.ai , Yuan Zong Southeast UniversitNanjingChina xhzongyuan@seu.edu.cn , Bihan Wen Nanyang Technological UniversitySingaporeSingapore bihan.wen@ntu.edu.sg and Heikki Kälviäinen Lappeenranta-Lahti University of Technology LUTLappeenrantaFinland heikki.kalviainen@lut.fi

(2024)

Abstract.

Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions. 2) Previous studies commonly utilize various signals such as facial, speech, and even very sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, it is becoming important to develop Emotion AI without relying on sensitive signals. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes’ post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we propose a Multimodal Large Language Models (MLLMs) based solution for long-sequential emotion analysis called EALD-MLLM and evaluate it with de-identity signals (e.g., visual, speech, and NFBLs). Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.

Multi-modal Large Language Model, Emotion Understanding, Affective Computing, Social Signal Processing, Human-Computer Interaction, Identity-free

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation email; October 28 - November 1, 2024; Melbourne, Australia.^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Applied computing Psychology

1. Introduction

Emotion analysis (Koelstra et al., 2011; Hakak et al., 2017) is one of the most fundamental yet challenging tasks. For humans, emotions are present at all times, such as when people talking, thinking, and making decisions. Therefore, emotion analysis plays an important role in human-machine communication, such as in e-learning, social robotics, and healthcare (Khare et al., 2023; Kołakowska et al., 2014; Cavallo et al., 2018). In the past decades, emotion analysis has attracted a lot of attention from the research community (Nandwani and Verma, 2021; Li and Deng, 2020; El Ayadi et al., 2011; Ezzameli and Mahersia, 2023; Rahdari et al., 2019). Various modalities of signals have been explored for emotion analysis. Biological signals such as electrocardiogram (ECG) (Hsu et al., 2017), electroencephalogram (EEG) (Li et al., 2022), and galvanic skin response (Liu et al., 2016) have been used for emotion analysis. Another research direction is static facial signals emotion analysis. For example, dataset EmotionNet (Fabian Benitez-Quiroz et al., 2016) and AffectNet (Mollahosseini et al., 2017) were proposed to identify emotions from facial images. Nowadays, video and multimodal (video + audio) emotion analysis has become a dominant research direction. AFEW (Dhall et al., 2012), AFEW-VA (Dhall et al., 2015) and LIRIS-ACCEDE (Baveye et al., 2015) provided emotional annotation from the movie clip. VAM (Grimm et al., 2008) collected visual and speech data from the speech and interview. Similarly, MOSEI (Zadeh et al., 2018) comprised 23,453 annotated video segments featuring 1,000 distinct speakers discussing 250 topics. The Youtube dataset (Morency et al., 2011) included product reviews and opinion videos sourced from YouTube. SEWA (Kossaifi et al., 2019) recorded video and audio of subjects who watched adverts and provided detailed facial annotation. The summary of existing video and multimodal datasets for emotion analysis is present in Table. 1.

Table 1. Comparison of the different video and multimodal emotion understanding datasets.

Dataset	Modality	Duration in total (hours)	Duration in avg. (mins)	NFBL annota.?	Identity free?
HUMAINE (Douglas-Cowie et al., 2007)	V+A+L	4.18	5.02	✓	$\times$
VAM (Grimm et al., 2008)	V+A	12.00	1.449	$\times$	$\times$
IEMOCAP (Busso et al., 2008)	V+A+L	11.46	¡1	$\times$	$\times$
Youtube (Morency et al., 2011)	V+A+L	0.50	¡1	$\times$	$\times$
AFEW (Dhall et al., 2012)	V+A	2.46	¡1	$\times$	$\times$
AM-FED (McDuff et al., 2013)	V	3.33	¡1	$\times$	$\times$
AFEW-VA (Dhall et al., 2015)	V+A	0.66	¡1	$\times$	$\times$
LIRIS-ACCEDE (Baveye et al., 2015)	V+L	-	¡1	✓	$\times$
EMILYA (Fourati and Pelachaud, 2014)	V+L	-	¡1	✓	$\times$
SEWA (Kossaifi et al., 2019)	V+A	4.65	¡1	$\times$	$\times$
CMU-MOSEI (Zadeh et al., 2018)	V+A+L	65.88	1.68	$\times$	$\times$
iMiGUE (Liu et al., 2021)	V+L	34.78	5.81	✓	✓
EALD (Ours)	V+A+L	32.15	7.02	✓	✓
Note: V, A, and L denote video, audio, and language, respectively.

Previous datasets have successfully improved research in emotion analysis from various aspects. However, two major limitations remain to be addressed: 1) Regardless of the modality used, all of the above studies are highly correlated with sensitive biometric data. This is because biometric data (e.g., facial, speech, and biological signals) plays a critical role in various applications, such as telephone unlocks and mobile payment. While every coin has two sides, biometric information is so sensitive that it is particularly prone to being stolen, misused, and unauthorized tracking. As the risk of hacking and privacy violations increases, the protection of personal biometric data is receiving increasing attention; 2) Existing datasets are more focused on short sequential video emotion analysis while overlooking long sequential video emotion analysis. As shown in Table 1, most datasets only provide videos that are shorter than one minute. The emotions depicted in short sequential videos merely capture fleeting moments, which could be intentionally directed or concealed. In contrast, long sequential videos have the capacity to unveil genuine emotions. For example, when an athlete who lost the game is interviewed after the game, although he or she may have expressed positive emotions in the middle of the interview, his or her overall emotion may still be negative. While iMiGUE (Liu et al., 2021) is close to our dataset, it lacks audio data. Moreover, as depicted in Fig. 2, the majority of videos in iMiGUE are shorter than 5 minutes, some even lasting only 3 minutes. Therefore, iMiGUE (Liu et al., 2021) may not be suitable for long-sequential emotion analysis.

Refer to caption — Figure 1. Selected samples of non-facial body language with the masked face of the proposed dataset EALD.

To solve the aforementioned limitations of the existing datasets, we construct and propose a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD. More precisely, we collect videos containing ”post-match press” scenarios in which a professional athlete is given several rounds of Q&A with reporters after a tough match. As a result of the match, winning or losing is a natural emotion trigger, leading to positive or negative emotional states of the player being interviewed. In addition, we annotated the Non-Facial Body Language (NFBL) of athletes during the interview. Because NFBL has been proven that it can be a crucial clue for understanding human hidden emotions in psychological studies (Loi et al., 2013; Abramson et al., 2021; Aviezer et al., 2012). Then, We use McAdams (Patino et al., 2020) and face detection ¹¹1http://dlib.net/python/ for speech and facial de-identification. Finally, we evaluate various single models (e.g, video and audio) and the Multimodal Large Language Model (MLLM) (e.g., Video-LLaMA (Zhang et al., 2023)) on the proposed dataset EALD. The experimental results demonstrate that MLLM surpasses other supervised single-modal models, even in the zero-shot scenario. In summary, the contributions of this paper can be concluded as follows:

•

We construct and propose EALD dataset. It bridges the gap in existing datasets in the aspect of de-identity long sequential video emotion analysis.
•

We provided a benchmark evaluation of the proposed dataset for future research.
•

We validate that NFBL is an effective and identity-free clue for emotion analysis.

2. EALD dataset

In this section, we will present the construction and detailed information of the proposed dataset EALD.

2.1. Motivation

As shown in Table 1, there is a gap between existing datasets and real-life applications: 1) sensitive biological signals involved in their datasets, such as facial and speech. It is hard for the existing datasets to meet the needs for privacy protection in real life; 2) pay more attention to short sequential video emotion analysis but overlook long sequential video. In fact, long sequential video is crucial to emotion analysis. Next, we will show how to construct the proposed dataset EALD. And later, we will demonstrate how EALD can effectively bridge this gap.

2.2. Dataset Construction

Data collection

The videos of athletes participating in post-game interviews are a suitable data source. First, the outcome of the game, whether it’s a victory or a loss, serves as a natural catalyst for emotions, eliciting either positive or negative states in the interviewed player. Second, the players had no (or very little) time to prepare because the press conference would be held immediately after the game, and he or she needed to respond to the questions rapidly. Unlike acting in movies or series, athletes’ NFBL is natural. Third, the duration of the post-game interview videos is commonly relatively long, which meets the need of our target, namely, long sequential emotion analysis. Therefore, we collect videos of post-game interviews in a total of 275 from the Australian Open²²2https://www.youtube.com/@australianopen and BNP Paribas Open³³3https://www.youtube.com/@bnpparibasopen from YouTube.

Data de-identification

As aforementioned, one of the aims of the proposed dataset is to provide identify-free data for emotion analysis. As one may see, the majority of identity information exists in facial and speech signals. Therefore, we perform video and audio de-identification to remove the identity information. Specifically, for video de-identification, we utilize pre-trained face detection Convolutional Neural Network (CNN) model $m_{face\_detection}$ from Dilb⁴⁴4http://dlib.net/ to implement facial masking for de-identification. Formally, give a video $\mathbf{V}=\{v_{1},v_{2},...,v_{N_{sample}}\}$ where $v_{i}$ denote the each frame of video $\mathbf{V}$ . Then, we can get the coordinates $C$ by $C=m_{face\_detection}(v_{i})$ . Finally, we mask the face with Gaussian blur according to the coordinates $C$ of the face. The examples are presented in Fig. 1.

Regarding audio de-identification, our goal is to remove the identity information in the audio while preserving the emotions. Therefore, we follow the method described in (Tomashenko et al., 2024) and utilize the McAdams coefficient-based approach (McAdams, 1984) to ensure broad applicability and ease of implementation. Specifically, the process of anonymization using the McAdams coefficient is as follows. For an input audio signal $\mathbf{A}=\{a_{1},a_{2},...,a_{N_{sample}}\}$ with a sampling frequency $f_{s}$ , we divide it into overlapping frames of length $T_{win}$ milliseconds and step size $T_{shift}$ milliseconds, denoted as $\hat{A}=\{\hat{a}_{1},\hat{a}_{2},...,\hat{a}_{N_{frame}}\}$ . These frames are then processed with a Hanning window, resulting in windowed frame signals $\hat{S}^{w}=\{\hat{a}^{w}_{1},\hat{a}^{w}_{2},...,\hat{a}^{w}_{N_{frame}}\}$ . Then, we utilize Linear Predictive Coding (LPC) analysis and Transfer Function to Zeroes, Poles, and Gain (tf2zpk) to obtain the corresponding poles $P=\{p_{1},p_{2},...,p_{N_{frame}}\}$ .

(1)

c_{i},r_{i}=\operatorname{LPC}(\hat{s}^{w}_{i},lp\_order),

(2)

z_{i},p_{i},k_{i}=\operatorname{tf2zpk}(c_{i}),

where $lp\_order$ is the order of the LPC analysis, $c_{i},r_{i}$ represent the LPC coefficients and the residuals. Then, we adjust the angle $\theta_{i}$ of the poles $p_{i}$ using the McAdams coefficient $\lambda$ to obtain the new angle $\theta_{i}^{new}=\theta_{i}^{\lambda}$ . Then, based on the new angle $\theta_{i}^{new}$ and the original amplitude of the poles, we calculate the new poles $p_{i}^{new}$ . Then, following the reverse process, $A^{new}$ is reconstructed. The waveform example of de-identity audio is presented in Fig. 3.

Data annotation

After data collection and de-identification, each video sequence needs to be annotated with NFBL. According to a study on human behavior (Navarro and Karlins, 2008), NFBL includes self-manipulations (e.g., scratching the nose and touching the ear), manipulation of objects (e.g., playing with rings, pens, and papers), and self-protection behaviors (e.g., rubbing the eyes, folding arms, or rhythmically moving legs). The defined NFBL classes are presented in Fig. 4. The annotation job is carried out in the following stages: First, the NFBL is identified from the collected videos, including start and end times, as well as the NFBL type (clip-level) by trained annotators. We consider ”positive” and ”negative” as two emotional categories to start with, and the labels are based on the objective fact that the game is won (positive emotion) or lost (negative emotion). After the initial annotation, we perform a review to ensure the accuracy of the labels.

Table 2. Properties of the proposed EALD.

Properties	EALD
Number of videos	275
Number of annotated NFBL	16180
Resolution	1280 $\times$ 720
Frame rate	32
Subjects	40 Male, 42 Female
Nationality	30
Total duration	32.15 hours
Average duration	7.02 mins

2.3. Dataset Statistics and Properties

In the proposed dataset EALD, we collect 275 (74 Lost, 201 Won) post-game interview videos as shown in Table. 2. The difference between the proposed dataset and existing datasets can be summarized as follows: 1) long sequential data. To meet our needs for long sequential emotion analysis, the average duration of collected videos is 7.02 minutes. Each video has a resolution of 1280 $\times$ 720 of 32 Frame Per Second (FPS); 2) identity-free: We remove the identity information for the video and audio as described in Sec. 2.2; 3) diversity: The expression for different people may be different because of cultural background, particularly in body language. It is worth emphasizing that the proposed EALD is well diverse in the aspects of gender and nationality. For example, the athletes in the collected video from different countries around the world (e.g., Australia, China, Japan, Slovak, South Africa, USA). More importantly, the gender of the athletes is well-balanced (40 Male, 42 Female).

Regarding the annotation of NFBL, we provide 16,180 clips of various NFBL with timestamps in the proposed dataset. The clips range from 0.08 seconds to 184.50 seconds. The distribution of different NFBL is shown in Fig. 4. As one may observe, the proportion of N9 (Biting nails) and N5 (Covering face) is significantly higher than that of other categories.

3. EALD-MLLM

Table 3. Ablation study of the EALD-MLLM on full EALD.

Video	Audio	NFBL	Accuracy(%) $\uparrow$	F-score(%) $\uparrow$	Precision(%) $\uparrow$	Confidence score $\uparrow$
✓			49.09	63.91	66.31	6.08
✓	✓		53.45	67.51	68.91	6.83
✓	✓	✓	58.54	70.77	68.65	7.29

Table 4. Comparative experiment on part of EALD.

Method	Pretraining dataset	Modality	Zero-shot?	Accuracy(%) $\uparrow$	F-score(%) $\uparrow$	Precision(%) $\uparrow$
SlowFast (Feichtenhofer et al., 2019)	K400	Video	$\times$	52.70	42.62	54.10
TSM (Lin et al., 2019)	ImageNet+K400	Video	$\times$	51.35	60.87	50.91
TimeSformer (Bertasius et al., 2021)	ImageNet+K400	Video	$\times$	48.65	61.22	49.18
Video-Swin (Liu et al., 2022)	ImageNet+K400	Video	$\times$	51.35	63.27	50.82
BEATs (Chen et al., 2022)	AudioSet	Audio	$\times$	50.00	66.67	50.00
EALD-MLLM (ours)	Webvid+VideoChat	Video + Audio + NFBL	✓	58.10	66.67	55.35

For humans, it is possible to identify the emotions of others through the visual system, the auditory system, or by combining both. Recently, Multimodal Large Language Models (MLLM) have shown their robust performance in many sub-stream tasks. MLLM combines the capabilities of large language models with the ability to understand content across multiple modalities, such as text, images, audio, and video. Next, we will introduce an MLLM-based solution for emotion analysis.

3.1. Framework

The framework of the proposed EALD-MLLM solution for emotion analysis is illustrated in Fig.5. As we can see, the proposed solution consists of three stages. Formally, given a video $\mathbf{V}$ , corresponding audio $\mathbf{A}$ and corresponding NFBL $\mathbf{N}$ . In stage one, we perform de-identification to remove the identity information. Thus, the de-identity video $\mathbf{\hat{V}}$ and audio $\mathbf{\hat{A}}$ can be obtained. More details of de-identification can be found in Sec.2.2.

MLLM inference

In stage two, we employ Video-LLaMA (Zhang et al., 2023) as our MLLM for emotion analysis inference, utilizing de-identity video $\mathbf{\hat{V}}$ , audio $\mathbf{\hat{A}}$ , and NFBL $N$ as inputs. We chose Video-LLaMA because it aligns different modality data (e.g., video, audio, and text), meeting our requirements. Specifically, Video-LLaMA incorporates a pre-trained image encoder into a video encoder known as the Video Q-former, enabling it to learn visual queries, and utilizes ImageBind (Girdhar et al., 2023) as the pre-trained audio encoder, introducing an Audio Q-former to learn auditory queries. Finally, it feeds visual and auditory queries to the pre-trained large language model LLaMA (Touvron et al., 2023) to generate responses.

Since $\mathbf{MLLM}$ is determined, we simply input de-identity video $\mathbf{\hat{V}}$ and audio $\mathbf{\hat{A}}$ into $\mathbf{MLLM}$ . Specifically, we uniformly sample $M$ frames from $\mathbf{\hat{V}}$ to form visual input $\mathbf{\bar{V}}=\{\hat{v}_{1},\hat{v}_{2},...,\hat{v}_{m}\}$ , where $\hat{v}_{i}$ denotes a frame in $\mathbf{\hat{V}}$ . For audio, we sample $S$ segments of 2-second audio clips. Next, we convert each of these 2-second audio clips into spectrograms $\bar{a}_{i}$ using 128 mel-spectrogram bins to form audio input $\mathbf{\bar{A}}=\{\bar{a}_{1},\bar{a}_{2},...,\bar{a}_{s}\}$ . Regarding NFBL $N$ , we utilize it as text input. Finally, the response $R$ is $\mathbf{MLLM}(\mathbf{\bar{V}},\mathbf{\bar{A}},N)$ . It is noted that proposing another new MLLM model is not our goal; instead, we aim to utilize MLLM for long-sequential emotion analysis. Therefore, we choose to use the existing model Video-LLaMA instead of proposing a new MLLM model. Additionally, although the current selection of MLLM may not be optimal, it still achieved satisfactory results, as shown in Table. 4. Thus, we stop further optimizing it. If one may interested in MLLM, please refer to (Zhang et al., 2024).

LLM judgement

Although we can get a reasonable response $R=\mathbf{MLLM}(\mathbf{\bar{V}},\mathbf{\bar{A}},N)$ , $\mathbf{MLLM}$ ’s response is more about descriptive content than emotion analysis. We think the response is related to the instruction dataset. Because Video-LLaMA is instruction-tuned on Visual Question Answering (VQA) datasets while not on emotion analysis datasets. However, the generated response $R$ is still helpful for emotion analysis because it contains descriptive information about visual, audio, and body language. Therefore, we opt for ChatGPT (Ouyang et al., 2022) from emotion estimation based on the given response $R$ . More precisely, we utilize gpt-3.5-turbo-0125 to generate a response that includes the estimation of emotion and the confidence score. The used prompt for ChatGPT is present in Fig.5 (Stage 3).

4. Experimental Result

In this section, we will introduce the ablation study and comparative study of the proposed MLLM solution EALD-MLLM.

4.1. Experimental settings

Metrics

As the emotion analysis EALD can be regarded as a binary classification problem, we employ accuracy, precision, and F-score as metrics to assess the performance of different models, as follows:

(3)

\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN},

(4)

\text{Recall}=\frac{TP}{TP+FN},

(5)

\text{Precision}=\frac{TP}{TP+FP},

(6)

\text{F1-score}=\frac{2\times\text{Precision}\times\text{Recall}}{\text{% Precision}+\text{Recall}},

where True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) represent the instances accurately predicted as positive, instances correctly predicted as negative, indicate instances inaccurately predicted as positive, and instances inaccurately predicted as negative respectively.

Inference

As described in Sec.3.1, the proposed EALD-MLLM solution for emotion analysis utilizes Video-LLaMA (Zhang et al., 2023) and ChatGPT (Ouyang et al., 2022) without training. For the inputs of Video-LLaMA, we uniformly sample 32 frames for each video and sample audio clips every 2 seconds for each audio. As for NFBL, we directly use the annotation of NFBL and input it as the text form for Video-LLaMA.

4.2. Ablation Study

Before comparing the EALD-MLLM with other approaches, three key questions need to be addressed: 1) Do we need multimodal data for long-sequential video emotion analysis? 2) Is NFBL helpful for long-sequential video emotion analysis? 3) Does long sequential emotion analysis more accurately reflect genuine emotions? Thus, we conduct an ablation study to investigate these three questions. Since all experiments in the ablation study are conducted under the zero-shot scenario, we use the full data of the proposed EALD.

Dose multimodal data beneficial for emotion analysis?

We begin with video input alone as the base model, then gradually integrate audio and NFBL. As shown in Table 3, relying solely on visual cues for detecting emotions proves challenging, with an accuracy below 50%. In contrast, incorporating audio data significantly improves the model’s accuracy by about 4%. Therefore, we believe that both video and audio are crucial modalities as they can provide rich information for long-sequential emotion analysis. Utilizing multimodal techniques that combine the features of video and audio is advantageous for emotion analysis.

Dose non-facial body language Helpful?

As shown in Table 3, combining NFBL can significantly improve performance (by 5% in accuracy). Additionally, it enhances the confidence score, which represents the descriptive content generated by Video-LLaMA and provides more informative insights for emotion analysis.

Does emotion analysis over a long sequential period more accurately reflect people’s genuine emotions?

As shown in the supplementary materials, when we compare instantaneous (short sequence) and long sequential video emotion analysis, long-term analysis is better able to reflect people’s actual emotional states comprehensively. Short-term recognition may be influenced by specific moments, whereas long-term observation can better capture emotional fluctuations and trends, thus providing more accurate emotional analysis. Long-term emotion recognition also allows for a deeper understanding of the context and underlying factors influencing individuals’ emotional states. This contextual richness further enhances the accuracy and depth of the emotional analysis over time. Due to limited space, please refer to the details in the supplementary materials.

4.3. Benchmark Evaluation

To validate the effectiveness of the proposed EALD-MLLM solution, we selected several single-modal models for comparison: 1) for video-based models, we chose Video-Swin-Transformer (Video-Swin) (Liu et al., 2022), TSM (Lin et al., 2019), TimeSformer (Bertasius et al., 2021), and SlowFast (Feichtenhofer et al., 2019); 2) for audio-based models, we selected BEATs (Chen et al., 2022). Typically, these methods require training before they can be tested on the proposed dataset. Thus, we randomly selected data from the dataset for the comparative experiment to reduce the impact of data imbalance. Specifically, we used 72 videos (36 Negative, 36 Positive) for training and 74 videos (37 Negative, 37 Positive) for testing. The video IDs of the selected dataset can be found in the Appendix. Considering the size of the training dataset is relatively small, we opted for linear probing to train these models. Thus, we only trained the last classification layer and froze the rest of the pre-trained layers. Video-Swin (Liu et al., 2022), TSM (Lin et al., 2019), TimeSformer (Bertasius et al., 2021) and are pre-trained on the Kinetics-400 (Carreira and Zisserman, 2017) and ImageNet (Deng et al., 2009). SlowFast (Feichtenhofer et al., 2019) is pre-trained on the Kinetics-400 (Carreira and Zisserman, 2017).Then, we employ an AdamW optimizer for 30 epochs using a cosine decay learning rate scheduler and 2.5 epochs of linear warm-up. The audio recognition model BEATs (Chen et al., 2022) is pre-trained on the AudioSet (Gemmeke et al., 2017). Then, we employ an AdamW optimizer for 30 epochs with EarlyStopping. It is noted that our audio files typically have a duration of about 10 minutes, which can be unacceptable for conventional audio classification models to handle. To enhance the feasibility of training, we resample the original audio from a sampling rate of 16,000 Hz to 320 Hz.

The results displayed in Table 4 indicate that classic video or audio recognition models exhibit relatively poor performance, achieving an accuracy of approximately 50%. This also proves that using only one modality is not suitable for long-sequential emotion analysis. In contrast, our proposed solution, EALD-MLLM, achieves superior performance, boasting a 5% increase in accuracy. The qualitative results are present in Fig. 6 and Fig. 7.

5. Limitation

As one can see, although the proposed EALD-MLLM outperforms other models, its performance is not very high, as shown in Table 4. This also proves that the proposed EALD dataset designed for long-sequential emotion analysis in this paper is challenging. In Table 3, we validate that Non-Facial Body Language (NFBL) is an important clue for long-sequential emotion analysis. However, we use the annotation of NFBL directly. To our knowledge, detecting these NFBLs in real applications is not a simple task, as they often occur quickly and are difficult to recognize. Therefore, we plan to study the detection of NFBLs in the future. Furthermore, the EALD-MLLM approach follows a two-stage methodology. Because the employed MlLM (Video-LLaMA) is specifically designed for general purposes and is not optimized for emotion analysis. As shown in Fig.8, in some cases, the generated response by MLLM does not provide useful information for emotion analysis while paying more attention to the environment information. Therefore, we have to use another LLM (ChatGPT) to refine the response and estimate the final emotion. We plan to propose an end-to-end MLLM model for long-sequential emotion analysis in the future.

6. Conclusion

In this paper, we introduce a novel dataset for Emotion Analysis in Long-sequential and De-identified termed EALD to solve the limitations of the existing datasets: 1) lack of long sequential video for emotion analysis; 2) sensitive data (e.g., de-identity facial, speech, and ECG signals) involved. The EALD dataset comprises 275 videos, each with an average duration of 7 minutes. It is worth emphasizing that we perform de-identification processing on all videos and audio. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. Furthermore, we propose a solution utilizing a Multimodal Large Language Model (MLLM) called EALD-MLMM. Experimental results demonstrate: 1) MLLMs can achieve comparable performance to supervised single-modal models, even in zero-shot scenarios; 2) NFBL serves as an important identity-free cue in the analysis of emotions over long-sequential.

References

(1)
Abramson et al. (2021) Lior Abramson, Rotem Petranker, Inbal Marom, and Hillel Aviezer. 2021. Social interaction context shapes emotion recognition through body language, not facial expressions. Emotion 21, 3 (2021), 557.
Aviezer et al. (2012) Hillel Aviezer, Yaacov Trope, and Alexander Todorov. 2012. Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science 338, 6111 (2012), 1225–1229.
Baveye et al. (2015) Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Liming Chen. 2015. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55.
Bertasius et al. (2021) Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (2008), 335–359.
Carreira and Zisserman (2017) Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
Cavallo et al. (2018) Filippo Cavallo, Francesco Semeraro, Laura Fiorini, Gergely Magyar, Peter Sinčák, and Paolo Dario. 2018. Emotion modelling for social robotics applications: a review. Journal of Bionic Engineering 15 (2018), 185–203.
Chen et al. (2022) Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei. 2022. Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058 (2022).
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
Dhall et al. (2012) Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE MultiMedia 19, 3 (2012), 34–41. https://doi.org/10.1109/MMUL.2012.26
Dhall et al. (2015) Abhinav Dhall, OV Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. 2015. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In ACM on International Conference on Multimodal Interaction. 423–426.
Douglas-Cowie et al. (2007) Ellen Douglas-Cowie, Roddy Cowie, Ian Sneddon, Cate Cox, Orla Lowry, Margaret Mcrorie, Jean-Claude Martin, Laurence Devillers, Sarkis Abrilian, Anton Batliner, et al. 2007. The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In International Conference on Affective Computing and Intelligent Interaction. Springer, 488–500.
El Ayadi et al. (2011) Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition 44, 3 (2011), 572–587.
Ezzameli and Mahersia (2023) K Ezzameli and H Mahersia. 2023. Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion (2023), 101847.
Fabian Benitez-Quiroz et al. (2016) C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. 2016. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 5562–5570.
Feichtenhofer et al. (2019) Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
Fourati and Pelachaud (2014) Nesrine Fourati and Catherine Pelachaud. 2014. Emilya: Emotional body expression in daily actions database.. In LREC. 3486–3493.
Gemmeke et al. (2017) Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and Signal Processing. IEEE, 776–780.
Girdhar et al. (2023) Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15180–15190.
Grimm et al. (2008) Michael Grimm, Kristian Kroschel, and Shrikanth Narayanan. 2008. The Vera am Mittag German audio-visual emotional speech database. In IEEE International Conference on Multimedia and Expo. IEEE, 865–868.
Hakak et al. (2017) Nida Manzoor Hakak, Mohsin Mohd, Mahira Kirmani, and Mudasir Mohd. 2017. Emotion analysis: A survey. In 2017 international conference on computer, communications and electronics (COMPTELIX). IEEE, 397–402.
Hsu et al. (2017) Yu-Liang Hsu, Jeen-Shing Wang, Wei-Chun Chiang, and Chien-Han Hung. 2017. Automatic ECG-based emotion recognition in music listening. IEEE Transactions on Affective Computing 11, 1 (2017), 85–99.
Khare et al. (2023) Smith K Khare, Victoria Blanes-Vidal, Esmaeil S Nadimi, and U Rajendra Acharya. 2023. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion (2023), 102019.
Koelstra et al. (2011) Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2011. Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3, 1 (2011), 18–31.
Kołakowska et al. (2014) Agata Kołakowska, Agnieszka Landowska, Mariusz Szwoch, Wioleta Szwoch, and Michal R Wrobel. 2014. Emotion recognition and its applications. Human-computer Systems Interaction: Backgrounds and Applications 3 (2014), 51–62.
Kossaifi et al. (2019) Jean Kossaifi, Robert Walecki, Yannis Panagakis, Jie Shen, Maximilian Schmitt, Fabien Ringeval, Jing Han, Vedhas Pandit, Antoine Toisoul, Björn Schuller, et al. 2019. Sewa db: A rich database for audio-visual emotion and sentiment research in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 3 (2019), 1022–1040.
Li and Deng (2020) Shan Li and Weihong Deng. 2020. Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing 13, 3 (2020), 1195–1215.
Li et al. (2022) Xiang Li, Yazhou Zhang, Prayag Tiwari, Dawei Song, Bin Hu, Meihong Yang, Zhigang Zhao, Neeraj Kumar, and Pekka Marttinen. 2022. EEG based emotion recognition: A tutorial and review. Comput. Surveys 55, 4 (2022), 1–57.
Lin et al. (2019) Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In IEEE/CVF International Conference on Computer Vision. 7083–7093.
Liu et al. (2016) Mingyang Liu, Di Fan, Xiaohan Zhang, and Xiaopeng Gong. 2016. Retracted: Human emotion recognition based on galvanic skin response signal feature selection and svm. In International Conference on Smart City and Systems Engineering. IEEE, 157–160.
Liu et al. (2021) Xin Liu, Henglin Shi, Haoyu Chen, Zitong Yu, Xiaobai Li, and Guoying Zhao. 2021. iMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10631–10642.
Liu et al. (2022) Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.
Loi et al. (2013) Felice Loi, Jatin G Vaidya, and Sergio Paradiso. 2013. Recognition of emotion from body language among patients with unipolar depression. Psychiatry research 209, 1 (2013), 40–49.
McAdams (1984) Stephen Edward McAdams. 1984. Spectral fusion, spectral parsing and the formation of auditory images. Stanford university.
McDuff et al. (2013) Daniel McDuff, Rana Kaliouby, Thibaud Senechal, May Amr, Jeffrey Cohn, and Rosalind Picard. 2013. Affectiva-mit facial expression dataset (am-fed): Naturalistic and spontaneous facial expressions collected. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 881–888.
Mollahosseini et al. (2017) Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. 2017. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10, 1 (2017), 18–31.
Morency et al. (2011) Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In International Conference on Multimodal Interfaces. 169–176.
Nandwani and Verma (2021) Pansy Nandwani and Rupali Verma. 2021. A review on sentiment analysis and emotion detection from text. Social network analysis and mining 11, 1 (2021), 81.
Navarro and Karlins (2008) Julia Navarro and Marvin Karlins. 2008. What every body is saying. HarperCollins Publishers New York, NY, USA:.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
Patino et al. (2020) Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, and Nicholas Evans. 2020. Speaker anonymisation using the McAdams coefficient. arXiv preprint arXiv:2011.01130 (2020).
Rahdari et al. (2019) Farhad Rahdari, Esmat Rashedi, and Mahdi Eftekhari. 2019. A multimodal emotion recognition system using facial landmark analysis. Iranian Journal of Science and Technology, Transactions of Electrical Engineering 43 (2019), 171–189.
Tomashenko et al. (2024) Natalia Tomashenko, Xiaoxiao Miao, Pierre Champion, Sarina Meyer, Xin Wang, Emmanuel Vincent, Michele Panariello, Nicholas Evans, Junichi Yamagishi, and Massimiliano Todisco. 2024. The VoicePrivacy 2024 Challenge Evaluation Plan. (2024). arXiv:2404.02677 [eess.AS]
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Zadeh et al. (2018) AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics. 2236–2246.
Zhang et al. (2023) Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858 (2023). https://arxiv.org/abs/2306.02858
Zhang et al. (2024) Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).


N3 Touching or scratching head	N8 Touching ears	N11 Touching or scratching neck

N12 Playing or adjusting hair	N14 Touching suprasternal notch	N16 Folding arms

N17 Dustoff clothes	N19 Moving torso	N20 Sit straightly

N24 Minaret gesture	N30 Shake double shoulders	N34 Touching nose


(a) waveform of original audio	(b) waveform of de-identity audio

(c) spectrogram of original audio	(d) spectrogram of de-identity audio