1 Introduction

Heart rate (HR), heart rate variability (HRV) and respiration frequency (RF) serve as important indicators of the physical health for individuals. Two kinds of skin-contact medical sensors, i.e., electrocardiography (ECG) and photoplethysmography (PPG) sensors, are widely adopted to measure these physiological signals. However, the usage of skin-contact sensors can inevitably cause inconvenience and discomfort to participants, particularly in cases of long-term or daily health monitoring (McDuff, 2023; Niu et al., 2020a; Verkruysse et al., 2008). Moreover, newborns and burn patients cannot use them (Chen et al., 2019; Lu et al., 2021). To cope with these limitations, facial video-based remote physiological measurement technique has emerged and garnered increasing research attentions in recent years (Yu et al., 2022b; Du et al., 2023; Lu et al., 2023). It exhibits promising performance in various applications such as face anti-spoofing (Liu et al., 2018), atrial fibrillation screening (Yan et al., 2020), and vital signs monitoring for ICU patients (Jorge et al., 2022).

The core principle of facial video-based remote physiological measurement lies in the remote photoplethysmography (rPPG) technique (Wang et al., 2017), i.e., the optical absorption of skins changes periodically in sync with the cardiac cycle. Ideally, the temporal variation of the skin color reflect the periodic rPPG signal, enabling further measurement of physiological signals such as HR, HRV and RF. However, rPPG signal is inherently intertwined with assorted non-periodic noises arising from illumination variations, subject movements and camera quantization noise (Lu et al., 2021; Yue et al., 2022). In early studies, researchers developed blind signal separation (Wang et al., 2017) and color space transformation (De Haan & Jeanne, 2013) approaches to disentangle rPPG signals from noises. However, these methods require manually defining the frequency bandpasses of interest to filter out unexpected noises. They also rely on assumptions about the scene background to diminish the effect of illumination variations on skin tones. As a result, these techniques are primarily suited for the well-controlled laboratory environment and often show severe performance degradation in the real-world environment.

In recent years, many deep learning-based approaches have been proposed, showcasing state of the art performance (Yu et al., 2022b; Lu et al., 2021; Qiu et al., 2019; Lee et al., 2020a; Yu et al., 2019; Li et al., 2023b; Niu et al., 2020b). They leverage diverse vision encoders to extract the visual features from facial videos for rPPG estimation. Nevertheless, the process of collecting facial videos and corresponding ground truth signals (i.e., ECG or PPG signals) for training is still costly, as it requires subjects to wear skin-contact sensors to record videos and signals synchronously. To overcome this challenge, a number of self-supervised approaches have been introduced. They train the models based on unlabeled facial videos using contrastive learning (Yang et al., 2023; Yue et al., 2023; Gideon & Stent, 2021; Sun & Li, 2024) or masked auto-encoding (Liu et al., 2024) mechanisms. However, their performance is inevitably inferior to the supervised ones.

Fig. 1
figure 1

Existing vision-language models (VLMs) are pre-trained on diverse vision-text pairs, including those that describe the temporal variations of certain object attributes. We for the first time adapt VLMs with the ability to digest the frequency-related knowledge of skin color temporal variation in vision and text modalities for self-supervised remote physiological measurement

Recently, vision-language models (VLMs), such as contrastive language-image pre-training (CLIP) (Radford et al., 2021), video and language understanding (VindLU) (Cheng et al., 2023), have drawn increasing research attentions due to their remarkable ability in many vision and language tasks (Zhou et al., 2024b; Zang et al., 2024; Shi et al., 2024). Particularly, in video processing, given vision-text pairs, visual and textual embeddings are extracted by vision and text encoders, respectively; and are then aligned via vision-text contrastive learning. As shown in Fig. 1, the text prompts can be versatile, i.e. "the color of sky changes from gold to orange to blue", covering rich temporal details. This type of text cues thus facilitates the model understanding of temporal visual patterns in the videos. Therefore, such VLMs have showed special strength in temporal tasks like action recognition (Wang et al., 2023), video summarization (Pramanick et al., 2023) and video object segmentation (Yuan et al., 2024). Inspired by their successes, we aim to leverage the powerful temporal feature modeling ability of pre-trained VLMs to capture the periodic variation of skin color in facial videos. As far as we have explored, VLMs have not been used to capture the frequency-related attributes in images/videos so far. Intuitively, we posit two potential solutions, specified below.

(1) Zero-shot learning: previous works have leveraged pre-trained VLMs for zero-shot depth estimation (Zhang et al., 2022), object counting (Jiang et al., 2023), etc. For example, Zhang et al. (2022) utilizes the pre-trained knowledge from CLIP to project the semantic response of each image patch into a certain depth bin, and linearly combine depth values in the bin to obtain the final depth predication. Similarly, we can define the frequency bin (e.g. , [0.5, 1.0, 1.5, 2.0, 2.5]) and use every value in the bin to construct text prompt describing the frequency of the skin color temporal variation, such as "the frequency of the skin color variation is 2.0 hertz in the video." Next, we calculate the similarity between the visual embedding of the given video and the textual embedding of the text prompt via the VLM. Finally, we compute the weighted sum of the values in the frequency bin and regard it as the main frequency for the rPPG signal. However, pre-trained VLMs excel in extracting temporal features but struggle to understand the frequency-related attributes. This is due to the fact that the pre-trained datasets for VLMs rarely contain vision-text pairs that describe periodic events. We thereby decide not to expand along this line.

(2) Fine-tuning: some studies focus on fine-tuning VLMs for specific downstream tasks. Due to the lack of text annotations in the datasets, they typically begin with some vision-text pair generation methods and subsequently optimize the models through multimodal representation learning. For example, Liang et al. (2023) manually constructs text prompts to describe the crowd counts within patches of different sizes in the given image. Then a multimodal ranking loss is devised to facilitate the vision encoder of the VLM to capture the crowd semantics in the image for crowd counting. This research line is feasible if we have the actual frequency of the rPPG signal to describe the facial video, which is however undesirable in our self-supervised remote physiological measurement task.

In this paper, we introduce a pioneering approach to bootstrap VLMs for self-supervised remote physiological measurement. Our proposed method, namely VL-phys, marks the first attempt to adapt VLMs with the ability to digest the frequency-related knowledge in vision and text modalities. As shown in Fig. 1, we introduce a novel frequency-oriented generation mechanism to create vision-text pairs reflecting the frequencies of skin color temporal variations. Given the input facial video, we first augment it into multiple positive and negative samples with their rPPG frequencies conforming certain relations; next, in the absence of their ground truth signal frequencies, we create contrastive spatio-temporal maps between positive and negative samples and craft text prompts to describe the frequency relations across samples; this forms vision-text pairs. Afterwards, we fine-tune the pre-trained VLM with these created vision-text pairs through well-designed frequency-related representation learning in both generative and contrastive manners. The former features a text-guided visual reconstruction task in which we reconstruct masked patches of contrastive spatio-temporal maps under the guidance of text prompts; while the latter consists of a series of tasks, such as the multimodal vision-text contrastive learning task to align the embeddings of vision-text pairs, and the unimodal frequency contrastive and ranking task to optimize rPPG signals estimated from visual embeddings of different video samples.

Overall, the contribution of this work can be summarized in four-fold:

  • We propose VL-phys, a novel frequency-centric self-supervised framework for remote physiological measurement based on the pre-trained VLM. It innovatively adapt the pre-trained VLM to digest the frequency-related knowledge in vision and text modalities via representation learning in both contrastive and generative manners.

  • We introduce a frequency-oriented vision-text pair generation mechanism to enable the representation learning. Without relying on the ground truth PPG signals, we carefully augment facial videos and create contrastive spatio-temporal maps to reflect the frequency ratios of skin color temporal variations across different samples; an appropriate text prompt is then designed to describe such relative frequency relation in these visual maps.

  • We design a multitask learning pipeline that incorporates the multimodal generative and contrastive tasks (i.e., the text-guided visual reconstruction task and the vision-text contrastive learning task) to help optimize the self-supervised rPPG estimation. Besides the multimodal learning, we also introduce the unimodal frequency contrastive and ranking task to constrain rPPG estimation among multiple augmented video samples.

  • We conduct extensive experiments on four publicily avaiable datasets, UBFC-rPPG (Bobbia et al., 2019), PURE (Stricker et al., 2014), VIPL-HR (Niu et al., 2020a) and MMVS (Yue et al., 2021); our results demonstrate that VL-phys significantly outperforms state of the art self-supervised methods, and performs even on par with state of the art supervised methods.

Our method, beyond its merits in the remote physiological measurement task, should also benefit many other tasks that desire VLMs to understand the periodic and frequency-related attributes in their modalities, such as gait recognition (Liu et al., 2022), weather change prediction (Sønderby et al., 2020), Parkinson’s disease diagnosis (Yang et al., 2022).

2 Related Work

2.1 Remote Physiological Measurement

Traditional facial video-based remote physiological measurement methods normally estimate rPPG using the blind signal separation techniques (Wang et al., 2017; Lam & Kuno, 2015) or skin reflection models (De Haan & Jeanne, 2013; Wang et al., 2016). First, based on assumption that the temporal variation of skin color is a linear combination of the target rPPG signal and other irregular noises, blind signal separation-based approaches leverage independent component analysis (ICA) or principal components analysis (PCA) to disentangle rPPG signals from noises. For example, Lam and Kuno (2015) applied ICA to several non-overlapping video patches for their independent rPPG estimation. They proposed a frequency-based majority voting strategy to determine the most accurate estimation. Second, skin reflection model-based approaches aim to project the original images into different color spaces that can better highlight rPPG components. For instance, De Haan and Jeanne (2013) proposed CHROM to project from the RGB space to the chrominance space for reducing the motion noises and improving rPPG estimation. However, these two kinds of traditional approaches are often designed under prior assumptions, e.g. , they assume that different individuals have uniform skin tones under white light (De Haan & Jeanne, 2013), which may not satisfy in complex situations, hence potentially impeding the rPPG estimation.

Recently, deep learning approaches show remarkable performance in rPPG estimation (Yu et al., 2022b; Lu et al., 2021; Qiu et al., 2019; Lee et al., 2020a; Yu et al., 2019; Niu et al., 2020b; Li et al., 2023b; Wang et al., 2022a; Li & Yin, 2023). For example, Niu et al. (2020b) proposed a feature disentangling mechanism that effectively separates physiological features from noises by employing an encoder-decoder network. Yu et al. (2022b) utilized self-attention layers to capture long-range spatio-temporal relationships among multiple video clips for rPPG estimation. Gupta et al. (2023) based their work on vision transformer to regress rPPG waveforms. Despite their remarkable achievements, these supervised learning approaches demand extensive facial videos and synchronously recorded PPG signals for model training. The collection of PPG signals is both costly and time-consuming. Consequently, there has been a recent surge in interest to optimize the model using contrastive self-supervised learning (Yang et al., 2023; Yue et al., 2023; Gideon & Stent, 2021; Sun & Li, 2024). Specifically, they first apply spatial/frequency augmentations to create the positive/negative samples for the given video. Next, they pull close the predicted rPPG signals from positive samples, and repel the predictions from positive samples to negative samples. For instance, Yue et al. (2023) designed a frequency augmentation module to modify the frequency of the rPPG signal for the given video and applied contrastive learning between the transformed and original videos. Similarly, Sun and Li (2024) presented Contrast-Phys+ which contrasts rPPG similarity among different spatial crops and temporal clips within the given video. Due to the absence of supervision from PPG signals, learning the periodic temporal information is still very challenging for the self-supervised learning.

To tackle above, we exploit text prompts to help capture frequency-related visual attributes in videos for accurate rPPG estimation. We are the first to introduce VLM into the remote physiological measurement task.

Fig. 2
figure 2

Overall architecture of our VL-phys. Given an input video, we respectively apply spatial augmentation and learnable frequency augmentation (LFA) to obtain its positive and negative video samples. We generate their spatio-temporal maps (STMaps) and then create contrastive spatio-temporal maps (C-STMaps) to reflect the frequency ratios of skin color temporal variations between positive and negative samples; meanwhile, we carefully craft text prompts to describe such relations. Afterwards, we fine-tune the pre-trained vision and text encoders of VLM with these formed vision-text pairs via frequency-related multimodal generative and contrastive tasks, i.e. the text-guided visual reconstruction task and vision-text contrastive learning task. Moreover, we introduce the unimodal frequency contrastive loss and the frequency ranking loss to optimize the rPPG signals estimated from different video samples

2.2 Vision-Language Model

Vision-language models (VLMs) such as CLIP (Radford et al., 2021), BLIP (Li et al., 2022), VideoCLIP (Xu et al., 2021), and MiniGPT4 (Zhu et al., 2023) have garnered significant attentions in recent years due to their impressive performance in both multimodal and unimodal downstream tasks. They learn cross-modal representations by aligning vision and text modalities through extensive training on a vast corpus of vision-text pairs. Building upon CLIP, Ye (2023) proposed a cross-modal moment exploration task and a multi-modal temporal relation exploration task to enhance the temporal feature modelling ability of their encoders. It achieves state of the art results on downstream tasks like videoQA, video captioning and video-text retrieval. Varma et al. (2023) decomposed the image-text samples into multiple region-attribute pairs, and proposed ViLLA to capture fine-grained relationships between image regions and textual attributes. Moreover, for specific visual tasks where off-the-shelf text descriptions are unavailable, researchers typically synthesize text prompts to enable vision-text contrastive learning. For instance, Chatterjee et al. (2024) created text prompts to help the vision encoder understand the spatial relationships (in front of, above, far away, etc. ) among objects in images for monocular depth estimation. Similarly, in the domain of surgical instrument segmentation, Zhou et al. (2024a) generated text prompts for instruments using outputs by the large language model GPT-4. In this paper, we novelly construct text prompts to describe the frequency ratios of skin color temporal variations over different video samples, and eventually align them with the vision modality in the embedding space to solve the rPPG estimation task.

3 Method

3.1 Overview

The method overview is illustrated in Fig. 2. Our goal is to bootstrap and finetune VLM to learn rPPG signals from unlabelled facial videos using frequency-related generative and contrastive learning mechanisms. The framework contains five phases: data augmentation, vision-text pair generation, feature extraction, visual reconstruction and network optimization.

Specifically, given a facial video \(x^a\), we first perform spatial/frequency augmentation on it to create a number of positive/negative samples. The positive samples \(X^p=\left\{ x^p \right\} \) maintain the same rPPG signal frequency (\(f^a\)) to \(x^a\), while the rPPG signal frequencies of negative samples are transformed to \(r \times f^a\) by feeding a set of frequency ratios \(R=\left\{ r \right\} \) into a learnable frequency augmentation (LFA) module. Then we generate the spatio-temporal maps (STMaps) \(M^p=\left\{ m^p \right\} \) and \(M^n=\left\{ m^n \right\} \) for both positive and negative samples. These STMaps represent the temporal variations of skin colors in facial ROIs. Next, we introduce a novel frequency-oriented vision-text pair generation method: we create a set of contrastive spatio-temporal maps (C-STMaps) \(M^c=\left\{ m^c \right\} \) by concatenating any \(m^p\) with \(m^n\) horizontally. Since the frequency ratio of rPPG signals between \(m^p\) with \(m^n\) is a fixed value r, we construct the text prompt \(d\in D\) that depicts the relative frequency ratio of the color variation from left to right in \(m^c\). By this means, \(m^c\) and d form a vision-text pair. Meanwhile, several non-overlapping patches of \(m^c\) are randomly masked to generate the masked contrastive spatio-temporal maps (M-STMap) \(m^o \in M^o\). Having \(M^p\), \(M^n\), \(M^c\) and \(M^o\), in the third stage, we pass them into the vision encoder to obtain visual embeddings \(V^p\), \(V^n\), \(V^c\) and \(V^o\). Meanwhile, the text prompts in D are input into the text encoder to obtain textual embeddings. We leverage the pre-trained vision and text encoders from the popular VLM, VindLU (Cheng et al., 2023). After that, an rPPG estimation head is applied to project \(V^p\), \(V^n\) into rPPG signals \(Y^p=\left\{ y^p \right\} \) and \(Y^n=\left\{ y^n \right\} \). In the visual reconstruction stage, we propose a text-guided visual reconstruction (TVR) module to reconstruct the masked patches in \(m^o\) under the guidance of text prompts. Last, in the main network optimization stage, the vision-text contrastive loss (Radford et al., 2021) is applied to align the embeddings of every vision-text pair \(m^c\) and d. The frequency contrastive loss (Yue et al., 2023) is exploited among the power spectral densities (PSD) of signals \(Y^p\) and \(Y^n\). Also, we introduce a novel frequency ranking loss among \(Y^p\) and \(Y^n\) to let the vision encoder accurately predict their frequency ranks.

3.2 Data Augmentation

Given the input facial video \(x^a\), we apply spatial/frequency augmentation strategies on it to generate positive/negative samples, respectively.

The spatial augmentations (i.e. rotations, horizontal and vertical flips) change the spatial layouts of the video frames without altering their colors. Consequently, the frequency of the rPPG signal inside \(x^a\) remains unaffected. In practice, we randomly choose two augmentation operations to create two positive samples \(X^p = \{x^p_1, x^p_2\}\) for \(x^a\). The main frequency of their rPPG signals is the same to that (\(f^a\)) of \(x^a\).

To obtain negative samples, we apply the learnable frequency augmentation (LFA) module (Yue et al., 2023) on \(x^a\) to manipulate its rPPG signal frequency with target ratios. Specifically, we randomly sample three frequency ratios \(R = \{r_1, r_2, r_3\} \) within a ratio bin [0.25,0.5,0.75,1.25,1.5,1.75,2] and feed them into the LFA module along with \(x^a\) to create three negative samples \(X^n = \{x^n_1, x^n_2, x^n_3\}\). Although \(X^n\) retain the general visual appearance of \(x^a\), the main frequencies of their rPPG signals are modulated to \(\{r_1 \times f^a, r_2 \times f^a, r_3 \times f^a\}\).

Next, we leverage a face detector (3DDFAv2 Guo et al. (2020)) to detect pulsation-sensitive facial ROIs (e.g. , forehead, nose) while masking out other areas (e.g. , backgrounds) in these samples. Following (Niu et al., 2020b, a; Lu et al., 2021), we transform positive samples \(X^p\) and negative samples \(X^n\) into spatio-temporal maps (STMaps) \(M^p = \{m^p \in \mathbb {R}^{A \times F \times C}\}\) and \(M^n = \{m^n \in \mathbb {R}^{A \times F \times C}\}\). Each row of STMap describes the temporal variation of the skin color within a certain facial ROI. A denotes the number of facial ROIs, F denotes the frame number of the video clip, and C denotes the channel number of color space (\(C=3\) for the RGB space). STMaps collapse the original 4-D video features into the 3-D format, which not only highlights the skin color temporal variations in facial videos but also accelerates model training and inference. Afterwards, we resize each m into \(\mathbb {R}^{F \times F \times C}\) to ensure the compliance with the input size requirement for the VLM.

3.3 Vision-Text Pair Generation

We aim to fine-tune the VLM to capture frequency attributes of \(M^p\) and \(M^n\) for rPPG estimation via representation learning. However, neither the ground truth signals or off-the-shelf text prompts are available in the self-supervised setting. Fortunately, the signal frequency between negative and positive samples are fixed ratios, i.e., the frequency of the horizontal color variation in the negative STMap \(m^n\) is r times of that in the positive STMap \(m^p\). We can design text prompts to illustrate their frequency relations. Below we specify the details.

3.3.1 Contrastive Spatio-Temporal Map Generation

Given two positive STMaps \(M^p = \{ m^p_1, m^p_2 \} \) and three negative STMaps \(M^n = \{ m^p_1, m^n_2, m^n_3 \} \), we perform horizontal concatenation between every \(m^p\) and every \(m^n\) to create a set of contrastive spatio-temporal maps (C-STMaps) \(M^c = \{ m^c_j \in \mathbb {R}^{F \times F \times C}| j \in \left[ 1,6 \right] \}\). For example, as depicted in Fig. 3, supposing \(m^n\) contains an rPPG signal with a frequency of \(0.5 \times f^a \) and \( m^p \) maintains an rPPG signal with a frequency of \(f^a\). We horizontally concatenate them and execute a central crop operation to produce C-STMap \(m^c\). The left and right side of \(m^c\) respectively reflects the temporal variation of skin color in the positive and negative STMap, \(m^p\) and \(m^n\). Therefore, the frequency of the horizontal color variation in its left side is \(r=\frac{1}{2}\) times of that in its right side.

Note that here we use the horizontal concatenation to ensure that each row corresponding to the same ROI in the positive and negative STMaps are properly aligned. This process yields the C-STMap in which the frequency of horizontal color variation displays a noticeable change at the boundary of the two sides. This arrangement allows us to easily design text prompts to describe the relative ratio of signal frequencies between the left and right sides of C-STMap. In contrast, using other concatenation methods, such as vertical or channel concatenation, would alter the spatial alignment between the positive and negative STMaps, and the signal frequency difference would not be manifesting to the same extent as along the single horizontal axis of C-STMap. This would make it even undesirable to generate corresponding text prompts for the frequency-related representation learning.

Fig. 3
figure 3

The process of the generation of frequency-oriented vision-text pair and masked contrastive spatio-temporal map (M-STMap)

3.3.2 Frequency-Oriented Vision-Text Pair Generation

Next, we generate text prompt \(d \in D\) to respectively describe the relative ratio of signal frequencies between the left and right sides of each \(m^c\). Specifically, d is defined as: “the frequency of the horizontal color variation on the left/right side is [ratio] times of that on the right/left side of the image”. Here, the [ratio] token is replaced with the specific frequency ratio r between negative and positive samples. For the example shown in Fig. 3, the [ratio] token is replaced with the value “1/2”. By this means, the C-STMap \(m^c_j\) and its corresponding text prompt \(d_j\) effectively form a positive vision-text pair. In contrast, \(m^c_j\) combined with other C-STMaps’ text prompts \(\{d_l| l \in \left[ 1,6 \right] , l \ne j\}\) form negative vision-text pairs. These pairs are used for the subsequent vision-text contrastive learning and text-guided visual reconstruction. The experimental analysis of other text templates is presented in Sect. 6.4.

3.4 Feature Extraction

In this stage, we extract features from the generated vision-text pairs. We encode C-STMaps \(M^c = \{ m^c_j \}\) and their associated text prompts \(D = \{ d_j \}\) into visual embeddings \(V^c = \{ v^c_j \}\) and textual embeddings \(E = \{ e_j \}\) using the pre-trained vision and text encoders from VindLU (Cheng et al., 2023), respectively.

Vision encoder: The vision encoder is a vision transformer that processes the input \(m^c\) by dividing it into patches and encoding them as a sequence of embeddings. It includes an additional [CLS] token that represents the global feature of \(m^c\). Positional embeddings are added to the patch embeddings to retain positional information, which is crucial for the encoder to understand the spatial layout of \(m^c\). We use \(v^c\) to denote the obtained visual embedding, and specifically, \(\hat{v}^c\) the embedding of the [CLS] token.

Text encoder: The text encoder has the same structure as BERT (Devlin et al., 2018). We feed the tokenized text prompt d into it for feature extraction. We append a [CLS] token to the beginning of d as a global summary of the prompt. The embedding of this [CLS] token is denoted as \(\hat{d}\) and is treated as the global textual embedding of d.

After obtaining \(\hat{v}^c\) and \(\hat{d}\) for each vision-text pair, we fine-tune the encoders by encouraging the feature similarity between \(\hat{v}^c\) and \(\hat{d}\) via the vision-text contrastive loss (Sect. 3.6, \(L_{vtc}\)).

Moreover, we employ the vision encoder to extract the embeddings of the global [CLS] tokens from two positive STMaps \(M^p=\{m^p_1, m^p_2\}\) and three negative STMaps \(M^n=\{m^n_1, m^n_2, m^n_3\}\), i.e., \(\hat{v}^p_1\), \(\hat{v}^p_2\), \(\hat{v}^n_1\), \(\hat{v}^n_2\), \(\hat{v}^n_3\). Then they are respectively fed into rPPG estimation head (i.e. a linear layer) to project them into positive signals \(Y^p = \{ y^p_1,y^p_2 \}\) and negative signals \(Y^n = \{ y^n_1,y^n_2,y^n_3 \}\). These signals are of various frequencies and can be used for unimodal frequency contrastive learning (Sect. 3.6, \(L_{fc}\)).

Subsequently, we concatenate these predicted signals \(Y = \{ y^p_1,y^p_2,y^n_1,y^n_2,y^n_3 \}\) and apply a linear layer to predict their frequency ranks. We optimize the predicted ranks against the ground truth ranks (Sect. 3.6, \(L_{fr}\)).

3.5 Visual Reconstruction

In this stage, we enable the encoders to develop a deeper understanding of the frequency-related attributes via a frequency-related generative learning mechanism. We develop a text-guided visual reconstruction task to reconstruct masked patches of C-STMaps with the guidance of the text prompts. This task can implicitly enhance our vision encoder to learn the periodicity of rPPG signals. Below we specify the details.

3.5.1 Masked Contrastive Spatio-Temporal Map Generation

As shown in Fig. 3, we first divide each C-STMap \(m^c\) into non-overlapping patches and randomly mask the patches with a ratio of b to the total number of patches. The visible patches in the masked contrastive spatio-temporal map (M-STMap) \(m^o\) are then fed into the vision encoder to obtain the visual embedding \(v^o\).

3.5.2 Text-Guided Visual Reconstruction

We further design a text-guided visual reconstruction (TVR) module to recover the masked patches in \(m^o\). Specifically, the TVR module contains three interaction blocks. Its structure is illustrated in Fig. 4. The input for the TVR module, \(z^o\), is the concatenated sequence of encoded visible patch features \(v^o\) and the learnable masked tokens (with positional embeddings). Each interaction block begins by calculating the self-attention (SA) within \(z^o\). It is expressed as: \(v_{sa}^o= SA(LN(z^o))+z^o\), where LN(.) represents the layer normalization. Then we apply the cross-attention (CA) between \(v_{sa}^o\) and its associated textual embedding \(\hat{d}\) to facilitate their interaction and obtain joint representation \(v_{ca}^o\); \(v_{sa}^o\) is treated as the query and \(\hat{d}\) is treated as both the key and value for CA. This process is written as: \(v_{ca}^o =CA(LN(v_{sa}^o),\hat{d})+v_{sa}^o\). Subsequently, a feed-forward layer is exploited to further refine \(v_{ca}^o\). After executing three such blocks, we reconstruct the patches and map the output back to the original RGB space via a \(1\times 1\) convolution layer and an upsampling layer.

3.5.3 Reconstruction Loss

We leverage the reconstruction loss \(L_r\) to minimize the distance between the reconstructed and original C-STMaps. It is formulated as:

$$\begin{aligned} L_r= \frac{1}{6}\sum _{j=1}^{6}MSE(m^c_j,m^q_j) \end{aligned}$$
(1)

where MSE(.) denotes the mean squared error, \(m^q\) represents the reconstructed visual map.

Fig. 4
figure 4

The structure of text-guided visual reconstruction (TVR) module

The text prompt d specifies the frequency relation of the color variation between the left and right parts of \(m^o\). It serves as guidance for the vision encoder to capture the periodic color variation patterns of visible patches in \(m^o\), facilitating the correct reconstruction of the remaining masked patches.

3.6 Network Optimization

Besides the reconstruction loss \( L_{r} \), we introduce the vision-text contrastive loss, the frequency contrastive loss, and the frequency ranking loss for network optimization.

Vision-text contrastive loss (\(L_{vtc}\)): This loss aims to align text with vision in the embedding space. This is achieved by encouraging the similarities between visual and textual embeddings from positive vision-text pairs (i.e. \(m^c_j\) and \(d_j\)) while distinguishing them from negative pairs (i.e. \(m^c_j\) and {\(d_l\}\)) (see Sect. 3.3.2). We define it as:

$$\begin{aligned} \begin{aligned} L_{vtc}&= -\frac{1}{6}\sum _{j=1}^{6}\left( \log \left[ \frac{\exp (\left\langle \hat{v}^c_j, \hat{d}_j\right\rangle / \tau _1)}{\sum _{l=1,l\ne j}^{6}\exp (\left\langle \hat{v}^c_j, \hat{d}_l\right\rangle / \tau _1)}\right] \right. \\&\left. \quad + \log \left[ \frac{\exp (\left\langle \hat{v}^c_j, \hat{d}_j\right\rangle / \tau _1)}{\sum _{l=1,l\ne j}^{6}\exp (\left\langle \hat{v}^c_l, \hat{d}_j\right\rangle / \tau _1)}\right] \right) \end{aligned} \end{aligned}$$
(2)

where \(\left\langle \hat{v}^c_j, \hat{d}_j\right\rangle \) refers to the cosine similarity between the embedding of the visual [CLS] token and the embedding of the textual [CLS] token. \(\tau _1\) is the temperature to scale logits.

Frequency contrastive loss (\( L_{fc}\)): This loss aims to pull close rPPG signals (\( Y^p=\{y^p_1,y^p_2\} \)) from positive samples while pushing them away from rPPG signals (\( Y^n=\{y^n_1,y^n_2,y^n_3\}\)) from negative samples in the feature space. We write out \( L_{fc}\) as:

$$\begin{aligned} L_{fc}=\log \left( \frac{\exp (e(y^p_1,y^p_2)/\tau _2 )}{ {\textstyle \sum _{i=1}^{3}} (\exp (e(y^p_1,y^n_i)/\tau _2 )+\exp (e(y^p_2,y^n_i)/\tau _2 ))}+1\right) \end{aligned}$$
(3)

where \(e(y^p_1,y^p_2)\) represents the mean squared error between power spectral densities (PSD) of signal \(y^p_1\) and \(y^p_2\).

Frequency ranking loss (\(L_{fr}\)): Given that the negative samples \(X^n\) are augmented from the given video based on a series of frequency ratios \(R = \{r_1, r_2, r_3\}\), their predicted negative signals \(Y^n\), along with the positive signals \(Y^n\) from positive samples \(X^p\), should conform to specific frequency ranks. Therefore, we propose \(L_{fr}\) to optimize the ranks according to the main frequencies of signals \(Y^p\) and \(Y^n\). We achieve it via constraining the relative ranks of signal frequencies, offering a flexible way to avoid overfitting.

Specifically, the ground truth frequency ranks for \(Y = \{y^p_1,y^p_2,y^n_1,y^n_2,y^n_3\}\) are given as multi-level ratings \(L = \{l(1), \ldots , l(5)\}\), where \(l(q) \in \{1, \ldots , 4\}\) is determined by the ratios R. It should be noticed that \(l(1) = l(2)\) because \( y^p_1\) and \(y^p_2\) share the same main frequency so they should be assigned with the same frequency rank. If \(l(q) > l(c)\), it indicates that the signal of the former has a higher main frequency than that of the latter. For example, supposedly we leverage the sampled ratios \(R = \{r_1=0.5, r_2=1.5, r_3=0.75\}\) to augment negative samples \(X^n\), which contain rPPG signals with frequencies of \(\{0.5 \times f^a, 1.5 \times f^a, 0.75 \times f^a\}\); the ground truth frequency ranks for Y should be assigned as \(L = \{3,3,1,4,2\}\). We denote the predicted frequency ranks for Y as \(S = \{s(1), \ldots , s(5)\}\) and define \(L_{fr}\) as:

$$\begin{aligned} L_{fr} = \textstyle \frac{1}{9} \sum _{q=1}^{5}\sum _{c=1,l(c)<l(q)}^{5} \phi (s(q) - s(c)) \end{aligned}$$
(4)

where \(\phi \) is the logistic function \(\phi (u) = \log (1+e^{-u})\).

This loss strengthens the model’s capability to effectively differentiate diverse frequencies of color variations among STMaps.

The overall loss function is the combination of the above three losses as well as the reconstruction loss in Sect. 3.5.3:

$$\begin{aligned} L= L_{r} + L_{vtc} + L_{fc} + L_{fr} \end{aligned}$$
(5)

4 Experiments

4.1 Datasets

We conduct experiments using four benchmark datasets: UBFC-rPPG (Bobbia et al., 2019), PURE (Stricker et al., 2014), VIPL-HR (Niu et al., 2020a) and MMVS (Yue et al., 2021).

UBFC-rPPG consists of 42 facial videos recorded from 42 subjects using a Logitech C920 HD Pro webcam with a resolution of \(640 \times 480\) and with 30 FPS. We follow the protocol outlined in Lu et al. (2021) to spilt the training and test sets.

PURE consists of 60 one-minute facial videos from 10 subjects with six kinds of activities, including stable, talking, slow translation, fast translation, small head rotation, and medium head rotation. The videos were captured at a frame rate of 30 FPS and a resolution of \(640 \times 480\). The ground truth PPG signals were recorded using a finger clip pulse oximeter. To ensure consistency, we follow the same experimental protocol used in Lu et al. (2021) to split the training and test sets.

VIPL-HR contains 2,378 facial videos from 107 subjects. These videos are collected by three different devices (i.e., Logitech C310, HUAWEI P9, and RealSense F200) in nine less-constrained scenarios (i.e., stable, motion, talking, dark, bright, long distance, exercise, phone stable, and phone motion scenarios). We follow the protocol in Yu et al. (2022b) to split the training and test sets.

MMVS contains 745 facial videos from 129 subjects, with 560 videos recorded in a laboratory setting and 185 videos collected in a hospital environment. A depth camera Intel Realsense D435i was utilized to record facial videos with the resolution of \(1920 \times 1080\) and the frame rate of 25 FPS. A finger clip pulse oximeter Contec CMS50E was used to record the ground truth PPG signals with the sampling rate of 60 Hz. We follow (Yue et al., 2023) for the train-test split.

4.2 Implementation Details

Vision-Language model. We employ the vision and text encoders from the widely-used VindLU (Cheng et al., 2023), specifically the ViT-B/16 and BERT pre-trained across large-scale video-text and image-text datasets, i.e. CC3M (Sharma et al., 2018), CC12M (Sharma et al., 2018), Visual Genome (Krishna et al., 2017), SBU captions (Ordonez et al., 2011), and WebVid10M (Bain et al., 2021). Other choices of VLMs can be found in our ablation study. We initialize the weights of the vision and text encoders with pre-trained parameters from VindLU, and further fine-tune them in context of our proposed VL-phys.

Hyper-parameters. For facial videos in four datasets, we randomly sample consecutive \(F=224\) frames from each video to build the spatio-temporal maps (STMaps) \(M^p=\left\{ m^p \right\} \) and \(M^n=\left\{ m^n \right\} \) for positive and negative samples, respectively. We follow (Niu et al., 2020a) to set the number of detected ROIs in facial videos as \(A=25\). The dimension of the STMap is \(224\times 224\times 3\), ensuring compatibility with the size requirement for the ViT-based vision encoder. The temperature \(\tau _1\) in equation (2) is set as 0.07 following (Radford et al., 2021), and \(\tau _2\) in equation (3) is set as 0.08 following (Yue et al., 2023). The masking ratio \(b=60\%\) is equivalent to that in Kwon et al. (2023). We freeze the weights of the pre-trained LFA module.

Training. The training epoch is set at 100 for UBFC-rPPG, PURE and MMVS datasets, and extended to 200 for the large-scale dataset VIPL-HR. Our model is trained using the AdamW optimizer (Loshchilov & Hutter, 2019) with a learning rate of \(5\times 10^{-5}\) and a batch size of 16. The training is conducted on four NVIDIA A40 GPUs.

4.3 Evaluation Protocol

Previous works calculate the heart rate (HR), heart rate variability (HRV) and respiration frequency (RF) from the estimated rPPG signals and compare them to the ground truth PPG signals for performance evaluation (Yu et al., 2022b; Sun & Li, 2024; Liu et al., 2024). We follow them to conduct intra-dataset HR evaluation on four datasets; HRV and RF evaluation on the UBFC-rPPG dataset. Moreover, we perform cross-dataset HR evaluation among UBFC-rPPG, PURE and MMVS. The calculation of HR, HRV and RF is via the toolkit HeartPy (Van Gent et al., 2019).

In line with prior research (Yu et al., 2022b; Sun & Li, 2024; Liu et al., 2024), we use three metrics to assess the accuracy for HR evaluation: the mean absolute error (MAE), root mean squared error (RMSE), and Pearson correlation coefficient (\(\rho \)). For the evaluation of HRV features, which encompasses low-frequency power (LF) in normalized units (n.u.), highfrequency power (HF) in normalized units (n.u.), and the LF/HF power ratio, we report the standard deviation (Std) of estimation errors, RMSE, and \(\rho \) as evaluation metrics. Finally, for RF, we also report the Std, RMSE and \(\rho \) as per most comparable methods (Yu et al., 2022b; Sun & Li, 2024; Liu et al., 2024).

Table 1 Comparison to state of the art on HR estimation. The results are reported on UBFC-rPPG, PURE, VIPL-HR, and MMVS datasets
Fig. 5
figure 5

Six examples for the visual comparison between estimated rPPG signals (red curves) and their corresponding ground truth PPG signals (blue curves) (Color figure online)

Fig. 6
figure 6

The Bland-Altman plots and scatter plots show the difference between estimated HR and ground truth HR on the VIPL-HR and MMVS datasets

4.4 Results

4.4.1 HR Evaluation

We first conduct intra-dataset HR estimation on four datasets. As shown in Table 1, we compare the proposed VL-phys with three traditional methods (POS (Wang et al., 2017), CHROM (De Haan & Jeanne, 2013) and Green (Verkruysse et al., 2008)); eight DNN-based supervised methods (SynRhythm (Niu et al., 2018), Meta-rppg Lee et al. (2020a), PulseGan (Song et al., 2021), Dual-Gan (Lu et al., 2021), Physformer (Yu et al., 2022b), Du et al. (2023), Dual-TL (Qian et al., 2024), and ND-DeeprPPG (Liu & Yuen, 2024)); four DNN-based self-supervised methods (Gideon et al. Gideon and Stent (2021), Contrast-Phys+ (Sun & Li, 2024), Yue et al. (Yue et al., 2023), and SiNC Speth et al. (2023)).

First, it is evident that traditional methods exhibit poor performance across four datasets. For example, the metric \(\rho \) for POS (Wang et al., 2017) and CHROM (De Haan & Jeanne, 2013) on UBFC-rPPG dataset is lower than 0.3. Traditional methods typically rely on the assumption that target rPPG signals can be separated from the RGB space by linear projection. However, this assumption proves overly idealistic, as periodic rPPG signals are easily intertwined with non-periodic noises in facial videos, which makes them hard to be disentangled.

Second, the performance gap between VL-phys and DNN-based supervised ones is narrow. For instance, on the UBFC-rPPG dataset, it is only inferior to two advanced models Du et al. (2023) and Dual-TL (Qian et al., 2024). On the most challenging dataset VIPL-HR, we achieve the MAE of 6.04 and the RMSE of 8.78, which are also comparable to the best-performing one Dual-Gan (Lu et al., 2021). Surprisingly, on the MMVS dataset, it even decreases the MAE/RMSE from Dual-Gan by 0.67/0.51.

Third, our VL-phys achieves the best performance among self-supervised methods across all metrics. For example, on UBFC-rPPG, it decreases MAE by 0.30 and RMSE by 0.34 from Yue et al. (2023). On PURE, it decreases 0.48/0.45 on MAE/RMSE from the very recent work Contrast-Phys+ (Sun & Li, 2024).

Moreover, it is worth noting that the general performance of comparable methods on the VIPL-HR dataset is clearly inferior to that on the rest datasets. This performance discrepancy arises because the facial videos in the VIPL-HR dataset were collected in less-constrained scenarios, including diverse ambient lighting conditions, and varying camera angles. These challenging factors introduce noises and hinder the capturing of physiological clues for accurate rPPG estimation. In contrast, other datasets were collected under controlled conditions, with consistent lighting and stable camera positioning.

Besides the quantitative results, we also visualize the results of rPPG predictions in Fig. 5. We can observe that the predicted rPPG signals are close to the ground truth PPG signals in both frequency and amplitude. In particular, for videos with special cases, such as subjects having head movements (Fig. 5(c) and Fig. 5(d)), bright (Fig. 5(e)) and dim illumination (Fig. 5(f)), the periodic patterns of peaks and troughs of the predicted rPPG signals are still consistent with the ground truth. This demonstrates the generalizability of the proposed method.

Next, we show the scatter plots and Bland-Altman plots for VIPL-HR and MMVS datasets in Fig. 6. \(HR_{gt}\) and \(HR_{et}\) represent the HR calculated from the ground truth PPG signal and the estimated rPPG signal, respectively. The top and bottom dashed lines indicate confidence intervals for 95% limits of agreement. We can observe that, for the VIPL-HR dataset (Fig. 6(a)), the confidence interval ranges from \(-\)14.30 to 14.83 BPM (beats per minute). For the MMVS dataset (Fig. 6(c)), the interval is from \(-\)6.98 to 7.29 BPM. These intervals confirm that our results have a strong correlation with the ground truth. In addition, from Fig. 6b and d we observe that the estimated HR closely approximates a linear relationship with the ground truth HR, indicating the effectiveness and robustness of the proposed method.

4.4.2 RF and HRV Evaluation

We further conduct intra-dataset RF and HRV estimation on UBFC-rPPG. We compare our approach with state of the art, and present the results in Table 2. LF, HF, and LF/HF are three attributes for HRV estimation. We can observe that the proposed method significantly outperforms traditional methods and most DNN-based methods. These findings underscore the capability of our method to yield high-quality rPPG signals with accurate systolic peaks, resulting into superior HRV estimations.

5 Discussion

5.1 Cross-Dataset HR Evaluation

To evaluate the generalization of our method, we perform cross-dataset HR evaluation among UBFC-rPPG, PURE and MMVS datasets. Specifically, we train VL-phys as well as recent supervised and self-supervised methods on one dataset and test them on another. For example, MMVS\(\rightarrow \)UBFC-rPPG means training on MMVS while testing on UBFC-rPPG. The results are presented in Table 3. It is evident that VL-phys consistently achieves more robust results compared to other methods. For example, in the cross-dataset setting of MMVS\(\rightarrow \)UBFC-rPPG, VL-phys significantly reduces MAE from 2.18 to 1.68 compared to the latest self-supervised method, Contrast-Phys+ (Sun & Li, 2024). These results show the strong generalization ability of our method in the unknown scenario.

5.2 Semi-Supervised and Fully-Supervised HR Evaluation

We then evaluate the performance of VL-phys on four datasets under the semi-supervised setting and the fully-supervised setting, as shown in Table 4. For instance, VL-phys (90% + 10%) indicates that 90% samples in the training set are utilized for self-supervised learning, while the remaining 10% are used for supervised learning. The two types of data are passed through the same backbone, except that the latter also leverages the ground truth supervision into the training. We use the Pearson correlation coefficient-based loss (Liu et al., 2024) to constrain the same rhythm periodicity between the estimated rPPG and the ground truth. VL-phys (0% + 100%) indicates the fully-supervised setting, in which all samples in the training set are leveraged for supervised learning. First, we can observe that, in the semi-supervised setting, when a partial of ground truth signals is available, VL-phys shows a significant performance improvement, even surpassing the best-performing supervised methods presented in Table 1. For instance, with only 30% of the samples provided with ground truth signals, VL-phys achieves an MAE of 0.13, 0.16, 4.86 and 1.93 across four datasets, respectively, significantly outperforming the best supervised methods. Second, an increase in labeled data from VL-phys to VL-phys (50% + 50%) yields superior outcomes, suggesting that more ground truth can effectively enhance the learning of the periodicity of rPPG signals. Third, the fully-supervised variant VL-phys (0% + 100%) achieves the best performance. These results demonstrate that our method can be effectively adapted to a semi-supervised or fully-supervised framework and exhibits superior performance in scenarios where some labeled videos are available.

Table 2 Comparison to state of the art on RF and HRV estimation
Table 3 Comparison to state of the art on cross-dataset HR estimation
Table 4 HR estimation of VL-phys on four datasets under semi-supervised and fully-supervised settings
Table 5 Quantitative performance on subjects with or without makeup
Fig. 7
figure 7

Qualitative performance on subjects with makeup

5.3 Effect of Makeup

We further assess the impact of makeup on our method. Following (Yue et al., 2023), we divide the MMVS dataset into two subsets: MMVS-w/ makeup and MMVS-w/o makeup. Each subset comprises a training set and a test set, which are used for model training and evaluation, respectively. The HR estimation results are summarized in Table 5. We observe that the performance on MMVS-w/o makeup surpasses that on MMVS-w/ makeup: e.g. the MAE is 2.15 for the former and 2.47 for the latter. Additionally, compared to Gideon and Stent (2021), which exhibits a MAE difference of 0.51 between the two subsets, the influence of makeup on our method is smaller (i.e. 0.32 MAE difference). Fig. 7 shows two estimated rPPG signals and their ground truth signals from subjects with makeup. We can see that some signal peaks are not generated (red arrows). This is because makeup can cover the skin color temporal variations, thereby hindering the extraction of physiological clues for accurate rPPG estimation.

6 Ablation Study

We conduct ablation study for intra-dataset HR estimation on the UBFC-rPPG and MMVS datasets.

6.1 Text Modality

In our method, we for the first time leverage text prompts to enhance the vision encoder’s understanding of frequency-related visual attributes for rPPG estimation. We now investigate the effectiveness of leveraging the text modality. In Table 6, we denote VL-phys w/o text as a variant where the text modality is removed from the framework. In this way, the STMaps for positive and negative samples are directly fed into the vision encoder for rPPG estimation without constructing the vision-text pairs. The proposed TVR module, vision-text contrastive loss \(L_{vtc}\), and reconstruction loss \(L_{r}\) are removed due to the absence of the text modality. The results show a significant increase in MAE, rising to 0.57 on the UBFC-rPPG dataset and 2.88 on the MMVS dataset, attesting the usage of the text modality in this task.

Text-guided visual reconstruction task. Text modality is leveraged into the frequency-related generative learning task, i.e., text-guided visual reconstruction task, to optimize the vision encoder to reconstruct the C-STMaps. Specifically, a TVR module and a reconstruction loss \(L_{r}\) are designed to recover the masked patches of C-STMaps based on the linguistic clues provided by their associated text prompts. If we remove this task from our framework, the MAE and RMSE will be increased for HR estimation (see Table 6: VL-phys w/o TVR). This observation suggests that this task can effectively encourage the interaction between vision and text modalities, enabling the vision encoder to capture accurate periodic patterns in temporal variations of skin colors.

Vision-text contrastive loss. We further validate the effectiveness of the vision-text contrastive loss \(L_{vtc}\) by ablating it in our framework, see Table 6. A clear performance drop can be observed by VL-phys w/o \(L_{vtc}\). This justifies our primary motivation of performing vision-text contrastive learning to boost the performance.

Table 6 Ablation study on the text modality
Table 7 Ablation study on the vision and text encoders

6.2 Vision and Text Encoders

We employ the pre-trained vision and text encoders from VindLU (Cheng et al., 2023) to extract features from the generated vision-text pairs. Table 7 details the performance of variants using encoders from other widely-used VLMs, i.e., CLIP (Radford et al., 2021), COCA (Yu et al., 2022a), ALBEF (Li et al., 2021), BLIP (Li et al., 2022), BLIP2 (Li et al., 2023a), All-in-one (Wang et al., 2023), CLIP-ViP (Xue et al., 2023). For example, VL-phys(CLIP) indicates a variant in which we use the pre-trained vision and text encoders from CLIP.

First, it is evident that all of these variants outperform the state of the art self-supervised rPPG estimation methods in Table 1. This essentially validates that the VLMs fine-tuned with our VL-phy possess more powerful ability to capture the periodic variation of skin color for rPPG estimation.

Second, one can see that our default setting with VindLU exhibits comparable (slightly superior) performance with VL-phys(All-in-one) and VL-phys(CLIP-ViP). While other variants perform inferior. We attribute this to the comprehensive training of VindLU, All-in-one, and CLIP-ViP across large-scale video-text datasets. In contrast, other VLMs such as BLIP and COCA are pre-trained on image-text datasets, which are not optimum for our video-based task. The extensive coverage of dynamic scenes and contexts allows VindLU, All-in-one, and CLIP-ViP to develop richer spatio-temporal feature encoding capabilities.

Pre-trained model weights. We present another variant that does not use the pre-trained weights from VindLU, but instead to initialize the weights of the encoders randomly. This variant is denoted as VL-phys w/o pw in Table 7. The results get worse to 0.55 and 2.39 in MAE on two datasets, respectively. This indicates that pre-training is important for the understanding of the text prompts describing variational visual details.

Textual embedding extraction. We verify the benefits of using text encoder BERT in our framework. Given that BERT has approximately 110 million parameters, we introduce an alternative light-weight variant where the text encoder is only an MLP with five linear layers. This encoder exclusively processes the [ratio] token from the text, and its output is directly employed for vision-text contrastive learning and text-guided visual reconstruction. This variant is denoted as VL-phys-MLP in Table 7. We can observe severe performance degradation by this variant, attesting the importance of using an established text encoder. Moreover, compared to using the single token of [ratio], our designed text prompt provides richer semantics describing the skin color temporal variation.

Image-guided masked language modeling task. Image-guided masked language modeling task is another kind of generative learning task that could potentially be integrated into our VL-phys. Given a vision-text pair, it aims to predict several randomly masked tokens in the text prompt conditioned on visual tokens and the remaining unmasked text tokens. This task differs from our designed text-guided visual reconstruction task, which focuses on recovering visual patches. We have incorporated this auxiliary task into our framework to evaluate it. This variant is denoted as VL-phys-MLM in Table 7. It should be noted that popular VLMs typically leverage a multimodal fusion encoder to fuse multimodal cues and reconstruct the masked tokens based on vision-grounded textual embeddings. We employ the pre-trained multimodal fusion encoder from VindLU for this purpose. We can observe a degraded performance for this variant. Similar findings are also noted from VL-phys(BLIP) to VL-phys(BLIP)-MLM. The issue primarily arises due to the uninformative tokens within the text prompts, such as "the", "of", "is" and "as". When these tokens are masked instead of more informative ones like the [ratio] token, the model tends to predict them using solely linguistic cues, rather than trying to find the answer from the visual modality.

Furthermore, we have improved VL-phys-MLM by setting the masked tokens in the text prompts consistently to the most important [ratio] token, denoted as VL-phys-MLM(r) in Table 7. This variant outperforms VL-phys-MLM and indeed shows some negligible improvement to VL-phys. However, considering the additional computation incurred by the multimodal fusion encoder, we recommend not to use this variant by default.

Fine-tuning methods. During the training stage of VL-phys, we fine-tune the pre-trained vision and text encoders of VLM using the full fine-tuning strategy. We present three variants that do not use the full fine-tuning strategy, but instead use parameter-efficient fine-tuning techniques for network optimization, including prompt-tuning (i.e. VPT (Jia et al., 2022) and CoOp (Zhou et al., 2022)) and adapter-tuning (Chen et al., 2022). These variants are denoted as VL-phys (VPT), VL-phys (CoOp), and VL-phys (Adapter) in Table 7. We can see they achieve inferior performance compared to the original VL-phys. This can be attributed to the difficulty of fine-tuning the encoders to effectively understand frequency-related attributes. Unlike parameter-efficient fine-tuning, full fine-tuning enables the encoders to comprehensively digest the frequency-related knowledge of skin color temporal variation.

Table 8 Ablation study on loss functions
Table 9 Ablation study on different text prompts

6.3 Losses

Frequency contrastive loss. Frequency contrastive loss \(L_{fc}\) is the most important loss in existing self-supervised remote physiological measurement methods (Yang et al., 2023; Yue et al., 2023; Gideon & Stent, 2021; Sun & Li, 2024). It enforces the signal frequency similarities among positive samples and dissimilarities between positive and negative samples in the feature space. In Table 8 we use VL-phys w/o \(L_{fc}\) to denote a variant without using \(L_{fc}\) in our framework. We can see that the MAE and RMSE are substantially increased while \(\rho \) is substantially decreased from the original framework, e.g. , +2.42, +2.83 and \(-\)0.09 on MMVS, which proves the benefits of adding \(L_{fc}\).

Frequency ranking loss. We verify the effectiveness of the proposed frequency ranking loss \(L_{fr}\) (Sect. 3.6). First, as shown in Table 8, omitting \(L_{fr}\) leads to deteriorated results on both datasets. Next, \(L_{fr}\) optimizes the relative frequency ranks of predicted positive and negative signals rather than their absolute frequency differences to avoid overfitting. If we instead constrains the signal frequency ratios between these signals to be consistent with the absolute ground truth ratios R, denoted by \(L_{fr}\rightarrow L_{afr}\) in Table 8, the performance gets worse, e.g. , +0.05 and +0.09 in MAE on two datasets, respectively.

6.4 Text Prompt Design

We analyze the influence of text prompt design in Table 9. We denote VL-phys-T1, VL-phys-T2, VL-phys-T3 as three variants, each uses an alternative prompt. We observe that these varints show comparable performance to the original VL-phys, indicating that our approach remains effective as long as the text prompt is grammatically correct and can accurately depict the relative frequency ratios of skin color temporal variations in C-STMaps. This robustness highlights our method’s ability to adapt to various text prompts while consistently capturing frequency-related attributes for rPPG estimation.

Additionally, we introduce VL-phys-Video as another variant without generating spatio-temporal maps (STMaps) from positive and negative video samples. In this variant, we concatenate the positive and negative video along the temporal dimension to obtain a cross-sample video. We then define its corresponding text prompt using template shown in Table 9. We can see the MAE respectively increases to 0.57 and 3.01 on two datasets by this variant. This can be owed to the potential noises and distractions when directly using the original videos. In contrast, STMaps retain only pixels from detected facial ROIs while masking other irrelevant pixels in the videos. They can effectively highlight the temporal variation of skin color.

6.5 Zero-Shot rPPG Estimation Using Pretrained VLMs

We investigate the zero-shot rPPG estimation performance using off-the-shelf pre-trained VLMs (CLIP, BLIP, ALBEF) corresponding to the discussion in Sect. 1. Initially, we define a frequency bin [0.5, 1.0, 1.5, 2.0, 2.5] and create text prompts to describe the frequency of skin color temporal variation. The text template is defined as "the frequency of skin color temporal variation is [ratio] hertz in the video." We respectively replace the [ratio] token with each frequency value in the bin and then calculate the similarity between the embeddings of the text prompt and the given video. After calculating the similarity score for each frequency value, we linearly combine them to obtain the final prediction value. We then multiply the results by 60Footnote 1 to calculate the HR for performance comparison with our VL-phys. As shown in Table 10, we can see a substantial performance gap between using VL-phys and off-the-shelf pre-trained VLMs, highlighting their difficulties in capturing periodic visual patterns. These findings also reveal that VLMs need to be re-trained or fine-tuned based on specific vision-text pairs, otherwise they would struggle to understand the attributes related to frequency.

6.6 Visual Embeddings for rPPG Estimation

We adopt the embeddings of the [CLS] tokens from STMaps to predict rPPG signals. We can also aggregate information from local patches of STMaps for the same rPPG estimation purpose. In Table 11, we denote VL-phys(g\(\rightarrow \)l) as a variant where the rPPG signals are predicted from the average of patch embeddings from STMaps. VL-phys(g+l) is another variant that concatenates the global [CLS] token embedding and the average of patch embeddings for rPPG estimation. VL-phys(g+l) show similar performance with the original VL-phys. Considering the computation increase, we opt to use the [CLS] tokens only.

Table 10 Zero-shot HR estimation performance of pre-trained VLMs
Table 11 Ablation study on visual embeddings for rPPG estimation
Table 12 Ablation study on the negative sample generation method
Fig. 8
figure 8

Parameter variation on the number of negative samples

6.7 Generation of Negative Samples

In the framework of VL-phys, we use the LFA module (Yue et al., 2023) to generate the negative samples. We further verify the model performance using an alternative negative sampling method (Wang et al., 2022b), where negative samples are generated by altering video speed according to the Nyquist sampling theorem. The results of this variant (denoted as LFA\(\rightarrow \)NST) are presented in Table 12. We can see this variant achieves comparable performance with the original VL-phys. This demonstrates the robustness and flexibility of the proposed framework in adapting to different negative sample generation methods.

Moreover, we evaluate the effectiveness of our predefined frequency ratio bin used in the LFA module for negative samples generation. According to Yu et al. (2021), the frequency of skin color temporal variation in real facial videos typically lies within the range of [0.75, 4] Hz. Based on this observation, we predefine a frequency ratio bin with a narrow value range to avoid significant frequency changes between the generated positive and negative samples. This design enables the subsequent vision encoder to focus on distinguishing the samples based on their subtle frequency differences, which is essential for accurately extracting underlying rPPG signals. If the frequency ratio r is instead randomly sampled from a broader range (0, 10), as shown in the variant VL-phys (br) in Table 12, the MAE and RMSE increase significantly, e.g. +1.32 and +1.05 on the MMVS dataset. These results underscore the effectiveness and superiority of our predefined ratio bin.

We further vary the number of negative video samples in this session. In Fig. 8, we vary this number k from 1 to 6 and observe that the performance appears to be stable when \(k\ge 3\). Considering that a big k would increase the computation cost, we select \(k=3\) as our default setting.

Table 13 Ablation study on the spatio-temporal map generation methods
Fig. 9
figure 9

Parameter variation on the masking ratio in C-STMaps

Fig. 10
figure 10

Parameter variation on the number of negative STMaps when generating C-STMaps

6.8 Generation of STMaps

In our framework, we use the spatio-temporal map generation method proposed in Niu et al. (2020a) and Lu et al. (2021) to generate the STMaps for positive and negative samples. However, different kinds of STMaps can be generated by using alternative methods discussed in Liu et al. (2024). Table 13 presents two variants, VL-phys (POS) and VL-phys (CHROM), where our generated STMaps are replaced by POS-STMaps and CHROM-STMaps, respectively. POS-STMap consists of U and V channels derived from the original facial videos, along with a POS channel generated from the POS algorithm (Wang et al., 2017). Similarly, CHROM-STMap also incorporates the U and V channels, but the POS channel is replaced by the CHROM channel generated from the CHROM algorithm (De Haan & Jeanne, 2013). We can see these variants have similar performance with the original VL-phys. This indicates that the proposed framework is robust and generalizable across different spatio-temporal map generation methods.

6.9 Masking Ratio in C-STMaps

We follow (Kwon et al., 2023) to mask non-overlapping patches from contrastive spatio-temporal maps (C-STMaps) based on a masking ratio of \(b=60\%\). To explore the impact of this ratio, we also experiment with masking ratios of 40%, 50%, 70%, and 80%, and report the performance in Fig. 9. Our results reveal that lower masking ratios yield inferior results. Maintaining a low masking ratio causes the TVR module to primarily reconstruct masked patches based on cues from the unmasked patches, rather than relying on text prompts. On the other hand, excessively relying on textual embeddings is also not recommended. Therefore, we set the \(b=60\%\) as default.

6.10 Generation of C-STMaps

We horizontally concatenate every positive STMap and every negative STMap to create a set of C-STMaps. In this configuration, the frequency of horizontal color variation within each C-STMap changes only once. More complex C-STMaps can be constructed by concatenating every positive STMap with multiple negative STMaps. For example, assuming we have a positive STMap and two negative STMaps containing rPPG signals with frequencies of \(f^a, 0.5 \times f^a, 2\times f^a\), respectively. Following the process outlined in Sect. 3.3.1, we crop and concatenate them into a C-STMap. We then generate a text prompt to describe the relative ratios of signal frequencies across different sides of it, i.e., "the frequency of the horizontal color variation on the middle and right side is respectively 1/2 times and 2 times of that on the left side of the image". We denote the number of negative STMaps used to create C-STMaps by q. In Fig. 10, we vary q from 1 (our default setting) to 3. We can see a larger q results in poorer performance because the complex frequency change in C-STMaps would confuse the encoder to accurately determine the frequency ratios between different parts.

7 Conclusion

This paper presents a novel frequency-centric self-supervised learning framework that bootstraps vision-language models (VLMs) for remote physiological measurement. It develops both generative and contrastive learning mechanisms to enhance the VLM’s ability to capture the frequency of skin color temporal variation for rPPG estimation. For a given facial video, our VL-phys begins by creating positive and negative spatio-temporal maps (STMaps) with varying rPPG signal frequencies. Then we horizontally concatenate them to create contrastive spatio-temporal maps (C-STMaps). We also construct text prompts to describe the relative ratios of signal frequencies across different parts of C-STMaps; this forms frequency-oriented vision-text pairs. Next, we fine-tune the pre-trained vision and text encoders of VLM using these vision-text pairs via frequency-related multimodal generative and contrastive tasks. These include the text-guided visual reconstruction task aimed at recovering masked image patches of C-STMaps with the guidance of text prompts, and the vision-text contrastive learning task to align vision and text modalities. Moreover, we also propose the unimodal frequency contrastive and ranking task to optimize the estimated rPPG signals across multiple video samples. Extensive experiments on four datasets demonstrate that our method not only outperforms existing self-supervised methods but also rivals state of the art supervised methods in HR, HRV, and RF estimations.