A-scan sequence transformers for palpation with optical coherence elastography

Robin Mieling; Maximilian Neidhardt; Finn Behrendt; Sarah Latus; Axel Heinemann; Benjamin Ondruschka; Alexander Schlaefer

doi:10.1364/BOE.553849

1. Introduction

The elasticity of the soft tissue is an important indicator for differentiating between healthy and pathological conditions [1,2]. During palpation, local differences in elastic properties are used to detect and localize lesions. However, palpation is very subjective and unavailable in laparoscopic and robot-assisted surgery. Therefore, additional sensors are needed to obtain quantitative feedback on elastic properties during minimally invasive surgery. One approach is elastography. In addition to ultrasound and magnetic resonance imaging, optical coherence tomography (OCT) has been considered for such biomechanical tissue characterization [3]. OCT is based on the principle of low-coherence interferometry and uses near-infrared light to acquire one-dimensional, high-resolution depth scans, so-called A-scans. A-scans are recorded sequentially while the beam is moved laterally across the sample. Multi-dimensional images, e.g. B-scans (2D) or C-scans (3D), are generated by restructuring the A-scan sequence along the lateral dimensions. Compared to ultrasound or magnetic resonance imaging, OCT offers superior spatial and temporal resolution [3,4]. With phase-sensitive OCT, sub-micrometer displacements can be detected at high temporal resolution, enabling quantitative assessment of elastic properties in wave-based elastography [4,5]. Fiber-based OCT probes enable small form factors suitable for in-vivo use, e.g. miniaturized imaging probes [6–9], intravascular OCT [10] or sensors integrated into medical instruments [11,12] and needles [13,14]. OCT is therefore particularly interesting when it comes to adapting elastography for minimally invasive surgery.

Optical coherence elastography (OCE) has been demonstrated mainly with tabletop systems in numerous applications, e.g. for analysing cornea or breast tissue [4,15]. Obtaining quantitative OCE intraoperatively is considerably more difficult, as the measurement of mechanical stress and tissue deformation must both be performed locally at the tissue surface [14]. Therefore, intraoperative, quantitative elastography is currently unavailable, especially for robot-assisted surgery. As an alternative to wave-based OCE, handheld scanners for quantitative compression-based OCE have been proposed [16–18] and were successfully demonstrated for intraoperative breast cancer detection in the surgical cavity [18]. But for wave-based OCE, we require appropriate in-vivo wave excitation in the intraoperative environment, besides laparoscopic OCT previously considered in [6,19]. Acoustic radiation force excitation has been successfully demonstrated in intravascular and intrabronchial applications by external ultrasound-based shear wave excitation [20] or with integrated piezoelectric elements [21–23]. However, acoustic wave excitation requires an acoustic medium to couple the transducer and tissue, e.g. water or gel. Since surgical instruments are already in contact with tissue during normal tool-tissue interaction, we consider displacement-based piezoelectric excitation for wave-based OCE instead, which provides high bandwidth, load force and precise frequency tuning [4]. Piezoelectric excitation has also recently been coupled with a fiber scanning endoscope for wave-based OCE [24].

We therefore consider a surgical instrument with integrated piezo actuators [25] and excite the instrument itself for OCE imaging (Fig. 1). However, robust detection of wave propagation for accurate elasticity estimates can be challenging due to the optical setup, e.g. imperfect scanning, noise and phase wrapping, and the underlying tissue mechanics, e.g. non-linearity, inhomogeneities and reverberations [4]. Consequently, several approaches for phase velocity estimation similar to ultrasound elastography have been considered, e.g. time-of-flight (TOF) [26,27] or Fourier estimator (FE) [28,29]. Furthermore, the reconstruction of the Young’s modulus is highly dependent on the choice of the mechanical model [30], e.g. the shear wave equation or the Rayleigh–Lamb frequency equation. It is often assumed that only surface waves propagating through a thin layer on top of the sample are observed during OCE, and the Rayleigh surface wave equation is therefore chosen for the reconstruction of the elastic modulus [31–33]. However, even in phantom experiments, the reconstructed elasticities based on OCE and the Rayleigh–Lamb frequency equation differ from the values determined by gold standard uniaxial compression testing [30]. So while the assumptions on wave propagation may be justified depending on the experimental setup, a more versatile and accurate model is desirable.

Fig. 1. In wave-based OCE, a wave is excited at the sample surface, e.g. with a modified surgical tool [25], and propagates through the tissue. Simultaneously, A-scans are acquired at a constant rate while the scanning mirrors direct the sample beam back and forth at high speed. The resulting data is therefore a temporal sequence of A-scans that includes both the oscillating scanning motion and the wave propagation. In conventional processing, the A-scans are rearranged according to their spatial position to represent two-dimensional images over time (B-M-scan) and facilitate the analysis of wave propagation.

Download Full Size | PDF

We consequently consider deep learning to calibrate our OCE setup for end-to-end elasticity reconstruction. Applications of deep learning in OCT have long focused on convolutional neural networks (CNNs) [34,35] and similarly CNNs have been considered for wave-based OCE [24,25,36–38]. Spatio-temporal 3D and 4D DenseNets were previously used for processing temporal sequences of B- and C-scans to estimate the elasticity of gelatin phantoms [25,36,37]. Recently, VP-NET has been proposed as an end-to-end approach for estimating wave velocity from a single B-scan [38]. VP-NET combines depth-wise separable convolutions that effectively reduce the model size, e.g. as shown in MobileNets [39], with squeeze-and-excitation blocks presented in [40]. However, these approaches consider the OCE data exclusively as multidimensional images and not as purely temporal sequences. OCT scans are the sequential accumulation of one-dimensional A-scans at different spatial or temporal positions. In the case of OCE, the excited wave propagates through the tissue while the scanning mirror continuously deflects the sample beam back and forth (Fig. 1). In conventional processing and CNNs, the A-scans are rearranged into cross-sectional images (B-scan) over time (B-M-scan). But the obtained measurement is still a sequence of A-scans containing spatio-temporal information about the wave propagation. Therefore, we consider transformers that are characterised by processing sequential data and directly consider the A-scan sequence as our input representation.

Transformers [41] have proven successful in processing long input sequences in natural language processing. Vision transformers (VITs) [42] have transferred the approach to the image domain by treating the input images as a sequence of patches. VITs have recently been considered for OCT processing, either by directly using VIT [43] or by combining CNN and transformers in hybrid models [44] when available data is limited. The applications of transformers in morphological OCT imaging range from disease classification [45] or retinal layer segmentation [46] to noise and artefact reduction [47]. However, earlier approaches of transformers in OCT exclusively followed patch-based sequencing for the self-attention mechanism [43–47], although the data obtained from OCT is already sequential in nature. Thus, the inherently temporal sequence of A-scans is reconstructed into 2D images and then resampled into patches to obtain sequence inputs.

In this work, we directly consider A-scan sequences for elasticity estimation. Compared to B-scan sequences or image patches, this approach decouples temporal and spatial dependencies while retaining the raw input representation typical for OCT. By leveraging transformers, which excel at processing sequential data, we enable the model to learn spatial features of wave propagation and accurately reconstruct sample elasticity. We consider an experimental setup with high-speed OCT imaging and a modified surgical tool (Fig. 2) that enables data acquisition for training transformer encoders without transfer learning. We use tissue-mimicking phantoms with known properties determined by mechanical testing to calibrate our setup for end-to-end Young’s modulus reconstruction. We consider transformers for processing the spatio-temporal OCE data and optimize data sequencing and model architecture. We compare our method with conventional wave velocity estimation and subsequent elasticity reconstruction similar to [31,33] and the previously considered learning-based approaches DenseNets [25,36,37] and VP-NET [38]. We train our models on homogeneous phantoms and validate our approach on heterogeneous phantoms with stiff inclusions simulating lesions and ex-vivo tissues.

Fig. 2. Experimental setup for OCE with a modified surgical tool and a robotic setup for data acquisition. (a-b) We excite waves in the sample via a modified surgical instrument with integrated piezoelectric elements at the proximal end of the tool. High voltage components are therefore located outside the patient and the waves excited at the proximal end propagate along the instrument into the tissue. (c) Data acquisition in a laparoscopic trainer is demonstrated, illustrating wave excitation via the surgical instrument and simultaneous OCE B-M-mode scanning.

Download Full Size | PDF

2. Methods

In the following, we introduce our OCE setup and describe the data acquisition. We then present model architectures and illustrate data processing.

2.1. OCE with modified surgical tool

Figure 2 illustrates the configuration used for the experiment. We employ a high-speed swept-source system (SS-OCT, OMES, Optores, Germany) operating at a temporal scan rate of 1.5 MHz. The central wavelength is 1315 nm. The system has an axial resolution of 15 µm in air and the lateral resolution is specified as 50 µm at a focal length of 100 mm. Note, that in our setup the working distance is 300 mm. We use B-M-mode scanning to capture the wave propagation both spatially and temporally. We use oscillating resonant mirrors to record B-scans with a temporal resolution of 11.4 kHz. We disregard A-scans during the pivot points and flip the lateral axis of every other B-scan. The field of view (FOV) is 3.5 mm and each B-scan is resolved over 118 sequential A-scans. For each OCE measurement, we acquire 30×10³ A-scans that correspond to 208 cross-sectional images. For elasticity estimation, we then consider the phase between A-scans at identical spatial location over time. B-M-mode scanning results in a spatio-temporal 3D representation of the propagating shear waves (Fig. 3). We crop each A-scan at a depth of 256 pixels to obtain a size of (208 × 118) × 256 pixels for the temporal and axial dimension, respectively. Simultaneously with OCT data acquisition, we excite shear waves on the sample surface with a modified surgical tool (Fig. 2), as described in [25]. The modified tool is mounted on the end effector of a six-axis serial robot (IRB120, ABB, Switzerland) for automatic data acquisition. We excite the piezoelectric actuators in our tool with 1000 Hz.

Fig. 3. Overview of the data representations used for processing OCE data. An example recorded with our experimental setup is shown as a raw sequence of A-scans containing phase data (a). A-scans are rearranged according to their spatial position to represent two-dimensional B-scans (b). The B-scans recorded over time (B-M-scan) are partially visualized in 3D (c). VP-NET [38] was proposed for direct processing of individual B-scans, while conventional processing and DenseNets [36] process spatio-temporal 3D OCE data. In contrast, by directly processing the sequence of individual A-scans ($\in \mathbb {R}^{1 \times h}$) in our attention-based approach (d), we retain the purely temporal input sequence and learn spatial dependencies during training. The transformer encoder with N layers and a fully connected regression head is trained for end-to-end elasticity reconstruction from A-scan sequences.

Download Full Size | PDF

2.2. Transformers for sequential OCE data

Transformers were originally proposed in [41] and later transferred to the natural image domain with VIT [42]. Compared to recurrent neural networks and CNNs, transformers allow capturing long-term dependencies without an inductive bias that limits attention to local neighbourhoods. This capability enables flexible feature representation and provides a global receptive field, even in shallow layers. Moreover, given sufficient data, transformers can learn advantageous properties similar to those of CNNs, such as local information aggregation in lower layers [48]. The basic concept of transformers is the scaled-dot product attention

(1)$$\text{Attn}(Q,K,V) = \text{Softmax}\big(\frac{Q K^{\text{T}}}{\sqrt{d_{k}}} \big) V$$

with $Q,K,V$ as the query, key and value matrices and $d_{k}$ as the dimension of the keys. The sequences $Q,K,V$ are obtained from the input sequence $x$ via learnable weight matrices $W_Q,W_K,W_V$. OCT and thus OCE data is originally a sequence of A-scans $\mathbb {R}^{(w t)\times d}$, with $w$ and $d$ denoting lateral and axial dimension. We therefore consider the direct use of the sequence of A-scans ($\mathbb {R}^{1 \times d}$) as input $x$ (Fig. 3). This corresponds to a purely temporal sequence of inputs and the spatial information must be learned by the model during training. For a comprehensive investigation, we additionally consider transformers for OCE with patch-based sequencing similar to VITs [42]. Here, the sequence $x$ is sampled from flattened patches (size $p \times p \times p$) of the reconstructed image $\in \mathbb {R}^{w \times d \times t}$. In both cases, the sequence passed through the multi-headed self-attention layers undergoes average-pooling over the encoder output sequence before a fully connected layer with a single output as our regression head.

2.2.1. Position encoding

For our transformer-based approach, position encoding (PE) is required to retain the information of the sequence order during self-attention. However, in the case of OCE, we obtain input data that is different from other imaging modalities due to its spatio-temporal nature. Here, we want to reconstruct the elasticity from the observed waves propagating through the tissue and thus through the sequence of A-scans. We therefore specifically consider the relative PE to provide the model with relative distances between the A-scans in the sequence. Overall, we investigate three different approaches for encoding the position of the spatio-temporal OCE data.

Relative PE enables effective training even for long token sequences and is well suited to capture the spatio-temporal nature of the OCE data. We therefore consider rotary position embeddings (RoPE) [49], that have recently been proposed as a powerful and efficient implementation of relative PE. In contrast to the learned absolute embeddings used for VIT [42], relative embeddings produce an increase in PE with increasing distance between tokens. Features of any query or key are paired and considered as 2D coordinates. They are then rotated by an angle $\theta$ specific to that pair and depending on the position within the sequence, e.g. for two features $x^{1}_m$ and $x^{2}_m$ at position $m$

(2)$$\text{RoPE}(x^{1}_m,x^{2}_m,m) = \begin{pmatrix} x^{1}_m \cos m\theta - x^{2}_m \sin m\theta \\ x^{2}_m \cos m\theta + x^{1}_m \sin m\theta \end{pmatrix}.$$

Applied to the dot-product for a pair of features at positions $m$ and $n$,

(3)$$\big \langle \text{RoPE}(x^{1}_m, x^{2}_m, m), \text{RoPE}(x^{1}_n, x^{2}_n, n) \big \rangle = \big \langle \text{RoPE}(x^{1}_m, x^{2}_m, m - n), \text{RoPE}(x^{1}_n, x^{2}_n, 0) \big \rangle$$

RoPE therefore offers PE relative to the distance $m-n$. RoPE could provide better performance than the simple relative PE investigated for VIT [42], that did not lead to performance improvements over learned embeddings.

Sinusoidal PE were originally considered for the transformer architecture [41]. The positions thus correspond to sinusoids that are added directly to the embedded features with increasing frequency, according to

(4)$$PE(m,2i) = sin(m/10000^{2i/d_{emb}}) $$

(5)$$PE(m,2i+1) = cos(m/10000^{2i/d_{emb}})$$

where $m$ and $i$ denote the position and the dimension of the token, respectively.

Learnable PE were used for VIT [42]. The embedding is a linear 1D mapping for each input token that is learned during backpropagation. The PE is added to each token after the initial embedding layer and the dimension of the trainable vector is given by the maximum sequence length and the dimension of the embedded tokens.

2.3. Conventional OCE processing

In addition to the CNN baselines, we consider the conventional estimation of the wave velocity and the subsequent reconstruction of the elasticity similar to [31,33]. To do this, we first reconstruct the wave velocity using an FE [4,50]. We reconstruct space-time maps for each OCE measurement by averaging the B-M-Scan along the depth axis and applying the Fourier transform to obtain a $k$-space representation. Compared to the learning-based approaches, this additionally requires the evaluation of OCT intensity data for surface smoothing. Surface points are detected based on thresholds and the depth is subsequently clipped to compensate for unevenness. Next, we determine the phase velocity

(6)$$c_{ph} = \frac{\omega}{k_{peak}(\omega)}$$

where $\omega$ is the angular frequency and $k_{peak}(\omega )$ is the wave number with the highest amplitude. Assuming that the measured signal corresponds to the Rayleigh surface waves, the bulk shear wave velocity can be calculated according to

(7)$$c_{S} = \frac{c_{ph}}{(0.862 + 1.14 \nu)/(1 + \nu)} \approx 1.05 \; c_{ph}$$

for a homogeneous, isotropic, linear-elastic and nearly incompressible sample [4,31]. Once we have obtained the bulk wave velocity, we can reconstruct Young’s modulus with

(8)$$E = 2 \rho (1+\nu) c_{S}^{2}.$$

The Poisson’s ratio for all gelatin gels and soft tissue samples is assumed to be $\nu =0.5$. We assume a density of $\rho = {1020}\;\textrm{kg m}^{-3}$ for all gelatin samples, similar to [51]. Estimated velocities for $c_{S}$ below ${1}\;\textrm{m s}^{-1}$ and above ${10}\;\textrm{m s}^{-1}$ are disregarded as they do not correspond to the range of typically observed velocities for soft tissue [52]. We adapt the processing to the data acquisition protocol and reduce the image noise by the known excitation frequency with a bandpass filter. We obtain an improved velocity reconstruction if we disregard the 10^th percentile of pixels with the lowest amplitude.

2.4. Data sets

To generate sufficient training data with known elastic properties, we consider tissue-mimicking phantoms. We prepare gelatin phantoms with a weight ratio of 5 %, 10 %, 15 % and 20 % of gelatin to water. We obtain Young’s modulus as a ground truth for training by uniaxial compression tests on cylindrical gelatin samples. We obtain 17 kPa, 56 kPa, 97 kPa and 139 kPa for the four different gelatin concentrations, respectively. We prepare five phantoms per elasticity and perform OCE measurements at 25 different positions on each phantom. We select one independent phantom from each gelatin concentration as a test set and perform cross-validation based on the remaining phantoms during training.

In addition to the main data set used in training, we obtain data for further validation of our models on unseen samples (Fig. 4). First, we consider the generalization from homogeneous to heterogeneous samples with three phantoms that simulate stiff lesions in soft surroundings. The three phantoms are generated by embedding stiff cylindrical inclusions in 5 % gelatin or a Young’s modulus of 17 kPa. We consider one inclusion with an elastic modulus of 56 kPa and a diameter of 29 mm and two inclusions with 97 kPa and 18 mm for their Young’s modulus and diameter, respectively. We systematically acquire measurements at different locations in a grid and visualize elastography maps of the heterogeneous phantoms. Each location is spaced 4 mm apart in both lateral directions, but we only partially image the larger inclusion. We therefore obtain 54 data points over a range of 20 mm × 32 mm for the 29 mm inclusion phantom and 81 measurements over a range of 32 mm × 32 mm for the two 18 mm inclusions.

Fig. 4. In addition to the homogeneous training data, we test the generalization of our models to heterogeneous phantoms with stiff inclusions (a). Finally, we evaluate our models on ex-vivo human soft tissue. Samples from one body donor are shown for heart, kidney and liver tissue (b-d). Data acquisition for inclusion phantoms and tissue samples is conducted analogous to homogeneous phantoms (Fig. 2).

Download Full Size | PDF

Second, we consider fresh post-mortem heart, kidney and liver tissue. One sample of each organ is obtained from two body donors as examples of solid organs and we acquire four measurements per sample (Fig. 4). We are unable to obtain gold standard compression measurements for the organs and instead observe the palpation force during indentation with the tool. We use the robot to drive the tool 4 mm into the sample and record force data with a high-resolution force sensor (Nano 43, ATI, USA). We then qualitatively compare the maximum observed palpation force with the reconstruction of the elastic modulus by OCE.

For both additional data sets, the OCE data acquisition follows the same methodology as for the training data and the predictions on the additional test sets are based on the ensemble of cross-validation models. Note that we only train on homogeneous gelatin data.

2.4.1. Surface data augmentation

The homogeneous training data show predominantly flat surfaces, which leads to a purely horizontal wave propagation. To account for more complex surface topographies during testing and to generalize better, e.g. for the heterogeneous phantoms and tissue samples, we therefore employ data augmentation during training to simulate different tissue surfaces. We randomly deform the homogeneous sample surface based on a logistic function

(9)$$f (x) = \frac{L}{1+e^{{-}k x}} + L_0$$

with which we shift the depth of the A-scans over the lateral dimension x. $L$ and $L_0$ are randomly chosen so that the maximum shift is between 0 and 50 pixels and the slope $k$ is randomly sampled between 0 and 4. Examples of the resulting surface augmentation can be found in Fig. 5. Additionally, we employ spatial and temporal flipping and the addition of shot and speckle noise during data augmentation. The same data pre-processing and augmentation steps are used for all learning-based approaches.

Fig. 5. Example OCE sequence corresponding to a single B-scan from the homogeneous training data (a) and two random augmentations of the same measurement (b-c). The intensity data is also shown on top for better visualization in addition to the phase data (below) used for wave-based OCE. OCE data is standardized during pre-processing.

Download Full Size | PDF

2.5. Implementation details

We train all models for 100 epochs using the mean squared error (MSE) loss and the Adam optimizer [53]. We tune hyperparameters based on the validation set. We train the 2D VP-NETs with a batch size of 128. We train the DenseNets variants and our approach with a batch size of 16. We achieve the best performance with a learning rate of 2×10⁻⁴ for DenseNets variants and VP-NET. For both CNN-based approaches, we follow a continuous reduction of the learning rate when the validation error reaches a plateau. For our transformer-based approach, we achieve the best results with a learning rate of 1×10⁻⁴, but with a linear warm-up of the learning rate over 10 % of the training steps, followed by cosine annealing. We also adjust the beta values to 0.9 and 0.95 for Adam and employ gradient clipping. We optimize embedding dimension, depth and width of the encoder and obtain best results with a reasonable parameter size for an embedding dimension of 768, 12 layers and 12 heads. For patch sequencing, we consider cubic patches of size $p=8\times 8\times 8$ and reduce the lateral dimension to the closest dimension divisible by $p$. The architecture of the VP-NET model is implemented according to [38]. But here, VP-NET is also trained to predict the elastic modulus directly, analogous to all other approaches considered here, instead of predicting only the intermediate phase velocity obtained by TOF processing [38]. VP-NET was originally proposed for slightly larger resolution images ($320 \times 320$ pixels) and we observed better performance for VP-NET without max-pooling layers designed for decreasing image dimensions. 3D CNNs for spatio-temporal processing are implemented according to [36]. Since the number of parameters in CNN and Transformer models is not directly comparable, we additionally consider two larger DenseNet variants to show that the CNN performance is not limited by the model size. We refer to the approach in [36] with 4 DenseNet blocks per layer, 32 initial features and a growth rate of 5 as Dense-S. For Dense-M and Dense-L, the growth rate is increased to 6 and the blocks per layer to [3,6,9,4] and [4,8,12,6] respectively. Dense-L also receives 48 features in the first layer. Similarly, we additionally consider VP-NET-L as proposed in [38] to investigate the model performance with increased model capacity. To keep the input sizes for the 3D data manageable, we do not process each measurement corresponding to 208 consecutive B-scans as a single input. Instead, we divide the inputs into shorter sequences with a sliding window corresponding to $t$ frames. We examine values between $t=1$ and $t=32$, which corresponds to a sequence length of up to 3776 A-scans. Based on these experiments, we choose a compromise between computational effort and model performance. For the 3D DenseNet variants, we use the same optimized sliding window size, while for 2D VP-NET $t=1$. For model tests, we take the median of the predictions obtained over the sliding window for each measurement. To account for the long token sequences in our approach, we use flash attention [54], that reduces the $O(n^2)$ complexity of self-attention (Eq. (1)) by fast and memory-efficient approximation. All models are implemented in Pytorch v2.2 and trained on an NVIDIA RTX 4090 graphics card. Our implementation is based on the x-transformer library (available at https://github.com/lucidrains/x-transformers). To evaluate model performance, we report the root mean square error (RMSE) and the mean absolute error (MAE). In addition, we consider the mean absolute percentage error (MAPE) and the R2 value. For each learning-based approach, we additionally report parameter count, required computational effort and inference time per sample. We employ the MLXtend permutation test [55] with 10000 permutations and a significance level of $\alpha =0.05$ to test for statistically significant differences. When testing generalization to heterogeneous phantoms, we additionally consider the detection of stiff inclusions as a classification task and look at model performance in terms of the area under the receiver operating characteristic (AUROC) and the area under the precision-recall curve (AUPRC).

3. Results

We first optimize and evaluate our approach using our main training data, which contains homogeneous phantoms. Then, we additionally test the model generalization with heterogeneous phantoms containing stiff inclusions and report the estimation of the elastic modulus for human tissue samples.

3.1. Homogeneous phantoms

The error metrics for the test set are shown in Table 1 for all baselines and our transformer-based approach. The predictions per material are shown in Fig. 6 together with the reference measurements. Dense-M, Dense-L and VP-NET-L are omitted and only a single variant of each approach is shown to ensure good visibility. Our transformer-based approach drastically outperforms the baselines in terms of all evaluation metrics and provides a more accurate and robust elasticity reconstruction.

Fig. 6. Results for test set of homogeneous gelatin data for all baselines and our proposed approach. Predicted elasticity is plotted over the different phantoms with reference ground truth measurements. For clarity, only one variant of DenseNet [36] and VP-NET [38] are shown. Differences in predictions for different variants of each approach are also not statistically significant.

Download Full Size | PDF

Table 1. Error metrics with deviations over cross-validation folds for CNN and transformer architectures. * Conventional FE processing fails in 15.4 % of measurements resulting in unrealistically high values that are excluded here. Best results are marked in bold.

View Table | View all tables in this article

Our approach is optimized with respect to A-scan sequence length and PE. Performance for different temporal window sizes with increasing A-scan sequence length up to the equivalent of $t=32$ consecutive B-scans are shown in Fig. 7. We observe performance improvements with increasing sequence length. However, the required computational load per sample also increases with increasing sequence length, even if the more efficient flash attention is used. We therefore choose a sequence length of 1880 A-scans as a reasonable compromise between model performance and required floating point operations (FLOPs) for our approach. The resulting configuration corresponds to a computational load of 162×10⁹ FLOPs per sample. In comparison the required computations per sample for DenseNet variants are between 32×10⁹ and 47×10⁹ FLOPs for Dense-S and Dense-L, respectively. VP-NET considers predictions based on a single 2D B-scan and therefore only corresponds to 55×10⁶ and 153×10⁶ FLOPs for VP-NET and VP-NET-L, respectively. Regarding the inference time per sample, we observe 6.71 ms to 10.61 ms, 0.71 ms to 0.75 ms and 16.5 ms for DenseNet variants, VP-NET variants and our approach, respectively.

Fig. 7. RMSE and MAE plotted over A-scan sequence length where 118 A-scans are equivalent to a single B-scan. Model performance increases with sequence length but computational demand and memory requirements simultaneously increase.

Download Full Size | PDF

Besides optimizing data processing, we also investigate our approach with different PE schemes. The resulting high errors for sinusoidal or learnable embeddings (Table 2) show that the attention-based approach is only able to effectively process the long A-scan sequences when RoPE is used. Training our approach with RoPE but without the surface data augmentation leads to an increased RMSE of 7.78(261) kPa and MAE of 2.48(130) kPa.

Table 2. Error metrics for alternatives we consider during optimization of transformer-based approach. We report errors for training with learnable and sinusoidal encodings instead of RoPE, with RoPE but without surface augmentation, and lastly with RoPE but for VIT-based patch sequencing instead of directly sequencing A-scans.

View Table | View all tables in this article

Finally, we also investigate patch sequencing, where 2D+t volumetric data are processed similar to the 3D DenseNets, but the patches are flattened into embeddings, as originally proposed for VIT [43]. Patch sequencing leads to errors of 3.34(20) kPa and 2.10(4) kPa for RMSE and MAE, respectively. Our approach, which directly processes A-scan sequences, achieves the overall lowest RMSE of 2.49(137) kPa and MAE of 1.64(89) kPa. The model predictions for our approach based on direct sequencing of A-scans and for patch-based sequencing are statistically different ($p<<\alpha$). The predictions between smaller and larger variants were not statistically significant for Dense-S, Dense-M and Dense-L ($p>0.1$) and also not between VP-NET and VP-NET-L. For the remaining analyses, we therefore only report the model performance for one variant. The differences between our approach and all baselines is statistically significant ($p<<\alpha$).

3.2. Inclusion phantoms

Next, we compare the generalization of our approach to all baselines with heterogeneous phantoms containing stiff inclusions. The predictions of the different approaches — trained on homogeneous phantom data except for FE — are shown in Fig. 8. The MAE for the simulated lesions is 13.27(2016) kPa, 24.76(1159) kPa, 28.97(2046) kPa and 7.44(1443) kPa for FE, Dense-S, VP-NET and our approach, respectively. FE leads to unrealistically high values for two measurements. The visualized predictions show that the two CNN-based baselines have difficulties to generalize to the heterogeneous samples and to highlight the simulated lesions. VP-NET leads to an overestimation of the elastic modulus near the edges around the inclusions. Overall, the CNN-based approaches lead to a poor contrast between soft surrounding and stiff inclusions. In comparison, our approach handles the heterogeneous phantom data more reliably and can effectively discriminate the simulated lesions. This becomes even clearer when we consider the AUROC and AUPRC for inclusion detection (Fig. 9). Our approach outperforms all baselines with the highest AUROC and AUPRC values of 0.997 and 0.991, respectively. While FE and Dense-S lead to similar performance, VP-NET shows the worst performance with values of 0.840 and 0.652 for AUROC and AUPRC, respectively.

Fig. 8. Model predictions for the three heterogeneous phantoms with stiff inclusions with FE (a), Dense-S (b), VP-NET (c) and our approach (d). The columns correspond to a 56 kPa inclusion with diameter of 29 mm, and two 97 kPa inclusions with diameters of 18 mm for left, middle and right, respectively. Failed conventional FE processing depicted as white.

Download Full Size | PDF

Fig. 9. AUROC and AUPRC curves when considering the different approaches for the discrimination between simulated lesions and soft surroundings for the heterogeneous phantoms.

Download Full Size | PDF

3.3. Body donor tissue samples

Finally, we evaluate the performance of our approach on ex-vivo tissue from two body donors. Elasticity estimations for heart, kidney and liver tissue samples are shown in Fig. 10, top. In addition, we qualitatively compare the maximum force observed during tissue indentation over a fixed distance and the Young’s modulus reconstruction of each approach. The comparison with this surrogate reference shows the highest correlation for our proposed approach (Fig. 10, bottom). Our method leads to elasticity estimates of 54.96(3645) kPa, 29.87(530) kPa and 38.67(792) kPa for heart, kidney and liver tissue, respectively. The conventional FE-based reconstruction of the elastic modulus is extremely unreliable and provides unrealistic values for 21 out of 24 measurements (87.5 %).

Fig. 10. Young’s modulus predictions for the tissue samples for all approaches (a-d). As we do not have the reference measurements of elasticity for the organs, we palpate the tissue with the tool itself and measure the maximum contact force reached after indenting the tissue 4 mm. This method does not give us a ground truth elasticity but we investigate the correlation between predicted elasticity and indentation force (e-h).

Download Full Size | PDF

4. Discussion

In this work, we consider transformers for OCE in conjunction with a wave-inducing surgical tool to enable robust and efficient quantitative palpation in a minimally invasive setting. We investigate end-to-end reconstruction of Young’s modulus from OCE phase data to address two challenges of conventional elasticity reconstruction. First, robust and reliable estimation of wave velocity with conventional estimators can be difficult, e.g., due to imperfect scanning, noise, or phase wrapping. This is emphasized in our experiments as we observe unrealistic estimates of over 50 m s⁻¹ for FE, particularly for noisy measurements of soft gelatin phantoms and tissue samples. We therefore exclude 15.4 % of the measurements for homogeneous phantoms. For our soft tissue samples only 3 out of 24 measurements result in realistic estimates. The second challenge is to correctly model the observed surface wave in order to derive accurate biomechanical properties from the estimated wave velocity. The reconstructed elasticity is highly dependent on the appropriate model selection [30] and is also influenced by the assumption of material properties that change along with elastic properties, e.g. density or Poisson’s ratio (Eq. (8)). Here, we show that transformer architectures are better suited to handle the spatio-temporal A-scan sequences than previously considered learning-based approaches based on CNNs [25,36–38]. Our transformer-based approach achieves the best overall performance in our experiments.

Direct processing of A-scan sequences also outperforms patch-based sequencing (Table 2). In contrast to patch inputs for static, morphological OCT imaging [43,44,46], patch sequencing in functional OCE imaging combines spatial and temporal information, resulting in less accurate elasticity reconstruction. Therefore, it seems advantageous to leverage the inherently sequential OCT data and keep the relationship between the tokens purely temporal. The spatial dependency is fully learned, which could be particularly advantageous considering the non-uniform lateral sampling caused by the sinusoidal oscillation of the resonant mirrors [56].

In optimising our approach, we find that RoPE is essential for the effective processing of long A-scan sequences. We therefore see significant performance differences in contrast to the PE comparisons performed for VIT [42]. Transformers trained with sinusoidal or learned PE lead to significantly larger errors even for homogeneous phantoms and are not able to outperform the previous CNN-based approaches. In contrast, RoPE allows us to effectively learn spatial dependencies during training. It is noteworthy that RoPE is applied in each attention layer which therefore injects positional information beyond the initial input layer and enables faster and more robust convergence [49]. In addition to PE, we also optimize our approach with respect to the length of the A-scan sequence and observe performance improvements with longer sequences and thus a longer temporal window. However, this also increase memory and computational cost and we therefore select the analysed length of 1880 A-scans as a compromise between performance and memory requirement. DenseNets contain more than an order of magnitude fewer parameters but the required computational effort and time per sample is only slightly increased for our approach, e.g. inference of 10.61 ms and 16.5 ms for Dense-L and ours, respectively. In contrast, VP-NET was specifically tailored for efficiency and only considers individual B-scans. This results in significantly lower inference times and required FLOPs. We observe better performance for VP-NET without max-pooling layers and therefore obtain larger models than in [38]. However, we find that only our attention-based approach can scale effectively in terms of computational load and model capacity. This is particularly evident as the larger variants of 3D DenseNet and 2D VP-NET do not significantly outperform their smaller counterparts. Furthermore, the performance of the model improves significantly thanks to our surface augmentation scheme, which allows the model to better learn the reconstruction of the elastic modulus regardless of the surface structure and shape.

Even though the same data augmentation is used in training VP-NETs and DenseNets, our approach shows superior performance in evaluating generalization from homogeneous to heterogeneous phantoms. Our approach shows effective discrimination of the stiff inclusions from the soft neighbourhood both qualitatively (Fig. 8) and quantitatively (Fig. 9). We also achieve better delineation of the inclusion boundaries compared to other learning-based approaches, where the B-scan may partially contain both the inclusion and the surrounding material, especially with VP-NET [38]. It is worth noting that the inclusion data is limited to 216 measurements and only 64 measurements (29.6 %) contain simulated lesions. As the inclusion data set is therefore imbalanced, the differences between the approaches are particularly notable when considering the AUPRC curve. The noisy estimates for DenseNet and VP-NET lead to high false-positive rates even at low thresholds. As expected, the conventional processing using FE generalizes relatively well to the heterogeneous environment as it does not depend on the distribution of the training data. Similar to all baselines, performance for our approach also decreases when transitioning from homogeneous to the more complex heterogeneous samples. Nevertheless, we achieve lower absolute errors and higher AUPRC and AUROC values with our proposed method.

Finally, we qualitatively evaluate the model predictions using soft tissue samples. Since we do not have gold standard reference measurements of elasticity for the organs, we palpate the tissue with the tool itself and measure the maximum contact force as a surrogate for specifying elastic properties. With this method, we do not obtain ground truth for elasticity and can only examine the correlation between predicted elasticity and indentation force. Although our approach shows a higher correlation than the baselines, this comparison is limited because the indentation force strongly depends on contact surface between tool and tissue as well as on the shape and stiffness of the tissue sample. Additionally, we note that our linear correlation analysis can only provide a rough approximation of the non-linear elasticity in tool-tissue interactions. Nevertheless, in contrast to the baselines, the elastic modulus values reconstructed using our approach are within the specified range of 8 kPa to 55 kPa [57], 25 kPa to 40 kPa [58] and 8 kPa to 48 kPa [59] for heart, kidney and liver, respectively. Future comparisons could consider alternative elastography approaches for samples with unknown elastic properties, e.g. compression-based OCE.

The conducted experiments highlight the potential of our approach. Our model accurately learns both spatial and temporal components of the wave field to estimate sample elasticity. In contrast, the FE approach shows less consistent predictions, particularly for stiffer samples (Fig. 6), suggesting spatial wave distortion effects at higher wave velocities. Modifications to the scanning setup could improve conventional processing, e.g. through rotating polygon mirrors that enable faster switching and reduce pauses between successive B-scans. However, our data-driven approach seems to inherently account for such effects present in the training data, including noisy measurements, wave distortion, and asynchronous wave excitation and scanning. However, our learning-based approach offers limited explainability in comparison to conventional methods. Physics-based models typically facilitate a more transparent analysis of potential error sources and systematic biases. Explicit models may offer insights into limitations and uncertainties related to system and scanning requirements. It will be interesting to further study improving the explainability of our approach to better address the trade-off between the improved accuracy we observed for transformers and the better interpretability of conventional methods. The phantom experiments allow a quantitative comparison in a laboratory setting, but are limited in terms of complexity of wave propagation and have limited wave distortion compared to anisotropic and heterogeneous soft tissue. The limited sample size and high standard deviation indicate that further validation using soft tissue samples will be required. For the validation of intraoperative applicability, e.g. for the assessment of tumour margins based on elasticity [60], extensive evaluation and further fine-tuning with tissue data is required. In contrast to the method proposed for VP-NET [38], where labels obtained by conventional TOF velocity estimation were used, we reconstruct the elastic modulus directly from the OCE phase data. Therefore, we train our model based on reference measurements for the elastic modulus rather than on the velocity label obtained by conventional processing. This provides the ability to accurately reconstruct elasticity independent of the limitations of the conventional Fourier or TOF estimator. However, this also limits the acquisition of training data, as the ground truth reference measurements must be performed together with the OCE. Further experiments should therefore attempt to combine the uncertain but easily obtainable conventional velocity estimation and the accurate ground truth reference measurement, e.g. via soft labels or curriculum learning. We currently also derive the Young’s modulus from the OCE phase data, as this is the most commonly used elastic property for comparing tissue stiffness. However, our learning-based procedure is not limited to this linearization and further research should include the prediction of non-linear elastic properties.

5. Conclusion and outlook

To summarise, the experiments performed show how our transformer-based approach can exploit the inherently sequential nature of OCE data by directly processing A-scan sequences. The approach therefore allows us to decouple spatial and temporal information to better capture the dynamics of wave propagation. We demonstrate the potential of our approach for robust reconstruction of the elastic modulus in homogeneous and heterogeneous phantoms as well as in soft tissue samples. In combination with a miniaturized endoscopic OCT probe [6,61], our approach could be used for optical palpation in surgery or autopsy. Accurate intraoperative OCE could then give physicians back the ability to feel for changes in elastic properties during minimally invasive procedures. In addition to localizing pathological tissue, knowledge of biomechanical properties also enables better monitoring of tool-tissue interactions, e.g. in vision-based force estimation [62]. Finally, any OCT system inherently captures temporal sequences of A-scans. Further evaluation of our transformer-based approach should particularly explore tasks where conventional alternatives are limited, e.g. segmentation. Additionally, our methodology is not limited to a specific OCT system or scanning regime. However, high-speed scanning is essential for visualizing wave propagation. It will be interesting to investigate if the learning-based approach could also enable slower scanning speeds than the considered 1.5 MHz system. However, changes to any system parameters would require additional training data and fine-tuning of model weights. Our approach to directly process A-scan sequences can also be applied to any sequence length or scan protocol, e.g. M-mode and C-Mode or different scan rates.

Funding

Deutsche Forschungsgemeinschaft (Grant SCHL 1844/6-1); Technische Universität Hamburg (i^3 initiative); HORIZON EUROPE Framework Programme (grant agreement No. 101059903, EU Funds Investments 2021-2027); Technische Universität Hamburg (ICCIR); Universitätsklinikum Hamburg-Eppendorf (ICCIR); Technische Universität Hamburg (Funding Programme Open Access Publishing).

Acknowledgments

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of the Hamburg Chamber of Physicians (No.: 2020-10353-BO-ff). Informed consent was obtained from all subjects involved in the study by their legal representatives and next of kin.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. K. Hoyt, B. Castaneda, M. Zhang, et al., “Tissue elasticity properties as biomarkers for prostate cancer,” Cancer biomarkers 4(4-5), 213–225 (2008). [CrossRef]

2. T. A. Krouskop, T. M. Wheeler, F. Kallel, et al., “Elastic moduli of breast and prostate tissues under compression,” Ultrason. imaging 20(4), 260–274 (1998). [CrossRef]

3. B. F. Kennedy, K. M. Kennedy, and D. D. Sampson, “A review of optical coherence elastography: Fundamentals, techniques and prospects,” IEEE J. Sel. Top. Quantum Electron. 20(2), 272–288 (2014). [CrossRef]

4. F. Zvietcovich and K. V. Larin, “Wave-based optical coherence elastography: The 10-year perspective,” Prog. Biomed. Eng. 4(1), 012007 (2022). [CrossRef]

5. M. A. Kirby, I. Pelivanov, S. Song, et al., “Optical coherence elastography in ophthalmology,” J. Biomed. Opt. 22(12), 1–28 (2017). [CrossRef]

6. M. Lee, H. Bang, E. Lee, et al., “Imaging peritoneal blood vessels through optical coherence tomography angiography for laparoscopic surgery,” J. Biophotonics 17(1), e202300221 (2024). [CrossRef]

7. J. Walther, J. Golde, M. Albrecht, et al., “A handheld fiber-optic probe to enable optical coherence tomography of oral soft tissue,” IEEE Trans. Biomed. Eng. 69(7), 2276–2282 (2022). [CrossRef]

8. H. Pahlevaninezhad, M. Khorasaninejad, Y.-W. Huang, et al., “Nano-optic endoscope for high-resolution optical coherence tomography in vivo,” Nat. Photonics 12(9), 540–547 (2018). [CrossRef]

9. T. Zhang, S. Yuan, C. Xu, et al., “Pneumaoct: Pneumatic optical coherence tomography endoscopy for targeted distortion-free imaging in tortuous and narrow internal lumens,” Am. Assoc. for Adv. Sci. 10(35), 1 (2024). [CrossRef]

10. J. Li, S. Thiele, B. C. Quirk, et al., “Ultrathin monolithic 3d printed optical coherence tomography endoscopy for preclinical and clinical use,” Light: Sci. Appl. 9(1), 124 (2020). [CrossRef]

11. A. M. D. Lee, L. Cahill, K. Liu, et al., “Wide-field in vivo oral oct imaging,” Biomed. Opt. Express 6(7), 2664–2674 (2015). [CrossRef]

12. M. Ourak, J. Smits, L. Esteveny, et al., “Combined oct distance and fbg force sensing cannulation needle for retinal vein cannulation: in vivo animal validation,” Int. J. Comput. Assist. Radiol. Surg. 14(2), 301–309 (2019). [CrossRef]

13. Q. Tang, C.-P. Liang, K. Wu, et al., “Real-time epidural anesthesia guidance using optical coherence tomography needle probe,” Quant. Imaging Med. Surg. 5(1), 118–124 (2015). [CrossRef]

14. R. Mieling, S. Latus, M. Fischer, et al., “Optical coherence elastography needle for biomechanical characterization of deep tissue,” in Med Image Comput Comput Assist Interv, (Springer, 2023), pp. 607–617.

15. J. Ormachea and K. J. Parker, “Elastography imaging: the 30 year perspective,” Phys. Med. Biol. 65(24), 24TR06 (2020). [CrossRef]

16. Q. Fang, B. Krajancich, L. Chin, et al., “Handheld probe for quantitative micro-elastography,” Biomed. Opt. Express 10(8), 4034–4049 (2019). [CrossRef]

17. X. Wang, Q. Wu, J. Chen, et al., “Development of a handheld compression optical coherence elastography probe with a disposable stress sensor,” Opt. Lett. 46(15), 3669 (2021). [CrossRef]

18. P. Gong, S. L. Chin, W. M. Allen, et al., “Quantitative micro-elastography enables in vivo detection of residual cancer in the surgical cavity during breast-conserving surgery,” Cancer Res. 82(21), 4093–4104 (2022). [CrossRef]

19. L. P. Hariri, G. T. Bonnema, K. Schmidt, et al., “Laparoscopic optical coherence tomography imaging of human ovarian cancer,” Gynecol. Oncol. 114(2), 188–194 (2009). [CrossRef]

20. S. Latus, S. Grube, T. Eixmann, et al., “A miniature dual-fiber probe for quantitative optical coherence elastography,” IEEE Trans. Biomed. Eng. 70(11), 3064–3072 (2023). [CrossRef]

21. Y. Qu, T. Ma, Y. He, et al., “Miniature probe for mapping mechanical properties of vascular lesions using acoustic radiation force optical coherence elastography,” Sci. Rep. 7(1), 4731 (2017). [CrossRef]

22. A. B. Karpiouk, D. J. VanderLaan, K. V. Larin, et al., “Integrated optical coherence tomography and multielement ultrasound transducer probe for shear wave elasticity imaging of moving tissues,” J. Biomed. Opt. 23(10), 1–7 (2018). [CrossRef]

23. H. Xu, Q. Xia, C. Shu, et al., “In vivo endoscopic optical coherence elastography based on a miniature probe,” Biomed. Opt. Express 15(7), 4237 (2024). [CrossRef]

24. M. Neidhardt, S. Latus, T. Eixmann, et al., “Deep learning for high speed optical coherence elastography with a fiber scanning endoscope,” IEEE Trans. Med. Imaging 44(3), 1445–1453 (2025). [CrossRef]

25. M. Neidhardt, R. Mieling, S. Latus, et al., “A modified da vinci surgical instrument for oce based elasticity estimation with deep learning,”.

26. S. Wang and K. V. Larin, “Noncontact depth-resolved micro-scale optical coherence elastography of the cornea,” Biomed. Opt. Express 5(11), 3807–3821 (2014). [CrossRef]

27. S. Song, Z. Huang, T.-M. Nguyen, et al., “Shear modulus imaging by direct visualization of propagating shear waves with phase-sensitive optical coherence tomography,” J. Biomed. Opt. 18(12), 1 (2013). [CrossRef]

28. Z. Han, M. Singh, S. R. Aglyamov, et al., “Quantifying tissue viscoelasticity using optical coherence elastography and the rayleigh wave model,” J. Biomed. Opt. 21(9), 090504 (2016). [CrossRef]

29. A. Ramier, B. Tavakol, and S.-H. Yun, “Measuring mechanical wave speed, dispersion, and viscoelastic modulus of the cornea using optical coherence elastography,” Opt. Express 27(12), 16635–16649 (2019). [CrossRef]

30. Z. Han, J. Li, M. Singh, et al., “Quantitative methods for reconstructing tissue biomechanical properties in optical coherence elastography: a comparison study,” Phys. Med. Biol. 60(9), 3531–3547 (2015). [CrossRef]

31. X. Feng, G.-Y. Li, and S.-H. Yun, “Ultra-wideband optical coherence elastography from acoustic to ultrasonic frequencies,” Nat. Commun. 14(1), 4949 (2023). [CrossRef]

32. G. Shi, Y. Zhang, Y. Wang, et al., “Quantitative evaluation of human lens and lens capsule elasticity by optical coherence elastography based on a rayleigh wave model,” J. Biophotonics 17(12), e202400322 (2024). [CrossRef]

33. A. Ramier, A. M. Eltony, Y. Chen, et al., “In vivo measurement of shear modulus of the human cornea using optical coherence elastography,” Sci. Rep. 10(1), 17366 (2020). [CrossRef]

34. C. S. Lee, A. J. Tyring, N. P. Deruyter, et al., “Deep-learning based, automated segmentation of macular edema in optical coherence tomography,” Biomed. Opt. Express 8(7), 3440–3448 (2017). [CrossRef]

35. A. P. Sunij, K. Saikat, S. Gayathri, et al., “Octnet: A lightweight cnn for retinal disease classification from optical coherence tomography images,” Comput. Methods Programs Biomed. 200, 105877 (2021). [CrossRef]

36. M. Neidhardt, M. Bengs, S. Latus, et al., “Deep learning for high speed optical coherence elastography,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), (IEEE, 2020), pp. 1583–1586.

37. M. Neidhardt, M. Bengs, S. Latus, et al., “4d deep learning for real-time volumetric optical coherence elastography,” Int. J. Comput. Assist. Radiol. Surg. 16(1), 23–27 (2021). [CrossRef]

38. Y. Zhang, J. Liao, Z. Feng, et al., “Vp-net: an end-to-end deep learning network for elastic wave velocity prediction in human skin in vivo using optical coherence elastography,” Front. Bioeng. Biotechnol. 12, 1 (2024). [CrossRef]

39. A. G. Howard, M. Zhu, B. Chen, et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”

40. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 7132–7141.

41. A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in Neural Information Processing Systems30, (2017).

42. A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” arXiv (2020). [CrossRef]

43. C. Playout, R. Duval, M. C. Boucher, et al., “Focused attention in transformers for interpretable classification of retinal images,” Med. Image Anal. 82, 102608 (2022). [CrossRef]

44. Z. Tan, F. Shi, Y. Zhou, et al., “A multi-scale fusion and transformer based registration guided speckle noise reduction for oct images,” IEEE Trans. Med. Imaging 43(1), 473–488 (2024). [CrossRef]

45. B. Ait Hammou, F. Antaki, M.-C. Boucher, et al., “Mbt: Model-based transformer for retinal optical coherence tomography image and video multi-classification,” Int. J. Med. Informatics 178, 105178 (2023). [CrossRef]

46. D. Philippi, K. Rothaus, and M. Castelli, “A vision transformer architecture for the automated segmentation of retinal lesions in spectral domain optical coherence tomography images,” Sci. Rep. 13(1), 517 (2023). [CrossRef]

47. G. Li, K. Wang, Y. Dai, et al., “Physics-based optical coherence tomography angiography (octa) image correction for shadow compensation,” IEEE Trans. Biomed. Eng. 72(3), 891–898 (2025). [CrossRef]

48. M. Raghu, T. Unterthiner, S. Kornblith, et al., “Do vision transformers see like convolutional neural networks?” Advances in Neural Information Processing Systems 34, 12116–12128 (2021).

49. J. Su, M. Ahmed, Y. Lu, et al., “Roformer: Enhanced transformer with rotary position embedding,” Neurocomputing 568, 127063 (2024). [CrossRef]

50. S. Beuve, L. Kritly, S. Callé, et al., “Diffuse shear wave spectroscopy for soft tissue viscoelastic characterization,” Ultrasonics 110, 106239 (2021). [CrossRef]

51. S. Pansino and B. Taisne, “Shear wave measurements of a gelatin’s young’s modulus,” Front. Earth Sci. 8, 1 (2020). [CrossRef]

52. M. Fink and M. Tanter, “Multiwave imaging and super resolution,” Phys. Today 63(2), 28–33 (2010). [CrossRef]

53. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv (2014). [CrossRef]

54. T. Dao, D. Fu, S. Ermon, et al., “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Adv. Neural Inf. Process. Syst. 35(1189), 16344–16359 (2022). [CrossRef]

55. S. Raschka, “Mlxtend: Providing machine learning and data science utilities and extensions to python’s scientific computing stack,” J. open source software 3(24), 638 (2018). [CrossRef]

56. B. Kowalski, V. Akondi, and A. Dubra, “Correction of non-uniform angular velocity and sub-pixel jitter in optical scanning,” Opt. Express 30(1), 112–124 (2022). [CrossRef]

57. R. Emig, C. M. Zgierski-Johnston, V. Timmermann, et al., “Passive myocardial mechanical properties: meaning, measurement, models,” Biophys. Rev. 13(5), 587–610 (2021). [CrossRef]

58. D. Radulescu, I. Peride, L. C. Petcu, et al., “Supersonic shear wave ultrasonography for assessing tissue stiffness in native kidney,” Ultrasound Med. & Biol. 44(12), 2556–2568 (2018). [CrossRef]

59. A. Nava, E. Mazza, M. Furrer, et al., “In vivo mechanical characterization of human liver,” Med. Image Anal. 12(2), 203–216 (2008). [CrossRef]

60. D. W. Good, G. D. Stewart, S. Hammer, et al., “Elasticity as a biomarker for prostate cancer: A systematic review,” BJU Int. 113(4), 523–534 (2014). [CrossRef]

61. M. Neidhardt, S. Latus, T. Eixmann, et al., “Deep learning for high speed optical coherence elastography with a fiber scanning endoscope,” IEEE Transactions on Medical Imaging (2024).

62. M. Neidhardt, R. Mieling, M. Bengs, et al., “Optical force estimation for interactions between tool and soft tissues,” Sci. Rep. 13(1), 506 (2023). [CrossRef]

Robin Mieling	https://orcid.org/0000-0003-0262-2519
Maximilian Neidhardt	https://orcid.org/0000-0002-5107-0864

Model	RMSE [kPa]	MAE [kPa]	MAPE	R2	Weights
FE*	9.25	7.80	0.12	0.949
Dense-S [36]	13.85(168)	10.48(126)	0.22(4)	0.906(2)	109K
Dense-M	13.30(408)	9.36(267)	0.19(4)	0.907(5)	193K
Dense-L	13.08(245)	9.43(173)	0.20(3)	0.914(3)	254K
VP-NET [38]	15.76(615)	5.55(269)	0.12(5)	0.862(8)	8.7M
VP-NET-L [38]	14.33(516)	6.03(274)	0.14(7)	0.888(6)	17.7M
Ours	2.49(137)	1.64(89)	0.03(2)	0.996(1)	85.8M

Parameter	RMSE [kPa]	MAE [kPa]	MAPE	R2	Params
Ours	2.49(137)	1.64(89)	0.03(2)	0.996(1)	85.8M
w/ Sinusoidal Enc.	19.74(924)	15.85(829)	0.39(16)	0.771(21)	85.8M
w/ Learnable Enc.	28.46(1292)	23.90(1191)	0.66(40)	0.528(36)	87.2M
w/o surf. augm.	7.78(261)	2.48(130)	0.03(2)	0.968(2)	85.8M
w/ Patch Sequence	3.34(20)	2.10(4)	0.04(1)	0.995(1)	86M

Model	RMSE [kPa]	MAE [kPa]	MAPE	R2	Weights
FE*	9.25	7.80	0.12	0.949
Dense-S [36]	13.85(168)	10.48(126)	0.22(4)	0.906(2)	109K
Dense-M	13.30(408)	9.36(267)	0.19(4)	0.907(5)	193K
Dense-L	13.08(245)	9.43(173)	0.20(3)	0.914(3)	254K
VP-NET [38]	15.76(615)	5.55(269)	0.12(5)	0.862(8)	8.7M
VP-NET-L [38]	14.33(516)	6.03(274)	0.14(7)	0.888(6)	17.7M
Ours	2.49(137)	1.64(89)	0.03(2)	0.996(1)	85.8M

Parameter	RMSE [kPa]	MAE [kPa]	MAPE	R2	Params
Ours	2.49(137)	1.64(89)	0.03(2)	0.996(1)	85.8M
w/ Sinusoidal Enc.	19.74(924)	15.85(829)	0.39(16)	0.771(21)	85.8M
w/ Learnable Enc.	28.46(1292)	23.90(1191)	0.66(40)	0.528(36)	87.2M
w/o surf. augm.	7.78(261)	2.48(130)	0.03(2)	0.968(2)	85.8M
w/ Patch Sequence	3.34(20)	2.10(4)	0.04(1)	0.995(1)	86M

A-scan sequence transformers for palpation with optical coherence elastography

Abstract

1. Introduction