1 Introduction

Fig. 1
figure 1

Compared to existing methods, our approach produces more realistic synthetic handwritten paragraphs in a specific style.

Advancements in handwritten text generation and imitation hold significant promise for preserving the personal qualities of handwriting, which health conditions or injuries may compromise (Bisio et al., 2016). These techniques function as a digital preservation mechanism, ensuring continuity of expression for individuals facing physical constraints or other kinds of restrictions. However, as with other deep learning paradigms, their effectiveness depends on the variety and size of suitable training data. Notably, current datasets present challenges, including biases towards certain writing styles, under-representation of languages, and limitations of common data augmentation for tasks such as Handwritten Text Recognition (HTR), Writer Identification (WI), and Visual Question Answering (VQA) in document contexts. Research to date has primarily concentrated on text generation at the word or line level due to the inherent complexities of processing larger coherent textual and visual entities. This focus, however, has led to shortcomings in producing consistent and realistic outputs at the paragraph level – a prerequisite for practical applicability in many real-world applications, such as personalized text messages or rendering writings in different languages.

Our study introduces a novel method for paragraph-level handwriting imitation that employs an adapted version of Latent Diffusion Models (LDMs)  (Rombach et al., 2022). Diffusion models typically require large amounts of data and significant computational resources, especially when handling high-resolution inputs such as 768 \(\times \) 768 images in our case. To address this, the LDM uses a Variational Autoencoder (VAE) to condense essential information into a more compact latent space, reducing computational demands while preserving crucial details. We enhance the encoder-decoder framework with style and content preservation loss terms, improving the fidelity and compression of the latent representation. Furthermore, we incorporate global positional information and cross-attention mechanisms within the Denoising U-Net architecture in latent space. These enhancements lead to more realistic paragraph generations. Evaluated as a zero-shot algorithm, our method demonstrates robustness and generalizability across previously unseen handwriting styles and writers, significantly outperforming existing methods in synthetic paragraph matching. The method achieves a top-1 score of over 54 % when matching the synthetic paragraphs with genuine data, almost twice as many percentage points as the second-best approach (Fig. 1).

In summary, our contributions to the field of handwriting generation and imitation include: (1) End-to-end framework for imitating entire paragraphs of handwritten text. Our method preserves the individual’s unique writing style and maintains the original layout, representing a significant step forward in the fidelity of imitated generative handwriting. (2) Refined encoder-decoder stage by incorporating specialized loss terms that target content and style preservation. We show that these auxiliary losses enhance the generation quality and the latent compression ratio. (3) Improved conditioning process by integrating the writing style with the target text and employing cross-attention to incorporate this combined information into the Denoising U-Net. (4) Ranked sampling: Based on the variance within the sampling process, we introduce a ranking scheme that simultaneously considers content and style preservation. (5) Qualitative and quantitative analyses show that our method surpasses current state-of-the-art imitation methods by a large margin, considering the combination of image generation, style preservation, and content preservation.

Fig. 2
figure 2

Method overview. We transfer the handwritten paragraphs in and out of latent space via encoder \(\mathcal {E}\) and decoder \(\mathcal {D}\). The Denoising U-Net \(\epsilon _{\Theta }\) is trained in latent space and conditioned with cross-attention. As conditioning information, we have two inputs: (1) a style image \(x_{\text {style}}\), which we encode with a shallow CNN \(\mathcal {E}_{style}\), and (2) a target text \(x_{\text {text}}\), which we embed into feature space. We fuse both modalities with a transformer and forward them as a stylized embedding into the Denoising U-Net via cross-attention.

2 Related Work

In this work, we generate handwritten text solely from images without relying on additional modalities such as the methods that utilize online trajectories (Graves, 2014; Mayr et al., 2020; Aksan et al., 2018; Chang et al., 2022; Luhman & Luhman, 2020). Unlike online handwriting methods, handwritten text images are widely available, offering broader application possibilities. Various strategies are employed at different levels of detail. Techniques commonly applied to Chinese handwriting or character-specific methods are infrequently used for cursive handwriting in Western scripts (Dai et al., 2023; Tang et al., 2022; Huang et al., 2022). GANwriting (Kang et al., 2020) generated images on a word level based on a few style samples. They extended their approach to also work on full lines (Kang et al., 2021). As the name suggests, they use a Generative Adverserial Network (GAN). Like most approaches, the style samples and target texts are encoded initially. An upsampling generator produces the output image based on the concatenated style and text information, while AdaIN (Huang & Belongie, 2017) is used for guidance. The two default GAN losses (discriminator and generator loss) are extended by the domain-specific feedback of the writer and recognition losses. Similarly, ScrabbleGAN (Fogel et al., 2020) and TS-GAN (Davis et al., 2020) applied a GAN to generate text lines, but the former one only used HTR feedback as an extra loss term, and the latter one added a space predictor to space the text for the generator. SmartPatch (Mattick et al., 2021) and SLOGAN (Luo et al., 2023) added character feedback to improve the results on the stroke level. By contrast, HiGAN+ (Gan et al., 2022) applies a patch discriminator with a fixed grid of extracted patches but additionally regularizes the style by reconstructing the style vector, which is uniformly sampled, like in JokerGAN (Zdenek & Nakayama, 2021). With JokerGAN++ (Zdenek & Nakayama, 2023) they exchanged their style encoder with a ViT (Dosovitskiy et al., 2021). To further increase realism in the outputs, transformer models (Bhunia et al., 2021) as generators and visual archetypes (Pippi et al., 2023a) are applied.

Recent advancements in Diffusion Models (DMs) in the field of computer vision (Sohl-Dickstein et al., 2015; Song et al., 2021; Ho et al., 2020; Yang et al., 2023) have also influenced the research in handwriting generation. This progress has rendered various customized loss terms obsolete (Ding et al., 2023; Zhu et al., 2023; Nikolaidou et al., 2023). Most of these diffusion methods are constrained to pre-existing writing styles since they incorporate the writer ID as a style input in their designs. Hence they cannot generalize to unseen styles. However, CTIG-DM (Zhu et al., 2023) differentiates between the style of the writer and the style of the image, with the latter primarily focusing on texture and colour. Interestingly, GC-DDPM (Ding et al., 2023) also incorporates visual archetypes into their approach for more stability, similar to VATr (Pippi et al., 2023a). Moreover, (Riaz et al., 2023) applied diffusion models to German text data. While all of these diffusion-based methods operate at the word level, our approach directly generates entire paragraphs. This is achieved via an adapted LDM, as described in Section 3. Additionally, our ranked resampling method, detailed in Section 3.4.2, accounts for both content and style, whereas Ding et al.’s approach solely ranks based on character correctness.

3 Methodology

DMs (Sohl-Dickstein et al., 2015; Song et al., 2021; Ho et al., 2020; Yang et al., 2023) are ubiquitous for image generation, but their application to high-resolution images is data- and resource-intensive. To mitigate this, LDMs train a diffusion model in a compressed latent space, accessible from the pixel space with an encoder-decoder pair (Ramesh et al., 2021; Rombach et al., 2022). Further, despite impressive results on natural images, DMs often lack the capabilities to produce realistic-looking text. Only TextDiffuser (Chen et al., 2023) produces realistic scene text images, mainly limited to fonts. Therefore, we applied several modifications as described below to be able to generate handwritten paragraphs.

Given a style image \(x_{\text {style}}\) and a target text \(x_{\text {text}}\), the task of handwritten text imitation can be described as to produce an output image \(\tilde{x}\) mimicking the style with the given content. For training, x and \(x_{\text {style}}\) are from the same writer but, if possible, different paragraphs. Figure 2 visualizes the building blocks to solve this task, which are elaborated in the upcoming subsections.

3.1 Encoder-Decoder Stage

First, the translation into the latent representation is applied, i.e, \(z = \mathcal {E}(x)\) where \(\mathcal {E}\) denotes the encoder. The images \(\tilde{x} = \mathcal {D}(x)\) are reconstructed with decoder \(\mathcal {D}\). This step’s important property is condensing the image information into a compressed representation. This is typically accomplished by reducing the spatial dimensions. Since we are working with rather high resolutions (\(768\times 768\)) in combination with a small number of non-synthetical training data (747 samples), this compression has to be very strong to have a well-behaving diffusion process. Wordstylist (Nikolaidou et al., 2023) applied a pre-trained model from Stable Diffusion for this task. Preliminary results showed that this does not scale to paragraphs (see Fig. 5 and Table 6). For a \(768\times 768\) input image, their compression method results in a feature matrix with shape \((4 \times 96 \times 96)\), where 4 is the feature dimension and \(96\times 96\) is the spatial dimension. We retrained the We retrained the Kullback-Leibler (KL) regularized VAE (Kingma & Welling, 2014; Rezende et al., 2014) from LDM but with a smaller feature dimension in the latent. To facilitate this increased compression rate, we extend the feedback of the paragraph-level reconstruction with pre-trained text recognition and writer-style models, leading to the latent shape of \((1 \times 96 \times 96)\). This updates the overall loss term to:

$$\begin{aligned} \mathcal {L}_{\text {EDS}} = \mathcal {L}_{\text {rec}} + w_{\text {KL}} \cdot \mathcal {L}_{\text {KL}} + w_{\text {HTR}} \cdot \mathcal {L}_{\text {HTR}} + w_{\text {WI}} \cdot \mathcal {L}_{\text {WI}}, \end{aligned}$$
(1)

where \(\mathcal {L}_{\text {HTR}}\) denotes the loss term for text recognition, while \(\mathcal {L}_{\text {WI}}\) incorporates the style task, scaled by \(w_{\text {HTR}}\) and \(w_{\text {WI}}\), respectively. To balance and better align the different loss terms given their varying value ranges, we introduce the weightings \(w_{\text {KL}}\), \(w_{\text {HTR}}\), and \(w_{\text {WI}}\). \(\mathcal {L}_{\text {rec}}\) is the applied \(L_1\) reconstruction loss, and for regularization, the KL divergence is applied, denoted as \(\mathcal {L}_{\text {KL}}\), with the weighting \(w_{\text {KL}}\) set to the default value \(1 \cdot 10^{-6}\). Note that we removed the discriminator loss due to unpredictable training behaviors.

The text recognizer, which is based on the approach by Kang et al. (2022), is applied to full paragraphs to ensure readable and unmodified content. The model combines a feature extractor, similar to \(\mathcal {E}\) but not with shared weights, and a transformer model for encoding the features and producing the output predictions. Further, to apply the transformer model to the extracted features, we add adaptive two-dimensional positional encoding (Lee et al., 2020) to the feature encodings, similar to  (Kang et al., 2022). \(\mathcal {L}_{\text {HTR}}\) is computed using cross-entropy, following sequence-to-sequence HTR approaches (Kang et al., 2022; Wick et al., 2021). The writer ID is used for correctly matching the writing style. We use a Convolutional Neural Network (CNN) to predict the writer ID for increased data efficiency and reduced overfitting due to the sparsity of writing styles.Footnote 1

3.2 Diffusion Model

Diffusion Models are generative models that learn a data distribution p(x) to reverse a diffusion process. In this process, gradually added noise to the image results in normally distributed noise. It is parametrized by a Markov Chain of length T, which is the total number of time steps with values close to T, resulting in almost completely noised inputs. Reversing that process in the Denoising U-Net from LDM (Rombach et al., 2022) \(\epsilon _{\Theta }(z_t,t,c)\) is done by gradually removing the noise, where \(z_t\) is the noised image at time step t and c stands for the conditioning described in Section 3.3. We added adaptive two-dimensional positional encoding (Lee et al., 2020) after the first projection layer of the spatial transformers for robust runs. This idea is inspired by HTR methods (Kang et al., 2022), which apply positional encoding to input images. We hypothesise that, while sharing the same style, different regions and words in the image are somewhat independent of each other. Preliminary experiments showed that without this extension, the model struggled to reconstruct entire paragraphs, instead producing a jigsaw-like arrangement of strokes and lines. The objective of training the model parameters \(\Theta \) is defined as \(\mathcal {L}_{\text {LDM}} = ||\epsilon - \epsilon _\Theta (z_t, t, c)||^2_{2}\).

3.3 Conditioning

We use conditioning as part of the architecture to prime the model with a specific style and a defined target text. Typically, just one modality is used as side input for the diffusion process model (Yang et al., 2023). In contrast, we fuse style and content with a transformer decoder for handwriting imitation. The style encoder \(\mathcal {E}_{\text {style}}\) consists of an initial convolutional layer, followed by 4 residual blocks with a total spatial downscaling factor of 128, followed by a final convolutional layer resulting in a spatial shape of (\(6\times 6\)). A small multi-layer perceptron head is used for pre-training \(\mathcal {E}_{\text {style}}\) on writer classification.

The encoded \(x_{\text {text}}\) with added 1D sinusoidal positional encoding (Vaswani et al., 2017) and the embedded \(x_{\text {style}}\) with added adaptive 2D positional encoding (Lee et al., 2020) are incorporated in a transformer model using cross-attention layers. Cross-attention (Vaswani et al., 2017) employs regular multi-head attention, i.e, \( \text {Att}(Q,K,V) = \text {softmax}(\frac{QK^T}{\sqrt{d}}) \cdot V \,, \) but with different inputs: \(Q=W_Q\cdot \mathcal {E}_{\text {embed}}(x_{\text {text}})\), \(K=W_K\cdot \mathcal {E}_{\text {style}}(x_{\text {style}})\), and \(V=W_V\cdot \mathcal {E}_{\text {style}}(x_{\text {style}})\). Note, for clarity, multi-head notation is omitted in the equations.

3.4 Sampling

3.4.1 Classifier-free Guidance

Another important part of diffusion models is the sampling for generating new latent representations, which, in our case, is based on additional conditioning. Recent developments have shifted towards adopting classifier-free guidance (Ho & Salimans, 2021) over its predecessor, classifier guidance (Dhariwal & Nichol, 2021). This shift is not merely a matter of preference but is substantiated by empirical evidence suggesting enhanced performance in generating conditioned outputs that closely mimic the desired attributes. Our preliminary experiments in handwriting imitation verify this trend, indicating a superior fidelity in reproducing handwriting styles when utilizing classifier-free guidance.

The basis for applying classifier-free guidance is a diffusion model which learns a conditional distribution p(x|c) and an unconditional distribution p(x) at the same time. We achieve this by replacing the conditioning information with an empty style image \(x_\text {empty}\) and an empty string with a set probability \(p=0.2\) during training. That allows us to strengthen the conditioning information during sampling by leveraging the scaled difference between the conditional and unconditional distribution. Mathematically classifier-free guidance equates to:

$$\begin{aligned} \epsilon _{t,c} = \epsilon _\Theta (x_t, t, c_{\text {empty}})+s\cdot (\epsilon _\Theta (x_t,t,c)-\epsilon _\Theta (x_t, t,c_{\text {empty}})), \end{aligned}$$
(2)

where s is the scaling parameter and \(c_{\text {empty}}\) is modeled as a blank page for style input and an empty string as target text. Here, s controls the strength of the conditioning signal, where \(s=0\) removes conditioning, \(s=1\) applies it as given, and \(s>1\) amplifies it, making the model adhere more strongly to the provided guidance.

3.4.2 Ranked Resampling

Ding et al. (2023) improved the results by applying their progressive data filtering strategy. However, this technique only focuses on the character outputs and not on the style. They achieved the filtering by iteratively removing bad synthetic images below a certain confidence threshold and fine-tuning a pre-trained HTR to decide which samples to keep for the next round. However, we focus our ranked resampling not only on legibility but also on style similarity. Specifically, we generate K samples from the same data point. To compute style vectors for evaluating style similarity, we employ a traditional writer retrieval pipeline that involves local feature extraction followed by the computation of a global feature representation (Christlein et al., 2015, 2017; Christlein & Maier, 2018). RootSIFT descriptors (Arandjelović & Zisserman, 2012) are extracted at SIFT keypoints (Lowe, 2004) and subsequently jointly whitened and dimensionality-reduced through PCA (Christlein et al., 2017). The computation of the global feature representation is computed using multi-VLAD, where multiple VLAD encodings are once more PCA-whitened (Christlein et al., 2015). Style similarity is measured using cosine similarity between the style vector of the generated sample \(\tilde{x}\), and the target style vector \(x_{\text {style}}\). To measure readability, we use the Character Error Rate (CER) obtained from an HTR system trained exclusively on the training set paragraphs and additionally created synthetic images. The architecture of this system matches that of the HTR system in the encoder-decoder stage described in Section 3.1. Each sample is ranked based on these measures, allowing us to identify and select the samples that best balance stylistic fidelity with readability. In the following, we define the ranking of the samples using \(\text {rank}_{\text {WI}}\) for the style property and \(\text {rank}_{\text {HTR}}\) for readability.

Since our ranked resampling approach generates K samples per data point, the computational cost primarily arises from running the LDM K times, followed by inference using the HTR and WI models. Given that K remains relatively small (typically 1-10), we can mitigate the sorting of the data points, which should account for the term \(K\log K\); the method scales linearly with the number of data points N, ensuring practical feasibility.

4 Empirical Evaluation

4.1 Dataset

4.1.1 IAM Handwriting Database

The IAM database (Marti & Bunke, 2002) is used at the paragraph level. For fine-tuning, we employ the 747 samples of the train split and the 116 samples of the validation split. Due to this low amount of training data, we created 50, 000 additional paragraphs with 365 true-type fonts from the internet and text from text generators. We select a portion of the 336 IAM test paragraphs for testing to guarantee a writer-disjoint and, thus, zero-shot setting. Therefore, we only use test samples of writers that do not appear in the train and validation sets and for which at least two samples are available. This criterion ensures that the priming information must stem from a different document. Consequently, we have assembled a collection of 247 documents authored by 72 writers.

4.2 CVL Database

For the out-of-distribution evaluation, we employ the CVL dataset Kleber et al. (2013) at the paragraph level. It contains 1604 handwritten paragraphs across 310 unique writers in German and English. However, we had to remove any paragraphs containing an umlaut, as the IAM training alphabet does not contain these special characters. Thus, we ended up with 984 paragraphs, which we split into 108 for training, 31 for validation, and 845 for testing. From the original 310 writers, we assigned 22 for training, 282 for testing, and the remaining ones for validation. This dataset is mainly used for WI. In contrast, it is rarely applied for HTR because the training set is smaller than the test set.

4.3 Metrics

4.3.1 Image Generation Quality via FID, KID, HWD

For natural images, the performance of generative models is commonly evaluated using Fréchet Inception Distance (FID)  (Heusel et al., 2017) and Kernel Inception Distance (KID)  (Bińkowski et al., 2018). These metrics measure the similarity between real and generated images in a feature space extracted from a pre-trained deep neural network. Lower values indicate a closer resemblance between the generated and real samples, meaning improved realism and quality of the generated handwriting. We evaluate them on paragraph and line levels. Both metrics are tailored towards natural images with the underlying Inception model trained on ImageNet (Deng et al., 2009). However, the distribution of handwritten data is different. Therefore,  Pippi et al. (2023b) introduced a handwriting-specific line-based metric denoted as Handwriting Distance (HWD),Footnote 2 where a VGG16 backbone is trained on 100M rendered text lines and words to classify the calligraphic fonts. Similar to FID and KID, feature representations are finally used for comparing the distributions of different datasets. Lower HWD values indicate that the generated samples better match the structural and stylistic properties of real handwriting.

4.3.2 Style Assessment via Writer Identification

For assessing the stylistic accuracy, we rely on a learning-free Writer Identification (WI) method. The efficacy is then determined in a zero-shot setting, i.e, evaluating the test dataset in a leave-one-sample-out cross-validation where each sample is picked as query and the remaining samples are ranked according to their similarity to the query. From the ranks, Mean Average Precision (mAP) and top-1 accuracy are computed. Higher mAP and top-1 values indicate that generated handwriting is more distinguishable as belonging to a specific writer, meaning better style preservation. A well-performing system should achieve high retrieval scores when querying generated samples against real samples of the same writer. As WI method, we follow the approach by Nikolaidou et al. (2023) and rely on the same writer retrieval pipeline as outlined in Section 3.4.2.

4.3.3 Content Quality via HTR

Content preservation is commonly measured in terms of Character Error Rate (CER) with an HTR model comparing the target text with the generated text. The HTR model is trained on the genuine IAM training and test set.

4.4 Implementation Details

The experiments are focused on line and paragraph levels in the empirical evaluation because they are used in real-world scenarios. We compare against three state-of-the-art methods and use their implementation and pre-trained models for an unbiased evaluation: HiGAN+,Footnote 3 VATr,Footnote 4 and TS-GAN.Footnote 5 Note that VATr and HiGAN+ were mainly built for word-level handwriting generation and thus produce unrealistic text lines due to the stitching process. For priming the style, we avoid using information from the same document. HiGAN+ needs just one word as style information, which is the lowest amount of all approaches. Therefore, a word image from the same writer’s other document is sampled and used as style input. For VATr, 15-word images are sampled from the other document, while TS-GAN gets a random line image as style information. Our approach works on paragraphs, so our model uses a paragraph from the same writer’s other document as style input. Please refer to the appendix for a detailed overview of the parameters and settings.

4.5 Results

Fig. 3
figure 3

Comparison of text generation and style imitation performances based on a style (top) and target text of a genuine sample (bottom). Images were sampled at random and cropped after the three lines.

In this section, we analyse style and content preservation at the paragraph level. Additionally, we apply line segmentation (Kodym et al., 2021) to assess our method at a more granular line level, addressing concerns about layout patterns versus intended content and style nuances. Finally, we conclude with ablation studies analysing the performance using synthetically generated data to fine-tune an HTR model, the generalisation capabilities on out-of-distribution data, and different parts of the framework.

4.5.1 Qualitative Results

In a qualitative analysis, we let the models write the same text of a given paragraph in a specific writing style. Figure 3 showcases two samples picked at random. In contrast to state-of-the-art methods, our method shows a consistent writing style, which is closer to the given style input and thus also closer to the original genuine sample (s. bottom of Fig.3). A general problem among all approaches seems to be the wrong selection of glyphs, which are still not close enough to the style sample. Note that background artifacts around the paragraph and also around words stem from the IAM dataset. The model only reproduces this style pattern from the genuine data.

Table 1 Paragraph-level style evaluation using Writer Identification (WI)performance. Q and K stand for query and key, respectively. The top rows show the WI performance on the IAM dataset and a stitched version. The stitching post-processing used for word-based imitation methods does not alter the WI performance. We evaluate two modalities: using pure synthetic, i.e, generated samples, and a mix of synthetic and genuine samples. All results are given in [%].
Fig. 4
figure 4

UMap visualization of the five most present writers in the IAM test set, colour-coded in the plot. It shows that our generated samples (\(\times \)) are much closer to the genuine samples (\(\bullet \)) than those generated by the other methods (\(\blacksquare \), \(\blacktriangle \), \(\blacklozenge \)).

4.5.2 Style Preservation

Table 1 assesses style preservation on the paragraph level. First, we evaluate the writer identification task exclusively on the imitated images (Query: Synth & Key: Synth) to examine the consistency within the styles of the generated paragraphs. Second, to verify the authenticity of the preserved genuine style, we treated the generated images as queries and calculated their top-1 and mAP scores against the pool of real samples (Query: Synth & Key: Genuine). The top row shows the results on genuine IAM test data to validate the writer identification task as an evaluation metric. Additionally, we report the stitched IAM test data (IAM stitched) results to justify our stitching protocol, which was applied as post-processing for the comparison approaches. The similarly high IAM and IAM stitched results highlight that (1) our applied WI method is effective and (2) our stitching process does not influence the writer identification performance.

Our method significantly outperforms current state-of-the-art methods in both experiments (Q: Synth & K: Synth and Q: Synth & K: Genuine). Higher top-1 and mAP scores indicate that our generated handwriting is more stylistically consistent and closely matches the intended writer’s style. VATr (Pippi et al., 2023a) performs well for the synthetically generated images but cannot preserve the style of the given input style image. At the same time, HiGAN+ (Gan et al., 2022) performs similarly well in both experiments. In addition to the baseline method, we evaluate the effect of different ranked resampling strategies. In particular, we rank the samples according to their performance in WI (\(\text {rank}_{\text {WI}}\)), HTR (\(\text {rank}_{\text {HTR}}\)), or both (\(\text {rank}_{\text {HTR+WI}}\)). As expected, the results show that a ranked sampling using WI is especially beneficial to preserve the input style (s. Synth+Genuine). The combination of WI and HTR is slightly worse.

Figure 4 offers an intuitive visualization of our results, showcasing the distribution of documents from the five most prolific writers in our dataset. We applied UMap dimensionality reduction (McInnes et al., 2020) to the L2-normalized global feature vectors obtained from the writer identification task. In this plot, each writer is distinguished by a unique colour, and the cluster centers are depicted as large, transparent circles. Surrounding these central points, the genuine test data samples, represented by smaller dots, tend to cluster closely. However, the representations generated by VATr (Pippi et al., 2023a) are mostly situated between the clusters of genuine writers, suggesting a less distinct association with any specific writer’s style. A similar observation can be made for TS-GAN (Davis et al., 2020), for which the low-dimensional representations tend to mix across the clusters of genuine writers. HiGAN+ (Gan et al., 2022) exhibits a somewhat better alignment in certain cases but struggles to accurately associate with the styles of the blue and red writers, indicating a partial success in style emulation. In contrast, the style vectors generated for our model’s paragraphs demonstrate a notably closer affiliation with the intended writers’ clusters, although with minor inaccuracies. For instance, a few blue samples are closer to the yellow cluster than their target blue centre.

Table 2 Assessment of line-level and paragraph-level text generation. HWD, CERL, FIDL and KIDL are computed on the line level while FIDP and KIDP are calculated on the paragraph level. CERL is in [%].

Style preservation on a line level draws different results for the comparison approaches, as seen in Table 2. For FID and KID, VATr (Pippi et al., 2023a) outperforms the others, where TS-GAN (Davis et al., 2020) drastically drops in performance. Conversely, for HWD metrics, HiGAN+ (Gan et al., 2022) and TS-GAN (Davis et al., 2020) exceed VATr’s performance. Our approach in combination with reranking for style and content achieves by far the best scores including FID, KID and HWD on both line (FIDL, KIDL, HWD) and paragraph-level (FIDP, KIDP). Note that for HWD, we used the provided line-level outputs from the other methods, while for our approach, we first applied line segmentation (Kodym et al., 2021) to extract individual lines from the generated paragraphs before computing HWD.

4.5.3 Content Preservation

Another crucial aspect of handwriting imitation is text preservation, which we assess using the CER. Table 2 shows CER results at the line level, with lower CER values indicating fewer transcription errors. While TS-GAN (Davis et al., 2020) achieves the lowest CER, it does so at the expense of poor style preservation, illustrating the trade-off between content accuracy and visual fidelity. In contrast, our approach maintains a low CER while significantly outperforming other methods in style preservation, demonstrating its ability to balance both aspects effectively.

We also evaluated our approach on paragraph level, resulting in a CERP of \(4.77\,\%\). However, problems arise when dealing with lines longer than 75 characters. There, the CERP raises to about \(30\,\%\). We argue that the HTR model cannot cope with the extreme downscaling of line images.

4.6 Ablations Study

Table 3 CER results [%] evaluated on real CVL test data for HTR models fine-tuned on synthetically recreated CVL train data.
Table 4 Out-of-distribution style evaluation on \(\text {CVL}_{\text {train}}\). Q and K stand for query and key, respectively. All results are given in [%].
Table 5 Out-of-distribution style evaluation on \(\text {CVL}_{\text {test}}\). Q and K stand for query and key, respectively. All results are given in [%].

4.6.1 Synthetic Data for Handwritten Text Recognition

One of the primary purposes of generative models is to use them for downstream tasks. Here, we evaluated the usefulness of different handwriting imitation approaches for creating synthetic data for training HTR systems. We used a pre-trained HTR model (synthetic fonts+real IAM train), fine-tuned it on synthetically generated CVL (Kleber et al., 2013) training data, and evaluated it on real \(\text {CVL}_{\text {test}}\). The CVL dataset was chosen because it is challenging for HTR. Table 3 demonstrates that our approach surpasses current methods. However, there is still a gap between genuine and synthetic data, suggesting that the handwriting imitation task needs to be improved to replace or increase the amount of real data samples.

4.6.2 Style Generalisation Capabilities with Out-of-Distribution Data

We analysed how well the style is preserved on out-of-distribution data. We applied our IAM-trained models on CVL data, following the same evaluation protocol as for IAM. Table 4 and Table 5 show worse results on CVL than on IAM but still significantly better than other approaches. Additionally, top-1 and mAP are considerably worse on \(\text {CVL}_{\text {test}}\) than on \(\text {CVL}_{\text {train}}\). We hypothesize that the matching is much harder due to the larger test set and fewer samples per writer.

4.6.3 Encoder-Decoder Capabilities

Wordstylist (Nikolaidou et al., 2023) demonstrated great success when applying LDMs for word-level handwritten text generation. They utilized pre-trained weights from Stable Diffusion (Rombach et al., 2022Footnote 6 for their encoder-decoder stage. We evaluate this approach and compare it to ours in Table 6. The reconstruction quality is first assessed using the standard metrics Mean Absolute Error (MAE) and d Mean Squared Error (MSE). While these metrics show satisfactory numerical results, their practical significance for handwritten text reconstruction is limited. In traditional image processing tasks, small MAE and MSE differences indicate better pixel-wise similarity. However, in the context of handwriting, even minor pixel variations can significantly impact the legibility and stylistic fidelity of the text. To further investigate this, we apply both paragraph-based and line-based HTR models to the reconstructed samples. The models were trained on IAM’s train and test data to decipher the different writing styles. The results reveal a significant increase in CERP (paragraph-based) and CERL (line-based) when using Stable Diffusion. When using handwritten paragraphs for direct training of a VAE from scratch with default loss terms, we observe even higher CER values. By contrast, the performance is improved when integrating the proposed writer and handwritten text recognition losses into training. This is supported by a qualitative analysis of Fig. 5, where 5(a) shows the original input to the encoder-decoder stage. Among the reconstructions without latent space modifications, 5(d) shows the closest resemblance to the original image despite a slight blurriness. The reconstruction from Stable Diffusion 5(b) alters certain characters, such as transforming the “cou” in “couple” into characters that more closely resemble “au”, making them challenging to read. This observation aligns with the quantitative findings, where the default VAE exhibits reconstructions with significantly reduced readability, mirroring the high CER values.

Fig. 5
figure 5

Qualitative comparison of the paragraph reconstructions showing that the additional HTR and WI losses are beneficial.

Table 6 Assessment of encoder-decoder stage’s reconstruction performance. The HTR results are produced by paragraph-based (CERP) and line-based (CERL) HTR models. All results are in [%].
Table 7 Comparison of different variations. Q and K stand for query and key, respectively. In “Ours + Cosine” a cosine scheduler is used instead of a linear scheduler while in “Ours + No \('\backslash n'\)” new line tokens are removed in the target text. All results are computed with the best ranked samples based on \(\text {rank}_{\text {WI}}\) and \(\text {rank}_{\text {HTR}}\). Results are given in [%].

4.6.4 Cosine Scheduler and New-Line-Token-Free Variants

We investigated two alternative versions of our approach: one employing a cosine scheduler (Nichol & Dhariwal, 2021) to prioritize the general layout of text, and another omitting new line tokens, leaving the model to determine line initiations autonomously. As shown in Table 7, both modifications exhibit a similar intrinsic synthetic style but outperform our main approach in reflecting the handwriting styles in the genuine data. However, these approaches come with a trade-off in terms of legibility. Employing the cosine scheduler increases the CERL to just over \(7\%\), and removing newline tokens leads to a CERL nearing \(35\,\%\). Additionally, qualitative assessments of the newline token-free variant revealed tendencies of the model to duplicate or omit words.

Fig. 6
figure 6

Evaluating the effect of different ranking methods on the style consistency (mAP) and content preservation (CERP). Ranking simultaneously by writer identification (WI) and by handwritten text recognition (HTR) provides a good compromise.

4.6.5 Ranking Effect

In Fig.6, we analyse the impact of different ranking strategies on the performance of our baseline method on writer identification (mAP) and (paragraph-based) HTR (CERP). The genuine writer style (Fig. 6 left) is best preserved when using only WI feedback for ranking (\(\text {rank}_{\text {WI}}\)), achieving an mAP of above \(60\,\%\) for the top-ranked samples (rank 1). It is unaffected by HTR feedback (\(\text {rank}_{\text {HTR}}\)), which remains stable at a mean performance level of approximately \(55\,\%\) mAP. The combined ranking (\(\text {rank}_{\text {HTR+WI}}\)) positively impacts style outputs, though less effectively than \(\text {rank}_{\text {WI}}\), reaching an mAP of nearly \(60\,\%\) for the top rank.

Regarding content preservation (Fig. 6 right), applying \(\text {rank}_{\text {HTR}}\) notably improves the outcomes, reducing the CER to approximately \(4\,\%\) for the top rank. \(\text {rank}_{\text {HTR+WI}}\) also lowers the CER, though not as effectively as \(\text {rank}_{\text {HTR}}\), yet still achieving comparable results. In contrast, \(\text {rank}_{\text {WI}}\) maintains a CER between \(6\,\%\) and \(7\,\%\) across the different ranks. It is important to highlight that implementing a ranking strategy utilizing both HTR and WI feedback results in a significant improvement in mAP, approximately five percentage points above the mean, and a concurrent enhancement in CERP, approximately two percentage points better than the mean. Thus, this strategy strikes a meaningful balance between style and content preservation.

5 Discussion

Table 1 shows that the writer style characteristics are well preserved, especially for synthetic samples. But even when imitating real handwriting captured on genuine images, the model shows realistic results. This is further demonstrated in the UMap plot (Fig. 4) where our method produces samples much closer to the original ones. Although our method achieves excellent replication of the desired style, the target text occasionally contains duplicate or incorrectly swapped characters, a flaw not seen with alternative methods. However, the CERL reported in Table 2 might not accurately represent the true CER. We computed a \(2.29\,\%\) CERL on paragraphs reconstructed from the latent representations derived from the original images, ultimately setting a rather high baseline. In the reconstruction quality results (Table 6), readability is lower compared to genuine data processed with an omniscient HTR model. This sets the lower boundary for readability.

5.1 Limitations with Out-of-Distribution Data

Challenges with out-of-distribution data primarily arise from two sources: the target text and the style image. For the target text, there is a small bias towards known words, which stems from a limited diversity in the paragraph training data. While large diffusion models are typically trained on millions of unique images, our training involved only 747 real images and generated \(\approx \!4000\) unique lines, which we permuted and stitched into a total of 50, 000 synthetic paragraphs, where every image contained 3 to 13 lines and 5 to 101 characters per line. Additionally, the results degrade when the target text takes on uncommon paragraph forms, such as paragraphs containing only a single word or paragraphs with long lines exceeding 101 characters per line. We believe this issue arises because the current KL-regularized latent space representation does not adequately capture the overall rigid structure of handwritten paragraphs. As a result, the network struggles to generalize to rare out-of-distribution samples. This could be addressed by incorporating prior knowledge about the semantic structure of handwritten paragraphs into the regularization of the latent space. Alternatively, a simpler solution could involve incorporating additional synthetic and stitched real training data, with a particular focus on these edge cases. Furthermore, stitching real data can introduce new artifacts, such as overly regular layouts. For the style image, the distribution of out-of-distribution styles must align closely with the training data, particularly for real-world applications. The results in Section 4.6.2 show the adaptation of the different methods to this case. Here, we can see that the results moderately decrease for \(\text {CVL}_{\text {train}}\). We hypothesize that this could be due to the cleaner nature of the CVL data compared to the IAM data, which contains some artifacts, such as background gradients. The biggest drop is with \(\text {CVL}_{\text {test}}\), which could stem in addition from the fact that this dataset split has a big pool of many unseen writers (283) on 845 paragraphs, making good top-1 and mAP results more challenging. This hypothesis is supported by the fact that the HWD stayed mostly consistent for our approach between \(\text {CVL}_{\text {train}}\) and \(\text {CVL}_{\text {test}}\).

5.1.1 Compute Time Comparison

Computational efficiency is a key factor when applying generative models in real-world scenarios. To evaluate inference speed, we measured the time required to generate CVL paragraphs on an NVIDIA A40 GPU. Our results confirm that GAN-based approaches significantly outperform diffusion-based models in speed. Among the tested methods, HiGAN+ is the fastest, generating a paragraph in 0.13 seconds, followed by TS-GAN (0.27 seconds) and VATr (0.28 seconds). However, when comparing our method to another diffusion-based approach, WordStylist Nikolaidou et al. (2023)Footnote 7, we observe a substantial efficiency gain. WordStylist relies solely on the writer ID and does not need to extract the style from the image, requiring even fewer computations. Despite this, it still requires 13.44 minutes per paragraph due to its word-by-word generation process and a high number of sampling steps (600). In contrast, our approach reduces inference time to just 9.06 seconds per paragraph, achieving an average speed-up of 91, making it a more viable option for practical applications.

5.1.2 Possible Negative Implications

A more appealing and realistic imitation of handwritten text poses several risks, particularly in forgery of sensitive documents such as wills, contracts, or historical records. Beyond document fraud, such a model could be exploited for identity theft, or falsification of handwritten evidence. These risks underscore the importance of robust forensic tools to detect AI-generated handwriting.

To counteract this, we make our approach and code publicly available to enable building countermeasures for these types of forgeries. There are already some initial works in this direction (Carriére et al., 2023). Further efforts in watermarking, authentication protocols, and forensic handwriting analysis could enhance security.

6 Conclusion and Future Work

In this study, we introduce a method that is capable of producing realistic-looking and style-consistent handwritten paragraphs in unseen writing styles. The approach is based on a refined latent diffusion model. By incorporating additional loss terms during the encoder-decoder phase, we achieved notable enhancements in both reconstruction quality and compression efficiency. Furthermore, the integration of style features with text embeddings proved to be effective for conditioning the denoising U-Net, demonstrating a successful application of our approach. Additionally, by imitating handwriting at the paragraph level rather than word by word, we significantly improved generation speed, making our method more efficient for practical applications. Overall, our contributions not only advance the field of handwriting imitation but also hold the potential to benefit other document analysis tasks, particularly in scenarios characterized by limited data availability.

Looking ahead, several opportunities exist to further enhance the model’s performance and applicability. Future work should focus on improving the encoder-decoder stage. We hypothesise that a more compressed and structured latent space could enhance the generalisability and sampling speed of the diffusion model. Additionally, optimising this stage may help reduce artefacts, such as low-frequency gradients in the background. Another key direction is increasing the amount of training data, which is crucial for fully leveraging diffusion models and mitigating issues related to out-of-distribution data, as discussed in Section 5. To bridge the computational gap between diffusion models and GANs, future research should explore reducing the number of sampling steps in combination with a more compressed latent space while maintaining high-quality output.