Abstract
The imitation of cursive handwriting is mainly limited to generating handwritten words or lines. Multiple synthetic outputs must be stitched together to create paragraphs or whole pages, whereby consistency and layout information are lost. To close this gap, we propose a method for imitating handwriting at the paragraph level that also works for unseen writing styles. Therefore, we introduce a modified latent diffusion model that enriches the encoder-decoder mechanism with specialized loss functions that explicitly preserve the style and content. We enhance the attention mechanism of the diffusion model with adaptive 2D positional encoding and the conditioning mechanism to work with two modalities simultaneously: a style image and the target text. This significantly improves the realism of the generated handwriting. We set a new benchmark in our comprehensive evaluation, achieving 61 % mAP and 56 % top-1 accuracy in style preservation, significantly outperforming the previous best method (37 % mAP, 30 % top-1). We are making our code publicly available for reproducibility, supporting research in this area and research into potential countermeasures: https://github.com/M4rt1nM4yr/paragraph_handwriting_imitation_ldm
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Advancements in handwritten text generation and imitation hold significant promise for preserving the personal qualities of handwriting, which health conditions or injuries may compromise (Bisio et al., 2016). These techniques function as a digital preservation mechanism, ensuring continuity of expression for individuals facing physical constraints or other kinds of restrictions. However, as with other deep learning paradigms, their effectiveness depends on the variety and size of suitable training data. Notably, current datasets present challenges, including biases towards certain writing styles, under-representation of languages, and limitations of common data augmentation for tasks such as Handwritten Text Recognition (HTR), Writer Identification (WI), and Visual Question Answering (VQA) in document contexts. Research to date has primarily concentrated on text generation at the word or line level due to the inherent complexities of processing larger coherent textual and visual entities. This focus, however, has led to shortcomings in producing consistent and realistic outputs at the paragraph level – a prerequisite for practical applicability in many real-world applications, such as personalized text messages or rendering writings in different languages.
Our study introduces a novel method for paragraph-level handwriting imitation that employs an adapted version of Latent Diffusion Models (LDMs) (Rombach et al., 2022). Diffusion models typically require large amounts of data and significant computational resources, especially when handling high-resolution inputs such as 768 \(\times \) 768 images in our case. To address this, the LDM uses a Variational Autoencoder (VAE) to condense essential information into a more compact latent space, reducing computational demands while preserving crucial details. We enhance the encoder-decoder framework with style and content preservation loss terms, improving the fidelity and compression of the latent representation. Furthermore, we incorporate global positional information and cross-attention mechanisms within the Denoising U-Net architecture in latent space. These enhancements lead to more realistic paragraph generations. Evaluated as a zero-shot algorithm, our method demonstrates robustness and generalizability across previously unseen handwriting styles and writers, significantly outperforming existing methods in synthetic paragraph matching. The method achieves a top-1 score of over 54 % when matching the synthetic paragraphs with genuine data, almost twice as many percentage points as the second-best approach (Fig. 1).
In summary, our contributions to the field of handwriting generation and imitation include: (1) End-to-end framework for imitating entire paragraphs of handwritten text. Our method preserves the individual’s unique writing style and maintains the original layout, representing a significant step forward in the fidelity of imitated generative handwriting. (2) Refined encoder-decoder stage by incorporating specialized loss terms that target content and style preservation. We show that these auxiliary losses enhance the generation quality and the latent compression ratio. (3) Improved conditioning process by integrating the writing style with the target text and employing cross-attention to incorporate this combined information into the Denoising U-Net. (4) Ranked sampling: Based on the variance within the sampling process, we introduce a ranking scheme that simultaneously considers content and style preservation. (5) Qualitative and quantitative analyses show that our method surpasses current state-of-the-art imitation methods by a large margin, considering the combination of image generation, style preservation, and content preservation.
Method overview. We transfer the handwritten paragraphs in and out of latent space via encoder \(\mathcal {E}\) and decoder \(\mathcal {D}\). The Denoising U-Net \(\epsilon _{\Theta }\) is trained in latent space and conditioned with cross-attention. As conditioning information, we have two inputs: (1) a style image \(x_{\text {style}}\), which we encode with a shallow CNN \(\mathcal {E}_{style}\), and (2) a target text \(x_{\text {text}}\), which we embed into feature space. We fuse both modalities with a transformer and forward them as a stylized embedding into the Denoising U-Net via cross-attention.
2 Related Work
In this work, we generate handwritten text solely from images without relying on additional modalities such as the methods that utilize online trajectories (Graves, 2014; Mayr et al., 2020; Aksan et al., 2018; Chang et al., 2022; Luhman & Luhman, 2020). Unlike online handwriting methods, handwritten text images are widely available, offering broader application possibilities. Various strategies are employed at different levels of detail. Techniques commonly applied to Chinese handwriting or character-specific methods are infrequently used for cursive handwriting in Western scripts (Dai et al., 2023; Tang et al., 2022; Huang et al., 2022). GANwriting (Kang et al., 2020) generated images on a word level based on a few style samples. They extended their approach to also work on full lines (Kang et al., 2021). As the name suggests, they use a Generative Adverserial Network (GAN). Like most approaches, the style samples and target texts are encoded initially. An upsampling generator produces the output image based on the concatenated style and text information, while AdaIN (Huang & Belongie, 2017) is used for guidance. The two default GAN losses (discriminator and generator loss) are extended by the domain-specific feedback of the writer and recognition losses. Similarly, ScrabbleGAN (Fogel et al., 2020) and TS-GAN (Davis et al., 2020) applied a GAN to generate text lines, but the former one only used HTR feedback as an extra loss term, and the latter one added a space predictor to space the text for the generator. SmartPatch (Mattick et al., 2021) and SLOGAN (Luo et al., 2023) added character feedback to improve the results on the stroke level. By contrast, HiGAN+ (Gan et al., 2022) applies a patch discriminator with a fixed grid of extracted patches but additionally regularizes the style by reconstructing the style vector, which is uniformly sampled, like in JokerGAN (Zdenek & Nakayama, 2021). With JokerGAN++ (Zdenek & Nakayama, 2023) they exchanged their style encoder with a ViT (Dosovitskiy et al., 2021). To further increase realism in the outputs, transformer models (Bhunia et al., 2021) as generators and visual archetypes (Pippi et al., 2023a) are applied.
Recent advancements in Diffusion Models (DMs) in the field of computer vision (Sohl-Dickstein et al., 2015; Song et al., 2021; Ho et al., 2020; Yang et al., 2023) have also influenced the research in handwriting generation. This progress has rendered various customized loss terms obsolete (Ding et al., 2023; Zhu et al., 2023; Nikolaidou et al., 2023). Most of these diffusion methods are constrained to pre-existing writing styles since they incorporate the writer ID as a style input in their designs. Hence they cannot generalize to unseen styles. However, CTIG-DM (Zhu et al., 2023) differentiates between the style of the writer and the style of the image, with the latter primarily focusing on texture and colour. Interestingly, GC-DDPM (Ding et al., 2023) also incorporates visual archetypes into their approach for more stability, similar to VATr (Pippi et al., 2023a). Moreover, (Riaz et al., 2023) applied diffusion models to German text data. While all of these diffusion-based methods operate at the word level, our approach directly generates entire paragraphs. This is achieved via an adapted LDM, as described in Section 3. Additionally, our ranked resampling method, detailed in Section 3.4.2, accounts for both content and style, whereas Ding et al.’s approach solely ranks based on character correctness.
3 Methodology
DMs (Sohl-Dickstein et al., 2015; Song et al., 2021; Ho et al., 2020; Yang et al., 2023) are ubiquitous for image generation, but their application to high-resolution images is data- and resource-intensive. To mitigate this, LDMs train a diffusion model in a compressed latent space, accessible from the pixel space with an encoder-decoder pair (Ramesh et al., 2021; Rombach et al., 2022). Further, despite impressive results on natural images, DMs often lack the capabilities to produce realistic-looking text. Only TextDiffuser (Chen et al., 2023) produces realistic scene text images, mainly limited to fonts. Therefore, we applied several modifications as described below to be able to generate handwritten paragraphs.
Given a style image \(x_{\text {style}}\) and a target text \(x_{\text {text}}\), the task of handwritten text imitation can be described as to produce an output image \(\tilde{x}\) mimicking the style with the given content. For training, x and \(x_{\text {style}}\) are from the same writer but, if possible, different paragraphs. Figure 2 visualizes the building blocks to solve this task, which are elaborated in the upcoming subsections.
3.1 Encoder-Decoder Stage
First, the translation into the latent representation is applied, i.e, \(z = \mathcal {E}(x)\) where \(\mathcal {E}\) denotes the encoder. The images \(\tilde{x} = \mathcal {D}(x)\) are reconstructed with decoder \(\mathcal {D}\). This step’s important property is condensing the image information into a compressed representation. This is typically accomplished by reducing the spatial dimensions. Since we are working with rather high resolutions (\(768\times 768\)) in combination with a small number of non-synthetical training data (747 samples), this compression has to be very strong to have a well-behaving diffusion process. Wordstylist (Nikolaidou et al., 2023) applied a pre-trained model from Stable Diffusion for this task. Preliminary results showed that this does not scale to paragraphs (see Fig. 5 and Table 6). For a \(768\times 768\) input image, their compression method results in a feature matrix with shape \((4 \times 96 \times 96)\), where 4 is the feature dimension and \(96\times 96\) is the spatial dimension. We retrained the We retrained the Kullback-Leibler (KL) regularized VAE (Kingma & Welling, 2014; Rezende et al., 2014) from LDM but with a smaller feature dimension in the latent. To facilitate this increased compression rate, we extend the feedback of the paragraph-level reconstruction with pre-trained text recognition and writer-style models, leading to the latent shape of \((1 \times 96 \times 96)\). This updates the overall loss term to:
where \(\mathcal {L}_{\text {HTR}}\) denotes the loss term for text recognition, while \(\mathcal {L}_{\text {WI}}\) incorporates the style task, scaled by \(w_{\text {HTR}}\) and \(w_{\text {WI}}\), respectively. To balance and better align the different loss terms given their varying value ranges, we introduce the weightings \(w_{\text {KL}}\), \(w_{\text {HTR}}\), and \(w_{\text {WI}}\). \(\mathcal {L}_{\text {rec}}\) is the applied \(L_1\) reconstruction loss, and for regularization, the KL divergence is applied, denoted as \(\mathcal {L}_{\text {KL}}\), with the weighting \(w_{\text {KL}}\) set to the default value \(1 \cdot 10^{-6}\). Note that we removed the discriminator loss due to unpredictable training behaviors.
The text recognizer, which is based on the approach by Kang et al. (2022), is applied to full paragraphs to ensure readable and unmodified content. The model combines a feature extractor, similar to \(\mathcal {E}\) but not with shared weights, and a transformer model for encoding the features and producing the output predictions. Further, to apply the transformer model to the extracted features, we add adaptive two-dimensional positional encoding (Lee et al., 2020) to the feature encodings, similar to (Kang et al., 2022). \(\mathcal {L}_{\text {HTR}}\) is computed using cross-entropy, following sequence-to-sequence HTR approaches (Kang et al., 2022; Wick et al., 2021). The writer ID is used for correctly matching the writing style. We use a Convolutional Neural Network (CNN) to predict the writer ID for increased data efficiency and reduced overfitting due to the sparsity of writing styles.Footnote 1
3.2 Diffusion Model
Diffusion Models are generative models that learn a data distribution p(x) to reverse a diffusion process. In this process, gradually added noise to the image results in normally distributed noise. It is parametrized by a Markov Chain of length T, which is the total number of time steps with values close to T, resulting in almost completely noised inputs. Reversing that process in the Denoising U-Net from LDM (Rombach et al., 2022) \(\epsilon _{\Theta }(z_t,t,c)\) is done by gradually removing the noise, where \(z_t\) is the noised image at time step t and c stands for the conditioning described in Section 3.3. We added adaptive two-dimensional positional encoding (Lee et al., 2020) after the first projection layer of the spatial transformers for robust runs. This idea is inspired by HTR methods (Kang et al., 2022), which apply positional encoding to input images. We hypothesise that, while sharing the same style, different regions and words in the image are somewhat independent of each other. Preliminary experiments showed that without this extension, the model struggled to reconstruct entire paragraphs, instead producing a jigsaw-like arrangement of strokes and lines. The objective of training the model parameters \(\Theta \) is defined as \(\mathcal {L}_{\text {LDM}} = ||\epsilon - \epsilon _\Theta (z_t, t, c)||^2_{2}\).
3.3 Conditioning
We use conditioning as part of the architecture to prime the model with a specific style and a defined target text. Typically, just one modality is used as side input for the diffusion process model (Yang et al., 2023). In contrast, we fuse style and content with a transformer decoder for handwriting imitation. The style encoder \(\mathcal {E}_{\text {style}}\) consists of an initial convolutional layer, followed by 4 residual blocks with a total spatial downscaling factor of 128, followed by a final convolutional layer resulting in a spatial shape of (\(6\times 6\)). A small multi-layer perceptron head is used for pre-training \(\mathcal {E}_{\text {style}}\) on writer classification.
The encoded \(x_{\text {text}}\) with added 1D sinusoidal positional encoding (Vaswani et al., 2017) and the embedded \(x_{\text {style}}\) with added adaptive 2D positional encoding (Lee et al., 2020) are incorporated in a transformer model using cross-attention layers. Cross-attention (Vaswani et al., 2017) employs regular multi-head attention, i.e, \( \text {Att}(Q,K,V) = \text {softmax}(\frac{QK^T}{\sqrt{d}}) \cdot V \,, \) but with different inputs: \(Q=W_Q\cdot \mathcal {E}_{\text {embed}}(x_{\text {text}})\), \(K=W_K\cdot \mathcal {E}_{\text {style}}(x_{\text {style}})\), and \(V=W_V\cdot \mathcal {E}_{\text {style}}(x_{\text {style}})\). Note, for clarity, multi-head notation is omitted in the equations.
3.4 Sampling
3.4.1 Classifier-free Guidance
Another important part of diffusion models is the sampling for generating new latent representations, which, in our case, is based on additional conditioning. Recent developments have shifted towards adopting classifier-free guidance (Ho & Salimans, 2021) over its predecessor, classifier guidance (Dhariwal & Nichol, 2021). This shift is not merely a matter of preference but is substantiated by empirical evidence suggesting enhanced performance in generating conditioned outputs that closely mimic the desired attributes. Our preliminary experiments in handwriting imitation verify this trend, indicating a superior fidelity in reproducing handwriting styles when utilizing classifier-free guidance.
The basis for applying classifier-free guidance is a diffusion model which learns a conditional distribution p(x|c) and an unconditional distribution p(x) at the same time. We achieve this by replacing the conditioning information with an empty style image \(x_\text {empty}\) and an empty string with a set probability \(p=0.2\) during training. That allows us to strengthen the conditioning information during sampling by leveraging the scaled difference between the conditional and unconditional distribution. Mathematically classifier-free guidance equates to:
where s is the scaling parameter and \(c_{\text {empty}}\) is modeled as a blank page for style input and an empty string as target text. Here, s controls the strength of the conditioning signal, where \(s=0\) removes conditioning, \(s=1\) applies it as given, and \(s>1\) amplifies it, making the model adhere more strongly to the provided guidance.
3.4.2 Ranked Resampling
Ding et al. (2023) improved the results by applying their progressive data filtering strategy. However, this technique only focuses on the character outputs and not on the style. They achieved the filtering by iteratively removing bad synthetic images below a certain confidence threshold and fine-tuning a pre-trained HTR to decide which samples to keep for the next round. However, we focus our ranked resampling not only on legibility but also on style similarity. Specifically, we generate K samples from the same data point. To compute style vectors for evaluating style similarity, we employ a traditional writer retrieval pipeline that involves local feature extraction followed by the computation of a global feature representation (Christlein et al., 2015, 2017; Christlein & Maier, 2018). RootSIFT descriptors (Arandjelović & Zisserman, 2012) are extracted at SIFT keypoints (Lowe, 2004) and subsequently jointly whitened and dimensionality-reduced through PCA (Christlein et al., 2017). The computation of the global feature representation is computed using multi-VLAD, where multiple VLAD encodings are once more PCA-whitened (Christlein et al., 2015). Style similarity is measured using cosine similarity between the style vector of the generated sample \(\tilde{x}\), and the target style vector \(x_{\text {style}}\). To measure readability, we use the Character Error Rate (CER) obtained from an HTR system trained exclusively on the training set paragraphs and additionally created synthetic images. The architecture of this system matches that of the HTR system in the encoder-decoder stage described in Section 3.1. Each sample is ranked based on these measures, allowing us to identify and select the samples that best balance stylistic fidelity with readability. In the following, we define the ranking of the samples using \(\text {rank}_{\text {WI}}\) for the style property and \(\text {rank}_{\text {HTR}}\) for readability.
Since our ranked resampling approach generates K samples per data point, the computational cost primarily arises from running the LDM K times, followed by inference using the HTR and WI models. Given that K remains relatively small (typically 1-10), we can mitigate the sorting of the data points, which should account for the term \(K\log K\); the method scales linearly with the number of data points N, ensuring practical feasibility.
4 Empirical Evaluation
4.1 Dataset
4.1.1 IAM Handwriting Database
The IAM database (Marti & Bunke, 2002) is used at the paragraph level. For fine-tuning, we employ the 747 samples of the train split and the 116 samples of the validation split. Due to this low amount of training data, we created 50, 000 additional paragraphs with 365 true-type fonts from the internet and text from text generators. We select a portion of the 336 IAM test paragraphs for testing to guarantee a writer-disjoint and, thus, zero-shot setting. Therefore, we only use test samples of writers that do not appear in the train and validation sets and for which at least two samples are available. This criterion ensures that the priming information must stem from a different document. Consequently, we have assembled a collection of 247 documents authored by 72 writers.
4.2 CVL Database
For the out-of-distribution evaluation, we employ the CVL dataset Kleber et al. (2013) at the paragraph level. It contains 1604 handwritten paragraphs across 310 unique writers in German and English. However, we had to remove any paragraphs containing an umlaut, as the IAM training alphabet does not contain these special characters. Thus, we ended up with 984 paragraphs, which we split into 108 for training, 31 for validation, and 845 for testing. From the original 310 writers, we assigned 22 for training, 282 for testing, and the remaining ones for validation. This dataset is mainly used for WI. In contrast, it is rarely applied for HTR because the training set is smaller than the test set.
4.3 Metrics
4.3.1 Image Generation Quality via FID, KID, HWD
For natural images, the performance of generative models is commonly evaluated using Fréchet Inception Distance (FID) (Heusel et al., 2017) and Kernel Inception Distance (KID) (Bińkowski et al., 2018). These metrics measure the similarity between real and generated images in a feature space extracted from a pre-trained deep neural network. Lower values indicate a closer resemblance between the generated and real samples, meaning improved realism and quality of the generated handwriting. We evaluate them on paragraph and line levels. Both metrics are tailored towards natural images with the underlying Inception model trained on ImageNet (Deng et al., 2009). However, the distribution of handwritten data is different. Therefore, Pippi et al. (2023b) introduced a handwriting-specific line-based metric denoted as Handwriting Distance (HWD),Footnote 2 where a VGG16 backbone is trained on 100M rendered text lines and words to classify the calligraphic fonts. Similar to FID and KID, feature representations are finally used for comparing the distributions of different datasets. Lower HWD values indicate that the generated samples better match the structural and stylistic properties of real handwriting.
4.3.2 Style Assessment via Writer Identification
For assessing the stylistic accuracy, we rely on a learning-free Writer Identification (WI) method. The efficacy is then determined in a zero-shot setting, i.e, evaluating the test dataset in a leave-one-sample-out cross-validation where each sample is picked as query and the remaining samples are ranked according to their similarity to the query. From the ranks, Mean Average Precision (mAP) and top-1 accuracy are computed. Higher mAP and top-1 values indicate that generated handwriting is more distinguishable as belonging to a specific writer, meaning better style preservation. A well-performing system should achieve high retrieval scores when querying generated samples against real samples of the same writer. As WI method, we follow the approach by Nikolaidou et al. (2023) and rely on the same writer retrieval pipeline as outlined in Section 3.4.2.
4.3.3 Content Quality via HTR
Content preservation is commonly measured in terms of Character Error Rate (CER) with an HTR model comparing the target text with the generated text. The HTR model is trained on the genuine IAM training and test set.
4.4 Implementation Details
The experiments are focused on line and paragraph levels in the empirical evaluation because they are used in real-world scenarios. We compare against three state-of-the-art methods and use their implementation and pre-trained models for an unbiased evaluation: HiGAN+,Footnote 3 VATr,Footnote 4 and TS-GAN.Footnote 5 Note that VATr and HiGAN+ were mainly built for word-level handwriting generation and thus produce unrealistic text lines due to the stitching process. For priming the style, we avoid using information from the same document. HiGAN+ needs just one word as style information, which is the lowest amount of all approaches. Therefore, a word image from the same writer’s other document is sampled and used as style input. For VATr, 15-word images are sampled from the other document, while TS-GAN gets a random line image as style information. Our approach works on paragraphs, so our model uses a paragraph from the same writer’s other document as style input. Please refer to the appendix for a detailed overview of the parameters and settings.
4.5 Results
In this section, we analyse style and content preservation at the paragraph level. Additionally, we apply line segmentation (Kodym et al., 2021) to assess our method at a more granular line level, addressing concerns about layout patterns versus intended content and style nuances. Finally, we conclude with ablation studies analysing the performance using synthetically generated data to fine-tune an HTR model, the generalisation capabilities on out-of-distribution data, and different parts of the framework.
4.5.1 Qualitative Results
In a qualitative analysis, we let the models write the same text of a given paragraph in a specific writing style. Figure 3 showcases two samples picked at random. In contrast to state-of-the-art methods, our method shows a consistent writing style, which is closer to the given style input and thus also closer to the original genuine sample (s. bottom of Fig.3). A general problem among all approaches seems to be the wrong selection of glyphs, which are still not close enough to the style sample. Note that background artifacts around the paragraph and also around words stem from the IAM dataset. The model only reproduces this style pattern from the genuine data.
UMap visualization of the five most present writers in the IAM test set, colour-coded in the plot. It shows that our generated samples (\(\times \)) are much closer to the genuine samples (\(\bullet \)) than those generated by the other methods (\(\blacksquare \), \(\blacktriangle \), \(\blacklozenge \)).
4.5.2 Style Preservation
Table 1 assesses style preservation on the paragraph level. First, we evaluate the writer identification task exclusively on the imitated images (Query: Synth & Key: Synth) to examine the consistency within the styles of the generated paragraphs. Second, to verify the authenticity of the preserved genuine style, we treated the generated images as queries and calculated their top-1 and mAP scores against the pool of real samples (Query: Synth & Key: Genuine). The top row shows the results on genuine IAM test data to validate the writer identification task as an evaluation metric. Additionally, we report the stitched IAM test data (IAM stitched) results to justify our stitching protocol, which was applied as post-processing for the comparison approaches. The similarly high IAM and IAM stitched results highlight that (1) our applied WI method is effective and (2) our stitching process does not influence the writer identification performance.
Our method significantly outperforms current state-of-the-art methods in both experiments (Q: Synth & K: Synth and Q: Synth & K: Genuine). Higher top-1 and mAP scores indicate that our generated handwriting is more stylistically consistent and closely matches the intended writer’s style. VATr (Pippi et al., 2023a) performs well for the synthetically generated images but cannot preserve the style of the given input style image. At the same time, HiGAN+ (Gan et al., 2022) performs similarly well in both experiments. In addition to the baseline method, we evaluate the effect of different ranked resampling strategies. In particular, we rank the samples according to their performance in WI (\(\text {rank}_{\text {WI}}\)), HTR (\(\text {rank}_{\text {HTR}}\)), or both (\(\text {rank}_{\text {HTR+WI}}\)). As expected, the results show that a ranked sampling using WI is especially beneficial to preserve the input style (s. Synth+Genuine). The combination of WI and HTR is slightly worse.
Figure 4 offers an intuitive visualization of our results, showcasing the distribution of documents from the five most prolific writers in our dataset. We applied UMap dimensionality reduction (McInnes et al., 2020) to the L2-normalized global feature vectors obtained from the writer identification task. In this plot, each writer is distinguished by a unique colour, and the cluster centers are depicted as large, transparent circles. Surrounding these central points, the genuine test data samples, represented by smaller dots, tend to cluster closely. However, the representations generated by VATr (Pippi et al., 2023a) are mostly situated between the clusters of genuine writers, suggesting a less distinct association with any specific writer’s style. A similar observation can be made for TS-GAN (Davis et al., 2020), for which the low-dimensional representations tend to mix across the clusters of genuine writers. HiGAN+ (Gan et al., 2022) exhibits a somewhat better alignment in certain cases but struggles to accurately associate with the styles of the blue and red writers, indicating a partial success in style emulation. In contrast, the style vectors generated for our model’s paragraphs demonstrate a notably closer affiliation with the intended writers’ clusters, although with minor inaccuracies. For instance, a few blue samples are closer to the yellow cluster than their target blue centre.
Style preservation on a line level draws different results for the comparison approaches, as seen in Table 2. For FID and KID, VATr (Pippi et al., 2023a) outperforms the others, where TS-GAN (Davis et al., 2020) drastically drops in performance. Conversely, for HWD metrics, HiGAN+ (Gan et al., 2022) and TS-GAN (Davis et al., 2020) exceed VATr’s performance. Our approach in combination with reranking for style and content achieves by far the best scores including FID, KID and HWD on both line (FIDL, KIDL, HWD) and paragraph-level (FIDP, KIDP). Note that for HWD, we used the provided line-level outputs from the other methods, while for our approach, we first applied line segmentation (Kodym et al., 2021) to extract individual lines from the generated paragraphs before computing HWD.
4.5.3 Content Preservation
Another crucial aspect of handwriting imitation is text preservation, which we assess using the CER. Table 2 shows CER results at the line level, with lower CER values indicating fewer transcription errors. While TS-GAN (Davis et al., 2020) achieves the lowest CER, it does so at the expense of poor style preservation, illustrating the trade-off between content accuracy and visual fidelity. In contrast, our approach maintains a low CER while significantly outperforming other methods in style preservation, demonstrating its ability to balance both aspects effectively.
We also evaluated our approach on paragraph level, resulting in a CERP of \(4.77\,\%\). However, problems arise when dealing with lines longer than 75 characters. There, the CERP raises to about \(30\,\%\). We argue that the HTR model cannot cope with the extreme downscaling of line images.
4.6 Ablations Study
4.6.1 Synthetic Data for Handwritten Text Recognition
One of the primary purposes of generative models is to use them for downstream tasks. Here, we evaluated the usefulness of different handwriting imitation approaches for creating synthetic data for training HTR systems. We used a pre-trained HTR model (synthetic fonts+real IAM train), fine-tuned it on synthetically generated CVL (Kleber et al., 2013) training data, and evaluated it on real \(\text {CVL}_{\text {test}}\). The CVL dataset was chosen because it is challenging for HTR. Table 3 demonstrates that our approach surpasses current methods. However, there is still a gap between genuine and synthetic data, suggesting that the handwriting imitation task needs to be improved to replace or increase the amount of real data samples.
4.6.2 Style Generalisation Capabilities with Out-of-Distribution Data
We analysed how well the style is preserved on out-of-distribution data. We applied our IAM-trained models on CVL data, following the same evaluation protocol as for IAM. Table 4 and Table 5 show worse results on CVL than on IAM but still significantly better than other approaches. Additionally, top-1 and mAP are considerably worse on \(\text {CVL}_{\text {test}}\) than on \(\text {CVL}_{\text {train}}\). We hypothesize that the matching is much harder due to the larger test set and fewer samples per writer.
4.6.3 Encoder-Decoder Capabilities
Wordstylist (Nikolaidou et al., 2023) demonstrated great success when applying LDMs for word-level handwritten text generation. They utilized pre-trained weights from Stable Diffusion (Rombach et al., 2022) Footnote 6 for their encoder-decoder stage. We evaluate this approach and compare it to ours in Table 6. The reconstruction quality is first assessed using the standard metrics Mean Absolute Error (MAE) and d Mean Squared Error (MSE). While these metrics show satisfactory numerical results, their practical significance for handwritten text reconstruction is limited. In traditional image processing tasks, small MAE and MSE differences indicate better pixel-wise similarity. However, in the context of handwriting, even minor pixel variations can significantly impact the legibility and stylistic fidelity of the text. To further investigate this, we apply both paragraph-based and line-based HTR models to the reconstructed samples. The models were trained on IAM’s train and test data to decipher the different writing styles. The results reveal a significant increase in CERP (paragraph-based) and CERL (line-based) when using Stable Diffusion. When using handwritten paragraphs for direct training of a VAE from scratch with default loss terms, we observe even higher CER values. By contrast, the performance is improved when integrating the proposed writer and handwritten text recognition losses into training. This is supported by a qualitative analysis of Fig. 5, where 5(a) shows the original input to the encoder-decoder stage. Among the reconstructions without latent space modifications, 5(d) shows the closest resemblance to the original image despite a slight blurriness. The reconstruction from Stable Diffusion 5(b) alters certain characters, such as transforming the “cou” in “couple” into characters that more closely resemble “au”, making them challenging to read. This observation aligns with the quantitative findings, where the default VAE exhibits reconstructions with significantly reduced readability, mirroring the high CER values.
4.6.4 Cosine Scheduler and New-Line-Token-Free Variants
We investigated two alternative versions of our approach: one employing a cosine scheduler (Nichol & Dhariwal, 2021) to prioritize the general layout of text, and another omitting new line tokens, leaving the model to determine line initiations autonomously. As shown in Table 7, both modifications exhibit a similar intrinsic synthetic style but outperform our main approach in reflecting the handwriting styles in the genuine data. However, these approaches come with a trade-off in terms of legibility. Employing the cosine scheduler increases the CERL to just over \(7\%\), and removing newline tokens leads to a CERL nearing \(35\,\%\). Additionally, qualitative assessments of the newline token-free variant revealed tendencies of the model to duplicate or omit words.
4.6.5 Ranking Effect
In Fig.6, we analyse the impact of different ranking strategies on the performance of our baseline method on writer identification (mAP) and (paragraph-based) HTR (CERP). The genuine writer style (Fig. 6 left) is best preserved when using only WI feedback for ranking (\(\text {rank}_{\text {WI}}\)), achieving an mAP of above \(60\,\%\) for the top-ranked samples (rank 1). It is unaffected by HTR feedback (\(\text {rank}_{\text {HTR}}\)), which remains stable at a mean performance level of approximately \(55\,\%\) mAP. The combined ranking (\(\text {rank}_{\text {HTR+WI}}\)) positively impacts style outputs, though less effectively than \(\text {rank}_{\text {WI}}\), reaching an mAP of nearly \(60\,\%\) for the top rank.
Regarding content preservation (Fig. 6 right), applying \(\text {rank}_{\text {HTR}}\) notably improves the outcomes, reducing the CER to approximately \(4\,\%\) for the top rank. \(\text {rank}_{\text {HTR+WI}}\) also lowers the CER, though not as effectively as \(\text {rank}_{\text {HTR}}\), yet still achieving comparable results. In contrast, \(\text {rank}_{\text {WI}}\) maintains a CER between \(6\,\%\) and \(7\,\%\) across the different ranks. It is important to highlight that implementing a ranking strategy utilizing both HTR and WI feedback results in a significant improvement in mAP, approximately five percentage points above the mean, and a concurrent enhancement in CERP, approximately two percentage points better than the mean. Thus, this strategy strikes a meaningful balance between style and content preservation.
5 Discussion
Table 1 shows that the writer style characteristics are well preserved, especially for synthetic samples. But even when imitating real handwriting captured on genuine images, the model shows realistic results. This is further demonstrated in the UMap plot (Fig. 4) where our method produces samples much closer to the original ones. Although our method achieves excellent replication of the desired style, the target text occasionally contains duplicate or incorrectly swapped characters, a flaw not seen with alternative methods. However, the CERL reported in Table 2 might not accurately represent the true CER. We computed a \(2.29\,\%\) CERL on paragraphs reconstructed from the latent representations derived from the original images, ultimately setting a rather high baseline. In the reconstruction quality results (Table 6), readability is lower compared to genuine data processed with an omniscient HTR model. This sets the lower boundary for readability.
5.1 Limitations with Out-of-Distribution Data
Challenges with out-of-distribution data primarily arise from two sources: the target text and the style image. For the target text, there is a small bias towards known words, which stems from a limited diversity in the paragraph training data. While large diffusion models are typically trained on millions of unique images, our training involved only 747 real images and generated \(\approx \!4000\) unique lines, which we permuted and stitched into a total of 50, 000 synthetic paragraphs, where every image contained 3 to 13 lines and 5 to 101 characters per line. Additionally, the results degrade when the target text takes on uncommon paragraph forms, such as paragraphs containing only a single word or paragraphs with long lines exceeding 101 characters per line. We believe this issue arises because the current KL-regularized latent space representation does not adequately capture the overall rigid structure of handwritten paragraphs. As a result, the network struggles to generalize to rare out-of-distribution samples. This could be addressed by incorporating prior knowledge about the semantic structure of handwritten paragraphs into the regularization of the latent space. Alternatively, a simpler solution could involve incorporating additional synthetic and stitched real training data, with a particular focus on these edge cases. Furthermore, stitching real data can introduce new artifacts, such as overly regular layouts. For the style image, the distribution of out-of-distribution styles must align closely with the training data, particularly for real-world applications. The results in Section 4.6.2 show the adaptation of the different methods to this case. Here, we can see that the results moderately decrease for \(\text {CVL}_{\text {train}}\). We hypothesize that this could be due to the cleaner nature of the CVL data compared to the IAM data, which contains some artifacts, such as background gradients. The biggest drop is with \(\text {CVL}_{\text {test}}\), which could stem in addition from the fact that this dataset split has a big pool of many unseen writers (283) on 845 paragraphs, making good top-1 and mAP results more challenging. This hypothesis is supported by the fact that the HWD stayed mostly consistent for our approach between \(\text {CVL}_{\text {train}}\) and \(\text {CVL}_{\text {test}}\).
5.1.1 Compute Time Comparison
Computational efficiency is a key factor when applying generative models in real-world scenarios. To evaluate inference speed, we measured the time required to generate CVL paragraphs on an NVIDIA A40 GPU. Our results confirm that GAN-based approaches significantly outperform diffusion-based models in speed. Among the tested methods, HiGAN+ is the fastest, generating a paragraph in 0.13 seconds, followed by TS-GAN (0.27 seconds) and VATr (0.28 seconds). However, when comparing our method to another diffusion-based approach, WordStylist Nikolaidou et al. (2023)Footnote 7, we observe a substantial efficiency gain. WordStylist relies solely on the writer ID and does not need to extract the style from the image, requiring even fewer computations. Despite this, it still requires 13.44 minutes per paragraph due to its word-by-word generation process and a high number of sampling steps (600). In contrast, our approach reduces inference time to just 9.06 seconds per paragraph, achieving an average speed-up of 91, making it a more viable option for practical applications.
5.1.2 Possible Negative Implications
A more appealing and realistic imitation of handwritten text poses several risks, particularly in forgery of sensitive documents such as wills, contracts, or historical records. Beyond document fraud, such a model could be exploited for identity theft, or falsification of handwritten evidence. These risks underscore the importance of robust forensic tools to detect AI-generated handwriting.
To counteract this, we make our approach and code publicly available to enable building countermeasures for these types of forgeries. There are already some initial works in this direction (Carriére et al., 2023). Further efforts in watermarking, authentication protocols, and forensic handwriting analysis could enhance security.
6 Conclusion and Future Work
In this study, we introduce a method that is capable of producing realistic-looking and style-consistent handwritten paragraphs in unseen writing styles. The approach is based on a refined latent diffusion model. By incorporating additional loss terms during the encoder-decoder phase, we achieved notable enhancements in both reconstruction quality and compression efficiency. Furthermore, the integration of style features with text embeddings proved to be effective for conditioning the denoising U-Net, demonstrating a successful application of our approach. Additionally, by imitating handwriting at the paragraph level rather than word by word, we significantly improved generation speed, making our method more efficient for practical applications. Overall, our contributions not only advance the field of handwriting imitation but also hold the potential to benefit other document analysis tasks, particularly in scenarios characterized by limited data availability.
Looking ahead, several opportunities exist to further enhance the model’s performance and applicability. Future work should focus on improving the encoder-decoder stage. We hypothesise that a more compressed and structured latent space could enhance the generalisability and sampling speed of the diffusion model. Additionally, optimising this stage may help reduce artefacts, such as low-frequency gradients in the background. Another key direction is increasing the amount of training data, which is crucial for fully leveraging diffusion models and mitigating issues related to out-of-distribution data, as discussed in Section 5. To bridge the computational gap between diffusion models and GANs, future research should explore reducing the number of sampling steps in combination with a more compressed latent space while maintaining high-quality output.
Data Availability
This work does not propose new data.
Notes
The appendix and code give a detailed view of the architectures of the different models.
HWD: https://github.com/aimagelab/HWD. Note: We slightly adapted the computation of the FID and KID so that the input is split into non-overlapping patches instead of using just the first square patch of the input.
References
Aksan, E., Pece, F., & Hilliges, O. (2018). Deepwriting: Making digital ink editable via deep generative modeling. Conference on human factors in computing systems (p. 1-14). New York, NY, USA: Association for Computing Machinery.
Arandjelović, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (pp. 2911-2918). Providence.
Bhunia, A.K., Khan, S., Cholakkal, H., Anwer, R.M., Khan, F.S., & Shah, M. (2021). Handwriting transformers. Ieee/cvf international conference on computer vision (iccv) (pp. 1086-1094).
Bisio, A., Pedullà, L., Bonzano, L., Ruggeri, P., Brichetto, G., & Bove, M. (2016). Eval Uation of Handwriting Movement Kinematics: From an Ecological to a Magnetic Resonance Environment. Front. Hum. Neurosci., 10, 1662–5161. https://doi.org/10.3389/fnhum.2016.00488
Bińkowski, M., Sutherland, D.J., Arbel, M., & Gretton, A. (2018). Demystifying MMD GANs. International conference on learning representations (iclr). Retrieved from https://openreview.net/forum?id=r1lUOzWCW
Carriére, G., Nikolaidou, K., Kordon, F., Mayr, M., Seuret, M., & Christlein, V. (2023). Beyond human forgeries: An investigation into detecting diffusion-generated handwriting. M. Coustaty & A. Fornés (Eds.), International conference on doc ument analysis and recognition (icdar) workshops (pp. 5-19). Cham: Springer Nature Switzerland.
Chang, J.-H.R., Shrivastava, A., Koppula, H., Zhang, X., & Tuzel, O. (2022). Style equalization: Unsupervised learning of controllable generative sequence models. K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), International conference on machine learning (icml) (Vol. 162, pp. 2917-2937). PMLR. Retrieved from https://proceedings.mlr.press/v162/chang22a.html
Chen, J., Huang, Y., Lv, T., Cui, L., Chen, Q., & Wei, F. (2023). Textdiffuser: Diffusion models as text painters. A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in neural information processing systems (neurips) (Vol. 36, pp. 9353-9387). Curran Associates, Inc.
Christlein, V., Bernecker, D., & Angelopoulou, E. (2015). Writer identification using VLAD encoded contour-Zernike moments. International conference on document analysis and recognition (icdar) (pp. 906-910). Nancy.
Christlein, V., Bernecker, D., Hönig, F., Maier, A., & Angelopoulou, E. (2017). Writer Identification Using GMM Supervectors and Exemplar-SVMs. Pattern Recognit., 63, 258–267. https://doi.org/10.1016/j.patcog.2016.10.005
Christlein, V., & Maier, A. (2018). Encoding CNN activations for writer recognition. Iapr international workshop on document analysis systems (pp. 169-174). Vienna.
Dai, G., Zhang, Y., Wang, Q., Du, Q., Yu, Z., Liu, Z., & Huang, S. (2023). Disentan gling writer and character styles for handwriting generation. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (pp. 5977-5986).
Davis, B., Tensmeyer, C., Price, B., Wigington, C., Morse, B., & Jain, R. (2020). Text and style conditioned GAN for generation of offline handwriting lines. British machine vision conference (bmvc). Retrieved from https://www.bmvc2020-conference.com/assets/papers/0815.pdf
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (pp. 248-255).
Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis. M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, & J.W. Vaughan (Eds.), Advances in neural information processing systems (neurips) (Vol. 34, pp. 8780-8794). Curran Associates, Inc.
Ding, H., Luan, B., Gui, D., Chen, K., & Huo, Q. (2023). Improving handwritten ocr with training samples generated by glyph conditional denoising diffusion probabilistic model. Retrieved from https://arxiv.org/abs/2305.19543
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,Dehghani,M.,Minderer,M., Heigold,G., Gelly,S.,Uszkoreit,J., & Houlsby, N. (2021). An image is worth 16\(\times \)16 words: Transformers for image recognition at scale. International conference on learning representations (iclr). Retrieved from https://openreview.net/forum?id=YicbFdNTTy
Fogel, S., Averbuch-Elor, H., Cohen, S., Mazor, S., & Litman, R. (2020). Scrab blegan: Semi-supervised varying length handwritten text generation. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (p. 4323-4332).
Gan, J., Wang, W., Leng, J., & Gao, X. (2022). HiGAN+: Handwriting imitation GAN with disentangled representations. ACM Trans. Graph., 42 (1),1–17 Retrieved from https://doi.org/10.1145/3550070
Graves, A. (2014). Generating sequences with recurrent neural networks. Retrieved from https://arxiv.org/abs/1308.0850
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local Nash equilib rium. I. Guyon et al. (Eds.), Advances in neural information processing systems (neurips) (Vol. 30). Curran Associates, Inc.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (neurips) (Vol. 33, pp. 6840-6851). Curran Associates, Inc.
Ho, J., & Salimans, T. (2021). Classifier-free diffusion guidance. Advances in neural information processing systems (neurips) workshop on deep generative models and downstream applications. Retrieved from https://openreview.net/forum?id=qw8AKxfYbI
Huang, H., Yang, D., Dai, G., Han, Z., Wang, Y., Lam, K.-M., Yang, F.,Huang, S.,Liu, Y., & He, M. (2022). Agtgan: Unpaired image translation for photographic ancient charac ter generation. Acm international conference on multimedia (p. 5456-5467). New York, NY, USA: Association for Computing Machinery. Retrieved from https://doi.org/10.1145/3503161.3548338
Huang, X., & Belongie, S. (2017). Arbitrary style transfer in real-time with adap tive instance normalization. Ieee international conference on computer vision (iccv).
Kang, L., Riba, P., Rusinol, M., Fornes, A., & Villegas, M. (2021). Content and style aware generation of text-line images for handwriting recognition. IEEE PAMI,1–1,. https://doi.org/10.1109/TPAMI.2021.3122572
Kang, L., Riba, P., Rusiñol, M., Fornés, A., & Villegas, M. (2022). Pay Attention to What You Read: Non-recurrent Handwritten Text-line Recognition. Pattern Recognit., 129, Article 108766. https://doi.org/10.1016/j.patcog.2022.108766
Kang, L., Riba, P., Wang, Y., Rusiñol, M., Fornés, A., Villegas, M. (2020). Ganwriting: Content-conditioned generation of styled handwritten word images. A. Vedaldi, H. Bischof, T. Brox, & J.-M. Frahm (Eds.), European conference on computer vision (eccv) (pp. 273-289). Cham: Springer International Publishing.
Kingma, D.P., & Welling, M. (2014). Auto-encoding variational bayes. International conference on learning representations (iclr).
Kleber, F., Fiel, S., Diem, M., & Sablatnig, R. (2013). CVL-DataBase: An Off-LineDatabase for Writer Retrieval. Icdar: Writer Identification and Word Spotting.
Kodym, O., & Hradiš, M. (2021). Page layout analysis system for unconstrained historic documents. J. Lladós, D. Lopresti, & S. Uchida (Eds.), Document anal ysis and recognition-icdar 2021 (pp. 492-506). Cham: Springer International Publishing.
Lee, J., Park, S., Baek, J., Oh, S.J., Kim, S., & Lee, H. (2020). On recognizing texts of arbitrary shapes with 2D self-attention. Ieee/cvf conference on computer vision and pattern recognition (cvpr) workshops.
Lowe, D.G. (2004). Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60 (2), 91-110, https://doi.org/10.1023/B:VISI.0000029664.99615.94
Luhman, T., & Luhman, E. (2020). Diffusion models for handwriting generation
Luo, C., Zhu, Y., Jin, L., Li, Z., & Peng, D. (2023). Slogan: Handwriting Style Syn Thesis for Arbitrary-length and Out-of-vocabulary Text. IEEE Trans. Neural Netw. Learn. Syst., 34(11), 8503–8515. https://doi.org/10.1109/TNNLS.2022.3151477
Marti, U.-V., & Bunke, H. (2002). The Iam-database: an English Sentence Database for Offline Handwriting Recognition. Int. J. Doc. Anal. Recog., 5(1), 39–46. https://doi.org/10.1007/s100320200071
Mattick, A., Mayr, M., Seuret, M., Maier, A., & Christlein, V. (2021). Smartpatch: Improving handwritten word imitation with patch discriminators. J. Lladós, D. Lopresti, & S. Uchida (Eds.), Document analysis and recognition (icdar) (pp. 268-283). Cham: Springer International Publishing.
Mayr, M., Stumpf, M., Nicolaou, A., Seuret, M., Maier, A., & Christlein, V. (2020). Spatio-temporal handwriting imitation. A. Bartoli & A. Fusiello (Eds.), Euro pean conference on computer vision (eccv) workshops (pp. 528-543). Cham: Springer International Publishing.
McInnes, L., Healy, J., & Melville, J. (2020). Umap: Uniform manifold approximation and projection for dimension reduction.
Nichol, A.Q., & Dhariwal, P. (2021). Improved denoising diffusion proba bilistic models. M. Meila & T. Zhang (Eds.), International conference on machine learning (icml) (Vol. 139, pp. 8162-8171). PMLR. Retrieved from https://proceedings.mlr.press/v139/nichol21a.html
Nikolaidou, K., Retsinas, G., Christlein, V., Seuret, M., Sfikas, G., Smith, E.B., Mokayed, H., & Liwicki, M. (2023). Wordstylist: Styled verbatim handwritten text generation with latent diffusion models. G.A. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), Document analysis and recognition (icdar) (pp. 384-401). Cham: Springer Nature Switzerland.
Pippi, V., Cascianelli, S., & Cucchiara, R. (2023). Handwritten text generation from visual archetypes. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (pp. 22458-22467).
Pippi, V., Quattrini, F., Cascianelli, S., & Cucchiara, R. (2023). Hwd: A novel evaluation score for styled handwritten text generation. British machine vision conference (bmvc). BMVA. Retrieved from https://papers.bmvc2023.org/0007.pdf
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A.,Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. M. Meila & T. Zhang (Eds.), International conference on machine learning (icml) (Vol. 139, pp. 8821-8831). PMLR. Retrieved from https://proceedings.mlr.press/v139/ramesh21a.html
Rezende, D.J., Mohamed, S., & Wierstra, D. (2014). Stochastic backprop agation and approximate inference in deep generative models. E.P. Xing & T. Jebara (Eds.), International conference on machine learning (icml) (Vol. 32, pp. 1278-1286). Bejing, China: PMLR. Retrieved from https://proceedings.mlr.press/v32/rezende14.html
Riaz, N., Saifullah, S., Agne, S., Dengel, A., & Ahmed, S. (2024). Stylusai: Stylistic adaptation for robust german handwritten text generation. E.H. Barney Smith, M. Liwicki, & L. Peng (Eds.), Document analysis and recognition-icdar 2024 (pp. 429-444). Cham: Springer Nature Switzerland.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (p. 10674-10685).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. F. Bach & D. Blei (Eds.), International conference on machine learning (icml) (Vol. 37, pp. 2256-2265). Lille, France: PMLR. Retrieved from https://proceedings.mlr.press/v37/sohl-dickstein15.html
Song, J., Meng, C., & Ermon, S. (2021). Denoising diffusion implicit models. International conference on learning representations (iclr). Retrieved from https://openreview.net/forum?id=St1giarCHLP
Tang, L., Cai, Y., Liu, J., Hong, Z., Gong, M., Fan, M.,Han, J.,Liu, J., Ding, E., & Wang, J. (2022). Few shot font generation by learning fine-grained local styles. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (pp. 7895-7904).
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polo sukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.), Advances in neural information processing systems (Vol. 30). Curran Associates, Inc.
Wick, C., Zöllner, J., & Grüning, T. (2021). Transformer for handwritten text recognition using bidirectional post-decoding. J. Lladós, D. Lopresti, & S. Uchida (Eds.), Document analysis and recognition-icdar 2021 (pp. 112-126). Cham: Springer International Publishing.
Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W.,Cui, B., & Yang, M.-H. (2023). Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv., 56 (4),1-39 https://doi.org/10.1145/3626235
Zdenek, J., & Nakayama, H. (2021). Jokergan: Memory-efficient model for handwrit ten text generation with text line awareness. Acm international conference on multimedia (p. 5655-5663). New York, NY, USA: Association for Computing Machinery.
Zdenek, J., & Nakayama, H. (2023). Handwritten text generation with character specific encoding for style imitation. G.A. Fink, R. Jain, K. Kise, & R. Zanibbi (Eds.), Document analysis and recognition (icdar) (pp. 313-329). Cham: Springer Nature Switzerland.
Zhu, Y., Li, Z., Wang, T., He, M., & Yao, C. (2023). Conditional text image generation with diffusion models. Ieee/cvf conference on computer vision and pattern recognition (cvpr) (pp. 14235-14245).
Acknowledgements
The authors gratefully acknowledge the scientific support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU). The hardware is funded by the German Research Foundation (DFG).
Funding
Open Access funding enabled and organized by Projekt DEAL. Funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 416910787.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Martin Mayr and Marcel Dreier performed the study analysis. Martin Mayr wrote the first draft of the manuscript, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that they have no conflict of interest.
Code availability
The code can be accessed via GitHub link.
Additional information
Communicated by Faisal Shafait.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Implementation Details
1.1 A.1 Hyperparameters
Hyperparameters for the first stage model are given in Table 8. The default VAE is defined as “Vanilla VAE”, which is used in LDMs (Rombach et al., 2022). 32 channels are the initial amount in the feature dimension. The “channel multiplier” denotes the feature scaling from the outer to the inner blocks, each consisting of two ResnetBlocks. The encoder and decoder are mirrored. The shape in the latent space is (1, 96, 96), where 1 is the feature dimension and (96, 96) is the spatial dimension. We used dilation, erasion, and distortion in combination with noise with a probability of 0.3, each. Note that erosion cannot be applied when dilation is applied and vice versa. “VAE w/ extra losses” is the VAE model utilizing a Handwritten Text Recognition loss and Writer Identification loss. In preliminary experiments, \(w_{\text {HTR}}=0.3\) and \(w_{\text {WI}}=0.005\) yielded the best results.
Table 9 displays the hyperparameters for the HTR system for paragraphs. Similarly to the VAE encoder, the feature extractor consists of ResnetBlocks. The initial channel size of 16 scales in the final block to 128. 128 also represents the model dimension (hidden size) in the transformer module. The Transformer consists of two encoder and four decoder layers. We added 3, 000 synthetic samples to the training process and applied label smoothing of 0.4 to prevent overfitting. Further heavy augmentations are utilized.
Hyperparameters of the VAE’s writer identification model are described in Table 9. Like the VAE and HTR models, it uses ResnetBlocks but with increased downsampling resulting in a spatial shape of (6, 6). We also used 3, 000 synthetic samples and label smoothing in combination with heavy augmentations. When evaluating this model, we achieved an accuracy (top-1) of \(90\%\).
To speed up the VAE, HTR, and WI models’ training, we first trained them for 400 epochs on one- and two-line paragraphs, which provided good initial values for the main training process on paragraphs.
Table 10 shows the hyperparameters used for our default diffusion model, the new-line-free token approach, and the one with a cosine scheduler (Nichol & Dhariwal, 2021) instead of a linear scheduler. We set the number of diffusion steps T to 1000. The models are pre-trained for 70k iterations on synthetic and real data. Afterwards, for 8k iterations, the model is fine-tuned only on the real samples. We applied similar values to our denoising U-Net as in LDM (Rombach et al., 2022). Gaussian noise, contrast, and brightness augmentations are used for improved results.
Image ’f07-084a’ from IAM (Marti & Bunke, 2002). Display of the style input (a), and the outputs of our different imitation approaches (b-d).
Image ’d04-032’ from IAM (Marti & Bunke, 2002). Display of the style input (a), and the outputs of our different imitation approaches (b-d).
1.2 A.2 Model Architectures
In this section we show the architectural changes we conducted.
Figure 7a shows the structure of the HTR model. It has a similar architecture as the HTR by Kang et al. (2022). The main difference is the encoder which is resembled from ResBlocks.
Further, we extended the transformer blocks in the denoising U-Net with adaptive 2D positional encoding, see Fig.7b. This is necessary to give the model an overall understanding of a paragraph.
Figure 7c displays the conditioning module. It consists of a Writer CNN and Text Encoder. The former one is pre-trained on writer labels. We use the latent representation after the last ResBlock added with adaptive 2D positional encoding (Lee et al., 2020) as keys and values for the cross-attention block in the Conditioning Stage. Queries are the positional encoded text embeddings.
Appendix B Additional Results
Figures 8, 9, 10 and 11 show qualitative results of the samples ’d04-032’, ’f07-084a’, ’f07-013’, and ’d06-041’ from the IAM database (Marti & Bunke, 2002). We give the style input, the genuine, and the synthetically generated paragraphs. ’d04-032’ has a very unique writing style. HiGAN+ (Gan et al., 2022) has problems replicating the style. VATr (Pippi et al., 2023a) and TS-GAN (Davis et al., 2020) generate better results but still not preserve the writer’s style. Our approach is the closest but introduces a light background artefact.
’f07-084a’ is quite an unusual sample because the slant is more leaned towards the left than towards the right. TS-GAN does not adapt to this style and is closer to the previous style than to this one. VATr also has problems with that style. HiGAN+ is the best comparison approach but is making some errors, e.g., “,” are often “,,”. Also, the strokes of VATr and HiGAN+ are not smooth. By contrast, our model replicates the style quite well but forgot an “l” in “Bouilla-baisse”.
Our three different diffusion variants in Fig.12 and 13 reflect the trends shown in the main paper. The style is quite similarly preserved for all the variants, but the content often differs from the target text for “Ours - Cosine” and especially for “Ours - no NL”.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mayr, M., Dreier, M., Kordon, F. et al. Zero-Shot Paragraph-level Handwriting Imitation with Latent Diffusion Models. Int J Comput Vis 133, 7054–7075 (2025). https://doi.org/10.1007/s11263-025-02525-0
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11263-025-02525-0