Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

Yejin Lee Alicia Golden Anna Sun Basil Hosmer Bilge Acun Can Balioglu Changhan Wang Charles David Hernandez Christian Puhrsch Daniel Haziza Driss Guessous Francisco Massa Jacob Kahn Jeffrey Wan Jeremy Reizenstein Jiaqi Zhai Joe Isaacson Joel Schlosser Juan Pino Kaushik Ram Sadagopan Leonid Shamis Linjian Ma Min-Jae Hwang Mingda Chen Mostafa Elhoushi Pedro Rodriguez Ram Pasunuru Samuel Hsia Scott Yih Sravya Popuri Xing Liu Carole-Jean Wu AI Research at Meta yejinlee@meta.com carolejeanwu@meta.com

(May 9, 2025)

Abstract

Generative artificial intelligence (AI) technology is revolutionizing the computing industry, posing new system design and optimization opportunities. In particular, AI’s ability to understand and respond in multiple modalities comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute significant inference latency due to the feed forward networks in Transformer-based models. We demonstrate that state-of-the-art optimization levers, spanning from applications to system software and hardware, set a 3.88 $\times$ better baseline.

\correspondence

Yejin Lee , Carole-Jean Wu

1 Introduction

Generative AI technologies are driving an unprecedented growth for the computing industry, introducing a new paradigm shift for AI. This technology redefines the interaction between humans and AI by enabling the creation of highly realistic images Sheynin et al. (2023), videos Singer et al. (2022); Girdhar et al. (2023), texts, and speech Communication et al. (2023), as well as intricate textual patterns or even new materials. Large language models (LLMs), such as ChatGPT OpenAI (2024), Llama Touvron et al. (2023a, b), or Gemini Team (2024b), demonstrate remarkable capabilities. LLMs not only enhance user experience by providing contextually relevant interactions but also play a critical role in automating complex tasks. It has already germinated a wide variety of applications, leading to higher productivity.

Beyond LLMs, multi-lingual speech translation and transcription models, such as Seamless Communication et al. (2023), Whisper Radford et al. (2022), or Translatotron Jia et al. (2019), are pivotal in breaking down the language barriers and enhancing communication on a global scale. The speech models provide accurate and real-time translation/transcription across different languages by processing speech and text modalities together, such as Speech to Speech and Text (S-ST), Text to Speech and Text (T-ST), and Automatic Speech Recognition (ASR).

In addition to text and speech modalities, state-of-the-art AI technologies can take inputs of multiple modalities to serve multi-modal use cases. Taking Chameleon Team (2024a) as an example, this multi-modal foundation model can take images and text as input and generate outputs in either modality. Such models are the foundation of image editing or visual question-answer (VQA) use cases. Also, the multi-modal models are capable of image generation based on text prompts or even ChatBot style conversations.

Refer to caption — Figure 1: Multi-modal generation tasks exhibit distinct system requirements across end-to-end inference latency, GPU utilization, memory capacity and computation requirement.

Beyond learning from texts, language, speeches, images or videos, generative AI technologies are also adopted in deep learning recommendation systems as well. Leveraging the ability of Attention-based Transformers for automatically extracting and learning features from datasets, recent deep learning recommendation models, such as HSTU Zhai et al. (2025), TIGER Rajput et al. (2023), introduce a new feature generation paradigm by adopting sequential generative models. Such new model architecture uses generative models to accurately predict items of interest. Generative recommendation models overcome the model quality saturation problem faced by existing deep learning recommendation models (DLRMs) Naumov et al. (2019), exceeding prediction quality over prior recommender system technologies.

While a disproportional investment is currently focused on LLMs, generative AI technologies that are capable of processing multi-modal inputs and outputs are on the horizon. Depending on distributions of input prompt lengths and use cases (Section 3.1) and the characteristics of model architectures (Section 3.2), the system design space for efficiency presents unique optimization opportunities. For example, a recent work shows that training a state-of-the-art text-to-image model can use 14x more GPUs per model parameter than that of an industry-scale LLM Golden et al. (2024a). To efficiently accelerate multi-modal generation model inference, this paper provides an in-depth system performance characterization for important industry-scale generative AI tasks: language (Code Llama Rozière et al. (2024)), speech translation (Seamless Communication et al. (2023)), text and image generation (Chameleon Team (2024a)), and generative deep learning recommender systems (gDLRM Zhai et al. (2025)).

These models serve important roles in Meta’s workloads, with Llama functioning as our foundational large language model powering Meta AI’s core capabilities. Seamless delivers high-quality translation services across Meta platforms, enabling multilingual access to Instagram and Facebook content. Chameleon serves as a foundation model for multimodal generation, handling various combinations of text and image inputs/outputs, while HSTU helps drive recommendation systems, processing billions of recommendations daily with strict latency constraints.

To sustainably scale generative AI technologies for a large, diverse variety of applications Wu et al. (2024), we must understand and enable AI deployment in a resource-efficient manner Wu et al. (2022). Figure 1 illustrates the system requirements for four multi-modal generation models at a single batch size. The chart highlights the latency requirement, overall memory capacity, communication requirement, and GPU utilization, for different tasks across these models. It is evident that, depending on input modalities and model architectures of a specific task, system resource utilization characteristics are distinct. For example, Chameleon can perform image-to-text (I-T), text-to-image (T-I), and image-text-to-text (IT-T) without requiring fine-tuning for each task. However, the T-I task demands significantly higher resource requirements across all four axes.

To scale advanced generative AI capabilities to billions of users in the world, inference needs to complete on the order of milliseconds and efficiently. The in-depth real-system performance characterization results in Section 3.2 guide the focus of inference performance and efficiency optimization. We take a step further to enable state-of-the-art inference performance optimization techniques — torch.compile and CUDA Graph for memory efficiency optimization Ansel et al. (2024), Scaled Dot Product Attention (SDPA) / Flash Attention to speed up Transformer’s key performance bottleneck Dao et al. (2022), quantization to further improve compute density and memory bandwidth utilization. When enabling state-of-the-art optimization levers properly, the inference performance over the important generative AI tasks can be improved by 3.8x, setting a new, more rigorous baseline. Beyond efficiently accelerating inference performance horizontally across the key generative AI tasks, in Section 4.3, we present ways to further improve inference performance efficiency with application-specific, algorithmic optimization. We enable LayerSkip Elhoushi et al. (2024), a self-speculative decoding approach to our workloads to speedup generation and show the inference performance is improved by 1.58 $\times$ . The key contributions of this paper are as follows:

•

System Performance Characterization for Emerging Multi-Modal Generative AI Tasks This paper delivers an in-depth examination of system performance across four pivotal generative AI models: LLM (Code Llama), Speech Translation (Seamless), Generative Text and Image Models (Chameleon), and Generative Deep Learning Recommendation Models (gDLRM). Our analysis covers critical aspects, such as computational and memory bandwidth requirements, variations in input distributions and roofline analysis — key to inference performance efficiency optimization.
•

Optimized Baseline for Generative AI Inference Acceleration We demonstrate the importance of enabling state-of-the-art optimization methods — torch.compile, CUDA Graph, SDPA/Flash Attention, and quantization — that accelerate the inference performance across the generative AI tasks by upto 28 $\times$ . Algorithmic optimization — LayerSkip — improves inference performance as well by 1.58 $\times$ . Altogether, cross-stack solutions, spanning algorithm and systems, improve inference performance by an average of 3.88 $\times$ . We also highlight the performance impact of using a newer generation of GPUs by comparing the performance analysis across different GPU generation.
•

Design Implications and New Directions for Future Systems We distill the implications of our findings for future research and development — 1) New solutions must improve upon stronger baseline 2) With proper understandings of the distinct characteristics and end-to-end inference pipeline of a given model, we can achieve 3.88 $\times$ speedup with state-of-the-art optimizations leverages 3) Enhancing the baseline with software optimization methods unlocks new possibilities for current and future hardware architectures.

2 Background and Motivation

2.1 Understanding the Lay of the Land for Multi-modal Generative AI Tasks

We provide an overview for key generative AI technologies. Figure 2 illustrates the model architectures for four generative AI models — LLM (Code Llama), Speech Translation (Seamless), Generative Text and Image Models (Chameleon) and Generative Deep Learning Recommendation Models (gDLRM). Table 1 summarizes input/output modalities and sequence length distribution for different workloads.

Category	Model	Auto- regressive	Nota tion	Tasks	Modality
Category	Model	Auto- regressive	Nota tion	Tasks	Input	Output
Text-based LLM	Llama		T-T	Code Completion, Infilling, Instruction	Text	Text
Image&Text Generation	Chameleon		I-T	Image Captioning	Image	Text
			T-I	Image Generation	Text	Image
			IT-T	Visual Question Ans.	Image & Text	Text
Speech&Text Translation	Seamless	$\bigtriangleup$ Only text decoder	S-S	Speech-to-Speech Trans.	Speech	Speech
			S-T	Speech-to-Text Trans.	Speech	Text
			T-T	Text-to-Text Trans.	Text	Text
			T-S	Text-to-Speech Trans.	Text	Speech
Generative DLRM	HSTU		H-A	Ranking and Retrieval	User History	Engagement Type (ranking) Recommend Item (retrieval)

Table 1: The input and output modality of each task performed by four multimodal generative models, LLM (Llama), speech&text translation (Seamless), text&image generation (Chameleon) and generative DLRM (HSTU).

2.1.1 Llama for Language Generation

Code Llama is a large language model for coding tasks. Code Llama models are trained on a wide range of input sequence lengths to be able to handle varying sizes of code snippets. For example, Code Llama support sequence lengths up to 100,000 tokens which is enough to capture a reasonably sized code snippet while keeping the computation feasible.

Code Llama has a standard transformer architecture Vaswani et al. (2023) as shown in Figure 2(a). In this paper, we take Code Llama as our model representing language generative AI model. And we refer Code Llama as Llama from now on for convenience. The model consists of embedding layer followed by consecutive Transformer decoder blocks that includes attention and feed forward layer. Specifically, Llama 34B model has 48 layers of Transformer decoder blocks.

Llama is an autoregressive generation model where inference pipeline is broken down into two phases: prefill (Prefill) and incremental decoding (Decoding). Prefill processes the entire input prompt of length $N$ at once, computing attention across all input tokens ( $O(N^{2})$ complexity), whereas Decoding generates output tokens one by one based on previously generated tokens. These phases present different computational characteristics: Prefill has high computational intensity due to processing full input sequence length $N$ while Decoding is computationally lighter by using KV cache optimization that stores and reuses key-value pairs, though this makes Decoding memory-bound due to frequent cache access.

2.1.2 Chameleon for Text and Image Generation

Chameleon is a foundational model for the family of early-fusion token-based mixed-modal models capable of understanding and generating images and text. It is capable of performing broad range of tasks including visual question answering, image captioning, text and image generation, and long-form mixed modal generation in a single model. The model architecture of Chameleon largely follows Llama-2 Touvron et al. (2023b) as shown in Figure 2(b), thus Chameleon is also an auto-regressive generation model. For normalization, Chameleon continue to use RMSNorm Zhang and Sennrich (2019); and Chameleon uses the SwiGLU Shazeer (2020) activation function and rotary positional embeddings (RoPE) Su et al. (2023).

Chameleon represents images, text, and code modalities as discrete tokens and uses a uniform transformer-based architecture that is trained from scratch in an end-to-end fashion on around 10T tokens of interleaved mixed-modal data. Chameleon can take any combination of image and text and utilizes image tokenizer Gafni et al. (2022) and text tokenizer Sennrich et al. (2016) respectively to generate tokens to be fed to the model. For text generation, generated tokens are decoded by text tokenizer to generate readable texts. For image generation, Chameleon generates 1024 image tokens and then detokenize them using an image detokenizer to generate images in a format that can be interpreted by human such as jpg. And Chameleon uses a contrastive decoding method specifically for T-I task, which aims to maximize the differences between a weak and a strong model. Logits from conditioned outputs are treated as the strong model, while unconditional logits are considered as the weak model. As a result, Chameleon decodes twice at each time step for T-I task.

2.1.3 Seamless for Speech Translation

Seamless Communication et al. (2023) is a family of speech translation models that enable more natural and authentic communication across languages. SeamlessM4T is the foundation model for multilingual multimodal machine translation supporting around 100 languages. SeamlessM4T achieves state-of-the-art semantic accuracy, supports a wide range of languages, and provides multitasking capabilities from and into text or speech.

SeamlessM4T, which we refer to as Seamless in this paper, consists of multiple pretrained blocks that are finetuned as a unified model. The four main building blocks are shown in Figure 2(c),

•

Conformer Speech Encoder (Blue) A speech representation learning model that leverages unlabeled speech audio data.
•

Text-to-Text Translator (T2TT) (Pink) A text-to-text translation model pre-trained on NLLB data in nearly 100 languages. It is "only" autoregressive module among all modules in Seamless.
•

Non-autoregressive (NAR) T2U (Green) NAR T2U is a text-to-unit sequence-to-sequence module.
•

Vocoder (Orange) A HiFi-GAN unit-vocoder that converts generated units to waveform output where an unit represents speech combining different aspects such as phonemes and syllables

Seamless utilizes different set of modules according to the task it is performing. For text generation tasks, such as S-T and T-T,the conformer speech encoder and T2TT modules are utilized. For speech generation tasks, such as T-S and S-S, NAR T2U and Vocoder are additionally activated and the output of the translated text from T2TT is fed as an input to NAR T2U.

Model	Dataset	Modality								Decode Step Count	Avg. Time (ms)
		Input				Output
		Modality	Min	Max	Avg	Modality	Min	Max	Avg
Llama	HumanEval	Text	44	430	154	Text	55	10000	692	538	4494
Llama	MBPP	Text	29	1748	59	Text	38	10000	1076	1016	5567
Seam- less	Fleurs Eng-Spa	Speech	179	1464	493	Speech	129	1029	385	35	1578
		Speech	179	1464	493	Text	15	98	36	35	1321
		Text	12	80	31	Speech	145	1030	393	34	1432
		Text	12	80	31	Text	14	95	35	34	1187
Chame- leon	MSCOCO	Image	1030	1030	1030	Text	30	30	30	30	2913
	Vizwiz	Img&Txt	1033	1095	1040	Text	10	10	10	10	1253
	MSCOCO	Text	10	22	13.9	Image	O(1025)	O(1025)	O(1025)	1024	159702
HSTU	Synthetic	User History	4507	5121	4814	Action	4507.0	5121.0	4813.9	N/A	50

Table 2: Sequence Length Distribution of Four Generative AI Models. We use 5 sample for each workload.

2.1.4 gDLRM for Generative Recommendations

Generative recommenders approach information retrieval and recommendation problems by modeling the underlying joint distribution of user-item interactions, and adopting homogeneous, large-scale sequential backbones to replace the traditional heterogeneous modules in DLRMs. gDLRMs enables the main tasks in recommendations, namely retrieval (predict the next item to recommend) and ranking (predict the engagement type given the retrieved item), to be formulated as a next-token prediction problem. We refer to both type of outputs for retrieval and ranking tasks as "Action". Compared to prior DLRMs Naumov et al. (2019); Gupta et al. (2020); Wu et al. (2020); Hsia et al. (2023), gDLRMs have demonstrated superior accuracy performance Zhai et al. (2025) and further enable a unified feature space to be used across different domains.

One key sequential architecture used in the generative recommender system — HSTU — can be viewed as a variant of self-attention or Transformers specialized for sequence-to-sequence (sequential transduction) tasks. HSTU is composed of a stack of identical layers connected by residual connections He et al. (2016) as shown in Figure 2(d). Each layer consists of three main sub-layers: Point-wise Projection, Spatial Aggregation, and Pointwise Transformation. Spatial Aggregation replaces sequence-level normalized Softmax with pointwise normalized attention and relative attention bias, and Point-wise Projection together with Pointwise Transformation together performs efficient token-level transformation augmented by element-wise gating. This reduces the number of matrix multiplication operations from standard Transformers. In general, training throughput performance can be significantly improvement through feature deduplication optimization Zhao et al. (2023). Note that HSTU is the only model that is non-autoregressive among the generation tasks studied in this paper.

3 System Performance Characterization on Multi-Modal Model Inference

We present real-system performance characterization results for the key generative AI tasks in this section. The deeper understanding of application-level characteristics and performance bottlenecks on real systems help guide our performance and efficiency optimization focus systematically. The data-driven analysis also underpins key system design and optimization opportunities, as what we later show in Section 3.2.

3.1 Sequence Length and Latency Distribution

Sequence length is a key task-specific dimension that determines where the most important performance acceleration opportunities come from. Sequence length distribution also affects the computational efficiency of generative models. For instance, models with shorter sequence lengths require less computation time to generate samples compared to models with longer sequence lengths. Transformer-based models, in particular, are highly sensitive to sequence length distributions due to their attention operation, where computational costs increase quadratically (i.e., $O(N^{2}d)$ , where $N$ and $d$ denote sequence length and embedding dimension, respectively). In Table 2, we delve into the sequence length distribution for the four different generative AI models.

And in Figure 3, we show end-to-end inference latency distribution to show the correlation between the sequence length and the latency. We measure the inference latency of each sample with a batch size of 1 on NVIDIA A100 GPU to get the latency distribution. Based on our analysis, latency distribution is highly correlated to the sequence length distribution. We are going to discuss the correlation in detail in the following parts. By understanding the sequence length and latency distribution and its correlation, our goal is to better understand what determines the different system performance of four generative AI models and help to optimize those models by understanding the trade-off between the length of the sample and the computational efficiency.

Llama: For the Llama-based coding tasks (Code Llama), we focus on coding capabilities of AI using the HumanEval Chen et al. (2021) and MBPP Austin et al. (2021), respectively. The input prompts describe the programming problems in text, such as "Write a python function to find the first repeated character in a given string.". We define input sequence lengths of Coda Llama as the number of text tokens fed into the model whereas output sequence lengths represent the number of text tokens generated by the model. In general, the input sequence length for MBPP is in the order of tens of tokens while the input sequence length for HumanEval is in the hundreds. This is because the HumanEval dataset gives more detailed constraints of problems with simple examples in the input prompts. In contrast, the output sequence lengths for HumanEval is in the order of hundreds since the solution for these datasets are quite simple that could be solvable in few lines of code (around 10 lines in general).

In Figure 3, we report the latency distribution of HumanEval and MBPP dataset. Overall, MBPP has longer end-to-end latency than HumanEval as the number of decoding step is the key factor in deciding the end-to-end latency which will be discussed in more detail in Section 3.2 Observation #1. T-T tasks have the widest latency distribution among all tasks as the end-to-end latency has high correlation with the sequence lengths and the number of decoding steps distribution. The standard deviation is one of the most representative metric to show how broadly the values are distributed. Our observation says T-T tasks have the largest standard deviation for input sequence lengths and decoding steps.

Seamless: For sequence length analysis of Seamless, we focus on the Fleurs Conneau et al. (2022) dataset which contains the speech version of the FLoRes Goyal et al. (2021) machine translation benchmark in 102 different languages. This dataset is used for a variety of speech tasks, including automatic speech recognition (ASR), speech language identification, translation and retrieval.

The input sequence for the Seamless M4T model is generated by extracting 80-dimensional filterbank features from the raw audio waveform at the 100Hz frame rate and by stacking every 2 frames for the final 160-dimensional 50Hz features. These extracted features, i.e., a dimension of 160 for Seamless M4T, become the input and the number of features becomes the sequence length of the model. The input sequence length statistics are for speech encoder in case of S-T and S-S tasks and text encoder in case of T-T and T-S tasks. For output sequence lengths statistics, we report the output sequence lengths of text decoder module in case of text generation tasks (S-T, T-T) and the output sequence lengths of NAR T2U module in case of speech generation tasks (S-S, T-S).

The output sequence of Seamless is specific to the corresponding tasks. For text generation tasks (T-T, T-S), we define the output sequences as the generated text tokens from T2TT module. For speech generation tasks (S-T, S-S), we define the output sequences as generated units from NAR T2U module. Furthermore, we take English to Spanish translation as our analysis use case since it is one of the most frequently used combinations for translation task. We use en_us and es_419 subset from Fleurs dataset for English source language and Spanish target language, respectively. The average duration of input speech files of en_us are around 9.88 sec, resulting in the average input sequence length for speech modality as 986 and 30 for text modality.

In Seamless, text generation tasks only utilize conformer speech encoder and T2TT modules while speech generation tasks run NAR T2U and vocoder in addition. Thus, speech generation tasks generally take longer than text generation tasks. In our analysis, S-S tasks are 24% slower than S-T tasks and T-S tasks are 20% slower than T-T tasks on average in terms of inference latency.

Chameleon: For Chameleon-based multi-modal tasks, we focus on the widely-used MSCOCO Lin et al. (2015), Vizwiz Gurari et al. (2018) datasets for I-T&T-I and IT-T tasks, respectively.

•

Image to Text (I-T) tasks: Chameleon uses newly trained image tokenizers Gafni et al. (2022) and the BPE tokenizer Sennrich et al. (2016) for the image and text input modality, respectively. The I-T task uses 1030 tokens, combining 1024 image tokens with 6 prompt tokens (e.g., "Describe the figure") for caption generation.
•

Image/Text to Text (IT-T) tasks: For the IT-T generation task, a representative use case is Visual Question Answering (VQA), which generates response given an image and a question for the image, such as "Can you tell me what this image is about?". In this case, the input sequence combines 1024 image tokens with additional tokens for the question text. Taking the Vizwiz dataset as an example, the number of text tokens for the questions range from 3 to 65.

Note that I-T and IT-T tasks maintain fixed decoding steps by using maximum output lengths and task-specific templates for prediction extraction.

In Figure 3, I-T tasks have longer output length (30) than IT-T (10) since image captioning require more words than VQA (visual question answering) answers, which are typically brief responses like "Q: Which one is the blue one? A: Right", "Q: "What color is this A: White".
•

Text to Image (T-I) tasks: For the T-I generation task, instructions to generate image such as "An upstairs living room is decorated nicely and holds a sewing machine." is given as the text input prompt. Thus, the input sequence length is determined by the number of text tokens generated by the text tokenizer. We use the MSCOCO Lin et al. (2015) dataset, for which the average input sequence length is 13.9.

In Figure 3, T-I tasks have the longest latency among the all tasks. Even though T-I tasks have shorter input sequence lengths, the number of decoding is highest which is 1024 resulting in the longest latencies. Also, as mentioned in Section 2.1.2, Chameleon uses a contrastive decoding methiod for T-I task, thus Chameleon runs the model twice at each incremental decoding step.

gDLRM: For recommendation tasks with feature generation (HSTU), we focus on a synthetically generated dataset, where a sequence of user history is randomly generated. We generated 16384 number of inference samples, and the sequence of each sample comes with random integer indices, which range from 0 to 6000. The synthetically-generated sequence lengths are configured to represent the distribution we observed in the production environment as mentioned in the work Zhai et al. (2025).

Also, as mentioned in Section 2.1.4, HSTU is composed of a stack of identical 14 layers but they limit the maximum input sequence length for later 11 layers as 1024 for speed improvement performance.

Based on the fact that sequence length distribution is unique to each generative AI task, in the next section, we delve into understanding where inference latency comes from.

3.2 Operator Time Breakdown

System performance bottlenecks are distinct across the key generative AI tasks Llama (CodeLlama), Chameleon (CM3), Seamless (Seamless), and generative DLRM (HSTU). Depending on the input and output modality types and the corresponding sequence lengths, system performance optimization opportunities vary.

Figure 4 presents the end-to-end model inference time breakdown for Llama, Seamless, Chameleon and gDLRM. We characterize the inference time by maximizing the batch size for each workload to fit in the HBM memory capacity (i.e., 80GB) of a single NVIDIA A100 GPU NVIDIA (2020). We report the averaged breakdown result for 5 samples after 15 iterations of warmup for Code Llama, Seamless, Chameleon and HSTU (3 samples for T-I task of Chameleon model). For detailed description of the codebase and environment setup, please refer to Section 4.

Four generative AI models consist of different sets of operators. "Idle" time indicates idle time on GPU device during the inference when GPU is in the idle status because of the GPU kernel launch overhead on CPU side. We divide prefill and decoding stage for Llama and Chameleon to better understand the different characteristics of each stage. And we show the normalized inference time to the end-to-end prefill time of Llama on top of each bar. Note that we exclude embedding table lookup time of HSTU given that DLRM serving disaggregates embedding from the main model itself.

Observation #1 The auto-regressive nature of token generation in Llama and CM3 makes token generation (decoding) a performance-critical phase that is primarily determined by the number of decoding steps, whereas the inference latency of HSTU is much faster and does not depend on token generation.

For autoregressive generative models (Llama, Seamless, and Chameleon), the number of decoding steps matters the most to the end-to-end latency. As these models generate tokens sequentially, larger the number of decoding steps prolongs the generation process. For example, I-T task and IT-T task have similar average input sequence length while I-T task have 3 times higher number of decoding step according to Table 2. This result in longer end-to-end inference latency as shown in Figure 4. Also, the T-I task in the Chameleon model takes the longest latency per inference sample because the image generation process involves 1024 decoding steps to produce a single image, significantly larger than the number of steps required by other tasks. This results in the longest latency per inference sample among the four models. Also, Llama has longer latency compared to I-T task and IT-T task of Chameleon even the input sequence lengths for Llama is much smaller (upto 13x). One of the primary cause for this is because Llama has higher number of decoding steps, resulting in the increased end-to-end latency.

Considering that the prefill stage is only performed once while the incremental decoding stage is repeated multiple times, the number of decoding steps has a more significant impact on end-to-end inference latency than the input sequence length of the prefill stage when the number of decoding steps is non-trivial.

On the other hand, non-autoregressive models generate all tokens simultaneously rather than sequentially, so they can be significantly faster than autoregressive models. This is particularly beneficial for long sequences or when real-time performance is crucial and this can lead to a better user experience. HSTU Zhai et al. (2025) demonstrates the potential benefits of non-autoregressive models.

Observation #2 The inference time of autoregressive models is often dominated by the GPU idle time, indicating that these models depend heavily on CPU-bound modules.

We observed a significant gap incurred by CPU overhead that delayed the launch of GPU kernels, resulting in GPU underutilization and a substantial increase in the execution time especially for Llama and Chameleon.

Seamless and HSTU have relatively higher GPU utilization compared to Llama and Chameleon. For Seamless, Speech/Text Encoder and Text Decoder are always activated and NAR T2U and Vocoder are selectively activated depending on the tasks. Among the four modules, only the text decoder is an autoregressive module indicating that only this module will be operating on matrix size of sequence length 1 except for Prefill while the rest modules are operating on the matrix size of the given full input sequence length. Thus, the overall GPU utilization for Seamless is higher than Llama and Chameleon since it has only one autoregressive module. For HSTU, the input sequence length is much larger (4813.9 x batch size) than other models according to Table 2, so GPU spends much more time on computation resulting in high GPU utilization.

To address CPU-bound issue, optimizing techniques like torch.compile and CUDA Graph can significantly reduce the GPU kernel launch overhead. The latency improvement results of torch.compile and CUDA Graph are provided in Section 4.1.2.

Observation #3 Across all workloads, linear operations constitute a comparable portion of the overall model inference latency as the attention operations due to the Feed Forward Networks (FFNs) in Transformer-based models.

For Llama and Chameleon, Linear operation dominates the end-to-end inference time. For Seamless, linear operation takes a comparable portion of the inference time to attention operation. And for HSTU, attention operation dominates the inference time, unlike the other models. That is because the computation cost of attention operation grows quadratically ( $O(N^{2})$ ) to the input sequence length and the input sequence length of HSTU is in higher order than other generative models as addressed in Table 2.

Generally, the linear operation takes an insignificant amount of the inference time, thus accelerating linear layer operations could bring significant improvements to end-to-end latency than accelerating attention operations. In Section 4.2, we delve more deeply into inference acceleration using different numeric precision levels on the linear operation performance and output quality.

Observation #4 KV Cache reordering operation dominates Seamless inference time, which is a necessary operation for the decoding strategy based on beam search.

Autoregressive models perform incremental decoding steps based on the decoding strategy that the model adopts. Decoding strategy is a sampling method used to choose the next token based on the output probability distribution over the vocabulary dictionary. The popular decoding strategies include deterministic methods such as greedy and beam search and stochastic (sampling) methods such as top-p, top-k, random, etc. Llama and Chameleon use top-p decoding strategy and Seamless uses beam search decoding strategy. Beam search decoding strategy is widely used for closed form generation tasks such as translation, because sampling based decoding strategies are way too stochastic which often lead to a worse semantic match between the predicted and reference sequence.

Beam search maintains a beam of the K best sequences so far and considers the probabilities of the combination of all of the preceding words along with the word in the current position. Beam search maintains a separate copy of the KV cache for each sequence, and it needs to reorder KV caches for all attention layers according to the selected sequences from the previous decoding step to make sure each selected beam performs with the corresponding KV cache. This step requires copying all KV cache into a new memory space resulting in a significant portion of the inference runtime. This could be further optimized with torch.compile and we discuss the torch.compile case study for Seamless in Section 4.1.2.

Model & Codebase	Task	Dataset	Max. Batch Size	# of Samples
Llama Ben Allal et al. (2022)	T-T	HumanEval	4	164
Chameleon Team (2024a)	I-T	MSCOCO	16	5000
	IT-T	Vizwiz	16	4319
	T-I	Coco Img.	16	500
Seamless Meta (2023)	S-S & S-T	Fleurs	128	643
Seamless Meta (2023)	T-T & T-S	Fleurs	384	643
HSTU Zhai et al. (2025)	H-A	Synthetic	32	16384

Table 3: Datasets, codebase and batch size configuration.

4 Accelerating Multi-Modal Model Inference via Cross-Stack Optimization

In this section, we highlight the importance of enhancing inference performance by taking into account state-of-the-art system optimization techniques as well as algorithmic advancement. There are (1) horizontal system-level optimizations and (2) vertical workload-specific optimization techniques.

•

System-level techniques optimize inference time performance horizontally across the generative AI tasks while being agnostic to specific algorithms. We consider Scaled Dot Product Attention (SDPA), torch.compile and CUDA graph optimization (CUDA Graph). SDPA leverages highly optimized and fused implementation to reduce the number of kernel launches and intermediate data transfers, which contributes to lower latency and memory usage. torch.compile and CUDA Graph facilitate streamline GPU task scheduling and execution, optimizing parallelism and resource utilization given system hardware. We deploy quantization optimization using the PyTorch AutoQuant framework PyTorch (2024) — in Section 4.2. It automates the tuning process by determining the most efficient quantization method for each layer.
•

Workload-specific techniques optimize design objectives tailoring to algorithm or neural network (NN) specific characteristics. Taking a recent NN optimization technique — LayerSkip Elhoushi et al. (2024) — tailor-designed for Transformer-based large language models, we evaluate the impact of LayerSkip across the generative AI tasks in Section 4.3

Metholody Detail: Table 3 presents the datasets and the corresponding codebase used for each task, as well as the maximum batch size that fits in a single NVIDIA A100 GPU NVIDIA (2020) used in our study. For the MSCOCO image dataset, we sub-sampled 500 out of 2000 data samples so the experiment time is more manageable and used the full dataset for the rest tasks. For HSTU, we generated synthetically a dataset with 16,384 samples, where a sequence of user history for each sample is randomly generated as explained in Section 3.1. We validated and ensured the dataset is representative of production usecases.

4.1 Baseline is All You Need

4.1.1 Accelerating Attention

The Attention operation in Transformer-based model architectures is an Amdahl’s law bottleneck. Based on the real-system performance characterization in Figure 4, Attention contributes to 3.4% of the end-to-end inference time in the decoding phase for Code Llama whereas, for HSTU, over 90% of the inference time comes from the Attention operation.

To accelerate Attention, we enable PyTorch SDPA (Scaled Dot Product Attention) PyTorch (2023) designed specifically to accelerate the fundamental building block — Attention — in Transformer-based model architectures. PyTorch provides torch.nn.functional.scaled_dot_product_attention as a function to optimize the inference time performance by accelerating the dot product computation between the Query, Key, and Value matrices using SDPA Vaswani et al. (2023).

Instead of relying on the PyTorch SDPA API directly, for HSTU, we manually implemented the memory-efficient attention Rabe and Staats (2022) and Flash Attention Dao et al. (2022) as is in PyTorch, directly at HSTU’s internal code base. The memory-efficient attention implementation divides the input into blocks and avoid materializing the large $h\times N\times N$ intermediate attention tensors for the backward pass. This reduces the attention computation as a group of back-to-back GEMMs with different shapes, which enables the sparsity of input sequences to be exploited. The construction of the relative attention bias is also a bottleneck due to memory accesses. To address this issue, we fused the relative bias construction and grouped GEMMs into a single GPU kernel, and accumulates gradients using GPU’s fast shared memory in the backward pass.

Results – SPDA. Figure 5 presents the end-to-end latency speedup across the family of multi-modal generation tasks for the settings of batch size=1 and of the maximum batch size (the largest batch size that can support each model on a single A100 NVIDIA GPU as configured in Table 3). PyTorch SDPA accelerates inference time performance of the generation tasks by an average of 1.07 $\times$ and 1.43 $\times$ for the single-batch and maximum-batch settings, respectively. In particular for HSTU, using the same fundamental principle, we observe 2.11 $\times$ and 9.87 $\times$ inference time improvement for the single-batch and maximum-batch settings, respectively. The significant inference time speedup stems from the proportionally-larger amount of time spent on the Attention operation for HSTU than the other generation tasks. And we observed that HSTU optimized implementation achieves up to 15 $\times$ speedups on 8K sequences.

In general, PyTorch SDPA generally establishes a more competitive baseline for inference performance across all tested scenarios. However, it’s important to note that performance gains may be negligible in cases where the attention operation constitutes a significantly smaller proportion of the overall inference runtime. For instance, we observed no performance improvement when applying SDPA to Seamless, as it allocates the smallest portion of runtime to attention operations among the four generative AI models examined—less than 7% across all tasks accordign to Figure 4.

4.1.2 Improving GPU Utilization

During inference, for the single batch setting, the workloads are typically not compute bound, which raises 2 issues. First, each kernel that runs on the GPU becomes so fast, the overhead of launching kernels starts dominating the overall inference time. We reduce the number of kernels with PyTorch’s compiler. torch.compile Ansel et al. (2024) accelerates PyTorch models by capturing and optimizing their computation graph. This includes fusing multiple operations into a single kernel. The second and more important issue of inference is that the GPU computations can be faster than the time it takes to execute the corresponding python code on CPU. The consequence is that the GPU is inactive most of the time, waiting for instructions from the CPU. We address this with CUDA graphs NVIDIA (2018). A CUDA graph is a succession of GPU operations that can be executed as a whole, without having to execute CPU code to schedule kernels one-by-one. In particular, this ensures that the GPU is always active during the graph execution. In practice, the graph is captured once when running the PyTorch model, and can then be replayed whenever we have a new input.

One key limitation is that the operations must be in exactly the same static tensor shapes with the same memory addresses. This is incompatible with inference workloads, because the KV cache increases with each iteration, as tokens get appended (cache=torch.cat((cache, new_value), dim=0)). To enable CUDA Graph under this limitation, we deployed a static buffer for the KV cache with the maximum sequence length supported by the model prior to the inference. As new keys and values are added to the cache, we increment the current token position on a GPU tensor. This counter is used by the kernel that copies the new tokens inside the KV cache. It is also used by the attention kernel, to skip the part of the KV cache that is not filled yet. This change enables CUDA Graphs, since now the KV cache and the counter have a static shape with a static GPU memory address. Note the baseline we compare with adopts the optimized implementation with a dynamic KV cache.

Results – torch.compile/CUDA Graph. Figure 6 presents the additional inference performance speedup with torch.compile and CUDA Graph. Overall, the end-to-end inference performance sees an average of 2.14 $\times$ and 2.16 $\times$ additional speedup on top of sdpa for the two batch settings, respectively. This results in total 2.28 $\times$ and 3.09 $\times$ speedup over the baseline without any optimization.

An exception, we observed performance degradation for Seamless at maximum batch size due to CUDA graph’s requirement for static KV cache. As mentioned earlier, substantial speedups are typically expected when static KV cache is used with CUDA Graph despite increased computational demands. However, this is an example of where the computational cost increase from static KV cache surpasses the benefits provided by CUDA Graph.

A Deeper Dive with Seamless: Seamless is an emerging speech translation technology that is important to many product surfaces but has not received similar amount of attention as LLMs nor deep learning recommendation models. We focus significant performance acceleration efforts to enable real-time speech translation built on Seamless and present our key findings in a deeper dive.

There are four primary modules in Seamless (Figure 2(b)). Enabling torch.compile (mode="max-autotune") and CUDA Graph for the T2TT decoder and vocoder (the most time-consuming modules, 61% and 23% of inference time) achieves 2 $\times$ speedup for the text decoder and 30 $\times$ speedup for the vocoder. This leads to 2.65 $\times$ faster end-to-end inference latency. It turns out, in particular for single batch inference, GPU kernel launch time is hardly amortized, leading to substantial GPU idle time. Enabling torch.compile without CUDA Graph leaves the performance acceleration potential to to 1.17 $\times$ and 18.4 $\times$ for the text decoder and the vocoder, respectively. While still significant, it shows the important role of CUDA Graph.

Our operator time breakdown in Figure 4 illustrates that Seamless also spends significant amount of time on KV cache management (KV_Cache_Reorder). This is because Seamless adopts beam search as a text decoding strategy. In each incremental decoding step, beam search picks the ’N’ beams containing the best sequences so far based on the probabilities of the combination of all of the preceding words + current word. For each incremental decoding step, KV cache reordering is needed by all Attention layers to ensure that newly selected beams perform on the corresponding KV caches from previous decoding step — kv_cache = kv_cache.index_select(new_beams). This code allocates new memory space and overwrites the memory pointer for kv_cache. To enable torch.compile for KV_Cache_Reorder, we had to modify KV cache reordering to keep the memory pointer of each cache as was recorded by using torch.Tensor.copy_ operator. By enabling torch.compile, all GPU kernels related to reordering are fused and compiled, resulting in the final speedup for Seamless.

Figure 7 presents the overall inference speedup we achieve for Seamless step-by-step and Table 4 describes the each label used for the figure. While application-specific performance optimization, such as incremental decoding and KV cache reordering, is important, significant inference acceleration potential can be further achieved by torch.compile and CUDA Graph optimization. For Seamless M4T, an end-to-end inference speedup of 2.7 $\times$ can be achieved for the challenging single-batch setting. This is key to efficiently enable low-latency, real-time speech translation tasks.

Label	Description
[Text Dec. Compile	Apply torch.compile to the text decoder
[Text Dec.] Compile + CUDA Graph	Apply torch.compile+CUDA Graph to the text decoder on top of above row
+[KV Cache Reorder] Compile	Apply torch.compile to KV cache reordering on top of above row
+[Vocoder] Compile	Apply torch.compile to the vocoder on top of the above row
+[Vocoder] Compile + CUDA Graph	Apply torch.compile + CUDA Graph to the vocoder on top of above row

Table 4: Table for description of the labels used in Figure 7.

4.2 Data Type Optimization

Quantization is an important optimization before models are deployed for downstream inference. To understand the potential of quantization capabilities, we assess data type optimization by applying AutoQuant (Auto-Quantization)PyTorch (2024). AutoQuant is a recently developed quantization implementation within the PyTorch torchao library PyTorch (2024) designed to integrate high-performance custom data types, layouts, and kernels into PyTorch workflows. AutoQuant optimizes the quantization process by determining the most efficient quantization for each model layer. It supports two quantization types — int8 dynamic quantization, int8 weight-only quantization.

Depending on downstream tasks, models of different input modalities, architectures and layer specifications can be quantized in distinct ways. For compute-intensive models, dynamic quantization tends to be most effective as it replaces expensive floating-point matrix multiplication operations with faster integer versions. In contrast, weight-only quantization is more beneficial for memory-bound scenarios, where the primary advantage is reduced weight data loading rather than decreased computational demand.

We enable AutoQuant as follows. First, in the model preparation step, linear layers within a model is identified as candidates for quantization. Then, in the shape calibration step, the model with one or more inputs is profiled for the shape and data types of activations recorded for subsequent uses. Finally, the timing performance of the recorded shape and data types are measured, and the fastest quantization setting is applied to speed up model inference.

AutoQuant is designed to work in conjunction with torch.compile, utilizing max-autotune setting to optimize quantization and achieve maximum performance gains. The quantization kernels within AutoQuant rely on torch.compile to generate high-performance kernels, therefore models must first be adapted to use static KV cache and static memory as highlighted in Section 4.1.2.

Results – AutoQuant. Figure 6 presents the inference time speedup for AutoQuant. AutoQuant provides additional 1.20 $\times$ , 1.57 $\times$ performance improvements for single batch setting on top of torch.compile (Section 4.1.2). Compared to the baseline without any optimization, we observe an average of 2.13 $\times$ and 4.38 $\times$ latency improvement for the single and the maximum batch settings, respectively.

For other generation tasks using the model architectures of Seamless and HSTU, we do not expect performance improvement based on the characterization results in Figure 4 — linear operations do not contribute significant runtime to end-to-end model inference. Furthermore, quantization optimization needs careful tuning, especially for production use cases of recommendation models Deng et al. (2021), thus we opt out HSTU from AutoQuant enablement.

4.3 Algorithm and NN Specific Optimizations

To meet the low inference latency requirement with resource efficiency, we prioritize enabling system optimization levers that come with minimal accuracy impact — SDPA and Flash Attention in Section 4.1.1 Golden et al. (2024b), torch.compile and CUDA Graph in Section 4.1.2, and AutoQuant in Section 4.2. To further efficiently accelerate inference, algorithm and neural network specific optimizations levers could be exploited. Here, we focus on a state-of-the-art inference optimization technique: LayerSkip Elhoushi et al. (2024). This technique is originally designed for Llama inference time and we show how it could be utilized to accelerate other multi-modal generative models.

LayerSkip Elhoushi et al. (2024) is specialized to minimize single-batch inference latency of LLMs. It speeds up inference by generating draft tokens sequentially with fewer layers while verifying them in parallel with remaining layers. Like speculative decoding Leviathan et al. (2023), parallel token verification amortizes per-token layer weight loading costs, resulting in end-to-end speedup. However, accuracy loss with exiting early is recovered by finetuning the model with a training recipe Elhoushi et al. (2024).

However, LayerSkip requires continuous model pretraining with early exit loss and layer dropout to improve early layer accuracy. It required 64 GPUs for 50K iterations for Code Llama and Chameleon, and moreover required access to the same or similar pretraining corpus that the models were trained on. Also, LayerSkip can be beneficial only for auto-regressive decoder models such as Llama, Chameleon, but not for non auto-regressive models like HSTU. Moreover, LayerSkip requires speculative decoding implementation for its mechanism, thus a custom implementation is required.

Results – LayerSkip. Figure 8 shows the inference time performance gain from workload-specific optimizations, we choose Llama and Chameleon as our target models. We focus on batch size 1 because efficient speculative decoding for larger batch sizes require significant modification to the attention mechanism Qian et al. (2024); Daniel (2024). However, LayerSkip is an optimization technique that achieves significant inference time speedup at the cost of accuracy loss. We achieve 1.59 $\times$ and 1.53 $\times$ speedup with +2.5% and -1.2% accuracy impact for CodeLlama 7B and 34B model, respectively. For Chameleon 7B model, LayerSkip achieves 1.43 $\times$ and 1.83 $\times$ speedup with -3.2 and -6.36 cider score loss for I-T and IT-T tasks, respectively. Overall, we observed the geomean 1.58 $\times$ speedup only with LayerSkip.

Results – Putting It Altogether. We further explored performance gains by enabling all cross-stack optimization techniques, system-level optimizations (SDPA, torch.compile, AutoQuant) and workload-load specific optimization (LayerSkip). This enhanced the speedup from 1.58 $\times$ to 3.88 $\times$ , demonstrating the significant potential of combining techniques for optimal performance gains.

4.4 Roofline Analysis

Figure 9 illustrates the effects of various optimization techniques on performance, as evaluated through the roofline analysis (data collected from NSight Compute NVIDIA (2024) profiling tool from NVIDIA). For each workload, Baseline is indicated with a circle marker where none of the optimization techniques is applied whereas Sys-Opt is indicated with a star marker where all the optimization levers are enabled. For Llama and Chameleon, SDPA+torch.compile+AutoQuant are enabled while SDPA+torch.compile is enabled for Seamless and SDPA is enabled for HSTU.

For each workload, enabling the system-level optimization techniques increased arithmetic intensity (i.e., FLOP/memory_traffic) and the performance (i.e., FLOP per sec), moving workload characteristics to the upper right part of the roofline. In the A100 deployment case, workloads that were already memory bandwidth-bound in the baseline setup are able to reduce the memory traffic and improve overall system performance. SDPA minimizes memory accesses during the attention mechanism by breaking down input sequence lengths into smaller tiles and perform computation within each tile independently. torch.compile fuses operations, eliminating intermediate memory allocations and accesses. AutoQuant effectively decreases the memory usage/traffic of each weight parameter by lowering their numerical precision.

The effect of applying each of these optimizations also depends on properties of each workload such as the underlying model architecture and input sequence length. For instance, we observe that workloads with textual inputs (e.g., T-T, T-I) and were previously most memory bandwidth-bound were the biggest beneficiaries. However, Seamless shows at most 10% difference (5% on average) between circle and star marker across four workloads (S-S, S-T, T-S, T-T). As mentioned in Section 4.1.2, applying torch.compile only to the text decoder among four primary modules in Seamless results in trivial impact to overall arithmetic intensity and the performance.

Beyond the Roofline Analysis We take Llama as an example to further understand the performance implication of each optimization technique to the roofline. When we apply SDPA, two key effects occur: First, FLOPs count increases by 8%. This is because efficient attention techniques require some recomputation. Second, memory traffic decreases by 14% due to the optimized algorithm. As a result, the arithmetic intensity increases. Counterintuitively, applying torch.compile on top of SDPA both increases both FLOPs count and memory traffic due to static KV cache adoption. Overall arithmetic intensity still increases because FLOPs count increases at a faster rate compared to memory traffic (attention has $O(N^{2})$ complexity). Memory traffic increases slightly by 1%: the static KV cache’s increased memory traffic more or less cancels out the reductions from fusing operations. By reducing per-weight memory footprint, AutoQuant reduces the memory traffic by 3.1 $\times$ on top of SDPA and torch.compile. Applying LayerSkip on top of all system-level optimizations reduces the FLOPs count by 2.3 $\times$ and the memory traffic by 2.2 $\times$ .

4.5 Result Analysis over GPU Generations

In this section, we extend our analysis to NVIDIA H100 GPU NVIDIA (2023) to demonstrate how our insights and optimizations generalize across different hardware generations. H100 (Hopper) is the newer generation GPU beyond NVIDIA A100 (Ampere), introducing improvements in both computing capability and memory subsystem, achieving about 3 $\times$ higher theoretical peak FLOPS and 1.5 $\times$ higher HBM bandwidth compared to A100.

By examining the same workloads on both platforms, it brings insights on how the new generation of hardware could impact the performance characteristics in diverse aspects by showing how existing bottleneck is resolved with the architectural improvements and what new optimization opportunities come up. These cross-platform insights are crucial for both hardware architects designing future accelerators and system engineers optimizing software stack.

Figure 10 shows H100 GPU operator time breakdown compared to A100 (Figure 4). We observed two changes in performance characteristics. First, H100 demonstrates substantial improvements in computational efficiency, with Linear operations showing the most dramatic speedup of 6.82 $\times$ , while Attention operations achieve a 1.44 $\times$ improvement resulting in 1.68 $\times$ speedup for end-to-end baseline runtime for batch size 1. Second, significant acceleration in Linear operation has shifted the performance bottlenecks - models previously bounded by Linear operations now show Misc or Attention operations as their primary bottlenecks.

Figure 11 shows speedup with system-level optimizations. When all possible set of system-level optimization techniques are enabled for each workload, we get 2.21 $\times$ , 3.1 $\times$ , 1.5 $\times$ , 2.7 $\times$ for Llama 34B, Chameleon 34B, Seamless (S-S) and HSTU for batch size 1 setting, respectively. LayerSkip on top of system-level optimizations gives the final speedup of 2.21 $\times$ , 4.13 $\times$ , 3.22 $\times$ , 4.53 $\times$ speedup for T-T of Llama 34B & 7B, IT-T and I-T task of Chameleon 7B, respectively.

The reduced relative performance gains from optimizations on H100 compared to A100 can be attributed to enhanced baseline capabilities of H100. With architectural improvements including hardware-optimized attention mechanisms and higher memory bandwidth, it demonstrates diminishing returns in software optimization techniques as baseline hardware performance improves, despite the superior absolute performance of H100.

5 Key Lessons and Concluding Remarks

Generative AI technologies are reshaping the computing landscape by offering new capabilities. This paper characterizes system performance of key multimodal models from Meta: Llama, Seamless, Chameleon, and gDLRM and emphasizes distinct resource needs and performance patterns requiring tailored optimizations. Enabling state-of-the-art system-level optimizations, such as Flash Attention/SDPA, torch.compile, CUDA Graph and quantization, strengthen baseline performance. Enabling workload-specific optimizations unlocks even further optimization opportunities by exploiting workload specific characteristics. We present reveals key insights for the computer architecture community:

•

Multi-modal models show distinct workload patterns compared to traditional AI models. Our quantitative results demonstrate difference in latency, compute and memory requirements across modalities and tasks. For instance, we observed that T-I and IT-T tasks of Chameleon demands 1.7 $\times$ more compute than HSTU, while the arithmetic intensity of HSTU is 1.25 $\times$ higher.
•

Optimization solutions must consider the whole end-to-end inference pipeline. Our research shows that focusing on isolated components may lead to suboptimal performance gains. For example, optimizing only the attention operation with SDPA gives 1.43 $\times$ improvement, and additionally optimizing linear operations with AutoQuant gives additional 3.06 $\times$ speedup, resulting in total 4.38 $\times$ inference speedup.
•

While new hardware accelerators are exciting prospects, our results emphasize the importance of first exhausting state-of-the-art software optimizations. We demonstrated that enabling state-of-the-art optimization, SDPA, torch.compile, and AutoQuant, led to an 4.38 $\times$ performance improvement across all the models, highlighting the untapped potential in existing hardware.
•

The diversity in model architectures, modalities necessitates flexible and adaptable optimization strategies. PyTorch SDPA’s effectiveness varies with attention operation’s proportion of total runtime, showing one-size-fits-all approach is insufficient.
•

Hardware design for generative AI tasks should prioritize flexibility and adaptability to accommodate diverse computational patterns and requirements across models, tasks and optimization knobs. Reconfigurable hardware design is essential to efficiently handle these variations. Addressing growing network demands through increased on-chip memory or enhanced inter/intra host communication is crucial for large-scale generative models.

We hope this work provides deeper understanding and insights on the landscape of generative AI technologies and cross-stack system optimization solutions. Focusing on optimizing fundamental components and considering unique input modalities of the key generative AI technologies are the key to efficiently accelerate model inference. The findings and methodologies in this paper enhance our understanding of generative AI system performance and set the stage for future innovations, leading to more efficient and scalable AI systems. As the field of generative AI continues to evolve rapidly, we believe that the computer architecture community has a crucial role to play in shaping the next generation of efficient, high-performance AI systems.

6 Acknowledgment

This work is an outcome of the extensive collaborations with many teams: Chameleon, Seamless, and HSTU. We are thankful for the valuable insights, numerous discussions, and refinement on the multimodal models. We would also like to thank the PyTorch and the xFormers teams, especially their inputs on ML system optimization.

References

Ansel et al. (2024) Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang, Jason Liang, Yinghai Lu, C. K. Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark Saroufim, Marcos Yukio Siraichi, Helen Suk, Shunting Zhang, Michael Suo, Phil Tillet, Xu Zhao, Eikan Wang, Keren Zhou, Richard Zou, Xiaodong Wang, Ajit Mathews, William Wen, Gregory Chanan, Peng Wu, and Soumith Chintala. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’24, page 929–947, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703850. 10.1145/3620665.3640366. https://doi.org/10.1145/3620665.3640366.
Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021.
Ben Allal et al. (2022) Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021. https://arxiv.org/abs/2107.03374.
Communication et al. (2023) Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne, Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang, Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang, Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti, Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez, Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu, Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-jussà, Maha Elbayad, Hongyu Gong, Francisco Guzmán, Kevin Heffernan, Somya Jain, Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, and Mary Williamson. Seamless: Multilingual expressive and streaming speech translation, 2023.
Conneau et al. (2022) Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech, 2022.
Daniel (2024) Cade Daniel. Optimizing attention for spec decode can reduce latency / increase throughput. https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA, 2024. [Accessed 16-09-2024].
Dao et al. (2022) Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022. https://arxiv.org/abs/2205.14135.
Deng et al. (2021) Zhaoxia Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu, Jie Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, and Mikhail Smelyanskiy. Low-precision hardware architectures meet recommendation model inference at scale. IEEE Micro, 41(5):93–100, 2021. 10.1109/MM.2021.3081981.
Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, and Carole-Jean Wu. Layerskip: Enabling early exit inference and self-speculative decoding, 2024. https://arxiv.org/abs/2404.16710.
Gafni et al. (2022) Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors, 2022.
Girdhar et al. (2023) Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning, 2023.
Golden et al. (2024a) Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. Generative AI Beyond LLMs: System Implications of Multi-Modal Generation . In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 257–267, Los Alamitos, CA, USA, May 2024a. IEEE Computer Society. 10.1109/ISPASS61541.2024.00032. https://doi.ieeecomputersociety.org/10.1109/ISPASS61541.2024.00032.
Golden et al. (2024b) Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. Is flash attention stable?, 2024b. https://arxiv.org/abs/2405.02803.
Goyal et al. (2021) Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzman, and Angela Fan. The flores-101 evaluation benchmark for low-resource and multilingual machine translation, 2021.
Gupta et al. (2020) Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. The Architectural Implications of Facebook’s DNN-Based Personalized Recommendation . In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 488–501, Los Alamitos, CA, USA, February 2020. IEEE Computer Society. 10.1109/HPCA47549.2020.00047. https://doi.ieeecomputersociety.org/10.1109/HPCA47549.2020.00047.
Gurari et al. (2018) Danna Gurari, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people, 2018.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, New York, NY, USA, 2016. IEEE. 10.1109/CVPR.2016.90.
Hsia et al. (2023) Samuel Hsia, Udit Gupta, Bilge Acun, Newsha Ardalani, Pan Zhong, Gu-Yeon Wei, David Brooks, and Carole-Jean Wu. Mp-rec: Hardware-software co-design to enable multi-path recommendation. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, page 449–465, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450399180. 10.1145/3582016.3582068. https://doi.org/10.1145/3582016.3582068.
Jia et al. (2019) Ye Jia, Ron J. Weiss, Fadi Biadsy, Wolfgang Macherey, Melvin Johnson, Zhifeng Chen, and Yonghui Wu. Direct speech-to-speech translation with a sequence-to-sequence model, 2019. https://arxiv.org/abs/1904.06037.
Leviathan et al. (2023) Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023. https://arxiv.org/abs/2211.17192.
Lin et al. (2015) Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
Meta (2023) Meta. Seamless. https://github.com/facebookresearch/seamless_communication, 2023.
Naumov et al. (2019) Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. Deep learning recommendation model for personalization and recommendation systems, 2019. https://arxiv.org/abs/1906.00091.
NVIDIA (2018) NVIDIA. CUDA Graph. https://developer.nvidia.com/blog/cuda-10-features-revealed/, 2018.
NVIDIA (2020) NVIDIA. NVIDIA A100 Tensor Core GPU. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
NVIDIA (2023) NVIDIA. NVIDIA H100 GPU. https://resources.nvidia.com/en-us-tensor-core?ncid=no-ncid, 2023.
NVIDIA (2024) NVIDIA. Nsight compute. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html, 2024.
OpenAI (2024) OpenAI. Chatgpt. https://openai.com/gpt, 2024.
PyTorch (2023) PyTorch. SDPA. https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html, 2023.
PyTorch (2024) PyTorch. torchao. https://github.com/pytorch/ao/tree/main/torchao/quantization, 2024.
Qian et al. (2024) Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, and Anoop Deoras. Bass: Batched attention-optimized speculative sampling, 2024. https://arxiv.org/abs/2404.15778.
Rabe and Staats (2022) Markus N. Rabe and Charles Staats. Self-attention does not need $o(n^{2})$ memory, 2022. https://arxiv.org/abs/2112.05682.
Radford et al. (2022) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022.
Rajput et al. (2023) Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. Recommender systems with generative retrieval, 2023.
Rozière et al. (2024) Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code, 2024.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. 10.18653/v1/P16-1162. https://aclanthology.org/P16-1162/.
Shazeer (2020) Noam Shazeer. Glu variants improve transformer, 2020. https://arxiv.org/abs/2002.05202.
Sheynin et al. (2023) Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks, 2023.
Singer et al. (2022) Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data, 2022.
Su et al. (2023) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. https://arxiv.org/abs/2104.09864.
Team (2024a) Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2024a. https://github.com/facebookresearch/chameleon.
Team (2024b) Gemini Team. Gemini: A family of highly capable multimodal models, 2024b.
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b.
Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023.
Wu et al. (2020) Carole-Jean Wu, Robin Burke, Ed H. Chi, Joseph Konstan, Julian McAuley, Yves Raimond, and Hao Zhang. Developing a recommendation benchmark for mlperf training and inference, 2020. https://arxiv.org/abs/2003.07336.
Wu et al. (2022) Carole-Jean Wu, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang, Fiona Aga, Jinshi Huang, Charles Bai, Michael Gschwind, Anurag Gupta, Myle Ott, Anastasia Melnikov, Salvatore Candido, David Brooks, Geeta Chauhan, Benjamin Lee, Hsien-Hsin Lee, Bugra Akyildiz, Maximilian Balandat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim Hazelwood. Sustainable ai: Environmental implications, challenges and opportunities. In D. Marculescu, Y. Chi, and C. Wu, editors, Proceedings of Machine Learning and Systems, volume 4, pages 795–813, Santa Clara, CA, USA, 2022. mlsys.org. https://proceedings.mlsys.org/paper_files/paper/2022/file/462211f67c7d858f663355eff93b745e-Paper.pdf.
Wu et al. (2024) Carole-Jean Wu, Bilge Acun, Ramya Raghavendra, and Kim Hazelwood. Beyond efficiency: Scaling ai sustainably, 2024. https://arxiv.org/abs/2406.05303.
Zhai et al. (2025) Jiaqi Zhai, Lucy Liao, Xing Liu, Yueming Wang, Rui Li, Xuan Cao, Leon Gao, Zhaojie Gong, Fangda Gu, Jiayuan He, Yinghai Lu, and Yu Shi. Actions speak louder than words: trillion-parameter sequential transducers for generative recommendations. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, New York, NY, USA, 2025. JMLR.org.
Zhang and Sennrich (2019) Biao Zhang and Rico Sennrich. Root mean square layer normalization. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
Zhao et al. (2023) Mark Zhao, Dhruv Choudhary, Devashish Tyagi, Ajay Somani, Max Kaplan, Sung-Han Lin, Sarunya Pumma, Jongsoo Park, Aarti Basant, Niket Agarwal, Carole-Jean Wu, and Christos Kozyrakis. Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure. In D. Song, M. Carbin, and T. Chen, editors, Proceedings of Machine Learning and Systems, volume 5, pages 754–767, Santa Clara, CA, USA, 2023. Curan. https://proceedings.mlsys.org/paper_files/paper/2023/file/f9b15fec25182f2d70af68a39546d60e-Paper-mlsys2023.pdf.