Gemini models process input and output in units called tokens.
Tokens can be single characters like z
or whole words like cat
. Long words
are broken up into several tokens. The set of all tokens used by the model is
called the vocabulary, and the process of splitting text into tokens is called
tokenization.
For Gemini models, a token is equivalent to about 4 characters. 100 tokens is equal to about 60-80 English words.
Each model has a maximum number of tokens that it can handle in a prompt and response. Knowing the token count of your prompt lets you know if you've exceeded this limit. Additionally, the cost of a request is determined in part by the number of input and output tokens, so knowing how to count tokens can be helpful.
Note that Gemini 1.0 and 1.5 models also supported a "billable characters" count and pricing, but since those models are all either retired or soon-to-be-retired, this page does not describe anything about billable characters.
Supported models
gemini-2.5-pro
gemini-2.5-flash
gemini-2.5-flash-lite-preview-06-17
gemini-2.0-flash-001
(and its auto-updated aliasgemini-2.0-flash
)gemini-2.0-flash-lite-001
(and its auto-updated aliasgemini-2.0-flash-lite
)gemini-2.0-flash-preview-image-generation
Options for counting tokens
All input and output for the Gemini API is tokenized, including text, image files, and other non-text modalities. Here are the options for counting tokens:
- Check the token count for your requests only (before sending them to the model).
- Call
countTokens
with the input of the request before sending it to the model. This returns:total_tokens
: token count of the input only
- Check the token count for both your requests and responses.
- Use the
usageMetadata
attribute on the response object. This includes:prompt_token_count
: token count of the input onlycandidates_token_count
: token count of the output only (does not include thinking tokens)thoughts_token_count
: token count of any thinking tokens used to generate the responsetotal_token_count
: total count of tokens for both the input and the output (includes any thinking tokens)
When streaming output, the
usageMetadata
attribute only appears on the last chunk of the stream. It'snil
for intermediate chunks.
Note the following points about the options above:
- They will not count the number of input images or the number of seconds in video or audio input files. However, the token count for each of these modalities will correlate with these values.
- The input token count includes the prompt (text and any input files) as well as any system instructions and tools.
- The output token count does not include any thinking tokens; those are provided in a separate field.
- Review the additional information specific to each type of request later on this page.
Pricing for these options
Calling
countTokens
: There's no charge for callingcountTokens
(the Count Tokens API). The maximum quota for the Count Tokens API is 3000 requests per minute (RPM).Using the
usageMetadata
attribute: This attribute is always returned as part of the response and doesn't incur any tokens or charge itself.
Additional information
Here's some additional information when working with specific types of requests.
Count text input tokens
No additional information.
Count multi-turn (chat) tokens
Note the following for calling countTokens
when using chat:
- If you call
countTokens
with the chat history, it returns the total token count from both roles in the chat (total_tokens
). - To understand how big your next conversational turn will be, you need to
append it to the history when you call
countTokens
.
Count multimodal input tokens
Note the following points about counting tokens with multimodal input:
- You can optionally call
countTokens
on the text and the file separately. - For both token counting options, you'll get the same token count whether you provide the file as inline data or using its URL.
Image input files
Image input files are converted to tokens based on their dimensions:
- Image inputs with both dimensions less than or equal to 384 pixels: each image is counted as 258 tokens.
- Image inputs that are larger in one or both dimensions: each image is cropped and scaled as needed into tiles of 768x768 pixels, and then each tile is counted as 258 tokens.
Video and audio input files
Video and audio input files are converted to tokens at the following fixed rates:
- Video: 263 tokens per second
- Audio: 32 tokens per second
Document (like PDFs) input files
PDF input files are treated as images, so each page of a PDF is tokenized in the same way as an image.