What is AI inference?

AI inference is the "doing" part of artificial intelligence. It's the moment a trained model stops learning and starts working, turning its knowledge into real-world results.

Think of it this way: if training is like teaching an AI a new skill, inference is that AI actually using the skill to do a job. It takes in new data (like a photo or a piece of text) and produces an instant output, like a prediction, generates a photo, or makes a decision. This is where AI delivers business value. For anyone building with AI, understanding how to make inference fast, scalable, and cost-effective is the key to creating successful solutions. For example, an enterprise developer could use AI inference on Google Kubernetes Engine (GKE) to build a system that analyzes customer purchases in real-time and offers personalized discounts at checkout, boosting sales and customer satisfaction.

BLOG

Ask a Techspert: What is inference?

'AI training' versus 'fine-tuning' versus 'inference' versus 'serving'

While the complete AI lifecycle involves everything from data collection to long-term monitoring, a model's central journey from creation to execution has three key stages. The first two are about learning, while the last one is about putting that learning to work.

AI training is the foundational learning phase. It's a computationally intensive process where a model analyzes a massive dataset to learn patterns and relationships. The goal is to create an accurate and knowledgeable model. This requires powerful hardware accelerators (like GPUs and TPUs) and can take anywhere from hours to weeks.
AI fine-tuning is a shortcut to training. It takes a powerful, pre-trained model and adapts it to a more specific task using a smaller, specialized dataset. This saves significant time and resources compared to training a model from scratch.
AI inference is the execution phase. It uses the trained and fine-tuned model to make fast predictions on new, "unseen" data. Each individual prediction is far less computationally demanding than training, but delivering millions of predictions in real-time requires a highly optimized and scalable infrastructure.
AI serving is the process of deploying and managing the model for inference. This often involves packaging the model, setting up an API endpoint, and managing the infrastructure to handle requests.

This table summarizes the key differences:

	AI training	AI fine-tuning	AI inference	AI serving
Objective	Build a new model from scratch.	Adapt a pre-trained model for a specific task.	Use a trained model to make predictions.	Deploy and manage the model to handle inference requests.
Process	Iteratively learns from a large dataset.	Refines an existing model with a smaller dataset.	A single, fast "forward pass" of new data.	Package the model and expose it as an API
Data	Large, historical, labeled datasets.	Smaller, task-specific datasets.	Live, real-world, unlabeled data.	N/A
Business focus	Model accuracy and capability.	Efficiency and customization.	Speed (latency), scale, and cost-efficiency.	Reliability, scalability, and manageability of the inference endpoint.

AI training

AI fine-tuning

AI inference

AI serving

Objective

Build a new model from scratch.

Adapt a pre-trained model for a specific task.

Use a trained model to make predictions.

Deploy and manage the model to handle inference requests.

Process

Iteratively learns from a large dataset.

Refines an existing model with a smaller dataset.

A single, fast "forward pass" of new data.

Package the model and expose it as an API

Data

Large, historical, labeled datasets.

Smaller, task-specific datasets.

Live, real-world, unlabeled data.

N/A

Business focus

Model accuracy and capability.

Efficiency and customization.

Speed (latency), scale, and cost-efficiency.

Reliability, scalability, and manageability of the inference endpoint.

How does AI Inference work?

At its core, AI inference involves three steps that turn new data into a useful output.

Let's walk through it with a simple example: an AI model built to identify objects in photos.

Input data preparation: First, new data is provided—for instance, a photo you've just submitted. This photo is instantly prepped for the model, which might mean simply resizing it to the exact dimensions it was trained on.
Model execution: Next, the AI model analyzes the prepared photo. It looks for patterns—like colors, shapes, and textures—that match what it learned during its training. This quick analysis is called a "forward pass," a read-only step where the model applies its knowledge without learning anything new.
Output generation: The model produces an actionable result. For the photo analysis, this might be a probability score (such as a 95% chance the image contains a "dog"). This output is then sent to the application and displayed to the user.

While a single inference is quick, serving millions of users in real time adds to the latency, cost and requires optimized hardware. AI specialized Graphics Processing Units (GPUs) and Google's Tensor Processing Units are designed to handle these tasks efficiently along with orchestration with Google Kubernetes Engine, helping to increase throughput and lower latency.

Types of AI inference

Cloud inference: For power and scale

This is the most common approach, where inference runs on powerful remote servers in a data center. The cloud offers immense scalability and computational resources, making it ideal for handling massive datasets and complex models. Within the cloud, there are typically two primary modes of inference:

Real-time (online) inference: Processes individual requests instantly as they arrive, often within milliseconds. This is crucial for interactive applications that demand immediate feedback.
Batch (offline) inference: Handles large volumes of data all at once, typically when immediate responses aren't required. It's a highly cost-effective method for periodic analyses or scheduled tasks.

Edge inference: For speed and privacy

This approach performs inference directly on the device where data is generated—this could be on a smartphone, or an industrial sensor. By avoiding a round-trip to the cloud, edge inference offers unique advantages:

Reduced latency: Responses are nearly instantaneous, critical for applications like autonomous vehicles or real-time manufacturing checks.
Enhanced privacy: Sensitive data (such as medical scans, personal photos, video feeds) can be processed on-device without ever being sent to the cloud.
Lower bandwidth costs: Processing data locally significantly reduces the amount of data that needs to be uploaded and downloaded.
Offline functionality: The application can continue to work even without an internet connection, ensuring continuous operation in remote or disconnected environments.

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.

AI inference comparison

To help you choose the best approach for your specific needs, here’s a quick comparison of the key characteristics and use cases for each type of AI inference:

Feature	Batch inference	Real-time inference	Edge inference
Primary location	Cloud (data centers)	Cloud (data centers)	Local device (such as phone, IoT sensor, robot)
Latency/responsiveness	High (predictions returned after processing batch)	Very low (milliseconds to seconds per request)	Extremely low (near-instantaneous, no network hop)
Data volume	Large datasets (such as terabytes)	Individual events/requests	Individual events/requests (on-device)
Data flow	Data sent to cloud, processed, results returned	Each request sent to cloud, processed, returned	Data processed on device, results used on device
Typical use cases	Large-scale document categorization, overnight financial analysis, periodic predictive maintenance	Product recommendations, chatbots, live translation, real-time fraud alerts	Autonomous driving, smart cameras, offline voice assistants, industrial quality control
Key benefits	Cost-effective for large, non-urgent tasks	Immediate responsiveness for user-facing apps	Minimal latency, enhanced privacy, offline capability, reduced bandwidth costs

Feature

Batch inference

Real-time inference

Edge inference

Primary location

Cloud (data centers)

Local device (such as phone, IoT sensor, robot)

Latency/responsiveness

High (predictions returned after processing batch)

Very low (milliseconds to seconds per request)

Extremely low (near-instantaneous, no network hop)

Data volume

Large datasets (such as terabytes)

Individual events/requests

Individual events/requests (on-device)

Data flow

Data sent to cloud, processed, results returned

Each request sent to cloud, processed, returned

Data processed on device, results used on device

Typical use cases

Large-scale document categorization, overnight financial analysis, periodic predictive maintenance

Product recommendations, chatbots, live translation, real-time fraud alerts

Autonomous driving, smart cameras, offline voice assistants, industrial quality control

Key benefits

Cost-effective for large, non-urgent tasks

Immediate responsiveness for user-facing apps

Minimal latency, enhanced privacy, offline capability, reduced bandwidth costs

Use cases for developers

AI inference is transforming industries by enabling new levels of automation, smarter decision-making, and innovative applications. For enterprise developers, here are some critical areas where inference delivers tangible business value:

Real-time risk and fraud detection

Instantly analyze financial transactions, user behavior, or system logs to identify and flag suspicious activities. This allows for proactive intervention to prevent fraud, money laundering, or security breaches.
Example: A credit card company uses inference to authorize transactions in milliseconds, blocking potentially fraudulent purchases immediately.

Hyper-personalization and recommendation engines

Provide highly tailored experiences for users by predicting preferences based on their past interactions and real-time context.
Example: Ecommerce platforms use inference to suggest products to shoppers or streaming services recommend movies based on viewing habits, driving engagement and sales.

AI-powered automation and agents

Deploy AI models to automate routine tasks, provide intelligent assistance, or interact with users at scale.
Example: Customer service organizations use AI agents to handle common inquiries, freeing up human agents for complex issues, or factories use AI for automated quality inspection on assembly lines.

Predictive maintenance and operations

Analyze sensor data from machinery, infrastructure, or IT systems to forecast failures, predict demand, or optimize resource allocation before problems occur.
Example: Manufacturers use inference to predict when equipment needs servicing, minimizing downtime and extending asset lifespan, or logistics companies optimize routes based on real-time traffic predictions.

Advanced content generation and understanding

Leverage AI to create new content (text, code, images, audio) or deeply understand existing unstructured data.
Example: Developers use code generation models to accelerate software development, or marketing teams use AI to summarize large documents and personalize ad copy.

What problem are you trying to solve?

What you'll get:

Step-by-step guide

Reference architecture

Available pre-built solutions

This service was built with Vertex AI. You must be 18 or older to use it. Do not enter sensitive, confidential, or personal info.

How Google Cloud can help

Google Cloud offers a comprehensive suite of tools and services that help developers and organizations build, deploy, and manage AI inference workloads efficiently and at scale. Inference capabilities are deeply integrated across many offerings:

Related products and solutions

Google Cloud product	Inference approach supported	Ideal when you need to	Example inference use case
Google Kubernetes Engine (GKE)	All inference types (cloud and hybrid)	Gain ultimate control and flexibility to deploy, manage, and scale custom containerized inference services, often with specialized hardware, across cloud or hybrid environments.	Deploy and scale a bespoke AI model for real-time anomaly detection in a complex industrial system.
Cloud Run	Real-time cloud inference (serverless)	Deploy containerized models with auto-scaling to zero and pay-per-request pricing, ideal for highly variable, intermittent workloads, or simple web services.	Serve a small-to-medium sized model for a web application where traffic fluctuates widely, ensuring cost-efficiency.
NVIDIA GPUs on Google Cloud	Real-time and batch cloud inference	Get flexible, high-performance acceleration for a wide range of AI models and frameworks.	Rapidly process high-resolution images for medical diagnosis or accelerate complex financial modeling.
BigQuery ML	Batch cloud inference (data warehouse)	Perform inference directly on data already in your data warehouse using SQL, eliminating data movement.	Predict customer churn directly on your CRM data within BigQuery.
Pre-trained AI APIs	Real-time cloud inference (specific tasks)	Easily embed advanced AI capabilities (like vision, language, speech) into applications without building or training any models.	Automatically translate customer chat messages in realtime or understand sentiment from social media posts.
Cloud TPUs	Real-time and batch cloud inference (large models)	Achieve maximum performance and cost-efficiency when serving very large, complex deep learning models, especially large language models (LLMs).	Power the real-time responses of a cutting-edge generative AI chatbot.
Edge solutions (such as Coral, GDC Edge)	Edge Inference	Enable ultra-low latency, enhanced privacy, or offline functionality by running models directly on devices.	Perform instant object recognition on a smart camera without sending video to the cloud.
Dataproc	Data preparation for batch cloud inference	Efficiently process and prepare vast amounts of data for large-scale batch inference jobs.	Pre-process petabytes of sensor data before feeding it into a predictive maintenance model.

Google Cloud product

Inference approach supported

Ideal when you need to

Example inference use case

Google Kubernetes Engine (GKE)

All inference types (cloud and hybrid)

Gain ultimate control and flexibility to deploy, manage, and scale custom containerized inference services, often with specialized hardware, across cloud or hybrid environments.

Deploy and scale a bespoke AI model for real-time anomaly detection in a complex industrial system.

Cloud Run

Real-time cloud inference (serverless)

Deploy containerized models with auto-scaling to zero and pay-per-request pricing, ideal for highly variable, intermittent workloads, or simple web services.

Serve a small-to-medium sized model for a web application where traffic fluctuates widely, ensuring cost-efficiency.

NVIDIA GPUs on Google Cloud

Real-time and batch cloud inference

Get flexible, high-performance acceleration for a wide range of AI models and frameworks.

Rapidly process high-resolution images for medical diagnosis or accelerate complex financial modeling.

BigQuery ML

Batch cloud inference (data warehouse)

Perform inference directly on data already in your data warehouse using SQL, eliminating data movement.

Predict customer churn directly on your CRM data within BigQuery.

Pre-trained AI APIs

Real-time cloud inference (specific tasks)

Easily embed advanced AI capabilities (like vision, language, speech) into applications without building or training any models.

Automatically translate customer chat messages in realtime or understand sentiment from social media posts.

Cloud TPUs

Real-time and batch cloud inference (large models)

Achieve maximum performance and cost-efficiency when serving very large, complex deep learning models, especially large language models (LLMs).

Power the real-time responses of a cutting-edge generative AI chatbot.

Edge solutions (such as Coral, GDC Edge)

Edge Inference

Enable ultra-low latency, enhanced privacy, or offline functionality by running models directly on devices.

Perform instant object recognition on a smart camera without sending video to the cloud.

Dataproc

Data preparation for batch cloud inference

Efficiently process and prepare vast amounts of data for large-scale batch inference jobs.

Pre-process petabytes of sensor data before feeding it into a predictive maintenance model.

Vertex AI

Vertex AI stands as Google Cloud's unified AI platform. It provides comprehensive tools for building, deploying, and managing ML models, making it the go-to service for most cloud-based inference needs.

Vertex AI feature	Inference approach	Ideal when you need to	Example inference use case
Online predictions	Real-time cloud inference	Deploy custom models and get real-time, low-latency predictions from a managed endpoint.	Instantly recommend products to a user browsing a website.
Batch predictions	Batch cloud inference	Process large datasets cost-effectively without needing real-time results.	Analyze all customer transactions from yesterday to detect fraud patterns.
Model Garden and foundation models	Real-time and batch cloud inference (generative AI)	Quickly leverage powerful pre-trained models for common or generative AI tasks without training from scratch.	Generate marketing copy, summarize long documents, or create code snippets.

Vertex AI feature

Inference approach

Ideal when you need to

Example inference use case

Online predictions

Real-time cloud inference

Deploy custom models and get real-time, low-latency predictions from a managed endpoint.

Instantly recommend products to a user browsing a website.

Batch predictions

Batch cloud inference

Process large datasets cost-effectively without needing real-time results.

Analyze all customer transactions from yesterday to detect fraud patterns.

Model Garden and foundation models

Real-time and batch cloud inference (generative AI)

Quickly leverage powerful pre-trained models for common or generative AI tasks without training from scratch.

Generate marketing copy, summarize long documents, or create code snippets.

Explore AI inference resources

Ready to take your AI inference skills to the next level? Here are some valuable resources to help you learn more and get started:

Take a course on AI inference on Cloud Run
Watch this video on the secret to cost-efficient AI inference
Learn to use Cloud Run for AI inference
Discover how to accelerate AI inference workloads

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Need help getting started?
Contact sales
Work with a trusted partner
Find a partner
Continue browsing
See all products