Google Cloud can deliver 18-60% cost savings versus other cloud-based Apache Spark alternatives. Get the ESG report.

Apache Spark on Google Cloud

Develop and run Apache Spark where you need it across all use cases, including data science, ETL, and exploration. Industry’s first autoscaling serverless Spark, integrated with the best of Google-native and open source tools.

2:46

Dun & Bradstreet cuts data workflows to minutes; boosts product response times by 60%

Benefits

Increase developer productivity and get faster data insights

Seamless Spark for all data users

Spark is integrated with BigQuery, Vertex AI, and your own IDEs allowing you to write and run it from those interfaces in two clicks. This eliminates the need for custom integrations and streamlines ETL, data exploration, analysis, and ML.

Operational simplicity with serverless Spark

Write Spark applications and pipelines that autoscale without any manual infrastructure provisioning or tuning with a universal catalog to unify business, technical, and runtime metadata for all of your data.

Run Spark your preferred way

One size does not fit all. Google Cloud gives you the flexibility to choose between serverless, managed clusters, and compute clusters for your Spark workloads.

Key features

Run Spark jobs that autoscale, from the interface of your choice, in two clicks

Serverless Spark in BigQuery

Serverless Spark in BigQuery (preview), powered by Dataproc Serverless, provides an integrated experience to run Apache Spark and SQL workloads from BigQuery, with unified security, runtime metadata, and governance. Improve collaboration and maximize productivity with integrated CI/CD and tooling without the need to deploy and manage Apache Spark clusters.

Managed Spark clusters with Dataproc

Dataproc is a fully managed and highly scalable service that simplifies the complexities of deploying and operating Spark, along with a vast ecosystem of other open-source tools. Its integration with the broader Google Cloud platform, coupled with a cost-effective pricing model, makes it ideal for tackling data lake modernization, efficient ETL pipelines, and secure, large-scale data science initiatives. Dataproc empowers you to focus on data insights rather than infrastructure management.

Data science with serverless Spark

Serverless Spark accelerates data science by automating infrastructure. Focus on your code, not cluster management. Automatic scaling and seamless integration with BigQuery and Vertex AI streamline workflows, enabling faster iteration and model development. Check out the latest libraries for serverless Spark to enable more use cases with less user-configuration needed. Check out the latest code samples for data scientists, including building a pipeline for predicting customer churn using Apache Spark, XGBoost, and the Hugging Face Transformers library.

Spark through Vertex AI

Spark for data science in one click: Data scientists can use Spark for development from Vertex AI Workbench seamlessly, with built-in security. Spark is integrated with Vertex AI's MLOps features, where users can execute Spark code through notebook executors that are integrated with Vertex AI Pipelines.

Open source table format support

Dataproc now offers compatibility with open source formats like Apache Iceberg and Delta Lake. You can use the Iceberg and Delta Lake tables component with Spark and Hive on Dataproc, unlocking a powerful combination for managing and analyzing large datasets. You can install additional components for Iceberg when you create a Dataproc cluster using the Optional components feature.

Ready to get started? Contact us

Partners