Develop and run Apache Spark where you need it across all use cases, including data science, ETL, and exploration. Industry’s first autoscaling serverless Spark, integrated with the best of Google-native and open source tools.
Benefits
Operational simplicity with serverless Spark
Write Spark applications and pipelines that autoscale without any manual infrastructure provisioning or tuning with a universal catalog to unify business, technical, and runtime metadata for all of your data.
Run Spark your preferred way
One size does not fit all. Google Cloud gives you the flexibility to choose between serverless, managed clusters, and compute clusters for your Spark workloads.
Key features
Serverless Spark in BigQuery (preview), powered by Dataproc Serverless, provides an integrated experience to run Apache Spark and SQL workloads from BigQuery, with unified security, runtime metadata, and governance. Improve collaboration and maximize productivity with integrated CI/CD and tooling without the need to deploy and manage Apache Spark clusters.
Dataproc is a fully managed and highly scalable service that simplifies the complexities of deploying and operating Spark, along with a vast ecosystem of other open-source tools. Its integration with the broader Google Cloud platform, coupled with a cost-effective pricing model, makes it ideal for tackling data lake modernization, efficient ETL pipelines, and secure, large-scale data science initiatives. Dataproc empowers you to focus on data insights rather than infrastructure management.
Serverless Spark accelerates data science by automating infrastructure. Focus on your code, not cluster management. Automatic scaling and seamless integration with BigQuery and Vertex AI streamline workflows, enabling faster iteration and model development. Check out the latest libraries for serverless Spark to enable more use cases with less user-configuration needed. Check out the latest code samples for data scientists, including building a pipeline for predicting customer churn using Apache Spark, XGBoost, and the Hugging Face Transformers library.
Spark for data science in one click: Data scientists can use Spark for development from Vertex AI Workbench seamlessly, with built-in security. Spark is integrated with Vertex AI's MLOps features, where users can execute Spark code through notebook executors that are integrated with Vertex AI Pipelines.
Dataproc now offers compatibility with open source formats like Apache Iceberg and Delta Lake. You can use the Iceberg and Delta Lake tables component with Spark and Hive on Dataproc, unlocking a powerful combination for managing and analyzing large datasets. You can install additional components for Iceberg when you create a Dataproc cluster using the Optional components feature.
Ready to get started? Contact us
Apache Spark is a trademark of The Apache Software Foundation.
Tell us what you’re solving for. A Google Cloud expert will help you find the best solution.