Learn how to use Serverless for Apache Spark to submit a batch workload on a Dataproc-managed compute infrastructure that scales resources as needed.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataproc API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Dataproc API.
Submit a Spark batch workload
You can use the Google Cloud console, the Google Cloud CLI, or the Serverless for Apache Spark API to create and submit a Serverless for Apache Spark batch workload.
Console
In the Google Cloud console, go to Dataproc Batches.
Click Create.
Submit a Spark batch workload that computes the approximate value of pi by selecting and filling in the following fields:
- Batch Info:
- Batch ID: Specify an ID for your batch workload. This value must be 4-63
lowercase characters. Valid characters are
/[a-z][0-9]-/
. - Region: Select a region where your workload will run.
- Batch ID: Specify an ID for your batch workload. This value must be 4-63
lowercase characters. Valid characters are
- Container:
- Batch type: Spark.
- Runtime version: The default runtime version is selected. You can optionally specify a non-default Serverless for Apache Spark runtime version.
- Main class:
org.apache.spark.examples.SparkPi
- Jar files (this file is pre-installed in the Serverless for Apache Spark Spark execution environment).
file:///usr/lib/spark/examples/jars/spark-examples.jar
- Arguments: 1000.
- Execution Configuration: You can specify a service account to use to run your workload. If you don't specify a service account, the workload runs under the Compute Engine default service account. Your service account must have the Dataproc Worker role.
- Network configuration: Select a subnetwork in the session region. Serverless for Apache Spark enables Private Google Access (PGA) on the specified subnet. For network connectivity requirements, see Google Cloud Serverless for Apache Spark network configuration.
- Properties: Enter the
Key
(property name) andValue
of supported Spark properties to set on your Spark batch workload. Note: Unlike Dataproc on Compute Engine cluster properties, Serverless for Apache Spark workload properties don't include aspark:
prefix. - Other options:
- You can configure the batch workload to use an external self-managed Hive Metastore.
- You can use a Persistent History Server (PHS). The PHS must be located in the region where you run batch workloads.
- Batch Info:
Click Submit to run the Spark batch workload.
gcloud
To submit a Spark batch workload to compute the approximate value
of pi
, run the following gcloud CLI
gcloud dataproc batches submit spark
command locally in a terminal window or in
Cloud Shell.
gcloud dataproc batches submit spark \ --region=REGION \ --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \ --class=org.apache.spark.examples.SparkPi \ -- 1000
Replace the following:
- REGION: Specify the region where your workload will run.
- Other options: You can add
gcloud dataproc batches submit spark
flags to specify other workload options and Spark properties.--version
: You can specify a non-default Serverless for Apache Spark runtime version.--jars
: The example JAR file is pre-installed in the Spark execution environment, The1000
command argument passed to the SparkPi workload specifies 1000 iterations of the pi estimation logic (workload input arguments are included after the "-- ").--subnet
: You can add this flag to specify the name of a subnet in the session region. If you don't specify a subnet, Serverless for Apache Spark selects thedefault
subnet in the session region. Serverless for Apache Spark enables Private Google Access (PGA) on the subnet. For network connectivity requirements, see Google Cloud Serverless for Apache Spark network configuration.
--properties
: You can add this flag to enter supported Spark properties for your Spark batch workload to use.--deps-bucket
: You can add this flag to specify a Cloud Storage bucket where Serverless for Apache Spark will upload workload dependencies. Thegs://
URI prefix of the bucket is not required; you can specify the bucket path or bucket name. Serverless for Apache Spark uploads the local file(s) to a/dependencies
folder in the bucket before running the batch workload. Note: This flag is required if your batch workload references files on your local machine.--ttl
: You can add the--ttl
flag to specify the duration of the batch lifetime. When the workload exceeds this duration, it is unconditionally terminated without waiting for ongoing work to finish. Specify the duration using as
,m
,h
, ord
(seconds, minutes, hours, or days) suffix. The minimum value is 10 minutes (10m
), and the maximum value is 14 days (14d
).- 1.1 or 2.0 runtime batches: If
--ttl
is not specified for a 1.1 or 2.0 runtime batch workload, the workload is allowed to run until it exits naturally (or run forever if it does not exit). - 2.1+ runtime batches: If
--ttl
is not specified for a 2.1 or later runtime batch workload, it defaults to4h
.
- 1.1 or 2.0 runtime batches: If
--service-account
: You can specify a service account to use to run your workload. If you don't specify a service account, the workload runs under the Compute Engine default service account. Your service account must have the Dataproc Worker role.- Hive Metastore: The following command configures a batch workload to use an external
self-managed Hive Metastore
using a standard Spark configuration.
gcloud dataproc batches submit spark\ --properties=spark.sql.catalogImplementation=hive,spark.hive.metastore.uris=METASTORE_URI,spark.hive.metastore.warehouse.dir=WAREHOUSE_DIR> \ other args ...
- Persistent History Server:
- The following command creates a PHS on a single-node Dataproc
cluster. The PHS must be located in the region where you run batch workloads,
and the Cloud Storage bucket-name must
exist.
gcloud dataproc clusters create PHS_CLUSTER_NAME \ --region=REGION \ --single-node \ --enable-component-gateway \ --properties=spark:spark.history.fs.logDirectory=gs://bucket-name/phs/*/spark-job-history
- Submit a batch workload, specifying your running Persistent History Server.
gcloud dataproc batches submit spark \ --region=REGION \ --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \ --class=org.apache.spark.examples.SparkPi \ --history-server-cluster=projects/project-id/regions/region/clusters/PHS-cluster-name \ -- 1000
- The following command creates a PHS on a single-node Dataproc
cluster. The PHS must be located in the region where you run batch workloads,
and the Cloud Storage bucket-name must
exist.
- Runtime version:
Use the
--version
flag to specify the Serverless for Apache Spark runtime version for the workload.gcloud dataproc batches submit spark \ --region=REGION \ --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar \ --class=org.apache.spark.examples.SparkPi \ --version=VERSION -- 1000
API
This section shows how to create a batch workload
to compute the approximate value
of pi
using the Serverless for Apache Spark
batches.create
`
Before using any of the request data, make the following replacements:
- project-id: A Google Cloud project ID.
- region: A Compute Engine region where Google Cloud Serverless for Apache Spark will run the workload.
- PROJECT_ID: Your Google Cloud project ID. Project IDs are listed in the Project info section on the Google Cloud console Dashboard.
- REGION: The session region.
Notes:
HTTP method and URL:
POST https://dataproc.googleapis.com/v1/projects/project-id/locations/region/batches
Request JSON body:
{ "sparkBatch":{ "args":[ "1000" ], "jarFileUris":[ "file:///usr/lib/spark/examples/jars/spark-examples.jar" ], "mainClass":"org.apache.spark.examples.SparkPi" } }
To send your request, expand one of these options:
You should receive a JSON response similar to the following:
{ "name":"projects/project-id/locations/region/batches/batch-id", "uuid":",uuid", "createTime":"2021-07-22T17:03:46.393957Z", "sparkBatch":{ "mainClass":"org.apache.spark.examples.SparkPi", "args":[ "1000" ], "jarFileUris":[ "file:///usr/lib/spark/examples/jars/spark-examples.jar" ] }, "runtimeInfo":{ "outputUri":"gs://dataproc-.../driveroutput" }, "state":"SUCCEEDED", "stateTime":"2021-07-22T17:06:30.301789Z", "creator":"account-email-address", "runtimeConfig":{ "version":"2.3", "properties":{ "spark:spark.executor.instances":"2", "spark:spark.driver.cores":"2", "spark:spark.executor.cores":"2", "spark:spark.app.name":"projects/project-id/locations/region/batches/batch-id" } }, "environmentConfig":{ "peripheralsConfig":{ "sparkHistoryServerConfig":{ } } }, "operation":"projects/project-id/regions/region/operation-id" }
Estimate workload costs
Serverless for Apache Spark workloads consume Data Compute Unit (DCU) and shuffle storage resources. For an example that outputs Dataproc UsageMetrics to estimate workload resource consumption and costs, see Serverless for Apache Spark pricing.
What's next
Learn about: