Autotuning Spark batch workloads

This document provides information about autotuning Serverless for Apache Spark batch workloads. Optimizing a Spark workload for performance and resiliency can be challenging due to the number of Spark configuration options and the difficulty of assessing how those options impact a workload. Serverless for Apache Spark autotuning provides an alternative to manual workload configuration by automatically applying Spark configuration settings to a recurring Spark workload based on Spark optimization best practices and an analysis of workload runs (called "cohorts").

To sign up for access to the Serverless for Apache Spark autotuning preview release described on this page, complete and submit the Dataproc Preview access request signup form. After the form is approved, projects listed in the form have access to preview features.

Benefits

Serverless for Apache Spark autotuning can provide the following benefits:

Auto-optimization: Automatically tune inefficient Serverless for Apache Spark batch and Spark configurations, which can speed job runtimes.
Historical learning: Learn from recurring runs to apply recommendations tailored to your workload.

Autotuning cohorts

Autotuning is applied to recurring executions (cohorts) of a batch workload.

The cohort name that you specify when you submit a batch workload identifies it as one of the successive runs of the recurring workload.

Autotuning is applied to batch workload cohorts as follows:

Autotuning is calculated and applied to the second and subsequent cohorts of a workload. Autotuning is not applied to the first run of a recurring workload because Serverless for Apache Spark autotuning uses workload history for optimization.
Autotuning is not applied retroactively to running workloads, it is applied only to newly submitted workloads.
Autotuning learns and improves over time by analyzing the cohort statistics. To allow the system to gather sufficient data, we recommend keeping autotuning enabled for at least five runs.

Cohort names: A recommended practice is to use cohort names that help to identify the recurring workload type. For example, you might use daily_sales_aggregation as the cohort name for a scheduled workload that runs a daily sales aggregation task.

Autotuning scenarios

When applicable, autotuning automatically selects and executes the following scenarios or goals to optimize a batch workload:

Scaling: Spark autoscaling configuration settings.
Join optimization: Spark configuration settings to optimize SQL broadcast join performance.

Use Serverless for Apache Spark autotuning

You can enable Serverless for Apache Spark autotuning on a batch workload by using the Google Cloud console, Google Cloud CLI, or Dataproc API, or the Cloud Client Libraries.

Console

To enable Serverless for Apache Spark autotuning on each submission of a recurring batch workload, perform the following steps:

In the Google Cloud console, go to the Dataproc Batches page.

Go to Dataproc Batches
To create a batch workload, click Create.
In the Autotuning section:
- Toggle the Enable button to enable autotuning for the Spark workload.
- Cohort: Fill in the cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads that are submitted with this cohort name. For example, specify daily_sales_aggregation as the cohort name for a scheduled batch workload that runs a daily sales aggregation task.
Fill in other sections of the Create batch page as needed, then click Submit. For more information about these fields, see Submit a batch workload.

gcloud

To enable Serverless for Apache Spark autotuning on each submission of a recurring batch workload, run the following gcloud CLI gcloud dataproc batches submit command locally in a terminal window or in Cloud Shell.

gcloud dataproc batches submit COMMAND \
    --region=REGION \
    --cohort=COHORT \
    --autotuning-scenarios=auto  \
    other arguments ...

Replace the following:

COMMAND: the Spark workload type, such as Spark, PySpark, Spark-Sql, or Spark-R.
REGION: the region where your batch workload will run.
COHORT: the cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads that are submitted with this cohort name. For example, specify daily_sales_aggregation as the cohort name for a scheduled batch workload that runs a daily sales aggregation task.
--autotuning-scenarios=auto: Enable autotuning.

Note: auto is the recommended setting to get automatic benefits from autotuning. Yet, if required, you can specify one or more comma-separated specific scenarios to apply to the workload (such as scaling or broadcast-hash-join).

API

To enable Serverless for Apache Spark autotuning on each submission of a recurring batch workload, submit a batches.create request that includes the following fields:

RuntimeConfig.cohort: the cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads submitted with this cohort name. For example, specify daily_sales_aggregation as the cohort name for a scheduled batch workload that runs a daily sales aggregation task.
AutotuningConfig.scenarios: Specify AUTO to enable autotuning on the Spark batch workload.

Note: AUTO is the recommended setting to get automatic benefits from autotuning. Yet, if required, you can specify one or more specific scenarios to apply to the workload (such as SCALING or BROADCAST_HASH_JOIN).

Example:

...
runtimeConfig:
  cohort: COHORT_NAME
  autotuningConfig:
    scenarios:
    - AUTO
...

Java

Before trying this sample, follow the Java setup instructions in the Serverless for Apache Spark quickstart using client libraries. For more information, see the Serverless for Apache Spark Java API reference documentation.

To authenticate to Serverless for Apache Spark, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

To enable Serverless for Apache Spark autotuning on each submission of a recurring batch workload, call BatchControllerClient.createBatch with a CreateBatchRequest that includes the following fields:

Batch.RuntimeConfig.cohort: The cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads submitted with this cohort name. For example, you might specify daily_sales_aggregation as the cohort name for a scheduled batch workload that runs a daily sales aggregation task.
Batch.RuntimeConfig.AutotuningConfig.scenarios: Specify AUTO to enable autotuning on the Spark batch workload.

Note: AUTO is the recommended setting to get automatic benefits from autotuning. Yet, if required, you can specify one or more specific scenarios to apply to the workload (such as SCALING or BROADCAST_HASH_JOIN).

Example:

...
Batch batch =
  Batch.newBuilder()
    .setRuntimeConfig(
      RuntimeConfig.newBuilder()
        .setCohort("daily_sales_aggregation")
        .setAutotuningConfig(
          AutotuningConfig.newBuilder()
            .addScenarios(Scenario.AUTO))
    ...
  .build();

batchControllerClient.createBatch(
    CreateBatchRequest.newBuilder()
        .setParent(parent)
        .setBatchId(batchId)
        .setBatch(batch)
        .build());
...

To use the API, you must use google-cloud-dataproc client library version 4.43.0 or later. You can use one of the following configurations to add the library to your project.

Maven

<dependencies>
 <dependency>
   <groupId>com.google.cloud</groupId>
   <artifactId>google-cloud-dataproc</artifactId>
   <version>4.43.0</version>
 </dependency>
</dependencies>

Gradle

implementation 'com.google.cloud:google-cloud-dataproc:4.43.0'

SBT

libraryDependencies += "com.google.cloud" % "google-cloud-dataproc" % "4.43.0"

Python

Before trying this sample, follow the Python setup instructions in the Serverless for Apache Spark quickstart using client libraries. For more information, see the Serverless for Apache Spark Python API reference documentation.

To authenticate to Serverless for Apache Spark, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

To enable Serverless for Apache Spark autotuning on each submission of a recurring batch workload, call BatchControllerClient.create_batch with a Batch that includes the following fields:

batch.runtime_config.cohort: The cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads submitted with this cohort name. For example, you might specify daily_sales_aggregation as the cohort name for a scheduled batch workload that runs a daily sales aggregation task.
batch.runtime_config.autotuning_config.scenarios: Specify AUTO to enable autotuning on the Spark batch workload.

Note: AUTO is the recommended setting to get automatic benefits from autotuning. Yet, if required, you can specify one or more specific scenarios to apply to the workload (such as SCALING or BROADCAST_HASH_JOIN).

Example:

# Create a client
client = dataproc_v1.BatchControllerClient()

# Initialize request argument(s)
batch = dataproc_v1.Batch()
batch.pyspark_batch.main_python_file_uri = "gs://bucket/run_tpcds.py"
batch.runtime_config.cohort = "daily_sales_aggregation"
batch.runtime_config.autotuning_config.scenarios = [
    Scenario.AUTO
]

request = dataproc_v1.CreateBatchRequest(
    parent="parent_value",
    batch=batch,
)

# Make the request
operation = client.create_batch(request=request)

To use the API, you must use google-cloud-dataproc client library version 5.10.1 or later. To add it to your project, you can use the following requirement:

google-cloud-dataproc>=5.10.1

Airflow

Instead of submitting each autotuned batch cohort manually, you can use Airflow to schedule the submission of each recurring batch workload. To do this, call BatchControllerClient.create_batch with a Batch that includes the following fields:

batch.runtime_config.cohort: The cohort name, which identifies the batch as one of a series of recurring workloads. Autotuning is applied to the second and subsequent workloads submitted with this cohort name. For example, you might specify daily_sales_aggregation as the cohort name for a scheduled batch workload that runs a daily sales aggregation task.
batch.runtime_config.autotuning_config.scenarios: Specify AUTO to enable autotuning on the Spark batch workload.

Note: AUTO is the recommended setting to get automatic benefits from autotuning. Yet, if required, you can specify one or more specific scenarios to apply to the workload (such as SCALING or BROADCAST_HASH_JOIN).

Example:

create_batch = DataprocCreateBatchOperator(
    task_id="batch_create",
    batch={
        "pyspark_batch": {
            "main_python_file_uri": PYTHON_FILE_LOCATION,
        },
        "environment_config": {
            "peripherals_config": {
                "spark_history_server_config": {
                    "dataproc_cluster": PHS_CLUSTER_PATH,
                },
            },
        },
        "runtime_config": {
            "cohort": "daily_sales_aggregation",
            "autotuning_config": {
                "scenarios": [
                    Scenario.AUTO,
                ]
            }
        },
    },
    batch_id="BATCH_ID",
)

To use the API, you must use google-cloud-dataproc client library version 5.10.1 or later. You can use the following Airflow environment requirement:

google-cloud-dataproc>=5.10.1

To update the package in Cloud Composer, see Install Python dependencies for Cloud Composer .

View autotuning changes

To view Serverless for Apache Spark autotuning changes to a batch workload, run the gcloud dataproc batches describe command.

Example: gcloud dataproc batches describe output is similar to the following:

...
runtimeInfo:
  propertiesInfo:
    # Properties set by autotuning.
    autotuningProperties:
      spark.dataproc.sql.broadcastJoin.hints:
        annotation: Converted 1 Sort-Merge Joins to Broadcast Hash Join
        value: v2;Inner,<hint>
      spark.dynamicAllocation.initialExecutors:
        annotation: Adjusted Initial executors based on stages submitted in first
          2 minutes to 9
        overriddenValue: '2'
        value: '9'
      spark.dynamicAllocation.maxExecutors:
        annotation: Tuned Max executors to 11
        overriddenValue: '5'
        value: '11'
      spark.dynamicAllocation.minExecutors:
        annotation: Changed Min executors to 9
        overriddenValue: '2'
        value: '9'
...

You can view the latest autotuning changes that were applied to a running, completed, or failed workload on the Batch details page in the Google Cloud console, under the Summary tab.

Autotuning summary panel.

Pricing

Serverless for Apache Spark autotuning is offered during private preview without additional charge. Standard Serverless for Apache Spark pricing applies.

Autotuning Spark batch workloads Stay organized with collections Save and categorize content based on your preferences.

Sign up for Serverless for Apache Spark autotuning

Benefits

Autotuning cohorts

Autotuning scenarios

Use Serverless for Apache Spark autotuning

Console

gcloud

API

Java

Maven

Gradle

SBT

Python

Airflow

View autotuning changes

Pricing

Autotuning Spark batch workloads