这是indexloc提供的服务,不要输入任何密码
Skip to content

Jobs Will Get Stuck in the RECEIVED JobStatus When Job Count Increases #86

@braden808

Description

@braden808

Important note: I am using the same dataset for all aggregation service jobs and in debug mode

Problem

I am working on a system to leverage the Aggregation Service with invocations up to 5000 jobs at once. I am currently facing an issue where some Aggregation Service Job Instances are getting stuck in RECEIVED State. I started to see this issue when I try to invoke 400 Aggregation Service Jobs at once. Around 390 of them will complete and then I would have 10+ instances that are stuck in RECEIVED state. Since it is stuck in RECEIVED state and not in processing state, the EC2 instance will never be released and the Aggregation Service Job will not reach the FAILED state since the worker did not pull in the SQS message and did not start doing work. There are no useful logs, but I did look into an EC2 instance and did not see any cpu or network resources being pinned. I could see the ASG work properly as it scales up and creates 400 EC2 instances. In SQS the messages would be in flight for hours and would not see a change to the JobStatus.

These Jobs are all using the same data. The jobs are being ran in debug mode. These jobs have a 100% error threshold.

When running 800 jobs at once, it seems that only 29 of the SQS messages are entering flight at a time even when I have 800 EC2 instances up and running. The last few messages then end up being stuck in RECEIVED state.

I have also attempted another experiment with 2000 Jobs being invoked. None of the SQS messages would go in flight. All 2000 EC2 instances were created, but no messages would go in flight and the JobStatus for all Jobs would be RECEIVED.

I will attempt this experiment with an earlier release of the Aggregation Service. I did cut a ticket to the Aggregation Service support email as mentioned in issue #53, but wanted to also add this information here. I greatly appreciate the support.

Specifics

Aggregation Service Version: 2.10.0
Cloud Provider: AWS
Execution Information:
A sample of Execution IDs that got stuck in RECEIVED state:
JobKey: 2434549e-d9ed-4dfa-aba3-75ed12cc25dd, ServerJobId: 02ede574-b861-4661-90c1-c6d30ad7500a
JobKey: b21a2800-60b3-4436-a4d2-78f4264cc350, ServerJobId: 0de54602-459e-4da8-9a45-4b9d9420f221
JobKey: bb2f691a-ba02-4203-877f-de7b0ea0efd9, ServerJobId: 52cd9c3d-6b1c-41c8-8fc2-33550350763c

Terraform Config:

region      = "us-east-1"
environment = "prod-aggregation-server"

# Total resources available affected by instance_type -- actual resources used
# is affected by enclave_cpu_count / enclave_memory_mib. All 3 values should be
# updated at the same time.
instance_type      = "m5.12xlarge" # 48 cores, 192 GiB
enclave_cpu_count  = 44            # Leave 4 vCPUs to host OS (minimum required).
enclave_memory_mib = 184320        # 180 GiB. ~192GiB are available on m5.12xlarge, leave 12 GiB for host OS.

max_job_num_attempts_parameter      = "1"
max_job_processing_time_parameter   = "21600" # 6 hours time out
frontend_api_max_latency_ms         = "10000"
coordinator_a_assume_role_parameter = "REMOVED FOR PRIVACY"
coordinator_b_assume_role_parameter = "REMOVED FOR PRIVACY"

min_capacity_ec2_instances = "0"
max_capacity_ec2_instances = "5000"

alarm_notification_email = REMOVED FOR PRIVACY"
allowed_otel_metrics = ["cpu_usage", "memory", "total_execution_time"] #Doc: https://github.com/privacysandbox/aggregation-service/blob/main/docs/telemetry.md#how-to-enable-metricstraces-collection

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions