Jobs Will Get Stuck in the RECEIVED JobStatus When Job Count Increases

#### Important note: I am using the same dataset for all aggregation service jobs and in debug mode

### Problem

I am working on a system to leverage the Aggregation Service with invocations up to 5000 jobs at once.  I am currently facing an issue where some Aggregation Service Job Instances are getting stuck in RECEIVED State.  I started to see this issue when I try to invoke 400 Aggregation Service Jobs at once.  Around 390 of them will complete and then I would have 10+ instances that are stuck in RECEIVED state.  Since it is stuck in RECEIVED state and not in processing state, the EC2 instance will never be released and the Aggregation Service Job will not reach the FAILED state since the worker did not pull in the SQS message and did not start doing work.  There are no useful logs, but I did look into an EC2 instance and did not see any cpu or network resources being pinned.  I could see the ASG work properly as it scales up and creates 400 EC2 instances.  In SQS the messages would be in flight for hours and would not see a change to the JobStatus.

These Jobs are all using the same data.  The jobs are being ran in debug mode.  These jobs have a 100% error threshold.
 
When running 800 jobs at once, it seems that only 29 of the SQS messages are entering flight at a time even when I have 800 EC2 instances up and running.  The last few messages then end up being stuck in RECEIVED state.
 
I have also attempted another experiment with 2000 Jobs being invoked.  None of the SQS messages would go in flight.  All 2000 EC2 instances were created, but no messages would go in flight and the JobStatus for all Jobs would be RECEIVED.
 
I will attempt this experiment with an earlier release of the Aggregation Service.  I did cut a ticket to the Aggregation Service support email as mentioned in issue #53, but wanted to also add this information here.  I greatly appreciate the support.
 
### Specifics
**Aggregation Service Version**: 2.10.0
**Cloud Provider**: AWS
**Execution Information**:
A sample of Execution IDs that got stuck in RECEIVED state:
JobKey: `2434549e-d9ed-4dfa-aba3-75ed12cc25dd`, ServerJobId: `02ede574-b861-4661-90c1-c6d30ad7500a`
JobKey: `b21a2800-60b3-4436-a4d2-78f4264cc350`, ServerJobId: `0de54602-459e-4da8-9a45-4b9d9420f221`
JobKey: `bb2f691a-ba02-4203-877f-de7b0ea0efd9`, ServerJobId: `52cd9c3d-6b1c-41c8-8fc2-33550350763c`
 
Terraform Config:
```
region      = "us-east-1"
environment = "prod-aggregation-server"

# Total resources available affected by instance_type -- actual resources used
# is affected by enclave_cpu_count / enclave_memory_mib. All 3 values should be
# updated at the same time.
instance_type      = "m5.12xlarge" # 48 cores, 192 GiB
enclave_cpu_count  = 44            # Leave 4 vCPUs to host OS (minimum required).
enclave_memory_mib = 184320        # 180 GiB. ~192GiB are available on m5.12xlarge, leave 12 GiB for host OS.

max_job_num_attempts_parameter      = "1"
max_job_processing_time_parameter   = "21600" # 6 hours time out
frontend_api_max_latency_ms         = "10000"
coordinator_a_assume_role_parameter = "REMOVED FOR PRIVACY"
coordinator_b_assume_role_parameter = "REMOVED FOR PRIVACY"

min_capacity_ec2_instances = "0"
max_capacity_ec2_instances = "5000"

alarm_notification_email = REMOVED FOR PRIVACY"
allowed_otel_metrics = ["cpu_usage", "memory", "total_execution_time"] #Doc: https://github.com/privacysandbox/aggregation-service/blob/main/docs/telemetry.md#how-to-enable-metricstraces-collection
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Jobs Will Get Stuck in the RECEIVED JobStatus When Job Count Increases #86

Important note: I am using the same dataset for all aggregation service jobs and in debug mode

Problem

Specifics

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Jobs Will Get Stuck in the RECEIVED JobStatus When Job Count Increases #86

Description

Important note: I am using the same dataset for all aggregation service jobs and in debug mode

Problem

Specifics

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions