这是indexloc提供的服务,不要输入任何密码
Skip to content

SIGSEGV in PerfEvents::walk when profiling Renaissance akka-uct benchmark #1319

@WadeWalker

Description

@WadeWalker

Describe the bug

Hi guys,

I love async-profiler, so thank you all sincerely for the work you do!

Anyway, to the bug report :) When I profile the Renaissance sub-benchmark akka-uct on Amazon EC2 m6a.metal instances, I pretty repeatably get a HotSpot SIGSEGV (file at hs_err_pid4147.log):

C [libasyncProfiler.so+0x18109] PerfEvents::walk(int, void*, void const**, int, StackContext*) [clone .constprop.1221]+0x29

It happens on other instance types too (m6i.metal, m7a.metal, m7g.metal, m7i.metal), so it affects both x86 and arm64. But it's most common on the m6a.metal instances, which are AMD EPYC 7R13 machines.

The error is somewhat intermittent. When I run it on most instance types, it might fail once or twice, but it'll eventually work. On m6a.metal, it's never run to completion for me.

I've tried to give a very complete description below, but please let me know if you need any more info.

Thanks!

Expected vs. actual behavior

Expected behavior is for the profile to run to completion. Actual behavior is SIGSEGV after some amount of time. The unsuccessful output of the benchmark looks like this:

====== akka-uct (concurrency) [default], iteration 0 started ======
GC before operation: completed in 24.973 ms, heap usage 62.562 MB -> 11.223 MB.
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000079116ac9d109, pid=20286, tid=22946
#
# JRE version: OpenJDK Runtime Environment Temurin-24.0.1+9 (24.0.1+9) (build 24.0.1+9)
# Java VM: OpenJDK 64-Bit Server VM Temurin-24.0.1+9 (24.0.1+9, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libasyncProfiler.so+0x18109]  PerfEvents::walk(int, void*, void const**, int, StackContext*) [clone .constprop.1221]+0x29
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/ubuntu/core.20286)
#
# An error report file with more information is saved as:
# /home/ubuntu/hs_err_pid20286.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues

Successful output looks like this:

====== akka-uct (concurrency) [default], iteration 0 started ======
GC before operation: completed in 16.653 ms, heap usage 56.787 MB -> 8.097 MB.
final size= 199991
final height= 10
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
====== akka-uct (concurrency) [default], iteration 0 completed (12863.032 ms) ======

< successful iterations deleted >

====== akka-uct (concurrency) [default], iteration 11 started ======
GC before operation: completed in 321.469 ms, heap usage 4.100 GB -> 103.590 MB.
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
final size= 199991
final height= 9
====== akka-uct (concurrency) [default], iteration 11 completed (10860.299 ms) ======

Reproduction Steps

Here are the reproduction instructions, starting from a totally empty new EC2 instance. It may also work on an already-set-up Ubuntu machine, but I haven't tried that.

  1. Provision a fresh Amazon EC2 m6a.metal instance with the standard Ubuntu 24.04 AMI. Give it an EBS volume of 20 GB or so, to hold the JDK and benchmark we're going to install on it.
  2. Copy this setup script to the machine: set-up-instance.txt
  3. Log into the machine, type mv set-up-instance.txt set-up-instance.sh, then source set-up-instance.sh. This will install Java and async-profiler.
  4. source ~/.profile to get the JDK on your path (the setup script adds a line to the .profile)
  5. Download the benchmark with curl -L -O "https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.16.0/renaissance-gpl-0.16.0.jar"
  6. Run the benchmark with java -agentpath:/home/ubuntu/async-profiler-4.0-linux-x64/lib/libasyncProfiler.so=start,event=cpu,file=profile.html -jar renaissance-gpl-0.16.0.jar akka-uct -t 120 --json results.json
  7. Observe the failure :) If you don't see it at first, try running again, or increasing -t 120 to a larger value to run longer.

Additional Information/Context

No response

Async-profiler version

4.0

Environment details

OS: Ubuntu 24.04, using Amazon's standard image at the path /aws/service/canonical/ubuntu/server/24.04/stable/current/

JDK: OpenJDK24U-jdk_x64_linux_hotspot_24.0.1_9

CPU: AMD EPYC 7R13 48-Core Processor (so amd64)

The application is running on an AWS EC2 metal instance, not in a container.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions