Fix bug in evaluation code #66

xjtupanda · 2025-06-23T09:25:30Z

Thank you for open-sourcing your work and making it easy to reproduce results.

This PR mainly fixes some bugs in the code and inappropriate evaluation settings that may cause overestimated results:

(1) In V Star Benchmark evaluation, the options should be shuffled (as in the annotation file provided by the official repo), instead of always putting the correct choice on 'A'. Here's the difference:

No shuffle:
Run 1:
"direct_attributes": 91.30434782608695,
"relative_position": 85.52631578947368,
"overall": 89.00523560209425

Run 2:
"direct_attributes": 92.17391304347827,
"relative_position": 90.78947368421053,
"overall": 91.62303664921467

Shuffle options:
Run 1:
"direct_attributes": 86.08695652173914,
"relative_position": 81.57894736842105,
"overall": 84.29319371727748

Run 2:
"direct_attributes": 86.08695652173914,
"relative_position": 85.52631578947368,
"overall": 85.86387434554975

The variance might be attributed to a small sample size (only 191 samples in total for this bench) and the model itself.

(2) In HRBench evaluation, the rule-based check should check if the option string is in the prediction result, instead of the option itself (in DeepEyes/eval/judge_result_hrbench.py):

  # elif answer in pred_ans:
  elif answer_str in pred_ans:
      acc_reward = 1.0

Here's an example of a False Positive:

hr_bench_8k
No.30:
"question": "Which side of the car is the person sitting on?", 
"answer": "B", 
"answer_str": "Front (hood)", 
"pred_ans": "A. Back (trunk)"

The model predicts a wrong result ("A. Back (trunk)") that accidentally contains the correct option ('B'), which is falsely taken as accurate.

Here's the difference made by the fix:

Before fixing:
"hr_bench_4k": {
    "single": 0.915,
    "cross": 0.585,
    "overall": 0.75
},
"hr_bench_8k": {
    "single": 0.8475,
    "cross": 0.565,
    "overall": 0.70625
}

After fixing:
"hr_bench_4k": {
    "single": 0.91,
    "cross": 0.565,
    "overall": 0.7375
},
"hr_bench_8k": {
    "single": 0.8475,
    "cross": 0.54,
    "overall": 0.6937500000000001
}

JaaackHongggg · 2025-06-25T23:21:56Z

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

xjtupanda · 2025-06-26T06:10:43Z

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

Thanks for your prompt reply. But I'm afraid it's not the case. The official setting (https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py) adopts a likelihood-based evaluation method, which evaluates the likelihood of each option (P(option | question)) and chooses the one with the largest likelihood. Thus, the answer is invariant to the position of the option in the prompt (i.e., irrelevant to "A", "B", "C", "D"). For example,

"question": "Is the flag red or white?",
"options": [
"The color of the flag is white.",
"The color of the flag is red."
]

If they evaluate the likelihood and find that:
P("The color of the flag is white." | "Is the flag red or white?") > P("The color of the flag is red." |"Is the flag red or white?")

Then they would consider that the model picks the option "The color of the flag is white.". This is different from putting the choice in the prompt and asking the model to pick one. Again, the final annotation file provided by V* bench curators casts the prompt into a common MCQ format and shuffles the options.

fix bug

84bbe6b

xjtupanda mentioned this pull request Jun 23, 2025

Questions about reproduction of reported results #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug in evaluation code #66

Fix bug in evaluation code #66

Uh oh!

xjtupanda commented Jun 23, 2025

Uh oh!

JaaackHongggg commented Jun 25, 2025

Uh oh!

xjtupanda commented Jun 26, 2025

Uh oh!

Uh oh!

Fix bug in evaluation code #66

Are you sure you want to change the base?

Fix bug in evaluation code #66

Uh oh!

Conversation

xjtupanda commented Jun 23, 2025

Uh oh!

JaaackHongggg commented Jun 25, 2025

Uh oh!

xjtupanda commented Jun 26, 2025

Uh oh!

Uh oh!