这是indexloc提供的服务,不要输入任何密码
Skip to content

Fix bug in evaluation code #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xjtupanda
Copy link

Thank you for open-sourcing your work and making it easy to reproduce results.

This PR mainly fixes some bugs in the code and inappropriate evaluation settings that may cause overestimated results:

(1) In V Star Benchmark evaluation, the options should be shuffled (as in the annotation file provided by the official repo), instead of always putting the correct choice on 'A'. Here's the difference:

No shuffle:
Run 1:
"direct_attributes": 91.30434782608695,
"relative_position": 85.52631578947368,
"overall": 89.00523560209425

Run 2:
"direct_attributes": 92.17391304347827,
"relative_position": 90.78947368421053,
"overall": 91.62303664921467

Shuffle options:
Run 1:
"direct_attributes": 86.08695652173914,
"relative_position": 81.57894736842105,
"overall": 84.29319371727748

Run 2:
"direct_attributes": 86.08695652173914,
"relative_position": 85.52631578947368,
"overall": 85.86387434554975

The variance might be attributed to a small sample size (only 191 samples in total for this bench) and the model itself.

(2) In HRBench evaluation, the rule-based check should check if the option string is in the prediction result, instead of the option itself (in DeepEyes/eval/judge_result_hrbench.py):

  # elif answer in pred_ans:
  elif answer_str in pred_ans:
      acc_reward = 1.0

Here's an example of a False Positive:

hr_bench_8k
No.30:
"question": "Which side of the car is the person sitting on?", 
"answer": "B", 
"answer_str": "Front (hood)", 
"pred_ans": "A. Back (trunk)"

The model predicts a wrong result ("A. Back (trunk)") that accidentally contains the correct option ('B'), which is falsely taken as accurate.

Here's the difference made by the fix:

Before fixing:
"hr_bench_4k": {
    "single": 0.915,
    "cross": 0.585,
    "overall": 0.75
},
"hr_bench_8k": {
    "single": 0.8475,
    "cross": 0.565,
    "overall": 0.70625
}

After fixing:
"hr_bench_4k": {
    "single": 0.91,
    "cross": 0.565,
    "overall": 0.7375
},
"hr_bench_8k": {
    "single": 0.8475,
    "cross": 0.54,
    "overall": 0.6937500000000001
}

@JaaackHongggg
Copy link
Contributor

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

@xjtupanda
Copy link
Author

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

Hi, for the evaluation of Vstar, we follow the official setting in https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py. We also noticed that existing works also assess Vstar by putting the correct choice on 'A'.

Thanks for pointing out the bug in evaluating HRBench. I think the performance difference caused by this bug is not particularly obvious. We will fix it. Thanks a lot.

Thanks for your prompt reply. But I'm afraid it's not the case. The official setting (https://github.com/penghao-wu/vstar/blob/main/vstar_bench_eval.py) adopts a likelihood-based evaluation method, which evaluates the likelihood of each option (P(option | question)) and chooses the one with the largest likelihood. Thus, the answer is invariant to the position of the option in the prompt (i.e., irrelevant to "A", "B", "C", "D"). For example,

"question": "Is the flag red or white?",
"options": [
"The color of the flag is white.",
"The color of the flag is red."
]

If they evaluate the likelihood and find that:
P("The color of the flag is white." | "Is the flag red or white?") > P("The color of the flag is red." |"Is the flag red or white?")

Then they would consider that the model picks the option "The color of the flag is white.". This is different from putting the choice in the prompt and asking the model to pick one. Again, the final annotation file provided by V* bench curators casts the prompt into a common MCQ format and shuffles the options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants