Fix bug in evaluation code #66
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thank you for open-sourcing your work and making it easy to reproduce results.
This PR mainly fixes some bugs in the code and inappropriate evaluation settings that may cause overestimated results:
(1) In V Star Benchmark evaluation, the options should be shuffled (as in the annotation file provided by the official repo), instead of always putting the correct choice on 'A'. Here's the difference:
The variance might be attributed to a small sample size (only 191 samples in total for this bench) and the model itself.
(2) In HRBench evaluation, the rule-based check should check if the option string is in the prediction result, instead of the option itself (in
DeepEyes/eval/judge_result_hrbench.py
):Here's an example of a False Positive:
The model predicts a wrong result ("A. Back (trunk)") that accidentally contains the correct option ('B'), which is falsely taken as accurate.
Here's the difference made by the fix: