Performance of Baseline Qwen 2.5 VL-7B on V* Benchmark

Hi, thanks for your brilliant open-sourced project ! 

When I tried to evaluate Qwen 2.5 VL-7B on V* Benchmark, I foudn my result is quite lower than yours reported in Table 1. 
- Mine are
"direct_attributes": 62.60869565217392,
 "relative_position": 64.47368421052632,
 "overall": 63.35078534031413,

- While yours are: 73.9 67.1 and 71.2 for overall.  

Do you know any reasons that could lead to these discrepancies? 

Here are my evaluate scripts:

## evaluate models

1. deploy qwen 

` CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen2.5-VL-7B-Instruct \ 
    --port 18901 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768   \
    --tensor-parallel-size 4    --served-model-name "baseline_qwen"  --trust-remote-code  --disable-log-requests`

  2. evaluate qwen 
  ` MODEL_NAME=Qwen2.5-VL-7B-Instruct
  API_KEY="EMPTY"
  API_URL="http://xxxxx:18901/v1"
  PATH_TO_SAVE_DIR="/codebase/DeepEyes/eval_results"
  MODEL_NAME_VLLM=baseline_qwen
  PATH_TO_VSTAR="/data_public/Vstar"
  CUDA_VISIBLE_DEVICES=4,5,6,7 python /codebase/DeepEyes/eval/eval_vstar.py \
    --model_name $MODEL_NAME \
    --api_key $API_KEY \
    --api_url $API_URL\
    --vstar_bench_path $PATH_TO_VSTAR \
    --save_path $PATH_TO_SAVE_DIR \
    --eval_model_name $MODEL_NAME_VLLM \
    --num_workers 4 
  `
## calculate scores

1. deploy Qwen-72B as Judge
`CUDA_VISIBLE_DEVICES=0,1,2,3  vllm serve /codebase/DeepEyes/pretrained_models/Qwen2.5-72B-Instruct\
    --port 18901 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 32768 \
    --tensor-parallel-size 4 \
    --served-model-name "judge" \
    --trust-remote-code \
    --disable-log-requests
`
2. calculate
`
MODEL_NAME=Qwen2.5-VL-7B-Instruct
API_KEY="EMPTY"
API_URL="http://xxxxx:18901/v1"
PATH_TO_SAVE_DIR="/codebase/DeepEyes/eval_results"
MODEL_NAME_VLLM=judge
PATH_TO_VSTAR="/data_public/Vstar"
CUDA_VISIBLE_DEVICES=4,5,6,7 python judge_result.py \
    --model_name $MODEL_NAME \
    --api_key $API_KEY \
    --api_url $API_URL\
    --vstar_bench_path $PATH_TO_VSTAR \
    --save_path $PATH_TO_SAVE_DIR \
    --eval_model_name $MODEL_NAME_VLLM \
    --num_workers 4
`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance of Baseline Qwen 2.5 VL-7B on V* Benchmark #91

evaluate models

calculate scores

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance of Baseline Qwen 2.5 VL-7B on V* Benchmark #91

Description

evaluate models

calculate scores

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions