Accuracy scores on the single subset (13,704 single-turn QA pairs) of WildVideo.
| Model | Perception | Cognition | Overall | |||||
|---|---|---|---|---|---|---|---|---|
| Object | Action | Visual Loc. | Consistency | Causality | Multimodal Ref. | World Knowledge | ||
| Open-Source LMMs | ||||||||
| Mono-InternVL-2B | 36.2 | 15.6 | 22.2 | 26.9 | 20.7 | 11.0 | 21.3 | 22.0 |
| Video-LLaVA-7B | 43.9 | 25.8 | 23.8 | 40.1 | 36.9 | 27.1 | 45.6 | 34.7 |
| InternVL2-8B | 42.0 | 27.9 | 33.2 | 40.5 | 37.4 | 28.4 | 37.3 | 35.2 |
| VideoLLaMA2-7B | 52.6 | 34.3 | 33.0 | 46.7 | 39.9 | 41.9 | 48.9 | 42.5 |
| Ovis1.6-Gemma-2-9B | 58.5 | 27.6 | 28.6 | 41.6 | 42.5 | 44.0 | 57.7 | 42.9 |
| Qwen2-VL-7B | 52.5 | 30.3 | 36.1 | 47.8 | 47.6 | 36.2 | 50.9 | 43.1 |
| MiniCPM-V2.6-8B | 57.4 | 39.2 | 41.5 | 57.3 | 51.5 | 36.4 | 41.8 | 46.4 |
| LLaVA-Video-7B-Qwen2 | 63.0 | 50.7 | 47.3 | 58.9 | 60.6 | 47.6 | 46.0 | 53.4 |
| Commercial LMMs | ||||||||
| Claude 3.5 Sonnet | 50.0 | 43.5 | 39.7 | 50.5 | 36.1 | 53.2 | 49.8 | 46.1 |
| Gemini 1.5 Flash | 67.7 | 38.2 | 42.4 | 50.6 | 33.6 | 49.3 | 66.7 | 49.8 |
| Gemini 1.5 Pro | 67.1 | 50.4 | 43.6 | 59.3 | 42.6 | 54.7 | 57.8 | 53.7 |
| GPT-4V | 57.0 | 39.7 | 34.2 | 54.9 | 39.2 | 54.0 | 69.7 | 49.8 |
| GPT-4o mini | 61.4 | 45.3 | 41.9 | 52.9 | 52.0 | 47.9 | 71.8 | 53.3 |
| GPT-4o | 68.2 | 54.2 | 51.5 | 66.5 | 59.2 | 61.4 | 73.6 | 62.1 |
The best and second-best LMM results are bold and underlined, respectively. All numbers are accuracies in %, with a full score of 100%.
Leaderboard on WildVideo (Multi-Turn)
Accuracy (%) on the multi-turn subset (1,585 dialogues, up to 5 turns) of WildVideo.
| Model | Perception | Cognition | Contextual Comprehension | Overall | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Object | Action | Visual Loc. | Consistency | Causality | Multimodal Ref. | World Knowledge | Contextual Ellipsis | Cross-turn Retrieval | ||
| Open-Source LMMs | ||||||||||
| Mono-InternVL-2B | 32.2 | 23.4 | 17.3 | 24.0 | 10.8 | 14.5 | 11.4 | 12.1 | 6.9 | 17.0 |
| InternVL2-8B | 31.5 | 30.4 | 34.8 | 48.7 | 10.2 | 22.0 | 25.3 | 13.9 | 15.2 | 25.8 |
| Qwen2-VL-7B | 50.7 | 34.8 | 36.9 | 55.5 | 44.5 | 41.9 | 38.8 | 25.3 | 23.4 | 39.1 |
| MiniCPM-V2.6-8B | 50.9 | 49.6 | 42.8 | 44.4 | 41.3 | 42.7 | 42.4 | 23.1 | 35.7 | 41.4 |
| Commercial LMMs | ||||||||||
| Claude 3.5 Sonnet | 33.1 | 39.3 | 24.2 | 39.9 | 35.3 | 35.0 | 68.6 | 32.6 | 30.9 | 37.6 |
| Gemini 1.5 Flash | 51.7 | 40.0 | 40.0 | 49.9 | 34.4 | 48.2 | 56.0 | 47.4 | 46.6 | 46.0 |
| Gemini 1.5 Pro | 56.2 | 47.8 | 43.2 | 64.5 | 37.5 | 44.8 | 73.4 | 43.1 | 37.2 | 49.8 |
| GPT-4V | 55.3 | 41.4 | 33.3 | 31.8 | 45.2 | 48.6 | 77.9 | 37.1 | 51.4 | 46.9 |
| GPT-4o mini | 55.0 | 38.5 | 38.3 | 45.2 | 40.1 | 46.3 | 63.1 | 30.8 | 41.8 | 44.3 |
| GPT-4o | 60.1 | 45.2 | 46.2 | 65.7 | 39.7 | 50.7 | 78.7 | 44.5 | 43.1 | 52.7 |
The best and second-best LMM results are bold and underlined, respectively. All numbers are accuracies in %, with a full score of 100%.
🚨 To submit your results to the leaderboard, please send to yangsongyuan@nudt.edu.cn with your result json files.
🚨 For more submission details, please refer to this link.