Logo WildVideo

Benchmarking LMMs for Understanding Video-Language Interaction

IEEE Transactions on Pattern Analysis and Machine Intelligence

Introduction

We introduce WildVideo, an open-world benchmark dataset designed to address how to assess hallucination of Large Multi-modal Models (LMMs) for understanding video-language interaction in the wild. Our WildVideo comprehensively tests the perceptual, cognitive, and contextual comprehension hallucination of LMMs through both single-turn and multi-turn open-ended question-answering (QA) tasks on videos captured from two human perspectives (i.e. first-person view and third-person view). We define 9 distinct tasks that challenge LMMs across multi-level perceptual tasks (e.g., static and dynamic perception), multi-aspect cognitive tasks (e.g., commonsense, world knowledge), and multi-faceted contextual comprehension tasks (e.g., contextual ellipsis, cross-turn retrieval). The benchmark consists of 1,318 meticulously curated videos, supplemented with 13,704 single-turn QA pairs and 1,585 multi-turn dialogues (up to 5 turns). We evaluated 14 commonly-used LMMs on WildVideo, revealing significant hallucination issues of current LMMs, highlighting substantial gaps in their current capabilities.

Leaderboard on WildVideo (Single-Turn)

Accuracy scores on the single subset (13,704 single-turn QA pairs) of WildVideo.

Model Perception Cognition Overall
Object Action Visual Loc. Consistency Causality Multimodal Ref. World Knowledge
Open-Source LMMs
Mono-InternVL-2B 36.2 15.6 22.2 26.9 20.7 11.0 21.3 22.0
Video-LLaVA-7B 43.9 25.8 23.8 40.1 36.9 27.1 45.6 34.7
InternVL2-8B 42.0 27.9 33.2 40.5 37.4 28.4 37.3 35.2
VideoLLaMA2-7B 52.6 34.3 33.0 46.7 39.9 41.9 48.9 42.5
Ovis1.6-Gemma-2-9B 58.5 27.6 28.6 41.6 42.5 44.0 57.7 42.9
Qwen2-VL-7B 52.5 30.3 36.1 47.8 47.6 36.2 50.9 43.1
MiniCPM-V2.6-8B 57.4 39.2 41.5 57.3 51.5 36.4 41.8 46.4
LLaVA-Video-7B-Qwen2 63.0 50.7 47.3 58.9 60.6 47.6 46.0 53.4
Commercial LMMs
Claude 3.5 Sonnet 50.0 43.5 39.7 50.5 36.1 53.2 49.8 46.1
Gemini 1.5 Flash 67.7 38.2 42.4 50.6 33.6 49.3 66.7 49.8
Gemini 1.5 Pro 67.1 50.4 43.6 59.3 42.6 54.7 57.8 53.7
GPT-4V 57.0 39.7 34.2 54.9 39.2 54.0 69.7 49.8
GPT-4o mini 61.4 45.3 41.9 52.9 52.0 47.9 71.8 53.3
GPT-4o 68.2 54.2 51.5 66.5 59.2 61.4 73.6 62.1

The best and second-best LMM results are bold and underlined, respectively. All numbers are accuracies in %, with a full score of 100%.

Leaderboard on WildVideo (Multi-Turn)

Accuracy (%) on the multi-turn subset (1,585 dialogues, up to 5 turns) of WildVideo.

Model Perception Cognition Contextual Comprehension Overall
Object Action Visual Loc. Consistency Causality Multimodal Ref. World Knowledge Contextual Ellipsis Cross-turn Retrieval
Open-Source LMMs
Mono-InternVL-2B 32.2 23.4 17.3 24.0 10.8 14.5 11.4 12.1 6.9 17.0
InternVL2-8B 31.5 30.4 34.8 48.7 10.2 22.0 25.3 13.9 15.2 25.8
Qwen2-VL-7B 50.7 34.8 36.9 55.5 44.5 41.9 38.8 25.3 23.4 39.1
MiniCPM-V2.6-8B 50.9 49.6 42.8 44.4 41.3 42.7 42.4 23.1 35.7 41.4
Commercial LMMs
Claude 3.5 Sonnet 33.1 39.3 24.2 39.9 35.3 35.0 68.6 32.6 30.9 37.6
Gemini 1.5 Flash 51.7 40.0 40.0 49.9 34.4 48.2 56.0 47.4 46.6 46.0
Gemini 1.5 Pro 56.2 47.8 43.2 64.5 37.5 44.8 73.4 43.1 37.2 49.8
GPT-4V 55.3 41.4 33.3 31.8 45.2 48.6 77.9 37.1 51.4 46.9
GPT-4o mini 55.0 38.5 38.3 45.2 40.1 46.3 63.1 30.8 41.8 44.3
GPT-4o 60.1 45.2 46.2 65.7 39.7 50.7 78.7 44.5 43.1 52.7

The best and second-best LMM results are bold and underlined, respectively. All numbers are accuracies in %, with a full score of 100%.



🚨 To submit your results to the leaderboard, please send to yangsongyuan@nudt.edu.cn with your result json files.

🚨 For more submission details, please refer to this link.

Logo WildVideo Dataset

Overview

We introduce WildVideo, a bilingual benchmark for evaluating hallucinations in video-based LMMs, fully aligned with real-world application settings. First, WildVideo incorporates multi-turn dialogues, enabling the evaluation of contextual understanding and dy- namic conversational flows. Second, to mitigate the limitations of unimodal references, WildVideo emphasizes the integration of multimodal references, such as visual cues paired with tex- tual input, which reflect the natural referencing mechanisms in human conversations. Furthermore, it incorporates deep vi- sual contextual understanding by addressing challenges like contextual ellipsis and cross-turn retrieval, which require models to maintain and utilize information across multiple dialogue turns.

All the data examples were divided into two subsets: single and multi.

  • single: 1,318 meticulously curated videos and 13,704 single-turn QA pairs
  • multi: 1,318 meticulously curated videos and 1,585 multi-turn QA pairs
You can download the dataset on Hugging Face Dataset.

data-overview

Key statistics of Logo WildVideo.

data-composition

Source dataset distribution of Logo Wldvideo.

Task statistics in WildVideo

THE NUMBER OF SINGLE-TURN AND MULTI-TURN QUESTION-ANSWER PAIRS FOR EACH TASK.

EXPERIMENT RESULT

Comparison of the capabilities of different LMMs on WildVideo.

BibTeX

@inproceedings{yang2025wildvideo,
  author    = {Yang, Songyuan and Yu, Weijiang and Yang, Wenjing and Liu, Xinwang and Tan, Huibin and Lan, Long and Xiao, Nong},
  title     = {WildVideo: Benchmarking LMMs for Understanding Video-Language Interaction},
  booktitle ={IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)},
  year      = {2025}
  note      = {We thank Jilin Ma for for developing the project website and GitHub infrastructure.}
}