FineBench

Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

CVPR'26

Workshop on Video Large Language Models (VidLLMs)

1National Taiwan University, 2Google, 3NVIDIA
Teaser image for FineBench

FineBench provides a benchmark and evaluation suites for fine-grained human-centric evaluation of vision-language models

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions.

While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding.

FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object manipulation, including compositional actions.

Our extensive evaluation reveals that while proprietary models like GPT-5 achieve respectable performance, current open-source VLMs significantly underperform, struggling particularly with spatial reasoning in multi-person scenes and distinguishing subtle differences in human movements and interactions.

To address these identified weaknesses, we propose FineAgent, a modular framework that enhances VLMs by leveraging a Localizer and a Descriptor. Experiments show that FineAgent consistently improves the performance of various open VLMs on FineBench.

FineBench provides a rigorous testbed for future research into fine-grained human-centric video understanding, while FineAgent offers a practical approach to enhance such reasoning in current VLMs.

Key statistics

  • 199,420 multiple-choice QA pairs across 64 long-form videos (~15 minutes each)
  • Average ~3,116 questions per video
  • Dense spatial and temporal grounding; categories: person movement, person interaction, object manipulation

Quantitative results

Model performance (radar)

Model performance (radar)

Accuracy vs. frames

Accuracy vs. frames

Qualitative results - VLMs Failure

Qualitative examples

FineAgent Improves Existing VLMs

Comparison

Leaderboard & Evaluation

The evaluation code and scripts to reproduce our results are available at github.com/joslefaure/FineBench_eval. Below is the main results table from the paper (subset and full-dataset evaluations).

Performance of 15 Vision-Language Models (VLMs) on FineBench. Proprietary models evaluated on a representative subset (7 videos, 20,143 questions) are shown first. Best open-model full-dataset score is bolded and second-best is underlined.
Model Size P. Movement P. Interaction Obj. Manipulation Avg.
Random Choice--25.025.025.025.0
Subset Evaluation
GPT-4o (2024/08/26)--70.973.984.474.3
GPT-5-mini (2025/08/07)--75.975.385.377.4
Gemini-1.5-Flash--71.266.881.971.6
Gemini-2.0-Flash--75.968.786.375.2
SmolVLM2B48.548.080.053.9
MiniCPM-2.68B49.557.484.858.4
mPlugOwl-37B47.955.884.056.6
Full Dataset Evaluation
InternVL-2.51B33.840.279.644.1
SmolVLM2B47.950.571.052.9
Qwen2.5-VL3B58.057.573.260.5
BLIP-34B34.358.664.948.2
InternVL-2.54B61.458.678.163.3
mPlugOwl-27B57.649.278.558.3
mPlugOwl-37B48.954.875.255.6
MiniCPM-2.68B56.256.572.859.2
LLaVA-OV7B53.360.469.658.6
InternVL-2.58B66.862.178.167.1
Qwen2.5-VL7B70.763.873.968.8

BibTeX

@misc{faure2026finebench,
  title={FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding},
  author={Gueter Josmy Faure and Min-Hung Chen and Jia-Fong Yeh and Hung-Ting Su and Winston H. Hsu},
  year={2026},
  note={VidLLMs'2026 (CVPR Workshop)},
  url={https://joslefaure.github.io/assets/html/finebench.html},
}