This project page hosts our ECCV'24 EVAL-FoMo Workshop paper "BREASE" and its improved and expanded version "HERMES".

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

1National Taiwan University, 2NVIDIA, 3Mobile Drive Technology, 4National Tsing Hua University
Teaser image.

HERMES simulates episodic memory accumulation to capture action sequences from long videos and reinforces them with semantic knowledge dispersed throughout the video.

Abstract

While existing research often treats long-form videos as extended short videos, we propose a novel approach that more accurately reflects human cognition.

This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) module that efficiently aggregates crucial representations from micro to semi-macro levels. Second, we propose a Semantics reTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.

Method Overview

method image.
We stream through a video window-by-window and extract features using a frozen ViT. Each window feature is processed by the ECO (illustrated in the lower left part of the graph) in an online fashion, discarding redundancies along the way and retaining video episodes which are passed to an episodic Q-Former. The video token bank contains features for every window, and SeTR selects only the high-level information to pass to a hierarchical frame-to-sequence Q-Former. The episodic and high-level representations are then concatenated before being fed to the frozen LLM, which outputs a text following the instructions.

Long Movie (MovieChat)

HERMES Achieves SOTA results on MovieChat, surpassing the its closest competitor by a staggering 14.9%. The deemphasized results are for the fully supervised model.

result mvchat.

Long Procedural Video

HERMES Achieves SOTA results on Breakfast and COIN, two long-form procedural videos.

result coin.

Long Movie (LVU)

HERMES Achieves a far greater top-1 accuracy than the closest method on the challenging LVU dataset.

result lvu.

Qualitative Results

HERMES excels at fine-grained understanding of arbitrarily long videos. Furthermore, it has the rare quality of knowing when it doesn't know.

qualitative 1

HERMES can identify animal species, accurately count them. It can also determine peoples' relationships by watching them interact across thousands of frames.

qualitative 1

BibTeX

@misc{faure2024bridgingepisodessemanticsnovel,
          title={Bridging Episodes and Semantics: A Novel Framework for Long-Form Video Understanding}, 
          author={Gueter Josmy Faure and Jia-Fong Yeh and Min-Hung Chen and Hung-Ting Su and Winston H. Hsu and Shang-Hong Lai},
          year={2024},
          eprint={2408.17443},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2408.17443}, 
    }