Papers
arxiv:2409.07239

PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Published on Sep 11
· Submitted by huangsiteng on Sep 13
Authors:
,
,
,
,

Abstract

Fueled by the Large Language Models (LLMs) wave, Large Visual-Language Models (LVLMs) have emerged as a pivotal advancement, bridging the gap between image and text. However, video making it challenging for LVLMs to perform adequately due to the complexity of the relationship between language and spatial-temporal data structure. Recent Large Video-Language Models (LVidLMs) align feature of static visual data like image into latent space of language feature, by general multi-modal tasks to leverage abilities of LLMs sufficiently. In this paper, we explore fine-grained alignment approach via object trajectory for different modalities across both spatial and temporal dimensions simultaneously. Thus, we propose a novel LVidLM by trajectory-guided Pixel-Temporal Alignment, dubbed PiTe, that exhibits promising applicable model property. To achieve fine-grained video-language alignment, we curate a multi-modal pre-training dataset PiTe-143k, the dataset provision of moving trajectories in pixel level for all individual objects, that appear and mention in the video and caption both, by our automatic annotation pipeline. Meanwhile, PiTe demonstrates astounding capabilities on myriad video-related multi-modal tasks through beat the state-of-the-art methods by a large margin.

Community

Paper author Paper submitter

ECCV 2024 Oral. We present PiTe, a novel Large Video-Language Model (LVidLM) that achieves state-of-the-art performance in video understanding tasks through a trajectory-guided Pixel-Temporal Alignment approach. PiTe aligns visual and textual data across spatial and temporal dimensions by leveraging a curated multi-modal pre-training dataset, PiTe-143k, which provides moving trajectories at the pixel level for individual objects in videos. This approach enables PiTe to comprehend videos with greater detail and accuracy, outperforming existing methods in question-answering, temporal grounding, and dense captioning tasks.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.07239 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.07239 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.07239 in a Space README.md to link it from this page.

Collections including this paper 4