Papers
arxiv:2407.14177

EVLM: An Efficient Vision-Language Model for Visual Understanding

Published on Jul 19
· Submitted by akhaliq on Jul 22
#1 Paper of the day
Authors:
Di Xu ,
,
,
,
,
,
,

Abstract

In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of language models can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for large language models to perceive visual signals fully. This paper proposes an efficient multi-modal language model to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

Community

Paper submitter

Screen Shot 2024-07-21 at 10.47.58 PM.png

Hi @akhaliq , I am not the Yifei Hu who co-authored this paper. Could you please unlink my account from the author list? Thank you!

·

Thanks for your feedback. Authorship removed! ✅

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

@zhonghanwen @zhonghuasong congratulations on the release! are you planning to open-source the model? 👀

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.14177 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2407.14177 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.14177 in a Space README.md to link it from this page.

Collections including this paper 9