Papers
arxiv:2408.13233

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Published on Aug 23
· Submitted by JamesSand on Aug 26
Authors:
,
,
,

Abstract

The quadratic computational complexity in the self-attention mechanism of popular transformer architectures poses significant challenges for training and inference, particularly in terms of efficiency and memory requirements. Towards addressing these challenges, this paper introduces a novel fast computation method for gradient calculation in multi-layer transformer models. Our approach enables the computation of gradients for the entire multi-layer transformer model in almost linear time n^{1+o(1)}, where n is the input sequence length. This breakthrough significantly reduces the computational bottleneck associated with the traditional quadratic time complexity. Our theory holds for any loss function and maintains a bounded approximation error across the entire model. Furthermore, our analysis can hold when the multi-layer transformer model contains many practical sub-modules, such as residual connection, casual mask, and multi-head attention. By improving the efficiency of gradient computation in large language models, we hope that our work will facilitate the more effective training and deployment of long-context language models based on our theoretical results.

Community

Paper author Paper submitter

Really excited to introduce this new work. It applies polynomial kernel approximation [AS23, AS24a] to solve the forward and backward computations of multi-layer transformer in almost linear time $n^{1+o(1)}$.

Thank you for the excellent work! I find the current structure of the paper a bit challenging to follow. It would be helpful to see more concise equations that better capture the core of the approximation method. I look forward to a tutorial on this in the future.

·
Paper author

Thank you for your interest to our work and your nice suggestions. We will try to provide more visualizations for the polynomial approximation method in the future, to make it easier to understand.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2408.13233 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2408.13233 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2408.13233 in a Space README.md to link it from this page.

Collections including this paper 1