Tuesday Oct 03, 2023
arxiv Preprint - Efficient Streaming Language Models with Attention Sinks
In this episode we discuss Efficient Streaming Language Models with Attention Sinks by Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. The paper proposes StreamingLLM, a framework that allows Large Language Models (LLMs) to generalize to infinite sequence length without fine-tuning. By observing the phenomenon of attention sink, where initial tokens have a significant impact on performance, the authors show that caching the Key and Value states of these tokens enhances the efficiency and stability of window attention. The authors demonstrate that StreamingLLM outperforms the sliding window recomputation baseline in streaming applications with a speedup of up to 22.2x.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.