Wednesday Dec 13, 2023
arxiv preprint - Block-State Transformers
In this episode we discuss Block-State Transformers by Mahan Fathi, Jonathan Pilault, Orhan Firat, Christopher Pal, Pierre-Luc Bacon, Ross Goroshin. The paper introduces the Block-State Transformer (BST) architecture that merges state space models and block-wise attention to effectively capture long-range dependencies and improve performance on language modeling tasks. The BST incorporates an SSM sublayer for long-range context and a Block Transformer sublayer for local sequence processing, enhancing parallellization and combining the strengths of both model types. Experiments demonstrate the BST's superior performance over traditional Transformers in terms of perplexity, generalization to longer sequences, and a significant acceleration in processing speed due to model parallelization.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.