Friday Aug 04, 2023
arxiv Preprint - Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
In this episode we discuss Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding by Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, Yu Wang. The paper proposes a method called "Skeleton-of-Thought" (SoT) to decrease the generation latency of large language models (LLMs). The sequential decoding approach used in current LLMs contributes to high latency. SoT guides LLMs to first generate the skeleton of the answer and then completes the contents of each skeleton point in parallel through API calls or batched decoding.
Comments (0)
To leave or reply to comments, please download free Podbean or
No Comments
To leave or reply to comments,
please download free Podbean App.