SpeechGPT2:

End-to-End Human-Like Spoken Chatbot

Authors: Dong Zhang, Qian Tu^, Ruifan Deng^, Jun Zhan^, Zixin Wang^, Xingjian Zhao,
Ke Chen, Xin Zhang, Pengyu Wang, Zhaowei Li, Shimin Li,
Yaqian Zhou, Xipeng Qiu

(^*Equal Contribution)

School of Computer Science, Fudan University

dongzhang22@m.fudan.edu.cn

Overview

SpeechGPT2 is an end-to-end speech dialogue language model, similar to GPT-4o. It can perceive and express emotions, and provide appropriate voice responses in various styles such as rap, drama, robot, funny, and whisper, based on context and human instructions. To address the issue of lengthy speech sequences, SpeechGPT2 employs an ultra-low bitrate speech codec (750bps) that models both semantic and acoustic information. It utilizes a Multi-Input-Multi-Output Language Model (MIMO-LM). Currently, SpeechGPT2 is still a turn-based dialogue system. We are in the process of developing a full-duplex version of real-time SpeechGPT2 and have already made some promising progress.

SpeechGPT2 is a technological exploration under limited resources. Constrained by computing and data resources, it still has some shortcomings, such as noise robustness in speech understanding and stability of sound quality in speech generation. We plan to open source the technical report, code, and model weights in the future.