SpeechGPT2:

End-to-End Human-Like Spoken Chatbot

Authors: Dong Zhang, Qian Tu*, Ruifan Deng*, Jun Zhan*, Zixin Wang*, Xingjian Zhao,
Ke Chen, Xin Zhang, Pengyu Wang, Zhaowei Li, Shimin Li,
Yaqian Zhou, Xipeng Qiu

(*Equal Contribution)

School of Computer Science, Fudan University

dongzhang22@m.fudan.edu.cn

Overview

SpeechGPT2 is an end-to-end speech dialogue language model, similar to GPT-4o. It can perceive and express emotions, and provide appropriate voice responses in various styles such as rap, drama, robot, funny, and whisper, based on context and human instructions. To address the issue of lengthy speech sequences, SpeechGPT2 employs an ultra-low bitrate speech codec (750bps) that models both semantic and acoustic information. It utilizes a Multi-Input-Multi-Output Language Model (MIMO-LM). Currently, SpeechGPT2 is still a turn-based dialogue system. We are in the process of developing a full-duplex version of real-time SpeechGPT2 and have already made some promising progress.

SpeechGPT2 is a technological exploration under limited resources. Constrained by computing and data resources, it still has some shortcomings, such as noise robustness in speech understanding and stability of sound quality in speech generation. We plan to open source the technical report, code, and model weights in the future.

Drama

Rap 1

Rap 2

Whisper

Robot

Funny

Emotional 1

Emotional 2

Shout

Conversational

SpeechGPT2 Details

Speech Codec

  • • Semantic & acoustic information modeling
  • • Ultra-low bitrate (750bps: 25hz * RVQ3)

Model Architecture

  • • MIMO-LM (Multi-Input-Multi-Output Language Model)
  • • Initialized from 7B text LLM

Inference

  • • Generating one second of speech requires 25 autoregressive decoding steps

Pre-trained Data

  • • Over 100k hours of academic and in-the-wild speech data
  • • Fine-grained style descriptions for each speech-text pair

Dialogue Data

  • • 100k data points
  • • High-quality multi-turn conversational spoken dialogues
  • • Multi-turn emotional spoken dialogues
  • • Multi-turn speech style control spoken dialogues

Next Steps

  • • Full Duplex Real-time large language model (Working on this)
  • • Streaming pipeline (codec + LLM)
  • • Scale up data & model size