Overview
SpeechGPT2 is an end-to-end speech dialogue language model, similar to GPT-4o. It can perceive and express emotions, and provide appropriate voice responses in various styles such as rap, drama, robot, funny, and whisper, based on context and human instructions. To address the issue of lengthy speech sequences, SpeechGPT2 employs an ultra-low bitrate speech codec (750bps) that models both semantic and acoustic information. It utilizes a Multi-Input-Multi-Output Language Model (MIMO-LM). Currently, SpeechGPT2 is still a turn-based dialogue system. We are in the process of developing a full-duplex version of real-time SpeechGPT2 and have already made some promising progress.
SpeechGPT2 is a technological exploration under limited resources. Constrained by computing and data resources, it still has some shortcomings, such as noise robustness in speech understanding and stability of sound quality in speech generation. We plan to open source the technical report, code, and model weights in the future.
Drama
Rap 1
Rap 2
Whisper
Robot
Funny
Emotional 1
Emotional 2
Shout
Conversational
SpeechGPT2 Details
Speech Codec
- • Semantic & acoustic information modeling
- • Ultra-low bitrate (750bps: 25hz * RVQ3)
Model Architecture
- • MIMO-LM (Multi-Input-Multi-Output Language Model)
- • Initialized from 7B text LLM
Inference
- • Generating one second of speech requires 25 autoregressive decoding steps
Pre-trained Data
- • Over 100k hours of academic and in-the-wild speech data
- • Fine-grained style descriptions for each speech-text pair
Dialogue Data
- • 100k data points
- • High-quality multi-turn conversational spoken dialogues
- • Multi-turn emotional spoken dialogues
- • Multi-turn speech style control spoken dialogues
Next Steps
- • Full Duplex Real-time large language model (Working on this)
- • Streaming pipeline (codec + LLM)
- • Scale up data & model size