SpeechTokenizer:

Unified Speech Tokenizer for Speech Language Models

Authors: Xin Zhang^, Dong Zhang^, Shimin Li, Yaqian Zhou, Xipeng Qiu

(^*Equal Contribution, order is random)

▶ School of Computer Science, Fudan University

Overview

Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. We establish SLMTokBench, a benchmark designed to assess the suitability of speech tokens for building speech language models. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. We propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Leveraging SpeechTokenizer, we construct a Unified Speech Language Model (USLM). Experimental results show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. The USLM outperforms VALL-E in zero-shot Text-to-Speech tasks.

Speech Reconstruction:

Tokenizer	Raw	RVQ-1	RVQ-2:8	RVQ-1:8
EnCodec
SpeechTokenizer
EnCodec
SpeechTokenizer
EnCodec
SpeechTokenizer

Zero-shot Text-to-Speech:

Text	Speaker Prompt	GroundTruth	Vall-E	USLM	USLM(soundstorm)
She also defended the lord chancellors existing powers.
I can confirm that miss deacon has received a threatening phone call.
Ms.Anderson yesterday put a brave face on the departure.
To the hebrews it was a token that there would be no more universal floods.
The greeks used to imagine that it was a sign from the gods to foretell war or heavy rain.