SpeechAlign:

Aligning Speech Generation to Human Preferences


Authors: Dong Zhang*, Zhaowei Li*, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

School of Computer Science, Fudan University

Overview

Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models.







Zero-shot Text-to-Speech:

The first 3 seconds of the Voice Prompt are used as prompt.
Text Voice Prompt SpeechAlign-sft SpeechAlign-RLHF-PPO SpeechAlign-DPO-Iter3 Groundtruth
Marie's face fell under his brooding gaze.
Burn fire burn flicker flicker flame.
Thou gentle maid of silent valleys and of modest brooks for thou shall be clothed in light and fed with morning manna till summers heat melts thee beside the fountains and the springs to flourish in eternal vales they why should thel complain.
He summoned half a dozen citizens to join his posse who followed obeyed and assisted him.
Soon the whole bridge was trembling and resounding.
Well that may be true agreed margolotte but on the contrary a servant with too much brains is sure to become independent and high and mighty and feel above her work.
Distrusting his own judgment his appeals to the opinion of chingachgook were frequent and earnest.
Everything he has done has been aimed at the conservation of energy the contraction of space the intensification of culture.
In the present day we are well aware that an ancient philosopher is to be interpreted from himself and by the contemporary history of thought.
My eyes fill with tears when i contrast the bliss of such a state brightened by hopes of the future with the melancholy state i now live in uncertain that i ever felt true contrition wandering in thought and deed longing for holiness which i shall never never obtain smitten at times to the heart with the conviction that ghastly calvinistic doctrines are true darkened in short by the very shadows of spiritual death.
Their distinctive characters however display one broad and unfailing difference.
Forthwith all ran to the opening of the tent to see what might be amiss but master will who peeped out first needed no more than one glance.