Dong Zhang (张栋)

Master student @ Fudan University [Resume]

dongzhang22@m.fudan.edu.cn

 /   /   /   / 

# About Me

Hi! I am a second year M.S. student of FudanNLPLab at Fudan University, supervised by Prof. Yaqian Zhou and Prof. Xipeng Qiu. I obtained my B.S. degree at Fudan University in 2022, advised by Prof. Fuliang Weng. Previously, I was interning at Bytedance AI Lab, mentored by Rong Ye.

My research interest focuses on Multi-Modal Large Language Models, Speech Foundation Models, and Multi-Modal Agents. I have developed several foundation models for speech, including SpeechGPT, SpeechTokenizer and SpeechAgents.

I am expected to graduate in June 2025 and seeking Ph.D. and job opportunities worldwide. I'm also open to academic collaboration opportunities. Please feel free to contact me by dongzhang22@m.fudan.edu.cn if you are interested!

# News

  • [2024.4] We released SpeechAlign, the first to apply RLHF to align speech language models with human preferences!

  • [2024.2] I give a talk about SpeechGPT series works at AGI Leap Summit 2024 hosted by SuperAGI.

  • [2024.2] We released AnyGPT, a unified multi-modal LLM for text, image, speech and music!

  • [2024.1] We released SpeechGPT-Gen, an 8B speech LLM efficient in semantic and perceptual information modeling.

  • [2024.1] We proposed InferAligner, an effective training-free LLM alignment method.

  • [2024.1] Our SpeechTokenizer accepted to ICLR 2024! See you in Vienna!

  • [2024.1] We released SpeechAgents, the first multi-modal multi-agent system.

  • [2023.10] Two papers accepted to EMNLP 2023!

  • [2023.8] We released SpeechTokenizer, a speech tokenizer designed for speech language models.

  • [2023.5] We released SpeechGPT, a conversational speech large language model.

  • [2023.5] One first-author paper accepted to ACL 2023(Findings)!

  • [2022.9] I joined FudanNLPLab as a master student.

#Research

(*: Equal contribution)

An overview of my research on building multi-modal large language models.


SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

[EMNLP 2023 Findings] [code ] [demo]

This work is a GitHub Trending project and is promoted by different media and forums, such as Heart of Machine, Twitter and youtube.

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Dong Zhang*, Xin Zhang*(order is random), Shimin Li, Yaqian Zhou, Xipeng Qiu

[ICLR 2024] [code ] [demo]

SpeechTokenizer unifies the semantic tokens and acoustic tokens and we build USLM(unified speech language model) based on it.

SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation

Dong Zhang*, Xin Zhang*, Jun Zhan, Shimin Li, Yaqian Zhou, Xipeng Qiu

[Preprint] [code ] [demo]

We propose Chain-of-Information speech generation method and scale up model size to 8B to build SpeechGPT-Gen, which can perform speech-to-speech dialogue with any voice you want.

SpeechAgents: Human-Communication Simulation with Multi-Modal Multi-Agent Systems

Dong Zhang, Zhaowei Li, Pengyu Wang, Xin Zhang, Yaqian Zhou, Xipeng Qiu

[Preprint] [code ] [demo]

SpeechAgents is the first multi-modal multi-agent systems.

SpeechAlign: Aligning Speech Generation to Human Preferences

Dong Zhang*, Zhaowei Li*, Shimin Li, Xin Zhang, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

[Preprint] [code ] [demo]

SpeechAlign is the first to applys RLHF to align speech language models with human preferences and proposes an effective iterative self-improvement strategy that converts weak speech language models to stronger ones.

DUB: Discrete Unit Back-translation for Speech Translation

Dong Zhang, Rong Ye, Tom Ko, Mingxuan Wang, Yaqian Zhou

[ACL 2023 Findings] [code ] [video]

DUB is the first to use discrete speech representation as input for speech translation and explore NLP techinques like mBART pretraining and back-translation based on it.

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Jun Zhan*, Junqi Dai*, Jiasheng Ye*, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yugang Jiang, Xipeng Qiu

[Preprint] [code ] [demo]

AnyGPT is our new exploration on discrete representation based multimodal LLM after SpeechGPT. AnyGPT unifies text, image, speech and music into one model and can perform any-to-any multimodal conversation.

#Full Publications

#2024

#2023

# Invited Talks

# Education

  • Fudan University Sept 2022 - Jun 2025
    M.S. in Computer Science

  • Fudan University Sept 2018 - Jun 2022
    B.S. in Electronic Engineering

# Internship

  • Bytedance AI Lab Apr 2022 - Jun 2023
    Research on speech translation

# Service

  • Reviewer:
    EMNLP(2023), ARR(Dec 2023, Feb 2024)