Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Bibliographic Details
Title:	Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Authors:	Wang, Xinsheng, Jiang, Mingqi, Ma, Ziyang, Zhang, Ziyu, Liu, Songxiang, Li, Linqin, Liang, Zheng, Zheng, Qixi, Wang, Rui, Feng, Xiaoqin, Bian, Weizhen, Ye, Zhen, Cheng, Sitong, Yuan, Ruibin, Zhao, Zhixian, Zhu, Xinfa, Pan, Jiahao, Xue, Liumeng, Zhu, Pengcheng, Chen, Yunlin, Li, Zhifei, Chen, Xie, Xie, Lei, Guo, Yike, Xue, Wei
Publication Year:	2025
Collection:	Computer Science
Subject Terms:	Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
More Details:	Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS. Comment: Submitted to ACL 2025
Document Type:	Working Paper
Access URL:	http://arxiv.org/abs/2503.01710
Accession Number:	edsarx.2503.01710
Database:	arXiv

More Details
Description not available.