DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

👉 Code and weights | Paper link 👈

Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.

Fig 1: The structure and two-stage training of DisCodec.

Fig 2: The overview of DisCo-Speech.

DisCo-Speech for Audiobook Dubbing

Achieving decoupled control over speaker timbre and prosody, our model enables any target speaker to adopt desired prosodic styles. In these demonstrations, the narration and individual character voices—encompassing both timbre and emotional style—are synthesized using distinct acoustic references from various speakers. Crucially, the specific choices of timbre and stylistic descriptions are automatically selected from our library by an understanding model, based on a deep analysis of the novel's context.

A Case Study on a Classic Scene from Demi-Gods and Semi-Devils

1 DisCo-Speech: Zero-shot Controllable Speech Generation

1.1 Voice Cloning Comparison With Other Mthods
1.2 Zero-Shot Controllable generation Comparison With Other Methods

2 DisCodec: Disentangled Speech Codec

2.1 DisCodec Reconstruction Performance
2.2 Zero-Shot Voice conversion of DisCodec
2.3 Disentanglement Visual Analysis

3 Additional DisCo-Speech Zero-Shot Demos

3.1 Voice Cloning
3.2 Cross lingual Voice Cloning
3.3 Zero-Shot Controllable Generation Demos

1 DisCo-Speech: Zero-shot Controllable Speech Generation

This section showcases the DisCo-Speech's superior prosodic continuation capability in voice cloning, which precisely captures and reproduces not only speaking styles but also nuanced accents from the audio prompt.

1.1 Voice Cloning Comparison With Other Mthods

Audio-Prompt	Text	DisCo-Speech	IndexTTS2	CosyVoice2	SparkTTS	Vevo
male speaker with accent	为了确保数据的安全与一致性，用户在访问数据库之前，都必须通过三层身份验证。
male speaker with disgust emotion	别拿这种微不足道的成就来炫耀，我所拥有的资源是你这辈子都无法想象的。
famale speaker with sad emotion	面对这样残酷的现实，我感到一种深深的无力感，仿佛所有的努力都瞬间化为泡影。
famale speaker with disgust emotion	请拿走这些粗制滥造的东西，这种低劣的品质简直是对我们审美能力的侮辱。
famale speaker with alluring style	从前有一个勇敢的小裁缝，他凭着自己的智慧和勇气，战胜了强大的巨人，赢得了公主的芳心。
famale speaker with whisper style	不管发生什么事情，你都必须保持绝对的安静，千万不能让门外的人察觉到我们的存在。
famale speaker with accent	那个娃儿做事总是毛毛躁躁的，喊他买个酱油都能把瓶子打烂，真的是让人脑壳痛。

1.2 Zero-Shot Controllable Generation Comparison With Other Methods

We employ distinct audio prompts as references for timbre and prosody expression, respectively, such that speaker timbre and speaking style (e.g. emotions, whisper, storytelling...) are derived from separate audio sources.

Timbre-Audio-Prompt	Prosody-Audio-Prompt	Text
male speaker	famale speaker with storytelling style	在这个古老的城堡里，传说每到月圆之夜，就会听到优美的钢琴声，却从来没有人见过弹琴的人。
famale speaker	famale speaker with tsundere style	我只是不想看到你因为这种小事而丢人现眼，才不是特意过来帮你的，你可千万不要想多了。
male speaker	male speaker with whisper style	气象局的最新报告指出，未来48小时内，本地区的降雨概率将维持在20%以下。
famale speaker	famale speaker with alluring style	在一片茂密的原始森林深处，居住着一群拥有魔法的精灵，他们世世代代守护着一颗能够实现愿望的宝石。
famale speaker	male speaker with accent	本次列车的前进方向是上海虹桥站，预计中途将停靠苏州北站和昆山南站。
male speaker	male speaker with sad emotion	看着那张泛黄的旧照片，我才意识到那些美好的时光已经永远成为了过去，再也无法触及。
male speaker	male speaker with storytelling style	选手在空中的姿态非常优美，落地也是纹丝不动，这绝对是一个满分的动作。

2 DisCodec: Disentangled Speech Codec

To support the independent zero-shot control, DisCodec is the crucial component, containing two-stage learning paradigm: 1)Tri-factor disentanglement: inspired by the characteristic of speech attributes as mentioned above, DiCodec factorizes speech into content, prosody, and timbre via three parallel encoders, and hybrid decoupling constraints are employed to ensure the decoupling performance; 2) Fusion and reconstruction: integrating the disentangled factors, a powerful decoder fuses content and prosody into unified timbre-agnostic content-prosody tokens suitable for LM usage, while optimizing reconstruction quality for directly mitigating disentanglement-reconstruction conflict.

2.1 DisCodec Reconstruction Performance

	GroundTruth	DisCodec		GroundTruth	DisCodec
Accent			Sad
Neutral			Fear
Disgust			Happy
Surprise			Calm
Accent			Whisper
English			Style
Style			Accent

2.2 Zero-Shot Voice Conversion of DisCodec

DiCodec disentangled the speech into timbre-agnostic content-prosody and prosody-agnostic timbre representations. We conduct the evaluation on zero-shot VC task, which requires models to convert speaker timbre while preserving the prosody and content of the source speech.

Target-Speaker-Prompt	Source-Speaker-Prompt	DisCodec Result
male speaker	famale speaker with disgust emotion
famale speaker	famale speaker with cry style
male speaker	famale speaker with happy emotion
famale speaker	male speaker with storytelling style
male speaker	famale speaker with accent
famale speaker	male speaker with angry emotion
male speaker	famale speaker with cry style
male speaker	famale speaker with surprise emotion
famale speaker	male speaker with whisper style

2.3 Disentanglement Visual Analysis

DisCodec explicitly decouples speech into content, prosody, and timbre under the guidance of hybrid decoupling constraints, ensuring the integrity of each attribute.

Prompt-Audio	Text	Disentangled Content	Content Visual	Disentangled Prosody	Prosody Visual	Reconstruction
	居然是我先和你提的分手！
	今夜的月光如此清亮，不做些什么真是浪费，随我一起去月下漫步吧，不许拒绝
	皇上请三思，皇后娘娘都是为了万岁爷着想，请万岁爷不要辜负了娘娘的一片苦心.
	A chance to leave him alone, but...No.She just wanted to see him again.Anna...You don't know how it feels to lose a sister.Anna, I'm sorry, but your father asked me not to tell you anything.

3 Additional DisCo-Speech Zero-Shot Demos

In this part, we synthesize speech with arbitrary styles and emotions based on a large amount of style- and emotion-unrelated text. This indicates that the DisCo-speech system primarily infers the style and emotion of the output audio from the prosodic acoustic references, rather than from the text content.

3.1 Expressieveness Voice Cloning

This section showcases the DisCo-Speech's superior prosodic continuation capability in voice cloning, which precisely captures and reproduces not only speaking styles but also nuanced accents from the audio prompt.

Audio-Prompt	Target Text	Audio-Prompt	Target Text
male speaker with storytelling style	只见那好汉纵身一跃，跳上房梁，身轻如燕，眨眼间便消失在茫茫夜色之中，无影无踪	famale speaker with sad emotion	每当深夜回想起那些往事，心中就会涌起一阵难以言喻的酸楚，久久无法平息
male speaker with accent style	那个娃儿做事总是毛毛躁躁的，喊他买个酱油都能把瓶子打烂，真的是让人脑壳痛	famale speaker with accent	这事儿你就放一百个心，只要我答应了你，就算是天上下刀子我也给你办得妥妥的
male speaker with storytelling style	欲知这后事如何发展，且听我喝口茶润润嗓子，咱们下回分解再细细道来	famale speaker with tsundere style	我只是不想看到你因为这种小事而丢人现眼，才不是特意过来帮你的，你可千万不要想多了
famale speaker with angry emotion	这种极不负责任的态度彻底激怒了我，我要求你立刻给出一个合理的解释	male speaker with cry style	对不起，都是我的错，如果当初我能再小心一点，就不会发生这样的悲剧了
famale speaker with acting cute style	今天的包包真的好重哦，你能不能帮人家提一下嘛，你最好了	male speaker with accent	能够在这个项目中与如此优秀的团队并肩作战，并取得这样的成绩，我感到由衷的荣幸
famale speaker with storytelling style	从前有一个勇敢的小裁缝，他凭着自己的智慧和勇气，战胜了强大的巨人，赢得了公主的芳心	male speaker with storytelling style	比赛已经进入了最后的伤停补时阶段，留给红队的时间已经不多了，他们必须发起最后的总攻才能挽回败局
male speaker with storytelling style	那将军手持丈八蛇矛，胯下乌骓马，一声断喝如平地惊雷，吓得曹军百万雄师竟无一人敢上前应战	famale speaker with alluring style	今晚的月色如此迷人，你难道不想放下所有的戒备，和我一起探索这个世界未知的快乐吗？
male speaker with whisper style	趁着守卫换班的间隙，我们必须快速通过这条通道，记住，脚步要轻	male speaker with angry emotion	为了确保数据的安全与一致性，所有用户在访问数据库之前，都必须通过三层身份验证。
Mandarin, male speaker with sad emotion	I have a really bad feeling about this. Let's just go	male speaker with surprise style	我简直无法相信自己的眼睛，在这么短的时间内，你竟然完成了这个不可能的任务
famale speaker with surprise style	这个消息来得太突然了，我完全没有做好心理准备，需要一点时间来消化	Mandarin, male speaker with storytelling style	Are you kidding me? This is completely unacceptable! That's it! You've crossed the line this time
male speaker with Tang Shi style	怒发冲冠，凭栏处、潇潇雨歇。抬望眼，仰天长啸，壮怀激烈	famale speaker with fear emotion	在这漆黑一片的走廊里，那种未知的压迫感让我浑身颤抖
famale speaker with alluring style	我的文硕哥哥，你是在犹豫，还是在享受这种，被我步步紧逼的，心痒难耐的感觉	male speaker with storytelling style	What's that over there? I've never seen anything like it

3.2 Cross lingual Voice Cloning

This section showcases the DisCo-Speech's superior prosodic continuation capability in cross lingual voice cloning, which precisely reproduces not only speaking styles but also nuanced accents from the audio prompt.

Audio-Prompt	Target Text	Audio-Prompt	Target Text
English male speaker	该药品需在二十五摄氏度以下的干燥环境中避光保存。	Mandarin male speaker with angry emotion	Participants in the clinical trial will be monitored for a period of six months after treatment.
English male speaker	本次软件更新的重点在于修复已知漏洞并提升系统运行稳定性。	Mandarin famale speaker with storyteling style	According to the protocol, each sample must be analyzed three separate times to ensure result accuracy.
English famale speaker	记得祖母总把中药包收在檀木柜最阴凉的格子里，说这样药性才能保持完整。	Mandarin male speaker with accent	The research team has published their findings in the latest issue of the prestigious scientific journal.
English male speaker	参观污水处理厂时，讲解员指着第四道净水池说这里的水已经可以养锦鲤了。	Mandarin famale speaker	The architectural design incorporates passive cooling techniques to reduce energy consumption.
English famale speaker	国际研讨会茶歇期间，那位白发学者用十二种语言说了同一句谢谢。	Mandarin famale speaker	The catalyst increases the reaction rate by several orders of magnitude.

3.3 Zero-Shot Controllable Generation Demos

We employ distinct audio prompts as references for timbre and prosody expression, respectively, such that speaker timbre and speaking prosodic are derived from separate audio prompts. This demo demonstrates DisCo-speech's control over content, prosody, and timbre in a zero-shot setting.

Timbre-Audio-Prompt	Prosody-Audio-Prompt	Target Text	Timbre-Audio-Prompt	Prosody-Audio-Prompt	Target Text
male speaker	famale speaker with accent	这家火锅店的味道确实霸道，特别是那个鸭肠，稍微烫一下就脆得很，口感简直没得说。	English, male speaker	Mandarin, male speaker with angry emotion	该实验项目的第二阶段测试将在所有必要条件均得到满足后，按照既定流程开始执行。
famale speaker	famale speaker with accent	气象局的最新报告指出，未来四十八小时内，本地区的降雨概率将维持在百分之二十以下。	famale speaker	famale speaker with angry emotion	你的行为严重违反了我们的约定，这种破坏信任的做法是绝对无法被原谅的
famale speaker	famale speaker with acting cute style	谁稀罕你的关心啊，我一个人过得好好的，才不需要你在旁边指手画脚	male speaker	male speaker with arrogant style	你们所谓的努力在我看来不过是徒劳的挣扎，差距从一开始就已经注定了
fmale speaker	male speaker with acting cute style	人家真的不是故意的，你就不要再生气了嘛，笑一个给我看看好不好	famale speaker	male speaker with whisper style	小心一点，那个角落里好像安装了窃听器，我们说话的声音必须再小一点。
famale speaker	famale speaker with alluring style	你在我们屋子里走路的时候，发现路程遥远，这是不足为怪的	English, famale speaker	Mandarin, famale speaker with alluring style	为了确保数据的安全与一致性，所有用户在访问数据库之前都必须通过三层身份验证。
male speaker	famale speaker with storytelling style	在一片茂密的原始森林深处，居住着一群拥有魔法的精灵，他们世世代代守护着一颗能够实现愿望的宝石	famale speaker	male speaker with alluring style	很久很久以前，天上有十个太阳，大地被烤得焦黑，直到一位名叫后羿的英雄拉开了神弓
famale speaker	male speaker with angry emotion	这种极不负责任的态度彻底激怒了我，我要求你立刻给出一个合理的解释	famale speaker	male speaker with angry emotion	我无论如何都不能接受这种毫无底线的背叛行为，你必须为你的所作所为承担全部责任
famale speaker	male speaker with fear emotion	脚步声越来越近了，我躲在角落里大气都不敢出，生怕一点动静就会暴露自己的位置	famale speaker	male speaker with surprise emotion	完全出乎我的意料，这个看起来毫不起眼的装置，竟然能爆发出如此巨大的能量
male speaker	famale speaker with cry style	汤姆，我真愿意信你的话，这样可以一肥遮百丑	male speaker	famale speaker with storytelling style	小红帽提着篮子走进了森林，她要去探望生病的奶奶，却不知道一只狡猾的大灰狼已经盯上了她

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

👉 Code and weights | Paper link 👈

Abstract

DisCo-Speech for Audiobook Dubbing

Contents

1 DisCo-Speech: Zero-shot Controllable Speech Generation

This section showcases the DisCo-Speech's superior prosodic continuation capability in voice cloning, which precisely captures and reproduces not only speaking styles but also nuanced accents from the audio prompt.

1.1 Voice Cloning Comparison With Other Mthods

1.2 Zero-Shot Controllable Generation Comparison With Other Methods

We employ distinct audio prompts as references for timbre and prosody expression, respectively, such that speaker timbre and speaking style (e.g. emotions, whisper, storytelling...) are derived from separate audio sources.

2 DisCodec: Disentangled Speech Codec

2.1 DisCodec Reconstruction Performance

2.2 Zero-Shot Voice Conversion of DisCodec

DiCodec disentangled the speech into timbre-agnostic content-prosody and prosody-agnostic timbre representations. We conduct the evaluation on zero-shot VC task, which requires models to convert speaker timbre while preserving the prosody and content of the source speech.

2.3 Disentanglement Visual Analysis

DisCodec explicitly decouples speech into content, prosody, and timbre under the guidance of hybrid decoupling constraints, ensuring the integrity of each attribute.

3 Additional DisCo-Speech Zero-Shot Demos

3.1 Expressieveness Voice Cloning

This section showcases the DisCo-Speech's superior prosodic continuation capability in voice cloning, which precisely captures and reproduces not only speaking styles but also nuanced accents from the audio prompt.

3.2 Cross lingual Voice Cloning

This section showcases the DisCo-Speech's superior prosodic continuation capability in cross lingual voice cloning, which precisely reproduces not only speaking styles but also nuanced accents from the audio prompt.

3.3 Zero-Shot Controllable Generation Demos

We employ distinct audio prompts as references for timbre and prosody expression, respectively, such that speaker timbre and speaking prosodic are derived from separate audio prompts. This demo demonstrates DisCo-speech's control over content, prosody, and timbre in a zero-shot setting.