DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Abstract

Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.

DisCo-Speech Overall Architecture

Fig 1: The structure and two-stage training of DisCodec.

DisCodec Detailed Structure

Fig 2: The overview of DisCo-Speech.

Contents

1 DisCo-Speech: Zero-shot Controllable Speech Generation

This section showcases the DisCo-Speech's superior prosodic continuation capability in voice cloning, which precisely captures and reproduces not only speaking styles but also nuanced accents from the audio prompt.

1.1 Voice Cloning Comparison With Other Mthods

Audio-Prompt Text DisCo-Speech IndexTTS2 CosyVoice2 SparkTTS Vevo

male speaker with accent

为了确保数据的安全与一致性,用户在访问数据库之前,都必须通过三层身份验证。

male speaker with disgust emotion

别拿这种微不足道的成就来炫耀,我所拥有的资源是你这辈子都无法想象的。

famale speaker with sad emotion

面对这样残酷的现实,我感到一种深深的无力感,仿佛所有的努力都瞬间化为泡影。

famale speaker with disgust emotion

请拿走这些粗制滥造的东西,这种低劣的品质简直是对我们审美能力的侮辱。

famale speaker with alluring style

从前有一个勇敢的小裁缝,他凭着自己的智慧和勇气,战胜了强大的巨人,赢得了公主的芳心。

famale speaker with whisper style

不管发生什么事情,你都必须保持绝对的安静,千万不能让门外的人察觉到我们的存在。

famale speaker with accent

那个娃儿做事总是毛毛躁躁的,喊他买个酱油都能把瓶子打烂,真的是让人脑壳痛。

1.2 Zero-Shot Controllable Generation Comparison With Other Methods

We employ distinct audio prompts as references for timbre and prosody expression, respectively, such that speaker timbre and speaking style (e.g. accent, emotions, whisper, storytelling...) are derived from separate audio sources.

Timbre-Audio-Prompt Prosody-Audio-Prompt Text DisCo-Speech Vevo IndexTTS2

male speaker

famale speaker with storytelling style

在这个古老的城堡里,传说每到月圆之夜,就会听到优美的钢琴声,却从来没有人见过弹琴的人。

famale speaker

famale speaker with tsundere style

我只是不想看到你因为这种小事而丢人现眼,才不是特意过来帮你的,你可千万不要想多了。

male speaker

male speaker with whisper style

气象局的最新报告指出,未来48小时内,本地区的降雨概率将维持在20%以下。

famale speaker

famale speaker with alluring style

在一片茂密的原始森林深处,居住着一群拥有魔法的精灵,他们世世代代守护着一颗能够实现愿望的宝石。

famale speaker

male speaker with accent

本次列车的前进方向是上海虹桥站,预计中途将停靠苏州北站和昆山南站。

male speaker

male speaker with sad emotion

看着那张泛黄的旧照片,我才意识到那些美好的时光已经永远成为了过去,再也无法触及。

male speaker

male speaker with storytelling style

选手在空中的姿态非常优美,落地也是纹丝不动,这绝对是一个满分的动作。

2 DisCodec: Disentangled Speech Codec

To support the independent zero-shot control, DisCodec is the crucial component, containing two-stage learning paradigm: 1)Tri-factor disentanglement: inspired by the characteristic of speech attributes as mentioned above, DiCodec factorizes speech into content, prosody, and timbre via three parallel encoders, and hybrid decoupling constraints are employed to ensure the decoupling performance; 2) Fusion and reconstruction: integrating the disentangled factors, a powerful decoder fuses content and prosody into unified timbre-agnostic content-prosody tokens suitable for LM usage, while optimizing reconstruction quality for directly mitigating disentanglement-reconstruction conflict.

2.1 DisCodec Reconstruction Performance

GroundTruth DisCodec GroundTruth DisCodec
Accent
Sad
Neutral
Fear
Disgust
Happy
Surprise
Calm
Accent
Whisper
English
Style
Style
Accent

2.2 Zero-Shot Voice Conversion of DisCodec

DiCodec disentangled the speech into timbre-agnostic content-prosody and prosody-agnostic timbre representations. We conduct the evaluation on zero-shot VC task, which requires models to convert speaker timbre while preserving the prosody and content of the source speech.

Target-Speaker-Prompt Source-Speaker-Prompt DisCodec Result

male speaker

famale speaker with disgust emotion

famale speaker

famale speaker with cry style

male speaker

famale speaker with happy emotion

famale speaker

male speaker with storytelling style

male speaker

famale speaker with accent

famale speaker

male speaker with angry emotion

male speaker

famale speaker with cry style

male speaker

famale speaker with surprise emotion

famale speaker

male speaker with whisper style

2.3 Disentanglement Visual Analysis

DisCodec explicitly decouples speech into content, prosody, and timbre under the guidance of hybrid decoupling constraints, ensuring the integrity of each attribute.

Prompt-Audio Text Disentangled Content Content Visual Disentangled Prosody Prosody Visual Reconstruction
居然是我先和你提的分手!
Content Visualization
Prosody Visualization
今夜的月光如此清亮,不做些什么真是浪费,随我一起去月下漫步吧,不许拒绝
Content Visualization
Prosody Visualization
皇上请三思,皇后娘娘都是万岁爷着想,请万岁爷不要辜负了娘娘的一片苦心.
Content Visualization
Prosody Visualization
A chance to leave him alone, but...No.She just wanted to see him again.Anna...You don't know how it feels to lose a sister.Anna, I'm sorry, but your father asked me not to tell you anything.
Content Visualization
Prosody Visualization

3 Additional DisCo-Speech Zero-Shot Demos

In this part, we synthesize speech with arbitrary styles and emotions based on a large amount of style- and emotion-unrelated text. This indicates that the DisCo-speech system primarily infers the style and emotion of the output audio from the prosodic acoustic references, rather than from the text content.

3.1 Expressieveness Voice Cloning

This section showcases the DisCo-Speech's superior prosodic continuation capability in voice cloning, which precisely captures and reproduces not only speaking styles but also nuanced accents from the audio prompt.

Audio-Prompt Target Text DisCo-Speech Audio-Prompt Target Text DisCo-Speech

male speaker with storytelling style

只见那好汉纵身一跃,跳上房梁,身轻如燕,眨眼间便消失在茫茫夜色之中,无影无踪

famale speaker with sad emotion

每当深夜回想起那些往事,心中就会涌起一阵难以言喻的酸楚,久久无法平息

male speaker with accent style

那个娃儿做事总是毛毛躁躁的,喊他买个酱油都能把瓶子打烂,真的是让人脑壳痛

famale speaker with accent

这事儿你就放一百个心,只要我答应了你,就算是天上下刀子我也给你办得妥妥的

male speaker with storytelling style

欲知这后事如何发展,且听我喝口茶润润嗓子,咱们下回分解再细细道来

famale speaker with tsundere style

我只是不想看到你因为这种小事而丢人现眼,才不是特意过来帮你的,你可千万不要想多了

famale speaker with angry emotion

这种极不负责任的态度彻底激怒了我,我要求你立刻给出一个合理的解释

male speaker with cry style

对不起,都是我的错,如果当初我能再小心一点,就不会发生这样的悲剧了

famale speaker with acting cute style

今天的包包真的好重哦,你能不能帮人家提一下嘛,你最好了

male speaker with accent

能够在这个项目中与如此优秀的团队并肩作战,并取得这样的成绩,我感到由衷的荣幸

famale speaker with storytelling style

从前有一个勇敢的小裁缝,他凭着自己的智慧和勇气,战胜了强大的巨人,赢得了公主的芳心

male speaker with storytelling style

比赛已经进入了最后的伤停补时阶段,留给红队的时间已经不多了,他们必须发起最后的总攻才能挽回败局

male speaker with storytelling style

那将军手持丈八蛇矛,胯下乌骓马,一声断喝如平地惊雷,吓得曹军百万雄师竟无一人敢上前应战

famale speaker with alluring style

今晚的月色如此迷人,你难道不想放下所有的戒备,和我一起探索这个世界未知的快乐吗?

male speaker with whisper style

趁着守卫换班的间隙,我们必须快速通过这条通道,记住,脚步要轻

male speaker with angry emotion

为了确保数据的安全与一致性,所有用户在访问数据库之前,都必须通过三层身份验证。

Mandarin, male speaker with sad emotion

I have a really bad feeling about this. Let's just go

male speaker with surprise style

我简直无法相信自己的眼睛,在这么短的时间内,你竟然完成了这个不可能的任务

famale speaker with surprise style

这个消息来得太突然了,我完全没有做好心理准备,需要一点时间来消化

Mandarin, male speaker with storytelling style

Are you kidding me? This is completely unacceptable! That's it! You've crossed the line this time

male speaker with Tang Shi style

怒发冲冠,凭栏处、潇潇雨歇。抬望眼,仰天长啸,壮怀激烈

famale speaker with fear emotion

在这漆黑一片的走廊里,那种未知的压迫感让我浑身颤抖

famale speaker with alluring style

我的文硕哥哥,你是在犹豫,还是在享受这种,被我步步紧逼的,心痒难耐的感觉

male speaker with storytelling style

What's that over there? I've never seen anything like it

3.2 Zero-Shot Controllable Generation Demos

We employ distinct audio prompts as references for timbre and prosody expression, respectively, such that speaker timbre and speaking prosodic are derived from separate audio prompts. This demo demonstrates DisCo-speech's control over content, prosody, and timbre in a zero-shot setting.

Timbre-Audio-Prompt Prosody-Audio-Prompt Target Text DisCo-Speech Timbre-Audio-Prompt Prosody-Audio-Prompt Target Text DisCo-Speech

male speaker

famale speaker with accent

这家火锅店的味道确实霸道,特别是那个鸭肠,稍微烫一下就脆得很,口感简直没得说。

English, male speaker

Mandarin, male speaker with angry emotion

该实验项目的第二阶段测试将在所有必要条件均得到满足后,按照既定流程开始执行。

famale speaker

famale speaker with accent

气象局的最新报告指出,未来四十八小时内,本地区的降雨概率将维持在百分之二十以下。

famale speaker

famale speaker with angry emotion

你的行为严重违反了我们的约定,这种破坏信任的做法是绝对无法被原谅的

famale speaker

famale speaker with acting cute style

谁稀罕你的关心啊,我一个人过得好好的,才不需要你在旁边指手画脚

male speaker

male speaker with arrogant style

你们所谓的努力在我看来不过是徒劳的挣扎,差距从一开始就已经注定了

fmale speaker

male speaker with acting cute style

人家真的不是故意的,你就不要再生气了嘛,笑一个给我看看好不好

famale speaker

male speaker with whisper style

小心一点,那个角落里好像安装了窃听器,我们说话的声音必须再小一点。

famale speaker

famale speaker with alluring style

你在我们屋子里走路的时候,发现路程遥远,这是不足为怪的

English, famale speaker

Mandarin, famale speaker with alluring style

为了确保数据的安全与一致性,所有用户在访问数据库之前都必须通过三层身份验证。

male speaker

famale speaker with storytelling style

在一片茂密的原始森林深处,居住着一群拥有魔法的精灵,他们世世代代守护着一颗能够实现愿望的宝石

famale speaker

male speaker with alluring style

很久很久以前,天上有十个太阳,大地被烤得焦黑,直到一位名叫后羿的英雄拉开了神弓

famale speaker

male speaker with angry emotion

这种极不负责任的态度彻底激怒了我,我要求你立刻给出一个合理的解释

famale speaker

male speaker with angry emotion

我无论如何都不能接受这种毫无底线的背叛行为,你必须为你的所作所为承担全部责任

famale speaker

male speaker with fear emotion

脚步声越来越近了,我躲在角落里大气都不敢出,生怕一点动静就会暴露自己的位置

famale speaker

male speaker with surprise emotion

完全出乎我的意料,这个看起来毫不起眼的装置,竟然能爆发出如此巨大的能量

male speaker

famale speaker with cry style

汤姆,我真愿意信你的话,这样可以一肥遮百丑

male speaker

famale speaker with storytelling style

小红帽提着篮子走进了森林,她要去探望生病的奶奶,却不知道一只狡猾的大灰狼已经盯上了她