Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control.
Fig 1: The structure and two-stage training of DisCodec.
Fig 2: The overview of DisCo-Speech.
| Audio-Prompt | Text | DisCo-Speech | IndexTTS2 | CosyVoice2 | SparkTTS | Vevo |
|---|---|---|---|---|---|---|
|
male speaker with accent |
为了确保数据的安全与一致性,用户在访问数据库之前,都必须通过三层身份验证。 | |||||
|
male speaker with disgust emotion |
别拿这种微不足道的成就来炫耀,我所拥有的资源是你这辈子都无法想象的。 | |||||
|
famale speaker with sad emotion |
面对这样残酷的现实,我感到一种深深的无力感,仿佛所有的努力都瞬间化为泡影。 |
|
||||
|
famale speaker with disgust emotion |
请拿走这些粗制滥造的东西,这种低劣的品质简直是对我们审美能力的侮辱。 | |||||
|
famale speaker with alluring style |
从前有一个勇敢的小裁缝,他凭着自己的智慧和勇气,战胜了强大的巨人,赢得了公主的芳心。 | |||||
|
famale speaker with whisper style |
不管发生什么事情,你都必须保持绝对的安静,千万不能让门外的人察觉到我们的存在。 |
|
||||
|
famale speaker with accent |
那个娃儿做事总是毛毛躁躁的,喊他买个酱油都能把瓶子打烂,真的是让人脑壳痛。 |
| Timbre-Audio-Prompt | Prosody-Audio-Prompt | Text | DisCo-Speech | Vevo | IndexTTS2 |
|---|---|---|---|---|---|
|
male speaker |
famale speaker with storytelling style |
在这个古老的城堡里,传说每到月圆之夜,就会听到优美的钢琴声,却从来没有人见过弹琴的人。 | |||
|
famale speaker |
famale speaker with tsundere style |
我只是不想看到你因为这种小事而丢人现眼,才不是特意过来帮你的,你可千万不要想多了。 | |||
|
male speaker |
male speaker with whisper style |
气象局的最新报告指出,未来48小时内,本地区的降雨概率将维持在20%以下。 | |||
|
famale speaker |
famale speaker with alluring style |
在一片茂密的原始森林深处,居住着一群拥有魔法的精灵,他们世世代代守护着一颗能够实现愿望的宝石。 | |||
|
famale speaker |
male speaker with accent |
本次列车的前进方向是上海虹桥站,预计中途将停靠苏州北站和昆山南站。 | |||
|
male speaker |
male speaker with sad emotion |
看着那张泛黄的旧照片,我才意识到那些美好的时光已经永远成为了过去,再也无法触及。 | |||
|
male speaker |
male speaker with storytelling style |
选手在空中的姿态非常优美,落地也是纹丝不动,这绝对是一个满分的动作。 |
| GroundTruth | DisCodec | GroundTruth | DisCodec | ||
|---|---|---|---|---|---|
| Accent | Sad | ||||
| Neutral | Fear | ||||
| Disgust | Happy | ||||
| Surprise | Calm | ||||
| Accent | Whisper | ||||
| English | Style | ||||
| Style | Accent |
| Target-Speaker-Prompt | Source-Speaker-Prompt | DisCodec Result |
|---|---|---|
|
male speaker |
famale speaker with disgust emotion |
|
|
famale speaker |
famale speaker with cry style |
|
|
male speaker |
famale speaker with happy emotion |
|
|
famale speaker |
male speaker with storytelling style |
|
|
male speaker |
famale speaker with accent |
|
|
famale speaker |
male speaker with angry emotion |
|
|
male speaker |
famale speaker with cry style |
|
|
male speaker |
famale speaker with surprise emotion |
|
|
famale speaker |
male speaker with whisper style |
| Prompt-Audio | Text | Disentangled Content | Content Visual | Disentangled Prosody | Prosody Visual | Reconstruction |
|---|---|---|---|---|---|---|
|
|
居然是我先和你提的分手! |
|
|
|
|
|
|
|
今夜的月光如此清亮,不做些什么真是浪费,随我一起去月下漫步吧,不许拒绝 |
|
|
|
|
|
|
|
皇上请三思,皇后娘娘都是万岁爷着想,请万岁爷不要辜负了娘娘的一片苦心. |
|
|
|
|
|
|
|
A chance to leave him alone, but...No.She just wanted to see him again.Anna...You don't know how it feels to lose a sister.Anna, I'm sorry, but your father asked me not to tell you anything. |
|
|
|
|
|
| Audio-Prompt | Target Text | DisCo-Speech | Audio-Prompt | Target Text | DisCo-Speech |
|---|---|---|---|---|---|
|
male speaker with storytelling style |
只见那好汉纵身一跃,跳上房梁,身轻如燕,眨眼间便消失在茫茫夜色之中,无影无踪 |
famale speaker with sad emotion |
每当深夜回想起那些往事,心中就会涌起一阵难以言喻的酸楚,久久无法平息 | ||
|
male speaker with accent style |
那个娃儿做事总是毛毛躁躁的,喊他买个酱油都能把瓶子打烂,真的是让人脑壳痛 |
famale speaker with accent |
这事儿你就放一百个心,只要我答应了你,就算是天上下刀子我也给你办得妥妥的 | ||
|
male speaker with storytelling style |
欲知这后事如何发展,且听我喝口茶润润嗓子,咱们下回分解再细细道来 |
famale speaker with tsundere style |
我只是不想看到你因为这种小事而丢人现眼,才不是特意过来帮你的,你可千万不要想多了 | ||
|
famale speaker with angry emotion |
这种极不负责任的态度彻底激怒了我,我要求你立刻给出一个合理的解释 |
male speaker with cry style |
对不起,都是我的错,如果当初我能再小心一点,就不会发生这样的悲剧了 | ||
|
famale speaker with acting cute style |
今天的包包真的好重哦,你能不能帮人家提一下嘛,你最好了 |
male speaker with accent |
能够在这个项目中与如此优秀的团队并肩作战,并取得这样的成绩,我感到由衷的荣幸 | ||
|
famale speaker with storytelling style |
从前有一个勇敢的小裁缝,他凭着自己的智慧和勇气,战胜了强大的巨人,赢得了公主的芳心 |
male speaker with storytelling style |
比赛已经进入了最后的伤停补时阶段,留给红队的时间已经不多了,他们必须发起最后的总攻才能挽回败局 | ||
|
male speaker with storytelling style |
那将军手持丈八蛇矛,胯下乌骓马,一声断喝如平地惊雷,吓得曹军百万雄师竟无一人敢上前应战 |
famale speaker with alluring style |
今晚的月色如此迷人,你难道不想放下所有的戒备,和我一起探索这个世界未知的快乐吗? | ||
|
male speaker with whisper style |
趁着守卫换班的间隙,我们必须快速通过这条通道,记住,脚步要轻 |
male speaker with angry emotion |
为了确保数据的安全与一致性,所有用户在访问数据库之前,都必须通过三层身份验证。 | ||
|
Mandarin, male speaker with sad emotion |
I have a really bad feeling about this. Let's just go |
male speaker with surprise style |
我简直无法相信自己的眼睛,在这么短的时间内,你竟然完成了这个不可能的任务 | ||
|
famale speaker with surprise style |
这个消息来得太突然了,我完全没有做好心理准备,需要一点时间来消化 |
Mandarin, male speaker with storytelling style |
Are you kidding me? This is completely unacceptable! That's it! You've crossed the line this time | ||
|
male speaker with Tang Shi style |
怒发冲冠,凭栏处、潇潇雨歇。抬望眼,仰天长啸,壮怀激烈 |
famale speaker with fear emotion |
在这漆黑一片的走廊里,那种未知的压迫感让我浑身颤抖 | ||
|
famale speaker with alluring style |
我的文硕哥哥,你是在犹豫,还是在享受这种,被我步步紧逼的,心痒难耐的感觉 |
male speaker with storytelling style |
What's that over there? I've never seen anything like it |
| Timbre-Audio-Prompt | Prosody-Audio-Prompt | Target Text | DisCo-Speech | Timbre-Audio-Prompt | Prosody-Audio-Prompt | Target Text | DisCo-Speech |
|---|---|---|---|---|---|---|---|
|
male speaker |
famale speaker with accent |
这家火锅店的味道确实霸道,特别是那个鸭肠,稍微烫一下就脆得很,口感简直没得说。 |
English, male speaker |
Mandarin, male speaker with angry emotion |
该实验项目的第二阶段测试将在所有必要条件均得到满足后,按照既定流程开始执行。 | ||
|
famale speaker |
famale speaker with accent |
气象局的最新报告指出,未来四十八小时内,本地区的降雨概率将维持在百分之二十以下。 |
famale speaker |
famale speaker with angry emotion |
你的行为严重违反了我们的约定,这种破坏信任的做法是绝对无法被原谅的 | ||
|
famale speaker |
famale speaker with acting cute style |
谁稀罕你的关心啊,我一个人过得好好的,才不需要你在旁边指手画脚 |
male speaker |
male speaker with arrogant style |
你们所谓的努力在我看来不过是徒劳的挣扎,差距从一开始就已经注定了 | ||
|
fmale speaker |
male speaker with acting cute style |
人家真的不是故意的,你就不要再生气了嘛,笑一个给我看看好不好 |
famale speaker |
male speaker with whisper style |
小心一点,那个角落里好像安装了窃听器,我们说话的声音必须再小一点。 | ||
|
famale speaker |
famale speaker with alluring style |
你在我们屋子里走路的时候,发现路程遥远,这是不足为怪的 |
English, famale speaker |
Mandarin, famale speaker with alluring style |
为了确保数据的安全与一致性,所有用户在访问数据库之前都必须通过三层身份验证。 | ||
|
male speaker |
famale speaker with storytelling style |
在一片茂密的原始森林深处,居住着一群拥有魔法的精灵,他们世世代代守护着一颗能够实现愿望的宝石 |
famale speaker |
male speaker with alluring style |
很久很久以前,天上有十个太阳,大地被烤得焦黑,直到一位名叫后羿的英雄拉开了神弓 | ||
|
famale speaker |
male speaker with angry emotion |
这种极不负责任的态度彻底激怒了我,我要求你立刻给出一个合理的解释 |
famale speaker |
male speaker with angry emotion |
我无论如何都不能接受这种毫无底线的背叛行为,你必须为你的所作所为承担全部责任 | ||
|
famale speaker |
male speaker with fear emotion |
脚步声越来越近了,我躲在角落里大气都不敢出,生怕一点动静就会暴露自己的位置 |
|
famale speaker |
male speaker with surprise emotion |
完全出乎我的意料,这个看起来毫不起眼的装置,竟然能爆发出如此巨大的能量 | |
|
male speaker |
famale speaker with cry style |
汤姆,我真愿意信你的话,这样可以一肥遮百丑 |
male speaker |
famale speaker with storytelling style |
小红帽提着篮子走进了森林,她要去探望生病的奶奶,却不知道一只狡猾的大灰狼已经盯上了她 |