Audio samples from "Exploring Timbre Disentanglement in Non-Autoregressive Cross-Lingual Text-to-Speech"

Abstract: In this paper, we study the disentanglement of speaker and language representations in non-autoregressive cross-lingual TTS models from various aspects. We propose a phoneme length regulator that solves the length mismatch problem between IPA input sequence and monolingual alignment results. Using the phoneme length regulator, we present a FastPitch-based crosslingual model with IPA symbols as input representations. Our experiments show that language-independent input representations (e.g. IPA symbols), an increasing number of training speakers, and explicit modeling of speech variance information all encourage non-autoregressive cross-lingual TTS model to disentangle speaker and language representations. The subjective evaluation shows that our proposed model can achieve decent naturalness and speaker similarity in cross-language voice cloning.
 

Contents

1. Reference Audio

Text(CN): 城郊外的某个别墅区。
Text(CN): 是站在门往外看,不是往里看。
d1-CN-M Speaker
Text(EN): Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition.
Text(EN): in being comparatively modern.
d1-EN-F Speaker

2. Effect of numbers of training speakers

2.1 d1-CN-M Speaker Utterances

Text(CN): 君不见,黄河之水天上来,奔流到海不复回。
Text(CN&EN):朋友们站在不远的地方 looking at her with serious faces.
Text(CN&EN): We have the techniques and the resources 来发展这个地方。
Text(EN):This no longer appears to be the case.
FastPitch-IPA(d1)
FastPitch-IPA(d1+d2)
FastPitch-IPA(d1+d2+d3)

2.2 d1-EN-F Speaker Utterances

Text(CN): 君不见,黄河之水天上来,奔流到海不复回。
Text(CN&EN):朋友们站在不远的地方 looking at her with serious faces.
Text(CN&EN): We have the techniques and the resources 来发展这个地方。
Text(EN):This no longer appears to be the case.
FastPitch-IPA(d1)
FastPitch-IPA(d1+d2)
FastPitch-IPA(d1+d2+d3)

3. Comparison with baseline models & Ablation study

3.1 d1-CN-M Speaker Utterances

Text(CN): 早在一百万年前,蓝田古人类就在这里建造了聚落。
Text(CN&EN):如果没有意见, the agenda is adopted.
Text(CN&EN): The test records 全部的过程
Text(EN):Author of the danger trail, Philip Steels, etc.
FastPitch-IPA(d1+d2+d3)
FastPitch-IPA with GRL(d1+d2+d3)
FastSpeech-IPA(d1+d2+d3)
FastSpeech-LDP(d1+d2+d3)
Tacotron-based(d1+d2+d3)

3.2 d1-EN-F Speaker Utterances

Text(CN): 早在一百万年前,蓝田古人类就在这里建造了聚落。
Text(CN&EN):如果没有意见, the agenda is adopted.
Text(CN&EN): The test records 全部的过程
Text(EN):Author of the danger trail, Philip Steels, etc.
FastPitch-IPA(d1+d2+d3)
FastPitch-IPA with GRL(d1+d2+d3)
FastSpeech-IPA(d1+d2+d3)
FastSpeech-LDP(d1+d2+d3)
Tacotron-based(d1+d2+d3)

4. Controllability exploration for Mandarin-dominated mixed-lingual utterances

Text:字母表前14个字母是:A,B,C,D,E,F,G,H,I,J,K,L,M,N
Text:字母表后12个字母是:O,P,Q,R,S,T,U,V,W,X,Y,Z
Text:快要开会了,你的PPT做好了吗?
Text: 这里WIFI密码是多少?
d1-CN-M Speaker
d1-EN-F Speaker
Text:我P了一张十连出了4个SSR、3个SP、2个SR和一个R卡式神的图。
Text:你喜欢PVP还是PVE?
Text:这是一个AOE技能。
Text: 你打ADC还是打辅助?
d1-CN-M Speaker
d1-EN-F Speaker