HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Abstract

High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz, compared with 16kHz or 24kHz in speaking voices) with large range of frequency to convey rich expression and emotion. However, higher sampling rate results in wider frequency band and longer waveform sequence with more fine-grained details and presents challenges for singing modeling in both frequency and time domains in singing voice synthesis (SVS). In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voice using 48kHz sampling rate. HiFiSinger consists of a FastSpeech based neural acoustic model and a Parallel WaveGAN based neural vocoder to ensure fast training and inference and also high voice quality. To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. Specifically, 1) To handle the larger range of frequencies caused by higher sampling rate (e.g., 48kHz vs. 24kHz), we introduce a novel sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands (e.g. low, middle and high frequency bands) and models each sub-band with a separate discriminator. 2) To model longer waveform sequences caused by higher sampling rate, we introduce a multi-length GAN (ML-GAN) for waveform generation to model different lengths of waveform sequences with separate discriminators. 3) We also introduce several additional designs in HiFiSinger that are crucial for high-fidelity voices, such as adding F0 (pitch) and V/UV (voiced/unvoiced flag) as acoustic features, choosing an appropriate window and hop size for mel-spectrogram, and increasing the receptive field in vocoder for long vowel modeling in singing voices. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality: 0.32/0.44 MOS gain over 48kHz/24kHz baseline and 0.83 MOS gain over previous SVS systems.

Audio Samples
1.1 Audio Quality
Ablation Studies
2.1 SF-GAN
2.2 ML-GAN
2.3 Pitch input
2.4 Window/hop size
2.5 Pitch Control
2.6 Duration control

Audio Samples

48kHz sampling rate is used unless otherwise stated.

Audio Quality

Recording , the original singing recordings.
Recording (24kHz) , the original singing recordings downsampled to 24kHz.
XiaoiceSing (48kHz) , a previous SVS system that also adopts high sampling rate of 48kHz but leverages WORLD vocoder.
Baseline (24kHz) , a baseline SVS system that uses the basic model backbone of HiFiSinger but without any of our improvements in HiFiSinger, and only uses 24kHz sampling rate.
Baseline (48kHz) , the same baseline system as Baseline (24kHz) but uses 48kHz sampling rate for training and inference.
HiFiSinger (24kHz) , our proposed HiFiSinger system but uses 24kHz sampling rate.
HiFiSinger (48kHz) , our final HiFiSinger system with 48kHz sampling rate.

这么说来很不单纯，你陪我看海zhè me shuō lái hěn bú dān chún, nǐ péi wǒ kàn hǎi

Recording (48kHz)	Baseline (24kHz up)	XiaoiceSing (48kHz)	HiFiSinger (48kHz)

Recording (24kHz)	Baseline (24kHz)	Baseline (48kHz)	HiFiSinger (24kHz)

宁静的夏天，天空中繁星点点níng jìng de xià tiān, tiān kōng zhōng fán xīng diǎn diǎn

Recording (48kHz)	Baseline (24kHz up)	XiaoiceSing (48kHz)	HiFiSinger (48kHz)

Recording (24kHz)	Baseline (24kHz)	Baseline (48kHz)	HiFiSinger (24kHz)

坏的是我发现不知不觉不见到你不是很习惯huài de shì wǒ fā xiàn bú zhī bú jué bú jiàn dào nǐ bú shì hěn xí guàn

Recording (48kHz)	Baseline (24kHz up)	XiaoiceSing (48kHz)	HiFiSinger (48kHz)

Recording (24kHz)	Baseline (24kHz)	Baseline (48kHz)	HiFiSinger (24kHz)

Ablation Studies

SF-GAN

这么说来很不单纯，你陪我看海zhè me shuō lái hěn bú dān chún, nǐ péi wǒ kàn hǎi

HiFiSinger	HiFiSinger with 0 SF-GAN	HiFiSinger with 1 SF-GAN	HiFiSinger with 5 SF-GAN

遇见一个人然后生命全改变，原来不是恋爱才有的情节yù jiàn yī gè rén rán hòu shēng mìng quán gǎi biàn，yuán lái bú shì liàn ài cái yǒu de qíng jié

HiFiSinger	HiFiSinger with 0 SF-GAN	HiFiSinger with 1 SF-GAN	HiFiSinger with 5 SF-GAN

我的小鬼小鬼，逗逗你的眉眼，让你喜欢这世界wǒ de xiǎo guǐ xiǎo guǐ, dòu dòu nǐ de méi yǎn, ràng nǐ xǐ huān zhè shì jiè

HiFiSinger	HiFiSinger with 0 SF-GAN	HiFiSinger with 1 SF-GAN	HiFiSinger with 5 SF-GAN

ML-GAN

谁在最需要的时候轻轻拍着我肩膀shuí zài zuì xū yào de shí hòu qīng qīng pāi zhe wǒ jiān bǎng

HiFiSinger	HiFiSinger without ML-GAN

见证你成长让我感到充满力量jiàn zhèng nǐ chéng zhǎng ràng wǒ gǎn dào chōng mǎn lì liàng

HiFiSinger	HiFiSinger without ML-GAN

你知道它的花语签上名，我继续一个人远行nǐ zhī dào tā de huā yǔ qiān shàng míng, wǒ jì xù yī gè rén yuǎn xíng

HiFiSinger	HiFiSinger without ML-GAN

Pitch and Voiced/Unvoiced input

这么说来很不单纯，你陪我看海zhè me shuō lái hěn bú dān chún, nǐ péi wǒ kàn hǎi

HiFiSinger	HiFiSinger without F0 and V/UV input

遇见一个人然后生命全改变，原来不是恋爱才有的情节yù jiàn yī gè rén rán hòu shēng mìng quán gǎi biàn, yuán lái bú shì liàn ài cái yǒu de qíng jié

HiFiSinger	HiFiSinger without F0 and V/UV input

我的小鬼小鬼，逗逗你的眉眼，让你喜欢这世界wǒ de xiǎo guǐ xiǎo guǐ, dòu dòu nǐ de méi yǎn, ràng nǐ xǐ huān zhè shì jiè

HiFiSinger	HiFiSinger without F0 and V/UV input

window/hop size

这么说来很不单纯，你陪我看海zhè me shuō lái hěn bú dān chún, nǐ péi wǒ kàn hǎi

HiFiSinger with 20ms/5ms window/hop size	HiFiSinger with 12ms/3ms window/hop size	HiFiSinger with 50ms/12.5ms window/hop size

遇见一个人然后生命全改变，原来不是恋爱才有的情节yù jiàn yī gè rén rán hòu shēng mìng quán gǎi biàn, yuán lái bú shì liàn ài cái yǒu de qíng jié

HiFiSinger with 20ms/5ms window/hop size	HiFiSinger with 12ms/3ms window/hop size	HiFiSinger with 50ms/12.5ms window/hop size

我的小鬼小鬼，逗逗你的眉眼，让你喜欢这世界wǒ de xiǎo guǐ xiǎo guǐ, dòu dòu nǐ de méi yǎn, ràng nǐ xǐ huān zhè shì jiè

HiFiSinger with 20ms/5ms window/hop size	HiFiSinger with 12ms/3ms window/hop size	HiFiSinger with 50ms/12.5ms window/hop size

Pitch Control

左心房，暖暖的好饱满zuǒ xīn fáng, nuǎn nuǎn de hǎo bǎo mǎn

shifting -4 semitones	shifting +0 semitones	shifting +4 semitones

我想说其实你很好，你自己却不知道wǒ xiǎng shuō qí shí nǐ hěn hǎo, nǐ zì jǐ què bú zhī dào

shifting -4 semitones	shifting +0 semitones	shifting +4 semitones

在朋友里面就数你最特别，总让我觉得很亲很铁zài péng yǒu lǐ miàn jiù shù nǐ zuì tè bié, zǒng ràng wǒ jué dé hěn qīn hěn tiě

shifting -4 semitones	shifting +0 semitones	shifting +4 semitones

Duration Control

因为我，完全信任你yīn wéi wǒ, wán quán xìn rèn nǐ

0.75x Speed	1.00x Speed	1.25x Speed

我想说其实你很好，你自己却不知道wǒ xiǎng shuō qí shí nǐ hěn hǎo, nǐ zì jǐ què bú zhī dào

0.75x Speed	1.00x Speed	1.25x Speed

杜鹃啼血声，芙蓉花蜀国尽缤纷dù juān tí xuè shēng, fú róng huā shǔ guó jìn bīn fēn

0.75x Speed	1.00x Speed	1.25x Speed

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

Abstract

Contents

Audio Samples

Audio Quality

Ablation Studies

SF-GAN

ML-GAN

Pitch and Voiced/Unvoiced input

window/hop size

Pitch Control

Duration Control