Controllable Singing Voice Synthesis
using Phoneme-Level Energy Sequence


Abstract

Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control—temporal loudness variation essential for musical expressiveness—and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.

Controllable Singing Voice Synthesis, Dynamics Control, Energy Sequence

1 Introduction↩︎

Singing Voice Synthesis (SVS) is the task of generating singing voice waveforms from symbolic musical representations such as scores. While recent SVS systems have achieved substantial improvements in synthesis quality, most existing approaches generate expressive singing voices in a probabilistic manner conditioned only on the provided score. These existing systems lack the capability to provide users to precisely control or convey user intent and musical expression. Consequently, there has been increasing interest in developing controllable SVS frameworks that can explicitly reflect user-specified expressive attributes. For example, SinTechSVS[1] demonstrated the synthesis of singing voices with eight distinct singing techniques. More recently, Prompt-Singer[2] proposed a system for controlling singing audio via natural language prompts, and TechSinger[3] further extended controllability by enabling user manipulation of various singing techniques through natural language instructions.

In this work, we focus on the controllable modeling of dynamics, a key expressive attribute in SVS. In the context of singing, As illustrated in Fig. 1, dynamics refers to the temporal variation of loudness, which is essential for conveying musical expressiveness[4]. The significance of dynamics has been widely acknowledged in the fields of music generation[5] and audio synthesis[6]. To the best of our knowledge, most existing SVS approaches[7][12] have not explicitly incorporated dynamics as a controllable condition.

a

b

Figure 1: The energy plot in (a) represents the temporal variation of energy across time, calculated as the sum of mel-spectrogram amplitudes for each frame. When the energy is low, the corresponding regions in the mel-spectrogram (b) exhibit lower amplitude across frequency bins, appearing as darker. This indicates reduced loudness or silence in the audio signal during those time intervals..

According to [13], dynamics is typically defined as a measure of the energy of an acoustic signal, computed based on the amplitude of the waveform. Since this energy can be extracted from the ground-truth waveform or spectrogram in most singing voice datasets, the need for manual annotation is significantly reduced.

In this work, we present experimental results demonstrating that conditioning the model on an energy sequence enables effective control over the dynamics of the synthesized singing voice. Furthermore, we propose a method for caculating a phoneme-level energy sequence to facilitate user-friendly control, and validate its effectiveness for dynamic manipulation through experimental result.

The main contributions of this paper are summarized as follows:

  • We show that dynamic control can be effectively achieved in a controllable SVS system by simply summing the energy sequence embedding with other input embeddings.

  • We introduce a phoneme-level energy sequence representation to enable intuitive and user-friendly control.

  • We empirically analyze the causal relationship between the input energy sequence and the dynamics of the synthesized singing voice.

Audio examples are available at , including the baseline system, ground-truth waveforms reconstructed with a vocoder, and our proposed SVS systems conditioned on energy at the frame and phoneme levels.

2 Related Works↩︎

2.1 Singing Voice Synthesis↩︎

SVS has seen significant improvements in quality enabled by advances in deep learning methodologies and large-scale singing datasets. Among the most influential studies in this field are DiffSinger[14] and VISinger[15].

DiffSinger employs a two-stage architecture featuring an encoder-decoder framework paired with a vocoder. The encoder generates a musical score embedding from input lyric, note, and duration sequences, while a diffusion-based decoder synthesizes mel-spectrograms conditioned on this embedding. A pre-trained vocoder converts the generated mel-spectrogram into waveform but is excluded from evaluation to focus solely on mel-spectrogram quality assessment.

VISinger, in contrast, is an end-to-end SVS system built upon Variational Inference with adversarial learning for end-to-end text-to-speech (VITS)[16]. In this architecture, the input sequences are encoded into a latent representation, from which the decoder directly generates the waveform. Additionally, an adversarial discriminator is applied to the latent space to enhance output quality. Building upon these two foundational studies as baselines, diffusion-based[7][10] and VITS-based SVS models[11], [12] have continued to evolve, further advancing the quality of synthesized singing voices.

Inspired by DiffSinger, our work adopts a diffusion-based framework for SVS, utilizing a Denoising Diffusion Probabilistic Model (DDPM) [17] as the mel-spectrogram decoder.

2.2 Controllable Singing Voice Synthesis↩︎

The audio quality of SVS has advanced significantly, and recent research increasingly emphasizes not only sound quality but also expressiveness and user-centric control. Prompt-Singer[2] pioneered the first controllable SVS model, enabling users to adjust attributes such as singer gender, vocal range, and volume through natural language prompts. However, the range of controllable attributes remains limited, and challenges persist in controlling singing style.

SinTechSVS[1] and TechSinger[3] address these challenges by developing systems that manipulate singing style through vocal techniques. SinTechSVS introduced four timbre-related and four pitch-related technique sequences, successfully synthesizing singing voices with expert-annotated technique data. TechSinger further advanced this approach by enabling technique control via natural language input, utilizing a dataset annotated with singing techniques. As shown by theses studies, controlling singing style through vocal techniques represents a highly promising research direction.

While prior work has relied on annotated datasets, we propose a method for controlling singing style using the energy sequence—a technique attribute that does not require manual annotation.

2.3 Energy Predictor↩︎

The energy predictor is a module that implicitly incorporates energy information to achieve more natural speech synthesis. During training, this module is optimized by minimizing the Mean Squared Error (MSE) between the predicted energy and the ground-truth energy, thereby encouraging the model to reflect energy in the generated output. The first study to introduce an energy predictor was FastSpeech 2[18], and since then, speech synthesis and SVS studies[19][21] have adopted this approach to achieve natural singing synthesis.

However, implicitly modeling musical dynamics using an energy predictor limits both the diversity and controllability of dynamics. Therefore, we propose a system that removes the energy predictor and instead uses energy sequence as an explicit input.

3 Method↩︎

3.1 Model Architecture↩︎

a

Figure 2: Example of input sequences used in the SVS system. The first row shows the phoneme-level musical note sequence, the second row represents the corresponding phoneme-level lyric sequence, and the third row indicates the phoneme-level duration sequence. The duration sequence is annotated in seconds, and <AP> denotes a pause (rest) in the lyrics..

Figure 3: Model architecture overview: (a) is overall architecture illustrating the input sequences of lyrics, notes, and phoneme-level energy, processed through the length regulator and FFT block, followed by the Denoising Diffusion Probabilistic Model (DDPM) mel-spectrogram decoder and vocoder. (b) is FFT block that sums the embeddings of all input sequences and applies a 2-layer 1D convolutional network to capture adjacent frame features. (c) is length regulator that expands phoneme-level sequences to frame-level sequences by repeating tokens according to duration annotations. Here, B denotes batch size, L denotes input sequence length, T denotes the expanded frame-level sequence length, and H denotes hidden size.
  • Input Sequences

In the SVS task, the input sequences consist of lyrics, musical notes, and duration. As shown in Fig. 2, all three sequences are annotated at the phoneme-level, and the length of each sequence is \(L\). In this study, we additionally use a phoneme-level energy sequence of length \(L\) as an input. The methodology is described in Subsection B.

  • Length Regulator

In speech synthesis, the duration predictor[18] is used to align the natural length of generated speech. Unlike speech synthesis, SVS datasets provide explicit duration annotations in seconds, enabling precise control over the target mel-spectrogram length. Given subtle individual variations in singing tempo, phoneme-level duration modeling better captures expressive nuances.

In this study, as shown in Fig. 3(c), the duration annotations in seconds are converted to frame lengths, and each phoneme token is repeated according to its frame length via the length regulator[18], producing a sequence of length \(T\) for the target mel-spectrogram. The sequence is then converted into embedding vectors through the embedding layer.

  • FFT Block

As shown in Fig. 3(b), the Feed-Forward Transformer (FFT) blocks [18] constructs the final hidden representation \(H_c\), which will be used as the decoder input. A 2-layer 1D convolutional network is used to capture the features between adjacent frames.

  • Mel-Spectrogram Decoder

The mel-spectrogram decoder generates the mel-spectrogram conditioned on \(H_c\). In this study, we employ DDPM.
The model architecture can be expressed by the following equations: \[\mathbf{h}^m = \mathbf{h}^l + \mathbf{h}^n + \mathbf{h}^e, \label{eq:arch1}\tag{1}\] \[H_c = \text{FFT}(\mathbf{h}^m), \label{eq:arch2}\tag{2}\] \[D_{out} = \text{DDPM}(H_c), \label{eq:arch3}\tag{3}\] \[\hat{Y}= \text{Vocoder}(D_{out}), \label{eq:arch4}\tag{4}\]

where \(\mathbf{h}^l\), \(\mathbf{h}^n\), and \(\mathbf{h}^e\) are the embedding vectors of the lyrics, note, and energy sequences, respectively, after passing through each embedding layer. By summing all the embedding sequences in this manner, \(\mathbf{h}^m\) serves as a musical score embedding that comprehensively reflects the integrated input sequences. A pre-trained vocoder is used and excluded from the training stage. We fine-tuned HiFi-GAN[22] and used it for inference and evaluation.

3.2 Input Energy Sequence↩︎

In this section, we describe the phoneme-level energy sequence used as an input.

3.2.1 Frame-Level Energy Eequence↩︎

We define the energy as the square root of the sum of energies across all channels rather than their direct sum to improve stability. We refer to this quantity simply as energy throughout the paper.

Energy can be extracted from the ground-truth mel-spectrogram without the need for human annotation. The length of the extracted energy sequence is equal to the number of frames \(T\) in the mel-spectrogram, and we refer to this as frame-level energy:

\[\mathbf{E}[t] = \sqrt{\frac{1}{N} \sum_{n=0}^{N-1} \left( \exp(S_{\text{mel}}[t, n]) \right)^2 }. \label{eq:energy}\tag{5}\]

The log-mel spectrogram amplitude \(S_{\text{mel}}[t, n]\) is derived from the magnitude-domain Short-Time Fourier Transform (STFT) as follows: \[S_{\text{mel}}[t, n] = \log \left( \sum_{k=0}^{K-1} M[n, k] \cdot |S_{\text{stft}}[t, k]| \right),\] where \(S_{\text{stft}}[t, k]\) denotes the complex STFT coefficient at frame \(t\) and frequency bin \(k\), \(M[n, k]\) is the mel filterbank matrix, \(|\cdot|\) denotes the magnitude operator, \(t\) is the frame index, \(n\) is the mel frequency bin index, \(N\) is the total number of mel frequency bins, and \(K\) is the number of STFT frequency bins.

3.2.2 Phoneme-Level Energy Eequence↩︎

We experimentally confirmed that when frame-level energy is added as an input embedding, the energy of the generated mel-spectrogram almost exactly follows the frame-level energy. However, using a frame-level energy sequence as input requires the user to specify and precisely align a very long sequence (on average, more than 1000 values for a waveform shorter than 10 seconds) with each token during inference, which is inconvenient for user control. Therefore, we chose to use the phoneme-level energy sequence as the input.

The phoneme-level energy sequence is defined as \[\mathbf{E} = [e_1, e_2, ..., e_L] \label{eq:phoneme95energy95seq},\tag{6}\] where \(L\) is the sequence length, corresponding to the number of phonemes (identical to the lengths of the lyric and note sequences).

The mean energy of the \(i\)-th phoneme, \(e_i\), is computed as \[e_i = \frac{1}{T_i} \sum_{t = t_{\text{start}_i}}^{t_{\text{end}_i}} \mathbf{E}[t], \label{eq:phoneme95energy}\tag{7}\] where \(T_i\) is the number of frames aligned to the \(i\)-th phoneme, \(t_{\text{start}_i}\) and \(t_{\text{end}_i}\) denote the start and end frame indices of the \(i\)-th phoneme, respectively, and \(\mathbf{E}[t]\) is the frame-level energy at frame \(t\).

3.3 Objective Function↩︎

We adopt the simplified objective function of the DDPM as our mel-spectrogram reconstruction loss. This objective minimizes the L1 loss between the noise predicted by the model at each timestep and the actual noise added to the data:

\[\mathcal{L} = \mathbb{E}_{\mathbf{x}_0, t, \epsilon}\left[ \left| \epsilon - \epsilon_\theta(\mathbf{x}_t, t, H_c) \right| \right], \label{eq:simple}\tag{8}\]

where \(\mathbf{x}_0\) is the original data sample (ground-truth mel-spectrogram), \(t\) is the diffusion timestep sampled uniformly from \({1, ..., T}\), \(\epsilon\) is a noise vector sampled from a standard normal distribution \(\mathcal{N}(0, I)\), \(\mathbf{x}_t\) is the noisy version of \(\mathbf{x}_0\) at timestep \(t\), \(\epsilon_\theta(\cdot)\) is the noise predicted by the neural network parameterized by \(\theta\), and \(H_c\) is the conditioning input.

The noisy data \(\mathbf{x}_t\) is constructed as follows: \[\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon,\] where \(\bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s)\) is the cumulative product of the noise schedule, and \(\beta_t\) is the variance schedule at timestep \(t\) (we use linear scheduling).

The final objective function, including noise scheduling, is given by: \[\mathcal{L} = \mathbb{E}_{\mathbf{x}_0, t, \epsilon}\left[ \left| \epsilon - \epsilon_\theta\left(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, t, H_c\right) \right| \right]. \label{eq:noise}\tag{9}\]

Figure 4: Mel-spectrogram segments corresponding to the time region highlighted by the blue box in the Fig. 5.(a) is ground-truth mel-spectrogram.(b) is mel-spectrogram generated by the baseline model.(c) is mel-spectrogram generated by the phoneme-level model.(d) is mel-spectrogram generated by the frame-level model.The yellow boxes indicate regions where the (b) exhibits noticeably lower energy (darker areas) compared to the (c) and (d). This visual difference demonstrates that incorporating the energy sequence as an input leads to increased energy in the synthesized output, confirming the effectiveness of explicit energy conditioning for dynamic control.

a

b

c

Figure 5: Energy plots comparing input energy sequences (blue) and generated mel-spectrogram energies (red) from: (a) baseline model, (b) phoneme-level model, and (c) frame-level model. The red curves in (b) and (c) closely follow the blue reference, demonstrating that explicit energy conditioning enables precise dynamic control in singing voice synthesis..

4 EXPERIMENTAL SETUPS↩︎

We used the GTSinger[23] dataset for our experiments. GTSinger is a high-quality singing voice dataset recorded with nine languages and six singing styles. Since our model is not designed for multilingual SVS, we trained it using only the Chinese subset, which contains approximately 16 hours of recordings from two speakers. The dataset was split into 7,082 training samples, 78 validation samples, and 57 test samples. Audio was sampled at 48 kHz, and for mel-spectrogram extraction, the hop size was set to 256, with both the window size and FFT size set to 1024. The number of mel-spectrogram bins was set to 80.

The model’s sequence embedding dimension was set to 256, and relational positional encoding was applied before the FFT block. The embedding layer is a linear embedding layer for integer indices, which maps them into continuous vector representations. The mel-spectrogram decoder is a U-Net-based DDPM, configured with 100 forward steps, linear scheduling, and a maximum beta value of 0.06. All experiments were conducted on an RTX A5000 GPU, taking approximately 16 hours for 113,600 training steps. The vocoder used was a pre-trained HiFi-GAN, employ during inference and excluded from evaluation metrics.

For objective evaluation, we used mel-spectrogram-based metrics. F0 and energy were extracted from both the generated and ground-truth mel-spectrograms, and the Mean Absolute Error (MAE) was calculated for each. Objective evaluation metrics included F0 MAE, energy MAE, and Mel Cepstral Distortion (MCD). Subjective evaluation was conducted using the Mean Opinion Score (MOS).

5 EXPERIMENTAL RESULTS↩︎

The objective of our experiments is not to evaluate the audio quality itself, but rather to assess attribute controllability and the effect of energy sequence input on model performance. As a result, the MCD of the generated mel-spectrograms may be lower than State-Of-The-Art (SOTA) models; however, we confirmed that our method consistently outperforms the baseline in all experiments conducted in this study.

The baseline model generates mel-spectrograms using DDPM with only the lyric and note sequences as inputs. The frame-level model adds a frame-level energy sequence of length \(T\) as an additional input to the baseline, while the phoneme-level model adds a phoneme-level energy sequence of length \(L\) as an additional input to the baseline.

5.1 Main Result↩︎

Table 1: Objective evaluation results for baseline, phoneme-level, and frame-level models. Both phoneme-level and frame-level energy conditioning significantly improve controllability and synthesis quality over the baseline.
Approach/Method Energy MAE \(\downarrow\) F0 MAE \(\downarrow\) MCD \(\downarrow\)
Baseline 0.33 10.67 12.89
Phoneme-Level 0.14 9.73 12.07
Frame-Level 0.03 6.56 11.64

5.1.1 Energy MAE↩︎

Energy MAE quantifies the MAE between the energy (amplitude) sequences extracted from generated mel-spectrogram and ground-truth mel-spectrogram. A lower value indicates superior fidelity in replicating energy patterns. As shown in the table 1, the baseline model has the highest energy MAE (0.33), while the phoneme-level model achieves a significant improvement (0.14). The frame-level model demonstrates the lowest error (0.03), establishing its dominance in precise energy control.

From the perspective of dynamic controllability, a higher energy MAE—indicating greater deviation from the exact replication of energy patterns—can reflect the fact that professional singers exhibit inherently diverse and inconsistent dynamics. Therefore, it suggests not only controllability of the energy sequence but also diversity in expressiveness.

The frame-level model’s energy curve, as visualized in Fig. 5(c), nearly perfectly mirrors the ground-truth energy profile. This visual evidence confirms its ability to not only replicate vocal dynamics with high precision but also reliably adhere to user-specified energy patterns.

Notably, while the phoneme-level model shows marginally higher errors than its frame-level counterpart, Fig. 5(b) reveals its robust capability to preserve the intended dynamic flow. This underscores the phoneme-level approach as a user-friendly control that maintains expressive quality without requiring fine-grained frame alignment.

Furthermore, the marginal improvements in F0 MAE and MCD suggest that incorporating the energy sequence as an additional input provides slight benefits to mel-spectrogram construction.

5.1.2 MOS↩︎

Table 2: The Mean Opinion Score (MOS) with 95% confidence intervals of test set. Both the phoneme-level and frame-level models achieved higher listening test scores than the baseline, demonstrating improved perceived audio quality.
MOS \(\uparrow\)
Ground-Truth + HiFi-GAN 4.02 \(\pm\) 0.2
Baseline + HiFi-GAN 3.43 \(\pm\) 0.17
Phoneme-Level + HiFi-GAN 3.78 \(\pm\) 0.19
Frame-Level + HiFi-GAN 3.57 \(\pm\) 0.18

The MOS results in Table 2 demonstrate that the ground-truth audio with HiFi-GAN vocoder achieves the highest score of 4.02 with a 95% confidence interval of ±0.2. The baseline model with HiFi-GAN scores 3.43 ± 0.17, while the phoneme-level and frame-level models achieve MOS scores of 3.78 ± 0.19 and 3.57 ± 0.18, respectively. These results indicate that the phoneme-level model demonstrates promising performance in audio quality.

5.2 Ablation Study↩︎

Table 3: Ablation study results comparing the baseline, baseline with energy predictor, and phoneme-level energy input models. Explicit energy input provides significantly better control over dynamics and improves synthesis quality compared to implicit energy prediction.
Energy MAE \(\downarrow\) F0 MAE \(\downarrow\) MCD \(\downarrow\)
Baseline 0.33 10.67 12.89
Baseline + Energy Predictor 0.30 9.43 12.97
Phoneme-Level 0.14 9.73 12.07

The Table 3. demonstrate that simply adding an energy predictor to the baseline model yields only marginal improvements in energy MAE (0.33 → 0.30). This suggests that the energy predictor, which models energy implicitly, has limited impact on controllability.

In contrast, the phoneme-level energy input model achieves a substantial reduction in energy MAE (0.14). These results clearly indicate that providing energy as an explicit input is significantly more effective for controlling the dynamics of the synthesized singing voice than relying on an implicit energy predictor.

6 Conclusion↩︎

In this work, we have demonstrated that explicit energy sequence conditioning enables effective and intuitive dynamic control in singing voice synthesis. By leveraging phoneme-level energy representations, our approach achieves significant improvements in controllability while maintaining synthesis quality. Although the baseline model employed in our experiments does not reflect SOTA performance, our results suggest that the proposed energy sequence input method can be readily integrated with more advanced SVS architectures to further enhance both controllability and expressive power. This paves the way for more natural, expressive, and user-controllable singing voice synthesis in future research.

Despite these results, several limitations remain. First, the MOS evaluation was conducted with only 10 participants, which may constrain the statistical reliability and generalizability of the subjective listening test results. Therefore, the focus of the MOS analysis should be on the fact that conditioning on the energy sequence does not lead to a degradation in the quality of the generated mel-spectrogram. As mentioned in Section I, the quality of the audio samples are available at . Secondly, our current framework primarily addresses dynamic control through energy modulation, leaving other expressive attributes—such as timbre, vibrato, and advanced singing techniques—relatively unexplored. As a final point, while phoneme-level energy representations offer a user-friendly interface, they may not fully capture the fine-grained temporal nuances inherent in natural singing performances. Future work will focus on combining energy with additional expressive features to achieve richer and more nuanced singing voice synthesis.

Acknowledgment↩︎

This work was supported by the IITP(Institute of Information & Coummunications Technology Planning & Evaluation)- ITRC(Information Technology Research Center) grant funded by the Korea government(Ministry of Science and ICT)(IITP2025-RS-2024-00436857), IITP grant funded by the Korea government(MSIT) (No. RS-2019-II190079, Artificial Intelligence Graduate School Program(Korea University)), and IITP under the artificial intelligence star fellowship support program to nurture the best talents (IITP2025-RS-2025-02304828) grant funded by the Korea government(MSIT).

References↩︎

[1]
J. Zhao, L. Q. H. Chetwin and Y. Wang, "SinTechSVS: A Singing Technique Controllable Singing Voice Synthesis System," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2641-2653, 2024.
[2]
Y. Wang et al., “Prompt-singer: Controllable singing-voice-synthesis with natural language prompt,” in Proc. 2024 Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., K. Duh, H. Helena Gómez-Adorno, and Y. Wang Eds., Mexico City, Mexico, 2024, pp. 4780–4794. [Online]. Available: https://doi.org/10.18653/v1/ 2024.naacl-long.268.
[3]
W. Guo et al., “TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 22, pp. 23978–23986, Apr. 2025.
[4]
J. Ryu, S. Rhyu, H.-G. Yoon, E. Kim, J. Y. Yang, and T. Kim, “MID-FiLD: MIDI Dataset for Fine-Level Dynamics,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, pp. 222–230, Mar. 2024.
[5]
S. -L. Wu, C. Donahue, S. Watanabe and N. J. Bryan, "Music ControlNet: Multiple Time-Varying Controls for Music Generation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2692-2703, 2024.
[6]
Y. Wu et al., “MIDI-DDSP: Detailed control of musical performance via hierarchical modeling,” in Proc. Int. Conf. Learn. Representations, Apr. 2022.
[7]
J. He et al., “RMSSinger: Realistic-Music-Score based Singing Voice Synthesis,” [Online]. Available: https://arxiv.org/abs/2305.10686.
[8]
D.-M. Byun, S.-H. Lee, J.-S. Hwang, and S.-W. Lee, “Midi-voice: Expressive zero-shot singing voice synthesis via midi-driven priors,” in Proc. 2024 IEEE Int. Conf. Acoust., Speech Signal Process., IEEE, 2024, pp. 12622–12626.
[9]
S. Kim, M. Jeong, H. Lee, M. Kim, B. J. Choi, and N. S. Kim, “MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance,” Interspeech 2022, pp. 1865–1869, Sep. 2024.
[10]
S. Dai, M.-Y. Liu, R. Valle, and S. Gururani, “ExpressiveSinger: Multilingual and multi-style score-based singing voice synthesis with expressive performance control,” in Proc. 32nd ACM Int. Conf. Multimedia, 2024, pp. 3229–3238.
[11]
J. Cui, Y. Gu, C. Weng, J. Zhang, L. Chen, and L. Dai, “Sifisinger: A high-fidelity end-to-end singing voice synthesizer based on source-filter model,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 126–11 130.
[12]
T. Kim, C. Cho, and Y. H. Lee, “Period Singer: Integrating Periodic and Aperiodic Variational Autoencoders for Natural-Sounding End-to-End Singing Voice Synthesis,” Interspeech 2024, pp. 1875–1879, Sep. 2024.
[13]
M. Umbert, J. Bonada, M. Goto, T. Nakano, and J. Sundberg, “Expression control in singing voice synthesis: Features, approaches, evaluation, and challenges,” IEEE Signal Process. Mag., vol. 32, no. 6, pp. 55–73, Nov. 2015.
[14]
J. Liu, C. Li, Y. Ren, F. Chen, and Z. Zhao, “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” in Proceedings of the AAAI conference on artificial intelligence, 2022, vol. 36, pp. 11020–11028.
[15]
Y. Zhang, J. Cong, H. Xue, L. Xie, P. Zhu, and M. Bi, “Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7237–7241.
[16]
J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end textto-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
[17]
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), vol. 33, pp. 6840– 6851, Curran Associates, Inc., 2020.
[18]
Y. Ren et al., “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. Int. Conf. Learn. Representations, 2021.
[19]
S. Zhou et al., “Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information,” in Proc. Interspeech 2022, 2022, pp. 4292–4296.
[20]
X. Zhuang, T. Jiang, S. -Y. Chou, B. Wu, P. Hu and S. Lui, "Litesing: Towards Fast, Lightweight and Expressive Singing Voice Synthesis," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 7078-7082.
[21]
Y. Song et al., “Singing voice synthesis with vibrato modeling and latent energy representation,” in Proc. IEEE 24th Int. Workshop Multimedia Signal Process., 2022, pp. 1–6.
[22]
K. Jungil, K. Jaehyeon, and B. Jaekyoung, “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[23]
Y. Zhang et al., “GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks,”Advances in Neural Information Processing Systems, vol. 37, pp. 1117–1140, Dec. 2024.