摘要
Prosodic structure prediction is an indispensable step in the text-to-speech system, and its results directly influence the naturalness and intelligibility of synthesized speech.In this study, a prosodic structure prediction method based on a pretrained language representation model was proposed.On the basis of the pretrained language representation model, a separate output layer was set for each prosody level, with character as the modeling unit.Then, the model was fine-tuned with prosody labeled data.To achieve the simultaneous prediction of different prosodic levels in input text, a word segmentation task was additionally introduced and the multitask learning method was used to model the relationship between the multilevel prosody and lexicon words.The experimental results prove the rationality of a multi-output structure and the effectiveness of using a pretrained language representation model and verify that adding the word segmentation task can further improve model performance.When comparing the best result to the baseline conditional random field model, significant improvements of 2.48% and 4.50% were observed for the F1 scores of prosodic word prediction and prosodic phrase prediction, respectively.By contrast, when comparing the best result to the baseline bidirectional long short-term memory model, more significant improvements of 6.2% and 5.4% were observed for the F1 scores of prosodic word prediction and prosodic phrase prediction, respectively.Finally, the experiments show that the proposed method considerably reduces the demand for training data while maintaining an excellent prediction performance. ? 2020, Editorial Board of Journal of Tianjin University(Science and Technology). All right reserved.
- 单位