This repository is an extended PyTorch implementation of Microsoft's FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, initially based on xcmyz's implementation, with the core code structure derived from ming024's original FastSpeech2 implementation.
We introduce several modifications to enable training and inference using phonological features instead of phoneme IDs, supporting cross-lingual and low-resource speech synthesis scenarios. This modification allows more linguistically informed training and better generalization across languages. Using this version, we successfully trained a German baseline TTS model, and further performed transfer learning with a small amount of English data to train an English model.
Our method is inspired by the concept of using cross-lingual phonological information as described in the paper:
"Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis"
SSW11 Paper PDF
We also refer to the PHOIBLE database for phonological feature definitions and mappings.
The overall training and synthesis pipeline still follows the original repository structure ming024's original FastSpeech2 implementation. However, we have made the following key modifications to support phonological feature-based modeling:
text/
folder: contains several modified files to support phonological feature data preparation.transformer/models.py
: updated to allow model input as phonological feature vectors instead of phoneme IDs.synthesis.py
: modified to support inference using phonological features as input.
- FastSpeech 2: Fast and High-Quality End-to-End Text to Speech, Y. Ren, et al.
- xcmyz's FastSpeech implementation
- TensorSpeech's FastSpeech 2 implementation
- rishikksh20's FastSpeech 2 implementation
- PHOIBLE: Phonological Segment Inventory Database
- Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis (SSW11)