Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification (2024)
We share a PyTorch implementation of our experiments here.
The figure below overviews the proposed framework to pool speaker embedding from the multi-layered nature of pre-trained models.
- We recommend you to visit Previous Versions (v1.12.0) for PyTorch installation including torchaudio==0.12.0.
Use the requirements.txt to install the rest of the Python dependencies.
Ubuntu-Soundfile and conda-ffmpeg packages would be required for downloading and preprocessing data, and you can install them as below.
$ pip install -r requirements.txt
$ apt-get install python3-soundfile
$ conda install -c conda-forge ffmpeg
The datasets can be downloaded from here:
-
VoxCeleb 1 & 2
We use clovaai/voxceleb_trainer to download VoxCeleb datasets, released under the MIT licence.
Follow the data preparation script, till you convert the VoxCeleb 2 audio formataac(.m4a)
into.wav
file.
The following scripts are to preprocess audio data and build evaluation trials from each dataset.
- However, you can skip the "# set split" part, since we have uploaded the ready-made splits and trials used for our experiments in the file tree.
Please check the contents in data/VCTK-Corpus/, data/LibriSpeech/, data/VoxCeleb/ [;speakers/, ;trials/] first.
# preprocessing
$ python ./src/preprocess/process-VCTK.py --read_path SRC_PATH
Remove speaker [p280, p315] of risen technical issues.
Drop samples (no.000~no.024), where the same transcript is used for each number.
Resample audio sources to meet the sample rate in common (48K → 16K).
# set split
$ python ./src/preprocess/split-VCTK-0-speakers.py
$ python ./src/preprocess/split-VCTK-1-rawtrials.py
$ python ./src/preprocess/split-VCTK-2-balancedtrials.py
Subset the total speaker pool into train, validation, and test speaker subsets.
Check the match of speaker meta-info (Gender | Age | Accents | Region | Label) given the total combination.
Sample the trials with a balance to the label distribution and meta-info matches.
# preprocessing
$ python ./src/preprocess/process-LibriSpeech.py --read_path SRC_PATH
Convert audio format
.flac
to.wav
file.
# set split
$ python ./src/preprocess/split-LibriSpeech-1-rawtrials.py
$ python ./src/preprocess/split-LibriSpeech-2-balancedtrials.py
Check the match of speaker meta-info (Gender(SEX) | Label) given the total combination of samples.
Sample the trials with a balance to the label distribution and meta-info matches.
$ mv ./data/VoxCeleb/*_wav/ ./data/VoxCeleb/preprocess/
No special data preprocessing required.
# set split
$ python ./src/preprocess/split-VoxCeleb-0-speakers.py
$ python ./src/preprocess/split-VoxCeleb-2-balancedtrials.py
List up the speakers in each subsets, and convert 'Vox1-O' evaluation path file format
.txt
to.csv
.
Sample the trials with a balance to the label distribution.
Loggings, weights, and training configurations will be saved under res/ directory.
The result folder will be created in local-YYYYMMDD-HHmmss
format by default.
To use neptune.ai logging, set your configuration in src/configs/neptune/neptune-logger-config.yaml and add --neptune in command line.
The experiment ID created at your neptune.ai [project] will be the name of the output directory.
- General usage examples,
# Running directly through the command line
$ CUDA_VISIBLE_DEVICES=0 python ./src/main.py train VCTK UniPool --use_pretrain --frz_pretrain --batch_size 128 --seed 9973 --backbone_cfg facebook/wav2vec2-base --nb_total_step 10000 --nb_steps_eval 1000;
# Or you can use a shell file for the multiple commands.
$ ./src/run.sh
- Adjusting hyperparameters directly by command.
$ python ./src/main.py -h
usage: main.py [1-action] [2-data] [3-model] [-h]
positional arguments (required):
[1] action: {train,eval}
[2] data : {VCTK,LibriSpeech,Vox1,Vox2}
[3] model : {X-vector,ECAPA-TDNN,SincNet,ExploreWV2,FinetuneWV2,UniPool}
optional arguments in general:
-h, --help show this help message and exit
--quickrun quick check for the running experiment on the modification, set as True if given
--skiptest skip evaluation for the testset during training, set as True if given
--neptune log experiment with neptune logger, set as True if given
--workers WORKERS the number of cpu-cores to use from dataloader (per each device), defaults to: 4
--device [DEVICE0,] specify the list of index numbers to use the cuda devices, defaults to: [0]
--seed SEED integer value of random number initiation, defaults to: 42
--eval_path EVAL_PATH result path to load model on the "action" given as {eval}, defaults to: None
--description DESCRIPTION user parameter for specifying certain version, Defaults to "Untitled".
keyword arguments:
--kwarg KWARG dynamically modifies any of the hyperparameters declared in ../configs/.../...*.yaml or ./benchmarks/...*.yaml
(e.g.) --lr 0.001 --batch_size 64 --nb_total_step 25000 ...
- The following command line will conduct the test evaluation with the best-validated model parameter from the configuration of DIR_NAME
CUDA_VISIBLE_DEVICES=0 python ./src/main.py eval _ _ --eval_path DIR_NAME;
- You can also conduct the cross-dataset evaluation by modifying the command like this.
CUDA_VISIBLE_DEVICES=0 python ./src/main.py eval Vox1 _ --eval_path DIR_NAME;
@article{kim2024universal,
title={Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification},
author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
journal={arXiv preprint arXiv.2409.07770},
year={2024}
}
This repository is released under the MIT license.
As we commented from each comparison src/benchmarks/
models, the projects below are referred to reproduce the implementations.
"Official" means that the author of the paper released the project.
Open to the public
X-vector/
: cvqluu/TDNN
Released under MIT license
SincNet/
: mravanelli/SincNet (Official)ECAPA-TDNN/
: TaoRuijie/ECAPA-TDNNFinetuneWV2/
: nikvaessen/w2v2-speaker (Official)
Some of the src/utils/
are quoted from below.
metrics.py
: speechbrain/utils/metric_stats.pysampler.py
: SeungjunNah/DeepDeblur-PyTorch/src/data/sampler.py