Skip to content

Universal Pooling Method for Speaker Verification Utilizing Pre-trained Multi-layer Features, 2025 preprint

License

Notifications You must be signed in to change notification settings

sadPororo/UniPool-SV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification (2024)

We share a PyTorch implementation of our experiments here.

The figure below overviews the proposed framework to pool speaker embedding from the multi-layered nature of pre-trained models.

Environment Supports & Python Requirements

Ubuntu Python PyTorch

Use the requirements.txt to install the rest of the Python dependencies.
Ubuntu-Soundfile and conda-ffmpeg packages would be required for downloading and preprocessing data, and you can install them as below.

$ pip install -r requirements.txt
$ apt-get install python3-soundfile
$ conda install -c conda-forge ffmpeg

1. Dataset Preparation

The datasets can be downloaded from here:

2. Data Preprocessing & Evaluation Split

The following scripts are to preprocess audio data and build evaluation trials from each dataset.

  • However, you can skip the "# set split" part, since we have uploaded the ready-made splits and trials used for our experiments in the file tree.
    Please check the contents in data/VCTK-Corpus/, data/LibriSpeech/, data/VoxCeleb/ [;speakers/, ;trials/] first.

VCTK CSTR Corpus

# preprocessing
$ python ./src/preprocess/process-VCTK.py --read_path SRC_PATH

Remove speaker [p280, p315] of risen technical issues.
Drop samples (no.000~no.024), where the same transcript is used for each number.
Resample audio sources to meet the sample rate in common (48K → 16K).

# set split
$ python ./src/preprocess/split-VCTK-0-speakers.py
$ python ./src/preprocess/split-VCTK-1-rawtrials.py
$ python ./src/preprocess/split-VCTK-2-balancedtrials.py

Subset the total speaker pool into train, validation, and test speaker subsets.
Check the match of speaker meta-info (Gender | Age | Accents | Region | Label) given the total combination.
Sample the trials with a balance to the label distribution and meta-info matches.

LibriSpeech

# preprocessing
$ python ./src/preprocess/process-LibriSpeech.py --read_path SRC_PATH

Convert audio format .flac to .wav file.

# set split
$ python ./src/preprocess/split-LibriSpeech-1-rawtrials.py
$ python ./src/preprocess/split-LibriSpeech-2-balancedtrials.py

Check the match of speaker meta-info (Gender(SEX) | Label) given the total combination of samples.
Sample the trials with a balance to the label distribution and meta-info matches.

VoxCeleb 1 & 2

$ mv ./data/VoxCeleb/*_wav/ ./data/VoxCeleb/preprocess/

No special data preprocessing required.

# set split
$ python ./src/preprocess/split-VoxCeleb-0-speakers.py
$ python ./src/preprocess/split-VoxCeleb-2-balancedtrials.py

List up the speakers in each subsets, and convert 'Vox1-O' evaluation path file format .txt to .csv.
Sample the trials with a balance to the label distribution.

3. Run Experiments

Loggings, weights, and training configurations will be saved under res/ directory.
The result folder will be created in local-YYYYMMDD-HHmmss format by default.

To use neptune.ai logging, set your configuration in src/configs/neptune/neptune-logger-config.yaml and add --neptune in command line.
The experiment ID created at your neptune.ai [project] will be the name of the output directory.

  • General usage examples,
# Running directly through the command line
$ CUDA_VISIBLE_DEVICES=0 python ./src/main.py train VCTK UniPool --use_pretrain --frz_pretrain --batch_size 128 --seed 9973 --backbone_cfg facebook/wav2vec2-base --nb_total_step 10000 --nb_steps_eval 1000;

# Or you can use a shell file for the multiple commands.
$ ./src/run.sh
  • Adjusting hyperparameters directly by command.
$ python ./src/main.py -h
usage: main.py [1-action] [2-data] [3-model] [-h]

positional arguments (required):
  [1] action:  {train,eval}
  [2] data  :  {VCTK,LibriSpeech,Vox1,Vox2}
  [3] model :  {X-vector,ECAPA-TDNN,SincNet,ExploreWV2,FinetuneWV2,UniPool}

optional arguments in general:
  -h, --help                 show this help message and exit
  --quickrun                 quick check for the running experiment on the modification, set as True if given
  --skiptest                 skip evaluation for the testset during training, set as True if given
  --neptune                  log experiment with neptune logger, set as True if given
  --workers     WORKERS      the number of cpu-cores to use from dataloader (per each device), defaults to: 4
  --device     [DEVICE0,]    specify the list of index numbers to use the cuda devices, defaults to: [0]
  --seed        SEED         integer value of random number initiation, defaults to: 42
  --eval_path   EVAL_PATH    result path to load model on the "action" given as {eval}, defaults to: None
  --description DESCRIPTION  user parameter for specifying certain version, Defaults to "Untitled".

keyword arguments:
  --kwarg KWARG              dynamically modifies any of the hyperparameters declared in ../configs/.../...*.yaml or ./benchmarks/...*.yaml
  (e.g.) --lr 0.001 --batch_size 64 --nb_total_step 25000 ...

4. Evaluate

  • The following command line will conduct the test evaluation with the best-validated model parameter from the configuration of DIR_NAME
CUDA_VISIBLE_DEVICES=0 python ./src/main.py eval _ _ --eval_path DIR_NAME;
  • You can also conduct the cross-dataset evaluation by modifying the command like this.
CUDA_VISIBLE_DEVICES=0 python ./src/main.py eval Vox1 _ --eval_path DIR_NAME;

Citation

@article{kim2024universal,
  title={Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification},
  author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
  journal={arXiv preprint arXiv.2409.07770},
  year={2024}
}

License License: MIT

This repository is released under the MIT license.

As we commented from each comparison src/benchmarks/ models, the projects below are referred to reproduce the implementations.
"Official" means that the author of the paper released the project.

Open to the public

Released under MIT license

Some of the src/utils/ are quoted from below.

About

Universal Pooling Method for Speaker Verification Utilizing Pre-trained Multi-layer Features, 2025 preprint

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages