This repository contains the notebooks used to produce the data, figures, and table-data included in the bachelor's thesis "Evaluation of automatic bias detection and pre-processing mitigation techniques" by Max Kleinegger.
The notebooks contain code written in Python for which an environment with the necessary dependencies is required. The environment is simple created via pip and therefore, we provide a necessary script which allows easy setup and use. Just call
./setup.sh
The repository is organized into several main directories, each serving a distinct purpose related to data synthesis, bias detection, and evaluation. Below is an overview of the structure:
-
README.md
: Provides an overview of the project, its purpose, and usage instructions. -
data/
: Contains datasets used for experiments, including synthetic data generated by different models.DataSynthesizer/
: Holds JSON descriptions for different data synthesis modes (correlated, independent, random).SDV/
: Includes pre-trained models for synthetic data generation using various techniques.- Synthetic Data Files:
- Various
.json
and.csv
files containing generated datasets and metadata. - Includes synthetic data from different synthesizers (
CTGAN
,CopulaGAN
,GaussianCopula
,TVAE
,DataSynthesizer
). - Contains
trainset.json
andtestset.json
for model evaluation.
- Various
-
notebooks/
: Contains Jupyter notebooks used for bias detection, mitigation, and preprocessing.bias_detection_pre_synthesized.ipynb
: Analyzes bias in pre-synthesized datasets.bias_detection_synthesized.ipynb
: Evaluates bias in synthetic datasets.bias_mitigation_synthesized.ipynb
: Implements bias mitigation techniques on synthetic data.bias_mitigation_synthesized_subsampling.ipynb
: Applies subsampling-based bias mitigation.preprocessing.ipynb
: Prepares data for further analysis.
-
requirements.txt
: Lists dependencies required for running the project. -
results/
: Stores experimental results and analysis outputs.data/
: Contains JSON files with fairness and utility metrics from different approaches.- Includes metrics for original and synthetic datasets, processed through binning and PDF-based techniques.
plots.ipynb
: Notebook to visualize and analyze fairness results.
-
setup.sh
: Shell script for setting up the project environment. -
src/
: Contains the source code for processing data, computing fairness metrics, and generating synthetic datasets.