Skip to content

Commit b406199

Browse files
authored
Merge pull request #20 from BioinfoMachineLearning/refactor
Version 1.2 additions
2 parents 33f0d0b + 534bc1e commit b406199

18 files changed

+2854
-82
lines changed

.gitignore

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,12 @@ venv.tar.gz
111111
.idea
112112
.vscode
113113

114+
# TensorBoard
115+
tb_logs/
116+
117+
# Feature Processing
118+
*work_filenames*.csv
119+
114120
# DIPS
115121
project/datasets/DIPS/complexes/**
116122
project/datasets/DIPS/interim/**
@@ -119,13 +125,15 @@ project/datasets/DIPS/parsed/**
119125
project/datasets/DIPS/raw/**
120126
project/datasets/DIPS/final/raw/**
121127
project/datasets/DIPS/final/final_raw_dips.tar.gz*
128+
project/datasets/DIPS/final/processed/**
122129

123130
# DB5
124131
project/datasets/DB5/processed/**
125132
project/datasets/DB5/raw/**
126133
project/datasets/DB5/interim/**
127134
project/datasets/DB5/final/raw/**
128135
project/datasets/DB5/final/final_raw_db5.tar.gz*
136+
project/datasets/DB5/final/processed/**
129137

130138
# EVCoupling
131139
project/datasets/EVCoupling/raw/**
@@ -137,4 +145,7 @@ project/datasets/EVCoupling/final/processed/**
137145
project/datasets/CASP-CAPRI/raw/**
138146
project/datasets/CASP-CAPRI/interim/**
139147
project/datasets/CASP-CAPRI/final/raw/**
140-
project/datasets/CASP-CAPRI/final/processed/**
148+
project/datasets/CASP-CAPRI/final/processed/**
149+
150+
# Input
151+
project/datasets/Input/**

README.md

Lines changed: 140 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
The Enhanced Database of Interacting Protein Structures for Interface Prediction
66

7-
[![Paper](http://img.shields.io/badge/paper-arxiv.2106.04362-B31B1B.svg)](https://arxiv.org/abs/2106.04362) [![CC BY 4.0][cc-by-shield]][cc-by] [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5134732.svg)](https://doi.org/10.5281/zenodo.5134732)
7+
[![Paper](http://img.shields.io/badge/paper-arxiv.2106.04362-B31B1B.svg)](https://arxiv.org/abs/2106.04362) [![CC BY 4.0][cc-by-shield]][cc-by] [![Primary Data DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5134732.svg)](https://doi.org/10.5281/zenodo.5134732) [![Supplementary Data DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8071136.svg)](https://doi.org/10.5281/zenodo.8071136)
88

99
[cc-by]: http://creativecommons.org/licenses/by/4.0/
1010
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
@@ -25,8 +25,9 @@ The Enhanced Database of Interacting Protein Structures for Interface Prediction
2525
* DB5-Plus' final 'raw' tar archive now also includes a corrected (i.e. de-duplicated) list of filenames for its 55 test complexes
2626
* Benchmark results included in our paper were run after this issue was resolved
2727
* However, if you ran experiments using DB5-Plus' filename list for its test complexes, please re-run them using the latest list
28+
* Version 1.2.0: Minor additions to DIPS-Plus tar archives, including new residue-level intrinsic disorder region annotations and raw Jackhmmer-small BFD MSAs (Supplementary Data DOI: 10.5281/zenodo.8071136)
2829

29-
## How to run creation tools
30+
## How to set up
3031

3132
First, download Mamba (if not already downloaded):
3233
```bash
@@ -51,66 +52,135 @@ conda activate DIPS-Plus # Note: One still needs to use `conda` to (de)activate
5152
pip3 install -e .
5253
```
5354

54-
## Default DIPS-Plus directory structure
55+
To install PSAIA for feature generation, install GCC 10 for PSAIA:
56+
57+
```bash
58+
# Install GCC 10 for Ubuntu 20.04:
59+
sudo apt install software-properties-common
60+
sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa
61+
sudo apt update
62+
sudo apt install gcc-10 g++-10
63+
64+
# Or install GCC 10 for Arch Linux/Manjaro:
65+
yay -S gcc10
66+
```
67+
68+
Then install QT4 for PSAIA:
69+
70+
```bash
71+
# Install QT4 for Ubuntu 20.04:
72+
sudo add-apt-repository ppa:rock-core/qt4
73+
sudo apt update
74+
sudo apt install libqt4* libqtcore4 libqtgui4 libqtwebkit4 qt4* libxext-dev
75+
76+
# Or install QT4 for Arch Linux/Manjaro:
77+
yay -S qt4
78+
```
79+
80+
Conclude by compiling PSAIA from source:
81+
82+
```bash
83+
# Select the location to install the software:
84+
MY_LOCAL=~/Programs
85+
86+
# Download and extract PSAIA's source code:
87+
mkdir "$MY_LOCAL"
88+
cd "$MY_LOCAL"
89+
wget http://complex.zesoi.fer.hr/data/PSAIA-1.0-source.tar.gz
90+
tar -xvzf PSAIA-1.0-source.tar.gz
91+
92+
# Compile PSAIA (i.e., a GUI for PSA):
93+
cd PSAIA_1.0_source/make/linux/psaia/
94+
qmake-qt4 psaia.pro
95+
make
96+
97+
# Compile PSA (i.e., the protein structure analysis (PSA) program):
98+
cd ../psa/
99+
qmake-qt4 psa.pro
100+
make
101+
102+
# Compile PIA (i.e., the protein interaction analysis (PIA) program):
103+
cd ../pia/
104+
qmake-qt4 pia.pro
105+
make
106+
107+
# Test run any of the above-compiled programs:
108+
cd "$MY_LOCAL"/PSAIA_1.0_source/bin/linux
109+
# Test run PSA inside a GUI:
110+
./psaia/psaia
111+
# Test run PIA through a terminal:
112+
./pia/pia
113+
# Test run PSA through a terminal:
114+
./psa/psa
115+
```
116+
117+
Lastly, install Docker following the instructions from https://docs.docker.com/engine/install/
118+
119+
## How to generate protein feature inputs
120+
In our [feature generation notebook](notebooks/feature_generation.ipynb), we provide examples of how users can generate the protein features described in our [accompanying manuscript](https://arxiv.org/abs/2106.04362) for individual protein inputs.
121+
122+
## How to use data
123+
In our [data usage notebook](notebooks/data_usage.ipynb), we provide examples of how users might use DIPS-Plus (or DB5-Plus) for downstream analysis or prediction tasks. For example, to train a new NeiA model with DB5-Plus as its cross-validation dataset, first download DB5-Plus' raw files and process them via the `data_usage` notebook:
124+
125+
```bash
126+
mkdir -p project/datasets/DB5/final
127+
wget https://zenodo.org/record/5134732/files/final_raw_db5.tar.gz -O project/datasets/DB5/final/final_raw_db5.tar.gz
128+
tar -xzf project/datasets/DB5/final/final_raw_db5.tar.gz -C project/datasets/DB5/final/
129+
130+
# To process these raw files for training and subsequently train a model:
131+
python3 notebooks/data_usage.py
132+
```
133+
134+
## Standard DIPS-Plus directory structure
55135

56136
```
57137
DIPS-Plus
58138
59139
└───project
60-
│ │
61-
│ └───datasets
62-
│ │ │
63-
│ │ └───builder
64-
│ │ │
65-
│ │ └───DB5
66-
│ │ │ │
67-
│ │ │ └───final
68-
│ │ │ │ │
69-
│ │ │ │ └───raw
70-
│ │ │ │
71-
│ │ │ └───interim
72-
│ │ │ │ │
73-
│ │ │ │ └───complexes
74-
│ │ │ │ │
75-
│ │ │ │ └───external_feats
76-
│ │ │ │ │
77-
│ │ │ │ └───pairs
78-
│ │ │ │
79-
│ │ │ └───raw
80-
│ │ │ │
81-
│ │ │ README
82-
│ │ │
83-
│ │ └───DIPS
84-
│ │ │
85-
│ │ └───filters
86-
│ │ │
87-
│ │ └───final
88-
│ │ │ │
89-
│ │ │ └───raw
90-
│ │ │
91-
│ │ └───interim
92-
│ │ │ │
93-
│ │ │ └───complexes
94-
│ │ │ │
95-
│ │ │ └───external_feats
96-
│ │ │ │
97-
│ │ │ └───pairs-pruned
98-
│ │ │
99-
│ │ └───raw
100-
│ │ │
101-
│ │ └───pdb
102-
│ │
103-
│ └───utils
104-
│ constants.py
105-
│ utils.py
106-
107-
.gitignore
108-
environment.yml
109-
LICENSE
110-
README.md
111-
requirements.txt
112-
setup.cfg
113-
setup.py
140+
141+
└───datasets
142+
143+
└───DB5
144+
│ │
145+
│ └───final
146+
│ │ │
147+
│ │ └───processed # task-ready features for each dataset example
148+
│ │ │
149+
│ │ └───raw # generic features for each dataset example
150+
│ │
151+
│ └───interim
152+
│ │ │
153+
│ │ └───complexes # metadata for each dataset example
154+
│ │ │
155+
│ │ └───external_feats # features curated for each dataset example using external tools
156+
│ │ │
157+
│ │ └───pairs # pair-wise features for each dataset example
158+
│ │
159+
│ └───raw # raw PDB data downloads for each dataset example
160+
161+
└───DIPS
162+
163+
└───filters # filters to apply to each (un-pruned) dataset example
164+
165+
└───final
166+
│ │
167+
│ └───processed # task-ready features for each dataset example
168+
│ │
169+
│ └───raw # generic features for each dataset example
170+
171+
└───interim
172+
│ │
173+
│ └───complexes # metadata for each dataset example
174+
│ │
175+
│ └───external_feats # features curated for each dataset example using external tools
176+
│ │
177+
│ └───pairs-pruned # filtered pair-wise features for each dataset example
178+
│ │
179+
│ └───parsed # pair-wise features for each dataset example after initial parsing
180+
181+
└───raw
182+
183+
└───pdb # raw PDB data downloads for each dataset example
114184
```
115185

116186
## How to compile DIPS-Plus from scratch
@@ -122,7 +192,7 @@ Retrieve protein complexes from the RCSB PDB and build out directory structure:
122192
rm project/datasets/DIPS/final/raw/pairs-postprocessed.txt project/datasets/DIPS/final/raw/pairs-postprocessed-train.txt project/datasets/DIPS/final/raw/pairs-postprocessed-val.txt project/datasets/DIPS/final/raw/pairs-postprocessed-test.txt
123193

124194
# Create data directories (if not already created):
125-
mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
195+
mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/pairs-pruned project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
126196

127197
# Download the raw PDB files:
128198
rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \
@@ -139,7 +209,17 @@ python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pa
139209

140210
# Generate externally-sourced features:
141211
python3 project/datasets/builder/generate_psaia_features.py "$PSAIADIR" "$PROJDIR"/project/datasets/builder/psaia_config_file_dips.txt "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$PROJDIR"/project/datasets/DIPS/interim/external_feats --source_type rcsb
142-
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB" "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file
212+
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB" "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
213+
214+
# Generate multiple sequence alignments (MSAs) using a smaller sequence database (if not already created using the standard BFD):
215+
DOWNLOAD_DIR="$HHSUITE_DB_DIR" && ROOT_DIR="${DOWNLOAD_DIR}/small_bfd" && SOURCE_URL="https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz" && BASENAME=$(basename "${SOURCE_URL}") && mkdir --parents "${ROOT_DIR}" && aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}" && pushd "${ROOT_DIR}" && gunzip "${ROOT_DIR}/${BASENAME}" && popd # e.g., Download the small BFD
216+
python3 project/datasets/builder/generate_hhsuite_features.py "$PROJDIR"/project/datasets/DIPS/interim/parsed "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned "$HHSUITE_DB_DIR"/small_bfd "$PROJDIR"/project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --generate_msa_only --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
217+
218+
# Identify interfaces within intrinsically disordered regions (IDRs) #
219+
# (1) Pull down the Docker image for `flDPnn`
220+
docker pull docker.io/sinaghadermarzi/fldpnn
221+
# (2) For all sequences in the dataset, predict which interface residues reside within IDRs
222+
python3 project/datasets/builder/annotate_idr_interfaces.py "$PROJDIR"/project/datasets/DIPS/final/raw
143223

144224
# Add new features to the filtered pairs, ensuring that the pruned pairs' original PDB files are stored locally for DSSP:
145225
python3 project/datasets/builder/download_missing_pruned_pair_pdbs.py "$PROJDIR"/project/datasets/DIPS/raw/pdb "$PROJDIR"/project/datasets/DIPS/interim/pairs-pruned --num_cpus 32 --rank "$1" --size "$2"
@@ -198,7 +278,7 @@ python3 project/datasets/builder/convert_complexes_to_graphs.py "$PROJDIR"/proje
198278

199279
We split the (tar.gz) archive into eight separate parts with
200280
'split -b 4096M interim_external_feats_dips.tar.gz "interim_external_feats_dips.tar.gz.part"'
201-
to upload it to Zenodo, so to recover the original archive:
281+
to upload it to the dataset's primary Zenodo record, so to recover the original archive:
202282

203283
```bash
204284
# Reassemble external features archive with 'cat'

0 commit comments

Comments
 (0)