fragSMILES4Reactions is a scientific project focused on the analysis and modeling of chemical reactions using fragment-based SMILES representations and other notations like SMILES, SELFIES and SAFE. This repository contains all the necessary code, data, and scripts to reproduce the experiments and results described in the associated research work.
rawdata_reactions/
– Raw reaction dataset already split into training, validation, and test sets.data_reactions/
– Processed reaction data used as input for experiments.experiments_reactions/
– Outputs from model training and prediction. Folder names follow the convention{key}={value}-{key}={value}-...
.floats/
– Figures (in PDF format) and tables (in LaTeX format) generated during analysis.notebooks/
– Jupyter notebooks for data exploration and post-processing of prediction results.bestof_setup/
– CSV files reporting the best configuration found for each model.scripts/
– Python scripts for preprocessing, training, prediction, and SMILES conversion tasks.src/
– Main source code of the project.extra/
– Contains an example reaction used for creating the introductory figure/chart.shell/
– Includesrun.sh
, a script to launch experiments using the best configurations for each model.requirements.txt
– List of required Python dependencies for setting up the environment.chemicalgof/
andsafe
/datamol/
are external static repositories adopted for this project. TheSAFE
package has been modified to detect and report reasons for invalid sampled sequences.chemicalgof/
is the latest version to handle with fragSMILES notation.
The output of the experiments is already included in experiments_reactions/
, except for the model checkpoints (.ckpt
files), which will be made available soon. For this reason prediction (see scripts section) phase can not be executed directly if trained model are not obtained, but analysis about them are already provided in this repository.
However to reproduce our experiments:
-
Clone the repository:
git clone https://github.com/molML/fragSMILES4reaction.git cd fragSMILES4Reactions
-
Set up the Python environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -r requirements.txt
-
Experiments using the best configurations for each model can be run through a shell script.
NOTE: Set python environment path to be activated in file
shell/run.sh
line 3.bash shell/run.sh
⚠️ These experiments were conducted using 4 GPUs in parallel. Running on fewer or lower-memory devices may result in out-of-memory errors. -
Explore the Jupyter notebooks in (see Notebooks) section to analyze datasets and prediction results.
Parameter | Description |
---|---|
task |
Task to perform: either forward (i.e., synthesis) or backward (i.e., retrosynthesis). |
notation |
Molecular representation format used as input/output (i.e., smiles, selfies, safe or fragsmiles). |
model_dim |
Dimensionality of the model's hidden layers (e.g., transformer embedding size). |
num_heads |
Number of attention heads in multi-head attention mechanisms. |
num_layers |
Number of layers (e.g., encoder or decoder blocks) in the model architecture. |
batch_size |
Number of training samples processed simultaneously during one training step. |
lr |
Learning rate used by the optimizer to update model weights. |
dropout |
Dropout rate for regularization to prevent overfitting (only 0.3 value was adopted in this work) |
These parameters are used as arguments in the Python scripts (see relative section) for training
and prediction
.
We recommend running scripts from the root directory. Example:
python scripts/script_file.py --argument1 value1 --argument2 value2
-
convert_dataset.py
Prepares dataset adopted for the experiments starting from rawdata. Please, explore arguments (python scripts/convert_dataset.py --help
) to be provided when command is called. Most important arguments are "notation", "split", "ncpus" (for multiprocessing computation). When a dataset notation-based is obtained, a csv file is written to track sequence lengths. -
train.py
Trains a model using the selected configuration (see dedicated section). Model checkpoint (.ckpt file format) will be stored in the corresponding experiment folder. Vocabulary file (vocab.pt) will be stored in the respective notation folder. -
predict.py
Predict the test set with trained model by using the selected configuration (see dedicated section). The output includes encoded predicted sequences stored in the respective experiment folder, with filenames containing tokens substring. -
convert_prediction_strict.py
Convert encoded predicted sequences obtained by model specifying its parameters. Invalid decoded sequences include the erroneus chirality label assigned to atoms. -
convert_prediction_strict_from_path.py
Same as above, but only requires the path to the experiment folder. -
convert_prediction.py
andconvert_prediction_from_path.py
: Similar to the strict versions, but invalid sequences do not include erroneus chirality label assigned to atoms. -
fragment_dataset.py
Fragment SMILES of data to obtain relative Scaffold, cycles, and acyclic chains of them. Used only on the test set, as demonstrated in05_struggle.ipynb
The Jupyter notebooks in notebooks/
provide an interactive way to explore datasets and experiment outputs.
NOTE: IPython package is required to handle with notebooks.
data_analysis.ipynb
Can be explored before experiments running. It includes sequence length and dataset size per split.bestof_selection.ipynb
Visualizes and compares loss curves for different hyperparameter settings to identify optimal configurations.accuracy.ipynb
Computes performance metrics for the best models and outputs tables ready for publication.similarity.ipynb
Analyzes similarity distributions between incorrect but valid predictions and their target molecules (forward task only).struggle.ipynb
Investigates failure cases in prediction, including reasons for invalid samples and substructure matching in erroneous predictions.
(No published paper)
This project is licensed under the MIT License. See the LICENSE file for details.