Skip to content

molML/fragSMILES4reaction

Repository files navigation

fragSMILES4Reactions

License: MIT Python 3.12 Jupyter Notebooks

fragSMILES4Reactions is a scientific project focused on the analysis and modeling of chemical reactions using fragment-based SMILES representations and other notations like SMILES, SELFIES and SAFE. This repository contains all the necessary code, data, and scripts to reproduce the experiments and results described in the associated research work.

📁 Project Structure

  • rawdata_reactions/ – Raw reaction dataset already split into training, validation, and test sets.
  • data_reactions/ – Processed reaction data used as input for experiments.
  • experiments_reactions/ – Outputs from model training and prediction. Folder names follow the convention {key}={value}-{key}={value}-....
  • floats/ – Figures (in PDF format) and tables (in LaTeX format) generated during analysis.
  • notebooks/ – Jupyter notebooks for data exploration and post-processing of prediction results.
  • bestof_setup/ – CSV files reporting the best configuration found for each model.
  • scripts/ – Python scripts for preprocessing, training, prediction, and SMILES conversion tasks.
  • src/ – Main source code of the project.
  • extra/ – Contains an example reaction used for creating the introductory figure/chart.
  • shell/ – Includes run.sh, a script to launch experiments using the best configurations for each model.
  • requirements.txt – List of required Python dependencies for setting up the environment.
  • chemicalgof/ and safe/datamol/ are external static repositories adopted for this project. The SAFE package has been modified to detect and report reasons for invalid sampled sequences. chemicalgof/ is the latest version to handle with fragSMILES notation.

🧪 Reproducibility

The output of the experiments is already included in experiments_reactions/, except for the model checkpoints (.ckpt files), which will be made available soon. For this reason prediction (see scripts section) phase can not be executed directly if trained model are not obtained, but analysis about them are already provided in this repository. However to reproduce our experiments:

  1. Clone the repository:

     git clone https://github.com/molML/fragSMILES4reaction.git
     cd fragSMILES4Reactions
  2. Set up the Python environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    pip install -r requirements.txt
  3. Experiments using the best configurations for each model can be run through a shell script.

    NOTE: Set python environment path to be activated in file shell/run.sh line 3.

    bash shell/run.sh

    ⚠️ These experiments were conducted using 4 GPUs in parallel. Running on fewer or lower-memory devices may result in out-of-memory errors.

  4. Explore the Jupyter notebooks in (see Notebooks) section to analyze datasets and prediction results.

Model parameters

Parameter Description
task Task to perform: either forward (i.e., synthesis) or backward (i.e., retrosynthesis).
notation Molecular representation format used as input/output (i.e., smiles, selfies, safe or fragsmiles).
model_dim Dimensionality of the model's hidden layers (e.g., transformer embedding size).
num_heads Number of attention heads in multi-head attention mechanisms.
num_layers Number of layers (e.g., encoder or decoder blocks) in the model architecture.
batch_size Number of training samples processed simultaneously during one training step.
lr Learning rate used by the optimizer to update model weights.
dropout Dropout rate for regularization to prevent overfitting (only 0.3 value was adopted in this work)

These parameters are used as arguments in the Python scripts (see relative section) for training and prediction.

Scripts

We recommend running scripts from the root directory. Example:

python scripts/script_file.py --argument1 value1 --argument2 value2
  • convert_dataset.py Prepares dataset adopted for the experiments starting from rawdata. Please, explore arguments (python scripts/convert_dataset.py --help) to be provided when command is called. Most important arguments are "notation", "split", "ncpus" (for multiprocessing computation). When a dataset notation-based is obtained, a csv file is written to track sequence lengths.

  • train.py Trains a model using the selected configuration (see dedicated section). Model checkpoint (.ckpt file format) will be stored in the corresponding experiment folder. Vocabulary file (vocab.pt) will be stored in the respective notation folder.

  • predict.py Predict the test set with trained model by using the selected configuration (see dedicated section). The output includes encoded predicted sequences stored in the respective experiment folder, with filenames containing tokens substring.

  • convert_prediction_strict.py Convert encoded predicted sequences obtained by model specifying its parameters. Invalid decoded sequences include the erroneus chirality label assigned to atoms.

  • convert_prediction_strict_from_path.py Same as above, but only requires the path to the experiment folder.

  • convert_prediction.py and convert_prediction_from_path.py : Similar to the strict versions, but invalid sequences do not include erroneus chirality label assigned to atoms.

  • fragment_dataset.py Fragment SMILES of data to obtain relative Scaffold, cycles, and acyclic chains of them. Used only on the test set, as demonstrated in 05_struggle.ipynb

📓 Notebooks

The Jupyter notebooks in notebooks/ provide an interactive way to explore datasets and experiment outputs.

NOTE: IPython package is required to handle with notebooks.

  1. data_analysis.ipynb Can be explored before experiments running. It includes sequence length and dataset size per split.
  2. bestof_selection.ipynb Visualizes and compares loss curves for different hyperparameter settings to identify optimal configurations.
  3. accuracy.ipynb Computes performance metrics for the best models and outputs tables ready for publication.
  4. similarity.ipynb Analyzes similarity distributions between incorrect but valid predictions and their target molecules (forward task only).
  5. struggle.ipynb Investigates failure cases in prediction, including reasons for invalid samples and substructure matching in erroneous predictions.

💬 Citation

(No published paper)

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •