fragSMILES4Reactions

fragSMILES4Reactions is a scientific project focused on the analysis and modeling of chemical reactions using fragment-based SMILES representations and other notations like SMILES, SELFIES and SAFE. This repository contains all the necessary code, data, and scripts to reproduce the experiments and results described in the associated research work.

📁 Project Structure

rawdata_reactions/ – Raw reaction dataset already split into training, validation, and test sets.
data_reactions/ – Processed reaction data used as input for experiments.
experiments_reactions/ – Outputs from model training and prediction. Folder names follow the convention {key}={value}-{key}={value}-....
floats/ – Figures (in PDF format) and tables (in LaTeX format) generated during analysis.
notebooks/ – Jupyter notebooks for data exploration and post-processing of prediction results.
bestof_setup/ – CSV files reporting the best configuration found for each model.
scripts/ – Python scripts for preprocessing, training, prediction, and SMILES conversion tasks.
src/ – Main source code of the project.
extra/ – Contains an example reaction used for creating the introductory figure/chart.
shell/ – Includes run.sh, a script to launch experiments using the best configurations for each model.
requirements.txt – List of required Python dependencies for setting up the environment.
chemicalgof/ and safe/datamol/ are external static repositories adopted for this project. The SAFE package has been modified to detect and report reasons for invalid sampled sequences. chemicalgof/ is the latest version to handle with fragSMILES notation.

🧪 Reproducibility

The output of the experiments is already included in experiments_reactions/, except for the model checkpoints (.ckpt files), which will be made available soon. For this reason prediction (see scripts section) phase can not be executed directly if trained model are not obtained, but analysis about them are already provided in this repository. However to reproduce our experiments:

Clone the repository:

 git clone https://github.com/molML/fragSMILES4reaction.git
 cd fragSMILES4Reactions

Set up the Python environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Experiments using the best configurations for each model can be run through a shell script.

NOTE: Set python environment path to be activated in file shell/run.sh line 3.
```
bash shell/run.sh
```
⚠️ These experiments were conducted using 4 GPUs in parallel. Running on fewer or lower-memory devices may result in out-of-memory errors.
Explore the Jupyter notebooks in (see Notebooks) section to analyze datasets and prediction results.

Model parameters

Parameter	Description
`task`	Task to perform: either `forward` (i.e., synthesis) or `backward` (i.e., retrosynthesis).
`notation`	Molecular representation format used as input/output (i.e., smiles, selfies, safe or fragsmiles).
`model_dim`	Dimensionality of the model's hidden layers (e.g., transformer embedding size).
`num_heads`	Number of attention heads in multi-head attention mechanisms.
`num_layers`	Number of layers (e.g., encoder or decoder blocks) in the model architecture.
`batch_size`	Number of training samples processed simultaneously during one training step.
`lr`	Learning rate used by the optimizer to update model weights.
`dropout`	Dropout rate for regularization to prevent overfitting (only 0.3 value was adopted in this work)

These parameters are used as arguments in the Python scripts (see relative section) for training and prediction.

Scripts

We recommend running scripts from the root directory. Example:

python scripts/script_file.py --argument1 value1 --argument2 value2

convert_dataset.py Prepares dataset adopted for the experiments starting from rawdata. Please, explore arguments (python scripts/convert_dataset.py --help) to be provided when command is called. Most important arguments are "notation", "split", "ncpus" (for multiprocessing computation). When a dataset notation-based is obtained, a csv file is written to track sequence lengths.
train.py Trains a model using the selected configuration (see dedicated section). Model checkpoint (.ckpt file format) will be stored in the corresponding experiment folder. Vocabulary file (vocab.pt) will be stored in the respective notation folder.
predict.py Predict the test set with trained model by using the selected configuration (see dedicated section). The output includes encoded predicted sequences stored in the respective experiment folder, with filenames containing tokens substring.
convert_prediction_strict.py Convert encoded predicted sequences obtained by model specifying its parameters. Invalid decoded sequences include the erroneus chirality label assigned to atoms.
convert_prediction_strict_from_path.py Same as above, but only requires the path to the experiment folder.
convert_prediction.py and convert_prediction_from_path.py : Similar to the strict versions, but invalid sequences do not include erroneus chirality label assigned to atoms.
fragment_dataset.py Fragment SMILES of data to obtain relative Scaffold, cycles, and acyclic chains of them. Used only on the test set, as demonstrated in 05_struggle.ipynb

📓 Notebooks

The Jupyter notebooks in notebooks/ provide an interactive way to explore datasets and experiment outputs.

NOTE: IPython package is required to handle with notebooks.

data_analysis.ipynb Can be explored before experiments running. It includes sequence length and dataset size per split.
bestof_selection.ipynb Visualizes and compares loss curves for different hyperparameter settings to identify optimal configurations.
accuracy.ipynb Computes performance metrics for the best models and outputs tables ready for publication.
similarity.ipynb Analyzes similarity distributions between incorrect but valid predictions and their target molecules (forward task only).
struggle.ipynb Investigates failure cases in prediction, including reasons for invalid samples and substructure matching in erroneous predictions.

💬 Citation

(No published paper)

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

fragSMILES4Reactions

📁 Project Structure

🧪 Reproducibility

Model parameters

Scripts

📓 Notebooks

💬 Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bestof_setup		bestof_setup
chemicalgof		chemicalgof
data_reactions		data_reactions
datamol		datamol
experiments_reactions		experiments_reactions
extra		extra
floats		floats
notebooks		notebooks
rawdata_reactions		rawdata_reactions
safe		safe
scripts		scripts
shell		shell
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

molML/fragSMILES4reaction

Folders and files

Latest commit

History

Repository files navigation

fragSMILES4Reactions

📁 Project Structure

🧪 Reproducibility

Model parameters

Scripts

📓 Notebooks

💬 Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages