The goal of this repository is to provide a simple, modular, and extendable toolkit for IndicTrans2 and be compatible with the HuggingFace models released. Please refer to the CHANGELOG.md
for latest developments.
Python 3.10+
- Other requirements as listed in
requirements.txt
- Editable installation (Note, this may take a while):
git clone https://github.com/VarunGumma/IndicTransToolkit
cd IndicTransToolkit
pip install --editable . --use-pep517 # required for pip >= 25.0
# in case it fails, try:
# pip install --editable . --use-pep517 --config-settings editable_mode=compat
- Common Installation Failures:
- Version incompatibilty: We highly recommend using the latest versions of
numpy>=2.1
,torch>=2.5
andtransformers>=4.51
for using this toolkit. We cannot guarantee the stability of the module below these requirements. We try our best to uphold backward compatibility, but prioritize any major releases of the dependencies.
- Version incompatibilty: We highly recommend using the latest versions of
For the training usecase, please refer here.
import torch
from IndicTransToolkit.processor import IndicProcessor # NOW IMPLEMENTED IN CYTHON !!
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cuda" is torch.cuda.is_available() else "cpu"
ip = IndicProcessor(inference=True)
tokenizer = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("ai4bharat/indictrans2-en-indic-dist-200M", trust_remote_code=True).to(device)
sentences = [
"This is a test sentence.",
"This is another longer different test sentence.",
"Please send an SMS to 9876543210 and an email on [email protected] by 15th October, 2023.",
]
batch = ip.preprocess_batch(sentences, src_lang="eng_Latn", tgt_lang="hin_Deva", visualize=False) # set it to visualize=True to print a progress bar
batch = tokenizer(batch, padding="longest", truncation=True, max_length=256, return_tensors="pt").to(device)
with torch.inference_mode():
outputs = model.generate(**batch, num_beams=5, num_return_sequences=1, max_length=256)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=True)
outputs = ip.postprocess_batch(outputs, lang="hin_Deva")
print(outputs)
>>> ['यह एक परीक्षण वाक्य है।', 'यह एक और लंबा अलग परीक्षण वाक्य है।', 'कृपया 9876543210 पर एक एस. एम. एस. भेजें और 15 अक्टूबर, 2023 तक [email protected] पर एक ईमेल भेजें।']
IndicEvaluator
is a python implementation of compute_metrics.sh.- We have found that this python implementation gives slightly lower scores than the original
compute_metrics.sh
. So, please use this function cautiously, and feel free to raise a PR if you have found the bug/fix.
from IndicTransToolkit import IndicEvaluator
# this method returns a dictionary with BLEU and ChrF2++ scores with appropriate signatures
evaluator = IndicEvaluator()
scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=pred_file, refs=ref_file)
# alternatively, you can pass the list of predictions and references instead of files
# scores = evaluator.evaluate(tgt_lang=tgt_lang, preds=preds, refs=refs)
- Varun Gumma ([email protected])
- Jay Gala ([email protected])
- Pranjal Agadh Chitale ([email protected])
- Raj Dabre ([email protected])
Since this a bleeding-edge module, you may encounter broken stuff and import issues once in a while. In case you encounter any bugs or want additional functionalities, please feel free to raise Issues
/Pull Requests
or contact the authors.
If you use our codebase, or models, please do cite the following paper:
@article{
gala2023indictrans,
title={IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
author={Jay Gala and Pranjal A Chitale and A K Raghavan and Varun Gumma and Sumanth Doddapaneni and Aswanth Kumar M and Janki Atul Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M Khapra and Raj Dabre and Anoop Kunchukuttan},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=vfT4YuzAYA},
note={}
}