This repository accompanies our ACL 2025 Findings paper: "Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval"
✨ We provide a reproducible benchmark suite for Amharic information retrieval, including:
-
BM25 sparse baseline
-
Dense embedding models (RoBERTa / BERT variants fine-tuned for Amharic)
-
ColBERT-AM (late interaction retriever)
- Pretrained Amharic Retrieval Models Includes ( RoBERTa-Base-Amharic-Embd, RoBERTa-Medium-Amharic-Embd, BERT-Medium-Amharic-Embd, and ColBERT-AM for dense retrieval.)
- Hugging Face model & dataset links for easy access
- Training, evaluation, and inference scripts for reproducibility
- Benchmarks BM25 (sparse retrieval), bi-encoder dense retrieval, and ColBERT (late interaction retrieval) for Amharic.
- MS MARCO-style dataset conversion script & direct dataset links
amharic-ir-benchmarks/
├── baselines/ # BM25, ColBERT, and dense Amharic retrievers
│ ├── bm25_retriever/
│ ├── ColBERT_AM/
│ ├── colbert-amharic-pylate/
│ └── embedding_models/
├── data/ # Scripts to download, preprocess, and prepare datasets
├── scripts/ # Shell scripts for training, indexing, evaluation
├── utils/ # Utility functions
├── amharic_environment.yml # Conda environment
├── requirements.txt
└── README.md
conda env create -f amharic_environment.yml
conda activate amharic_ir
Or using pip:
pip install -r requirements.txt
We use two publicly available Amharic datasets:
Dataset | Description | Link |
---|---|---|
2AIRTC | Ad-hoc IR test collection | 2AIRTC Website |
Amharic News | Headline–body classification corpus | Hugging Face |
Scripts for downloading and preprocessing can be found in the data/
folder.
Our Amharic text embedding and ColBERT models can be found in the following Hugging Face Collection
bash scripts/train_colbert.sh
python baselines/bm25_retriever/run_bm25.ipynb
bash scripts/index_colbert.sh
bash scripts/retrieve_colbert.sh
This table presents the performance of Amharic-optimized vs multilingual dense retrieval models on the Amharic News dataset, using a bi-encoder architecture. We report MRR@10, NDCG@10, and Recall@10/50/100. Best scores are in bold, and † indicates statistically significant improvements (p < 0.05) over the strongest multilingual baseline.
Model | Params | MRR@10 | NDCG@10 | Recall@10 | Recall@50 | Recall@100 |
---|---|---|---|---|---|---|
Multilingual models | ||||||
gte-modernbert-base | 149M | 0.019 | 0.023 | 0.033 | 0.051 | 0.067 |
gte-multilingual-base | 305M | 0.600 | 0.638 | 0.760 | 0.851 | 0.882 |
multilingual-e5-large-instruct | 560M | 0.672 | 0.709 | 0.825 | 0.911 | 0.931 |
snowflake-arctic-embed-l-v2.0 | 568M | 0.659 | 0.701 | 0.831 | 0.922 | 0.942 |
Ours (Amharic-optimized models) | ||||||
BERT-Medium-Amharic-embed | 40M | 0.682 | 0.720 | 0.843 | 0.931 | 0.954 |
RoBERTa-Medium-Amharic-embed | 42M | 0.735 | 0.771 | 0.884 | 0.955 | 0.971 |
RoBERTa-Base-Amharic-embed | 110M | 0.775↑ | 0.808↑ | 0.913↑ | 0.964↑ | 0.979↑ |
📖 For further details on the baselines, see: Yu et al., 2024 — Multilingual-E5 Wang et al., 2024 — Snowflake Arctic Embed
The table below presents a comparison between sparse and dense retrieval approaches on the Amharic passage reterival dataset. While BM25 represents traditional lexical matching, RoBERTa and ColBERT models leverage semantic embeddings optimized for Amharic. All models were trained and evaluated under the same data splits.
ColBERT-RoBERTa-Base-Amharic, leveraging late interaction with a RoBERTa backbone, delivers the highest retrieval quality across most metrics. Statistically significant gains are marked with ↑ (p < 0.05).
Type | Model | MRR@10 | NDCG@10 | Recall@10 | Recall@50 | Recall@100 |
---|---|---|---|---|---|---|
Sparse retrieval | BM25-AM | 0.657 | 0.682 | 0.774 | 0.847 | 0.871 |
Dense retrieval | RoBERTa-Base-Amharic-embed | 0.755 | 0.808 | 0.913 | 0.964 | 0.979 |
Dense retrieval | ColBERT-RoBERTa-Base-Amharic | 0.843↑ | 0.866↑ | 0.939↑ | 0.973↑ | 0.979 |
📌 Note
- ColBERT-RoBERTa-Base-Amharic significantly outperforms RoBERTa-Base-Amharic on all ranking metrics, except Recall@100, where both models converge. Significance assessed using a paired t-test.
- For additional experiments on the 2AIRTC dataset, refer to the Appendix section of our ACL 2025 Findings paper.
If you use this repository, please cite our ACL 2025 Findings paper:
@inproceedings{mekonnen2025amharic,
title={Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval},
author={Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke },
booktitle={Findings of ACL},
year={2025}
}
Please open an issue for questions, feedback, or suggestions.
This project is licensed under the Apache 2.0 License.
This project builds on the ColBERT repository by Stanford FutureData Lab. We sincerely thank the authors for open-sourcing their work, which served as a strong foundation for our Amharic ColBERT implementation and experiments.