Cecilia is a family of language models continual pretrained specifically on Cuban written text, capturing the linguistic, cultural, and social nuances of Cuban Spanish. These models are designed to support natural language processing tasks with a focus on Cuban language varieties and cultural context.
This repository contains Cecilia 2B v0.1, a 2 billion parameter model continual pretrained on Cuban written text from Salamandra 2B.
The model is developed by the Artificial Intelligence Research Group (GIA-UH) at the University of Havana with the collaboration of Language Processing and Information Systems Group (GPLSI) at the University of Alicante and the support of Syalia SRL and Epistemial.
Cecilia Tiny was continual pretrained for 2 full epochs on a private corpus of approximately 600 million tokens of Cuban written text, including:
- 10 years of the most relevant Cuban newspapers.
- The Cuban Encyclopedia (ecured.cu).
- The complete collection of Cuban laws.
- Over 400 important Cuban literary works.
- Several local encyclopedias documenting Cubanisms and cultural elements.
- Hundreds of song lyrics from popular Cuban singers.
This diverse dataset ensures that Cecilia captures a rich spectrum of Cuban language, culture, and history.
- Based on the Salamandra 2B architecture.
- Fine-tuned using continual pretraining and instruction tuning techniques.
- Optimized for Cuban Spanish linguistic features and cultural context.
Cecilia can be used for various NLP tasks involving Cuban Spanish, such as:
- Text generation and completion.
- Sentiment analysis on Cuban social media or literature.
- Named entity recognition with Cuban-specific entities.
- Machine translation and language understanding tailored to Cuban Spanish.
- Research on Cuban linguistic phenomena and cultural studies.
You can easily load and use Cecilia Tiny (2B) v-0.1 with the Hugging Face Transformers library. Here is a simple example in Python:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "gia-uh/cecilia-2b-v0.1"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example usage
input_text = "¿Cómo están las guaguas en La Habana?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- vLLM: Cecilia Tiny can be used with vLLM for efficient inference and serving.
- LM Studio: The model is compatible with LM Studio, enabling easy local deployment and experimentation.
- The model is currently not quantized. Quantized versions will be released shortly to improve efficiency and reduce resource requirements.
- Cecilia Tiny is fine-tuned via continual pretraining on Cuban text but yet is not instruction-tuned. It is optimized for language modeling rather than instruction-following tasks. Instruction-tuned versions will be released soon.
Cecilia Tiny (2B) v-0.1 is released under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. This allows both research and commercial use, provided that appropriate credit is given and any derivative works are shared under the same license.
Important: Access to download the model requires a manual review of requests to ensure fair and responsible use aligned with the spirit of the license and the cultural sensitivity of the data. Please submit your request with a brief description of your intended use.
Cecilia is a powerful language model fine-tuned on Cuban written text, but it is important to recognize its limitations and use it responsibly:
- Potential for Errors and Hallucinations: Like all large language models, Cecilia can generate incorrect, misleading, or biased information. It may "hallucinate" facts or produce outputs that do not reflect reality or the nuances of Cuban culture perfectly.
- Not a Substitute for Professional Advice: Cecilia should not be used for medical, legal, financial, or other professional advice. Outputs should be carefully reviewed by qualified experts before any decision-making.
- Bias and Fairness: Despite efforts to curate training data, the model may still reflect biases present in source texts. Users should be aware of potential cultural, social, or linguistic biases and interpret results accordingly.
- Privacy and Data Use: The model was trained on publicly available and licensed Cuban texts. Users should respect privacy and copyright laws when applying the model.
- Responsible Use: We encourage users to apply Cecilia in ways that respect Cuban culture and society, avoid harm, and promote fairness and inclusivity.
- Transparency: Users should clearly communicate when content is generated by Cecilia to avoid confusion or misattribution.
By using Cecilia, you agree to apply it ethically and responsibly, understanding its limitations and the cultural sensitivity embedded in its design.
If you use Cecilia 2B v0.1 - The Cuban Language Model in your research, please cite it as:
Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Alejandro Piad-Morffis, and Yudivian Almeida-Cruz. (2025). cecilia-2b-v0.1 (Revision 1921f36). Hugging Face. https://huggingface.co/gia-uh/cecilia-2b-v0.1. DOI: 10.57967/hf/5667
If using LaTeX, please use the following bibTeX entry:
@misc{cecilia2b,
author = { Ernesto L. Estevanell-Valladares and Suilan Estevez-Velarde and Alejandro Piad-Morffis and Yudivian Almeida-Cruz },
title = { Cecilia 2B v0.1 - The Cuban Language Model },
year = 2025,
url = { https://huggingface.co/gia-uh/cecilia-2b-v0.1 },
doi = { 10.57967/hf/5667 },
publisher = { Hugging Face }
}
The model could not have been created without the commitment and work of members of GIA-UH and GPLSI groups.
GIA-UH - Ernesto L. Estevanell, Daniel A. Valdés, Roberto Marti, Deborah Famadas, Roberto García, Gabriel Hernández, Elena Rodríguez, Niley González, Alejandro Beltrán, Juan Pablo Consuegra, Suilan Estévez, Alejandro Piad, Yudivián Almeida.
GPLSI - Robiert Sepúlveda, Yoan Gutiérrez, Rafael Muñoz, Andrés Montoyo, Manuel Palomar.
We thank all contributors and data providers who made this work possible.
This work was partially funded by the ILENIA-VIVES project 2022/TL22/00215334
, and by private funding from Syalia SRL and Epistemial.