unstable‑baselines is an experimental, asynchronous, online reinforcement‑learning framework for rapid prototyping of multi‑turn / multi‑agent algorithms on TextArena environments.
The main focus on unstable baselines is to enable fast prototyping/research. For something a bit more production ready we recomment to use oat or verifiers
Work in progress — interfaces will change.
- Asynchronous collection & learning – actors generate data while learners train.
- Multi‑agent, multi‑turn focus with self‑play or fixed opponents.
- LoRA‑first fine‑tuning workflow for fast, lightweight updates.
- Composable reward transforms at step, final, and sampling stages.
Developed in partnership with PlasticLabs.
# clone the repo
git clone https://github.com/LeonGuertler/unstable-baselines.git
cd unstable-baselines
# install Python dependencies
pip install -r requirements.txt
# build TextArena v0.6.9 (until it’s on PyPI)
git clone https://github.com/LeonGuertler/TextArena.git
cd TextArena
git checkout v0.6.9
pip install -e .
cd ..
python3 example.py