KUMO is a novel benchmark for systematically evaluating complex reasoning capabilities in Large Language Models (LLMs) through procedurally generated reasoning games. This repository contains the official implementation of our research paper.
The KUMO benchmark introduces procedurally generated reasoning games structured around:
- 🔍 Truth Set (
$T$ ): Possible truths. - 🎯 Action Set (
$A$ ): Available actions. - 🌟 Outcomes (
$\mathcal{O}$ ): Action-based outcomes. - 📚 Knowledge Book (
$K$ ): Detailed guidelines linking truths, actions, and outcomes.
- A valid truth (
$t^*$ ) is secretly chosen. - Players take actions and observe outcomes.
- Deduce the truth efficiently using logic and reasoning.
🧑⚕️ Example Scenario: Diagnosing diseases using medical tests.
📌 Provided Domains:
- 100 autogenerated exemplar domains
- Categories: Computer Science, Biology, Art, and more
- Typical domain: ~50 truths, ~30 actions
The KUMO dataset is provided in JSON format, simplifying integration and customization. Data available at KUMO/env:
kumo/
└── env/
├── data/
│ └── [DomainName]_data.py
├── [DomainName]/
│ ├── knowledge_book/
│ │ └── truth_num=4+action_num=6+valid_truth_num=1/
│ │ ├── seed=0.txt
│ │ └── ...
│ └── truth_num=4+action_num=6+valid_truth_num=1.jsonl
└── [DomainName].py
⚙️ Customize parameters (truth_num
, action_num
, etc.) easily for tailored benchmarking.
git clone https://github.com/linhaowei1/kumo.git
cd kumo
Recommended: Conda with Python (3.10 to 3.12):
conda create -n kumo python=3.12
conda activate kumo
pip install -r requirements.txt
Hardware requirement: None. Only CPU is needed for inference.
We recommend using OpenAI API to call LLMs. Please check examples/main.sh to customize by adding your own API key and model name. The results are in results/.
Expected runtime: depended on API call (GPT-4o may take about 3 hours to run all 100 easy-setting tasks using this script).
Create customized domains and scenarios easily:
Generate scenarios via LLM:
python generate/config_generation.py \
--load_type OPENAI \
--api_base http://localhost:8001/v1 \
--api_key EMPTY \
--data_path ./templates/config_generation.jsonl
Generate specific tasks:
python SAT_sampling.py \
--truth_num 4 \
--action_num 6 \
--valid_truth_num 1 \
--data_num 50 \
--domain MedicalEnv
Automatically build detailed knowledge bases:
python knowledge_book_generation.py \
--load_type OPENAI \
--api_base http://localhost:8001/v1 \
--api_key EMPTY \
--data_num 50 \
--truth_num 4 \
--action_num 6 \
--valid_truth_num 1 \
--domain MedicalEnv
🔗 Example
Improve generated knowledge books:
python generate/knowledge_book_revision.py \
--load_type OPENAI \
--api_base http://localhost:8001/v1 \
--api_key EMPTY \
--domain MedicalEnv \
--revision_template_path ./templates/revision_template.md
💬 Support & Questions
For support, feedback, or inquiries, please:
- Open an issue on GitHub
- Contact the repository maintainers directly