🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

KUMO is a novel benchmark for systematically evaluating complex reasoning capabilities in Large Language Models (LLMs) through procedurally generated reasoning games. This repository contains the official implementation of our research paper.

🚀 Quick Links

📂 Benchmark Dataset
📑 Benchmark Format
⚙️ Environment Setup
📈 Evaluation
🛠️ Dataset Generation

📂 Benchmark Dataset

The KUMO benchmark introduces procedurally generated reasoning games structured around:

🔍 Truth Set ($T$): Possible truths.
🎯 Action Set ($A$): Available actions.
🌟 Outcomes ($\mathcal{O}$): Action-based outcomes.
📚 Knowledge Book ($K$): Detailed guidelines linking truths, actions, and outcomes.

Gameplay Mechanics:

A valid truth ($t^*$) is secretly chosen.
Players take actions and observe outcomes.
Deduce the truth efficiently using logic and reasoning.

🧑‍⚕️ Example Scenario: Diagnosing diseases using medical tests.

📌 Provided Domains:

100 autogenerated exemplar domains
Categories: Computer Science, Biology, Art, and more
Typical domain: ~50 truths, ~30 actions

📑 Benchmark Format

The KUMO dataset is provided in JSON format, simplifying integration and customization. Data available at KUMO/env:

kumo/
└── env/
    ├── data/
    │   └── [DomainName]_data.py
    ├── [DomainName]/
    │   ├── knowledge_book/
    │   │   └── truth_num=4+action_num=6+valid_truth_num=1/
    │   │       ├── seed=0.txt
    │   │       └── ...
    │   └── truth_num=4+action_num=6+valid_truth_num=1.jsonl
    └── [DomainName].py

⚙️ Customize parameters (truth_num, action_num, etc.) easily for tailored benchmarking.

⚙️ Environment Setup

🔽 Clone the Repository

git clone https://github.com/linhaowei1/kumo.git
cd kumo

📦 Install Dependencies

Recommended: Conda with Python (3.10 to 3.12):

conda create -n kumo python=3.12
conda activate kumo
pip install -r requirements.txt

Hardware requirement: None. Only CPU is needed for inference.

📈 Evaluation

We recommend using OpenAI API to call LLMs. Please check examples/main.sh to customize by adding your own API key and model name. The results are in results/.

Expected runtime: depended on API call (GPT-4o may take about 3 hours to run all 100 easy-setting tasks using this script).

🛠️ Dataset Generation

Create customized domains and scenarios easily:

1️⃣ Seed Configuration

Generate scenarios via LLM:

python generate/config_generation.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --data_path ./templates/config_generation.jsonl

🔗 Detailed Instructions

2️⃣ Task Instances via SAT Sampling

Generate specific tasks:

python SAT_sampling.py \
  --truth_num 4 \
  --action_num 6 \
  --valid_truth_num 1 \
  --data_num 50 \
  --domain MedicalEnv

🔗 Example Script

3️⃣ Knowledge Book Generation

Automatically build detailed knowledge bases:

python knowledge_book_generation.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --data_num 50 \
  --truth_num 4 \
  --action_num 6 \
  --valid_truth_num 1 \
  --domain MedicalEnv

🔗 Example

4️⃣ Optional Knowledge Book Refinement

Improve generated knowledge books:

python generate/knowledge_book_revision.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --domain MedicalEnv \
  --revision_template_path ./templates/revision_template.md

🔗 Revision Details

💬 Support & Questions

For support, feedback, or inquiries, please:

Open an issue on GitHub
Contact the repository maintainers directly

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
agents		agents
common		common
env		env
examples		examples
figures		figures
llm		llm
miscs		miscs
results		results
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SAT_sampling.py		SAT_sampling.py
config_generation.py		config_generation.py
generate.ipynb		generate.ipynb
knowledge_book_generation.py		knowledge_book_generation.py
knowledge_book_revision.py		knowledge_book_revision.py
main.py		main.py
requirements.txt		requirements.txt
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

🚀 Quick Links

📂 Benchmark Dataset

Gameplay Mechanics:

📑 Benchmark Format

⚙️ Environment Setup

🔽 Clone the Repository

📦 Install Dependencies

📈 Evaluation

🛠️ Dataset Generation

1️⃣ Seed Configuration

2️⃣ Task Instances via SAT Sampling

3️⃣ Knowledge Book Generation

4️⃣ Optional Knowledge Book Refinement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

linhaowei1/kumo

Folders and files

Latest commit

History

Repository files navigation

🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

🚀 Quick Links

📂 Benchmark Dataset

Gameplay Mechanics:

📑 Benchmark Format

⚙️ Environment Setup

🔽 Clone the Repository

📦 Install Dependencies

📈 Evaluation

🛠️ Dataset Generation

1️⃣ Seed Configuration

2️⃣ Task Instances via SAT Sampling

3️⃣ Knowledge Book Generation

4️⃣ Optional Knowledge Book Refinement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages