Skip to content

linhaowei1/kumo

Repository files navigation

🌩️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

arXiv HuggingFace Datasets Open In Colab Project Homepage

KUMO is a novel benchmark for systematically evaluating complex reasoning capabilities in Large Language Models (LLMs) through procedurally generated reasoning games. This repository contains the official implementation of our research paper.


🚀 Quick Links


📂 Benchmark Dataset

The KUMO benchmark introduces procedurally generated reasoning games structured around:

  • 🔍 Truth Set ($T$): Possible truths.
  • 🎯 Action Set ($A$): Available actions.
  • 🌟 Outcomes ($\mathcal{O}$): Action-based outcomes.
  • 📚 Knowledge Book ($K$): Detailed guidelines linking truths, actions, and outcomes.

Gameplay Mechanics:

  • A valid truth ($t^*$) is secretly chosen.
  • Players take actions and observe outcomes.
  • Deduce the truth efficiently using logic and reasoning.

🧑‍⚕️ Example Scenario: Diagnosing diseases using medical tests.

📌 Provided Domains:

  • 100 autogenerated exemplar domains
  • Categories: Computer Science, Biology, Art, and more
  • Typical domain: ~50 truths, ~30 actions

KUMO Example


📑 Benchmark Format

The KUMO dataset is provided in JSON format, simplifying integration and customization. Data available at KUMO/env:

kumo/
└── env/
    ├── data/
    │   └── [DomainName]_data.py
    ├── [DomainName]/
    │   ├── knowledge_book/
    │   │   └── truth_num=4+action_num=6+valid_truth_num=1/
    │   │       ├── seed=0.txt
    │   │       └── ...
    │   └── truth_num=4+action_num=6+valid_truth_num=1.jsonl
    └── [DomainName].py

⚙️ Customize parameters (truth_num, action_num, etc.) easily for tailored benchmarking.


⚙️ Environment Setup

🔽 Clone the Repository

git clone https://github.com/linhaowei1/kumo.git
cd kumo

📦 Install Dependencies

Recommended: Conda with Python (3.10 to 3.12):

conda create -n kumo python=3.12
conda activate kumo
pip install -r requirements.txt

Hardware requirement: None. Only CPU is needed for inference.


📈 Evaluation

We recommend using OpenAI API to call LLMs. Please check examples/main.sh to customize by adding your own API key and model name. The results are in results/.

Expected runtime: depended on API call (GPT-4o may take about 3 hours to run all 100 easy-setting tasks using this script).


🛠️ Dataset Generation

Create customized domains and scenarios easily:

1️⃣ Seed Configuration

Generate scenarios via LLM:

python generate/config_generation.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --data_path ./templates/config_generation.jsonl

🔗 Detailed Instructions

2️⃣ Task Instances via SAT Sampling

Generate specific tasks:

python SAT_sampling.py \
  --truth_num 4 \
  --action_num 6 \
  --valid_truth_num 1 \
  --data_num 50 \
  --domain MedicalEnv

🔗 Example Script

3️⃣ Knowledge Book Generation

Automatically build detailed knowledge bases:

python knowledge_book_generation.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --data_num 50 \
  --truth_num 4 \
  --action_num 6 \
  --valid_truth_num 1 \
  --domain MedicalEnv

🔗 Example

4️⃣ Optional Knowledge Book Refinement

Improve generated knowledge books:

python generate/knowledge_book_revision.py \
  --load_type OPENAI \
  --api_base http://localhost:8001/v1 \
  --api_key EMPTY \
  --domain MedicalEnv \
  --revision_template_path ./templates/revision_template.md

🔗 Revision Details


💬 Support & Questions

For support, feedback, or inquiries, please:

  • Open an issue on GitHub
  • Contact the repository maintainers directly

About

☁️ KUMO: Generative Evaluation of Complex Reasoning in Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •