GitHub - showlab/WorldGUI: Enable AI to control your PC. This repo includes the WorldGUI Benchmark and GUI-Thinker Agent Framework.

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Welcome to WorldGUI! If you find this repo useful, please give a star ⭐ for encouragement.

English | 中文

Overview

WorldGUI-Agent has highly adaptive self-reflection capabilities in dynamic GUI envs.

No Docker or Virtual Machine for deployment.

Visit our study WorldGUI in project page.🌐

User Query: Disable the 'Battery saver' Notifications

WorldGUI-Agent:

Introduction

What's new in WorldGUI-Agent?

WorldGUI-Agent is a newly developed GUI agent based on a self-reflection mechanism. We systematically investigate GUI automation and establish the following workflow, incorporating three key self-reflection modules:

Planner-Critic (Post-Planning Critique): Self-corrects the initial plans to ensure their accuracy
Step-Check (Pre-Execution Validation): Remove redundant steps or modify them if necessary.
Actor-Critic (Post-Action Evaluation): Review the task completion status and apply necessary corrections.

Technique Details

Overall framework of WorldGUI-Agent.

State-Aware Planner and Planner-Critic modules.

Step-Check module.

Actor-Critic module.

Comparing with SOTA Desktop GUI Agent

Comparison of various agents on the WorldGUI Benchmark (meta task).

📢 Update

[2025.06.10] We are excited to introduce our improved version of WolrdGUI. We increased the amount of the task from 315 to 611 and revised the technique reports for better understanding.
[2025.03.11] ⚡ We are excited to introduce a fast version of WorldGUI-Agent powered by the base models Claude-3.5-Sonnet and Claude-3.7-Sonnet. In this release, the Claude models serve as the Actor without relying on the GUI Parser. This setup delivers impressive speed. Try with test_guithinker_fast.py.
[2025.03.08] We made a demo for showing the WorldGUI-Agent.
[2025.03.05] ⚡ Our WorldGUI-Agent now supports both instructional video and non-video inputs. Enjoy!
[2025.03.05] 😊 We release the code of WorldGUI-Agent. Now, we support running our GUI agent on your Windows computer locally Getting started. WorldGUI-Agent now supports various base LMMs through API calling, including GPT-4o, Gemini-2.0, and Claude-3.5-Sonnet. Local model support will be available soon.
[2025.02.13] We release the WorldGUI in arxiv.

✨Key Features

🏆 High Performance: Our WorldGUI-Agent surpasses Cluade-3.5 Computer Use by 14.9% on our WorldGUI Benchmark.
🌐 Universal LMM Support: Seamlessly integrates with A Wide Range of LMMs (e.g., OpenAI, Anthropic, Gemini)
🔀 Flexible Interaction: Supports both intructional video input and non-instructional video input.
🚀 Easy Deployment: Get started instantly with a simple .\shells\start_server.bat command and python test_guithinker_custom.py without the need of Docker or Virtual Machine.

🤖 Core Components:

Our codebases includes:

GUI Parser: Utilizes Google OCR and PyAutoGUI to extract element grounding information.
State-Aware Planner: Accepts screenshots and instructional videos to generate plans.
Planner-Critic: Refines the initial plan generated by the planner.
Step-Check: Verifies task completion and redundancy using various output statuses (e.g., <Modify>, <Pass>, <Continue>, <Finished>). It also implements an LLM-driven region search module to locate target elements.
Actor: Translates action descriptions into executable code (e.g., click(100, 200)). It can be any API models or locally running models.
Actor-Critic: Checks task completion status by comparing before and after screenshots and uses an iterative action correction algorithm to gradually verify and correct actions.
Input with Instructional Video: Supports the execution with instructional video.
Input without Instructional Video: Supports the direct execution with user query.
Frontend-backend communication system: Supports seperate the frontend and backend for flexible deploying the locally running model and user interfaces.

See our paper for detail. Our WorldGUI-Agent is along with a newly curted Desktop GUI benchmark WorldGUI.

🖥️ Demo of Computer Using

Demo Video (The video has been sped up):

demovideo1.mp4

See 1080p version from https://www.youtube.com/watch?v=RoJ-cbjfZmg

🚀 Getting Started

See Get Started for local computer running.

❤ Acknowledgement

Special thanks to Difei Gao for his hard work on devleoping the codebase.
We express our great thanks to Kaiming Yang, Mingyi Yan, Wendi Yu for their hard work for data ananotation and baseline testing.
OOTB (Computer Use): Computer Use OOTB is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (Claude 3.5 Computer Use) and locally-running models (ShowUI, UI-TARS).
ShowUI: Open-source, End-to-end, Lightweight, Vision-Language-Action model for GUI Agent & Computer Use.
AssistGUI: AssistGUI is the first work that focuses on desktop productivity software usage with over 100 realistic GUI tasks.
VideoGUI: A Benchmark for GUI Automation from Instructional Videos. Can a GUI agent behave like a human when giving an image-style effect and a user query?
SWE-bench Multimodal: SWE-bench Multimodal is a dataset for evaluating AI systems on visual software engineering tasks.

🎓 BibTeX

If you find WorldGUI useful, please cite using this BibTeX:

@misc{zhao2025worldguiinteractivebenchmarkdesktop,
      title={WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point}, 
      author={Henry Hengyuan Zhao and Kaiming Yang and Wendi Yu and Difei Gao and Mike Zheng Shou},
      year={2025},
      eprint={2502.08047},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.08047}, 
}

🔔 Contact

If you have any questions or suggestions, please don't hesitate to let us know. You can directly email Henry Hengyuan Zhao at NUS using the email address [email protected], or post an issue on this repository. We welcome contributions. Feel free to submit pull requests if you have suggestions for improvement.

Name		Name	Last commit message	Last commit date
Latest commit History 144 Commits
agent		agent
assets		assets
css		css
data		data
shells		shells
.DS_Store		.DS_Store
Get Started.md		Get Started.md
README.md		README.md
README_zh.md		README_zh.md
index.html		index.html
requirements.txt		requirements.txt
test_guithinker_custom.py		test_guithinker_custom.py
test_guithinker_demo.py		test_guithinker_demo.py
test_guithinker_fast.py		test_guithinker_fast.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Welcome to WorldGUI! If you find this repo useful, please give a star ⭐ for encouragement.

English | 中文

Overview

Introduction

Technique Details

Comparing with SOTA Desktop GUI Agent

📢 Update

✨Key Features

🤖 Core Components:

🖥️ Demo of Computer Using

🚀 Getting Started

❤ Acknowledgement

🎓 BibTeX

🔔 Contact

About

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

showlab/WorldGUI

Folders and files

Latest commit

History

Repository files navigation

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Welcome to WorldGUI! If you find this repo useful, please give a star ⭐ for encouragement.

English | 中文

Overview

Introduction

Technique Details

Comparing with SOTA Desktop GUI Agent

📢 Update

✨Key Features

🤖 Core Components:

🖥️ Demo of Computer Using

🚀 Getting Started

❤ Acknowledgement

🎓 BibTeX

🔔 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Uh oh!

Languages