WorldGUI-Agent has highly adaptive self-reflection capabilities in dynamic GUI envs.
No Docker or Virtual Machine for deployment.
Visit our study WorldGUI in project page.🌐
User Query: Disable the 'Battery saver' Notifications
WorldGUI-Agent:
What's new in WorldGUI-Agent?
WorldGUI-Agent is a newly developed GUI agent based on a self-reflection mechanism. We systematically investigate GUI automation and establish the following workflow, incorporating three key self-reflection modules:
-
Planner-Critic (Post-Planning Critique): Self-corrects the initial plans to ensure their accuracy
-
Step-Check (Pre-Execution Validation): Remove redundant steps or modify them if necessary.
-
Actor-Critic (Post-Action Evaluation): Review the task completion status and apply necessary corrections.
Overall framework of WorldGUI-Agent.
State-Aware Planner and Planner-Critic modules.
Step-Check module.
Actor-Critic module.
Comparison of various agents on the WorldGUI Benchmark (meta task).
-
[2025.06.10] We are excited to introduce our improved version of WolrdGUI. We increased the amount of the task from 315 to 611 and revised the technique reports for better understanding.
-
[2025.03.11] ⚡ We are excited to introduce a fast version of WorldGUI-Agent powered by the base models Claude-3.5-Sonnet and Claude-3.7-Sonnet. In this release, the Claude models serve as the Actor without relying on the GUI Parser. This setup delivers impressive speed. Try with test_guithinker_fast.py.
-
[2025.03.08] We made a demo for showing the WorldGUI-Agent.
-
[2025.03.05] ⚡ Our WorldGUI-Agent now supports both instructional video and non-video inputs. Enjoy!
-
[2025.03.05] 😊 We release the code of WorldGUI-Agent. Now, we support running our GUI agent on your Windows computer locally Getting started. WorldGUI-Agent now supports various base LMMs through API calling, including GPT-4o, Gemini-2.0, and Claude-3.5-Sonnet. Local model support will be available soon.
-
[2025.02.13] We release the WorldGUI in arxiv.
- 🏆 High Performance: Our WorldGUI-Agent surpasses Cluade-3.5 Computer Use by 14.9% on our WorldGUI Benchmark.
- 🌐 Universal LMM Support: Seamlessly integrates with A Wide Range of LMMs (e.g., OpenAI, Anthropic, Gemini)
- 🔀 Flexible Interaction: Supports both intructional video input and non-instructional video input.
- 🚀 Easy Deployment: Get started instantly with a simple
.\shells\start_server.bat
command andpython test_guithinker_custom.py
without the need of Docker or Virtual Machine.
Our codebases includes:
- GUI Parser: Utilizes Google OCR and PyAutoGUI to extract element grounding information.
- State-Aware Planner: Accepts screenshots and instructional videos to generate plans.
- Planner-Critic: Refines the initial plan generated by the planner.
- Step-Check: Verifies task completion and redundancy using various output statuses (e.g.,
<Modify>
,<Pass>
,<Continue>
,<Finished>
). It also implements an LLM-driven region search module to locate target elements. - Actor: Translates action descriptions into executable code (e.g.,
click(100, 200)
). It can be any API models or locally running models. - Actor-Critic: Checks task completion status by comparing before and after screenshots and uses an iterative action correction algorithm to gradually verify and correct actions.
- Input with Instructional Video: Supports the execution with instructional video.
- Input without Instructional Video: Supports the direct execution with user query.
- Frontend-backend communication system: Supports seperate the frontend and backend for flexible deploying the locally running model and user interfaces.
See our paper for detail. Our WorldGUI-Agent is along with a newly curted Desktop GUI benchmark WorldGUI.
Demo Video (The video has been sped up):
demovideo1.mp4
See 1080p version from https://www.youtube.com/watch?v=RoJ-cbjfZmg
See Get Started for local computer running.
-
Special thanks to Difei Gao for his hard work on devleoping the codebase.
-
We express our great thanks to Kaiming Yang, Mingyi Yan, Wendi Yu for their hard work for data ananotation and baseline testing.
-
OOTB (Computer Use): Computer Use OOTB is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (Claude 3.5 Computer Use) and locally-running models (ShowUI, UI-TARS).
-
ShowUI: Open-source, End-to-end, Lightweight, Vision-Language-Action model for GUI Agent & Computer Use.
-
AssistGUI: AssistGUI is the first work that focuses on desktop productivity software usage with over 100 realistic GUI tasks.
-
VideoGUI: A Benchmark for GUI Automation from Instructional Videos. Can a GUI agent behave like a human when giving an image-style effect and a user query?
-
SWE-bench Multimodal: SWE-bench Multimodal is a dataset for evaluating AI systems on visual software engineering tasks.
If you find WorldGUI useful, please cite using this BibTeX:
@misc{zhao2025worldguiinteractivebenchmarkdesktop,
title={WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point},
author={Henry Hengyuan Zhao and Kaiming Yang and Wendi Yu and Difei Gao and Mike Zheng Shou},
year={2025},
eprint={2502.08047},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2502.08047},
}
If you have any questions or suggestions, please don't hesitate to let us know. You can directly email Henry Hengyuan Zhao at NUS using the email address [email protected], or post an issue on this repository. We welcome contributions. Feel free to submit pull requests if you have suggestions for improvement.