The Precision Problem in AI Automation

Imagine telling a new assistant, “Click that big blue button,” only for them to hesitate, or worse, click the wrong big blue button. Annoying, right? Now, scale that challenge up to AI agents trying to navigate complex digital interfaces, and you start to grasp the monumental task of “grounding” – teaching AI to reliably find and interact with the precise on-screen element we intend. For years, this has been a significant bottleneck in developing truly autonomous and intelligent computer-use agents.
But what if there was a breakthrough? What if AI could not only understand your instruction but pinpoint the exact pixel location you meant, every single time? This isn’t just a hypothetical anymore. A team of researchers from ML Foundations has introduced Gelato-30B-A3B, a state-of-the-art grounding model that’s set to redefine how AI interacts with graphical user interfaces (GUIs). It’s designed to be the sharp eye and steady hand for AI agents, translating our natural language instructions into reliable, pixel-perfect clicks.
The Precision Problem in AI Automation
At its core, the challenge for AI agents navigating computers comes down to precision. A human can quickly distinguish between two visually similar buttons based on context or a subtle nuance in an instruction. For an AI, especially one dealing with the ever-changing layouts of different operating systems and applications, this is incredibly difficult.
Previous models, while impressive, often struggled with this nuance. They might get close, but “close enough” isn’t good enough when you need to click a specific cell in a spreadsheet or a particular checkbox in a dense form. This is where Gelato-30B-A3B steps in, addressing what’s known as the GUI grounding problem with unprecedented accuracy.
Modular Design: The Agent’s Sharp Eye
Gelato-30B-A3B isn’t an entire AI agent; rather, it’s a highly specialized and powerful component within a larger AI stack. Think of it as the ultimate “eyes and pointer” for an agent. It’s a 31-billion parameter model, built by fine-tuning Qwen3-VL-30B-A3B Instruct with a mixture of experts architecture.
The input is simple: a screenshot of the computer screen and a textual instruction, like “click on the Save button.” The output? A single, precise click coordinate. This modularity is key. A separate planner model, say a powerful large language model like GPT-5 (as used in Gelato’s experiments), decides the high-level action, then hands off the precise execution to Gelato. This separation allows the agent to operate fluidly across diverse operating systems and applications, without needing to relearn the visual landscape for every new scenario.
Click 100k: The Data That Makes the Difference
Behind every great AI model is a great dataset, and for Gelato-30B-A3B, that’s Click 100k. This isn’t just any collection of images; it’s a meticulously curated dataset specifically designed for GUI grounding. It pairs real computer screen images with natural language instructions, target element bounding boxes, and precise coordinates.
What makes Click 100k stand out is its comprehensive approach. It’s built by filtering and unifying a wide array of public sources, including ShowUI, AutoGUI, PC Agent E, and many more. Each source contributes unique scenarios, from everyday clicks to complex spreadsheet manipulations.
Beyond Quantity: The Art of Filtering
Quantity is good, but quality is paramount. The research team didn’t just dump all available data into the mix. They ran an aggressive filtering pipeline to ensure every sample in Click 100k was meaningful and challenging. OmniParser, for instance, discards clicks that don’t land on actual interface elements. More importantly, models like Qwen2.5-7B-VL and SE-GUI-3B were used to remove trivial examples – because an AI doesn’t learn much from effortlessly clicking an obvious hyperlink.
Further checks with GTA1-7B-2507 and UI-Venus-7B ensured that instructions genuinely matched the click regions, eliminating misleading data. This rigorous filtering process, as shown by a baseline trained on a balanced subset, resulted in a significant +9 percentage point accuracy gain on ScreenSpot Pro. It’s a testament to the idea that sometimes, less (but higher quality) data is more effective for training robust AI.
A specific focus was also placed on professional application coverage. By integrating data from sources like UI VISION, a JEDI subset for spreadsheet tasks, and even annotating over 80 professional application tutorial videos with Claude 4 Sonnet, Click 100k ensures Gelato learns to operate in the real, often complex, environments users encounter daily.
From Training Floors to Real-World Impact
Training Gelato-30B-A3B was no small feat. The researchers utilized GRPO, a sophisticated reinforcement learning (RL) algorithm adapted from systems like DeepSeekMath. This isn’t your typical supervised learning. With GRPO, Gelato learns through sparse rewards, meaning it only gets a “pat on the back” when its predicted click lands precisely inside the target bounding box.
This approach, similar to the GTA1 recipe, significantly boosts grounding accuracy over traditional baselines. Starting from Qwen3-VL-30B-A3B Instruct, the model underwent 100 RL steps on a powerful setup of 32 A100 GPUs. The results speak for themselves: Gelato-30B-A3B achieved an impressive 63.88% accuracy on ScreenSpot-Pro and 69.15% on OS-World-G, with 74.65% on OS-World-G Refined when using a simple refusal prompting strategy.
Real-World Performance: A Clear Edge
What truly matters, of course, is how a model performs in a complete agent framework. The team integrated Gelato-30B-A3B into the GTA1.5 agent framework, where GPT-5 served as the planner, guiding the high-level strategy. This setup allowed them to test Gelato’s capabilities on real computer-use tasks within the challenging OS-World environment.
The results were compelling. Gelato-30B-A3B achieved a 58.71% automated success rate, comfortably surpassing GTA1-32B, which managed 56.97% in the same harness. Even more telling, human evaluation on problematic tasks confirmed Gelato’s superiority, reaching 61.85% success compared to GTA1-32B’s 59.47%. This clearly demonstrates that better grounding directly translates into stronger, more reliable end-to-end agent performance.
A New Benchmark for GUI Grounding
Gelato-30B-A3B isn’t just another incremental improvement; it represents a significant leap forward for grounded computer use. By leveraging a Qwen3-VL based Mixture of Experts model and training it on the meticulously curated Click 100k dataset with sophisticated reinforcement learning, Gelato-30B-A3B has set a new state-of-the-art for GUI grounding.
It has convincingly outperformed both its predecessor, GTA1-32B, and even much larger vision language models like Qwen3-VL-235B-A22B-Instruct, all while remaining accessible through Hugging Face. This means the capabilities of Gelato aren’t confined to research labs but can be adopted and built upon by the wider AI community.
The future of AI agents navigating our digital world just got a whole lot clearer and more precise. With Gelato-30B-A3B, we’re one step closer to truly intelligent digital assistants that understand our intent, not just our words, and execute tasks with unparalleled accuracy. It’s an exciting time to be watching AI evolve, bringing us closer to a future where our computers work seamlessly alongside us, anticipating our needs and executing our commands with precision and grace.




