Technology

Google AI Introduces Gemini 2.5 ‘Computer Use’ (Preview): A Browser-Control Model to Power AI Agents to Interact with User Interfaces

Google AI Introduces Gemini 2.5 ‘Computer Use’ (Preview): A Browser-Control Model to Power AI Agents to Interact with User Interfaces

Estimated reading time: 6 minutes

  • Revolutionary Browser Control: Google AI’s Gemini 2.5 ‘Computer Use’ is a specialized model enabling AI agents to interact with web interfaces, mimicking human actions with precision.
  • Extensive UI Automation: It provides 13 predefined UI actions (e.g., click_at, type_text_at) and allows for custom function extensions, empowering complex workflow automation across browser and mobile surfaces.
  • Prioritized Safety: The model includes a robust safety monitor to block prohibited actions and requires explicit human confirmation for high-stakes operations like payments or sensitive data access.
  • Impressive Performance: Achieves high pass rates on industry benchmarks (e.g., 69.0% on Online-Mind2Web, 79.9% on WebVoyager), demonstrating superior accuracy and efficiency compared to competing APIs.
  • Practical Impact: Early applications show significant improvements in automated UI test reliability (over 60% rehabilitation of failing tests) and operational speed (approx. 50% faster workflows).

In the rapidly evolving landscape of artificial intelligence, the ability for AI agents to not just understand but interact with our digital world has been a significant frontier. Google AI is pushing the boundaries of this interaction with its latest innovation: Gemini 2.5 ‘Computer Use’. This specialized model represents a leap forward, enabling AI agents to seamlessly navigate and operate within web interfaces, mimicking human interaction with unparalleled precision and efficiency. The public preview of this technology signals a future where routine browser tasks can be intelligently delegated, freeing up human creativity and productivity for more complex challenges. It’s not just about automating clicks; it’s about empowering AI to understand context and execute sophisticated workflows across the web.

Unlocking AI Agent Capabilities with Gemini 2.5 ‘Computer Use’

The vision of AI agents taking over mundane, repetitive digital tasks is steadily becoming a reality. Google AI’s Gemini 2.5 ‘Computer Use’ model is at the forefront of this transformation. This innovative offering redefines how AI can engage with the intricate world of user interfaces. As Google AI articulates,

“Which of your browser workflows would you delegate today if an agent could plan and execute predefined UI actions? Google AI introduces Gemini 2.5 Computer Use, a specialized variant of Gemini 2.5 that plans and executes real UI actions in a live browser via a constrained action API. It’s available in public preview through Google AI Studio and Vertex AI. The model targets web automation and UI testing, with documented, human-judged gains on standard web/mobile control benchmarks and a safety layer that can require human confirmation for risky steps.”

This powerful statement encapsulates the core promise of the technology: a future where intelligent agents can shoulder the burden of repetitive digital work.

At its core, Gemini 2.5 ‘Computer Use’ is engineered to facilitate direct interaction with web browsers. Developers gain access to this capability by invoking a new computer_use tool within their applications. This tool, in turn, generates a series of function calls that correspond to specific user interface actions. Imagine commands like click_at, type_text_at, or drag_and_drop – these are the building blocks of an agent’s interaction. Client-side code, often leveraging robust automation frameworks such as Playwright or Browserbase, then executes these generated actions. Following each execution, a fresh screenshot and URL are captured, providing the model with real-time feedback and allowing it to adapt and continue the task. This iterative loop persists until the designated task is completed or a pre-defined safety rule intervenes, ensuring controlled and secure operation.

The model’s action space is thoughtfully designed yet flexible. It encompasses 13 predefined UI actions that cover a broad spectrum of browser interactions:

  • open_web_browser
  • wait_5_seconds
  • go_back
  • go_forward
  • search
  • navigate
  • click_at
  • hover_at
  • type_text_at
  • key_combination
  • scroll_document
  • scroll_at
  • drag_and_drop

This comprehensive set allows for complex workflow automation straight out of the box. Furthermore, the system is not limited to these predefined actions. Developers possess the capability to extend this action space with custom functions, such as open_app, long_press_at, or go_home, thereby enabling interaction with non-browser surfaces and tailoring the agent’s capabilities to specific application environments, even for mobile scenarios. This extensibility ensures that Gemini 2.5 ‘Computer Use’ can evolve alongside emerging digital interfaces and user requirements.

Scope, Safety, and Performance Metrics

Understanding the capabilities and limitations of any advanced AI model is crucial for effective deployment. Gemini 2.5 ‘Computer Use’ is primarily optimized for web browsers, making it an ideal candidate for a vast array of web-based automation tasks. While its initial focus is on browser environments, Google notes that it is not yet optimized for direct desktop operating system-level control. However, its flexible architecture allows for seamless adaptation to mobile scenarios. By swapping in custom actions while maintaining the same underlying API loop, developers can leverage the model’s intelligence for mobile application control, expanding its utility across diverse platforms.

Safety is paramount when AI agents interact with user interfaces that often handle sensitive information or critical operations. Google AI has meticulously integrated a robust, built-in safety monitor into Gemini 2.5 ‘Computer Use’. This monitor is designed to proactively identify and block prohibited actions, preventing unintended or malicious operations. More critically, for “high-stakes” operations—such as processing payments, sending messages, or accessing sensitive records—the model is configured to require explicit user confirmation. This crucial safety layer ensures human oversight at critical junctures, striking a balance between automation efficiency and user security.

The performance of Gemini 2.5 ‘Computer Use’ has been rigorously measured against industry benchmarks, showcasing impressive results. On the challenging Online-Mind2Web benchmark, the model achieved a notable 69.0% pass@1 (evaluated through majority-vote human judgments), a score validated by the benchmark organizers themselves. Further independent testing conducted by Browserbase, using a matched harness, demonstrated that Gemini 2.5 ‘Computer Use’ leads competing computer-use APIs in both accuracy and latency. Across Online-Mind2Web and WebVoyager benchmarks, under identical time, step, and environment constraints, Google’s model card reports 65.7% (OM2W) and 79.9% (WebVoyager) in Browserbase runs.

These figures underscore the model’s ability to reliably execute complex web tasks. An important consideration for practical applications is the trade-off between latency and quality. Google’s own figures indicate achieving approximately 70%+ accuracy at a median latency of around 225 seconds on the Browserbase OM2Web harness. While this latency might seem significant for certain real-time applications, it is important to treat this as Google-reported data based on human evaluations, reflecting the complexity of the tasks being automated. Furthermore, the model’s generalization capabilities extend to mobile environments; on the AndroidWorld benchmark, it achieved 69.7% measured by Google, accomplished by integrating custom mobile actions and excluding browser-specific actions within the same API loop.

Early production signals further validate the practical utility and impact of Gemini 2.5 ‘Computer Use’. Google’s payments platform team has reported a significant breakthrough: the model has been instrumental in rehabilitating over 60% of previously failing automated UI test executions. This remarkable improvement highlights the model’s ability to intelligently adapt and recover from unexpected UI changes, greatly enhancing the reliability of automated testing pipelines. Beyond testing, early external testers like Poke.com have observed substantial operational speed improvements, with workflows often running approximately 50% faster compared to their next-best alternative. These real-world applications underscore the profound efficiency gains that Gemini 2.5 ‘Computer Use’ can deliver across various industries.

Practical Applications and Getting Started

The introduction of Gemini 2.5 ‘Computer Use’ opens up a plethora of practical applications, especially in areas demanding repetitive or complex interactions with web interfaces. One immediate and impactful area is automated UI testing and repair. Imagine a large-scale e-commerce platform with thousands of user flows that need constant validation. Manual testing is prohibitively slow and prone to human error, and traditional automated tests often break with minor UI changes. With Gemini 2.5 ‘Computer Use’, an AI agent can intelligently navigate these flows, performing actions like adding items to a cart, proceeding to checkout, and even handling dynamic elements. If a UI element shifts, the agent, powered by the model, can often adapt and find the new location, repairing test failures automatically – a feat demonstrated by Google’s own payments platform team. This significantly reduces maintenance overhead and accelerates development cycles.

Another compelling real-world example lies in web-based data extraction and entry. Consider a business that needs to regularly collect information from various vendor portals, input data into multiple legacy web forms, or compare prices across numerous online retailers. These tasks are typically time-consuming and tedious for human employees. An AI agent driven by Gemini 2.5 ‘Computer Use’ could be programmed to log into specific portals, navigate through complex menus, extract predefined data points (e.g., pricing, inventory levels, competitor information), and then input that data into an internal system or another web application, all autonomously. The type_text_at and click_at functions, combined with the model’s contextual understanding, allow it to execute these steps with precision and speed, freeing human workers to focus on analysis and strategic decision-making.

Here are three actionable steps to begin exploring the potential of Gemini 2.5 ‘Computer Use’:

  1. Dive into the Public Preview: The most direct way to experience this innovative model is to access its public preview. Developers can find Gemini 2.5 ‘Computer Use’ available through Google AI Studio and Vertex AI. This provides an immediate entry point to experiment with its capabilities and integrate it into preliminary projects.
  2. Familiarize Yourself with the computer_use Tool and its Actions: Before building complex agents, spend time understanding the core computer_use tool and the 13 predefined UI actions it supports. Experiment with simple scripts that use click_at, type_text_at, and navigate to get a feel for how the model translates high-level instructions into concrete browser interactions. Consider the possibilities of extending this action space with custom functions for unique application needs.
  3. Identify Potential Use Cases for UI Testing or Web Automation: Look within your organization or personal projects for areas ripe for automation. Think about repetitive browser-based tasks, tedious data entry, or fragile UI testing routines. Gemini 2.5 ‘Computer Use’ is particularly well-suited for these challenges, offering a robust and intelligent alternative to traditional scripting methods. Start with a small, contained problem to measure its effectiveness and build confidence.

Conclusion

Google AI’s introduction of Gemini 2.5 ‘Computer Use’ marks a pivotal moment in the evolution of AI agents. By providing a sophisticated browser-control model, Google is empowering developers to build agents that can not only understand but also intelligently interact with the dynamic and complex world of user interfaces. With its robust set of predefined actions, extendable API, crucial safety mechanisms, and impressive performance on industry benchmarks, Gemini 2.5 ‘Computer Use’ stands poised to revolutionize web automation and UI testing. From automating intricate test repair to significantly accelerating operational workflows, the efficiency gains and enhanced reliability offered by this model are substantial. As this technology matures, we can anticipate a future where AI agents become indispensable partners in our digital lives, handling routine interactions with unparalleled autonomy and precision.

Ready to delve deeper into the capabilities of Gemini 2.5 ‘Computer Use’?

Check out the GitHub Page and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Frequently Asked Questions (FAQ)

What is Gemini 2.5 ‘Computer Use’?

Gemini 2.5 ‘Computer Use’ is a specialized AI model developed by Google AI. It is designed to empower AI agents to interact directly with user interfaces, primarily web browsers, by planning and executing predefined UI actions, effectively mimicking human interaction for automation tasks.

How does Gemini 2.5 ‘Computer Use’ ensure safety during operations?

Google AI has integrated a robust, built-in safety monitor. This monitor blocks prohibited actions and, critically, requires explicit human confirmation for “high-stakes” operations like processing payments, sending messages, or accessing sensitive personal records. This ensures a balance between automation efficiency and user security.

What are the primary applications of this model?

The model is ideally suited for web automation and UI testing. Key applications include automating complex UI test repair, web-based data extraction and entry, and accelerating repetitive digital workflows across various industries. It aims to offload mundane tasks from human workers.

Can Gemini 2.5 ‘Computer Use’ interact with mobile applications?

While primarily optimized for web browsers, its flexible architecture allows for adaptation to mobile scenarios. Developers can integrate custom mobile actions within the same API loop, extending its capabilities to control mobile applications effectively, as demonstrated on the AndroidWorld benchmark.

Where can developers access the public preview?

The public preview of Gemini 2.5 ‘Computer Use’ is available to developers through Google AI Studio and Vertex AI. This provides an immediate entry point for experimentation and integration into preliminary projects.

Related Articles

Back to top button