Technology

Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results

Author6 days ago

0 8 minutes read

Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results

Estimated Reading Time: 6 minutes

Claude Sonnet 4.5 achieves state-of-the-art results in software engineering, with a remarkable 77.2% accuracy on the SWE-bench Verified dataset, demonstrating superior code understanding, generation, and debugging.
The model significantly enhances autonomous AI agents, leading with a 61.4% score on OSWorld-Verified for general computer use, reflecting stronger tool control and UI manipulation for complex, multi-step tasks.
Anthropic introduces the Claude Agent SDK, providing robust scaffolding for developing long-horizon, reliable agents with enhanced memory management, planning, and tool orchestration capabilities.
Sonnet 4.5 is widely accessible through the Anthropic API, AWS Bedrock, Google Cloud Vertex AI, and GitHub Copilot, with pricing consistent with Sonnet 4, making advanced AI practical for developers and enterprises.
Its focus on extended autonomy (over 30 hours of uninterrupted focus) and hardened safety posture (ASL-3 with prompt-injection defenses) positions it for production-ready, high-impact AI solutions.

Setting a New Benchmark for Software Engineering Excellence
Empowering Next-Generation AI Agents with Unprecedented Autonomy
Accessibility and Integration: Where to Experience Sonnet 4.5
Actionable Steps for Leveraging Claude Sonnet 4.5
Conclusion
Frequently Asked Questions (FAQ)

The landscape of artificial intelligence is relentlessly evolving, pushing the boundaries of what machines can achieve. In a significant stride forward, Anthropic has unveiled Claude Sonnet 4.5, a powerful new iteration of its large language model family. This release isn’t merely an incremental update; it heralds a new era for AI in software engineering and general computer use, positioning itself as a formidable tool for developers and enterprises alike.

Sonnet 4.5 arrives with a clear mission: to tackle the complex, multi-step challenges of real-world computing and autonomous agentic workflows. With enhancements spanning coding proficiency, agent reliability, and broad accessibility, Anthropic aims to redefine the benchmarks for AI assistance in intricate, long-horizon tasks. The model’s debut promises to unlock unprecedented levels of automation and intelligent problem-solving across various applications.

Setting a New Benchmark for Software Engineering Excellence

At the heart of Claude Sonnet 4.5’s capabilities lies its groundbreaking performance in software development. The model demonstrates an unparalleled ability to understand, generate, and debug code, setting a new industry standard. This advancement is particularly significant for tasks requiring deep logical reasoning and meticulous attention to detail.

“Anthropic released Claude Sonnet 4.5 and sets a new benchmark for end-to-end software engineering and real-world computer use. The update also ships concrete product surface changes (Claude Code checkpoints, a native VS Code extension, API memory/context tools) and an Agent SDK that exposes the same scaffolding Anthropic uses internally. Pricing remains unchanged from Sonnet 4 ($3 input / $15 output per million tokens).

What’s actually new?

SWE-bench Verified record. Anthropic reports 77.2% accuracy on the 500-problem SWE-bench Verified dataset using a simple two-tool scaffold (bash + file edit), averaged over 10 runs, no test-time compute, 200K “thinking” budget. A 1M-context setting reaches 78.2%, and a higher-compute setting with parallel sampling and rejection raises this to 82.0%.

Computer-use SOTA. On OSWorld-Verified, Sonnet 4.5 leads at 61.4%, up from Sonnet 4’s 42.2%, reflecting stronger tool control and UI manipulation for browser/desktop tasks.

Long-horizon autonomy. The team observed >30 hours of uninterrupted focus on multi-step coding tasks — a practical jump over earlier limits and directly relevant to agent reliability.

Reasoning/math. The release notes “substantial gains” across common reasoning and math evals; exact per-bench numbers (e.g., AIME config). Safety posture is ASL-3 with strengthened defenses against prompt-injection.

What’s there for agents?

Sonnet 4.5 targets the brittle parts of real agents: extended planning, memory, and reliable tool orchestration. Anthropic’s Claude Agent SDK exposes their production patterns (memory management for long-running tasks, permissioning, sub-agent coordination) rather than just a bare LLM endpoint. That means teams can reproduce the same scaffolding used by Claude Code (now with checkpoints, a refreshed terminal, and VS Code integration) to keep multi-hour jobs coherent and reversible.

On measured tasks that simulate “using a computer,” the 19-point jump on OSWorld-Verified is notable; it tracks with the model’s ability to navigate, fill spreadsheets, and complete web flows in Anthropic’s browser demo. For enterprises experimenting with agentic RPA-style work, higher OSWorld scores usually correlate with lower intervention rates during execution.

Where you can run it?

Anthropic API & apps. Model ID claude-sonnet-4-5; price parity with Sonnet 4. File creation and code execution are now available directly in Claude apps for paid tiers.

AWS Bedrock. Available via Bedrock with integration paths to AgentCore; AWS highlights long-horizon agent sessions, memory/context features, and operational controls (observability, session isolation).

Google Cloud Vertex AI. GA on Vertex AI with support for multi-agent orchestration via ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and prompt caching.

GitHub Copilot. Public preview rollout across Copilot Chat (VS Code, web, mobile) and Copilot CLI; organizations can enable via policy, and BYO key is supported in VS Code.

Summary

With a documented 77.2% SWE-bench Verified score under transparent constraints, a 61.4% OSWorld-Verified computer-use lead, and practical updates (checkpoints, SDK, Copilot/Bedrock/Vertex availability), Claude Sonnet 4.5 is developed for long-running, tool-heavy agent workloads rather than short demo prompts. Independent replication will determine how durable the “best for coding” claim is, but the design targets (autonomy, scaffolding, and computer control) are aligned with real production pain points today.”

— Anthropic Official News

The headline achievement is Sonnet 4.5’s astounding 77.2% accuracy on the 500-problem SWE-bench Verified dataset. This score, achieved with a minimalist two-tool scaffold (bash + file edit) and a 200K “thinking” budget, underscores its robust problem-solving capabilities in realistic software engineering scenarios. With higher computational resources, this figure climbs to an impressive 82.0%, suggesting a scalable potential for even more complex tasks.

Beyond raw accuracy, Sonnet 4.5 exhibits remarkable long-horizon autonomy. The ability to maintain uninterrupted focus on multi-step coding tasks for over 30 hours represents a significant leap from previous models. This extended coherence is critical for large-scale projects, where maintaining context and managing dependencies are paramount. Furthermore, Anthropic reports substantial gains in reasoning and mathematical evaluations, bolstering the model’s ability to handle complex logical operations. Its ASL-3 safety posture, reinforced with advanced prompt-injection defenses, ensures responsible and secure deployment.

Empowering Next-Generation AI Agents with Unprecedented Autonomy

Sonnet 4.5 isn’t just about better coding; it’s a game-changer for autonomous AI agents. Anthropic has specifically engineered this model to address the historically “brittle parts” of real-world agents, such as extended planning, robust memory management, and reliable tool orchestration. The introduction of the Claude Agent SDK is pivotal here, providing developers with the same internal scaffolding Anthropic uses for its production systems.

This SDK empowers teams to build agents that can manage long-running tasks, handle permissions, and coordinate sub-agents effectively. By reproducing these battle-tested patterns, developers can create multi-hour jobs that are both coherent and reversible—a vital feature for debugging and ensuring reliability. The integration of Claude Code checkpoints, a refreshed terminal, and a native VS Code extension further streamlines the development and deployment of sophisticated agents.

The model’s proficiency in general computer use is highlighted by its leading 61.4% score on OSWorld-Verified, a substantial jump from Sonnet 4’s 42.2%. This 19-point improvement signifies Sonnet 4.5’s superior ability to control tools, manipulate UIs, navigate browsers, fill spreadsheets, and complete complex web flows. For enterprises, this translates directly into the potential for highly efficient, agentic RPA-style automation, where higher OSWorld scores correlate with significantly reduced human intervention rates during execution.

Real-World Example: Automated Customer Onboarding

Imagine a financial institution needing to onboard new customers. This typically involves navigating multiple web portals, extracting data from various documents, filling out forms, verifying identities, and updating internal systems. A Sonnet 4.5-powered agent could autonomously manage this entire multi-step process. Leveraging its OSWorld capabilities, it could log into banking platforms, cross-reference data, upload KYC documents, trigger API calls, and even interact with legacy desktop applications—all with minimal human oversight and high reliability over extended periods. This drastically reduces manual effort, speeds up onboarding, and minimizes errors.

Accessibility and Integration: Where to Experience Sonnet 4.5

Anthropic has ensured that Claude Sonnet 4.5 is not only powerful but also widely accessible across leading platforms. This broad availability means developers and organizations can integrate its advanced capabilities into their existing workflows with ease, accelerating innovation and deployment.

Anthropic API & Apps: Directly available via the Anthropic API using the model ID claude-sonnet-4-5. Crucially, pricing remains consistent with Sonnet 4 ($3 input / $15 output per million tokens), making the upgrade highly cost-effective. Paid tiers of Claude apps now also support file creation and code execution directly.
AWS Bedrock: Integration via AWS Bedrock provides seamless access, particularly highlighting its paths to AgentCore for building long-horizon agent sessions, leveraging memory/context features, and utilizing operational controls like observability and session isolation.
Google Cloud Vertex AI: Sonnet 4.5 is generally available (GA) on Vertex AI, offering robust support for multi-agent orchestration through ADK/Agent Engine, provisioned throughput, 1M-token analysis jobs, and efficient prompt caching.
GitHub Copilot: A public preview rollout is underway across Copilot Chat (within VS Code, web, and mobile environments) and Copilot CLI. Organizations can enable Sonnet 4.5 via policy, with Bring Your Own Key (BYO key) support available in VS Code for enhanced security and control.

Actionable Steps for Leveraging Claude Sonnet 4.5

Ready to put Sonnet 4.5 to work? Here are three concrete steps to get started:

For Developers & Engineers: Dive into the Claude Agent SDK. Experiment with building multi-step coding agents or leverage Sonnet 4.5’s enhanced capabilities directly through GitHub Copilot Chat and CLI for everyday development tasks. Focus on long-running processes that previously required significant manual oversight.
For Enterprises & Innovators: Explore Sonnet 4.5 for advanced agentic automation, especially in areas like Robotic Process Automation (RPA) or complex data processing workflows. Utilize its superior computer-use capabilities (as demonstrated by OSWorld-Verified scores) to reduce human intervention and improve efficiency across operational tasks.
For AI Researchers & Enthusiasts: Take advantage of its public availability on Bedrock, Vertex AI, and the Anthropic API. Conduct independent evaluations of its SWE-bench and OSWorld performance to contribute to the understanding of its practical efficacy and to identify new frontiers for AI autonomy.

Conclusion

Anthropic’s Claude Sonnet 4.5 marks a pivotal moment in the evolution of AI. By achieving state-of-the-art results in both software engineering and general computer interaction, it positions itself as an indispensable tool for developing highly autonomous, reliable, and intelligent agents. Its focus on long-running, tool-heavy workloads directly addresses some of the most pressing pain points in current AI applications, moving beyond mere demo prompts to truly production-ready capabilities.

While independent replication will undoubtedly further validate its claims, the architectural design and documented performance metrics strongly suggest that Sonnet 4.5 is poised to lead the charge in practical, high-impact AI solutions. Its widespread availability across key platforms ensures that this powerful model is within reach for a vast ecosystem of developers and organizations eager to innovate.

Ready to transform your coding and agentic workflows? Explore the official Anthropic documentation and API today, or integrate Claude Sonnet 4.5 through AWS Bedrock, Google Cloud Vertex AI, or GitHub Copilot to experience the future of AI-powered development.

Frequently Asked Questions (FAQ)

Q1: What is Claude Sonnet 4.5?

Claude Sonnet 4.5 is a new, advanced iteration of Anthropic’s large language model, designed to excel in complex, multi-step tasks such as software engineering and autonomous agentic workflows. It sets new benchmarks for coding proficiency and general computer use, offering enhanced reliability and extended autonomy.

Q2: How does Claude Sonnet 4.5 improve software engineering?

It achieves state-of-the-art results in software development, notably scoring 77.2% accuracy on the SWE-bench Verified dataset. This signifies its superior ability to understand, generate, and debug code, tackling deep logical reasoning and meticulous detail required in software engineering tasks. It also features long-horizon autonomy, maintaining focus on coding tasks for over 30 hours.

Q3: What is the Claude Agent SDK?

The Claude Agent SDK is a developer kit that exposes Anthropic’s internal production patterns for building robust, autonomous AI agents. It helps manage extended planning, memory, reliable tool orchestration, and sub-agent coordination for long-running, multi-step tasks, making agents more coherent and reversible.

Q4: Where can I access Claude Sonnet 4.5?

Claude Sonnet 4.5 is widely available through the Anthropic API (model ID claude-sonnet-4-5), AWS Bedrock, Google Cloud Vertex AI, and in public preview on GitHub Copilot (including Chat and CLI). Its pricing is consistent with Sonnet 4.

Q5: What are some practical applications of Claude Sonnet 4.5’s agentic capabilities?

Its enhanced computer-use capabilities (61.4% on OSWorld-Verified) make it ideal for tasks like automated customer onboarding, complex Robotic Process Automation (RPA), data extraction and processing across multiple platforms, and intelligent problem-solving requiring navigation of UIs and interaction with various applications.

The post Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Art Results appeared first on MarkTechPost.

Author6 days ago

0 8 minutes read