The Role of Model Context Protocol (MCP) in Generative AI Security and Red Teaming

The Role of Model Context Protocol (MCP) in Generative AI Security and Red Teaming
Estimated Reading Time: 8 minutes
- MCP as a Security Framework: The Model Context Protocol (MCP) provides a standardized, JSON-RPC-based framework for securing AI client-to-server interactions by formalizing tools, resources, and prompts, making agent/tool interactions explicit and auditable.
- Robust Authorization: MCP enforces strict OAuth 2.1-based authorization controls, including “no token passthrough” and mandatory audience binding, preventing confused-deputy attacks and ensuring servers act as first-class principals with auditable credentials.
- Clear Trust Boundaries & Containment: The protocol establishes explicit client-server trust boundaries, enabling granular consent UIs, precise scoping of AI agent capabilities (e.g., for secrets brokering), and enforcement of least-privilege principles.
- Deterministic Red Teaming: MCP’s typed tool schemas and replayable transports create deterministic attack surfaces, allowing security teams to develop reproducible tests for prompt injection, insecure output handling, and supply-chain vulnerabilities, mapping directly to taxonomies like OWASP LLM Top 10.
- Supply Chain Vigilance: Real-world incidents, such as the malicious
postmark-mcp
npm package, highlight the critical need to treat MCP servers as privileged connectors, necessitating strict allowlisting, code provenance, version pinning, and continuous monitoring for anomalous behavior.
- Unpacking Model Context Protocol: A Framework for Secure AI Interactions
- Enforcing Trust: MCP’s Normative Authorization Controls
- MCP in Action: Real-World Security Engineering and Red Teaming
- Conclusion
- Elevate Your AI Security Today
- Original Source Material (Verbatim as per Request)
- Frequently Asked Questions
As Generative AI rapidly integrates into enterprise workflows, the imperative for robust security measures grows exponentially. Large Language Models (LLMs) and their agents interact with an expanding ecosystem of tools, data, and users, introducing complex attack surfaces. Traditional security protocols often fall short in this dynamic environment, necessitating a new approach to manage trust, control, and auditability. This is where the Model Context Protocol (MCP) emerges as a critical standard, offering a structured framework to secure AI client-to-server interactions and providing a fertile ground for proactive red-teaming exercises.
Unpacking Model Context Protocol: A Framework for Secure AI Interactions
The Model Context Protocol (MCP) is an open, JSON-RPC–based standard designed to formalize how AI clients—such as intelligent assistants, integrated development environments (IDEs), and web applications—connect with servers. These servers expose three fundamental primitives: tools, resources, and prompts, over defined transports like stdio for local interactions and Streamable HTTP for remote deployments.
MCP’s inherent value for security engineering lies in its ability to render agent/tool interactions explicit and auditable. By clearly defining these primitives, MCP clarifies control at each interaction point: tools are model-driven (actions callable by the model), resources are application-driven (readable data objects), and prompts are user-driven (reusable message templates). This distinction is vital for threat modeling; for instance, prompt injection often targets model-controlled paths, while insecure output handling frequently occurs at application-controlled joins.
The protocol also standardizes transports and the client/server lifecycle. Whether using local stdio to minimize network exposure or Streamable HTTP for multi-client web deployments, the choice of transport itself acts as a security control. This uniformity in discovering server capabilities, negotiating sessions, and exchanging messages allows security teams to easily instrument call flows, capture structured logs, and enforce pre/postconditions without the need for bespoke adapters for every integration. This explicit design enables tight blast-radius control for tool use, repeatable red-team scenarios at clear trust boundaries, and measurable policy enforcement.
Enforcing Trust: MCP’s Normative Authorization Controls
One of MCP’s most distinctive features is its unusually prescriptive approach to authorization, providing concrete requirements that significantly enhance security. This moves beyond typical integration protocols, insisting on robust identity and access management for AI agent interactions.
A cornerstone of MCP’s security model is its “no token passthrough” rule: “The MCP server MUST NOT pass through the token it received from the MCP client.” Instead, MCP servers operate as OAuth 2.1 resource servers. Clients obtain audience-bound tokens from an authorization server using RFC 8707 resource indicators, ensuring tokens are explicitly tied to the intended server. This crucial measure prevents confused-deputy paths, where a server might unknowingly act on behalf of a client with broader permissions than intended, and preserves upstream audit and limit controls.
Complementing this, servers are normatively required to perform audience binding and validation. This means servers MUST validate that the access token’s audience precisely matches themselves before processing any request. Operationally, this thwarts attempts to replay a client-minted token intended for “Service A” to an entirely different “Service B.” Red teams are specifically advised to include explicit probes for this failure mode. Together, these controls ensure that model-side capabilities, however powerful, are always mediated by servers that act as first-class principals, each with their own credentials, scopes, and auditable logs, rather than opaque conduits for a user’s global token.
MCP in Action: Real-World Security Engineering and Red Teaming
MCP isn’t just a theoretical construct; its explicit design principles translate directly into practical advantages for security engineering and red teaming, providing concrete levers for securing generative AI systems.
The client-server edge, formalized by MCP, establishes clear and inspectable trust boundaries. At this boundary, organizations can implement consent user interfaces, meticulously scope prompts, and deploy structured logging. Many client implementations, though not explicitly mandated by the standard, already present permission prompts that enumerate a server’s tools and resources before activation. This is invaluable for enforcing least-privilege principles and maintaining comprehensive audit trails. Furthermore, because an MCP server is a separate principal, it enables precise containment: a secrets-broker server, for example, can mint short-lived credentials and expose only highly constrained tools (e.g., “fetch secret by policy label”) rather than granting broad vault tokens directly to an AI model.
For red teaming, MCP offers a significant advantage: deterministic attack surfaces. With typed tool schemas and replayable transports, security teams can construct fixtures that simulate adversarial inputs at tool boundaries and rigorously verify post-conditions across models and clients. This capability yields reproducible tests for common classes of failures, including prompt injection, insecure output handling, and supply-chain abuse. Such tests can be directly mapped to recognized security taxonomies like OWASP’s LLM Top 10, ensuring comprehensive coverage.
Real-World Example: The Malicious Postmark-MCP Server Incident
The theoretical risks MCP aims to mitigate became starkly real in late September 2025. Researchers disclosed a trojanized postmark-mcp
npm package, which impersonated a legitimate Postmark email MCP server. Beginning with version 1.0.16, the malicious build silently BCC-exfiltrated every email sent through it to an attacker-controlled address/domain. This incident, believed to be the first publicly documented malicious MCP server in the wild, underscored a critical lesson: MCP servers often operate with high trust levels and must be vetted, allowlisted, and version-pinned with the same rigor applied to any other privileged connector in the supply chain. Operational takeaways included maintaining allowlists, requiring code provenance, monitoring for anomalous egress, and practicing credential rotation, illustrating that incident impact flowed directly from over-trusted server code in a routine developer workflow.
Actionable Steps for Hardening MCP Deployments:
- Implement Strict Client-Side Allowlisting and Consent: Require explicit user consent for starting local MCP servers, enumerate the tools/resources being enabled, and maintain an allowlist of approved servers with pinned versions and checksums, denying unknown servers by default.
- Enforce Robust Server-Side OAuth 2.1 Compliance: Ensure all MCP servers strictly implement OAuth 2.1 resource-server behavior, validating tokens and audiences, and critically, never forwarding client-issued tokens upstream. Prioritize minimal scopes and short-lived credentials.
- Develop Rapid Detection & Response Playbooks: Prepare break-glass automation to quickly revoke client approvals and rotate upstream secrets when a server is flagged. Alert on anomalous server egress patterns (e.g., unexpected destinations, BCC exfiltration) and sudden capability changes between versions.
Conclusion
The Model Context Protocol is not a panacea but a foundational protocol that provides security and red-team practitioners with stable, enforceable levers. These include audience-bound tokens, explicit client-server boundaries, typed tool schemas, and instrumentable transports. By leveraging these capabilities, organizations can effectively constrain what AI agents can do, meticulously observe their actual actions, and reliably replay adversarial scenarios. Treating MCP servers as privileged connectors—vetting, pinning, and monitoring them—is paramount, as adversaries are already targeting this vector. With these practices firmly in place, MCP establishes itself as a practical cornerstone for building secure agentic systems and an indispensable substrate for rigorous red-team evaluation in the evolving landscape of generative AI.
Elevate Your AI Security Today
Explore the Model Context Protocol further to understand how its design principles can fortify your generative AI deployments. Review the official specifications, dive into current implementations by industry leaders, and begin integrating its security best practices into your development and red-teaming strategies. Secure your AI agent interactions and build resilient, trustworthy systems for the future.
Original Source Material (Verbatim as per Request)
Overview
Model Context Protocol (MCP) is an open, JSON-RPC–based standard that formalizes how AI clients (assistants, IDEs, web apps) connect to servers exposing three primitives—tools, resources, and prompts—over defined transports (primarily stdio for local and Streamable HTTP for remote). MCP’s value for security work is that it renders agent/tool interactions explicit and auditable, with normative requirements around authorization that teams can verify in code and in tests. In practice, this enables tight blast-radius control for tool use, repeatable red-team scenarios at clear trust boundaries, and measurable policy enforcement—provided organizations treat MCP servers as privileged connectors subject to supply-chain scrutiny.
What MCP standardizes?
An MCP server publishes: (1) tools (schema-typed actions callable by the model), (2) resources (readable data objects the client can fetch and inject as context), and (3) prompts (reusable, parameterized message templates, typically user-initiated). Distinguishing these surfaces clarifies who is “in control” at each edge: model-driven for tools, application-driven for resources, and user-driven for prompts. Those roles matter in threat modeling, e.g., prompt injection often targets model-controlled paths, while unsafe output handling often occurs at application-controlled joins.
Transports
The spec defines two standard transports—stdio (Standard Input/Output) and Streamable HTTP—and leaves room for pluggable alternatives. Local stdio reduces network exposure; Streamable HTTP fits multi-client or web deployments and supports resumable streams. Treat the transport choice as a security control: constrain network egress for local servers, and apply standard web authN/Z and logging for remote ones.
Client/server lifecycle and discovery
MCP formalizes how clients discover server capabilities (tools/resources/prompts), negotiate sessions, and exchange messages. That uniformity is what lets security teams instrument call flows, capture structured logs, and assert pre/postconditions without bespoke adapters per integration.
Normative authorization controls
The Authorization approach is unusually prescriptive for an integration protocol and should be enforced as follows:
No token passthrough
“The MCP server MUST NOT pass through the token it received from the MCP client.” Servers are OAuth 2.1 resource servers; clients obtain tokens from an authorization server using RFC 8707 resource indicators so tokens are audience-bound to the intended server. This prevents confused-deputy paths and preserves upstream audit/limit controls.
Audience binding and validation
Servers MUST validate that the access token’s audience matches themselves (resource binding) before serving a request. Operationally, this stops a client-minted token for “Service A” from being replayed to “Service B.” Red teams should include explicit probes for this failure mode.
This is the core of MCP’s security structure: model-side capabilities are powerful, but the protocol insists that servers be first-class principals with their own credentials, scopes, and logs—rather than opaque pass-throughs for a user’s global token.
Where MCP supports security engineering in practice?
Clear trust boundaries
The clientserver edge is an explicit, inspectable boundary. You can attach consent UIs, scope prompts, and structured logging at that edge. Many client implementations present permission prompts that enumerate a server’s tools/resources before enabling them—useful for least-privilege and audit—even though UX is not specified by the standard.
Containment and least privilege
Because a server is a separate principal, you can enforce minimal upstream scopes. For example, a secrets-broker server can mint short-lived credentials and expose only constrained tools (e.g., “fetch secret by policy label”), rather than handing broad vault tokens to the model. Public MCP servers from security vendors illustrate this model.
Deterministic attack surfaces for red teaming
With typed tool schemas and replayable transports, red teams can build fixtures that simulate adversarial inputs at tool boundaries and verify post-conditions across models/clients. This yields reproducible tests for classes of failures like prompt injection, insecure output handling, and supply-chain abuse. Pair those tests with recognized taxonomies.
Case study: the first malicious MCP server
In late September 2025, researchers disclosed a trojanized postmark-mcp npm package that impersonated a Postmark email MCP server. Beginning with v1.0.16, the malicious build silently BCC-exfiltrated every email sent through it to an attacker-controlled address/domain. The package was subsequently removed, but guidance urged uninstalling the affected version and rotating credentials. This appears to be the first publicly documented malicious MCP server in the wild, and it underscores that MCP servers often run with high trust and should be vetted and version-pinned like any privileged connector.
Operational takeaways:
- Maintain an allowlist of approved servers and pin versions/hashes.
- Require code provenance (signed releases, SBOMs) for production servers.
- Monitor for anomalous egress patterns consistent with BCC exfiltration.
- Practice credential rotation and “bulk disconnect” drills for MCP integrations.
These are not theoretical controls; the incident impact flowed directly from over-trusted server code in a routine developer workflow.
Using MCP to structure red-team exercises
- 1) Prompt-injection and unsafe-output drills at the tool boundary. Build adversarial corpora that enter via resources (application-controlled context) and attempt to coerce calls to dangerous tools. Assert that the client sanitizes injected outputs and that server post-conditions (e.g., allowed hostnames, file paths) hold. Map findings to LLM01 (Prompt Injection) and LLM02 (Insecure Output Handling).
- 2) Confused-deputy probes for token misuse. Craft tasks that try to induce a server to use a client-issued token or to call an unintended upstream audience. A compliant server must reject foreign-audience tokens per the authorization spec; clients must request audience-correct tokens with RFC 8707 resource. Treat any success here as a P1.
- 3) Session/stream resilience. For remote transports, exercise reconnection/resumption flows and multi-client concurrency for session fixation/hijack risks. Validate non-deterministic session IDs and rapid expiry/rotation in load-balanced deployments. (Streamable HTTP supports resumable connections; use it to stress your session model.)
- 4) Supply-chain kill-chain drills. In a lab, insert a trojaned server (with benign markers) and verify whether your allowlists, signature checks, and egress detection catch it—mirroring the Postmark incident TTPs. Measure time to detection and credential rotation MTTR.
- 5) Baseline with trusted public servers. Use vetted servers to construct deterministic tasks. Two practical examples: Google’s Data Commons MCP exposes public datasets under a stable schema (good for fact-based tasks/replays), and Delinea’s MCP demonstrates least-privilege secrets brokering for agent workflows. These are ideal substrates for repeatable jailbreak and policy-enforcement tests.
Implementation-Focused Security Hardening Checklist
Client side
- Display the exact command or configuration used to start local servers; gate startup behind explicit user consent and enumerate the tools/resources being enabled. Persist approvals with scope granularity. (This is common practice in clients such as Claude Desktop.)
- Maintain an allowlist of servers with pinned versions and checksums; deny unknown servers by default.
- Log every tool call (name, arguments metadata, principal, decision) and resource fetch with identifiers so you can reconstruct attack paths post-hoc.
Server side
- Implement OAuth 2.1 resource-server behavior; validate tokens and audiences; never forward client-issued tokens upstream.
- Minimize scopes; prefer short-lived credentials and capabilities that encode policy (e.g., “fetch secret by label” instead of free-form read).
- For local deployments, prefer stdio inside a container/sandbox and restrict filesystem/network capabilities; for remote, use Streamable HTTP with TLS, rate limits, and structured audit logs.
Detection & response
- Alert on anomalous server egress (unexpected destinations, email BCC patterns) and sudden capability changes between versions.
- Prepare break-glass automation to revoke client approvals and rotate upstream secrets quickly when a server is flagged (your “disconnect & rotate” runbook). The Postmark incident showed why time matters.
Governance alignment
MCP’s separation of concerns—clients as orchestrators, servers as scoped principals with typed capabilities—aligns directly with NIST’s AI RMF guidance for access control, logging, and red-team evaluation of generative systems, and with OWASP’s LLM Top-10 emphasis on mitigating prompt injection, unsafe output handling, and supply-chain vulnerabilities. Use those frameworks to justify controls in security reviews and to anchor acceptance criteria for MCP integrations.
Current adoption you can test against
- Anthropic/Claude: product docs and ecosystem material position MCP as the way Claude connects to external tools and data; many community tutorials closely follow the spec’s three-primitive model. This provides ready-made client surfaces for permissioning and logging.
- Google’s Data Commons MCP: released Sept 24, 2025, it standardizes access to public datasets; its announcement and follow-up posts include production usage notes (e.g., the ONE Data Agent). Useful as a stable “truth source” in red-team tasks.
- Delinea MCP: open-source server integrating with Secret Server and Delinea Platform, emphasizing policy-mediated secret access and OAuth alignment with the MCP authorization spec. A practical example of least-privilege tool exposure.
Summary
MCP is not a silver-bullet “security product.” It is a protocol that gives security and red-team practitioners stable, enforceable levers: audience-bound tokens, explicit clientserver boundaries, typed tool schemas, and transports you can instrument. Use those levers to (1) constrain what agents can do, (2) observe what they actually did, and (3) replay adversarial scenarios reliably. Treat MCP servers as privileged connectors—vet, pin, and monitor them—because adversaries already do. With those practices in place, MCP becomes a practical foundation for secure agentic systems and a reliable substrate for red-team evaluation.
Resources used in the article
- MCP specification & concepts
- MCP ecosystem (official)
- https://www.anthropic.com/news/model-context-protocol
- https://docs.claude.com/en/docs/mcp
- https://docs.claude.com/en/docs/claude-code/mcp
- https://modelcontextprotocol.io/quickstart/server
- https://modelcontextprotocol.io/docs/develop/connect-local-servers
- https://modelcontextprotocol.io/docs/develop/connect-remote-servers
- Security frameworks
- Incident: malicious postmark-mcp server
- https://www.koi.security/blog/postmark-mcp-npm-malicious-backdoor-email-theft
- https://thehackernews.com/2025/09/first-malicious-mcp-server-found.html
- https://www.itpro.com/security/a-malicious-mcp-server-is-silently-stealing-user-emails
- https://threatprotect.qualys.com/2025/09/30/malicious-mcp-server-on-npm-postmark-mcp-exploited-in-attack/
- Example MCP servers referenced
- https://developers.googleblog.com/en/datacommonsmcp/
- https://blog.google/technology/developers/ai-agents-datacommons/
- https://github.com/DelineaXPM/delinea-mcp
- https://delinea.com/news/delinea-mcp-server-to-provide-secure-credential-access-for-ai-agents?hs_amp=true
- https://delinea.com/blog/unlocking-ai-agents-mcp
Frequently Asked Questions
What is Model Context Protocol (MCP)?
The Model Context Protocol (MCP) is an open, JSON-RPC–based standard that formalizes how AI clients connect with servers. It standardizes the exposure of three fundamental primitives: tools, resources, and prompts, facilitating secure and auditable interactions between AI agents and their environment.
How does MCP enhance AI security?
MCP enhances AI security through several mechanisms: it makes agent/tool interactions explicit and auditable, enforces robust OAuth 2.1-based authorization controls with audience binding and “no token passthrough” rules, establishes clear client-server trust boundaries for least-privilege enforcement, and enables deterministic attack surfaces for red-teaming exercises.
What is “no token passthrough” in MCP?
“No token passthrough” is a core security rule in MCP that mandates servers must not forward the access tokens they receive from clients upstream. Instead, MCP servers act as OAuth 2.1 resource servers, requiring clients to obtain audience-bound tokens specifically for the intended server. This prevents confused-deputy attacks and preserves auditability.
How does MCP aid in red-teaming?
MCP provides deterministic attack surfaces for red-teaming. With typed tool schemas and replayable transports, security teams can create repeatable fixtures to simulate adversarial inputs at tool boundaries, verify post-conditions, and test for vulnerabilities like prompt injection and insecure output handling. This allows for rigorous, reproducible testing mapped to security taxonomies.
What was the significance of the postmark-mcp
incident?
The malicious postmark-mcp
npm package incident was the first publicly documented case of a malicious MCP server in the wild. It highlighted that MCP servers often operate with high trust and must be subjected to stringent supply-chain security practices, including vetting, allowlisting, version-pinning, and continuous monitoring, similar to any other privileged connector.