Defend Excessive Agency: Re-scope a Tool Agent to Least Privilege (AuthZ + Human Approval Gate)
Hosted · ide
Beta

Defend Excessive Agency: Re-scope a Tool Agent to Least Privilege (AuthZ + Human Approval Gate)

Harden DV-ToolAgent, a real tool-using ReAct agent, against the confused-deputy and scope-escalation attacks from the offensive lab, in small sequential steps. Stand up the agent and trace one benign ticket, then reproduce the handed-to-you exploit one surface at a time: an ingested ticket that makes the agent redirect a billing payee under its own shared credential and read across tenants, then an SSRF reach into an internal-only endpoint and a poisoned-memory replant. Watch a naive SQL denylist get bypassed by a case-folded variant, then build the durable control one mechanism per step: least-privilege tool scope (the ingest role holds no write scope), per-argument authorization decided on the session identity (a caller-supplied identity claim is ignored), and a human-in-the-loop approval gate that holds high-impact writes pending an explicit token. Verify the exploit is dead with authorized in-scope work intact, then prove obfuscated, renamed, and spoofed variants are all blocked.

90 min9 steps3 domainsAdvanced

Hands-on labs require Pro · $29.99/mo · cancel anytime

Map the attack surface
Query
Retriever
LLM
Poisoned doc
retrieved chunk
Answer
0%
Attack-success rate
Attacks blocked · benign answers pass
graded on real output, not the model's talk

What you'll learn

  1. 1
    Stand up DV-ToolAgent and trace one benign ticket
    You are the defender on DV-ToolAgent, ACME Cloud's internal operations assistant.
  2. 2
    Reproduce attack A: the confused-deputy write fires
    Before you defend anything, reproduce the attack so you can see exactly what is open.
  3. 3
    Reproduce attack B: SSRF reach and a poisoned-memory replant
    The same over-privileged agent has two more surfaces in the same excessive-agency
  4. 4
    Watch a naive SQL denylist get bypassed
    The obvious reaction to the confused-deputy write is to block the dangerous word:
  5. 5
    Control mechanism 1: least-privilege tool scope
    Time to build the durable control. It has three mechanisms, one per step, and you build
  6. 6
    Control mechanism 2: server-side authorization on the session identity
    Mechanism 1 scoped each role to a set of tools and verbs. That stopped the ingest
  7. 7
    Control mechanism 3: a human approval gate for high-impact writes
    Mechanisms 1 and 2 stopped the low-privilege caller: ticket-bot cannot run a write and
  8. 8
    Verify: the exploit is blocked, benign in-scope work intact
    You built the control over three mechanisms: least privilege (Step 5), per-arg
  9. 9
    Resist bypass: obfuscated, renamed, and spoofed attacks all blocked
    A control that only stops the one payload you tested is the denylist mistake all over

Prerequisites

  • Comfortable reading and editing Python
  • Know what a SQL UPDATE, an HTTP GET, and an allow-list are
  • Helpful (not required): the offensive Tool-Scope Escalation lab

Exam domains covered

Defensive AI SecurityLLM Application SecurityExcessive Agency

Skills & technologies you'll practice

This advanced-level ai/ml lab gives you real-world reps across:

Excessive AgencyLeast PrivilegeConfused DeputyAuthorizationHuman-in-the-LoopAgentic AI SecurityOWASP LLM06OWASP ASI03AI Red Team

What you'll do in this lab

This is a hands-on defensive-security lab built on a real tool-using agent: a ReAct loop with native tool-calling against an in-cluster model, a write-capable SQLite tool, and an HTTP fetch tool with no allow-list. You are the defender. The red team handed you a working exploit against DV-ToolAgent, ACME Cloud's internal operations assistant: as ticket-bot, the low-privilege ingest account, an ingested ticket makes the agent redirect a billing payee under its own shared credential. You reproduce that confused deputy (OWASP LLM06 Excessive Agency, Agentic ASI03), then watch an obvious SQL-keyword denylist get bypassed by a case-folded and comment-obfuscated variant, learning why shallow filters fail.

You then build the durable control at the tool boundary, server-side, so it holds regardless of what the model decides: a minimal tool policy that allow-lists the exact tool actions ticket-bot may take, a per-argument authorization check that rejects an out-of-scope write and a cross-tenant read, a human-in-the-loop approval gate that routes a high-impact action to a pending queue instead of auto-executing it (Agentic ASI03 and ASI05), and memory integrity that quarantines recalled notes as data and namespaces them per user (Agentic ASI06). You verify against freshly planted payloads each run: the confused-deputy write is rejected, the SSRF host is denied, the high-impact action lands in the approval queue rather than firing, the poisoned memory cannot re-fire, and an authorized in-scope action still succeeds.

Frequently asked questions

Do I need a machine-learning background?

No. The control surface is authorization, least privilege, and a human approval gate, not model internals. You read a small ReAct agent and its tools, then add a policy module that decides allow, deny, or route-for-approval for every tool call based on the requesting user and the arguments. The fixes are ordinary access-control boundaries enforced in code.

Why isn't a stronger system prompt or an SQL keyword filter enough?

Because both sit at the model's decision layer or pattern-match the payload. A naive denylist that blocks the word UPDATE is defeated by case folding, inline SQL comments, or pivoting to a different tool. The lab shows that bypass, then has you move the decision to the tool boundary: authorization keyed to the requesting identity and a human approval gate for high-impact actions, which hold no matter how convincing the injected ticket is.

How is the hardening graded?

Deterministically, on side effects, never on model wording. Each run plants fresh payloads. The check confirms the confused-deputy write does not mutate the account row, the high-impact action is written to the pending approval queue instead of executing, the SSRF value never reaches the in-pod listener, a poisoned memory note does not re-fire in a later session, and an authorized in-scope action still succeeds.