About This Catalog

This is the third volume in a three-volume catalog of the working vocabulary of agentic AI. The first volume cataloged the patterns by which LLM calls, tools, and sub-agents are composed in time. The second cataloged the skills repositories that distribute model instructions in packaged form. This third volume catalogs the tools themselves --- the typed function-calling primitives that an LLM invokes during a turn.

“Tool” has a precise meaning here. A tool is a function defined by a name, a JSON input schema, and a runtime handler. The model emits a tool_use block; the runtime executes the handler with the supplied input; a tool_result is returned to the model on the next turn. Everything an agent does in the world --- reading a file, searching the web, running code, sending a message, deploying a service --- happens through this single mechanism.

The catalog covers about three dozen tools across nine sections: Anthropic-native primitives, search and retrieval, code execution sandboxes, filesystem and version control, browser and computer control, collaboration, cloud infrastructure, databases, and memory. Each tool gets a Fowler-style entry: classification, intent, motivating problem, how it works, when to use it (including alternatives), sources, an example, and example artifacts --- a tool schema or invocation sample or setup snippet, depending on what the tool needs.

Scope

Coverage:

Foundational Anthropic-native tools: web_search, web_fetch, code_execution, bash_tool, text_editor (str_replace family), computer use, memory tool, tool search, advisor tool.
Major MCP servers --- both the still-active Anthropic reference servers (Everything, Fetch, Filesystem, Git, Memory, Sequential Thinking, Time) and the consequential vendor-maintained ones.
Code-execution sandboxes that agents use to run untrusted code: E2B, Modal, and the dev-environment platforms (Daytona, Blaxel).
The principal collaboration tools (Slack, Gmail, Linear, Calendar) and cloud platforms (Cloudflare, AWS, Kubernetes).
Memory tools that give agents state across sessions: Anthropic’s memory tool, Mem0, Letta.

Out of scope:

Niche or industry-specific MCP servers (legal, medical, scientific) --- the patterns transfer but the inventory is too large to enumerate.
Framework-specific tool wrappers (LangChain tools, OpenAI Agents SDK tools) when an equivalent MCP server exists. Build once with MCP; use everywhere.
Closed/proprietary tools that aren’t addressable from a documented schema.
Skill-as-tool wrappers around long-running workflows --- those are covered in the Skills Catalog.

How to read this catalog

Part 1 (“The Narratives”) is conceptual orientation: what a tool is mechanically, how tools cleave into four families by reads/mutates and internal/external, where tools sit relative to skills and MCP servers, how to think about permission tiers, and how MCP’s fan-out architecture lets one server serve many agents. Five diagrams sit in Part 1; everything in Part 2 is text and code.

Part 2 (“The Tools”) is reference material organized by section. Each section opens with a short essay on what its tools have in common and how they relate to alternatives. Representative tools follow as individual Fowler-style entries. The entries are not meant to be read front-to-back; jump in via the table of contents to whatever matches the task at hand.

Part 1 — The Narratives

Five short essays frame the design space for agent tools. The reference entries in Part 2 assume the vocabulary established here.

Chapter 1. What a Tool Is

A tool, in the precise sense used throughout this catalog, is a function declaration with three parts: a unique name, a JSON input schema that the model must satisfy, and a runtime handler that executes when the model emits a matching tool_use block. The function-calling lifecycle is one of the small number of mechanisms that fully describe how every modern agent acts on the world.

Anatomy of a tool call — Function-calling lifecycle: the user prompts; the LLM emits a tool_use block; the runtime executes the handler; tool_result returns; the LLM continues.

The lifecycle has six recognizable steps. The user sends a message that includes a list of available tools (their names and schemas). The model receives the message and, if appropriate, emits one or more tool_use blocks, each naming a tool and supplying inputs that match the schema. The runtime --- not the model --- executes the handler for each tool_use and produces a tool_result block. The tool_result returns to the model on the next turn. The model can either emit more tool_use blocks (multi-step agent loop) or produce its final response to the user.

Three properties follow from this design. First, the model never executes the tool itself; it only nominates one. The runtime is always free to refuse, queue, gate, log, or transform the call --- this is the substrate on which approval policies and audit trails are built. Second, the schema is contractual: any input that doesn’t match is rejected before the handler runs, which makes the model’s job classification rather than free-form generation. Third, tool_result is just another content block; it flows back into the same context the model already had, which is what makes multi-step tool use composable.

Chapter 2. The Four Tool Families

Tools differ in two dimensions that matter for design: whether they read or mutate state, and whether that state is internal (sandboxed, inside the agent’s context) or external (outside, in the real world). The 2x2 produces four families with sharply different cost profiles.

Retrieve (internal, reads): file_read, glob, grep, sql_select, vector_search, get_memory. No state change. Failures are recoverable by retrying. Cheap to call freely.
Compute (internal, mutates): code_execution with side effects, bash_in_sandbox, str_replace in a scratch directory, set_memory. Changes are bounded by the sandbox; the worst case is a discarded session.
Observe (external, reads): web_search, web_fetch, http_get, screenshot, monitoring_query. Reads real-world state. The world doesn’t change, but the calls cost money and can be rate-limited.
Act (external, mutates): send_email, git_push, deploy, payment, post_message, browser_click, file_delete. Changes the world. The worst case is unbounded --- sent emails, charged cards, deleted production data.

Designing an agent’s tool surface means choosing which families it has access to and matching the approval policy to the family. A research agent might have only Retrieve and Observe --- it can read anything but change nothing. A coding agent gets all four but with explicit human approval gates between Compute and Act. A trading agent’s Act tools each carry hard caps on cost per call.

Chapter 3. Tools vs. Skills vs. MCP

Tools are the lowest layer in a three-layer system. Above them sit MCP servers (the packaging and delivery mechanism) and skills (the model instructions that direct when and how to use tools). The three layers complement each other; none of them substitutes for another.

Tools are primitives: a name, a schema, a handler. They are what gets called. MCP servers are the standard packaging for a set of tools --- a process or HTTP endpoint that exposes tools/list and tools/call, so the same set of tools is reachable from any MCP-compatible agent. Skills are model instructions stored as SKILL.md files; they direct the model on when a particular tool is appropriate, what inputs to construct, and how to interpret the tool_result. A single agent stack typically uses all three: native tools for what the platform built in, MCP for what vendors and the community publish, skills to tell the model when to reach for which.

The mapping isn’t one-to-one. A skill can compose several tools into a workflow. A single MCP server might expose dozens of tools or just one. A tool can exist without any skill (the model figures out when to use it from the description alone) or without any MCP server (the runtime built it in directly, like Claude’s native web_search). The three layers exist because each answers a different question --- what, how shipped, when --- and answering them separately makes the system maintainable.

Chapter 4. ACI: Agent-Computer Interface Design

Anthropic calls the surface between a model and its tools the Agent-Computer Interface, or ACI. ACI design is to agents what API design is to web services: the contract that determines whether the system is usable. A well-designed ACI presents tools whose names and descriptions disambiguate clearly, whose inputs require the minimum specification needed to act, and whose outputs are structured for the model’s next decision rather than for human consumption.

The recurring failure modes are predictable. Tools with overlapping descriptions (“search_database” and “query_database”) confuse the model. Tools that return giant unstructured blobs burn context. Tools that require the model to guess at internal IDs, hashes, or magic numbers fail intermittently. Tools that take dozens of optional parameters get used wrong by all but the most capable models. Tools that don’t handle the empty case gracefully (“no results”) cause the model to fabricate.

Two scaling phenomena emerge once tool surfaces grow beyond a handful of entries. Context bloat: tool definitions consume context budget before any actual work; a typical multi-server stack can spend tens of thousands of tokens on definitions alone. Selection accuracy: a model’s ability to pick the right tool degrades meaningfully past 30—50 available tools, and falls off sharply past 100. The Tool Search Tool (Section A) exists to address both --- a deferred-loading mechanism that surfaces the few relevant tools on demand.

The remedies for the per-tool failures are correspondingly predictable. Each tool covers a single well-defined task. Names disambiguate at a glance. Descriptions enumerate what the tool does, what it does NOT do, and the inputs that should and shouldn’t trigger it (the same “pushy” description discipline from the Skills Catalog applies here). Inputs are minimal; optional parameters have sensible defaults. Outputs are structured (JSON over prose where the next step is computation; prose where the next step is reasoning). Error paths are explicit --- “no results found” is a result, not an exception.

The operational tests are simple. Drop the tool into a typical agent stack. Watch what happens when the model uses it. If the model frequently calls it with wrong inputs, the schema or description is the problem. If it fails to call when it should, the description is wrong. If it succeeds at the call but fails at the next step, the output is the problem. Iterate on the surface, not on the model.

Chapter 5. Permission Tiers and Safety

Different tools have different blast radii when called wrongly. A web_search call that returns the wrong document costs a few hundred tokens. A git_push call to the wrong branch can break the production deployment. A payment call with the wrong amount costs real money. Operational safety in an agent system is mostly about matching the approval gate to the tool’s blast radius.

Tier 1 (read-only) tools have no state change at all. They can be invoked freely; the only cost of being wrong is wasted tokens. web_search, web_fetch, file_read, sql_select, and vector_search live here. Most well-designed agents call these aggressively as the cheapest path to grounded information.

Tier 2 (sandboxed mutation) tools change state, but inside a sandbox the user trusts. code_execution in a Python sandbox, bash inside a container, str_replace in a scratch directory. The worst case is a corrupted session that gets discarded. These can also be invoked freely, with the runtime responsible for sandbox containment.

Tier 3 (persistent mutation) tools change durable state but the change is reversible. git commit (local), file_write to the project, set_memory, kubectl apply with —dry-run. Approval is recommended but not always required; the operational discipline is to provide an undo path.

Tier 4 (external side-effects) tools change the world outside the system. send_email, git_push, payment, post_message, deploy, browser_click on real sites. These should require explicit user approval, with every call logged for audit. The classic agent failure mode is a Tier 4 call that the model misclassified as Tier 2 --- thinking it was “just running some code,” it sent a production deployment.

Permission tiers are an operational property of the runtime, not the tool. The same tool (file_delete) is Tier 2 in a sandbox and Tier 4 against a real filesystem. The agent’s job is to nominate; the runtime’s job is to gate.

Part 2 — The Tools

Nine sections follow. Each opens with a short essay on what the section’s tools have in common and how they relate to alternatives. Representative tools are presented in the same Fowler-style template used by the prior two catalogs.

Sections at a glance

Section A --- Foundational primitives (Anthropic-native)
Section B --- Search and retrieval
Section C --- Code execution sandboxes
Section D --- Filesystem and version control
Section E --- Browser and computer control
Section F --- Collaboration and communication
Section G --- Cloud and infrastructure
Section H --- Databases
Section I --- Memory and state
Section J --- Designing tools for agents

The MCP fan-out diagram from Chapter 5 of Part 1 is reproduced once here to anchor the section discussions. Many of the entries that follow are MCP servers; the canonical hub is the official MCP Registry at registry.modelcontextprotocol.io, with the modelcontextprotocol/servers repository (80k+ GitHub stars as of May 2026) as the home for the seven still-active reference implementations and the redirection point to vendor-maintained alternatives for everything else.

MCP fan-out: one server, many agents — Most non-Anthropic-native tools in this catalog are MCP servers. One server reaches many agents.

Section A — Foundational primitives (Anthropic-native)

Tools built into the Anthropic platform --- no MCP server, no external dependency

Nine tools form the foundation of every Anthropic-hosted agent: web_search, web_fetch, code_execution, bash_tool, the text_editor family (view, str_replace, create), computer use, the memory tool, the tool search tool, and the advisor tool. They are built into the Claude API and exposed as first-class types rather than as MCP servers; the runtime that handles them is Anthropic’s own. The advantage is that they are highly tuned for the model that uses them --- schemas, descriptions, output formats, and error handling are all co-designed with the model’s training. The disadvantage is that they are only available where Anthropic’s runtime runs (the Claude API, Managed Agents, Claude Code, Claude.ai, Claude Platform on AWS), not across the entire agent ecosystem.

Seven of the nine tools cover the four families directly. web_search and web_fetch are Observe. code_execution and bash_tool are Compute (with sandbox isolation). The text_editor tools span Retrieve (view) and Compute (str_replace, create). Computer use is Act. The memory tool spans Retrieve (get) and Compute (set). The remaining two --- tool search and advisor --- are meta-tools: tool search is infrastructure for managing tool catalogs that grow past a few dozen entries, and advisor pairs a faster executor model with a stronger reviewer model for difficult agentic loops.

Where the rest of the catalog presents alternatives --- Tavily for search, E2B for sandboxes, Playwright for browser --- these are the defaults. Reach for an alternative only when an explicit feature gap (latency, scale, multi-LLM portability) warrants it.

web_search

Source: Anthropic platform tool, type web_search_20250305

Classification Observe (external + reads). Tier 1.

Intent

Search the live web from inside a Claude turn, returning ranked results with snippets and source attributions.

Motivating Problem

The model’s training has a cutoff; anything that happened after it, or anything that changes by the day (prices, leadership, software versions, news), can’t be answered from weights alone. web_search closes the gap by giving Claude live access to a search index from inside the tool-use loop.

How It Works

The tool is declared in the API call by type rather than by schema; Anthropic’s runtime supplies the underlying search backend. When activated, Claude can issue queries during its turn; results come back as structured snippets with URLs, titles, and citation indices. The model is trained to cite results inline using <cite> markers, which become first-class citation links in the API response.

Two design choices matter operationally. First, the tool is iterative: Claude can issue multiple queries in a single turn, refining as it reads results. Second, the runtime tracks tokens spent across queries; budget-conscious deployments can cap the number of searches per turn.

When to Use It

Any task that depends on present-day facts: who currently holds a position, what something costs now, the current version of a library, breaking news, recent regulatory changes. Pair with web_fetch when the snippet isn’t enough and you need the full page content.

Alternatives --- Tavily, Exa, Perplexity, Brave Search MCPs (Section B). Use those when the deployment is non-Anthropic, when you need a specific backend’s characteristics, or when you want explicit control over query volume and cost.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool

Example

Researching a competitor: the agent issues web_search queries for “competitor company funding”, “competitor product launches 2026”, and “competitor pricing tiers”, then synthesizes the results with inline citations to source URLs.

Example artifacts

Tool schema.

// Declared by type, not by JSON schema, in the messages call:

tools: [

{ "type": "web_search_20250305", "name": "web_search" }

]

Invocation.

# Typical agent turn:

# 1. user: "Who is the current CEO of Acme Corp?"

# 2. Claude emits tool_use { name: web_search, input: { query:
"Acme Corp CEO 2026" } }

# 3. runtime returns tool_result with top results

# 4. Claude reads the top result, may issue follow-up query

# 5. Claude responds with cited answer

web_fetch

Source: Anthropic platform tool

Classification Observe (external + reads). Tier 1.

Intent

Fetch the full contents of a specific URL, returning the text content (markdown-extracted by default) to the model.

Motivating Problem

web_search returns snippets, which are deliberately short. For tasks that require reading the source --- long articles, documentation, GitHub READMEs, blog posts --- snippets aren’t enough. web_fetch is the obvious follow-on: search to find the URL, fetch to read it.

How It Works

Pass a URL; receive the page’s text content. The runtime handles HTML extraction (markdown is the default; raw HTML is available via parameter), token-limit truncation, and bot-blocking gracefully. For PDFs, web_fetch can return extracted text or base64 bytes. For pages that require auth, web_fetch fails predictably rather than silently --- the runtime doesn’t carry session cookies or credentials.

Operationally, the tool can only fetch URLs that the model has “seen”: URLs the user typed, URLs returned from a prior web_search, or URLs returned from another tool. The model can’t fabricate a URL and fetch it; this is a deliberate safety constraint.

When to Use It

After a web_search when snippets are insufficient. When the user provides a specific URL to read. When following links from one page to another during deep research. Skip web_fetch for sites that obviously require login --- the request will fail.

Alternatives --- Firecrawl MCP for industrial-strength scraping with markdown conversion at scale; the official Anthropic Fetch reference MCP server when running outside the Anthropic platform.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/web-fetch-tool

Example

User asks for a summary of a specific blog post. web_search isn’t needed because the URL is already provided. web_fetch returns the full post; Claude summarizes from the actual content rather than from training-time knowledge of the blog.

Example artifacts

Tool schema.

tools: [

{ "type": "web_fetch_20250910", "name": "web_fetch" }

]

Invocation.

# Claude emits:

{

"name": "web_fetch",

"input": { "url": "https://www.anthropic.com/news/..." }

}

# Returns markdown of the page text.

code_execution

Source: Anthropic platform tool, type code_execution_20250825

Classification Compute (internal + mutates). Tier 2.

Intent

Execute Python code in a sandboxed environment with file persistence, network access, and pre-installed scientific libraries.

Motivating Problem

Many tasks the user asks an agent to do are computational: parse this CSV, plot this dataset, solve this equation, run this regex against this corpus. The model can write the code but can’t run it; without a runtime, the user has to copy-paste. code_execution closes the gap by giving Claude its own Python interpreter for the duration of the turn.

How It Works

The runtime provides a per-session container with Python 3, scientific libraries (pandas, numpy, scipy, matplotlib, scikit-learn, and many others pre-installed), filesystem access scoped to /mnt/, and a working directory at /home/claude. Files persist for the session’s duration; the session is wiped between sessions. The model emits code; the runtime executes it and returns stdout, stderr, and any inline images produced.

The code_execution tool composes with the Files API: the user uploads a file, the model reads it from /mnt/user-data/uploads/, processes it, writes outputs to /mnt/user-data/outputs/, and shares the result via the present_files tool. This is the same mechanism behind Claude’s built-in document creation skills.

Critically, the sandbox has network access (subject to allowlists configurable by the runtime) but no access to the user’s real filesystem. Permission tier 2 by construction: a wrong call costs a discarded session, not user data.

When to Use It

Any computational task: data analysis, plotting, file format conversion, web scraping, small ML experiments. Generating downloadable files (CSVs, images, Excel) for the user. Validating code before suggesting it to the user.

Alternatives --- E2B (Section C) for the same functionality outside the Anthropic platform, with finer-grained control over the sandbox lifecycle. Modal sandboxes for compute-heavy workloads (GPUs, large memory).

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/code-execution-tool

Example

User uploads a CSV and asks “what’s the average revenue by region?” Claude calls code_execution with pandas to load the file, group by region, and compute the mean. The result is a small markdown table; if the user asked for a chart, matplotlib produces an inline PNG.

Example artifacts

Tool schema.

tools: [

{ "type": "code_execution_20250825", "name": "code_execution"
}

]

# Beta header on the messages call:

# anthropic-beta: code-execution-2025-08-25

Invocation.

# Claude emits:

{

"name": "code_execution",

"input": {

"code": "import pandas as pd\ndf =
pd.read_csv('/mnt/user-data/uploads
/sales.csv')\nprint(df.groupby('region')['revenue'].mean())"

}

}

bash_tool

Source: Anthropic platform tool, type bash_20250124

Classification Compute (internal + mutates). Tier 2 in sandbox; Tier 3–4 on host.

Intent

Execute shell commands in a Linux environment. Used heavily by Claude Code and by Anthropic-hosted agent containers.

Motivating Problem

Many development tasks need the shell, not just Python: running tests, building artifacts, running CLI tools, managing packages with pip or npm. Giving an agent bash access turns it from a code-suggester into a code-runner. The classic risk is well-known --- a misfired rm -rf or a wrong git push has bigger blast radius than a Python typo.

How It Works

The runtime exposes a single bash tool with a command parameter. Each call executes in a stateful shell session; the working directory persists across calls. stdout, stderr, and exit codes return as structured fields in the tool_result.

Two operational properties depend on the deployment. In Anthropic’s hosted sandboxes (Claude Code on the cloud, the code-execution containers), bash runs in an isolated environment with no access to user systems; mistakes are Tier 2. When Claude Code runs locally, bash runs against the developer’s actual filesystem; mistakes are Tier 3—4, which is why Claude Code introduces approval prompts for destructive commands and why skills like Matt Pocock’s git-guardrails-claude-code exist to add hooks.

When to Use It

Coding tasks that need the build, test, or run loop. Anywhere the natural tool is a CLI rather than a Python library. Pair with text_editor tools for an end-to-end coding agent: view the file with text_editor, modify with str_replace, run tests with bash.

Alternatives --- code_execution when the task is Python-shaped and a shell escape isn’t needed. MCP-based shell servers when running outside the Anthropic platform.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/bash-tool

Example

Running the test suite for a Python project: bash tool executes pytest, observes the failures, the agent reads the test output, modifies the failing code with str_replace, re-runs pytest, iterates until green.

Example artifacts

Tool schema.

tools: [

{ "type": "bash_20250124", "name": "bash" }

]

Invocation.

{
  "name": "bash",
  "input": {
    "command": "pytest tests/ -x"
  }
}

# Returns: { stdout, stderr, return_code, ... }

text_editor (view / str_replace / create)

Source: Anthropic platform tools, type text_editor_20250728

Classification Retrieve (view) and Compute (str_replace, create). Tier 2–3.

Intent

Read, modify, and create files via a focused editing API that maps naturally to how the model reasons about code changes.

Motivating Problem

An agent editing files needs three primitives: see the current state, make a small change, or create a new file. A blunt write-the-whole-file API is wasteful (the model regenerates content that didn’t change) and unsafe (silently overwrites). text_editor splits the operation into view, str_replace (single string find-and-replace within a file), and create (new file with content) --- each focused and auditable.

How It Works

view returns the file’s contents with line numbers prepended. The line numbers are display-only; they don’t need to be included when calling str_replace.

str_replace takes old_str and new_str. The runtime locates old_str in the file (must be uniquely matching) and replaces it with new_str. Match failures (because the string isn’t present or appears multiple times) return an error rather than guessing. This is what makes the tool reliable: the model can’t accidentally do a partial overwrite.

create writes a new file with given content. If the file exists, the call fails --- use str_replace to edit existing files. The trio composes naturally: view, identify exact lines, str_replace; or create for new files.

When to Use It

Any coding agent. Any task that edits text files. The pattern view-then-edit is the dominant agent shape in Claude Code, Cursor (which uses similar primitives under different names), and the Anthropic-hosted document agents.

Alternatives --- a write-whole-file tool when the file is small and the model knows the full content. Filesystem MCP (Section D) when running outside the Anthropic platform.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/text-editor-tool

Example

Fixing a typo in a config file: view the file, find the line containing the typo, str_replace the old version with the corrected version. The change is precise; the rest of the file is untouched.

Example artifacts

Tool schema.

tools: [

{ "type": "text_editor_20250728",

"name": "str_replace_based_edit_tool" }

]

Invocation.

# view

{ "name": "str_replace_based_edit_tool",

"input": { "command": "view", "path": "/repo/config.yaml" }
}

# str_replace

{ "name": "str_replace_based_edit_tool",

"input": {

"command": "str_replace",

"path": "/repo/config.yaml",

"old_str": "port: 8080",

"new_str": "port: 9090"

} }

computer use

Source: Anthropic platform tool, type computer_20250124

Classification Act (external + mutates). Tier 4 against real systems.

Intent

Control a computer’s screen, keyboard, and mouse as a human would --- move the cursor, click, type, take screenshots, scroll.

Motivating Problem

Many systems an agent might need to use don’t have APIs: legacy enterprise software, web apps without exposed automation, desktop applications, browsers without a Playwright integration. Computer use is the universal fallback: give the model a screen, a mouse, and a keyboard, and let it operate the thing as a human would.

How It Works

The model sees the screen via screenshots (returned as image inputs in tool_result blocks). It emits actions --- mouse_move with coordinates, left_click, type with a string, key with a keyboard shortcut, scroll. The runtime executes the action against the target system (typically a virtual desktop) and returns a fresh screenshot.

The loop is screenshot → reason → act → screenshot. It is slow (each turn is a screenshot interpretation), expensive (image tokens add up), and fragile (small UI changes invalidate the model’s plan). But it works for any visible system.

Anthropic’s implementation runs against an Ubuntu desktop in a sandbox by default. Production deployments use E2B Desktop, Anthropic’s own browser sandbox, or local VMs with the appropriate hooks.

When to Use It

Last-resort automation: legacy software with no API, web apps that resist scripting, desktop applications. Demo and prototype work where building a real integration is overkill. NOT for production workflows that have a real API or MCP server available --- Computer Use is the slowest and most expensive option for any task that has an alternative.

Alternatives --- Playwright MCP for web automation that does have programmatic access; browser-use as a focused alternative built specifically for browser tasks; direct API calls when the target has one.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool
Anthropic computer-use-demo repository

Example

An internal sales tool that’s a 15-year-old Windows desktop app with no API. Computer use opens it in a virtual desktop, navigates to the customer record screen, reads the data on screen, and produces a structured summary for the agent’s next step.

Example artifacts

Tool schema.

tools: [

{ "type": "computer_20250124",

"name": "computer",

"display_width_px": 1024,

"display_height_px": 768 }

]

Invocation.

# Take a screenshot:

{ "name": "computer", "input": { "action": "screenshot" } }

# Click somewhere:

{ "name": "computer", "input": {

"action": "left_click",

"coordinate": [512, 384] } }

# Type a string:

{ "name": "computer", "input": {

"action": "type", "text": "hello world" } }

memory tool

Source: Anthropic platform tool, beta

Classification Retrieve (get) and Compute (set). Tier 3 (persistent).

Intent

Give an agent durable cross-conversation memory --- write notes, recall them in future sessions, and apply context editing to keep long-running conversations manageable.

Motivating Problem

Each model call has a fixed context window; an agent that learns nothing across calls relearns the user’s preferences every session. Stuffing previous conversations back into context works for short histories but doesn’t scale. The memory tool gives Claude a key-value notebook it can write to and read from across sessions, plus context-editing strategies that prune long conversations intelligently.

How It Works

The memory tool exposes a small set of operations: get, set, delete, list. Behind the scenes, the runtime persists the memory in a per-user store (configurable; the default in Claude.ai is per-account). The model writes free-form notes; the descriptions in the schema train it to write durable facts (“user prefers tea over coffee”) rather than ephemeral state (“user is currently asking about coffee”).

Context editing is the partner mechanism. Two strategies are configurable: clear_tool_uses_20250919 prunes old tool_use blocks when context grows large; clear_thinking_20251015 prunes extended thinking blocks similarly. Both are configurable by trigger thresholds and retention policies; both keep long conversations within the context window without the agent losing track of recent steps.

When to Use It

Long-running agents that need to learn user preferences across sessions. Customer support agents that build up product knowledge over time. Research assistants that accumulate findings across multiple sessions. Avoid for tasks that don’t cross session boundaries; the overhead isn’t justified.

Alternatives --- Mem0 and Letta (Section I) for richer memory architectures with vector retrieval, knowledge graphs, and explicit memory categories. Anthropic’s memory tool is simpler but tightly integrated with the model.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool
Anthropic cookbook: tool_use/memory_cookbook.ipynb

Example

Session 1: user mentions they’re vegan and a tire-fitter by trade. Claude writes both facts to memory. Session 2 (weeks later, new conversation): user asks for restaurant recommendations near their work. Claude reads memory, sees “vegan,” filters recommendations accordingly without asking again.

Example artifacts

Tool schema.

tools: [

{ "type": "memory_20250818", "name": "memory" }

]

# Beta header:

# anthropic-beta: context-management-2025-06-27

Invocation.

# Write to memory:

{ "name": "memory", "input": {

"command": "set",

"key": "preferences/diet",

"value": "vegan since 2018"

} }

# Read from memory:

{ "name": "memory", "input": {

"command": "get",

"key": "preferences/diet"

} }

tool search

Source: Anthropic platform tool, types tool_search_tool_regex_20251119 and tool_search_tool_bm25_20251119

Classification Retrieve (read) — over the agent's own tool catalog. Tier 1. Infrastructure.

Intent

Let Claude work with hundreds or thousands of tools by dynamically discovering and loading them on demand, instead of loading every tool definition into context upfront.

Motivating Problem

Two scaling problems compound quickly as tool libraries grow. First, context bloat: a typical multi-server stack (GitHub, Slack, Sentry, Grafana, Splunk) can consume around 55,000 tokens in tool definitions before the model does any actual work --- burning budget on tools that 19 turns out of 20 are never called. Second, selection accuracy: Claude’s ability to correctly pick the right tool degrades meaningfully once you exceed 30—50 available tools, and falls off sharply past 100.

Tool search addresses both: definitions stay in a catalog and the model searches for them when it needs them. Anthropic reports the mechanism typically reduces tool-definition context usage by over 85% while keeping selection accuracy high across catalogs of thousands of tools.

How It Works

Two variants are available: the regex variant (tool_search_tool_regex_20251119) where Claude constructs Python re.search() patterns to search tool names, descriptions, argument names, and argument descriptions; and the BM25 variant (tool_search_tool_bm25_20251119) where Claude uses natural-language queries against a BM25 index over the same fields. Maximum query length is 200 characters. Both return 3—5 most-relevant tool references per call.

Tool definitions are marked deferred via the defer_loading: true flag on the tool definition. Tools without that flag stay loaded into context immediately; deferred tools live in the catalog and load only when the model discovers them through search. The tool search tool itself must never be deferred.

Operationally: Claude searches; the runtime returns tool_reference blocks pointing at matching tool names; the runtime then expands each reference into the full tool definition before passing it back to the model. References persist across conversation turns, so discovered tools can be reused without re-searching. Prompt caching is preserved because deferred tools are not part of the system-prompt prefix.

Limits: up to 10,000 tools in a catalog; model support on Sonnet 4.0+, Opus 4.0+, Haiku 4.5+, and Claude Mythos Preview. ZDR-eligible.

When to Use It

Any agent with more than ~10 tools, especially when tool definitions consume more than 10K tokens. MCP-powered systems with multiple servers (200+ tools). Tool libraries expected to grow over time. Tool sets where selection accuracy issues have been observed.

Operational tips: keep the 3—5 most-frequently-used tools as non-deferred for low-latency baseline operation. Use consistent namespacing (e.g. github_, slack_) so that search queries naturally surface the right group. Add a system-prompt section describing available tool categories so the model knows when to search.

When traditional tool calling is fine: fewer than 10 tools total; every tool used in nearly every request; very small definitions.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool
anthropic.com/engineering/advanced-tool-use
anthropic.com/engineering/effective-context-engineering-for-ai-agents

Example

An ops agent connected to GitHub, Slack, Sentry, Grafana, Splunk, Linear, and PagerDuty MCP servers --- around 250 tools total. The full definitions would consume 60K+ tokens. Tool search loads only the search tool and a small starter set up front; when the user asks “which incidents were paged last night?”, Claude searches for “incident”, gets 3—5 relevant references from the PagerDuty MCP, and proceeds. The other 240+ tool definitions never enter context.

Example artifacts

Tool schema.

tools = [

{ "type": "tool_search_tool_regex_20251119",

"name": "tool_search_tool_regex" },

{ "name": "get_weather",

"description": "Get the weather at a specific location",

"input_schema": { ... },

"defer_loading": True },

{ "name": "search_files",

"description": "Search through files in the workspace",

"input_schema": { ... },

"defer_loading": True },

// ... up to 10,000 deferred tools

]

Invocation.

# Claude internally:

# 1. user asks about weather

# 2. emits server_tool_use { name: tool_search_tool_regex, input: {
query: "weather" } }

# 3. runtime returns tool_references: [ { tool_name:
"get_weather" } ]

# 4. runtime expands the reference into the full get_weather
definition

# 5. Claude calls get_weather normally

advisor

Source: Anthropic platform tool, type advisor_20260301 (beta header advisor-tool-2026-03-01)

Classification Compute (sub-inference). Tier 1. Infrastructure.

Intent

Let a faster, lower-cost executor model consult a higher-intelligence advisor model mid-generation for strategic guidance, without breaking out of the single request boundary.

Motivating Problem

Long-horizon agentic workloads --- coding agents, computer-use agents, multi-step research pipelines --- have a recurring shape: most turns are mechanical (running a command, reading a result, applying a small edit), but a few moments require the kind of high-quality planning that only a top-tier model produces. Running the entire task on the top-tier model wastes money on the mechanical turns; running it all on a smaller model produces brittle plans. The advisor pattern --- a strong reviewer that the smaller executor consults at critical junctures --- captures most of the strong-model quality at most of the small-model cost.

How It Works

The advisor tool runs inside a single /v1/messages request. The executor model (the top-level model field) decides when to call the advisor; the advisor model (the model field inside the tool definition) is a separate sub-inference that runs server-side with the full transcript. The advisor sees system prompt, all prior turns, all tool results, all prior advice. It produces a 400—700-token plan or course-correction (1,400—1,800 tokens including its thinking) and returns it to the executor as an advisor_tool_result block.

Model pairing must be compatible: the advisor must be at least as capable as the executor. Current supported pairings include Haiku 4.5 / Sonnet 4.6 / Opus 4.6 / Opus 4.7 as executors with Opus 4.7 as the advisor.

The executor decides timing; the server supplies the context. The server_tool_use block the executor emits has an empty input --- nothing the executor writes is forwarded; the server reconstructs the advisor’s view from the transcript automatically.

Two independent caching layers: executor-side caching of the advisor_tool_result like any other content block; and advisor-side caching of the advisor’s own transcript across calls within the same conversation (enabled via caching: { type: “ephemeral”, ttl: “5m” }). Advisor-side caching breaks even at roughly three advisor calls per conversation and improves from there.

When to Use It

Coding agents and computer-use agents where the executor model is Sonnet or Haiku and you want occasional access to Opus-level planning. Multi-step research and analysis pipelines with both mechanical and strategic phases. Anywhere the bulk of tokens are generated by the executor doing routine work and a few critical decisions justify the stronger model.

Best-practice timing for coding tasks: call the advisor early (after orientation reads but before substantive writing) and at the end (after file writes and test outputs are in the transcript). Two to three calls per task tends to produce the best quality-cost trade-off; aggressive limiting via max_uses caps cost.

Weaker fit: single-turn Q&A, pass-through model pickers where the user already chose model and cost trade-off explicitly, workloads where every turn genuinely needs the advisor model’s full capability.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/advisor-tool

Example

A coding agent running on Sonnet 4.6 building a concurrent worker pool in Go. Early in the task, after a few exploratory reads, the executor calls the advisor. Opus 4.7 produces a concise plan: channel-based coordination, close the input channel first then wait on a WaitGroup, watch for writer starvation. The executor writes code following the plan, runs tests, and calls the advisor a second time after a failure. Total cost: roughly Sonnet-class with one or two Opus sub-inferences; quality: closer to Opus-solo than to Sonnet-solo.

Example artifacts

Tool schema.

tools: [

{

"type": "advisor_20260301",

"name": "advisor",

"model": "claude-opus-4-7",

"max_uses": 5,

"caching": { "type": "ephemeral", "ttl": "5m" }

}

]

# Beta header:

# anthropic-beta: advisor-tool-2026-03-01

Invocation.

# Executor emits with empty input:

{ "type": "server_tool_use",

"id": "srvtoolu_abc123",

"name": "advisor",

"input": {} }

# Server runs advisor sub-inference, then returns:

{ "type": "advisor_tool_result",

"tool_use_id": "srvtoolu_abc123",

"content": {

"type": "advisor_result",

"text": "Use a channel-based coordination pattern. The tricky part
is draining in-flight work during shutdown: close the input channel
first, then wait on a WaitGroup..."

}

}

Section B — Search and retrieval

Non-Anthropic search backends, scraping engines, documentation lookup, and vector retrieval

Anthropic’s native web_search and web_fetch cover most search needs inside Claude’s API. Beyond that, four categories of search-and-retrieval tools matter: third-party search backends (Tavily, Exa, Perplexity, Brave) for non-Anthropic deployments or specific backend characteristics; web-scraping engines (Firecrawl) for large-scale or JavaScript-heavy sites; documentation lookup tools (Context7) that pull version-pinned package docs into context; and vector retrieval tools that search the agent’s own knowledge base.

These tools all sit at Tier 1 (read-only). The dominant cost concern is not safety but money --- each call hits a paid API and tokens add up quickly. A common operational pattern is to put the cheapest backend (often Brave) in front of a more expensive one (Tavily or Exa for semantic-similar pages), with the expensive backend invoked only when the cheap one fails.

Tavily / Exa / Perplexity / Brave — the search backends

Source: MCP servers: tavily, exa, perplexity-mcp, brave-search-mcp

Classification Observe (external + reads). Tier 1.

Intent

Live web search via four distinctively different backends, each surfacing different parts of the web with different trade-offs.

Motivating Problem

Different agent workloads need different search behavior. Tavily is built for AI agents and returns clean, ranked, deduped results suitable for direct ingestion. Exa indexes by embedding similarity rather than keyword, so it surfaces semantically-related pages that a keyword index misses. Perplexity returns AI-summarized answers with citations rather than raw search results. Brave maintains an independent index that doesn’t share Google’s biases and offers a permissive free tier.

How It Works

All four are MCP servers (or have official MCP servers). Setup is similar: install the package, set the API key as an environment variable, register the server with the agent. The tools they expose vary: search is universal; Tavily adds extract and crawl; Exa adds find_similar; Perplexity exposes perplexity_ask and perplexity_reason (a reasoning-model variant) as two distinct tools; Brave includes brave_local_search for local-business queries on top of brave_web_search.

A useful pattern is to expose more than one of these and let the model select. Perplexity for “summarize the recent literature on X”, Exa for “find me pages similar to this one”, Tavily for “give me the top sources on Y for direct reading”. The model picks based on its description-level understanding of each tool.

When to Use It

Agents running outside the Anthropic platform that need a search tool. Agents inside the Anthropic platform that need a specific backend’s characteristics (semantic similarity from Exa; AI-summarized answers from Perplexity). Cost-sensitive deployments that want fine-grained control over query volume.

Pitfall: don’t install three search MCPs simultaneously “just in case.” The model is more likely to be confused by overlapping descriptions than to pick the best one. Install one, with documented justification.

Sources

tavily.com, exa.ai, perplexity.ai, brave.com/search/api
registry.modelcontextprotocol.io

Example

Building a research assistant for academic literature. Perplexity-reason as the primary tool (returns synthesized answers with citations); Exa as the secondary for finding related papers; web_fetch to read the actual papers once URLs are known.

Example artifacts

Invocation.

// Tavily search call:

{ "name": "tavily_search",

"input": { "query": "GraphQL vs REST 2026", "max_results": 5
} }

// Exa find_similar:

{ "name": "exa_find_similar",

"input": { "url": "https://martinfowler.com/articles/..." } }

// Perplexity reasoning:

{ "name": "perplexity_reason",

"input": { "messages": [{ "role": "user",

"content": "What changed in EU AI Act compliance between 2024 and
2026?" }] } }

Setup.

// Example Claude Desktop / Cursor config for Tavily:

{

"mcpServers": {

"tavily": {

"command": "npx",

"args": ["-y", "@tavily/mcp-server"],

"env": { "TAVILY_API_KEY": "tvly-..." }

}

}

}

Firecrawl

Source: MCP server: firecrawl-mcp (Mendable)

Classification Observe (external + reads). Tier 1.

Intent

Industrial-strength web scraping --- search, scrape, crawl, and extract structured data from JavaScript-heavy sites, returning clean markdown.

Motivating Problem

web_fetch handles simple page fetches. It does not handle JavaScript-rendered single-page apps, sites with anti-bot defenses, multi-page crawls, or structured-data extraction across a site. Firecrawl exists for those cases: a headless-browser-backed scraping engine with markdown conversion baked in.

How It Works

Firecrawl runs a headless Chromium pool with anti-bot evasion, JavaScript rendering, and proxy rotation. The MCP server exposes four primary tools: firecrawl_search (search + fetch in one call), firecrawl_scrape (single URL with markdown output), firecrawl_crawl (multi-page following links with configurable depth), and firecrawl_extract (LLM-assisted structured extraction matching a schema).

The structured-extraction tool is particularly useful: rather than the agent reading raw markdown and parsing fields, Firecrawl applies its own LLM pass with a user-supplied schema and returns typed JSON. For product-catalog scraping or contact-info extraction across many pages, this dramatically reduces the agent’s token usage.

When to Use It

Scraping at scale, especially with JavaScript-rendered content. Extracting structured data from many similar pages. When web_fetch fails because the site blocks simple HTTP requests.

Alternatives --- ScrapingBee MCP, Apify MCP for similar capabilities with different pricing models. Direct Playwright MCP control (Section E) when you need fine-grained interaction with the page beyond “fetch the rendered HTML”.

Sources

firecrawl.dev
github.com/mendableai/firecrawl-mcp-server

Example

An agent tasked with monitoring competitors’ pricing pages. firecrawl_crawl across each competitor’s pricing path; firecrawl_extract pulls tier names, prices, and feature lists into a structured schema. Output ready for direct comparison without parsing.

Example artifacts

Invocation.

// Structured extraction:

{ "name": "firecrawl_extract",

"input": {

"urls": ["https://competitor.com/pricing"],

"schema": {

"type": "object",

"properties": {

"tiers": { "type": "array", "items": {

"type": "object",

"properties": {

"name": { "type": "string" },

"price": { "type": "number" },

"features": { "type": "array", "items": { "type":
"string" } }

}

}}

}

}

}

}

Setup.

{
  "mcpServers": {
    "firecrawl": {
      "command": "npx",
      "args": [
        "-y",
        "firecrawl-mcp"
      ],
      "env": {
        "FIRECRAWL_API_KEY": "fc-..."
      }
    }
  }
}

Context7

Source: MCP server: @upstash/context7-mcp

Classification Retrieve (reads). Tier 1.

Intent

Pull current, version-pinned documentation for 9,000+ libraries directly into the agent’s context, so the model writes code against real APIs rather than from training-cutoff memory.

Motivating Problem

An agent writing code against a library it learned about during training is one of the most common hallucination paths: the API has changed; the method name is wrong; the imports are stale. Context7 fixes this by maintaining a daily-refreshed documentation index across thousands of popular libraries and exposing it through MCP, so the model can fetch the current docs for whichever library it’s about to write code against.

How It Works

Two tools: resolve-library-id (turn a library name like “docx” into a Context7 ID), and get-library-docs (fetch the relevant doc sections for that ID, with token budgeting). The agent typically pairs them: resolve first, then fetch the specific topics it needs rather than the full docs.

The model is trained (or prompted via skill) to call Context7 before writing code that imports a library. The token cost per call is small; the savings from avoiding hallucinated APIs are large. Context7 is one of the three tools in the widely-cited “starter pack” recommendation (GitHub MCP + Context7 + Playwright MCP).

When to Use It

Any coding agent that produces code against external libraries. Particularly valuable for libraries that have evolved since the model’s training cutoff. Skip for standard-library code in mainstream languages.

Alternatives --- web_search + web_fetch to read docs ad-hoc, but Context7 is faster and more structured. Docfork is a competing service with similar positioning.

Sources

context7.com
github.com/upstash/context7

Example

Agent asked to generate a Word document with the docx library. Before writing code, agent calls Context7 for “docx”, then specifically for the numbering and table sections. Code that follows uses the actual current API rather than what the model remembers from training data.

Example artifacts

Invocation.

// 1. Resolve

{ "name": "resolve-library-id",

"input": { "libraryName": "docx" } }

// Returns: { libraryId: "/dolanmiu/docx" }

// 2. Fetch docs for a specific topic

{ "name": "get-library-docs",

"input": {

"context7CompatibleLibraryID": "/dolanmiu/docx",

"topic": "tables",

"tokens": 5000

}

}

Setup.

{
  "mcpServers": {
    "context7": {
      "command": "npx",
      "args": [
        "-y",
        "@upstash/context7-mcp"
      ]
    }
  }
}

Vector retrieval (Pinecone, Weaviate, embedded stores)

Source: MCP servers per vendor; also LangChain/LlamaIndex retrieval tools

Classification Retrieve (internal + reads). Tier 1.

Intent

Semantic search over a private corpus --- the agent’s knowledge base, internal documentation, customer support archives, codebases.

Motivating Problem

An agent often needs to recall from a corpus that doesn’t exist anywhere on the public web: internal company docs, the codebase of the current project, past customer-support tickets, the user’s personal notes. Vector retrieval is the standard primitive for this: documents are chunked, embedded, and stored in a vector index; the agent queries by embedding similarity rather than keyword.

How It Works

A retrieval tool typically exposes a single search (or query, or retrieve) function with a query string and optional filters (top-K, metadata constraints). The runtime computes the query embedding, looks up nearest neighbors, and returns matching chunks with their source documents.

The implementation varies enormously --- Pinecone and Weaviate are hosted services with their own MCP servers; LanceDB and ChromaDB embed locally; LlamaIndex and LangChain are framework-shaped wrappers that abstract the backend. From the agent’s perspective, the tool surface is mostly the same.

Two non-obvious operational properties matter. First, chunking strategy affects retrieval quality more than embedding choice does --- documents chunked at section boundaries with 10—20% overlap outperform fixed-size chunks. Second, retrieval is rarely sufficient on its own; pair with web_fetch-style follow-up tools that can read the full source document when a chunk hit looks relevant.

When to Use It

Any agent grounded in private content. Customer support agents grounding in past tickets. Coding agents grounding in the project’s own code (often via a code-specific retrieval tool that understands AST structure). Internal-docs agents.

Alternatives --- keyword search (BM25, Elasticsearch) when queries are exact-match shaped; full-document context when the corpus is small enough to fit. Hybrid retrieval (BM25 + vectors) outperforms either alone on most real corpora.

Sources

pinecone.io, weaviate.io, lancedb.com, chromadb.com
LangChain and LlamaIndex retrievers

Example

A customer-support agent grounded in the company’s help-center articles. When a user asks “how do I reset my password?” the agent queries the vector store, retrieves the top three most-relevant help articles by semantic similarity, and answers from their content with citations.

Example artifacts

Tool schema.

// Typical retrieval tool schema:

{

"name": "search_knowledge_base",

"description": "Semantic search over the company knowledge
base.",

"input_schema": {

"type": "object",

"properties": {

"query": { "type": "string" },

"top_k": { "type": "integer", "default": 5 },

"filter": {

"type": "object",

"properties": { "category": { "type": "string" } }

}

},

"required": ["query"]

}

}

Section C — Code execution sandboxes

Run AI-generated code safely --- isolated, fast-booting, network-controlled

Anthropic’s code_execution covers Python sandbox needs inside the Claude API. Beyond that, four products dominate the sandbox-for-AI category: E2B (the open-source incumbent), Modal (general-purpose serverless that adapted to AI workloads), Daytona (dev environments first, sandboxes second), and Blaxel (the newest entrant, optimizing for resume latency).

The selection criteria are operational, not technical: does the workload need persistent state across many tool calls (favor Blaxel’s perpetual standby); does it need GPUs (Modal); is the team already using one of these for non-AI workloads (use that one); is the budget tight enough to want the open-source self-hostable option (E2B)?

All sandboxes share a Tier 2 permission classification: misfires are contained by the sandbox boundary. They can also be configured to enable Tier 4 capabilities (the sandbox calls real APIs with real credentials), in which case the tier classification escalates accordingly.

E2B

Source: PyPI: e2b, e2b-code-interpreter. npm: e2b, @e2b/code-interpreter

Classification Compute (internal + mutates). Tier 2 by default.

Intent

Open-source secure sandboxes for AI-generated code, running on Firecracker microVMs with sub-second boot times.

Motivating Problem

An agent that writes code needs a place to run it. Running it on the host system is dangerous; running it in a generic Docker container is slower than the agent’s decision-making and lacks the file-system + network primitives that make a sandbox useful for real work. E2B builds a sandbox layer specifically for AI agents: microVM-isolated, fast-booting, with Python and JavaScript SDKs that expose the loop the agent naturally needs.

How It Works

The core abstraction is a Sandbox object created via Sandbox.create(). Once created, the sandbox supports two main operations: commands.run for shell commands and (with the code-interpreter SDK) run_code for stateful Python or JavaScript execution. State persists across calls within a sandbox; sandboxes are reused as long as they’re kept alive. Boot is around 200 milliseconds when sandbox and client are co-located.

The code-interpreter SDK adds a Jupyter-kernel layer: variables persist between run_code calls so the model can build up state across multiple steps. Standard libraries (pandas, numpy, matplotlib) are pre-installed; pip install works inside the sandbox.

Two flavors run alongside the core: E2B Desktop provides a graphical Linux desktop reachable from Anthropic’s computer-use tool or OpenAI’s computer-use agent (Surf); Fragments is the open-source template for building Claude-Artifacts-style or v0-style code-generation apps. Open-source core (Apache 2.0); BYOC deployments on AWS and GCP for enterprise.

When to Use It

Any non-Anthropic agent stack that needs a code sandbox. Anthropic-stack agents that need state persistence across many turns (Anthropic’s code_execution sandbox lifecycle is shorter). Building a computer-use agent on a virtual desktop (E2B Desktop).

Alternatives --- Modal for compute-heavy workloads with GPU access; Blaxel for sub-25ms resume on a perpetually-warm sandbox; Daytona for full dev-environment use cases.

Sources

e2b.dev
github.com/e2b-dev/E2B
github.com/e2b-dev/code-interpreter

Example

Data-analysis agent: user uploads a CSV; agent creates an E2B sandbox; loads the CSV with pandas; runs several exploratory queries, each in a new run_code call (with state preserved); generates a matplotlib chart; downloads the chart as a PNG and shows the user.

Example artifacts

Setup.

pip install e2b-code-interpreter

# or

npm install \@e2b/code-interpreter

export E2B_API_KEY=e2b_...

Code.

# Python

from e2b_code_interpreter import Sandbox

with Sandbox.create() as sandbox:

sandbox.run_code("import pandas as pd; df =
pd.read_csv('/tmp/data.csv')")

result =
sandbox.run_code("df.groupby('region')['revenue'].sum()")

print(result.text)

# TypeScript / JavaScript

import { Sandbox } from '@e2b/code-interpreter'

const sandbox = await Sandbox.create()

await sandbox.runCode("x = 1")

const execution = await sandbox.runCode("x += 1; x")

console.log(execution.text) // outputs 2

Source: modal.com (Python SDK: modal)

Classification Compute (internal + mutates). Tier 2.

Intent

General-purpose serverless compute that scales to AI sandbox workloads: GPUs, large memory, custom Docker images.

Motivating Problem

Some AI sandbox workloads outgrow lightweight microVMs: training a small model, running a CUDA kernel, processing a multi-GB dataset, deploying a fine-tuned model. Modal’s value proposition for AI agents is its full serverless surface --- GPUs, custom images, scheduled functions, persistent volumes --- with a Python-first SDK that makes it usable from inside an agent loop.

How It Works

The Modal SDK lets the agent define functions that run in cloud sandboxes with configurable resources: CPU, memory, GPU class (T4 through H100), Python version, custom Docker image, persistent volumes. The agent calls these functions like local Python functions; Modal handles the container lifecycle.

For AI-agent use, Modal added explicit sandbox primitives in late 2025: ephemeral sandboxes with file-system snapshots, configurable network policies, and the same sub-second boot times as E2B (for cold starts on the smallest tier). Sandboxes can be kept warm for up to 7 days in the current alpha; longer lifecycles via Modal Volumes.

When to Use It

Compute-heavy workloads that need GPUs or large memory. Agents running pipelines that mix scratch computation with production deployment. Teams already using Modal for non-AI workloads who want to share infrastructure.

Alternatives --- E2B for lighter-weight stateful sandboxes without GPU need. Direct cloud-provider sandboxes (AWS Lambda, GCP Cloud Run) when the workload is more script-like than agent-like.

Sources

modal.com
modal.com/docs/guide/sandbox

Example

An agent fine-tuning a small classifier on user-provided data. Modal sandbox with a single A10G GPU, mount the dataset, run the training loop, save the model to a Modal Volume, return the inference endpoint URL to the user.

Example artifacts

Setup.

pip install modal

modal token new

Code.

import modal

app = modal.App("agent-sandbox")

image = modal.Image.debian_slim().pip_install("pandas",
"scikit-learn")

with app.run():

sb = modal.Sandbox.create(image=image, gpu="T4")

p = sb.exec("python", "-c", "import sklearn;
print(sklearn.__version__)")

print(p.stdout.read())

sb.terminate()

Daytona / Blaxel — dev-environment sandboxes

Source: daytona.io, blaxel.ai

Classification Compute (internal + mutates). Tier 2.

Intent

Sandbox products that double as full development environments, optimizing for long-lived state and fast resume rather than for ephemeral microVMs.

Motivating Problem

Some agent workloads aren’t ephemeral: an agent reviewing a PR needs the full repo cloned and dependencies installed; an agent migrating a codebase needs a long-running workspace where build state survives across many tool calls; an agent doing iterative dev needs to re-enter the same workspace tomorrow without re-bootstrapping. Daytona and Blaxel optimize for these long-lived shapes rather than for the fire-and-forget microVM.

How It Works

Daytona provisions a full dev environment per sandbox: filesystem, network, ports, persistent storage, and editor compatibility (VS Code, JetBrains). Sandboxes archive after 30 days of inactivity. The MCP server exposes tools for creating workspaces, attaching to existing ones, running commands, and reading files.

Blaxel takes the opposite extreme on lifecycle: sandboxes can stay in standby indefinitely at zero compute cost, with resume times under 25ms (filesystem and memory state intact). It pairs sandboxes with co-located agent hosting so the agent’s tool calls don’t cross network boundaries.

Both compete with E2B and Modal at the edges of the sandbox-for-AI category. Choose Daytona when the workload is dev-environment shaped (PR reviews, multi-day refactors); choose Blaxel when fast resume matters (perpetually-warm agents, latency-sensitive workflows).

When to Use It

Long-running agent workloads where state persistence is the bottleneck, not boot speed. Cases where the agent and a human collaborate in the same workspace (Daytona’s editor compatibility). Production agents with strict resume-latency requirements (Blaxel).

Alternatives --- E2B for shorter-lived stateful sandboxes; Modal for compute-heavy work; GitHub Codespaces for the human-in-the-loop case.

Sources

daytona.io
blaxel.ai

Example

A code-review agent that runs across hundreds of PRs per week. Daytona provisions a sandbox per PR, clones the repo at the PR’s commit, installs dependencies, runs tests, generates a review comment, and archives the sandbox.

Example artifacts

Setup.

# Daytona

daytona create my-workspace

daytona run my-workspace -- pytest

# Blaxel

blaxel sandbox create --image python:3.11

blaxel sandbox exec sandbox-id -- python script.py

Section D — Filesystem and version control

The tools every coding agent needs: read/write/edit files, manipulate Git, interact with the GitHub API

Every non-trivial coding agent has three tool categories: filesystem operations (the agent reads, writes, and edits files in the project), local Git operations (commit, diff, log, branch), and remote-Git platform operations (open PRs, comment on issues, search repos). Anthropic’s text_editor and bash tools cover the first two when running locally inside Claude Code; MCP servers provide the cross-agent versions and add the GitHub API surface.

Two design notes matter operationally. First, the filesystem MCP server is one of the still-active Anthropic reference servers, but it’s redundant in Claude Code (which already has filesystem access via its native tools); install it only in agents that don’t have native filesystem access. Second, the canonical GitHub MCP server is the vendor-maintained one (github/github-mcp-server), which replaced the archived Anthropic reference implementation in late 2025; many tutorials still point at the deprecated repository, so check before installing.

Filesystem MCP

Source: modelcontextprotocol/servers (filesystem)

Classification Retrieve (read) + Compute (write). Tier 2 in sandbox; Tier 3 against real filesystems.

Intent

Secure file operations --- read, write, list, search --- against a configurable directory tree, with access controls.

Motivating Problem

Agents that don’t have native filesystem access (most non-Claude-Code agents) still need to operate on files: read a CSV, write a generated report, search across a directory tree for relevant content. The filesystem MCP server exposes the standard file operations with configurable access controls so the agent can’t escape outside the allowed root.

How It Works

The server takes a list of allowed directories as a startup argument; all operations are confined to those directories. Tools include read_file, write_file, list_directory, search_files (recursive glob), get_file_info, create_directory, and move_file. Each has minimal input schemas --- a path and the operation-specific arguments.

Access controls are deliberately simple: the agent can read or write anything within the allowed directories; it cannot read outside them. There’s no per-file ACL; the model is trusted to operate on the right files.

One of the seven still-active Anthropic reference servers, retained because it remains useful in non-Claude-Code stacks (Cursor, Codex, Claude Desktop).

When to Use It

Non-Claude-Code agents that need file access. Sandbox-style deployments where the allowed directory is the only legitimate scope. Skip in Claude Code, which already exposes equivalent native tools.

Alternatives --- Anthropic native text_editor (within Claude Code or the Claude API). Bash MCP tools that wrap shell commands; less safe but more flexible.

Sources

github.com/modelcontextprotocol/servers/tree/main/src/filesystem

Example

A research assistant agent given a directory of research notes. filesystem MCP exposes the notes directory; the agent uses search_files to find notes containing a topic, read_file to load matching ones, and writes a synthesis as a new file in the same directory.

Example artifacts

Invocation.

{
  "name": "read_file",
  "input": {
    "path": "/Users/me/Projects/notes.md"
  }
}

{
  "name": "search_files",
  "input": {
    "path": "/Users/me/Documents",
    "pattern": "*.pdf"
  }
}

Setup.

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/me/Documents",
        "/Users/me/Projects"
      ]
    }
  }
}

Git MCP

Source: modelcontextprotocol/servers (git)

Classification Retrieve + Compute (Tier 2 reading; Tier 3 mutating).

Intent

Read, search, and manipulate local Git repositories --- status, diff, log, branch, commit, blame.

Motivating Problem

Coding agents need to understand and manipulate Git state: what changed since main, what does this commit do, which files have I touched in this session, when should I commit. Calling git via bash works but loses structure; the Git MCP server exposes the same operations as typed tools with structured output.

How It Works

Tools cover the standard Git operations: git_status, git_diff, git_log, git_show, git_branch, git_checkout, git_add, git_commit, git_blame. Each returns structured output (parsed diff hunks, log entries with structured fields rather than raw text). The server is one of the seven still-active Anthropic reference servers.

Notably, the Git MCP does not include push by default --- push is a Tier 4 operation that should require explicit approval. Most production setups either disable push entirely (and require the developer to push manually after reviewing the agent’s commits) or add it behind explicit configuration.

When to Use It

Any coding agent that operates on Git repositories. Pair with the filesystem MCP and a GitHub MCP for a full Git workflow. Skip if Claude Code’s native bash tool is already in use and the structure of the MCP output isn’t needed.

Alternatives --- bash MCP wrapping the git CLI directly. The mattpocock git-guardrails-claude-code skill (from the Skills Catalog) for layering safety hooks on top of bash-based Git operations.

Sources

github.com/modelcontextprotocol/servers/tree/main/src/git

Example

An agent finishing a feature branch: git_status to confirm the changes, git_diff to review them, git_add for selected files, git_commit with a generated message. push is left to the human.

Example artifacts

Invocation.

{
  "name": "git_status",
  "input": {}
}

{
  "name": "git_diff",
  "input": {
    "target": "HEAD~1"
  }
}

{
  "name": "git_commit",
  "input": {
    "message": "Add user-profile endpoint\n\nAdds GET /api/users/:id returning profile fields."
  }
}

Setup.

{
  "mcpServers": {
    "git": {
      "command": "uvx",
      "args": [
        "mcp-server-git",
        "--repository",
        "/path/to/repo"
      ]
    }
  }
}

GitHub MCP (vendor-maintained)

Source: github/github-mcp-server

Classification Retrieve + Observe + Act (Tier 1–4 depending on operation).

Intent

Access the GitHub API as tools --- search repos and code, read PRs and issues, create issues and PRs, comment on threads, get commit info.

Motivating Problem

Many agent workflows aren’t local-Git: triage open issues on a repo, summarize a PR, search across an org’s code for a pattern, file a bug report. These need the GitHub API. The vendor-maintained GitHub MCP server (one of the canonical “starter pack” recommendations) wraps the API surface as tools.

How It Works

The server exposes dozens of tools across read and write operations: search_repositories, search_code, search_issues, get_pull_request, get_pr_diff, create_issue, create_pull_request, add_comment, get_file_contents (read a file at a ref without cloning). Authentication is via GitHub Personal Access Token or GitHub App credentials.

Mixed permission tiers: search and read operations are Tier 1; commenting and creating issues are Tier 4 (real-world side-effects, observable to other humans). Production deployments typically scope the token to read-only or require explicit approval for write operations.

Operationally, the GitHub MCP is one of the three tools in the widely-recommended “starter pack” (GitHub + Context7 + Playwright). Across the agent ecosystem, it’s the single most-installed MCP server outside the Anthropic reference set.

When to Use It

Any agent that does work on GitHub-hosted code at the platform level (not just locally): triage, PR review, code search across an organization, repository-management automation, issue summarization and routing. Pair with Git MCP for the local Git half.

Alternatives --- GitLab MCP server for GitLab-hosted code (zereight/mcp-gitlab). Direct REST API calls via a generic HTTP MCP when only one or two operations are needed.

Sources

github.com/github/github-mcp-server

Example

An open-source-maintenance agent run weekly: search_issues for issues opened in the last 7 days; for each, fetch the linked PR if any; classify into bug/feature/question with the model; add a label via add_label; post a triage comment via add_comment. Runs end-to-end in a few minutes against repositories where this used to take a human a half-day.

Example artifacts

Invocation.

{
  "name": "search_issues",
  "input": {
    "query": "repo:my-org/my-repo is:open is:issue created:>=2026-05-01"
  }
}

{
  "name": "get_pull_request",
  "input": {
    "owner": "my-org",
    "repo": "my-repo",
    "pullNumber": 1234
  }
}

{
  "name": "add_issue_comment",
  "input": {
    "owner": "my-org",
    "repo": "my-repo",
    "issueNumber": 1234,
    "body": "Triaged as a duplicate of #1199."
  }
}

Setup.

{
  "mcpServers": {
    "github": {
      "command": "docker",
      "args": [
        "run",
        "-i",
        "--rm",
        "-e",
        "GITHUB_PERSONAL_ACCESS_TOKEN",
        "ghcr.io/github/github-mcp-server"
      ],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "ghp_..."
      }
    }
  }
}

Section E — Browser and computer control

Tools that let agents drive a browser or a full desktop --- the universal fallback when an API doesn’t exist

When a target system doesn’t have an MCP server, a REST API, or any other programmatic surface, the agent’s last resort is to operate the system as a human would: render the screen, recognize what’s on it, take action via mouse and keyboard. Three tools dominate this category: Anthropic Computer Use (full desktop control), Playwright MCP (web browser control via structured DOM), and browser-use (browser control with an opinionated agent loop). They differ in scope, latency, and reliability; the choice depends on what the target system permits.

All three are Tier 4 when operating against real systems: a wrong click can post to social media, send an email, or click a Buy button. Production deployments use them against virtual desktops or scoped browser sessions whenever possible. The latency is significant: each action-screenshot-reason cycle takes 1—3 seconds even with fast networks, so these tools are 10—100x slower than direct API calls when an API is available.

Anthropic Computer Use (cross-reference)

Source: Anthropic platform tool, type computer_20250124

Classification Act (external + mutates). Tier 4.

Intent

Control a full desktop --- mouse, keyboard, screenshots --- as the universal automation fallback when no API exists.

Motivating Problem

Some targets aren’t web pages and can’t be Playwright-driven: legacy enterprise desktop applications, virtual machines, OS-level interactions, applications that detect and refuse browser-automation. Computer Use handles these by operating the entire desktop rather than just the browser.

How It Works

Covered in detail in Section A. The relevant point in this section: where Playwright reads the DOM and acts on element handles, Computer Use reads screenshots and acts on pixel coordinates. This is what makes it both more powerful (works on anything visible) and less reliable (small UI changes break the model’s plan).

The dominant deployment pattern is to pair Computer Use with E2B Desktop or Anthropic’s own browser sandbox, so the desktop being controlled is a sandbox rather than the user’s real machine. The sandbox is the boundary that makes Computer Use’s Tier 4 nature operationally safe.

When to Use It

When the target has no API and no Playwright affordance. When the workflow involves multiple desktop applications, not just a browser. As an explicit demo of “an agent that can use any computer” for new use cases.

Strongly prefer Playwright MCP, browser-use, or direct API integration when those options exist. Computer Use is the heaviest hammer in the toolbox.

Sources

platform.claude.com/docs/en/agents-and-tools/tool-use/computer-use-tool
github.com/anthropics/anthropic-quickstarts (computer-use-demo)

Example

An agent migrating data out of a legacy desktop CRM whose only export is screen-scraping. Computer Use opens the app, navigates to each customer record, reads the visible fields, and emits a structured row. Slow (minutes per record) but the alternative is hand work.

Example artifacts

Invocation.

// See Section A for the full tool schema and invocation examples.

Playwright MCP

Source: microsoft/playwright-mcp

Classification Act + Observe (external). Tier 4 against real sites.

Intent

Drive a browser through structured DOM operations --- navigate, click elements, fill forms, take accessibility snapshots --- rather than via screen coordinates.

Motivating Problem

Computer Use’s screenshot-and-click loop is slow and fragile for web pages because the web already has a structured DOM that’s machine-readable. Playwright MCP exposes the structured-DOM interface to the agent: accessibility snapshots (semantic tree of the page) and element-handle operations (click an element by role and name, not by coordinates).

How It Works

The server runs a Playwright-managed browser (Chromium by default; Firefox and WebKit available). Tools include browser_navigate, browser_snapshot (accessibility tree), browser_click (by element role + accessible name), browser_type (fill an input), browser_wait_for (waits for a network state or element to appear), and browser_screenshot (visual debugging).

The accessibility snapshot is the key innovation. Instead of the model reading a screenshot and computing coordinates, it reads a structured tree like {role: “button”, name: “Sign in”, id: 42} and acts on element handles by ID. Faster, cheaper (no image tokens), and more robust to visual changes that don’t affect the DOM.

Like the other tools in this section, the security model depends on the deployment: against a real production site with real credentials, every browser_click is Tier 4. Against a sandboxed test environment, it’s Tier 2.

When to Use It

Any browser automation: web scraping with login, end-to-end testing, QA verification of agent-built UI changes. Pair with the webapp-testing skill (Skills Catalog Section A) for AI-driven test runs.

Alternatives --- Computer Use when the target isn’t a browser. browser-use when you want a more agent-loop-shaped API. Firecrawl when the task is content extraction rather than interactive automation.

Sources

github.com/microsoft/playwright-mcp
playwright.dev

Example

An end-to-end test for a new sign-up flow. Playwright MCP navigates to the sign-up page, fills the form (browser_type), submits (browser_click), waits for the post-signup landing (browser_wait_for), and asserts the expected text via the accessibility snapshot. The same agent loop replicates across browsers (Chromium, Firefox, WebKit) by changing one parameter.

Example artifacts

Invocation.

{
  "name": "browser_navigate",
  "input": {
    "url": "https://example.com/signup"
  }
}

{
  "name": "browser_snapshot",
  "input": {}
}

// returns: { url, title, elements: [
// { id: 12, role: "textbox", name: "Email" },
// { id: 14, role: "button", name: "Sign up" }, ...
// ] }

{
  "name": "browser_type",
  "input": {
    "elementId": 12,
    "text": "test@example.com"
  }
}

{
  "name": "browser_click",
  "input": {
    "elementId": 14
  }
}

Setup.

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": [
        "-y",
        "@playwright/mcp@latest"
      ]
    }
  }
}

browser-use

Source: browser-use.com (Python)

Classification Act + Observe (external). Tier 4.

Intent

Browser automation built specifically for AI agents, with an agent-loop API rather than a low-level browser-control API.

Motivating Problem

Playwright MCP gives the agent low-level browser primitives; the agent itself decides the action sequence. For many web tasks (“book me a flight”, “fill out this form”, “find and download this report”), the agent-loop structure repeats: read the page, decide what to click or fill, observe the result, repeat. browser-use packages that loop as the API: pass a high-level goal and a starting URL, get back a result.

How It Works

The library wraps Playwright with a model-driven agent loop. The agent reads the page (structured DOM extraction with vision fallback), reasons about the next action, executes it, and observes. The user-facing API is high-level: agent.run(task) where task is a natural-language description. The internal loop is configurable --- you can swap the model, adjust the planning step, or intercept actions for approval.

Compared to Playwright MCP, browser-use trades off granularity for ergonomics: it’s the right choice when you want to delegate the whole task to the agent rather than orchestrating individual browser primitives. Compared to Computer Use, it’s browser-only but much faster and more reliable for browser-shaped tasks.

When to Use It

Browser tasks that can be expressed as a high-level goal. Web research agents that need to navigate complex sites. Form-filling and data-entry workflows where the form structure varies per site.

Alternatives --- Playwright MCP for fine-grained control. Computer Use for non-browser tasks. Direct API integration when one exists.

Sources

browser-use.com
github.com/browser-use/browser-use

Example

An agent that monitors government-procurement portals for relevant tenders. browser-use receives the task “find tenders matching {keywords} posted in the last 24 hours on portal {url}”; the library navigates the portal’s search UI, filters results, paginates through, and returns structured matches.

Example artifacts

Setup.

pip install browser-use

playwright install chromium

Code.

from browser_use import Agent

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-6")

agent = Agent(

task="Find new tenders on procurement.example.gov related to
'cybersecurity'.",

llm=llm,

)

result = await agent.run()

print(result)

Section F — Collaboration and communication

Slack, email, issue trackers, calendar --- the tools that let agents participate in human work streams

Most knowledge-work happens across a small set of communication tools: Slack (or another team chat), email (mostly Gmail in startup/SMB, Outlook in enterprise), an issue tracker (Linear, Jira, or GitHub Issues), and a calendar (Google Calendar, Microsoft Calendar). An agent that can read and write to these tools is suddenly useful as a teammate rather than just as a code generator.

The shared design challenge across this section is permission: reading these tools is mostly Tier 1 (low risk), but writing is Tier 4 (the agent’s message goes to real humans, the calendar invite reaches real recipients, the issue gets created on the real backlog). Most production setups split the surface explicitly: read tools available without approval, write tools requiring explicit confirmation.

Slack

Source: Vendor-maintained official servers (the original Anthropic reference is archived; now maintained by Zencoder)

Classification Observe (read) + Act (write). Tier 1–4.

Intent

Read and post to Slack channels, search message history, list channels and users, retrieve thread context.

Motivating Problem

Slack is where many companies’ institutional knowledge lives: decisions in channels, debates in threads, documents shared in DMs. An agent that can search and summarize Slack becomes a knowledge-base over the team’s actual history. Conversely, agents that can post to Slack become part of the team’s notification flow --- “deploy succeeded”, “PR ready for review”, “customer escalation incoming”.

How It Works

Read tools: slack_list_channels, slack_get_channel_history, slack_search, slack_get_thread_replies, slack_get_users. Write tools: slack_post_message, slack_reply_to_thread, slack_add_reaction. Authentication is via a Slack Bot Token; the bot needs to be added to each channel it should read or post to.

Operationally important: Slack search returns a token-budgeted set of matches, not the whole history; the agent must search well-formed queries. The notable failure mode is the agent posting to the wrong channel by mistake --- hence the Tier 4 classification for write operations.

When to Use It

Anywhere agents need to participate in or learn from team chat. Cross-functional automation (status updates, escalation routing). Building Q&A bots that ground in past discussions. Note: the original Anthropic Slack reference MCP server was archived; use the vendor-maintained Zencoder server from the official MCP registry.

Alternatives --- Microsoft Teams MCP for Teams-using organizations; Discord MCP for Discord-using communities; direct Slack Events API when the agent is the consumer of inbound events rather than the initiator.

Sources

api.slack.com
github.com/zencoderai/slack-mcp-server
registry.modelcontextprotocol.io

Example

An agent triaging customer escalations: slack_search for messages mentioning the customer’s name in the last week, slack_get_thread_replies on the relevant thread, slack_post_message in #customer-success summarizing the escalation history and proposed next step.

Example artifacts

Invocation.

{
  "name": "slack_search",
  "input": {
    "query": "customer_x escalation",
    "sort": "timestamp"
  }
}

{
  "name": "slack_post_message",
  "input": {
    "channel_id": "C0123456",
    "text": "Escalation summary for customer_x: ..."
  }
}

Setup.

{
  "mcpServers": {
    "slack": {
      "command": "npx",
      "args": [
        "-y",
        "@zencoder/slack-mcp-server"
      ],
      "env": {
        "SLACK_BOT_TOKEN": "xoxb-..."
      }
    }
  }
}

Gmail

Source: Vendor-maintained MCP servers; Google Workspace gws-shared family

Classification Observe (read) + Act (write). Tier 1–4.

Intent

Read and search Gmail messages, draft and send messages, manage labels, search across history.

Motivating Problem

Email is the most-asynchronous-but-still-replied-to channel for many people. Agents that can triage email inboxes, draft replies, and surface follow-up reminders save the dominant cost of most knowledge-worker days: inbox processing.

How It Works

Tools cover the standard Gmail operations: search_threads, get_thread, create_draft, list_drafts, update_label, label_message, send_message. Authentication via Google OAuth; permissions are scoped (read-only is a different scope from send).

The discipline for write operations is to draft, not send. Most production deployments give the agent create_draft (Tier 3 --- reversible, the draft sits in the user’s drafts folder for review) but withhold send_message (Tier 4). The user reviews the draft and clicks send themselves.

When to Use It

Inbox triage and summarization. Drafting replies that the user then reviews. Searching for context across email history (“what did we agree about the launch date?”). Calendar-meeting follow-ups (combined with the Calendar tool).

Alternatives --- Microsoft Outlook MCP for Outlook-using organizations. The Google Workspace gws-shared family for organizations using the broader Workspace surface (Drive, Sheets, Calendar) together.

Sources

developers.google.com/gmail/api
registry.modelcontextprotocol.io

Example

End-of-day inbox triage: search_threads for unread messages, classify each by topic and urgency, create_draft for the routine acknowledgments, surface the genuinely-important threads to the user with a one-line summary each.

Example artifacts

Invocation.

{
  "name": "search_threads",
  "input": {
    "query": "is:unread newer_than:1d"
  }
}

{
  "name": "create_draft",
  "input": {
    "to": [
      "alice@example.com"
    ],
    "subject": "Re: Q3 numbers",
    "body": "Acknowledging --- will respond by EOD tomorrow.\n\nBest, ..."
  }
}

Setup.

# OAuth flow during install; tokens persist in ~/.config/gmail-mcp/

npx -y gmail-mcp configure

Linear / Jira — issue trackers

Source: Linear MCP (official), Atlassian MCP (official)

Classification Observe (read) + Act (write). Tier 1–4.

Intent

Manage issues, projects, sprints, and comments in Linear or Jira via natural-language agent interactions.

Motivating Problem

An agent that can read and update the issue tracker becomes a working teammate on engineering and product work. The tasks are the same shape everywhere: triage incoming issues, query “what’s open in this sprint,” update issue status, comment on threads, link issues to PRs. Both Linear and Jira ship official MCP servers; tools differ by platform but the shapes are similar.

How It Works

Linear MCP exposes tools like linear_search_issues, linear_get_issue, linear_create_issue, linear_update_issue, linear_create_comment, linear_list_projects, linear_list_teams. Atlassian MCP covers Jira and Confluence: jira_search, jira_get_issue, jira_create_issue, jira_add_comment, confluence_search, confluence_get_page.

Authentication: Linear via API key (one per user); Jira/Confluence via OAuth or API token. Both servers provide structured output (JSON with fields like assignee, status, labels) rather than the raw API responses.

When to Use It

Standup automation (“summarize what each person has open this sprint”). Issue triage from incoming bug reports. Cross-system automation that connects code (GitHub) to product (Linear/Jira). PR-review agents that close the loop by updating issue status when the PR merges.

Alternatives --- GitHub Issues for repositories that use GitHub Issues as the primary tracker (GitHub MCP from Section D handles it).

Sources

linear.app/developers
developer.atlassian.com

Example

An agent generating the weekly engineering report: linear_search_issues filtered by team and timeframe, group by status and project, render as a structured summary, post to Slack via slack_post_message. Runs in under a minute and replaces a Friday-afternoon hand-roll.

Example artifacts

Invocation.

{
  "name": "linear_search_issues",
  "input": {
    "team": "ENG",
    "state": "InProgress",
    "limit": 50
  }
}

{
  "name": "linear_create_comment",
  "input": {
    "issueId": "ENG-1234",
    "body": "Reproduced on staging. Root cause is in the queue consumer."
  }
}

Setup.

{
  "mcpServers": {
    "linear": {
      "command": "npx",
      "args": [
        "-y",
        "@linear/mcp-server"
      ],
      "env": {
        "LINEAR_API_KEY": "lin_api_..."
      }
    }
  }
}

Google Calendar

Source: Google Calendar MCP (multiple implementations)

Classification Observe (read) + Act (write). Tier 1–4.

Intent

Read, search, create, and modify calendar events; suggest meeting times; respond to invitations.

Motivating Problem

Scheduling is one of the few knowledge-work tasks that’s genuinely hard to delegate --- it requires reading multiple calendars, applying preferences, and negotiating timezones. An agent that can do all three replaces a meaningful percentage of administrative time. Calendar tools handle the read/write half; the human still mostly handles the negotiation.

How It Works

Tools cover the standard calendar operations: list_calendars, list_events (with time range filters), get_event, create_event, update_event, delete_event, respond_to_event, and the higher-level suggest_time (which finds free windows across one or more calendars matching constraints).

Notably, suggest_time is where the real value is. The agent gathers constraints (“30 minutes with alice and bob next week”), queries the calendars for free/busy data, and returns ranked time options. The model handles timezone arithmetic and meeting-length math better than humans typically do.

When to Use It

Anywhere an agent needs to participate in scheduling. Daily digest agents (“what’s on the calendar today”). Auto-responding to invites with policy (decline lunch hour, accept dev syncs). Pair with Gmail for full meeting-coordination flows.

Alternatives --- Microsoft Calendar (Outlook) MCP for organizations on the Microsoft stack. Calendly-style purpose-built scheduling tools when negotiation with external parties is the dominant case.

Sources

developers.google.com/calendar
registry.modelcontextprotocol.io

Example

An agent helping schedule a four-person meeting next week. suggest_time across the four calendars finds three windows; the agent picks the one with the longest buffer on each side; create_event with the right attendees and a generated agenda based on the meeting topic.

Example artifacts

Invocation.

{
  "name": "suggest_time",
  "input": {
    "calendars": [
      "alice@x.com",
      "bob@x.com"
    ],
    "duration_minutes": 30,
    "earliest": "2026-05-20T09:00:00Z",
    "latest": "2026-05-24T17:00:00Z",
    "working_hours_only": true
  }
}

{
  "name": "create_event",
  "input": {
    "calendar": "primary",
    "summary": "Architecture review",
    "start": "2026-05-21T15:00:00Z",
    "end": "2026-05-21T15:30:00Z",
    "attendees": [
      "alice@x.com",
      "bob@x.com"
    ]
  }
}

Section G — Cloud and infrastructure

Cloudflare, AWS, Kubernetes --- the tools that let agents operate production infrastructure

The cloud-and-infrastructure category is the highest-stakes part of the catalog. A misfired tool call here can break a production deployment, leak credentials, run up a large cloud bill, or expose user data. The MCP servers in this section are operationally Tier 4 by default, with careful gating recommended even for read-only operations.

Three tools illustrate the shape: Cloudflare MCP (vendor-maintained, broad surface across Workers, KV, R2, D1, DNS), AWS MCPs (the AWS Labs project at github.com/awslabs/mcp covers many AWS services, with Bedrock KB retrieval as the most established example), and the Kubernetes/kubectl tools that let agents inspect and modify cluster state. All three have legitimate use cases for agentic workflows; all three should be deployed with explicit approval gates for any write operation.

Cloudflare MCP

Source: Cloudflare official MCP server (github.com/cloudflare/mcp-server-cloudflare)

Classification Observe + Act. Tier 4 for writes; Tier 1–2 for reads.

Intent

Operate Cloudflare resources --- Workers, KV, R2, D1, DNS, Pages, Analytics --- from inside an agent.

Motivating Problem

Cloudflare is one of the most-used edge platforms. Agents that can deploy Workers, query KV namespaces, manage DNS records, or pull analytics from inside a session collapse a number of operational tasks (“why is this domain slow”, “deploy this Worker”, “what’s the cache hit rate”) into single-prompt workflows.

How It Works

The server exposes tools across the Cloudflare product surface: workers_list, workers_deploy, kv_get, kv_put, r2_list_objects, r2_upload, d1_query, dns_list_records, dns_create_record, analytics_query. Authentication is via Cloudflare API Token; the token can be scoped to specific resources to limit blast radius.

The product breadth is the point: most cloud MCPs cover a narrow slice; Cloudflare’s covers the entire edge-platform surface as one server with consistent auth.

When to Use It

Cloudflare-hosted infrastructure: deploying or debugging Workers, managing edge data (KV, R2), reading analytics. Pair with Git MCP and GitHub MCP for code-to-deploy workflows.

Alternatives --- Cloudflare’s wrangler CLI invoked via bash for ad-hoc operations; direct REST API calls for one-off uses. The MCP wins when an agent is doing repeated Cloudflare operations as part of a larger workflow.

Sources

developers.cloudflare.com
github.com/cloudflare/mcp-server-cloudflare

Example

An agent debugging a slow site. analytics_query pulls performance metrics by route; the agent identifies a slow Worker; workers_list and a follow-up deploy of an optimized version. End-to-end inside a single chat.

Example artifacts

Invocation.

{
  "name": "workers_list",
  "input": {}
}

{
  "name": "kv_get",
  "input": {
    "namespace_id": "abc...",
    "key": "feature_flags:beta"
  }
}

Setup.

{
  "mcpServers": {
    "cloudflare": {
      "command": "npx",
      "args": [
        "-y",
        "@cloudflare/mcp-server-cloudflare"
      ],
      "env": {
        "CLOUDFLARE_API_TOKEN": "cf_..."
      }
    }
  }
}

AWS Knowledge Base / AWS MCPs

Source: github.com/awslabs/mcp (broad family); aws-kb-retrieval (Bedrock)

Classification Observe + Compute + Act. Tier varies by service.

Intent

Retrieve from AWS Bedrock Knowledge Bases via MCP; access broader AWS services through the AWS Labs MCP project.

Motivating Problem

AWS Bedrock Knowledge Bases give organizations a managed RAG layer: documents are ingested, indexed, and queryable as vector retrieval. The AWS KB MCP exposes that as a tool any agent (not just Bedrock-hosted agents) can call. The broader AWS MCP surface is less consolidated than Cloudflare’s but growing rapidly through the AWS Labs MCP servers project.

How It Works

The AWS KB MCP wraps the Bedrock Agent Runtime Retrieve API: pass a query and a knowledge-base ID, get back ranked passages. Authentication is via standard AWS credentials (IAM role, access keys, or SSO).

Beyond Bedrock KB, the AWS Labs MCP project (github.com/awslabs/mcp) provides specialized servers for: aws-documentation (searching AWS docs), aws-cdk (infrastructure-as-code operations), S3, DynamoDB, and many others. The ecosystem is more fragmented than Cloudflare’s because AWS’s surface is much broader.

When to Use It

Agents that need to retrieve from organizationally-curated knowledge stored in Bedrock KB. AWS infrastructure operations where a service-specific MCP is available. Documentation-heavy queries (aws-documentation MCP).

Alternatives --- boto3 (the AWS Python SDK) invoked via code_execution for ad-hoc operations. The AWS CLI via bash for command-line-shaped operations. The MCP layer pays off when the agent is doing repeated structured operations.

Sources

docs.aws.amazon.com/bedrock
github.com/awslabs/mcp

Example

An internal-support agent grounded in company SOPs ingested into Bedrock KB. User asks a question; agent calls aws_kb_retrieve to get the top relevant passages; answers from those passages with citations.

Example artifacts

Invocation.

{
  "name": "retrieve_from_aws_kb",
  "input": {
    "knowledgeBaseId": "KB123...",
    "query": "What's the on-call escalation policy?",
    "numberOfResults": 5
  }
}

Setup.

{
  "mcpServers": {
    "aws-kb": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-aws-kb-retrieval"
      ],
      "env": {
        "AWS_REGION": "us-east-1",
        "AWS_ACCESS_KEY_ID": "...",
        "AWS_SECRET_ACCESS_KEY": "..."
      }
    }
  }
}

Kubernetes / kubectl MCP

Source: Community kubectl MCP servers

Classification Observe + Act. Tier 1–4.

Intent

Inspect cluster state, read pod logs, apply manifests, manage deployments --- the standard kubectl operations as agent tools.

Motivating Problem

Kubernetes operations are a high-volume, high-skill domain: reading pod logs, applying manifests, scaling deployments, diagnosing pod failures. An agent with kubectl access turns natural-language debugging requests (“why is the api-service pod restarting?”) into the right kubectl commands without the user needing to remember kubectl idioms.

How It Works

MCP servers wrap kubectl operations: kubectl_get (pods, deployments, services, namespaces, custom resources), kubectl_logs, kubectl_describe, kubectl_apply, kubectl_delete, kubectl_scale, kubectl_rollout. Authentication uses standard kubeconfig; the agent runs with whatever permissions the kubeconfig grants.

The discipline is to deploy with carefully-scoped kubeconfig: read-only access for diagnostic agents, write access only with explicit approval gates for any apply/delete/scale operation. The wshobson/agents skills (Section C of the Skills Catalog) include Kubernetes-specific patterns for templating manifests safely.

When to Use It

Cluster diagnostics agents that answer “what’s wrong with X?” Live-incident response agents that produce situation summaries from cluster state. Deployment-automation agents (with strict guardrails; kubectl apply is a Tier 4 operation).

Alternatives --- direct kubectl via bash for ad-hoc operations. Helm-specific MCPs for Helm-based deployments. The kubernetes-python-client wrapped via code_execution for richer queries that don’t fit kubectl’s shape.

Sources

kubernetes.io
Community kubectl MCP servers

Example

On-call response: alert fires that the api-service is failing. Agent runs kubectl_get on the deployment, kubectl_describe to see events, kubectl_logs on the most-recently-restarted pod. Reports back: image pull error, the new tag doesn’t exist in the registry. Suggests rollback with kubectl_rollout undo --- but doesn’t execute it without explicit confirmation.

Example artifacts

Invocation.

{
  "name": "kubectl_get",
  "input": {
    "resource": "pods",
    "namespace": "production",
    "selector": "app=api-service"
  }
}

{
  "name": "kubectl_logs",
  "input": {
    "pod": "api-service-abc123",
    "namespace": "production",
    "tail": 100
  }
}

// Tier 4 --- requires approval:
{
  "name": "kubectl_rollout",
  "input": {
    "action": "undo",
    "resource": "deployment/api-service",
    "namespace": "production"
  }
}

Section H — Databases

Query and (carefully) mutate database state through standardized tools

Database access from agents is a category where the MCP pattern shines: standardized read tools (SELECT-shaped operations) are Tier 1; mutating tools (INSERT, UPDATE, DELETE, schema changes) are Tier 3—4; the tool surface is small and well-defined. Two servers cover most needs: PostgreSQL MCP for production OLTP queries against a Postgres database, and DuckDB MCP for analytical queries against files and remote data sources. Other database-specific MCPs (MySQL, MSSQL, Redis, MongoDB) follow the same pattern with similar shapes.

The recurring design discipline is to expose read tools liberally and gate write tools strictly. Many production deployments expose only a query tool that runs SELECTs against a read-replica; mutations happen out of band, typically through human-reviewed migrations.

PostgreSQL MCP

Source: Vendor-maintained Postgres MCP servers (Supabase, others)

Classification Retrieve (read). Tier 1–3.

Intent

Query PostgreSQL databases from inside an agent --- schema introspection, table queries, and (with care) limited mutations.

Motivating Problem

Most internal data lives in Postgres. An agent that can query Postgres --- “how many users signed up last week?”, “show me the orders for customer X”, “what’s the schema of the events table?” --- becomes a working business-analyst layer. Without an MCP, the agent has to write SQL and ask a human to run it, which defeats the purpose.

How It Works

Read-oriented Postgres MCPs expose: query (parameterized SELECT execution), list_tables, describe_table, list_schemas, get_table_constraints. Some add list_functions and describe_view. The most-used vendor-maintained server is Supabase’s, which adds project- and migration-management tools on top of the SQL surface.

Authentication via standard Postgres connection strings or service tokens. The discipline is connection-string-as-credential: production agents should be connected via a read-only role on a read replica; write access is a deliberate, separate configuration.

When to Use It

Internal-analytics agents grounded in Postgres data. Customer-support agents that look up account state. Database-introspection tools (“what’s the foreign key structure?”, “what indexes exist on this table?”). Pair with the wshobson/agents sql-optimization-patterns skill (Skills Catalog Section C) for query-tuning workflows.

Alternatives --- ORM-specific MCPs (Prisma, SQLAlchemy) for richer abstraction; direct psycopg via code_execution for one-off queries; Supabase MCP specifically when the deployment is on Supabase.

Sources

supabase.com/docs (MCP integration)
registry.modelcontextprotocol.io

Example

A product-analytics agent asked “which features did users with churn risk use most?”. PostgreSQL MCP lists the relevant tables, describes their schemas, and runs a parameterized SELECT joining user_features and churn_predictions. The agent gets structured rows; it summarizes the result for the user.

Example artifacts

Invocation.

{
  "name": "list_tables",
  "input": {
    "schema": "public"
  }
}

{
  "name": "describe_table",
  "input": {
    "schema": "public",
    "table": "events"
  }
}

{
  "name": "query",
  "input": {
    "sql": "SELECT count(*) FROM users WHERE created_at >= \$1",
    "params": [
      "2026-05-01"
    ]
  }
}

Setup.

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": [
        "-y",
        "postgres-mcp-server",
        "postgresql://reader:***@db.internal/analytics"
      ]
    }
  }
}

DuckDB MCP

Source: Official DuckDB MCP skills (duckdb/attach-db, duckdb/query, duckdb/read-file)

Classification Retrieve + Compute. Tier 1–2.

Intent

Analytical SQL over local files (CSV, Parquet, JSON, Excel) or remote sources, with friendly SQL dialect and on-demand DuckDB CLI installation.

Motivating Problem

Many analytical questions are over files rather than databases: a CSV the user uploaded, a Parquet file on S3, an Excel workbook with multiple sheets. DuckDB is the standard tool for this; its SQL dialect handles all the relevant formats natively and runs in-process. DuckDB MCP wraps it as tools.

How It Works

The DuckDB MCP skills include attach-db (open a .duckdb file for interactive querying with auto schema exploration), query (run SQL against attached databases or ad-hoc against files), read-file (read CSV/JSON/Parquet/Excel/spatial files from local or remote sources), duckdb-docs (search DuckDB docs via full-text search), and install-duckdb (CLI installation and version management).

The friendly-SQL dialect matters: DuckDB supports operations like SELECT * FROM ‘data.csv’ or SELECT * FROM read_parquet(‘s3://bucket/file.parquet’) directly, without explicit table creation. This is what makes DuckDB MCP unusually well-suited to ad-hoc analytical work.

When to Use It

Ad-hoc analysis of uploaded files. Querying Parquet or CSV data on cloud storage. Cross-file analysis (joining multiple CSVs into one query). Particularly useful for data-science workflows where the data isn’t yet in a managed database.

Alternatives --- pandas via code_execution for richer Python-shaped analysis. ClickHouse MCP for column-store analytical workloads at scale. PostgreSQL MCP when the data is already in a managed Postgres.

Sources

duckdb.org
github.com/duckdb/duckdb-mcp

Example

A user uploads three months of CSV exports from a SaaS billing system. DuckDB MCP runs SELECT month, SUM(amount) FROM ‘invoices_*.csv’ GROUP BY month --- the wildcard syntax loads all three files; the aggregation runs in milliseconds; the result is a small table the agent renders inline.

Example artifacts

Invocation.

{
  "name": "duckdb_query",
  "input": {
    "sql": "SELECT region, SUM(revenue) FROM 'sales_2026_q*.parquet' GROUP BY region"
  }
}

{
  "name": "duckdb_read_file",
  "input": {
    "path": "s3://my-bucket/events/2026/05/*.parquet",
    "format": "parquet"
  }
}

Section I — Memory and state

Tools that give agents memory beyond a single session: Anthropic’s memory tool, Mem0, Letta

Section A introduced Anthropic’s native memory tool as one of the foundational primitives. This section completes the picture by covering the two principal alternatives: Mem0 (a memory layer designed as a drop-in service across LLM providers) and Letta (formerly MemGPT, a memory-first agent framework that includes sophisticated memory architecture as a core abstraction). Each has a different design philosophy; choose based on the agent’s shape.

Anthropic’s memory tool is the simplest: a key-value store with the model writing free-form notes. Mem0 adds vector retrieval, automatic categorization, and explicit memory categories (preferences, facts, history). Letta adds a multi-tier memory architecture (core memory always in context, recall memory queried on demand, archival memory long-tail) and the agent abstraction that uses them. The progression is from primitive to opinionated framework.

Anthropic memory tool (cross-reference)

Source: Anthropic platform tool, beta

Classification Retrieve + Compute. Tier 3.

Intent

Simple key-value memory across sessions, paired with context-editing strategies for long conversations.

Motivating Problem

Covered in Section A. The relevant comparative point here: the Anthropic memory tool optimizes for simplicity and tight integration with the model. The agent decides what to write; the runtime persists key-value pairs; retrieval is by key lookup. There is no automatic categorization, no vector search, no memory hierarchy. This is intentional --- the design trusts the model to write and recall well, with model-side discipline rather than framework structure.

How It Works

See Section A for the mechanics. The choice between Anthropic’s native memory and a framework like Mem0 or Letta is fundamentally about how much memory architecture the application needs. If the model writing free-form notes to a key-value store is sufficient, use Anthropic’s. If the application needs structured retrieval, categorization, or multi-tier memory, look at the alternatives below.

When to Use It

Most cases where the application is on the Claude API and the memory requirements are simple. Particularly good for personal-assistant-shaped agents where the model’s judgment about what to remember is the dominant axis of quality.

Reach for Mem0 or Letta when the memory architecture itself needs to be structured --- when retrieval needs to span thousands of memories, when categorization matters, or when the application is multi-provider and the memory must be portable across LLMs.

Sources

See Section A

Mem0

Source: mem0.ai (Python and Node SDKs; also available as MCP server)

Classification Retrieve + Compute. Tier 3.

Intent

A persistent memory layer for AI agents that automatically extracts, stores, and retrieves user preferences, facts, and conversation history with vector retrieval.

Motivating Problem

Anthropic’s memory tool requires the model to make explicit set calls. For applications where the user’s preferences and facts should be extracted automatically from natural conversation --- without the agent having to remember to remember --- Mem0 provides automatic memory extraction: pass conversations through Mem0; Mem0 identifies facts worth retaining; future conversations retrieve relevant memories automatically.

How It Works

Mem0’s SDK exposes two primary operations: add (pass in a conversation or a fact; Mem0 decides what to extract and how to categorize it) and search (query against the memory store; returns relevant memories ranked by similarity and recency). Internally, Mem0 uses an LLM-driven extraction step that parses the conversation and identifies durable facts, plus a vector index for retrieval and a categorization layer that buckets memories into preferences, facts, and history.

The MCP server makes Mem0 available to any MCP-compatible agent without writing SDK code. The hosted service handles the vector store and the extraction LLM; self-hosting is available for organizations that need data residency.

When to Use It

Multi-LLM agents that need memory portability across providers. Customer-support and conversational agents where memories should accumulate automatically from chat history. Applications that need explicit memory categories (preferences vs. facts vs. history) for governance or transparency.

Alternatives --- Anthropic memory tool when the application is Claude-only and the memories are simple. Letta when the memory architecture itself should be the framework’s opinionated abstraction.

Sources

mem0.ai
github.com/mem0ai/mem0

Example

A conversational customer-success agent. Across many sessions with the same user, Mem0 automatically accumulates: “user is on Enterprise plan,” “user’s primary integration is Salesforce,” “user had a bad experience with v3.2,” “user prefers email replies over phone calls.” Every new session’s agent gets the relevant subset retrieved automatically based on the current topic.

Example artifacts

Setup.

pip install mem0ai

export MEM0_API_KEY=...

Code.

from mem0 import MemoryClient

mem = MemoryClient(api_key="...")

# Add a conversation; Mem0 extracts facts automatically

mem.add(

messages=[

{"role": "user", "content": "I'm allergic to peanuts and
dairy."},

{"role": "assistant", "content": "Noted --- I'll keep that in
mind for restaurant suggestions."},

],

user_id="user_123",

)

# Retrieve memories relevant to a new query

results = mem.search("Where should I eat tonight?",
user_id="user_123")

# Returns memories about allergies, food preferences, location, etc.

Letta

Source: letta.com (formerly MemGPT; Python SDK and hosted service)

Classification Retrieve + Compute. Tier 3.

Intent

A memory-first agent framework with multi-tier memory architecture: core memory always in context, recall memory queried on demand, archival memory long-tail.

Motivating Problem

Some agents need memory that goes beyond a flat key-value store --- they need a memory architecture. Letta’s thesis is that the agent’s memory should be split into tiers with different access patterns: a small “core memory” always in context (who the user is, the agent’s persona), a queryable “recall memory” for recent conversation history, and an “archival memory” for the long tail. The agent itself manages the tiers via tool calls.

How It Works

Letta provides an agent abstraction (not just a memory tool) with memory built in. The agent’s system prompt always includes the core memory; recall and archival memory are accessed through agent-callable tools (search_recall, search_archival, add_to_archival). The agent decides when to elevate a fact from recall to archival, when to compress old conversation history, when to update core memory.

The architecture descends from the MemGPT paper (Packer et al., 2023), which framed memory in LLM agents as analogous to virtual memory in operating systems: a small fast tier always in context, larger slower tiers accessible via paging. Letta operationalizes that into a hosted service with a Python SDK and OpenAI-compatible / Anthropic-compatible LLM interfaces.

When to Use It

Agents where memory architecture is itself a first-class concern --- long-running assistants, AI companions, customer-support agents that need explicit persona persistence. Research applications that want to study memory-architecture trade-offs. Cases where you want the memory layer to be the framework, not just a tool.

Alternatives --- Mem0 for the lighter-weight automatic-extraction approach. Anthropic memory tool for the simplest possible case. LangGraph’s memory primitives for agents that already use LangGraph as the orchestration framework.

Sources

letta.com
github.com/letta-ai/letta
Packer et al., MemGPT: Towards LLMs as Operating Systems (2023)

Example

A long-running personal-assistant agent that the user interacts with daily for years. Letta’s core memory holds the user’s name, role, and current top-level goals; recall memory holds the last few weeks of conversation; archival memory holds everything older. When the user references something from six months ago, the agent calls search_archival; when it references yesterday’s conversation, search_recall.

Example artifacts

Setup.

pip install letta

letta server start # local; or use the hosted service

Code.

from letta import create_client

client = create_client()

agent = client.create_agent(

name="daily-assistant",

persona="A helpful long-term assistant for productivity.",

human="User is Roman, a search consultant in San Francisco.",

llm_config=client.list_llm_configs()[0],

)

# Conversation; Letta manages core/recall/archival memory tiers
automatically.

response = client.send_message(

agent_id=agent.id,

message="Remind me what we decided about the Q3 search roadmap.",

role="user",

)

Section J — Designing tools for agents

Writing the name, schema, and results --- the ACI craft of Chapter 4 as templated patterns

Sections A through I catalog tools that already exist. This section is about the tools you write yourself. A tool is a name, a JSON schema, and a runtime handler (Chapter 1); the agent never sees the handler, only the name, the description, and the schema, and it decides entirely from those whether to call the tool and how. Designing that surface --- the Agent-Computer Interface --- is a distinct discipline from building the integration behind it, and it is where custom tools most often fail. The three patterns here cover the surface the model reads: the description it routes on, the input schema it fills, and the result it reasons about next.

Tool descriptions and model selection

Source: Anthropic tool-use documentation (docs.claude.com); “Building Effective Agents” (Anthropic, 2024)

Classification Writing the tool name and description the model routes on. ACI design.

Intent

Write a tool’s name and description so the model reliably selects the right tool for a request --- and declines the wrong ones --- given only the definitions it sees, never the implementation behind them.

Motivating Problem

The model chooses among tools by reading their descriptions; nothing else routes the call. A description that says what a tool does but not when to use it, or that overlaps a neighbor, produces the failure modes custom tools are plagued by: the model calls the wrong tool, calls nothing when it should, or picks unpredictably between two plausible matches. As the catalog grows, these routing errors compound. The description is a prompt --- the most-read prompt in the whole agent --- and it is too often written as an afterthought on top of an API doc.

How It Works

What, when, and when-not: the reliable shape for a description gives three things --- what the tool does, when to use it, and when not to. “Look up one order by its ID. Use when the user names a specific order. Do not use to search orders by customer or date; use search_orders for that.” The when-not clause is what separates two overlapping tools.

Disambiguation: when two tools could match a request, distinguish them explicitly. Anchor each by domain (get_invoice for billing, get_order for fulfilment), name the trigger that should fire it, and exclude the other’s territory by name. Distinct verbs help the model more than distinct nouns do.

Names prime, descriptions decide: the model weighs the description far more heavily than the name, but the name still primes selection. Prefer specific, verb-led names (search_orders over query, get_customer over lookup), and avoid generic names that collide as the catalog grows.

Framing the return in the description: telling the model what the tool returns --- a single object, a list, a boolean, a status --- shapes how it plans the next step and whether it calls the tool at all.

Selection is testable: routing reliability is a property you measure, not assume. Run the tool set against varied phrasings of the same intent, and against edge and adversarial phrasings, watching for misroutes and for selection drift as new tools are added. When two tools keep getting confused, the fix is usually a sharper description or a merged tool, not a third tool.

When to Use It

Every custom tool, and re-examined whenever a tool is added to an agent that already has several. The discipline matters most where tools overlap in domain --- multiple search tools, multiple write tools --- and where a wrong call is expensive or irreversible. It pairs with tool scoping: the fewer tools in scope, the easier the routing (Volume 9, subagents).

Alternatives --- system-prompt routing rules (Volume 15) when descriptions alone cannot disambiguate; consolidating two overlapping tools into one with an enum parameter rather than sharpening two descriptions.

Sources

docs.claude.com/en/docs/build-with-claude/tool-use
anthropic.com/research/building-effective-agents

Example artifacts

Code.

{
  "name": "get_order_status",
  "description": "Get the current status of one order by its order ID. Use when the user names a specific order. Do NOT use to search orders by customer, product, or date; use search_orders for that. Returns a status object, or an order_not_found error.",
  "input_schema": {
    "type": "object",
    "properties": {
      "order_id": { "type": "string", "description": "The order ID, e.g. ORD-4821" }
    },
    "required": ["order_id"]
  }
}

Tool input schemas as contracts

Source: JSON Schema specification; Anthropic tool-use and OpenAI function-calling documentation

Classification Designing tool input schemas the model fills correctly. ACI design.

Intent

Design a tool’s JSON input schema --- its parameter names, types, required set, and constraints --- as a contract that guides the model to generate valid arguments and rejects invalid ones, rather than as a loose shape validated only after the call.

Motivating Problem

The input schema is the second half of the interface: the description tells the model whether to call the tool, the schema tells it how. A schema with vague names, no constraints, and everything optional lets the model supply an email where an ID belongs, invent an enum value, or omit the one field the handler needs. Each parameter’s name and description is a micro-prompt the model reads while composing arguments; a sloppy schema produces sloppy arguments, and the handler pays for it downstream.

How It Works

Name parameters unambiguously: customer_id beats id; start_date beats when. Prefer snake_case, and let the name carry the meaning. Avoid overloaded names (a bare id when several entities are in play) and cryptic abbreviations.

Descriptions as micro-prompts: each parameter’s description guides the value the model generates. State the format (an ISO 8601 date, a Stripe customer ID like cus_1234, the ID and not the email), and note boundary cases. This is where extraction reliability is won.

Constrain the value space: use enum for fields with a fixed set of valid values, so the model cannot invent one; use type plus format (date-time, email) and bounds (minimum, maximum, minLength, pattern) to reject malformed input before the handler runs. Set additionalProperties to false so unexpected keys are caught rather than silently passed.

Design the required set deliberately: mark a field required only if the tool genuinely cannot function without it. A lean required set is more reliable --- fewer forced fields means fewer hallucinated ones. For useful-but-not-critical fields, make them optional with a sensible default and handle their absence gracefully.

Test the schema three ways: a positive case (valid input parses and runs), a negative case (invalid input is rejected with a useful error), and a disambiguation case (an ambiguous input maps to the intended field). Schema quality is measured, like description quality.

When to Use It

Every custom tool, and especially tools used for structured extraction, where the schema is doing the heavy lifting (Volume 15, structured output). The discipline is highest-value for tools that mutate state or spend money, where a wrong argument is costly, and for enums and identifiers, where a hallucinated value fails silently.

Alternatives --- accepting a free-form string and parsing inside the handler when the input genuinely does not fit a schema (at the cost of moving validation past the interface); vendor structured-output features (Volume 15) when the goal is a typed model output rather than a tool input.

Sources

json-schema.org
docs.claude.com/en/docs/build-with-claude/tool-use

Example artifacts

Code.

{
  "type": "object",
  "properties": {
    "customer_id": {
      "type": "string",
      "description": "Stripe customer ID, e.g. cus_1234. Not the email address."
    },
    "reason": {
      "type": "string",
      "enum": ["duplicate", "fraudulent", "requested_by_customer"],
      "description": "Why the refund is issued. Must be one of the three allowed values."
    },
    "amount_cents": {
      "type": "integer",
      "minimum": 1,
      "description": "Amount to refund in cents. Omit to refund the full charge."
    }
  },
  "required": ["customer_id", "reason"],
  "additionalProperties": false
}

Tool results and error responses for recovery

Source: Anthropic tool-use documentation; practitioner conventions for agent error handling

Classification Shaping tool results and errors so the model can act and recover. ACI design.

Intent

Shape what a tool returns --- both success results and errors --- so the model can decide what to do next: trim results to the signal the model needs, and return errors as structured, actionable responses rather than raw stack traces.

Motivating Problem

A tool’s return value is context the model must reason over, and it is the third face of the interface. Two failures are common. First, tools return everything --- an eighty-field API response when the agent needs three --- bloating context, burying the signal, and pushing key facts into the lost-in-the-middle zone. Second, tools surface errors as raw exceptions or opaque strings, so the model cannot tell a retryable timeout from a permanent not-found, and either gives up or retries forever. The result contract determines whether the agent recovers.

How It Works

Trim the result to the signal: return the fields the model needs to act, not the full upstream payload. For a stable response shape, keep a fixed field list; for a variable one, summarize; for a genuinely large payload, store it and return a reference the agent can fetch on demand. Trim at model time, not display time.

Errors as structured responses: return an error the model can branch on --- a stable error_code (ORDER_NOT_FOUND, RATE_LIMITED), a human-readable message that includes the triggering value, a context object, and, where useful, a suggested_action naming what to do next. Stable codes over prose; no raw stack traces.

Separate user errors from system errors: a bad ID the user supplied should be surfaced for the user to correct; a transient dependency failure should be handled or retried without bothering the user. The error’s shape tells the model which path to take.

Signal empty versus failed: distinguish “the query ran and found nothing” from “the query could not run.” A search that returns an empty list because the backend was unreachable is not the same as a search that found nothing, and the model must not report one as the other. A typed status (success, empty_result, access_failure) lets the model retry a failure but not a genuine empty.

Design for safe retry: for tools that mutate state, make retries idempotent --- accept a client-supplied idempotency key, store-and-return on repeat --- so a retried call does not double-charge or double-send. Surface partial success (which items succeeded, which failed and why) rather than an all-or-nothing result.

When to Use It

Every custom tool. Result trimming matters most for verbose upstream APIs and for multi-call workflows where bloat stacks. The error contract matters most for tools that fail in more than one way and for anything the agent is expected to recover from autonomously (Volume 1, Exception Handling and Recovery).

Alternatives --- returning raw output when the model has proven able to parse it and context budget is not a concern; handling all retries and idempotency in an orchestration layer (Volume 4, durable execution) rather than in the tool, when that layer already exists.

Sources

docs.claude.com/en/docs/build-with-claude/tool-use
anthropic.com/research/building-effective-agents

Example artifacts

Code.

{
  "error_code": "ORDER_NOT_FOUND",
  "message": "No order with ID ORD-4821 exists.",
  "context": { "order_id": "ORD-4821" },
  "suggested_action": "Ask the user to confirm the order number, or call search_orders."
}

Appendix A --- Tier Reference Table

Cross-reference between the four families (from Chapter 2) and the four permission tiers (from Chapter 5), with representative tools from across the catalog:

Family	Tier	Examples
Retrieve (read-only)	Tier 1	web_search, web_fetch, file_read, sql_select, vector_search, get_memory, kubectl get, github search_issues, tool_search
Compute (sandboxed)	Tier 2	code_execution, bash in sandbox, str_replace in scratch dir, set_memory, E2B run_code, kubectl apply –dry-run, advisor sub-inference
Persistent mutation	Tier 3	git commit (local), file_write to project, create_draft (email), linear_create_comment, kv_put
External side-effects	Tier 4	send_email, git_push, payment, post_message (Slack), workers_deploy, kubectl apply, computer use against real systems

Appendix B --- The “Starter Pack” Recommendation

Across the agent community as of mid-2026, three MCP servers are consistently recommended as the starter pack for any new agentic coding setup, regardless of which LLM provider or which IDE/CLI host is in use:

GitHub MCP --- collapses “find that PR, read that file, open that issue” into a single agent ask. The vendor-maintained server (github/github-mcp-server) is the canonical implementation.
Context7 --- pulls version-pinned library docs into context so the model writes code against real APIs rather than from training-cutoff memory. Particularly valuable for libraries that have evolved fast.
Playwright MCP --- handles real browser actions for QA, scraping, and verification of generated UI changes.

With those three installed, an agent has reasonable coverage of code (GitHub), docs (Context7), and the web (Playwright). The other 50+ MCP servers in this catalog are additions to the starter pack, justified by specific workflows.

Appendix C --- Composition with Skills and Patterns

This catalog is the third volume in a trilogy. The other two volumes give the vocabulary for talking about how tools fit into larger systems:

Patterns of AI Agent Workflows catalogs how LLM calls and tool calls compose in time. A tool by itself does nothing; it’s invoked inside a pattern (prompt-chaining, orchestrator-workers, evaluator-optimizer, autonomous-agent loop). Choose the pattern first; the tool surface follows.
The Claude Skills Catalog catalogs the SKILL.md-format instruction packs that tell the model when and how to use tools. A skill is the bridge between a workflow (a pattern) and the tools that workflow needs. Many of the entries in this Tools Catalog have a corresponding skill in the Skills Catalog --- critical-code-reviewer uses git and filesystem tools; webapp-testing uses Playwright MCP; describe-design composes filesystem reads with Mermaid output.

Read together, the three catalogs describe agentic AI at three levels of abstraction: the pattern (timing), the skill (instruction), and the tool (primitive). Designing a new agent system involves picking from all three.

Appendix D --- MCP Discovery

Finding MCP servers in mid-2026 is a problem of filtering, not discovery. Community directories list 5,000—15,000+ servers; the practical universe is much smaller. The hubs that matter:

Official MCP Registry (registry.modelcontextprotocol.io) --- the canonical hub maintained by the protocol community. Indexes verified, production-grade servers and tools.
modelcontextprotocol/servers on GitHub --- the reference repository (80k+ stars, 9.8k forks). Home to the seven still-active reference implementations (Everything, Fetch, Filesystem, Git, Memory, Sequential Thinking, Time); the README now redirects users to the MCP Registry for browsing third-party servers.
Smithery AI --- a popular third-party registry designed for one-click discovery, configuration, and install of MCP tools.
MCPMarket --- aggregator tracking community extensions that mimic Claude Code behaviors, structured memory extensions, and reasoning environments. Has an Anthropic-specific hub at mcpmarket.com/businesses/anthropic.
Glama (glama.ai/mcp/servers) and PulseMCP (pulsemcp.com) --- daily-updated server directories with usage metrics and trust signals.
Community “Awesome MCP Servers” lists --- the appcypher and punkpeye GitHub lists are the most actively maintained, with category-by-category indexes.

Three pragmatic rules for navigating the directories. First, prefer vendor-maintained official servers over community forks where one exists (GitHub’s own server beats third-party GitHub wrappers; Sentry’s own beats community forks). Second, half the “official Anthropic MCP server” tutorials still online point at the 13 archived servers --- verify before installing. Third, MCP isn’t free --- each tool definition consumes context, every call costs tokens --- so install fewer servers than you think you need.

Appendix E --- The Catalog’s Omissions

This catalog covers about 34 tools across 9 sections. The wider ecosystem is much larger; a non-exhaustive list of what isn’t here:

Niche or industry-specific MCP servers (legal, medical, scientific, financial, geospatial). The patterns transfer; the inventory is too large to enumerate.
Framework-specific tool surfaces (LangChain tools, LlamaIndex tools, OpenAI Agents SDK tools) when an equivalent MCP server exists. The recommendation is to build once with MCP and reach the same tool from any framework.
Closed or proprietary tools that aren’t addressable from a documented schema.
Tools for content generation that more naturally belong in the Skills Catalog --- image generation, music generation, design tools that ship as skills rather than as bare tool primitives.
The full Slack/Microsoft Teams/Discord and Outlook/Gmail and JIRA/Linear/Asana/ClickUp and Calendar matrix. The catalog covers one of each shape; the others follow the same template.
Cloud-vendor MCPs for cloud platforms other than AWS and Cloudflare (Vercel, Netlify, Fly, Railway, GCP, Azure).
Anthropic’s tool-infrastructure features beyond Tool Search and Advisor --- Tool Runner (SDK), Programmatic Tool Calling, Fine-grained Tool Streaming, Strict Tool Use, Tool Combinations, Parallel Tool Use --- which are runtime features rather than tools per se. See the platform.claude.com tool infrastructure docs for details.

Appendix F --- A Note on the Moving Target

Anthropic published MCP in November 2024 and the Linux Foundation’s Agentic AI Foundation (AAIF) took over governance in December 2025. Anthropic published the Skills feature in October 2025. The Tool Search tool went live in late 2025; the Advisor tool entered beta in March 2026. The function-calling primitive itself --- tool_use blocks and JSON schemas --- is older than all of these. The catalog’s structure is stable; the specific tool inventory is not. Star counts, version numbers, and the exact set of vendor-maintained MCP servers reflect snapshots from May 2026. Treat the catalog as a map of an ecosystem still under rapid construction --- the shape of the landscape will hold even as individual landmarks shift.

The deepest structural fact to internalize: a tool is a name, a JSON schema, and a runtime handler. Everything else --- MCP, skills, patterns, frameworks, products --- is an arrangement of these primitives. Catalog the primitives well, and the larger system becomes navigable.

--- End of The AI Agent Tools Catalog v0.1 ---