Can't I just ask Claude Code to test my MCP myself?

It's tempting, but biased. Your local Claude Code already knows how to use your MCP and has full context from your codebase — exactly the context your real users won't have. You'd also need to re-run it on every harness, every shipped change, every new model release, and regularly even when nothing's new, because labs ship harness updates and reasoning improvements that quietly change behavior on production MCPs.

Which MCPs and CLIs does Armature support?

Any MCP that speaks the protocol over HTTP, SSE, or stdio, and any CLI we can spawn in a sandbox. Bring your own bearer token, API key, or basic auth credentials — secrets are stored server-side and never exposed. OAuth support is coming soon.

YBacked by Y Combinator

You used to test user flows.
Now test agent flows.

Real agents run your workflows end-to-end through your MCP and CLI, across every harness and every model, catching breakage before your users do.

Start testing risk-free → See how it works

30-day money-back guarantee

Live agent run testing cloud-mcp

Mission Deploy a simple app

Mission failed

Real agents on every harness your users actually use

Claude Code

Codex

OpenClaw

Claude

ChatGPT

Gemini

OpenCode

MiniMax

The problem

You see the calls. Not the agent.

Their AI client

90%

↺ Retried 4 times. Gave up.
Replied "I can't help with that."

↗ Suggested switching to a competitor's MCP.

Your MCP server logs

10%

14:32:08 → mcp_server.connect_db({db: "users-prod"}) 200 OK

This is happening right now to your users, and you won't see it in your logs. The only way to catch agents going wrong is to run them yourself, before your users do.

Features

Everything you need to test user flows agent flows.

01 · End-to-end

Put a real agent on a mission.

Test use-cases like “Deploy a simple app using my MCP which provides cloud services” and observe how they do it.

02 · Coverage

Every harness. Every model. Every hour.

Make behaviour consistent across all agents on all use-cases. Catch all regressions.

03 · Heartbeat

Lightweight checks between runs.

Get alerted before your users experience a failure.

create

update

list

04 · And more

Fix, improve, and grow.

Analyze, receive suggestions, ship improvements, watch agent usage grow.

Alerts

Suggestions

Benchmarks

Analytics

Monitoring

More soon

Get started

Three steps. No setup work.

01 · Sign up

Sign up in seconds.

Just create an account. You’re in.

02 · Connect

Drop in your MCP or CLI.

Paste a URL. We discover every tool you ship.

https://your-mcp.com/v1 $ armature run my-cli

03 · Launch

Our agent builds your test plan.

Realistic missions and tool monitors, ready to run.

You’re all set. Tests run in the background. Watch analytics, get alerts, receive suggestions.

Start testing risk-free → 30-day money-back guarantee

Pricing

Simple pricing. 30-day money-back.

Try anything for 30 days. Cancel for a refund, no questions.

Starter

$49/ month$59 $59/ month

billed annually billed monthly

1 MCP or CLI source
10 tool monitors, checked up to every 5 minutes
1 end-to-end workflow tested daily
Tests on Claude Code & Codex
Slack & Email alerts
Standard support

Start testing risk-free → 30-day money-back guarantee

Common questions before you start.

Is this another agent observability tool, like LangSmith or Braintrust?

No. LangSmith and Braintrust observe and evaluate the agents you build. Armature tests how your users' agents behave on the product you ship: your MCP, your CLI. Think of it as the new layer that replaces UI testing: you used to test the UI your human users clicked through, now you test the MCP their agent calls. We grade their behavior and the reasoning behind it, the part you'd typically uncover during user interviews with humans.

How is this different from Datadog Synthetics or Playwright-based testing?

Synthetics tools test deterministic browser flows. They were built for the click-based web. Armature spawns real LLM agents that reason their way through your MCP and CLI like real users. We catch the regression a script can't see: the wrong tool picked, the off-script retry, the path that only fails on Opus 4.7.

Can't I just ask my Claude Code to test my MCP?

It's tempting, but biased. Your Claude Code already knows how to use your MCP and has full context from your codebase, exactly the context your real users won't have. You'd also need to re-run it on every harness, every time you ship a change, every time a new model lands, and regularly even when nothing's new. Labs ship harness updates and reasoning improvements that quietly change behavior on production MCPs.

Do I need to manually write all my own test scenarios?

No. An AI agent discovers your tools, configures monitors automatically, and suggests realistic end-to-end workflows: the kind your users would actually prompt their agent for. It also defines the right evaluation criteria, grounded in both the real outcome and the tester agent's reasoning trace. You can edit anything, but the default setup is on its feet in under a minute.

Which MCPs and CLIs do you support?

Any MCP that speaks the protocol (HTTP, SSE, stdio) and any CLI we can spawn in a sandbox. Bring your own bearer token, API key, or basic auth. Secrets are stored server-side, never exposed. OAuth coming soon.

How does billing work?

Everything is fully included in your plan, within the quotas listed above. No hidden charges, no metered surprises. If you need more than what's available, talk to us and we'll work out something custom.

Is there a free tier or a free trial?

We don't offer a free tier or a 7-day trial because we want you to enjoy full access from day one, and for long enough to see real value. Instead, we make it completely risk-free: if you're not 100% satisfied within 30 days, tell us and get a full refund. No questions asked.

You used to test user flows.
Now test agent flows.

You see the calls. Not the agent.