YBacked by Y Combinator

You used to test user flows.
Now test agent flows.

Real agents run your workflows end-to-end through your MCP and CLI, across every harness and every model, catching breakage before your users do.

30-day money-back guarantee
Live agent run testing cloud-mcp
Mission Deploy a simple app
Mission failed
Real agents on every harness your users actually use
Claude Code Codex OpenClaw Claude ChatGPT Gemini OpenCode MiniMax
The problem

You see the calls. Not the agent.

What you can't see Their AI client
90%
prompt
reasoning
Retried 4 times. Gave up.
Replied "I can't help with that."
Suggested switching to a competitor's MCP.
What you can see Your MCP server logs
10%
14:32:08 mcp_server.connect_db({db: "users-prod"}) 200 OK

This is happening right now to your users, and you won't see it in your logs. The only way to catch agents going wrong is to run them yourself, before your users do.

Features

Everything you need to test user flows agent flows.

01 · End-to-end

Put a real agent on a mission.

Test use-cases like “Deploy a simple app using my MCP which provides cloud services” and observe how they do it.

02 · Coverage

Every harness. Every model. Every hour.

Make behaviour consistent across all agents on all use-cases. Catch all regressions.

03 · Heartbeat

Lightweight checks between runs.

Get alerted before your users experience a failure.

search
create
update
list
04 · And more

Fix, improve, and grow.

Analyze, receive suggestions, ship improvements, watch agent usage grow.

Alerts
Suggestions
Benchmarks
Analytics
Monitoring
More soon
Get started

Three steps. No setup work.

You’re all set. Tests run in the background. Watch analytics, get alerts, receive suggestions.
Start testing risk-free 30-day money-back guarantee
About us

We've shipped both halves. Now the missing piece.

We've deployed MCPs used by millions of customers and built observability products from the ground up. Now we're building the missing piece, to help builders ship a seamless user agent experience.

Backed by top investors including Y Combinator.

Palantir Tsuga Joko Y Combinator
Pricing

Simple pricing. 30-day money-back.

Try anything for 30 days. Cancel for a refund, no questions.

Starter
$49/ month$59 $59/ month
billed annually billed monthly
  • 1 MCP or CLI source
  • 10 tool monitors, checked up to every 5 minutes
  • 1 end-to-end workflow tested daily
  • Tests on Claude Code & Codex
  • Slack & Email alerts
  • Standard support
Start testing risk-free 30-day money-back guarantee
Enterprise
Custom Custom
tailored to your team
  • Everything in Pro
  • Custom limits & test frequencies
  • SSO & audit logs
  • Dedicated CSM
  • SLA-backed uptime
FAQ

Common questions before you start.

Is this another agent observability tool, like LangSmith or Braintrust?
No. LangSmith and Braintrust observe and evaluate the agents you build. Armature tests how your users' agents behave on the product you ship: your MCP, your CLI. Think of it as the new layer that replaces UI testing: you used to test the UI your human users clicked through, now you test the MCP their agent calls. We grade their behavior and the reasoning behind it, the part you'd typically uncover during user interviews with humans.
How is this different from Datadog Synthetics or Playwright-based testing?
Synthetics tools test deterministic browser flows. They were built for the click-based web. Armature spawns real LLM agents that reason their way through your MCP and CLI like real users. We catch the regression a script can't see: the wrong tool picked, the off-script retry, the path that only fails on Opus 4.7.
Can't I just ask my Claude Code to test my MCP?
It's tempting, but biased. Your Claude Code already knows how to use your MCP and has full context from your codebase, exactly the context your real users won't have. You'd also need to re-run it on every harness, every time you ship a change, every time a new model lands, and regularly even when nothing's new. Labs ship harness updates and reasoning improvements that quietly change behavior on production MCPs.
Do I need to manually write all my own test scenarios?
No. An AI agent discovers your tools, configures monitors automatically, and suggests realistic end-to-end workflows: the kind your users would actually prompt their agent for. It also defines the right evaluation criteria, grounded in both the real outcome and the tester agent's reasoning trace. You can edit anything, but the default setup is on its feet in under a minute.
Which MCPs and CLIs do you support?
Any MCP that speaks the protocol (HTTP, SSE, stdio) and any CLI we can spawn in a sandbox. Bring your own bearer token, API key, or basic auth. Secrets are stored server-side, never exposed. OAuth coming soon.
How does billing work?
Everything is fully included in your plan, within the quotas listed above. No hidden charges, no metered surprises. If you need more than what's available, talk to us and we'll work out something custom.
Is there a free tier or a free trial?
We don't offer a free tier or a 7-day trial because we want you to enjoy full access from day one, and for long enough to see real value. Instead, we make it completely risk-free: if you're not 100% satisfied within 30 days, tell us and get a full refund. No questions asked.

Start testing your MCP or CLI in 60 seconds.

Drop your MCP URL or CLI install command. The agent does the setup. We do the testing across every harness and every model, around the clock.

30-day money-back guarantee