Problem
TermSignal exists to help enterprises build and ship agentic systems. The pitch is specific: we’re engineers, not consultants — we ship code into production with evals to prove it works.
But there’s a credibility gap any new agency faces. You can describe the stack all you want. You can show a beautiful website. What you cannot fake is a production system with real operational pressure on it.
We needed to close that gap. Not with a demo. With something we actually depend on — a system that runs our own operations, that we’d be embarrassed to show going down. The only way to sell this stack is to live on it.
Approach
We treated this exactly the way we’d treat a client engagement: discovery sprint first, evals before any production code.
The discovery phase surfaced four high-leverage workflows for automation:
- Talent intake — screening inbound candidate applications against role criteria
- Proposal generation — drafting project proposals from discovery call transcripts
- Observability summarization — distilling Prometheus + Loki signals into daily briefings
- Content pipeline — turning technical notes into polished long-form content
These weren’t toy problems. Each had ambiguity, edge cases, and meaningful consequences for getting it wrong. Exactly the kind of work agents fail at in demos but can handle in production when built correctly.
The architecture decision was deliberate: one Claude-first model layer (Opus for reasoning-heavy tasks, Sonnet for throughput), Temporal for durable execution, custom MCP servers for internal data access, and an evals harness wired to CI from day one.
Build
We started with the eval harnesses — not the agents.
For each workflow, we assembled a golden dataset: 50–100 representative inputs with expected outputs, tagged by difficulty and failure mode. We wrote the evaluation criteria before writing a single system prompt. This forced us to define “working” before building anything.
MCP servers came next. We built three internal MCP servers:
termsignal-mcp— exposes candidate data, screening results, and status updatestemporal-mcp— reads workflow state, surfaces blocked runs, triggers manual interventionsmonitoring-mcp— queries Prometheus metrics and Loki log streams for agent briefings
Each MCP server implements authentication, rate limiting, and structured error responses. Tools are typed. Descriptions are written for the model, not for humans. This matters: vague tool descriptions are how agents go off-rails.
Temporal handles orchestration. Each agent workflow is a Temporal workflow — durable, observable, and retryable. A failed Claude API call doesn’t lose work. A mid-run infrastructure blip retries transparently. We can inspect exactly where a workflow is, signal it manually, or terminate it cleanly. This is the scaffolding that demo videos never show.
Claude Code runs our engineering workflow. Our CLAUDE.md files encode project conventions, branching rules, commit patterns, and escalation paths. Custom slash commands (/commit-push-pr, /plan-runner, /new-feature) handle the repetitive mechanics. Hooks enforce code quality gates before any tool call touches production. This isn’t AI assistance — it’s AI as a first-class team member with a defined role.
The eval loop runs on every PR. We push, GitHub Actions runs the eval harnesses against the golden dataset, and a failing eval blocks the merge. We’ve caught three regressions this way — twice from prompt changes that looked harmless, once from a Temporal version bump that changed workflow serialization behavior. Without evals, two of those would have shipped.
Outcome
Four agents in production. Six weeks from discovery sprint to first workflow in production, eight weeks to full coverage.
The talent intake agent processes every inbound application — screening against role criteria, scoring on a rubric, and surfacing the top candidates with annotated reasoning. What used to take two hours of manual review per role now takes minutes with a human reviewing the agent’s output rather than raw applications.
The proposal agent has drafted eight client proposals. We review and edit every one, but the lift has dropped from four hours to forty minutes per proposal.
The observability briefing runs at 09:00 every morning — a structured summary of the previous 24 hours of system health, including anomalies, slow workflows, and cost spikes. It hits a Slack channel. We read it. It’s become load-bearing.
The content pipeline is the one we’re still calibrating. Long-form technical writing requires more iteration loops than the other workflows. The evals are harder to define. We’re on v3 of the system prompt. This is expected — it’s the hardest category of task.
The 99.2% eval pass rate is across all four workflows, all golden datasets. The 0.8% that fail are documented edge cases we’ve explicitly decided to not handle in v1. That’s what evals tell you: not just that things work, but exactly where they don’t.
This is the stack we sell. We built it. We depend on it. That’s the only kind of reference implementation worth having.