BombadilLabs

The workshop

What we're building.

Three agentic systems, each a working answer to the same question: what becomes possible when an agent can prove what it did? These are active lab projects — the briefs below are the real ones.

in developmentsoftware engineeringautonomous maintenancesupply-chain security

Warden

The codebase custodian

The problem

Every engineering team carries an invisible second job: the custody of its own codebase. Issue triage, bug reproduction, dependency patching, flaky-test forensics, the long tail of "known, shallow, and nobody's priority." It is the work that makes everything else possible, and it loses the prioritization argument every single sprint.

Coding agents made it cheap to generate changes. They did not make it cheap to trust them. A pull request without a reproduction, a failing-then-passing test, and a scoped diff is not help — it's review burden wearing a helpful expression.

The bet

Warden is built on a simple inversion: the agent's job is not to write code, it's to produce evidence. Code is a byproduct.

For every piece of work it takes on, Warden assembles a case file:

  1. Reproduce. The bug is confirmed in a disposable sandbox, with a minimal failing test checked in before any fix is attempted.
  2. Fix narrowly. The smallest diff that turns the test green. No drive-by refactors, no opportunistic cleanup.
  3. Prove. Full suite, lints, type checks, and a risk note: what was touched, what could regress, what wasn't verified.
  4. Stand for review. The PR opens with the case file attached. A reviewer reads evidence, not vibes.

The same loop covers dependency custody — supply-chain-aware updates that quarantine new releases, read changelogs and diffs, and ship upgrades with proof that nothing observable changed.

What makes it different

Most autonomous-coding products chase breadth: more languages, bigger tasks, flashier demos. Warden chases custody — the narrow, deep set of maintenance work where correctness is checkable and autonomy is therefore provable. Its autonomy budget is explicit: actions are tiered, every tier is gated on its own track record, and the gate moves only when the eval record says it has earned the move.

Where it stands

Warden runs against the lab's own repositories today — this site included — and is in development toward a private beta with a small set of teams who maintain serious production codebases.

research previewresearch synthesisknowledge systemsprovenance

Loremaster

A living review of everything you cannot afford to miss

The problem

The reading is unbounded and the readers are not. In any fast-moving field — AI itself being the worst offender — the half-life of a literature review is measured in weeks. Teams make consequential decisions on top of an understanding that quietly expired months ago, and nobody can say precisely which belief expired, because beliefs were never written down with their sources attached.

Retrieval-augmented chat doesn't solve this. Asking a model "what does the literature say?" produces an answer, but not an accountable one: no claim-level citations, no record of what was searched, no notification when the answer changes.

The bet

Loremaster treats a body of knowledge as a living artifact with a maintenance contract, not a one-shot summary.

  • Continuous reading. Agents monitor sources you designate — preprint servers, journals, standards bodies, internal docs — and triage what's new against what you already believe.
  • Claim-level provenance. Syntheses are built from atomic claims, each pinned to its sources, extraction date, and confidence. Every sentence in a brief can answer the question "says who?"
  • Contradiction detection. When new evidence undercuts a standing claim, the claim is flagged, the dependent conclusions are traced, and the humans who relied on it are told — proactively, with the receipts.
  • Auditable briefs. Output is a document a careful skeptic can verify: claims, sources, the search trail, and an explicit list of what was not covered.

What makes it different

Synthesis tools optimize for the moment of asking. Loremaster optimizes for the months after — the slow drift between what the field knows and what your team believes. It is the difference between a search engine and a librarian who has read everything, remembers what you've read, and taps you on the shoulder at exactly the right moment.

Where it stands

Loremaster is in research preview inside the lab, where it maintains our own agent-reliability literature base. The provenance model and contradiction-detection evals are the current focus before we open it to outside research teams.

research previewEthereumagent economyverification

Goldberry

Proof of conduct for economic agents

The problem

The agent economy arrived onchain faster than anyone scheduled it. ERC-8004 went live on Ethereum mainnet in January 2026 and crossed two hundred thousand registered agents across twenty-plus networks in its first quarter; x402 settled its hundred-millionth agentic payment on Base around the same time. The registries answered who is this agent and the payment rail answered how does it get paid.

Nobody shipped the answer to should you let it touch your money. The standard's own Validation Registry — the part that would make "trustless agents" mean something — remains under revision, and the spec authors say so plainly. Meanwhile audits of the registries find the overwhelming majority of registered agents are dead endpoints, and reputation, where it exists, is a feedback field anyone can farm. The result is a market where identity is cheap, claims are free, and verification is nowhere: exactly the conditions under which capital correctly refuses to delegate.

The bet

Goldberry makes an agent's conduct a first-class, slashable, onchain fact. Three mechanisms, composed:

  • Policy-scoped execution. Every Goldberry engagement runs under an EIP-7702 session key whose policy — venues, sizes, counterparties, loss bounds — is enforced at the account layer, not promised in a system prompt. The agent physically cannot exceed its mandate; the mandate is readable by anyone before they delegate.
  • Staked re-execution. Sampled jobs are replayed by independent validators in deterministic sandboxes against the same inputs, with results posted through ERC-8004's validation hooks. Validators stake against their attestations; catching a deviation pays, signing a false pass slashes. Verification becomes a market with the correct sign on every incentive.
  • Conduct-gated settlement. Payment flows over x402 with a holdback that releases when validation clears. Intent-based order flow (ERC-7683) keeps execution venue-agnostic; the conduct record travels with the agent's identity, not with any one chain.

An agent operating under Goldberry doesn't ask to be trusted. It accumulates a public record of policy-bound, re-executed, stake-backed work — and "trust me" becomes "slash me."

What makes it different

This is the lab's reliability research with money on the line. Field Note 001 derives why verification, not actor quality, is the binding constraint on how much autonomy a system can be given — the verifier's dividend, 1/(1c)1/(1-c). Goldberry is that same arithmetic applied to economic delegation: raise the catch rate on misconduct and you raise, mechanically, the amount of capital it is rational to leave in an agent's hands. Everyone else in the agent economy is scaling the actor. We're scaling the verifier, because that's where the dividend is.

It's also deliberately unglamorous about scope. Goldberry doesn't trade memecoins, predict markets, or promise yield. It is plumbing — the conduct layer the standards left open — built to be the boring, load-bearing thing every serious economic agent ends up standing on.

Where it stands

Research preview. The policy-enforcement layer and the deterministic re-execution harness are running against testnet deployments of the ERC-8004 registries; the validator economics are being red-teamed on paper before they're red-teamed by the internet. We're talking with teams operating real agent fleets who want their conduct on the record — the strongest possible signal in a market of dead endpoints.