An applied AI research & product lab

Agentic systems that earn their autonomy.

Bombadil Labs is an independent lab researching how autonomous AI systems fail — and building ones that don't. Research, R&D, and consulting for teams who'd rather measure than hope.

Work with us See the work

Trusted by teams at

awsMicrosoft

What we do

Three practices, one obsession: autonomy you can audit.

Research

Failure modes, evaluation harnesses, and the unglamorous science of making agents reliable. We study how autonomous systems break so yours don't have to.

Agentic R&D

We design, prototype, and harden autonomous systems end to end — from the first tool call to the production incident review. Our own products are the proving ground.

Consulting

Embedded with your team. Architecture reviews, eval pipelines, deployment guardrails — and a straight answer about what not to automate yet.

In the workshop

What we're building.

Three agentic systems in active development at the lab. Each one is a bet that autonomy, done carefully, beats automation done carelessly.

in development

Warden

The codebase custodian

An autonomous custodian for production repositories. Warden triages incoming issues, reproduces bugs in sandboxed environments, writes the regression test first, then opens evidence-backed pull requests — while your team sleeps.

Read the brief

research preview

Loremaster

A living review of everything you cannot afford to miss

An agentic research engine that reads the literature continuously, maintains living syntheses with claim-level provenance, and flags contradictions the moment they appear — so your team's understanding never goes stale.

Read the brief

design partners

Nightwatch

Self-healing infrastructure, with receipts

A fleet of operations agents that watches your telemetry, drafts ranked incident hypotheses, and executes guarded runbooks behind explicit approval gates — turning 3 a.m. pages into morning reading.

Read the brief

How we work

House rules, written in ink.

01
Autonomy is earned.: Capability ships behind evidence, never ahead of it. An agent gets exactly as much rope as its eval record justifies.
02
Evals before vibes.: A demo is an anecdote. If it isn't measured, it isn't done — and if it can't be measured, it isn't ready.
03
Small, sharp tools.: The most trustworthy agent is the one with the fewest ways to go wrong. We subtract before we add.
04
Boring is a feature.: Reliability compounds; novelty depreciates. We optimize for the systems still quietly working in five years.
05
Leave things better.: Codebases, datasets, teams. We tend whatever garden we pass through, and we hand back the keys in better shape than we found them.

Field notes

Dispatches from the lab notebook.

The notebook is open; the ink is drying.

We write up what we learn — eval design, agent failure taxonomies, the occasional detour into the weeds. First notes land soon. Subscribe via RSS to read them the moment they do.

Have something worth building carefully?

Tell us what you're trying to make autonomous — and what keeps you up at night about it. We reply to every serious note.

support@bombadillabs.io