Memory · Evals · Replay · Distillation

The improvement loop for coding agents.

Capture · Replay · Distill · Evaluate — the layer that plugs into any harness you already use and makes it get better over time. The durable value is the loop, not the orchestration.

Developers → a provenance-first registry · Teams → a managed loop

Frontier-only tooling prices out individual developers — and for many teams, the code simply can’t leave its own environment. The improvement should run against the tools you already have, with the option to keep everything local.

Why this exists

How it works

A loop, not a pipeline.

Every successful coding episode becomes fuel for the next one. Capture what happened, replay what worked, distill it into reusable skills and small models, then evaluate the result — and feed it back in.

01
Capture
real coding episodes, as they happen
02
Replay
the trajectories that actually worked
03
Distill
reusable skills + small models
04
Evaluate
under a controlled protocol

↺ the output of evaluate feeds the next capture — it is a loop, not a pipeline

Plugs into the harnesses you already use

CodexAiderSWE-agentOpenHands…and more via adapters

Architecture

The shape of it

Five conceptual layers.

From the harnesses you run, up through the loop, evaluation and provenance, the registry, and — specified but not yet built — a managed cloud control plane.

5Cloud control plane

hosted memory · execution · metering · gateway

spec'd, not built

4Registry & leaderboard

publish · verify · rank — only verifiable results listed

provenance-first

3Evaluation & provenance

multi-mode runner · CL metrics · promotion gate · manifests

2The loop

capture → replay → distill

1Harness & adapters

thin connectors into the coding harnesses you already run

a conceptual map — each layer is a responsibility, not a repository

Two paths

One loop, two front doors.

For developers

Registry & leaderboard

Publish small fine-tuned coding agents and reusable skills. Rankings are provenance-first: a result is listed only if its manifest and digests verify. No verifiable manifest, no entry.

▪ publish · verify · rank
▪ weights via Hugging Face Hub pointer
▪ every number carries a manifest

For teams

Managed Ferrum Cloud

The loop, hosted: managed memory, execution, metering and a gateway — with code-can’t-leave treated as a hard constraint, not a tier. Specified in detail; built once demand is proven.

▪ private code never routed through shared compute
▪ metering + gateway + secure execution profiles
▪ bootstrapped, pay-per-use cost posture

Proof — by protocol

Continual learning is measured, not asserted.

We treat continual learning as a claim to be proven under a controlled-exposure protocol — not a result we are announcing. The metric definitions live in code as the source of truth; what follows are the methodology and the success thresholds, not numbers we have already hit.

FWTForward transfer

Does learning earlier tasks help on later, unseen ones?

Success threshold: FWT > 0

BWTBackward transfer

Does new learning erode what the agent already knew?

Success threshold: BWT ≥ −ε (bounded forgetting)

PIPlasticity Index

Can the agent still absorb genuinely new skills over time?

Success threshold: Plasticity ≥ 0.5

The protocol, in brief

Same student model and command budget across arms; a pinned benchmark and dataset hash; a frozen zero-shot baseline that runs ephemeral with no memory leakage; fixed or repeated-randomized task order reported with variance. Numbers reach this page only once a manifest-backed entry exists — until then, thresholds and method.

First receipts

14/24 → 16/24 · completion, frozen 24-task suite+0.026/cycle · reward slope1 candidate · rejected by the promotion gate

Frozen local 7B model (qwen2.5-coder) — only the harness learns. Four eval-gated cycles; full data released.

Read the lab report ↗Raw data + video (SHA-pinned release) ↗

Join the waitlist.

Early and bootstrapped. Tell us whether you’re a developer or a team and we’ll reach out as the loop opens up.