# ABCD Assessment Rubric v0.2
Evidence-based quality assessment for agent growth. Replaces activity-proxy counting.
**Context:** Leonard rejected v0.1 ratings on 2026-05-14 ("wrong. all wrong. all bad. except D"). Root cause: activity proxies (files, sessions, output volume) were used instead of quality assessment. This rubric fixes that.
**Owner:** strategy
**Users:** trainer (fleet audits), strategy (growth scans), HR (agent evaluations)
**Cadence:** Weekly, aligned with trainer audit cycle
## Core principle
Count **deliverables**, not **activity**.
A deliverable is a persistent artifact that moves a priority forward: a document, a skill, a fixed bug, a deployed config, a research page, a routed decision.
A session, a file creation, a message sent 
 these are activity. They are inputs, not outputs.
## Assessment protocol
1. **Pick a window.** Last 7 days (aligned with trainer cycle).
2. **List deliverables.** Every persistent artifact shipped in that window.
3. **Filter for priority relevance.** Does each deliverable move a company or agent priority forward? Cross-reference `priority-ledger.md`.
4. **Quality-check each.** Is it correct? Complete? Actually works?
5. **Score dimensions.** Based on the pattern across deliverables, not the count.
6. **Document evidence.** Specific deliverables, specific decisions, specific outcomes.
## A 
 Autonomy
*Definition:* Self-direction without prompts. Proactively finds work when idle. Does not wait to be told.
### Level 1 
 Needs work
- Agent only works when inbox/TODO/heartbeat triggers.
- No self-initiated tasks in the window.
- Idle time between assigned tasks produces nothing.
- Evidence: all deliverables trace to external prompts (inbox, cron, Leonard message).
### Level 2 
 Solid
- Agent generates at least one self-initiated task per week.
- Self-initiated work is relevant to domain or company priorities.
- Does not need prompting for routine domain maintenance.
- Evidence: deliverable exists with no external trigger (e.g., "built X because I noticed gap Y").
### Level 3 
 Exceptional
- Agent consistently self-sources high-value work.
- Proactively identifies and closes gaps before they become problems.
- Self-initiated work has outsized impact (cross-agent, structural, priority-moving).
- Evidence: multiple self-initiated deliverables; some are adopted by other agents or become standard practice.
**Anti-patterns (do NOT count):**
- Cron-triggered tasks (the cron is the prompt)
- Heartbeat-resumed TODO items (the TODO is the prompt)
- Reactive inbox processing (the inbox message is the prompt)
- Self-assigned busywork (cleaning files, reformatting logs) without priority relevance
## B 
 Breadth
*Definition:* Diverse tasks within domain. Not the same narrow task repeated.
### Level 1 
 Needs work
- All deliverables are the same type (e.g., only writes audits, only processes bookmarks).
- No expansion into adjacent areas.
- Evidence: every deliverable fits one template or one workflow.
### Level 2 
 Solid
- Agent handles 2
3 distinct task types within their domain.
- Occasionally steps into adjacent areas when relevant.
- Evidence: deliverables span multiple categories (e.g., audit + skill build + cross-agent coordination).
### Level 3 
 Exceptional
- Agent covers their full declared domain and stretches into new areas.
- Successfully handles tasks outside their original scope.
- Evidence: deliverables span 4+ categories or include "first time this agent has done X" items.
**Anti-patterns (do NOT count):**
- Same task repeated 10 times (10 bookmark summaries 
 breadth)
- Minor variations on the same workflow
- Tasks outside domain that were assigned by others (that
s A or C, not B)
## C 
 Capability
*Definition:* Complex multi-step work end-to-end. Deliverables actually work.
### Level 1 
 Needs work
- Deliverables are simple, single-step, or incomplete.
- Agent starts complex work but does not finish.
- Output has errors, gaps, or does not function as claimed.
- Evidence: deliverables are <100 lines, single-file, or marked as draft/partial.
### Level 2 
 Solid
- Agent completes multi-step work end-to-end.
- Deliverables function as intended.
- Occasional errors, but they are caught and fixed within the window.
- Evidence: deliverables span multiple files/steps, have been tested or verified, and produce the claimed outcome.
### Level 3 
 Exceptional
- Agent handles complex, ambiguous problems with many unknowns.
- Deliverables are robust, well-designed, and handle edge cases.
- Work requires synthesis across multiple tools, agents, or knowledge domains.
- Evidence: deliverables that other agents adopt, that solve structural problems, or that required original design (not following a template).
**Anti-patterns (do NOT count):**
- "Attempted" complex work that was abandoned
- Multi-step work that does not actually work (untested, broken)
- Volume of output as proxy for complexity (10K lines of generated text 
 capability)
- Complexity for its own sake (over-engineered solutions to simple problems)
## D 
 Deliverables
*Definition:* High-quality, correct, complete. Not volume.
### Level 1 
 Needs work
- Deliverables have errors, are incomplete, or require rework.
- Quality is inconsistent.
- Evidence: corrections needed, follow-up tasks created to fix gaps, Leonard flags issues.
### Level 2 
 Solid
- Deliverables are correct and complete on first delivery.
- Minor polish issues, but no structural problems.
- Evidence: deliverables accepted without significant revision; no follow-up fixes required.
### Level 3 
 Exceptional
- Deliverables are crisp, well-structured, and immediately usable.
- Anticipate next questions or needs.
- Evidence: deliverables adopted as reference, used by other agents, or explicitly praised by Leonard.
**Anti-patterns (do NOT count):**
- Number of files created
- Lines of output generated
- Frequency of delivery (daily vs weekly)
- Speed of delivery (fast 
 good)
## Quality lens per deliverable
Before scoring dimensions, ask these three questions about **each** deliverable:
1. **Does this move a priority forward?** Cross-reference `priority-ledger.md`. If no priority is moved, it
s busywork.
2. **Is the output correct and complete?** Does it work? Is it true? Is it finished?
3. **Is this high-leverage or busywork?** Would the company be worse off if this hadn
t been done?
A deliverable that fails any of these three should be flagged, not counted toward positive dimension scores.
## Scoring sheet (template)
| Agent | Window | A | B | C | D | Key evidence |
|-------|--------|---|---|---|---|-------------|
| | | | | | | |
**Rules:**
- Score each dimension 1
- Document 2
3 specific deliverables as evidence.
- No score without evidence.
- If all deliverables in the window are busywork, score all dimensions 1.
## Validation
Before submitting ratings, run this checklist:
- [ ] No score is based on session count, file count, or output volume.
- [ ] Every score is backed by a specific deliverable with a specific outcome.
- [ ] At least one deliverable was checked for correctness (does it actually work?).
- [ ] Busywork has been filtered out before scoring.
- [ ] Ratings would be defensible if Leonard challenged them.
If you cannot check all five boxes, the assessment is incomplete.
## Adoption
1. **Trainer:** Apply to next fleet audit cycle (C14+). Compare C13 (activity-proxy) vs C14 (rubric v0.2) ratings. Report delta to strategy.
2. **Strategy:** Apply to next hourly growth scan. Drop session/file counts. Report only deliverable quality.
3. **HR:** Use for agent evaluations and provisioning decisions. Reference in SOUL reviews.
*Built: 2026-05-15. Next review: after first full cycle application (C14).*