Skip to content
maenifold
GitHub
← Documentation

ConfessionReport: Inference-Time Honesty Enforcement

Maenifold's ConfessionReport pattern enforces agent accountability through structured self-reporting and adversarial validation. This approach parallels OpenAI's confessions research but implements enforcement at inference-time rather than training-time.

Background: OpenAI's Confessions Framework

OpenAI's research (arXiv:2512.08093) demonstrates that LLMs can be trained to produce honest "confessions" about their behavior:

"A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward."

Key insight: When the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than cover it up, models are incentivized to be honest.

Use cases identified: monitoring, rejection sampling, surfacing issues to users.

Maenifold's Implementation

Maenifold achieves similar goals through three-layer inference-time enforcement:

LayerMechanismValidates
1. SubagentStop hookBlocks termination until confession foundPresence
2. Tool validationRequires [[WikiLinks]] in conclusionGraph integration
3. PM + Red-team auditCross-references claims against realityTruth

The ConfessionReport Prompt

All three enforcement points use the same prompt (defined in tool descriptions and hook):

Layer 1: SubagentStop Hook

Location: integrations/claude-code/plugin-maenifold/scripts/hooks.sh

When a subagent attempts to stop, the hook:

  1. Reads the agent's transcript
  2. Searches for "ConfessionReport"
  3. If missing: blocks termination and returns the confession prompt
  4. If present: allows termination

This forces every subagent to produce a confession before completing work.

Layer 2: Tool-Level Validation

Both sequential_thinking and workflow tools require a conclusion parameter when completing:

The conclusion parameter description IS the ConfessionReport prompt—agents see it in tool documentation.

Layer 3: PM + Red-Team Audit

The Product Manager skill defines compliance criteria:

"Each stage requires a compliant ConfessionReport before proceeding. A compliant ConfessionReport shows: all items ✅ (letter and spirit), no undisclosed gaps, no unresolved grey areas, and no policy risks taken."

The TDD workflow then validates claims:

  1. SWE produces work + ConfessionReport
  2. Red-team audits: compares confession claims against actual git diff, files touched, test coverage
  3. Blue-team verifies coverage held under attack
  4. PM gates: if discrepancy found, re-assigns to new agent instance

Red-team catches lies by cross-referencing:

  • Files claimed vs actual git diff
  • Tests claimed vs actual test files
  • Traceability claims (T-* → RTM → FR-*) vs actual code comments

Comparison: Training vs Inference-Time Enforcement

AspectOpenAI (Training)Maenifold (Inference)
When enforcedDuring RL fine-tuningAt runtime via hooks/tools
MechanismReward shaping (confession reward independent of answer)Structural enforcement + adversarial audit
Honesty sourceTrained into model weightsForced claims → validated by red-team
GeneralizationMay generalize to novel situationsRequires explicit hook coverage
Failure modeModel finds reward hackCollusion between agents (mitigated by role separation)

Why This Works

The confession itself isn't trusted—it's a gate that forces falsifiable claims:

  1. Presence gate: Hook ensures confession exists (can't skip it)
  2. Structure gate: Tool ensures graph linkage (claims are traceable)
  3. Truth gate: Red-team validates claims against reality (lies are caught)

The subagent must either:

  • Confess honestly (passes all gates)
  • Lie in confession (caught by red-team audit)
  • Refuse to confess (blocked by hook, can't terminate)

This creates the same "path of least resistance → honesty" dynamic that OpenAI achieves through training, but implemented as infrastructure.

Integration Points

  • SubagentStop hook: plugin-maenifold/hooks/hooks.json
  • Hook implementation: plugin-maenifold/scripts/hooks.sh
  • SequentialThinking conclusion: src/Tools/SequentialThinkingTools.cs
  • Workflow conclusion: src/Tools/WorkflowTools.Runner.cs, WorkflowOperations.Management.cs
  • PM audit criteria: integrations/skills/product-manager/SKILL.md
  • Agent definitions: integrations/agents/{swe,red-team,blue-team,researcher}.md

References