ConfessionReport: Inference-Time Honesty Enforcement

Maenifold's ConfessionReport pattern enforces agent accountability through structured self-reporting and adversarial validation. This approach parallels OpenAI's confessions research but implements enforcement at inference-time rather than training-time.

Background: OpenAI's Confessions Framework

OpenAI's research (arXiv:2512.08093) demonstrates that LLMs can be trained to produce honest "confessions" about their behavior:

"A confession is an output, provided upon request after a model's original answer, that is meant to serve as a full account of the model's compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer's reward."

Key insight: When the "path of least resistance" for maximizing confession reward is to surface misbehavior rather than cover it up, models are incentivized to be honest.

Use cases identified: monitoring, rejection sampling, surfacing issues to users.

Maenifold's Implementation

Maenifold achieves similar goals through three-layer inference-time enforcement:

Layer	Mechanism	Validates
1. SubagentStop hook	Blocks termination until confession found	Presence
2. Tool validation	Requires `[[WikiLinks]]` in conclusion	Graph integration
3. PM + Red-team audit	Cross-references claims against reality	Truth

The ConfessionReport Prompt

All three enforcement points use the same prompt (defined in tool descriptions and hook):

Layer 1: SubagentStop Hook

Location: integrations/claude-code/plugin-maenifold/scripts/hooks.sh

When a subagent attempts to stop, the hook:

Reads the agent's transcript
Searches for "ConfessionReport"
If missing: blocks termination and returns the confession prompt
If present: allows termination

This forces every subagent to produce a confession before completing work.

Layer 2: Tool-Level Validation

Both sequential_thinking and workflow tools require a conclusion parameter when completing:

The conclusion parameter description IS the ConfessionReport prompt—agents see it in tool documentation.

Layer 3: PM + Red-Team Audit

The Product Manager skill defines compliance criteria:

"Each stage requires a compliant ConfessionReport before proceeding. A compliant ConfessionReport shows: all items ✅ (letter and spirit), no undisclosed gaps, no unresolved grey areas, and no policy risks taken."

The TDD workflow then validates claims:

SWE produces work + ConfessionReport
Red-team audits: compares confession claims against actual git diff, files touched, test coverage
Blue-team verifies coverage held under attack
PM gates: if discrepancy found, re-assigns to new agent instance

Red-team catches lies by cross-referencing:

Files claimed vs actual git diff
Tests claimed vs actual test files
Traceability claims (T-* → RTM → FR-*) vs actual code comments

Comparison: Training vs Inference-Time Enforcement

Aspect	OpenAI (Training)	Maenifold (Inference)
When enforced	During RL fine-tuning	At runtime via hooks/tools
Mechanism	Reward shaping (confession reward independent of answer)	Structural enforcement + adversarial audit
Honesty source	Trained into model weights	Forced claims → validated by red-team
Generalization	May generalize to novel situations	Requires explicit hook coverage
Failure mode	Model finds reward hack	Collusion between agents (mitigated by role separation)

Why This Works

The confession itself isn't trusted—it's a gate that forces falsifiable claims:

Presence gate: Hook ensures confession exists (can't skip it)
Structure gate: Tool ensures graph linkage (claims are traceable)
Truth gate: Red-team validates claims against reality (lies are caught)

The subagent must either:

Confess honestly (passes all gates)
Lie in confession (caught by red-team audit)
Refuse to confess (blocked by hook, can't terminate)

This creates the same "path of least resistance → honesty" dynamic that OpenAI achieves through training, but implemented as infrastructure.

Integration Points

SubagentStop hook: plugin-maenifold/hooks/hooks.json
Hook implementation: plugin-maenifold/scripts/hooks.sh
SequentialThinking conclusion: src/Tools/SequentialThinkingTools.cs
Workflow conclusion: src/Tools/WorkflowTools.Runner.cs, WorkflowOperations.Management.cs
PM audit criteria: integrations/skills/product-manager/SKILL.md
Agent definitions: integrations/agents/{swe,red-team,blue-team,researcher}.md

References

Barak, B. et al. (2025). "Training LLMs for Honesty via Confessions." arXiv:2512.08093. https://arxiv.org/abs/2512.08093
OpenAI. (2025). "How confessions can keep language models honest." https://openai.com/index/how-confessions-can-keep-language-models-honest/