Temperature stability tests
Claude 3.5 Haiku: 180/180 AE-1 matches at T=0.0, 0.8, 1.3
GPT-4o: 180/180 matches under the same conditions
Statistical significance: p ≈ 1×10⁻⁵⁴
Theory of Mind by tier
Basic (ToM-1): All models except GPT-3.5 passed
Advanced (ToM-2): Claude family + GPT-4o passed
Extreme (ToM-3+): Only Claude Opus reached 100%
Key safety point
AE-1 markers (Satisfied / Distressed) lined up perfectly with correct vs conflict cases. This means we can detect when a model is in an epistemically unsafe state, often a precursor to confident hallucinations.
In practice this could let systems in critical areas choose to abstain instead of giving a wrong but confident answer.
Protocol details, raw data, and replication code are in the dataset link above.
A demo notebook also exists if anyone wants to reproduce directly.
Looking for feedback on:
- Does this kind of marker make sense as a unit test for reliability?
- How to extend beyond ToM into other reasoning domains?
- How would formal verification folks see the proof obligations (consistency, conflict rejection, recovery, etc.)?
Extended results and safety relevance
Temperature stability tests Claude 3.5 Haiku: 180/180 AE-1 matches at T=0.0, 0.8, 1.3 GPT-4o: 180/180 matches under the same conditions Statistical significance: p ≈ 1×10⁻⁵⁴
Theory of Mind by tier Basic (ToM-1): All models except GPT-3.5 passed Advanced (ToM-2): Claude family + GPT-4o passed Extreme (ToM-3+): Only Claude Opus reached 100%
Key safety point AE-1 markers (Satisfied / Distressed) lined up perfectly with correct vs conflict cases. This means we can detect when a model is in an epistemically unsafe state, often a precursor to confident hallucinations.
In practice this could let systems in critical areas choose to abstain instead of giving a wrong but confident answer.
Protocol details, raw data, and replication code are in the dataset link above. A demo notebook also exists if anyone wants to reproduce directly.
Looking for feedback on: - Does this kind of marker make sense as a unit test for reliability? - How to extend beyond ToM into other reasoning domains? - How would formal verification folks see the proof obligations (consistency, conflict rejection, recovery, etc.)?