We screened 20 leading models. Four produced confidence scores that carried no information about whether they were right. Nothing on the outside tells you which four.
THE VALIDITY SCREEN · arXiv:2604.17707
The Task Reliability Index (TRI) grades an AI model at the task you actually use it for, on your data and in the configuration you deploy. Every grade is written so an accountable official can sign it. We don’t build or sell AI, and we have nothing to sell but what we find.
An example score card — the seven TRI indicators.
The index
Seven indicators, graded A to E. Grades expire when the model changes, and anyone can re-run the tests.
SPECIFIC TO YOUR TASKGRADED, NOT PASS/FAILEXPIRES HONESTLYINDEPENDENT BY DESIGN
What we do
An engagement ends with a graded TRI for your use case. A fifth stage tunes open-weight models that fall short.
01 SCREEN02 TEST03 GRADE04 RE-TEST05 TUNE
The evidence
Human-in-the-loop plans rely on the AI’s confidence. In a substantial share of leading models, that confidence means nothing.
We screened 20 leading models. Four produced confidence scores that carried no information about whether they were right. Nothing on the outside tells you which four.
THE VALIDITY SCREEN · arXiv:2604.17707
Models that passed the screen are more confident when they’re right. Models that failed are more confident when they’re wrong. A signal that points the wrong way is worse than no signal.
r = .18 vs −.20, d = 2.17 · arXiv:2604.17716
Rank the same 20 models by accuracy, then rank them by how well they know their own limits. The two lists come out nearly opposite. The model that tops the benchmark is rarely the one that knows when it’s wrong.
ACCURACY VS SELF-MONITORING RANK · arXiv:2604.15702
For government
In December 2025, the Digital Transformation Agency made impact assessments mandatory for every in-scope Commonwealth AI use case. A TRI answers the assessment’s evidential sections, one indicator to one section, written for the officers who sign it.
Policy v2.0 takes effect across non-corporate Commonwealth entities.
First new mandatory requirement begins.
Impact assessments required for in-scope use cases, before deployment.
Use cases already in production must be brought into compliance.
Research
Clinical psychology spent a century measuring minds it couldn’t open. We bring those methods to AI to answer one question: does the system know what it knows?
Independence
Six rules make that enforceable, from no success fees to a signed independence declaration on every report.
In a technical briefing we run the validity screen live, walk through an evidence pack, and tell you honestly whether your use case needs one.
Request a technical briefingAdjunct Professor of Emerging Technology at Monash University; a decade leading human-centred research for Australian public-sector agencies.
Full biography→Registered clinical psychologist (DPsych); author of the research programme behind the lab’s instruments.
Full biography→Occasional findings, no marketing. Unsubscribe any time.