Benchmarks test general capability.
The index
The Task Reliability Index: one score card per use case, every grade a plain sentence.
A TRI grades one AI system at one task. Seven automated tests produce seven plain-language grades, from “may act alone” to “a human checks every case”. Grades expire when the model changes. The methods are public, and the result carries our signature.
Why it’s built this way
The index answers seven problems with seven design choices.
Pass/fail can’t tell you how to deploy.
We grade from “may act alone” to “a human checks every case”.
Test reports are written for engineers.
Every grade is a plain sentence an official can sign.
New government AI rules demand evidence agencies can’t produce.
Every indicator maps to a section of the required impact assessment.
Most assurance asks you to trust the assessor.
Every grade links to automated tests anyone can re-run.
The model you tested isn’t the model you’re running.
Grades expire when the model changes. We re-test and re-grade.
Vendors mark their own homework.
We don’t sell AI, and anyone can re-run our tests.
The seven indicators
Seven questions, each answered by automated tests.
Grades & expiry
Every indicator is graded A to E, and every grade expires when the model changes.
Each indicator is graded A to E. A grade is a sentence, not just a letter: it states what the system may be allowed to do.
Grade thresholds are set per use case in the analysis plan, agreed in writing before any data is collected.
Reading a score card
Built to be read twice: once by a minister’s adviser, once by an engineer.
The top layer of every card is plain language: the question, the grade, the sentence. The bottom layer is the instrument: the tests behind the grade, the statistics, the uncertainty, and links to re-run everything. Anyone can re-run the tests. An example card is on the home page.