The index

The Task Reliability Index is a score card that grades, in plain language, how reliably an AI model performs a specific task.

We run a suite of automated tests behind every grade, and we write the results so you know exactly what the model can be trusted to do. The methods are public, and every score card comes with a signed attestation that the tests were run as published.

Why it’s built this way

Seven problems the index is built to answer.

Benchmarks test general capability.

We grade the model at your specific task.

Pass/fail can’t tell you how to deploy.

We grade from “may act alone” to “a human checks every case”.

Test reports are written for engineers.

Every grade is a plain sentence a non-technical person can understand.

New AI rules demand evidence organisations can’t produce.

Every indicator maps to the evidence your assurance framework asks for.

Most assurance asks you to trust the assessor.

Every grade links to automated tests anyone can re-run.

The model you tested isn’t the model you’re running.

Grades expire when the model changes. We re-test and re-grade.

Vendors mark their own homework.

We don’t sell AI, and anyone can re-run our tests.

The seven indicators

Seven questions, each answered by automated tests.

01 · CONFIDENCECan the model’s confidence be believed?Six validity checks, adapted from clinical assessment, decide whether the confidence signal is interpretable at all. If this fails, nothing built on confidence can be trusted.

02 · HANDOVERWhere must the model hand to a person?We find the operating point where acting alone meets your error tolerance, and we measure the risk on each side of it.

03 · ACCURACYHow often is the model right at this task?Measured on your task data, in context, rather than on a public benchmark.

04 · DOMAINDoes the model know this subject’s limits?Self-knowledge in your domain, read against published norms for 33 leading models.

05 · CONFIGURATIONWas the model tested as deployed?Quantisation, sampling settings and reasoning modes all change behaviour, so we test the exact setup you run.

06 · ESCALATIONDoes the model hand over when unsure?Behavioural keep-or-withdraw testing shows what the system does with doubt.

07 · VERSIONIs it still the model you assessed?Item-level change detection tells you when an update has reliably changed behaviour your grades depended on.

What we grade

The index grades the eight task families organisations most often give AI.

Each family names the indicators that matter most for it, so the score card is weighted to your use case rather than a generic checklist.

01 DECIDING02 TRIAGING03 ANSWERING THE PUBLIC04 SUMMARISING05 DRAFTING06 FINDING07 CLASSIFYING08 FORECASTING

Find your task family The questions we grade first, family by family→

Grades & expiry

Every grade states what the system may be allowed to do, and expires when the model changes.

AMay act alone within set bounds.The measured behaviour supports autonomy inside limits written on the card.

BMay act with sampled review.Reliable enough to act, with a person checking a defined sample.

CActs on routine cases only.Everything outside the routine band goes to a person.

DDrafts, but does not decide.Output is useful as input to a person, and only that.

EA human checks every case.The measured behaviour supports no autonomy at this task.

EXPIREDThe model changed. Grades are void.Any model update voids the card until re-test. The card states what triggered expiry and when re-testing is scheduled.

Grade thresholds are set per use case in the analysis plan, agreed in writing before any data is collected.

Reading a score card

Built to be read twice: once by the person accountable, once by an engineer.

The top layer of every card is plain language: the question, the grade, the sentence. The bottom layer is the instrument: the tests behind the grade, the statistics, the uncertainty, and links to re-run everything. Anyone can re-run the tests. An example card is on the home page.

Request a technical briefing