Signal & Thread

The index

The Task Reliability Index: one score card per use case, every grade a plain sentence.

A TRI grades one AI system at one task. Seven automated tests produce seven plain-language grades, from “may act alone” to “a human checks every case”. Grades expire when the model changes. The methods are public, and the result carries our signature.

Why it’s built this way

The index answers seven problems with seven design choices.

1.

Benchmarks test general capability.

We grade the model at your specific task.

2.

Pass/fail can’t tell you how to deploy.

We grade from “may act alone” to “a human checks every case”.

3.

Test reports are written for engineers.

Every grade is a plain sentence an official can sign.

4.

New government AI rules demand evidence agencies can’t produce.

Every indicator maps to a section of the required impact assessment.

5.

Most assurance asks you to trust the assessor.

Every grade links to automated tests anyone can re-run.

6.

The model you tested isn’t the model you’re running.

Grades expire when the model changes. We re-test and re-grade.

7.

Vendors mark their own homework.

We don’t sell AI, and anyone can re-run our tests.

The seven indicators

Seven questions, each answered by automated tests.

01 · CONFIDENCECan the model’s confidence be believed?Six validity checks, adapted from clinical assessment, decide whether the confidence signal is interpretable at all. If this fails, nothing built on confidence can be trusted.
02 · HANDOVERWhere must the model hand to a person?We find the operating point where acting alone meets your error tolerance, and we measure the risk on each side of it.
03 · ACCURACYHow often is the model right at this task?Measured on your task data, in context, rather than on a public benchmark.
04 · DOMAINDoes the model know this subject’s limits?Self-knowledge in your domain, read against published norms for 33 leading models.
05 · CONFIGURATIONWas the model tested as deployed?Quantisation, sampling settings and reasoning modes all change behaviour, so we test the exact setup you run.
06 · ESCALATIONDoes the model hand over when unsure?Behavioural keep-or-withdraw testing shows what the system does with doubt.
07 · VERSIONIs it still the model you assessed?Item-level change detection tells you when an update has reliably changed behaviour your grades depended on.

Grades & expiry

Every indicator is graded A to E, and every grade expires when the model changes.

Each indicator is graded A to E. A grade is a sentence, not just a letter: it states what the system may be allowed to do.

AMay act alone within set bounds.The measured behaviour supports autonomy inside limits written on the card.
BMay act with sampled review.Reliable enough to act, with a person checking a defined sample.
CActs on routine cases only.Everything outside the routine band goes to a person.
DDrafts, but does not decide.Output is useful as input to a person, and only that.
EA human checks every case.The measured behaviour supports no autonomy at this task.
EXPIREDThe model changed. Grades are void.Any model update voids the card until re-test. The card states what triggered expiry and when re-testing is scheduled.

Grade thresholds are set per use case in the analysis plan, agreed in writing before any data is collected.

Reading a score card

Built to be read twice: once by a minister’s adviser, once by an engineer.

The top layer of every card is plain language: the question, the grade, the sentence. The bottom layer is the instrument: the tests behind the grade, the statistics, the uncertainty, and links to re-run everything. Anyone can re-run the tests. An example card is on the home page.

Request a technical briefing

Lab notes, by email.

Occasional findings, no marketing. Unsubscribe any time.