Independently test whether your AI is reliable at the task you’ve given it.

The Task Reliability Index (TRI) tells you what an AI model can be trusted to do at your specific task, and gives you the graded evidence to deploy it with confidence, or the grounds to say no.

Task Reliability Index How a TRI is produced→

01 / 07

An example score card — the seven TRI indicators.

The index

The Task Reliability Index grades an AI model at one task, in sentences a non-technical person can understand.

We grade seven indicators of reliability, each from A to E, so one card shows you where the model can act alone and where it can’t. Grades expire when the model changes, and anyone can re-run the tests.

SPECIFIC TO YOUR TASKGRADED, NOT PASS/FAILEXPIRES HONESTLYINDEPENDENT BY DESIGN

The full index All seven indicators→

What we do

We screen the model, test it at your task, grade what we find, and keep grading it as the model changes.

01 SCREEN02 TEST03 GRADE04 RE-TEST05 TUNE

Each stage, in full For vendors→

The evidence

Benchmarks measure how often a model is right; we measure whether it knows when it’s wrong.

Human-in-the-loop plans rely on the AI’s confidence. In a substantial share of leading models, that confidence means nothing.

4 of 20

We screened 20 leading models. Four produced confidence scores that carried no information about whether they were right. Nothing on the outside tells you which four.

THE VALIDITY SCREEN · arXiv:2604.17707

backwards

Models that passed the screen are more confident when they’re right. Models that failed are more confident when they’re wrong. A signal that points the wrong way is worse than no signal.

r = .18 vs −.20, d = 2.17 · arXiv:2604.17716

opposite

Rank the same 20 models by accuracy, then rank them by how well they know their own limits. The two lists come out nearly opposite. The model that tops the benchmark is rarely the one that knows when it’s wrong.

ACCURACY VS SELF-MONITORING RANK · arXiv:2604.15702

The research behind these findings→

What we grade

The index grades eight task families, covering most of what organisations give AI to do.

01 DECIDING02 TRIAGING03 ANSWERING THE PUBLIC04 SUMMARISING05 DRAFTING06 FINDING07 CLASSIFYING08 FORECASTING

Find your use case, and the question we grade first→

For government

We produce the evidence the Commonwealth’s AI rules now ask of every agency.

In December 2025, the Digital Transformation Agency made impact assessments mandatory for every in-scope Commonwealth AI use case. A TRI answers the assessment’s evidential sections one indicator to one section, so the assessment you must sign is answerable with evidence.

15 Dec 2025
Policy v2.0 takes effect across non-corporate Commonwealth entities.
15 Jun 2026
First new mandatory requirement begins.
15 Dec 2026
Impact assessments required for in-scope use cases, before deployment.
30 Apr 2027
Use cases already in production must be brought into compliance.

For government, in full How a TRI fills in the impact assessment→

Research

The science behind every grade is published, with code and data anyone can run.

Clinical psychology spent a century measuring minds it couldn’t open. We bring those methods to AI to answer one question: does the system know what it knows?

All publications (24) The four instruments→

Independence

We don’t build or sell AI systems. We measure them.

Six rules keep our findings beyond influence, from no success fees to a signed declaration on every report, so the evidence you rely on cannot be bought.

The charter Why we’re called Signal & Thread→

Watch the screen run on a live model.

In a technical briefing we run the validity screen live, walk through an evidence pack, and tell you honestly whether your use case needs one.

Request a technical briefing

CO-FOUNDER · ENGAGEMENTS & EVIDENCE

Dr Chris Marmo

Adjunct Professor of Emerging Technology at Monash University; a decade leading human-centred research for Australian public-sector agencies.

Full biography→

CO-FOUNDER · INSTRUMENTS & ANALYSIS

Dr Jon-Paul Cacioli

Registered clinical psychologist (DPsych); author of the research programme behind the lab’s instruments.