Signal & Thread

Independently test whether your AI is reliable at the task you’ve given it.

The Task Reliability Index (TRI) grades an AI model at the task you actually use it for, on your data and in the configuration you deploy. Every grade is written so an accountable official can sign it. We don’t build or sell AI, and we have nothing to sell but what we find.

01 / 07

An example score card — the seven TRI indicators.

The index

The Task Reliability Index grades an AI model at one task, in sentences an official can sign.

Seven indicators, graded A to E. Grades expire when the model changes, and anyone can re-run the tests.

SPECIFIC TO YOUR TASKGRADED, NOT PASS/FAILEXPIRES HONESTLYINDEPENDENT BY DESIGN

What we do

We screen the model, test it at your task, grade what we find, and keep grading it as the model changes.

An engagement ends with a graded TRI for your use case. A fifth stage tunes open-weight models that fall short.

01 SCREEN02 TEST03 GRADE04 RE-TEST05 TUNE

The evidence

Benchmarks measure how often a model is right; we measure whether it knows when it’s wrong.

Human-in-the-loop plans rely on the AI’s confidence. In a substantial share of leading models, that confidence means nothing.

4 of 20

We screened 20 leading models. Four produced confidence scores that carried no information about whether they were right. Nothing on the outside tells you which four.

THE VALIDITY SCREEN · arXiv:2604.17707

backwards

Models that passed the screen are more confident when they’re right. Models that failed are more confident when they’re wrong. A signal that points the wrong way is worse than no signal.

r = .18 vs −.20, d = 2.17 · arXiv:2604.17716

opposite

Rank the same 20 models by accuracy, then rank them by how well they know their own limits. The two lists come out nearly opposite. The model that tops the benchmark is rarely the one that knows when it’s wrong.

ACCURACY VS SELF-MONITORING RANK · arXiv:2604.15702

For government

We produce the evidence the Commonwealth’s AI rules now ask of every agency.

In December 2025, the Digital Transformation Agency made impact assessments mandatory for every in-scope Commonwealth AI use case. A TRI answers the assessment’s evidential sections, one indicator to one section, written for the officers who sign it.

  1. 15 Dec 2025

    Policy v2.0 takes effect across non-corporate Commonwealth entities.

  2. 15 Jun 2026

    First new mandatory requirement begins.

  3. 15 Dec 2026

    Impact assessments required for in-scope use cases, before deployment.

  4. 30 Apr 2027

    Use cases already in production must be brought into compliance.

Research

The science behind every grade is published, with code and data anyone can run.

Clinical psychology spent a century measuring minds it couldn’t open. We bring those methods to AI to answer one question: does the system know what it knows?

Independence

We don’t build or sell AI systems. We measure them.

Six rules make that enforceable, from no success fees to a signed independence declaration on every report.

Watch the screen run on a live model.

In a technical briefing we run the validity screen live, walk through an evidence pack, and tell you honestly whether your use case needs one.

Request a technical briefing
CO-FOUNDER · ENGAGEMENTS & EVIDENCE

Dr Chris Marmo

Adjunct Professor of Emerging Technology at Monash University; a decade leading human-centred research for Australian public-sector agencies.

Full biography
CO-FOUNDER · INSTRUMENTS & ANALYSIS

Dr Jon-Paul Cacioli

Registered clinical psychologist (DPsych); author of the research programme behind the lab’s instruments.

Full biography

Lab notes, by email.

Occasional findings, no marketing. Unsubscribe any time.