Signal & Thread

What we do

We screen the model, test it at your task, grade what we find, and keep grading it as the model changes.

An engagement runs in four stages and ends with a graded TRI for your use case. When an open-weight model falls short, there is a fifth stage: we tune it.

01ScreenWe screen the model first. Six checks establish whether its confidence can be believed at all. Four of the twenty leading models we screened failed. A failed screen reshapes everything downstream, so it comes first and stands alone as a fixed-scope product.
02TestWe test the system at your task and find the point where it should stop and hand to a person. We measure what happens on both sides of that point, and we put a number on the risk that remains. The battery runs in your environment, on your data.
03GradeWe grade the score card and write the evidence pack. The grades are plain sentences, the findings are mapped to your AI impact assessment, and the figures state their uncertainty. Your accountable official signs on top of our name.
04Re-testWe re-test each quarter and after every model update, because a TRI expires when the system changes. If an update moves a grade your sign-off depended on, you hear it from us first. Quarterly statements are written to drop into your register cycle.
05TuneWhen an open-weight model falls short of the grade your use case needs, we fine-tune it for your task and grade it again. The re-grade uses the same tests, and the tuning is declared on the score card. Our published method restores usable confidence in open-weight models at practical sizes.

For vendors

We verify vendor claims at the tasks agencies will actually give them.

An independent verification statement answers procurement’s harder questions before they are asked, and travels with your bid.

CLAIMSYou nominate the claims.We design the tests. You do not see the items before administration.
CONDITIONSTested as government will run it.Australian task data, deployable configuration, the same instruments agencies see.
STATEMENTA verification statement, not a seal.What was claimed, what was tested, what was found, and the limits of scope, written to attach to a tender.
FINDINGSFindings are findings.Reported whether favourable or not. You may keep a statement confidential. You may not edit one.

How engagements run

Every engagement begins with a written analysis plan.

The plan fixes the use case, the task data, the deployed configuration, the grade thresholds, and what would count as failure. It is agreed before any data is collected, so the result cannot be moved after the fact. Signal & Thread is an early-stage lab: we won’t show you invented case studies, and we’ll tell you exactly what our instruments can and can’t yet support.

Request a technical briefing

Lab notes, by email.

Occasional findings, no marketing. Unsubscribe any time.