What we do

We screen the model, test it at your task, grade what we find, and keep grading it as the model changes.

An engagement ends with a graded TRI for your use case; a fifth stage tunes open-weight models that fall short.

01ScreenWe screen the model first. Six checks establish whether its confidence can be believed at all. Four of the twenty leading models we screened failed. A failed screen reshapes everything downstream, so it comes first, and you can commission it on its own.

02TestWe test the system at your task and find the point where it should stop and hand to a person. We measure what happens on both sides of that point, and we put a number on the risk that remains. The battery runs in your environment, on your data.

03GradeWe grade the score card and write the evidence pack. The grades are plain sentences, the findings are mapped to your AI impact assessment, and the figures state their uncertainty. Your accountable official signs on top of our name.

04Re-testWe re-test each quarter and after every model update, because a TRI expires when the system changes. If an update moves a grade your sign-off depended on, you hear it from us first. Quarterly statements are written to drop into your reporting cycle.

05TuneWhen an open-weight model falls short of the grade your use case needs, we fine-tune it for your task and grade it again. The re-grade uses the same tests, and the tuning is declared on the score card. Our published method restores usable confidence in open-weight models at practical sizes.

For vendors

We verify vendor claims at the tasks your customers will actually give them.

CLAIMSYou nominate the claims.We design the tests. You do not see the items before administration.

CONDITIONSTested as your customers will run it.Local task data, the deployable configuration, and the same instruments their evaluators see.

STATEMENTA verification statement, not a seal.What was claimed, what was tested, what was found, and the limits of scope, written to attach to a tender.

FINDINGSFindings are findings.Reported whether favourable or not. You may keep a statement confidential. You may not edit one.

How engagements run

Every engagement begins with a written analysis plan.

The plan fixes the use case, the task data, the deployed configuration, the grade thresholds, and what would count as failure. It is agreed before any data is collected, so the result cannot be moved after the fact. Signal & Thread is an early-stage lab: we won’t show you invented case studies, and we’ll tell you exactly what our instruments can and can’t yet support.

Request a technical briefing

We screen the model, test it at your task, grade what we find, and keep grading it as the model changes.

We verify vendor claims at the tasks your customers will actually give them.

Every engagement begins with a written analysis plan.

Lab notes, by email.