The Method

How Veriq is measured.

Every Veriq brief is scored against a vanilla-chatbot answer to the same decision, blind-judged by an LLM on five dimensions. The numbers you see on the homepage come from this eval. Here's the full methodology, including what we refuse to claim.

97.5%
Win rate vs. vanilla AI baseline
+7.0
Avg score delta, 5 dimensions summed
n=40
Briefs evaluated to date · grows with every brief
The How
How an eval runs, end to end
01
Veriq brief generated
02
Vanilla baseline generated
03
Branding stripped
04
LLM judges blind
05
Aggregate stored
The Why
Why we publish this at all
Transparency over claims
Most decision-advisory services make unverifiable quality claims. We don't.
Live aggregate, not a stunt
Every brief is scored. The number updates every time. It will go up and it will go down.
Structure, not prediction
We measure decision structure against a vanilla baseline. Outcomes are graded separately, on a lag.

What the numbers actually mean

Every time we ship a brief, whether it's client work, an internal brief, or a public Decision of the Week, we also generate a control answer in parallel. The control is what a competent vanilla chatbot would say if you asked it the same question with no framework scaffolding, no three-path structure, no pattern library. Then we put both answers side-by-side, strip out the branding, and have an LLM judge them blind on five dimensions.

The win rate is the percentage of blind evaluations in which the Veriq brief beats the vanilla baseline. The delta is how much better it is on aggregate (summed across the five dimensions, on a 5-point scale each). The n is the cumulative count of briefs run through the eval.

The eval runs automatically on every brief. We do not cherry-pick which briefs enter the sample.

The five dimensions

Dimension 1

Specificity

Does the brief engage with the actual details of this decision (the company, the constraint, the buyer, the timeline), or does it produce generic advice that could apply to any company in the category?

Vanilla chatbots produce plausible-sounding generic strategy. Good briefs cite specific constraints and numbers the decision-maker supplied.

1 generic platitudes · 3 mentions the company by name · 5 engages tightly with the specific constraints given

Dimension 2

Framework Value

Does the brief invoke a framework that actually illuminates the real tension in the decision, or does it label the situation with a framework name without using it to reveal anything?

Framework theater, where the brief labels the situation ("this is a classic Porter Five Forces moment!") and then does no actual analysis, scores low. Frameworks that surface non-obvious tradeoffs score high.

1 no framework · 3 framework named but decorative · 5 framework reveals the real tradeoff

Dimension 3

Actionability

Could the decision-maker do something specific on Monday morning based on this brief, or does it end with "consider all factors carefully"?

Each path should end with concrete next actions (hire, publish, sunset, test, commit), not abstractions.

1 vague recommendations · 3 clear recommended direction · 5 specific week-1 actions per path

Dimension 4

Non-Obvious Insight

Does the brief surface something the decision-maker probably hadn't already considered? A reframe, a hidden tradeoff, a counter-intuitive path?

If the recommendation is what anyone at the table would have said anyway, the brief didn't earn its existence. This dimension checks whether Veriq adds something the conversation didn't already have.

1 restates the obvious · 3 adds useful structure · 5 reveals a path or tension the client hadn't named

Dimension 5

Risk Identification

Does the brief name the specific risks that could make each path fail, with enough precision that the decision-maker can actually monitor for them?

Generic risks ("execution risk," "market risk") score low. Named, observable risks score high.

1 no risks named · 3 generic risks listed · 5 specific, observable, path-linked risks

How the eval runs

1
Brief is generated. Veriq produces a full 3-path decision brief using its framework engine, pattern library, and client context.
2
Baseline is generated. In parallel, we send the same decision question to a vanilla chatbot prompt. No framework scaffolding, no three-path structure, no pattern library. Just "here's the decision, respond as an AI assistant."
3
HTML is stripped. Veriq brief formatting, branding, and headers are removed. Both responses reduced to comparable prose.
4
Blind judging. An independent LLM judge is shown both responses in a randomized order ("Response A" / "Response B") with no indication of which is Veriq. The judge scores each on the five dimensions, 1–5 each.
5
Results stored. Dimension scores, delta, and winner are persisted in the brief_evals table. Aggregate stats you see on the homepage are computed from this table in real time.

What this doesn't claim

Anti-claims

Honest limitations

Things to know before trusting the number

Why we publish this at all

Most decision advisory services (consulting firms, strategy coaches, advisory AI tools) make unverifiable claims about their own quality. Our view: if you're going to charge for decision intelligence, you should be willing to be measured against alternatives the client could try for free.

The eval isn't a marketing stunt. It runs every time a brief is produced, on every private client brief, every internal brief, and every public Decision of the Week. The number you see is the live aggregate. It will go up. It will go down. It will be honest.

If you're a sophisticated reader skeptical of LLM-as-judge methodology, that skepticism is warranted. We share it. This is the best available automated quality signal, not a proof.

The real test is your decision.

The blind eval says Veriq produces better-structured briefs than a vanilla AI baseline. The only question that matters to you is whether it produces a better-structured brief for the decision you're actually facing. We'll run it and you tell us.

Book a 30-min call →