How Veriq Is Measured · The Method

What the numbers actually mean

Every time we ship a brief, whether it's client work, an internal brief, or a public Decision of the Week, we also generate a control answer in parallel. The control is what a competent vanilla chatbot would say if you asked it the same question with no framework scaffolding, no three-path structure, no pattern library. Then we put both answers side-by-side, strip out the branding, and have an LLM judge them blind on five dimensions.

The win rate is the percentage of blind evaluations in which the Veriq brief beats the vanilla baseline. The delta is how much better it is on aggregate (summed across the five dimensions, on a 5-point scale each). The n is the cumulative count of briefs run through the eval.

The eval runs automatically on every brief. We do not cherry-pick which briefs enter the sample.

The five dimensions

Dimension 1

Specificity

Does the brief engage with the actual details of this decision (the company, the constraint, the buyer, the timeline), or does it produce generic advice that could apply to any company in the category?

Vanilla chatbots produce plausible-sounding generic strategy. Good briefs cite specific constraints and numbers the decision-maker supplied.

1 generic platitudes · 3 mentions the company by name · 5 engages tightly with the specific constraints given

Dimension 2

Framework Value

Does the brief invoke a framework that actually illuminates the real tension in the decision, or does it label the situation with a framework name without using it to reveal anything?

Framework theater, where the brief labels the situation ("this is a classic Porter Five Forces moment!") and then does no actual analysis, scores low. Frameworks that surface non-obvious tradeoffs score high.

1 no framework · 3 framework named but decorative · 5 framework reveals the real tradeoff

Dimension 3

Actionability

Could the decision-maker do something specific on Monday morning based on this brief, or does it end with "consider all factors carefully"?

Each path should end with concrete next actions (hire, publish, sunset, test, commit), not abstractions.

1 vague recommendations · 3 clear recommended direction · 5 specific week-1 actions per path

Dimension 4

Non-Obvious Insight

Does the brief surface something the decision-maker probably hadn't already considered? A reframe, a hidden tradeoff, a counter-intuitive path?

If the recommendation is what anyone at the table would have said anyway, the brief didn't earn its existence. This dimension checks whether Veriq adds something the conversation didn't already have.

1 restates the obvious · 3 adds useful structure · 5 reveals a path or tension the client hadn't named

Dimension 5

Risk Identification

Does the brief name the specific risks that could make each path fail, with enough precision that the decision-maker can actually monitor for them?

Generic risks ("execution risk," "market risk") score low. Named, observable risks score high.

1 no risks named · 3 generic risks listed · 5 specific, observable, path-linked risks

How the eval runs

Brief is generated. Veriq produces a full 3-path decision brief using its framework engine, pattern library, and client context.

Baseline is generated. In parallel, we send the same decision question to a vanilla chatbot prompt. No framework scaffolding, no three-path structure, no pattern library. Just "here's the decision, respond as an AI assistant."

HTML is stripped. Veriq brief formatting, branding, and headers are removed. Both responses reduced to comparable prose.

Blind judging. An independent LLM judge is shown both responses in a randomized order ("Response A" / "Response B") with no indication of which is Veriq. The judge scores each on the five dimensions, 1–5 each.

Results stored. Dimension scores, delta, and winner are persisted in the brief_evals table. Aggregate stats you see on the homepage are computed from this table in real time.

What this doesn't claim

Anti-claims

We do not claim Veriq briefs predict business outcomes. The eval measures brief quality, not forecasting accuracy.
We do not claim Veriq recommendations are "right." The eval judges whether the brief is better-structured than a vanilla baseline, not whether the path chosen will succeed.
We do not claim LLM judges are unbiased. Judge models have known preferences (length, confidence, structure). We've audited for these and cross-checked results on a subsample with human judges, but the bias is real and we disclose it.
We do not cherry-pick briefs to include in the sample. Every brief that completes the pipeline is evaluated.
We do not claim Veriq "beats consultants" or "replaces strategy advisors." The baseline is a vanilla chatbot, not a human expert.

Honest limitations

Things to know before trusting the number

Judge bias: LLM-as-judge methodology has documented limitations. Judges favor longer, more-structured, more-confident responses. Veriq's format is all three. We check results against a human-judged subsample but the structural advantage is real.
Sample composition: Briefs skew toward Veriq's ICP: early-stage SaaS, B2B, mid-market strategy. Win rate on decisions outside that ICP may be different.
The baseline is a moving target: Vanilla chatbot quality is improving. Today's 97.5% win rate is not a permanent number; it's a snapshot. We republish it live, so if baselines catch up, the number will reflect that.
This measures brief quality, not business value: A better-structured brief doesn't automatically produce a better business outcome. That's what the DOTW Track Record is designed to measure over time. Complementary, not redundant.

Why we publish this at all

Most decision advisory services (consulting firms, strategy coaches, advisory AI tools) make unverifiable claims about their own quality. Our view: if you're going to charge for decision intelligence, you should be willing to be measured against alternatives the client could try for free.

The eval isn't a marketing stunt. It runs every time a brief is produced, on every private client brief, every internal brief, and every public Decision of the Week. The number you see is the live aggregate. It will go up. It will go down. It will be honest.

If you're a sophisticated reader skeptical of LLM-as-judge methodology, that skepticism is warranted. We share it. This is the best available automated quality signal, not a proof.

How Veriq is measured.