What the numbers actually mean
Every time we ship a brief, whether it's client work, an internal brief, or a public Decision of the Week, we also generate a control answer in parallel. The control is what a competent vanilla chatbot would say if you asked it the same question with no framework scaffolding, no three-path structure, no pattern library. Then we put both answers side-by-side, strip out the branding, and have an LLM judge them blind on five dimensions.
The win rate is the percentage of blind evaluations in which the Veriq brief beats the vanilla baseline. The delta is how much better it is on aggregate (summed across the five dimensions, on a 5-point scale each). The n is the cumulative count of briefs run through the eval.
The eval runs automatically on every brief. We do not cherry-pick which briefs enter the sample.
The five dimensions
Dimension 1
Does the brief engage with the actual details of this decision (the company, the constraint, the buyer, the timeline), or does it produce generic advice that could apply to any company in the category?
Vanilla chatbots produce plausible-sounding generic strategy. Good briefs cite specific constraints and numbers the decision-maker supplied.
Dimension 2
Does the brief invoke a framework that actually illuminates the real tension in the decision, or does it label the situation with a framework name without using it to reveal anything?
Framework theater, where the brief labels the situation ("this is a classic Porter Five Forces moment!") and then does no actual analysis, scores low. Frameworks that surface non-obvious tradeoffs score high.
Dimension 3
Could the decision-maker do something specific on Monday morning based on this brief, or does it end with "consider all factors carefully"?
Each path should end with concrete next actions (hire, publish, sunset, test, commit), not abstractions.
Dimension 4
Does the brief surface something the decision-maker probably hadn't already considered? A reframe, a hidden tradeoff, a counter-intuitive path?
If the recommendation is what anyone at the table would have said anyway, the brief didn't earn its existence. This dimension checks whether Veriq adds something the conversation didn't already have.
Dimension 5
Does the brief name the specific risks that could make each path fail, with enough precision that the decision-maker can actually monitor for them?
Generic risks ("execution risk," "market risk") score low. Named, observable risks score high.
How the eval runs
brief_evals table. Aggregate stats you see on the homepage are computed from this table in real time.What this doesn't claim
Anti-claims
- We do not claim Veriq briefs predict business outcomes. The eval measures brief quality, not forecasting accuracy.
- We do not claim Veriq recommendations are "right." The eval judges whether the brief is better-structured than a vanilla baseline, not whether the path chosen will succeed.
- We do not claim LLM judges are unbiased. Judge models have known preferences (length, confidence, structure). We've audited for these and cross-checked results on a subsample with human judges, but the bias is real and we disclose it.
- We do not cherry-pick briefs to include in the sample. Every brief that completes the pipeline is evaluated.
- We do not claim Veriq "beats consultants" or "replaces strategy advisors." The baseline is a vanilla chatbot, not a human expert.
Honest limitations
Things to know before trusting the number
- Judge bias: LLM-as-judge methodology has documented limitations. Judges favor longer, more-structured, more-confident responses. Veriq's format is all three. We check results against a human-judged subsample but the structural advantage is real.
- Sample composition: Briefs skew toward Veriq's ICP: early-stage SaaS, B2B, mid-market strategy. Win rate on decisions outside that ICP may be different.
- The baseline is a moving target: Vanilla chatbot quality is improving. Today's 97.5% win rate is not a permanent number; it's a snapshot. We republish it live, so if baselines catch up, the number will reflect that.
- This measures brief quality, not business value: A better-structured brief doesn't automatically produce a better business outcome. That's what the DOTW Track Record is designed to measure over time. Complementary, not redundant.
Why we publish this at all
Most decision advisory services (consulting firms, strategy coaches, advisory AI tools) make unverifiable claims about their own quality. Our view: if you're going to charge for decision intelligence, you should be willing to be measured against alternatives the client could try for free.
The eval isn't a marketing stunt. It runs every time a brief is produced, on every private client brief, every internal brief, and every public Decision of the Week. The number you see is the live aggregate. It will go up. It will go down. It will be honest.
If you're a sophisticated reader skeptical of LLM-as-judge methodology, that skepticism is warranted. We share it. This is the best available automated quality signal, not a proof.