The Veriq Harness

Applying Claude Code's architectural pattern to strategic decisions

When Anthropic shipped Claude Code, the framing was unusual: not a better LLM, a harness. A structural wrapper that lifted the same model from 30% on SWE-bench to 60%+. The architecture did the work, not the model. Romasanta identified the equivalent failure mode for strategic advice. Veriq is the harness.

The Pattern

Frontier LLMs have specific, named biases that prompt engineering cannot fix. The fix has to live above the model: in the structure of how it is invoked, what context is supplied, how outputs are validated, what guardrails force commitment only when commitments hold up.

Claude Code did this for code: file-level context, repository-aware reasoning, structured tool use, validation against tests, force-commit on diffs. The same Claude went from 30% to 60%+ on SWE-bench Verified. The model didn't change. The architecture around it did.

The pattern generalizes. Wherever a frontier LLM has structural biases on a measurable task, the response is not to wait for a better model. It is to build the harness that defeats the bias today. Veriq applies this pattern to strategic decisions.

The Failure Mode: Trendslop

In March 2026, Romasanta, Thomas, and Levina published in Harvard Business Review a study of systematic bias in frontier LLM strategic advice. They ran 15,000+ prompts across seven frontier models on seven canonical strategy tensions and named the failure mode trendslop: the systematic tendency to recommend whichever side of a tension has friendlier connotations in the training corpus, regardless of context.

96%

picked Differentiation over Cost Leadership

93%

picked Augmentation over Automation

24%

defaulted to "do both" hybrids when not forced to choose

Source: Angelo Romasanta, Llewellyn D.W. Thomas, Natalia Levina. "Researchers Asked LLMs for Strategic Advice. They Got 'Trendslop' in Return." Harvard Business Review, March 2026.

Why Context Alone Doesn't Fix It

The deeper finding is the structural ceiling on context-based debiasing. Study 2 tested whether prompt-engineering interventions could fix the bias.

~11%

bias reduction from rich industry context

~2%

prompt-engineering shift on strongest biases

~0%

benefit from model size or version

Rich context shifted bias 11%. Prompt engineering moved the strongest biases 2%. Model upgrades did nothing. The bias is structural, not prompt-fixable. Whatever defeats it has to live above the model, not inside the prompt. Same conclusion Anthropic reached about software-engineering biases. Same argument for a harness.

The Veriq Harness

Veriq's response is not a better prompt. It is four architectural components that sit above any frontier LLM and force the same model to produce structurally different output.

Forced framework triangulation

Every brief draws from a 66-framework registry, with documented support on both sides of seven canonical strategy tensions. The model cannot pick Differentiation without engaging the framework support for Cost Leadership. It cannot quietly default; it has to argue from a specific framework, against a documented opposite, with a citation to where the framework was developed.

Three paths, not one

Every brief generates three distinct strategic paths, each grounded in a different framework. The trendy path is argued against an explicitly-named unfashionable path with citation. This is the mechanism Romasanta's own conclusion implicitly recommends: use AI to expand options, not to pick them.

Force-commit on the selected path

The brief schema includes a mandatory recommendation.selected_path_name field. The model is structurally prevented from "do both" or "phased balance" hedges that lack a coherent sequencing trigger. The 24% hybrid trap Romasanta found in vanilla outputs is closed off at the schema level, not at the prompt level.

Outcome-calibrated framework scoring

Frameworks that predict client outcomes accurately get reinforced. Frameworks that miss get downweighted. The system carries a per-framework Bayesian performance score across the portfolio. Choices reflect predictive accuracy on real outcomes, not the training-corpus connotation of the framework name. This component cannot exist in a self-serve chatbot; it requires per-client engagement, an outcome log, and a closed feedback loop.

Empirical Results

The harness hypothesis is testable. We ran Romasanta's exact Study 1 protocol against two arms: a vanilla baseline using his system prompt with no scaffolding, and the Veriq harness (pipeline v2.3) applying the four components above to the same model (Claude Sonnet 4.5). Same seven tensions, same five phrasings, same five replications, blind LLM coder.

22.3%

harness mean trendy% across 7 tensions

65.7%

vanilla baseline mean trendy%

−43.4 pp

absolute reduction in trendy-side bias

Tension	Vanilla	Harness	Δ
Commoditization vs. Differentiation	96.0%	32.0%	−64 pp
Radical vs. Incremental Innovation	68.0%	8.0%	−60 pp
Short-term vs. Long-term Performance	76.0%	28.0%	−48 pp
Centralization vs. Decentralization	64.0%	24.0%	−40 pp
Competition vs. Collaboration	52.0%	16.0%	−36 pp
Automation vs. Augmentation	60.0%	28.0%	−32 pp
Exploration vs. Exploitation	44.0%	20.0%	−24 pp
Mean across all 7 tensions	65.7%	22.3%	−43.4 pp

The harness reduced trendy% on every Romasanta tension. The strongest baselines collapsed hardest: Commoditization 96%→32%, Radical Innovation 68%→8%. The vanilla arm acted as a control: re-run alongside the harness, vanilla drifted between −16 pp and +8 pp across tensions (mean −5 pp), consistent with sampling noise. The harness Δ was 2–3× larger than vanilla drift on every tension. Net harness effect after subtracting drift: −39 pp.

A separate audit of 12 pilot briefs found a 0% hybrid rate against Romasanta's 24% vanilla baseline — the schema-level force-commit working as designed.

Honest Limitations

n = 12 is directional, not robust. Veriq is in soft launch. The hybrid-audit sample is 12 pilot briefs, not production-scale data.
One coder, one model. Stage-3 coding used Claude Haiku 4.5. Romasanta used GPT-4.1-mini. Cross-coder validation against a non-Anthropic model is a planned follow-up.
The hybrid coder is itself an LLM. LLM-as-judge has documented limitations. Positive controls suggest the coder is calibrated, but the broader limitation is real.
This is not a claim Veriq always picks the right answer. The benchmark measures commitment to a defensible direction grounded in a named framework, not ex-post outcome correctness. Outcome quality is measured separately on lag via the Decision of the Week track record.

The Honest Claim

Veriq does not claim a better LLM. The model is the same one anyone with an Anthropic API key can use. What we claim is that the architecture around the model produces structurally different output on a measurable benchmark — replicating exactly the pattern Anthropic established with Claude Code on software-engineering benchmarks, adapted to strategic decisions.

Romasanta named the failure mode. The architecture in this document is one answer to it. The empirical results are early but directional.

If this maps to a decision you're facing: Tell us what you're wrestling with. We respond within 48 hours with a read on whether a Veriq brief is the right next step.

Send the decision →

Cited:
Romasanta, Thomas & Levina (March 2026). "Researchers Asked LLMs for Strategic Advice. They Got 'Trendslop' in Return." Harvard Business Review.
Anthropic. Claude Code (2024). SWE-bench Verified benchmark scores.
March, J. G. (1991). "Exploration and Exploitation in Organizational Learning." Organization Science, 2(1).
Porter, M. E. (1980). Competitive Strategy. Free Press.

The Veriq Harness · v1.1 · Published May 2026 · v2.3 re-measurement May 11, 2026 · Veriq Advisory