Restricted AI Models and Opaque Benchmarks Threaten the Emerging AI Insurance Market
A Gallagher Re report warns that Anthropic’s restricted release of its Mythos large language model raises a fundamental underwriting question: how can insurers assess and price AI risk when the most advanced models are unavailable for independent scrutiny?
The release of Mythos under a restricted access arrangement establishes a fourth category of frontier AI model, sitting between open-source, open-weight and fully proprietary approaches and accessible only to a select group of vetted partners. The implications extend to coverage availability, premium pricing and the long-term insurability of AI-related risks.
Restricted Access Revives the Dynamics Behind the Last Cyber Hard Market
Anthropic’s stated rationale for limiting distribution of Mythos is the model’s reported capability at detecting software vulnerabilities, creating exploits and chaining attacks across operating systems and browsers.
The UK AI Security Institute evaluated the model and found it performed strongly on offensive cyber tasks. Gallagher Re draws a direct parallel to the conditions that preceded the 2017 WannaCry and NotPetya attacks, when leaked NSA-developed exploits lowered barriers to entry for sophisticated attacks, increased the automation and scalability of destructive campaigns and reduced friction in facilitating ransomware payments.
The report identifies AI-enabled attacks as replicating all three ingredients, with one additional factor: speed. Data cited in the report shows mean time-to-exploitation for disclosed vulnerabilities has compressed from more than two years in 2018 to roughly 10 hours in 2026.
Gallagher Re also notes that restriction may only delay, not eliminate, the threat. The UK AI Security Institute’s own testing found that OpenAI’s GPT-5.5 performed comparably to Mythos on the same offensive cyber evaluations, meaning models with similar capabilities are already publicly available. Gallagher Re describes Deepseek v4 as positioning Chinese labs as roughly six months behind their U.S. counterparts, suggesting the window of any protection afforded by Anthropic’s access controls is narrow.
Benchmark-Based Evaluation Falls Short of What Insurers Need
A separate but related problem identified by Gallagher Re is that standard AI model evaluation methods are poorly suited to insurance underwriting.
Most models are assessed using static benchmarks that measure accuracy on controlled datasets, but the report argues these tools are “not designed to predict real-world loss.” Several widely used benchmarks are now considered saturated, with top scores clustering near the ceiling: Gemini 3.1 Pro scores approximately 95% on GPQA Diamond, which measures advanced scientific reasoning and knowledge, and Claude 4.5 Sonnet scores approximately 98% on HumanEval, which measures programming capability, leaving little ability to differentiate risk quality between insureds.
The report identifies five specific gaps in current evaluation practice:
- Benchmarks measure performance rather than behavior, meaning high-scoring models can still hallucinate, make inconsistent decisions or misinterpret instructions in ways that generate legal and regulatory liability.
- Models are increasingly trained on benchmark material itself, inflating scores without improving real-world reliability.
- Narrow evaluation encourages behavioral homogenization across models, which raises concentration risk for insurers whose portfolios may be heavily exposed to a small number of shared foundation models.
- Current methods also do not measure whether failures could be correlated across multiple insureds simultaneously.
- Finally, the practical input space for deployed AI is effectively infinite, and static guardrails cannot cover it.
Early Signs of Progress, but Structural Risks Remain
Gallagher Re points to emerging evaluation approaches that begin to address some of these gaps. Epoch AI combines internal and external benchmarks to reduce contamination risk. Artificial Analysis’s Omniscience Index scores hallucination and knowledge calibration as well as correctness.
The report notes that running Artificial Analysis’s Intelligence Index on Claude Opus 4.6 cost just under $5,000, signaling that comprehensive behavioral testing carries real costs but also real value for underwriters.
If restricted-access models become the norm and independent evaluators are excluded from testing them alongside proprietary models, the report concludes, insurers will be left pricing uncertainty rather than risk. That outcome, Gallagher Re argues, “almost always leads to higher premiums, narrower coverage or both.”
Obtain the full report here. &
