State of the Article

A blend of models beats Fable 5

TLDR Blending several models into one answer can beat the single best model, and a cheap blend can rival a pricey one.

The finding: combining several models and merging their answers can beat the single best model, and sometimes a cheap blend rivals an expensive one.

OpenRouter released Fusion, which runs a prompt across multiple models at once and synthesizes their outputs into one response, a technique known as ensembling. On 100 deep research tasks from the DRACO benchmark, a panel of Fable 5 plus GPT-5.5 scored 69 percent, beating Fable 5 alone at 65.3.

The cost angle is the surprise. A budget panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro came within a point of Fable 5 (64.7 versus 65.3) at roughly half the token cost, and beat both GPT-5.5 and Opus 4.8 on their own. Even running Opus 4.8 twice in parallel and fusing the two answers lifted its score by nearly seven points. That suggests the synthesis step itself, not just using different models, drives most of the gain, much like how an LLM-as-judge improves a result by weighing several attempts before settling on one.

Fusion runs server-side and can be called like a normal API, either as your default model or as a tool a base model invokes only when a question is hard enough to justify the extra compute. If you want to get more out of whatever model you pick, the Prompting Guide is the fastest lever.

Source: OpenRouter

← All of AI Field Notes
qm.studioOpen source ↗