Remote OpenClaw Blog

Sakana Fugu Ultra Benchmarks: Claimed vs Verified

8 min read · 24 June 2026

Sakana Fugu Ultra benchmark claims, as of late June 2026, are vendor-reported and selectively echoed by press, but none of them have been independently verified by neutral third parties. Sakana AI positions Fugu Ultra as competitive with the frontier tier (Claude Fable 5, Mythos Preview, Gemini 3.1 Pro, Opus 4.8, and GPT-5.5), and some outlets report it edging past GPT-5.5 and Opus 4.8 on SWE-Bench Pro and TerminalBench — but you should treat every one of those as a claim to test, not a confirmed result.

Key Takeaways

Sakana AI reports Fugu Ultra "stands shoulder-to-shoulder" with frontier models like Fable 5 and Mythos Preview, but publishes no single headline number in its public text.
Press reports claiming wins over GPT-5.5 and Opus 4.8 on SWE-Bench Pro and TerminalBench are attributed claims, not independently reproduced as of late June 2026.
Fugu Ultra is a multi-agent orchestration system, which makes apples-to-apples benchmarking genuinely hard: variable agent counts, tool access, and runtime cost.
No neutral leaderboard had reproduced Sakana's figures within the launch window (announced June 22, 2026), so the safe stance is "unverified."
The only benchmark that matters for your decision is your own: run Fugu Ultra against your real tasks before trusting any launch-window scores.

What benchmark claims exist

The Sakana Fugu Ultra benchmark claims center on a single positioning statement: Sakana AI reports that Fugu Ultra "stands shoulder-to-shoulder with leading models like Fable 5 and Mythos Preview" across engineering, scientific, and reasoning evaluations. That is the company's own characterization, and it is deliberately framed as parity rather than a specific score.

According to Sakana AI's materials, the baseline comparison set spans the current frontier tier: Claude Fable 5, Mythos Preview, Gemini 3.1 Pro, Opus 4.8, and GPT-5.5. The company describes Fugu Ultra as competitive with that group rather than publishing one headline benchmark figure in its public-facing text. You can read the company's framing on the Sakana Fugu product overview.

Some press reports go further than Sakana's own copy. Those reports claim Fugu Ultra surpasses GPT-5.5 and Opus 4.8 on SWE-Bench Pro (a software-engineering benchmark) and TerminalBench (an agentic command-line benchmark). Those are reported claims attributed to outlets, not numbers Sakana stamps as official, and not figures any independent tester had reproduced as of late June 2026.

Why these claims are unverified

None of the Sakana Fugu Ultra benchmark claims had been independently reproduced as of late June 2026 because the model is only days old: it was announced June 22, 2026, and the verification ecosystem moves slower than a launch cycle. Neutral leaderboards, academic reproductions, and community evals all take weeks to set up, run, and publish.

There is also a structural problem with self-reported benchmarks. A vendor controls the prompt, the scaffold, the tool access, the number of attempts, and which results get reported. Two labs running "SWE-Bench Pro" can use different harnesses, retry budgets, and pass@k definitions and arrive at wildly different numbers for the same underlying model. That is why the field treats vendor numbers as a starting hypothesis, not a settled fact.

For context on how the canonical software-engineering benchmark defines tasks and scoring, see the official SWE-Bench site. Comparing any Fugu Ultra claim against that methodology is the only way to know whether two reported numbers are even measuring the same thing.

The practical takeaway: until reproducible results appear on a neutral leaderboard, treat "Fugu Ultra beats Opus 4.8 on SWE-Bench Pro" as a marketing-grade claim. It might be true. It is just not yet evidence.

Claimed results vs status

The table below separates what is claimed from what has actually been verified as of late June 2026. Every entry in the "Status" column is the same for a reason: independent confirmation simply does not exist yet.

Claim	Source of claim	Status (as of late June 2026)
Fugu Ultra "stands shoulder-to-shoulder" with Fable 5 and Mythos Preview	Sakana AI (vendor positioning)	Vendor claim — not independently verified
Competitive with frontier tier (Gemini 3.1 Pro, Opus 4.8, GPT-5.5)	Sakana AI (vendor positioning)	Vendor claim — no headline score published; unverified
Surpasses GPT-5.5 and Opus 4.8 on SWE-Bench Pro	Press reports (attributed)	Reported claim — not reproduced by neutral testers
Surpasses GPT-5.5 and Opus 4.8 on TerminalBench	Press reports (attributed)	Reported claim — not reproduced by neutral testers
Trialed by ~500 beta users on demanding real-world workloads	Sakana AI (vendor)	Vendor claim — usage signal, not a benchmark result

Why orchestration complicates benchmarking

Multi-agent orchestration breaks the assumptions that make standard benchmarks comparable, which is the core reason Fugu Ultra numbers are hard to trust at face value. Fugu Ultra is not a single fixed model answering a prompt — it is a coordinator that can recruit several expert models, hand them subtasks, verify their work, and synthesize a result, as described in the Sakana Fugu Technical Report (arXiv:2606.21228).

That introduces three variables a normal benchmark assumes are fixed. First, the compute budget: a single hard SWE-Bench Pro task might trigger one agent or a dozen, so "the model's score" depends on how much orchestration it was allowed to do. Second, the cost and latency per task swing widely, which means a parity score can hide a 5x cost gap. Third, the system can use tools and retries that a baseline single-model comparison may not, making "same benchmark" an illusion.

Apples-to-apples comparison requires fixing all of that — equal tool access, equal attempt budgets, equal cost ceilings — and vendor reports rarely disclose those constraints. Without them, a headline "wins SWE-Bench Pro" tells you almost nothing about how the system behaves under your budget and your guardrails.

How to run your own eval

The most reliable Fugu Ultra benchmark is the one you build from your own tasks, because it controls for cost, tooling, and the work you actually care about. Public leaderboards optimize for general tasks; your production workload does not look like SWE-Bench Pro.

A defensible mini-eval has five steps. Collect 20–50 real tasks from your backlog with known-good answers. Define a clear pass/fail rubric before you run anything. Run Fugu Ultra and your current model on the identical task set, same prompts, same tool access. Record three numbers per task: correctness, cost, and latency. Then compare on the metric that matters to your product, not the metric Sakana chose to highlight.

Cap the cost and attempt budget so the comparison is fair. Because Fugu Ultra can spend more compute per task, an unconstrained "best of many attempts" run will flatter it; a fixed budget tells you the honest tradeoff. The Sakana developer console exposes an OpenAI-compatible API, so you can point existing eval harnesses at it with minimal changes — see the Sakana Fugu product page for access details.

If you want to anchor your harness to a credible methodology, model your scoring on the published SWE-Bench task format and adapt it to your domain. Borrowing a vetted rubric beats inventing one, and it makes your results legible to anyone who already understands that benchmark.

Limitations and Tradeoffs

The biggest limitation of any Fugu Ultra benchmark discussion right now is recency: the model launched June 22, 2026, so independent data simply has not had time to accumulate. Any conclusion you draw from launch-window numbers is provisional by definition.

Multi-agent systems also make cost and latency unpredictable. A benchmark that reports only accuracy can mask the fact that Fugu Ultra spent several model calls — and several times the cost — to match a single-model baseline. For interactive products with strict response budgets, that variance is often the deciding factor, not the accuracy delta.

There is a verification asymmetry, too. Vendor and press numbers are easy to publish and hard to reproduce, so the burden of proof falls on you. The reported wins over GPT-5.5 and Opus 4.8 may well hold up — but until a neutral party reproduces them with disclosed budgets, treating them as fact is a risk, not a shortcut.

When should you wait? If you need audited benchmark numbers, deterministic latency, or a long track record before a platform commitment, Fugu Ultra is too new to bet on in mid-2026. If you can A/B test outputs on your own tasks and absorb some cost variance, it is cheap to evaluate and worth a controlled trial.

Related Guides

FAQ

Are Sakana Fugu Ultra's benchmark scores verified?

No. As of late June 2026, all Sakana Fugu Ultra benchmark claims are vendor-reported or attributed to press, and none have been independently reproduced on a neutral leaderboard. The model launched June 22, 2026, so the verification ecosystem has not caught up yet.

Does Fugu Ultra really beat GPT-5.5 and Opus 4.8 on SWE-Bench Pro?

Some press reports claim Fugu Ultra surpasses GPT-5.5 and Opus 4.8 on SWE-Bench Pro and TerminalBench, but those are reported claims, not reproduced results. Sakana AI's own public text frames Fugu Ultra as competitive with the frontier tier rather than publishing a specific winning score, so treat the press figures as unverified.

Why is it hard to benchmark a multi-agent model like Fugu Ultra?

Fugu Ultra can recruit and coordinate multiple expert models per task, so its score depends on how much compute, tool access, and how many attempts it was allowed. A fair, apples-to-apples comparison must fix those variables across all models, and vendor reports rarely disclose them, which makes headline numbers hard to trust.

How should I evaluate Fugu Ultra for my own use case?

Build a mini-eval from 20–50 of your real tasks with known-good answers, run Fugu Ultra and your current model on the identical set with equal tool access and a fixed cost budget, and record correctness, cost, and latency per task. Compare on the metric your product depends on, not the one Sakana chose to highlight.

What benchmarks does Sakana reference for Fugu Ultra?

Sakana AI describes Fugu Ultra as competitive with frontier models like Fable 5, Mythos Preview, Gemini 3.1 Pro, Opus 4.8, and GPT-5.5 across engineering, scientific, and reasoning evaluations. Press reports add SWE-Bench Pro and TerminalBench specifically, but those benchmark wins remain unverified by independent testers.

Skills for this topic

Browse all skills →

verified-agent-identitybillionsnetwork/verified-agent-identity8K installs ultracitehaydenbleasel/ultracite5K installs ultrathinkintellectronica/agent-skills2K installs azure-verified-moduleshashicorp/agent-skills2K installs worker-benchmarksruvnet/ruflo795 installs ultraqayeachan-heo/oh-my-claudecode721 installs

Frequently Asked Questions

Are Sakana Fugu Ultra's benchmark scores verified?

Does Fugu Ultra really beat GPT-5.5 and Opus 4.8 on SWE-Bench Pro?

Why is it hard to benchmark a multi-agent model like Fugu Ultra?

How should I evaluate Fugu Ultra for my own use case?

What benchmarks does Sakana reference for Fugu Ultra?

Explore skill collections

DirectoryOpenClaw Skills DirectoryBrowse the public OpenClaw community skills directory by installs, stars, categories, and use case.HubAI Agent Skills DirectoryCompare OpenClaw, Hermes Agent, Claude Code, and Codex from one skills-first starting point.Code Review SkillsAI Agent Skills for Code Review, PR Cleanup, and VerificationBrowse AI agent skills for code review, PR cleanup, security review, browser verification, and architecture-sensitive review work.

Ready to choose the right OpenClaw workflow?

Best Next StepIf that last section felt like a lot - use the marketplace to find the configured version.Browse AI Agent SkillsUse the skills hub to move from research into the right ecosystem, use case, and install path.See Founder Ops BundleSkip the setup if you want Atlas plus Compass as the ready-made first bundle.

Loading article