Skip to main content

AI Benchmarks

Agent Version: 1.0.0

Benchmark Conditions

The benchmarks are recorded against three chained real-world tests that perform the same tasks demonstrated in our tutorial videos.

The agent is configured for maximum token efficiency and low error tolerance:

MAX_SNAPSHOTS_HISTORY=0
CONTEXT_COMPRESSION=true
MAXIMUM_RESTRICTED_TOOL_USAGE=3
RATE_LIMIT_RETRY=2
AI_MAX_TOKENS=8192

Models compared: claude-sonnet-4-6 vs deepseek-v4-pro


Results Summary

ClaudeDeepSeek
Result✓ success✓ success
Total API Costs$1.66$0.04*
  • Deekseek v4 pro api tokens were offered at a 90% discount when benchmarked.

Verdict

Claude Sonnet 4.6 is the stronger choice for speed and reliability. it completed the full suite faster, maintained lower output verbosity, and showed more consistent behaviour across longer tests. Its cache warmed up more gradually.

DeepSeek V4 Pro's main strength is cost efficiency. it achieved a higher cache hit rate quickly and used fewer API calls in longer tests due to less granular verification. Its weaknesses are higher per-call latency, more frequent instruction violations, and slightly higher output verbosity.

Recommendations

Use DeepSeek V4 pro for development testing or high-volume runs. Use Claude Sonnet 4.6 for regression testing or websites requiring strict compliance.

Execution Duration

Duration per test

Claude completed the full suite in 483s vs DeepSeek's 733s — 34% faster overall. The gap is most pronounced in Test 1 (126s vs 323s), where DeepSeek's higher per-call latency compounded across 44 API calls.

Claude T1DS4 T1Claude T2DS4 T2Claude T3DS4 T3Total
Duration126s323s200s225s156s183sC: 483s · DS: 733s

API Calls

API calls per test

DeepSeek uses fewer calls in Tests 2 and 3 due to a less granular verification pattern. Claude's higher count reflects more thorough step-by-step auditing.

Claude T1DS4 T1Claude T2DS4 T2Claude T3DS4 T3Total
API calls394462395035C: 151 · DS: 118
Step log entries617083586849C: 212 · DS: 177

Cache Performance

Both models have active caching — the mechanisms differ but both achieve high cache hit rates.

Cache hit rate per test

Claude T1DS4 T1Claude T2DS4 T2Claude T3DS4 T3
Cache hit tokens425,494595,584678,932487,552510,033398,848
Full-rate tokens138,10528,33380,75031,25358,67620,362
Cache hit rate75.5%95.5%89.4%94.0%89.7%95.1%

DeepSeek achieves a higher cache hit rate (94–95%) vs Claude (75–89%). This is because DeepSeek's cache covers the full accumulated conversation history, while Claude's cacheReadTokens grows from scratch each session and takes several turns to warm up (hence Test 1 is lower than T2/T3 for Claude).

Cache Hit Tokens per Step

Cache read per step

Claude's cache reads grow steadily through the session. DeepSeek's cached tokens start high (system prompt + tools are cached from step 2) and grow linearly as history accumulates — explaining the higher overall hit rate.

Full-Rate Billed Tokens per Step

Full-rate tokens per step

The key difference in token economics: Claude sends only new content (44–2,800 tokens per step at full rate after warm-up), while DeepSeek's cacheMissTokens are small too (40–2,200 per step) — both models bill surprisingly little at full rate per call once the cache is warm.

The large spike in Claude T1 steps 21–22 (~30k) is caused by a large ARIA tree snapshot from the registration form being absorbed into cache in that turn.


Token Breakdown Summary

Token breakdown

Claude T1DS4 T1Claude T2DS4 T2Claude T3DS4 T3
Full-rate billed138,10528,33380,75031,25358,67620,362
Cache hit (discounted)425,494595,584678,932487,552510,033398,848
Output4,8407,3017,7205,4085,8674,466

Claude's full-rate figure = inputTokens + cacheCreationTokens
DS4's full-rate figure = cacheMissTokens
Claude's cache hit figure = cacheReadTokens
DS4's cache hit figure = cachedTokens


Output Tokens

Average output tokens per call

Output tokens per step

DeepSeek is consistently more verbose despite the same "1–2 sentence max" system prompt instruction:

Claude T1DS4 T1Claude T2DS4 T2Claude T3DS4 T3Overall avg
Avg output/call124166 (+34%)125139 (+11%)117128 (+9%)C: 122 · DS: 144
Total output4,8407,3017,7205,4085,8674,466C: 18,427 · DS: 17,175

Errors and Issues

Error counts per test


Key Takeaways

Both models have effective caching — DeepSeek's hit rate is higher.
DeepSeek achieves 94–95% cache hit rates from the second call onward (system prompt + tools cached immediately). Claude warms up more gradually, reaching 89–90% by Tests 2 and 3.

Full-rate billed tokens are small for both once cache is warm.
After the first call, both models bill surprisingly little at full rate per step — typically 40–2,000 tokens. The cost difference comes primarily from cache creation and model pricing, not from per-step miss volume.

Speed: Claude wins clearly.
34% faster overall (483s vs 733s), driven by lower per-call latency. DeepSeek's latency is higher regardless of token volume.

API calls: DeepSeek more efficient in Tests 2–3.
22% fewer total calls. Less granular verification pattern means fewer round trips but also less step-level auditability.

Output verbosity: Claude is more concise.
122 tokens/call average vs 144 for DeepSeek — 18% fewer. Both use the same system prompt brevity instruction.

Error patterns are complementary and both fixable via prompting.
Claude has a recurring CSS selector error on form submit buttons. DeepSeek has a worsening snapshot lifecycle issue on longer tests.