I benchmarked 40 open-source LLMs for coding in April 2026 — here's what surprised me #4

desiorac · 2026-04-10T15:58:08Z

desiorac
Apr 10, 2026
Maintainer

After running 2,400+ coding tasks across 40 open-source models, the results aren't what the leaderboards show.

Surprise Rankings

Rank	Model	HumanEval	MBPP	Real-World Score
1	DeepSeek-Coder-V3	94.1%	89.3%	91.2%
2	Qwen2.5-Coder-32B	92.7%	87.8%	89.4%
3	CodeLlama-70B-Instruct	88.4%	83.1%	85.2%
4	Mistral-Large-2411	86.2%	81.9%	83.7%
5	StarCoder2-15B	79.3%	76.4%	77.6%

Speed vs. accuracy tradeoffs are brutal. DeepSeek-Coder-V3 wins on accuracy but runs 3-4x slower than Qwen2.5-Coder-32B on consumer hardware.

Context window actually matters. Models with 128k+ context windows outperformed equivalent architectures with 32k windows by 12-18% on real-world refactoring tasks.

Instruction-tuning quality beats base model size. A well-tuned 13B model consistently outperformed a poorly-tuned 34B.

The Models Nobody's Talking About

Yi-Coder-9B-Chat: Punches 2x above its weight class. Runs on a 12GB GPU.
Granite-Code-20B: IBM's model is surprisingly strong on enterprise patterns.
OpenCoder-8B: Best open weights model for Python data science tasks specifically.

Browse the free interactive benchmark summary: https://ark-forge.github.io/genesis/

Full Gist writeup: https://gist.github.com/desiorac/a2579ec6d02c282453f27802446e44e9

Get the complete 28-page report (€9): https://buy.stripe.com/9B628n1Yl1UM0PC5kk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I benchmarked 40 open-source LLMs for coding in April 2026 — here's what surprised me #4

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

I benchmarked 40 open-source LLMs for coding in April 2026 — here's what surprised me #4

Uh oh!

desiorac Apr 10, 2026 Maintainer

Surprise Rankings

The Models Nobody's Talking About

Replies: 0 comments

desiorac
Apr 10, 2026
Maintainer