I benchmarked 40 open-source LLMs for coding in April 2026 — here's what surprised me #4
desiorac
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
After running 2,400+ coding tasks across 40 open-source models, the results aren't what the leaderboards show.
Surprise Rankings
Speed vs. accuracy tradeoffs are brutal. DeepSeek-Coder-V3 wins on accuracy but runs 3-4x slower than Qwen2.5-Coder-32B on consumer hardware.
Context window actually matters. Models with 128k+ context windows outperformed equivalent architectures with 32k windows by 12-18% on real-world refactoring tasks.
Instruction-tuning quality beats base model size. A well-tuned 13B model consistently outperformed a poorly-tuned 34B.
The Models Nobody's Talking About
Browse the free interactive benchmark summary: https://ark-forge.github.io/genesis/
Full Gist writeup: https://gist.github.com/desiorac/a2579ec6d02c282453f27802446e44e9
Get the complete 28-page report (€9): https://buy.stripe.com/9B628n1Yl1UM0PC5kk
Beta Was this translation helpful? Give feedback.
All reactions