diff --git a/pages/insights/20251105_human_ai.md b/pages/insights/20251105_human_ai.md index 859685b..fb1da2b 100644 --- a/pages/insights/20251105_human_ai.md +++ b/pages/insights/20251105_human_ai.md @@ -1,6 +1,6 @@ title: Humans & AI [Ep. 1] - Gigachad Strikes date: 2025-11-05 -description: Where does AI rank among human programmers on RobotRumble? +description: AI vs. Human showdown - who comes out on top? authors: John Yang We're particularly excited about using CodeClash as a platform to explore collaboration and competition among human and AI programmers. diff --git a/pages/insights/20260115_human_ai_ladder.md b/pages/insights/20260115_human_ai_ladder.md new file mode 100644 index 0000000..e1004c2 --- /dev/null +++ b/pages/insights/20260115_human_ai_ladder.md @@ -0,0 +1,477 @@ +title: Humans & AI [Ep. 2] - Introducing CC:Ladder +date: 2026-01-15 +description: Where does AI rank among public solutions by human programmers? +authors: John Yang + +**tl;dr** We introduce boss battles as a new format for evaluating LMs' coding + reasoning capabilities. + +We pit [Claude 4.5 Sonnet against GigaChad](/insights/20251105_human_ai/) in [RobotRumble](/arenas/robotrumble) and found that today's best coding models still struggle heavily to develop suboptimal codebases into ones that rival the best human written solutions. + +Inspired by this finding, we introduce **CC:Ladder**, a twist that makes evaluating LMs as competitive, long-horizon software developers **hill-climable** and **cheaper**. + +## How it works + +In **CC:Ladder**, models begin against the weakest human solution and must win a majority of `n` rounds to advance to increasingly stronger opponents; evaluation is determined by the highest-ranked opponent defeated. + + + +Some key details: + +- Models start with a codebase containing the weakest opponent's solution. +- Models play `n` rounds against an opponent, where **`n >= 3`** and **`n` is odd**. +- A model "advances" to the next opponent if it **wins `(n+1)/2` rounds** *and* it **wins the last round**. +- If a model advances, **its codebase carries over**. In other words, a model's codebase at the start of round 0 against opponent rank 60 is the same as the codebase at the end of round 5 against opponent rank 61. The model's codebase does *not* get reset to the initial state. + +**CC:Ladder** has several advantages over the default Elo leaderboard. + +- **Hill-climable**: See how far up the rankings a model can go. Better models achieve higher rankings. +- **Cheaper**: The model competes against static human solutions. No need to spend $$ to run another LM as an opponent. +- **Less noise**: Again, because the opponent is a static human solution. +- **Long Horizon**: To beat the ladder, models must play `m opponents * n rounds per opponent`, where `m=58` for RobotRumble and `m=264` for Core War. + +## Building CC:Ladder + +Putting together a ladder for a CodeClash arena is entirely dependent on how many open source, human written solutions are available on the web. + +- For RobotRumble, we found 58 open source implementations on the [public leaderboard](https://robotrumble.org/boards/2/robots) +- For Core War, we found 264 open source implementations by manually crawling the Core War online [directory](http://www.koth.org/planar/by-name/complete.htm). + +Given a solution, we (1) check that the solution compiles and runs properly, then (2) push the solution as a branch (named `human/` or `human//`) to the corresponding repository (branches for [Core War](https://github.com/CodeClash-ai/CoreWar/branches), [RobotRumble](https://github.com/CodeClash-ai/RobotRumble/branches)). + +We currently execute this workflow manually. +Ping us in [Slack](https://join.slack.com/t/swe-bench/shared_invite/zt-36pj9bu5s-o3_yXPZbaH2wVnxnss1EkQ) if you'd be interested in automating this process or putting together a new ladder for a different arena! + +## Initial Findings + +### Part 1: Ranking human-written solutions + +Given `n` solutions, we make every unique pair of solutions compete `t` times. + +- `t=250` for RobotRumble +- `t=4000` for Core War + +`t` varies solely due to compute constraints. +Core War simulations run more quickly than RobotRumble simulations. + +Then, we compute each solution's Elo and determine the rankings. +Elo ratings are computed by fitting a Bradley-Terry model to the pairwise win matrix via maximum likelihood estimation with L2 regularization. +We set the regularization strength to 0.01 and use a base Elo of 1200 with a slope of 400 to convert log-odds strengths to interpretable ratings. + +For **Core War**, the top ten: + +1. human/toxic: **1408.7** +2. human/forjohn: **1401.9** +3. human/maelstrom: **1396.0** +4. human/silkworm: **1392.2** +5. human/returnofthefugitive: **1386.1** +6. human/unheardof: **1385.3** +7. human/devilstick: **1384.7** +8. human/mascafe: **1379.6** +9. human/cloudburst: **1376.9** +10. human/decoysignal: **1372.2** + +
+Show full Core War rankings + +
    +
  1. human/toxic: 1408.7
  2. +
  3. human/forjohn: 1401.9
  4. +
  5. human/maelstrom: 1396.0
  6. +
  7. human/silkworm: 1392.2
  8. +
  9. human/returnofthefugitive: 1386.1
  10. +
  11. human/unheardof: 1385.3
  12. +
  13. human/devilstick: 1384.7
  14. +
  15. human/mascafe: 1379.6
  16. +
  17. human/cloudburst: 1376.9
  18. +
  19. human/decoysignal: 1372.2
  20. +
  21. human/chainlockv02a: 1370.0
  22. +
  23. human/burningmetal: 1367.7
  24. +
  25. human/defensive: 1365.0
  26. +
  27. human/firestorm: 1364.8
  28. +
  29. human/dawn2: 1362.2
  30. +
  31. human/mercenary: 1361.5
  32. +
  33. human/pdqscan: 1358.1
  34. +
  35. human/lastjudgement: 1351.7
  36. +
  37. human/rust: 1350.8
  38. +
  39. human/snowscan: 1350.6
  40. +
  41. human/frothandfizzle: 1346.6
  42. +
  43. human/thefugitive: 1346.3
  44. +
  45. human/blackknight: 1342.6
  46. +
  47. human/sonofvain: 1340.3
  48. +
  49. human/dawn: 1339.8
  50. +
  51. human/goldeneye: 1335.4
  52. +
  53. human/silking: 1332.1
  54. +
  55. human/artofcorewar: 1331.9
  56. +
  57. human/blowrag: 1329.2
  58. +
  59. human/returnofthejedimp: 1326.9
  60. +
  61. human/danceoffallenangels: 1324.6
  62. +
  63. human/azathoth: 1320.9
  64. +
  65. human/kosmos: 1319.4
  66. +
  67. human/simplicity: 1314.0
  68. +
  69. human/armadillo: 1313.3
  70. +
  71. human/combatra: 1313.2
  72. +
  73. human/cinammon: 1309.9
  74. +
  75. human/returnofthependragon: 1306.9
  76. +
  77. human/numb: 1305.0
  78. +
  79. human/neith: 1304.3
  80. +
  81. human/halcyon: 1303.2
  82. +
  83. human/olivia: 1303.2
  84. +
  85. human/reepicheep: 1301.3
  86. +
  87. human/hullab3loo: 1301.0
  88. +
  89. human/npaperii: 1300.7
  90. +
  91. human/elvenking: 1298.3
  92. +
  93. human/gargantuan: 1297.8
  94. +
  95. human/mandragora: 1296.4
  96. +
  97. human/safetyinnumbers: 1295.4
  98. +
  99. human/hullabaloo: 1290.9
  100. +
  101. human/eccentric: 1290.0
  102. +
  103. human/thunderstrike: 1289.6
  104. +
  105. human/impishv02: 1289.2
  106. +
  107. human/ziggy: 1289.0
  108. +
  109. human/stylizedeuphoria: 1288.7
  110. +
  111. human/ironicimps: 1287.6
  112. +
  113. human/gigolo: 1286.8
  114. +
  115. human/gremlin: 1285.1
  116. +
  117. human/borgir: 1283.6
  118. +
  119. human/unrequitedlove: 1279.4
  120. +
  121. human/themystery: 1278.0
  122. +
  123. human/spiritualblackdimension: 1276.2
  124. +
  125. human/recycledbits: 1273.1
  126. +
  127. human/jade: 1272.7
  128. +
  129. human/luca: 1268.9
  130. +
  131. human/vain: 1268.8
  132. +
  133. human/bitethebullet: 1268.3
  134. +
  135. human/disharmonious: 1267.6
  136. +
  137. human/uninvited: 1267.6
  138. +
  139. human/revengeofthepapers: 1267.4
  140. +
  141. human/bulldozed: 1265.7
  142. +
  143. human/diehard: 1264.2
  144. +
  145. human/nighttrain: 1263.0
  146. +
  147. human/blacken: 1262.7
  148. +
  149. human/sunset: 1261.6
  150. +
  151. human/devilish202: 1261.4
  152. +
  153. human/retroq: 1259.8
  154. +
  155. human/evolcap66: 1259.3
  156. +
  157. human/fixed: 1258.7
  158. +
  159. human/nemesis: 1258.5
  160. +
  161. human/ompega: 1258.2
  162. +
  163. human/stormkeeper: 1256.1
  164. +
  165. human/quicksilver: 1255.7
  166. +
  167. human/slimetest: 1255.3
  168. +
  169. human/rosebud: 1255.2
  170. +
  171. human/bluecandle: 1253.0
  172. +
  173. human/riseofthedragon: 1252.6
  174. +
  175. human/kryptonite: 1250.0
  176. +
  177. human/digitalis2003: 1245.4
  178. +
  179. human/freighttrain: 1245.4
  180. +
  181. human/electricrazor: 1244.8
  182. +
  183. human/forgottenlore2: 1244.3
  184. +
  185. human/timescape10: 1243.4
  186. +
  187. human/revivalfire: 1240.3
  188. +
  189. human/hellfire: 1239.7
  190. +
  191. human/nightterrors: 1238.1
  192. +
  193. human/thehistorian: 1236.9
  194. +
  195. human/borg: 1236.7
  196. +
  197. human/falconv03: 1236.2
  198. +
  199. human/torment: 1234.1
  200. +
  201. human/impfinityv4g1: 1232.7
  202. +
  203. human/behemot: 1230.5
  204. +
  205. human/returnofvanquisher: 1229.9
  206. +
  207. human/forgottenlore: 1228.4
  208. +
  209. human/sputnik: 1228.3
  210. +
  211. human/unpitq: 1227.8
  212. +
  213. human/vanquisher: 1227.7
  214. +
  215. human/blade: 1227.2
  216. +
  217. human/arrow: 1225.5
  218. +
  219. human/electrichead: 1225.2
  220. +
  221. human/lithobolia: 1224.1
  222. +
  223. human/enigma: 1223.8
  224. +
  225. human/valkyrie: 1223.5
  226. +
  227. human/hazylazy: 1223.3
  228. +
  229. human/shottonothing: 1222.1
  230. +
  231. human/bigitalshot: 1221.9
  232. +
  233. human/hazylazyc11: 1221.5
  234. +
  235. human/alladinscave: 1220.8
  236. +
  237. human/dust07: 1220.6
  238. +
  239. human/unpit: 1219.5
  240. +
  241. human/herbalavenger: 1219.3
  242. +
  243. human/grendelsrevenge: 1218.8
  244. +
  245. human/fireandice: 1218.5
  246. +
  247. human/whitemist: 1218.3
  248. +
  249. human/macromagic: 1218.0
  250. +
  251. human/xenosmilus: 1217.3
  252. +
  253. human/hector2: 1215.3
  254. +
  255. human/oblivion: 1214.1
  256. +
  257. human/bpanamax: 1213.9
  258. +
  259. human/carmilla: 1213.4
  260. +
  261. human/excalibur: 1213.3
  262. +
  263. human/simple88v2: 1212.9
  264. +
  265. human/kusanagi: 1212.8
  266. +
  267. human/perseus: 1211.7
  268. +
  269. human/barrage: 1211.1
  270. +
  271. human/jackinthebox: 1210.4
  272. +
  273. human/discord: 1209.7
  274. +
  275. human/boysarebackintown: 1208.8
  276. +
  277. human/nosferatu: 1208.1
  278. +
  279. human/pendulum: 1207.4
  280. +
  281. human/jinx: 1207.0
  282. +
  283. human/vampsareback02: 1205.1
  284. +
  285. human/zooom: 1204.8
  286. +
  287. human/sprawlingchaos: 1204.7
  288. +
  289. human/eternalexile: 1204.5
  290. +
  291. human/bloodlust: 1204.1
  292. +
  293. human/curseoftheundead: 1203.9
  294. +
  295. human/recon2: 1201.0
  296. +
  297. human/jackintheboxii: 1200.5
  298. +
  299. human/blizzard: 1199.8
  300. +
  301. human/hazyshadeii: 1199.0
  302. +
  303. human/sneakyb2: 1198.8
  304. +
  305. human/labomba: 1198.8
  306. +
  307. human/bluefunk3: 1198.3
  308. +
  309. human/lithium: 1197.8
  310. +
  311. human/damageincorporated: 1197.6
  312. +
  313. human/torcht18: 1197.0
  314. +
  315. human/probe: 1196.3
  316. +
  317. human/intotheunknown: 1195.6
  318. +
  319. human/grilledoctopus05: 1194.4
  320. +
  321. human/yogibear: 1193.5
  322. +
  323. human/infiltrator: 1193.1
  324. +
  325. human/myvamp54: 1192.5
  326. +
  327. human/claw: 1192.4
  328. +
  329. human/stoninc: 1192.2
  330. +
  331. human/chameleon: 1191.7
  332. +
  333. human/thenextstep88: 1191.3
  334. +
  335. human/julietandpaper: 1190.4
  336. +
  337. human/stalker: 1189.8
  338. +
  339. human/zygote: 1189.7
  340. +
  341. human/tnt: 1189.1
  342. +
  343. human/bayonet: 1188.4
  344. +
  345. human/mason20: 1185.1
  346. +
  347. human/tornado30: 1184.8
  348. +
  349. human/bluefunk: 1184.6
  350. +
  351. human/myvamp37: 1184.3
  352. +
  353. human/onebite: 1183.8
  354. +
  355. human/icedragon: 1182.6
  356. +
  357. human/win: 1181.2
  358. +
  359. human/soldieroffortune: 1179.0
  360. +
  361. human/mirage15: 1178.8
  362. +
  363. human/mirage2: 1178.7
  364. +
  365. human/nightofthelivingdead: 1178.7
  366. +
  367. human/flurry: 1177.2
  368. +
  369. human/blur2: 1176.4
  370. +
  371. human/blur: 1175.3
  372. +
  373. human/thermiteii: 1175.2
  374. +
  375. human/gemoftheocean: 1173.9
  376. +
  377. human/replicant: 1172.5
  378. +
  379. human/vamp02b: 1171.2
  380. +
  381. human/aeka: 1170.6
  382. +
  383. human/quiz: 1167.8
  384. +
  385. human/gothik: 1164.0
  386. +
  387. human/evoltmp88: 1162.1
  388. +
  389. human/twister: 1161.1
  390. +
  391. human/agonyii: 1158.8
  392. +
  393. human/steppingstone: 1157.2
  394. +
  395. human/abomination: 1155.6
  396. +
  397. human/phq: 1155.3
  398. +
  399. human/beholderseye17: 1150.3
  400. +
  401. human/armorya5: 1149.9
  402. +
  403. human/foggyswamp: 1149.9
  404. +
  405. human/elementaldust2: 1149.5
  406. +
  407. human/heremscimitar: 1149.2
  408. +
  409. human/pacman: 1148.8
  410. +
  411. human/leviathan: 1146.3
  412. +
  413. human/chimerav35: 1146.0
  414. +
  415. human/leapfrog: 1144.4
  416. +
  417. human/snake: 1143.9
  418. +
  419. human/irongate: 1141.6
  420. +
  421. human/fatexpansionv: 1138.7
  422. +
  423. human/seventyfive: 1137.6
  424. +
  425. human/kitchensinkii: 1136.9
  426. +
  427. human/cannonade: 1133.5
  428. +
  429. human/lucky3: 1133.3
  430. +
  431. human/winterwerewolf3: 1133.0
  432. +
  433. human/blur88: 1132.1
  434. +
  435. human/leprechaunonspeed: 1130.5
  436. +
  437. human/stasis: 1130.1
  438. +
  439. human/agony51: 1128.4
  440. +
  441. human/ttti: 1127.0
  442. +
  443. human/thermite10: 1124.5
  444. +
  445. human/capskeyisstuck: 1124.2
  446. +
  447. human/sj4a: 1123.4
  448. +
  449. human/medusasv7x: 1122.7
  450. +
  451. human/ncdecoy: 1122.2
  452. +
  453. human/agony31: 1122.2
  454. +
  455. human/hordesofmicrowarriors: 1121.1
  456. +
  457. human/sphinxv28: 1118.6
  458. +
  459. human/rave: 1115.5
  460. +
  461. human/keystonet13: 1113.6
  462. +
  463. human/charonv81: 1113.2
  464. +
  465. human/leprechaun1b: 1106.0
  466. +
  467. human/nomuckingabout: 1096.6
  468. +
  469. human/charonv70: 1095.4
  470. +
  471. human/bscannersliveinvain: 1094.9
  472. +
  473. human/crimp2: 1092.1
  474. +
  475. human/crimp: 1090.7
  476. +
  477. human/killerinstinct: 1088.4
  478. +
  479. human/imprimis6: 1084.4
  480. +
  481. human/griffin2: 1083.7
  482. +
  483. human/requestv20: 1076.7
  484. +
  485. human/impurge: 1067.2
  486. +
  487. human/backstabber: 1066.2
  488. +
  489. human/0stormbringer: 1065.0
  490. +
  491. human/twilightpitsv60: 1060.2
  492. +
  493. human/fastfoodv21: 1056.8
  494. +
  495. human/flashpaper: 1046.7
  496. +
  497. human/flashpaper37: 1045.9
  498. +
  499. human/gammapaper30: 1045.4
  500. +
  501. human/flypaper30: 1040.7
  502. +
  503. human/hydra: 1026.4
  504. +
  505. human/precipice: 1025.0
  506. +
  507. human/trinity: 1022.7
  508. +
  509. human/paratroopsv21: 1017.9
  510. +
  511. human/genocide: 1015.6
  512. +
  513. human/vagabond: 1001.0
  514. +
  515. human/notepaper: 967.6
  516. +
  517. human/returnofthelivingdead: 955.5
  518. +
  519. human/smoothnoodlemap6: 909.9
  520. +
  521. human/smoothnoodlemap: 887.8
  522. +
  523. human/dwarf: 864.3
  524. +
  525. human/validate: 344.1
  526. +
  527. human/pspace: -889.5
  528. +
+ +
+ +For **RobotRumble**, the top ten: + +1. human/entropicdrifter/gigachad: **3219.0** +2. human/entropicdrifter/seven-of-nine: **2627.3** +3. human/entropicdrifter/we-are-borg: **2560.0** +4. human/entropicdrifter/glommerv2: **2456.8** +5. human/mousetail/coward-bot: **2326.5** +6. human/entropicdrifter/glommer: **2250.2** +7. human/mitch84/crw_preempt: **2109.9** +8. human/mitch84/retreat_walk2: **2040.6** +9. human/devchris/black_magic: **2001.7** +10. human/tabaxi3k/black-magic-1: **1994.3** + +
+Show full RobotRumble rankings + +
    +
  1. human/entropicdrifter/gigachad: 3219.0
  2. +
  3. human/entropicdrifter/seven-of-nine: 2627.3
  4. +
  5. human/entropicdrifter/we-are-borg: 2560.0
  6. +
  7. human/entropicdrifter/glommerv2: 2456.8
  8. +
  9. human/mousetail/coward-bot: 2326.5
  10. +
  11. human/entropicdrifter/glommer: 2250.2
  12. +
  13. human/mitch84/crw_preempt: 2109.9
  14. +
  15. human/mitch84/retreat_walk2: 2040.6
  16. +
  17. human/devchris/black_magic: 2001.7
  18. +
  19. human/tabaxi3k/black-magic-1: 1994.3
  20. +
  21. human/mitch84/walk_retreat: 1968.8
  22. +
  23. human/jammyliu/sixty-nine-line: 1889.7
  24. +
  25. human/atl15/centerrr: 1838.2
  26. +
  27. human/clay/diag-lattice: 1719.0
  28. +
  29. human/gerenuk/gere-ape: 1712.4
  30. +
  31. human/wolfsleuth/simple: 1656.1
  32. +
  33. human/essickmango/pickle-up: 1655.9
  34. +
  35. human/mkap/test: 1638.9
  36. +
  37. human/ketza/arthur: 1624.4
  38. +
  39. human/mountain/neuralbot4-3h: 1622.5
  40. +
  41. human/aaoutkine/silo34: 1618.6
  42. +
  43. human/anton/om-om: 1594.2
  44. +
  45. human/mee42/follow-bot: 1594.1
  46. +
  47. human/lanity/sivuy: 1593.7
  48. +
  49. human/underscore/bot1: 1589.8
  50. +
  51. human/mario31313/alpha_13: 1588.9
  52. +
  53. human/thesmilingturtl/naivefaa: 1587.8
  54. +
  55. human/aaoutkine/school-bot: 1570.6
  56. +
  57. human/suddenlyseals/control-center: 1551.4
  58. +
  59. human/ketza/bob: 1543.2
  60. +
  61. human/mjburgess/rule99: 1499.7
  62. +
  63. human/kalkin/maxad: 1498.1
  64. +
  65. human/mousetail/genetic-robot: 1493.7
  66. +
  67. human/edward/flail: 1477.2
  68. +
  69. human/aayyad/testbot: 1427.0
  70. +
  71. human/anton/anton4000: 1397.8
  72. +
  73. human/luisa/baselinegere: 1226.0
  74. +
  75. human/luisa/luisasrobot: 1223.1
  76. +
  77. human/jay0jayjay/naivestarter: 1168.3
  78. +
  79. human/aaa/jippty5: 1032.3
  80. +
  81. human/devchris/first_test: 940.9
  82. +
  83. human/tabaxi3k/charles: 936.3
  84. +
  85. human/essickmango/fruity-test: 935.9
  86. +
  87. human/sbasu3/meek-bot: 499.4
  88. +
  89. human/jiricodes/jiricodes-bot: 400.0
  90. +
  91. human/navster8/maginot-line: 397.3
  92. +
  93. human/kalkin/artemis2: 390.0
  94. +
  95. human/kalkin/artemis: 340.7
  96. +
  97. human/mountain/neuralbot2-6h: 331.4
  98. +
  99. human/sivecano/clouded-mind: 75.9
  100. +
  101. human/mountain/neuralbot1-1h: 23.5
  102. +
  103. human/aaoutkine/dark-knight: -55.6
  104. +
  105. human/navster8/bash-brothers: -496.0
  106. +
  107. human/ldang/nemo: -496.7
  108. +
  109. human/ldang/nessy: -538.5
  110. +
  111. human/anton/wallifier: -911.3
  112. +
  113. human/happysquid/test: -1624.4
  114. +
  115. human/anton/anton3000: -1736.7
  116. +
+ +
+ +### Part 2: How high do current models climb? + +On Core War + +* Claude Opus 4.5 reaches *[coming soon]* +* GPT 5.2 (medium thinking) reaches *[coming soon]* +* Gemini 3 Pro reaches *[coming soon]* + +On RobotRumble + +* Claude Opus 4.5 reaches *[coming soon]* +* GPT 5.2 (medium thinking) reaches *[coming soon]* +* Gemini 3 Pro reaches *[coming soon]* + +## How to run? + +Run your model against **CC:Ladder** today. +[Set up CodeClash](https://docs.codeclash.ai/quickstart/#installation) and run `uv run python ladder.py configs/ladder/.yaml`, where `.yaml` specifies (using Core War as the example arena): + +
+tournament:
+  rounds: 5 # Number of rounds model players each opponent
+game:
+  name: CoreWar
+  sims_per_round: 1000
+  args: {}
+player:
+  agent: mini
+  name: claude-sonnet-4-5-20250929
+  config:
+    agent: !include mini/default.yaml
+    model:
+      model_name: '@anthropic/claude-sonnet-4-5-20250929'
+      model_kwargs:
+        temperature: 0.2
+        max_tokens: 4096
+
+ +## Relationship between CC:Ladder & CodeClash + +For Pokémon fans, **CC:Ladder** is the equivalent of the [Elite 4](https://pokemon.fandom.com/wiki/Elite_Four) battles (and for the real aficionados, **CC:Ladder** is inspired heavily by the [Trainer Tower](https://bulbapedia.bulbagarden.net/wiki/Trainer_Tower)). +CodeClash is the real world [Video Game Championships](https://en.wikipedia.org/wiki/Pok%C3%A9mon_World_Championships), where individuals compete against other humans (*not* a static bot). + + +
+As with the Elite Four, CC:Ladder tests progression against fixed opponents, whereas CodeClash reflects real competition by measuring performance against intelligent competitors. +
+ +We recommend CC:Ladder be treated as a proper evaluation as well. +Similar to how SWE-bench Lite and Verified were created as easier subsets of SWE-bench, we think + +CodeClash remains the north-star evaluation. +Competition against dynamic, intelligent competition is more challenging than static solutions. +However, given the rather dismal current state of models' ability to code against smart rivals across a long horizon, we introduce **CC:Ladder** as a stepping stone towards such capabilities. \ No newline at end of file diff --git a/static/css/layout.css b/static/css/layout.css index a27e8af..6872e06 100644 --- a/static/css/layout.css +++ b/static/css/layout.css @@ -600,6 +600,7 @@ summary { padding: 0.25rem 0.5rem; border-left: 3px solid #e1e4e8; margin-bottom: 0.5rem; + background-color: rgba(128, 128, 128, 0.1); transition: border-color 0.2s; } diff --git a/static/images/insights/20260116_human_ai_ladder/cc_ladder.png b/static/images/insights/20260116_human_ai_ladder/cc_ladder.png new file mode 100644 index 0000000..afd88dd Binary files /dev/null and b/static/images/insights/20260116_human_ai_ladder/cc_ladder.png differ diff --git a/static/images/insights/20260116_human_ai_ladder/elite4firered.png b/static/images/insights/20260116_human_ai_ladder/elite4firered.png new file mode 100644 index 0000000..4cbed63 Binary files /dev/null and b/static/images/insights/20260116_human_ai_ladder/elite4firered.png differ