Claude Code Benchmark: Dynamic Languages Faster and Cheaper

News Context

At a glance

A quantitative benchmark conducted by Ruby committer Yusuke Endoh has found that dynamic programming languages are significantly more efficient for AI coding agents when generating working implementations.
The experiment aimed to provide empirical data to the ongoing debate regarding whether static typing prevents AI hallucination bugs or if dynamic typing is preferable because it saves...
The benchmark consisted of over 600 runs, with each language being tested 20 times.

A quantitative benchmark conducted by Ruby committer Yusuke Endoh has found that dynamic programming languages are significantly more efficient for AI coding agents when generating working implementations. The study, which tested Claude Code across 13 different languages, revealed that Ruby, Python, and JavaScript were the fastest, cheapest, and most stable options for the AI model.

The experiment aimed to provide empirical data to the ongoing debate regarding whether static typing prevents AI hallucination bugs or if dynamic typing is preferable because it saves tokens. To test this, Endoh tasked Claude Code (Opus 4.6) with implementing a simplified version of Git, referred to as a mini-git, across various language categories.

Benchmark Methodology

The benchmark consisted of over 600 runs, with each language being tested 20 times. The implementation was divided into two distinct phases. In the first phase, the AI was asked to implement basic Git functions including init, add, commit, and log from an empty directory. The second phase required the AI to extend the project by adding status, diff, checkout, and reset functions.

To ensure the results were not skewed by differences in library dependencies across different languages, the researcher utilized a custom hash algorithm instead of SHA-256. The prompt provided to the AI was straightforward: Read SPEC-v1.txt, implement it, and make sure test-v1.sh passes, with a similar instruction for the second phase.

The languages tested were categorized into several groups to isolate the impact of type-checking:

Dynamic: Python, Ruby, JavaScript, Perl, and Lua.
Dynamic with type checkers: Python/mypy (using fully type-annotated Python verified with mypy –strict) and Ruby/Steep (using RBS type signatures verified with steep check).
Static: TypeScript, Go, Rust, C, and Java.
Functional: Scheme (dynamic), OCaml (static), and Haskell (static).

Performance and Cost Results

The results indicated that dynamic languages dominated in terms of speed and cost. Ruby emerged as the most efficient, averaging $0.36 per run with a completion time of 73.1 seconds. Python followed closely at $0.38 per run and 74.6 seconds, while JavaScript averaged $0.39 per run and 81.1 seconds.

All three dynamic leaders passed all 40 tests across their 20 runs and exhibited low variance in their performance. In contrast, statically typed languages were found to be 1.4 to 2.6 times slower and more expensive.

As the rankings moved beyond the top three, both the cost and the variance in performance increased sharply. Go averaged $0.50 per run and 101.6 seconds, though it showed a standard deviation of 37 seconds. Java also averaged $0.50 per run but took 115.4 seconds. Rust averaged $0.54 per run with a time of 113.7 seconds and displayed the widest spread of results at 54.8 seconds.

The benchmark also highlighted stability issues with some static languages. Rust was one of only two languages that experienced test failures, passing 38 out of 40 tests.

The Impact of Type Checking

By including versions of dynamic languages with type checkers, the benchmark allowed for a direct comparison of the overhead introduced by type annotations. Python/mypy, for example, saw its average cost rise to $0.57 and its average time increase to 125.3 seconds, compared to the standard Python results of $0.38 and 74.6 seconds.

The data suggests that for prototyping-scale tasks, the ability to skip type annotations saves tokens and reduces the complexity of the code the AI must generate. This results in faster generation times and lower financial costs per implementation.

While the benchmark favors dynamic languages for the generation phase, the researcher noted that if the final application requires high runtime speed, there is still a reason to choose a static language despite the higher initial cost of AI generation.

Claude Code Benchmark: Dynamic Languages Faster and Cheaper

Benchmark Methodology

Performance and Cost Results

The Impact of Type Checking

Share this:

Related