Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prism::CodeUnitsCache #3173

Merged
merged 1 commit into from
Oct 10, 2024
Merged

Prism::CodeUnitsCache #3173

merged 1 commit into from
Oct 10, 2024

Conversation

kddnewton
Copy link
Collaborator

Calculating code unit offsets for a source can be very expensive, especially when the source is large. This commit introduces a new class that wraps the source and desired encoding into a cache that reuses pre-computed offsets. It performs quite a bit better.

There are still some problems with this approach, namely character boundaries and the fact that the cache is unbounded, but both of these may be addressed in subsequent commits.

Some benchmarks, using the following script:

# # frozen_string_literal: true

require "bundler/setup"
require "prism"
require "benchmark"

code = "😀😀😀😀😀😀😀😀" * Integer(ARGV.first)
result = Prism.parse(code)

source = result.source
bytesize = code.bytesize

Benchmark.bm do |x|
  x.report("old") do
    1000.times { source.code_units_offset(rand(bytesize), Encoding::UTF_16LE) }
  end

  x.report("new") do
    cache = source.code_units_cache(Encoding::UTF_16LE)
    1000.times { cache[rand(bytesize)] }
  end
end

resulted in:

$ be ruby test.rb 10 
       user     system      total        real
old  0.002221   0.000374   0.002595 (  0.002597)
new  0.000789   0.000016   0.000805 (  0.000807)
$ be ruby test.rb 100
       user     system      total        real
old  0.008739   0.000677   0.009416 (  0.009421)
new  0.003202   0.000081   0.003283 (  0.003286)
$ be ruby test.rb 1000
       user     system      total        real
old  0.078277   0.003391   0.081668 (  0.081750)
new  0.016299   0.000608   0.016907 (  0.016929)
$ be ruby test.rb 10000
       user     system      total        real
old  0.749045   0.036684   0.785729 (  0.786660)
new  0.037629   0.003045   0.040674 (  0.040730)
$ be ruby test.rb 100000
       user     system      total        real
old  7.299773   0.319311   7.619084 (  7.624173)
new  0.521168   0.019792   0.540960 (  0.541081)

@kddnewton kddnewton force-pushed the code-units-cache branch 8 times, most recently from c59b804 to f2268a4 Compare October 9, 2024 19:40
@kddnewton kddnewton changed the title Prism::Source::CodeUnitsCache Prism::CodeUnitsCache Oct 10, 2024
Calculating code unit offsets for a source can be very expensive,
especially when the source is large. This commit introduces a new
class that wraps the source and desired encoding into a cache that
reuses pre-computed offsets. It performs quite a bit better.

There are still some problems with this approach, namely character
boundaries and the fact that the cache is unbounded, but both of
these may be addressed in subsequent commits.
@kddnewton kddnewton merged commit ba89182 into main Oct 10, 2024
54 checks passed
@kddnewton kddnewton deleted the code-units-cache branch October 10, 2024 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant