I’m trying out a new scanner/encoder concept for CodeRay that would (hopefully) make highlighting even faster while improving the code at the same time.
Basically, it’s about bypassing the Tokens representation altogether.
The tests can be seen at Odd-eyed code.
Here are my results so far:
$ ruby current.rb Scanning: 2.98s -- 370 kTok/s Encoding: 3.14s -- 350 kTok/s Together: 6.11s -- 180 kTok/s $ ruby direct-stream.rb Scanning: 2.76s -- 399 kTok/s Encoding: 3.16s -- 348 kTok/s Together: 5.92s -- 186 kTok/s Scanning + Encoding: 5.04s -- 218 kTok/s
$ ruby187 current.rb Scanning: 3.27s -- 336 kTok/s Encoding: 2.96s -- 371 kTok/s Together: 6.23s -- 176 kTok/s $ ruby187 direct-stream.rb Scanning: 2.97s -- 371 kTok/s Encoding: 2.85s -- 386 kTok/s Together: 5.82s -- 189 kTok/s Scanning + Encoding: 4.98s -- 221 kTok/s
$ ruby19 current.rb Scanning: 2.32s -- 474 kTok/s Encoding: 3.47s -- 317 kTok/s Together: 5.79s -- 190 kTok/s $ ruby19 direct-stream.rb Scanning: 2.24s -- 491 kTok/s Encoding: 2.70s -- 408 kTok/s Together: 4.94s -- 223 kTok/s Scanning + Encoding: 4.08s -- 270 kTok/s
$ jruby current.rb Scanning: 3.13s -- 351 kTok/s Encoding: 1.59s -- 693 kTok/s Together: 4.72s -- 233 kTok/s $ jruby direct-stream.rb Scanning: 2.35s -- 469 kTok/s Encoding: 1.68s -- 655 kTok/s Together: 4.03s -- 273 kTok/s Scanning + Encoding: 2.18s -- 504 kTok/s
“Scanning + Encoding” bypasses the Tokens intermediate representation. The scanner calls the encoder directly.
The awesome speedup in JRuby is strange, I think it’s not the final word on this. But some things are certain:
- It’s possible to highlight tokens 20% to 50% faster without sacrificing speed of either a separate scanning or encoding phase.
- MRI 1.8 < MRI 1.8.7 < YARV 1.9.2 < JRuby here when it comes to performance.
- I like the new API. encoder.open(:string) looks nicer than tokens << [:open, :string].
I think it’s also possible to support the current interface (CodeRay.scan(:ruby, code).html) using a Tokens proxy like ActiveRecord’s find(:all)
.
There’s a good chance that this find its way into 1.0, but I have to re-write all the scanners, possibly even break older plugin scanners. But hey, what’s a 1.0 release without some API deprecations.
Update: It’s committed, working, and faster :) It’ll be interesting to benchmark it against Licenser’s new clj-highlight library.