Wed 2010-01-06 ( En pr )

I’m trying out a new scanner/encoder concept for CodeRay that would (hopefully) make highlighting even faster while improving the code at the same time.

Basically, it’s about bypassing the Tokens representation altogether.

The tests can be seen at Odd-eyed code.

Here are my results so far:

$ ruby current.rb 
Scanning: 2.98s -- 370 kTok/s
Encoding: 3.14s -- 350 kTok/s
Together: 6.11s -- 180 kTok/s
$ ruby direct-stream.rb 
Scanning: 2.76s -- 399 kTok/s
Encoding: 3.16s -- 348 kTok/s
Together: 5.92s -- 186 kTok/s
Scanning + Encoding: 5.04s -- 218 kTok/s

$ ruby187 current.rb 
Scanning: 3.27s -- 336 kTok/s
Encoding: 2.96s -- 371 kTok/s
Together: 6.23s -- 176 kTok/s
$ ruby187 direct-stream.rb 
Scanning: 2.97s -- 371 kTok/s
Encoding: 2.85s -- 386 kTok/s
Together: 5.82s -- 189 kTok/s
Scanning + Encoding: 4.98s -- 221 kTok/s

$ ruby19 current.rb 
Scanning: 2.32s -- 474 kTok/s
Encoding: 3.47s -- 317 kTok/s
Together: 5.79s -- 190 kTok/s
$ ruby19 direct-stream.rb 
Scanning: 2.24s -- 491 kTok/s
Encoding: 2.70s -- 408 kTok/s
Together: 4.94s -- 223 kTok/s
Scanning + Encoding: 4.08s -- 270 kTok/s

$ jruby current.rb 
Scanning: 3.13s -- 351 kTok/s
Encoding: 1.59s -- 693 kTok/s
Together: 4.72s -- 233 kTok/s
$ jruby direct-stream.rb 
Scanning: 2.35s -- 469 kTok/s
Encoding: 1.68s -- 655 kTok/s
Together: 4.03s -- 273 kTok/s
Scanning + Encoding: 2.18s -- 504 kTok/s

“Scanning + Encoding” bypasses the Tokens intermediate representation. The scanner calls the encoder directly.

The awesome speedup in JRuby is strange, I think it’s not the final word on this. But some things are certain:

  • It’s possible to highlight tokens 20% to 50% faster without sacrificing speed of either a separate scanning or encoding phase.
  • MRI 1.8 < MRI 1.8.7 < YARV 1.9.2 < JRuby here when it comes to performance.
  • I like the new API. encoder.open(:string) looks nicer than tokens << [:open, :string].

I think it’s also possible to support the current interface (CodeRay.scan(:ruby, code).html) using a Tokens proxy like ActiveRecord’s find(:all).

There’s a good chance that this find its way into 1.0, but I have to re-write all the scanners, possibly even break older plugin scanners. But hey, what’s a 1.0 release without some API deprecations.

Update: It’s committed, working, and faster :) It’ll be interesting to benchmark it against Licenser’s new clj-highlight library.

Say something! / Sag was!

remember: "5 scanner a year". if you are now rewriting all scanners that could be a hard thing to reach in 2010 (-;
bovi @ 10:29 on Thursday, 2010-01-07
Update. I meant I have to update the scanners ;)
murphy @ 13:29 on Thursday, 2010-01-07
I think the incredible increase of speed with jruby comes from the warmup time of the vm, otherwise scan+encode would Be fasrer then Scanning alone?
Licenser @ 20:18 on Friday, 2010-01-08
Yeah, definitely. But it's still spooky :)
murphy @ 02:52 on Saturday, 2010-01-09

No markup, just plain monospace text. / Kein Markup, nur Normschrift-Klartext.