Sat 2008-09-20 ( pr En )

(German translation on Ruby-Mine.)

Now, I’m about to write the Java Scanner for CodeRay (ticket #42 btw), and I thought it would be nice to have a list of built-in types highlighted – like String or IllegalStateException. I knew that TextMate had quite good highlighting for Java, so I checked the bundle for something to use.

Indeed, some smart guy included a very long regular expression into the token definitions, it looks like this:

support-type-built-ins-java = {
  name = 'support.type.built-ins.java';
  match = '\b(R(GBImageFilter|MI(S(ocketFactory|e(curity(Manager|Exception)|
    rver(SocketFactory|Impl(_Stub)?)?))|C(onnect(ion(Impl(_Stub)?)?|
    or(Server)?)|l(ientSocketFactory|assLoader(Spi)?))|IIOPServerImpl|
    JRMPServerImpl|FailureHandler)|SA(MultiPrimePrivateCrtKey(Spec)?|
    ...this goes on for several screens...

Apparently, they converted a long list of types into a minimal regexp, surely using a script for this. But that wasn’t exactly what I had searched for. How could I convert this back into the original plain list?1

Recursive Descent Parser

What I needed was something like this:

srs = SimpleRegexpScanner.new('(A)?(B|C)')
srs.list  #=> ['AB', 'AC', 'B', 'C']

Something I remembered from my lecture about compiler construction three years ago were recursive descent parsers: It is possible to parse simple grammars (LL to be precise) with that approach, and the grammar of the regexp was simple enough.

So I wrote SimpleRegexpScanner in recursive form, of course in Ruby. After some refinement, the result became quite short:

class SimpleRegexpScanner < StringScanner
  1. Returns an Array of all possible strings that would fit the given regexp.
    def list
    scan_union.uniq
    end

protected
def scan_group # :nodoc:
scan(/\(/) or return
options = scan_union
scan(/\)/) or raise ‘) expected at end of group’
options << ’’ if scan(/\?/)
options
end

def scan_union # :nodoc: options = scan_concatenation options += scan_union if scan(/\|/) options.uniq end def scan_concatenation # :nodoc: options = scan_group || [scan(/[^(|)?]*/)] if check(/[^|)]/) suffixes = scan_concatenation options.map! do |option| suffixes.map { |suffix| option + suffix } end.flatten! end options end

end

It solved my problem: I can now restore the original word list from any regexp of this type in a matter of milliseconds :)

Download

I added tests and a bit of documentation, and you can play with it under LGPL as much as you want.

SimpleRegexpScanner

1 I need a list because CodeRay uses a Hash-based approach for word lists, which should be much faster than using even optimized regexps. In addition, Ruby 1.8 refused the regexp as “too big”, and TextMate has problems with long one-line strings anyway.

Say something! / Sag was!

No markup, just plain monospace text. / Kein Markup, nur Normschrift-Klartext.