Sat 2008-09-20 ( pr En )

(German translation on Ruby-Mine.)

Now, I’m about to write the Java Scanner for CodeRay (ticket #42 btw), and I thought it would be nice to have a list of built-in types highlighted – like String or IllegalStateException. I knew that TextMate had quite good highlighting for Java, so I checked the bundle for something to use.

Indeed, some smart guy included a very long regular expression into the token definitions, it looks like this:

support-type-built-ins-java = {
  name = 'support.type.built-ins.java';
  match = '\b(R(GBImageFilter|MI(S(ocketFactory|e(curity(Manager|Exception)|
    rver(SocketFactory|Impl(_Stub)?)?))|C(onnect(ion(Impl(_Stub)?)?|
    or(Server)?)|l(ientSocketFactory|assLoader(Spi)?))|IIOPServerImpl|
    JRMPServerImpl|FailureHandler)|SA(MultiPrimePrivateCrtKey(Spec)?|
    ...this goes on for several screens...

Apparently, they converted a long list of types into a minimal regexp, surely using a script for this. But that wasn’t exactly what I had searched for. How could I convert this back into the original plain list?1

Recursive Descent Parser

What I needed was something like this:

srs = SimpleRegexpScanner.new('(A)?(B|C)')
srs.list  #=> ['AB', 'AC', 'B', 'C']

Something I remembered from my lecture about compiler construction three years ago were recursive descent parsers: It is possible to parse simple grammars (LL(k) to be precise) with that approach, and the grammar of the regexp was simple enough.

So I wrote SimpleRegexpScanner in recursive form, of course in Ruby. After some refinement, the result became quite short:

class SimpleRegexpScanner < StringScanner
  
  # Returns an Array of all possible strings that would fit the given regexp.
  def list
    scan_union.uniq
  end
  
protected
  def scan_group  # :nodoc:
    scan(/\(/) or return
    options = scan_union
    scan(/\)/) or raise ') expected at end of group'
    options << '' if scan(/\?/)
    options
  end
  
  def scan_union  # :nodoc:
    options = scan_concatenation
    options += scan_union if scan(/\|/)
    options.uniq
  end
  
  def scan_concatenation  # :nodoc:
    options = scan_group || [scan(/[^(|)?]*/)]
    if check(/[^|)]/)
      suffixes = scan_concatenation
      options.map! do |option|
        suffixes.map { |suffix| option + suffix }
      end.flatten!
    end
    options
  end
  
end

It solved my problem: I can now restore the original word list from any regexp of this type in a matter of milliseconds :)

Download

I added tests and a bit of documentation, and you can play with it under LGPL as much as you want.

SimpleRegexpScanner

1 I need a list because CodeRay uses a Hash-based approach for word lists, which should be much faster than using even optimized regexps. In addition, Ruby 1.8 refused the regexp as “too big”, and TextMate has problems with long one-line strings anyway.

Say something! / Sag was!

No markup, just plain monospace text. / Kein Markup, nur Normschrift-Klartext.