Enumerating Regular Languages

A problem proposed by Jayadev Misra is allegedly a good test for qualities of a language, methodology, individual technique, etc. We can look at it as a good LanguageTestCase. Of course, being heavily algorithmic in nature most OO languages do not fare well with regards to EconomyOfExpression, and to my chagrin it's not even clear how SchemeLanguage can compete with HaskellLanguage and MlLanguage.

The problem is stated as follows: you have a regular language described by a given RegularExpression. Given a natural number N, enumerate the first N strings in that language in a modified lexicographical order: the strings are compared first on their length and two strings of the same length are considered in lexicographical order, with some ordering imposed on the characters of the alphabet (assume ASCII ordering, to keep the problem simpler).

The simplest idea is to generate a (optimized if possible) NondeterministicFiniteAutomaton? that recognizes the language, and emulate its functioning for an unknown input source (do it in reverse, the automaton unrolls the imaginary input source) while when reaching final states outputting the string that led to that final state. A nice Haskell solution can be picked up from DougMcIlroy page. It's an outstanding demonstration of the EconomyOfExpression qualities of HaskellLanguage.

Another brute force solution would be to enumerate all strings in lexicographical order (at least limit the alphabet to the characters present in the RE) while testing for matching and counting up to N. Of course that won't give you much elegance points (and don't try this at home for large expression or big N, you may be disappointed).

Sounds like a job involving lots of CoRoutines...


I've translated the version in the paper published at http://www.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf from HaskellLanguage to RubyLanguage. I've used the lazy lists library for Ruby to handle the lists (http://lazylist.rubyforge.org/). The code base length is just a little bit longer than the original and it's almost copied from there. -- AurelianoCalvo.

Usage is like this:

((("A".atom | "B".atom).repeat ) + "C!".atom).enum.take( 10 ) #enumerates the 10 first elements of the RegularExpression language (a|b)*C!.

The result is:

["C!", "AC!", "BC!", "AAC!", "ABC!", "BAC!", "BBC!", "AAAC!", "AABC!", "ABAC!"]

Below is the implementation:

 require 'lazylist'

class LazyList def + (other) self.merge( other ) do |a,b| a.length != b.length ? a.length < b.length : a < b end end def * (other) return list if self.empty? or other.empty? return list( self.head + other.head ) { (self.tail * LazyList[[other.head]]) + (self * other.tail) } end def closure return LazyList[[ "" ]] if self.empty? return closure( self.tail ) if self.head == "" return list( "" ) { self * closure } end end

class String def atom EnumRegLang::Atom.new(self) end end

module EnumRegLang

class RegLang def | (other) Option.new( self, other ) end def + (other) Concat.new( self, other ) end def repeat Repeat.new( self ) end end

VOID = RegLang.new

def VOID.enum list end

class Atom < RegLang def initialize( text ) @text = text end

def enum LazyList.from_enum [ @text ] end end

class Option < RegLang def initialize( lang1, lang2 ) @lang1 = lang1 @lang2 = lang2 end def enum @lang1.enum + @lang2.enum end end

class Concat def initialize( lang1, lang2 ) @lang1 = lang1 @lang2 = lang2 end def enum @lang1.enum * @lang2.enum end end

class Repeat < RegLang def initialize( lang ) @lang = lang end def enum @lang.enum.closure end end end


EditText of this page (last edited April 30, 2006) or FindPage with title or text search