Enumerating Regular Languages

A problem proposed by Jayadev Misra is allegedly a good test for qualities of a language, methodology, individual technique, etc. We can look at it as a good LanguageTestCase. Of course, being heavily algorithmic in nature most OO languages do not fare well with regards to EconomyOfExpression, and to my chagrin it's not even clear how SchemeLanguage can compete with HaskellLanguage and MlLanguage.

The problem is stated as follows: you have a regular language described by a given RegularExpression. Given a natural number N, enumerate the first N strings in that language in a modified lexicographical order: the strings are compared first on their length and two strings of the same length are considered in lexicographical order, with some ordering imposed on the characters of the alphabet (assume ASCII ordering, to keep the problem simpler).

The simplest idea is to generate a (optimized if possible) NondeterministicFiniteAutomaton? that recognizes the language, and emulate its functioning for an unknown input source (do it in reverse, the automaton unrolls the imaginary input source) while when reaching final states outputting the string that led to that final state. A nice Haskell solution can be picked up from DougMcIlroy page. It's an outstanding demonstration of the EconomyOfExpression qualities of HaskellLanguage.

Another brute force solution would be to enumerate all strings in lexicographical order (at least limit the alphabet to the characters present in the RE) while testing for matching and counting up to N. Of course that won't give you much elegance points (and don't try this at home for large expression or big N, you may be disappointed).

Sounds like a job involving lots of CoRoutines...

I've translated the version in the paper published at http://www.cs.utexas.edu/users/misra/Notes.dir/RegExp.pdf from HaskellLanguage to RubyLanguage. I've used the lazy lists library for Ruby to handle the lists (http://lazylist.rubyforge.org/). The code base length is just a little bit longer than the original and it's almost copied from there. -- AurelianoCalvo.

Usage is like this:

((("A".atom | "B".atom).repeat ) + "C!".atom).enum.take( 10 ) #enumerates the 10 first elements of the RegularExpression language (a|b)*C!.

The result is:

["C!", "AC!", "BC!", "AAC!", "ABC!", "BAC!", "BBC!", "AAAC!", "AABC!", "ABAC!"]

Below is the implementation:

 require 'lazylist'

 class LazyList
def + (other)
self.merge( other ) do 
|a,b| 
a.length != b.length ? a.length < b.length : a < b
end
end
def * (other)
return list if self.empty? or other.empty?
return list( self.head + other.head ) {
(self.tail * LazyList[[other.head]]) + (self * other.tail) }
end
def closure
return LazyList[[ "" ]] if self.empty?
return closure( self.tail ) if self.head == ""
return list( "" ) {
self * closure
}
end
 end

 class String
def atom
EnumRegLang::Atom.new(self)
end
 end

 module EnumRegLang

class RegLang
def | (other)
Option.new( self, other )
end
def + (other)
Concat.new( self, other )
end
def repeat
Repeat.new( self )
end
end

VOID = RegLang.new

def VOID.enum
list
end

class Atom < RegLang
def initialize( text )
@text = text
end

def enum
LazyList.from_enum [ @text ]
end
end

class Option < RegLang
def initialize( lang1, lang2 )
@lang1 = lang1
@lang2 = lang2
end
def enum
@lang1.enum + @lang2.enum
end
end

class Concat
def initialize( lang1, lang2 )
@lang1 = lang1
@lang2 = lang2
end
def enum
@lang1.enum * @lang2.enum
end
end

class Repeat < RegLang
def initialize( lang )
@lang = lang
end
def enum
@lang.enum.closure
end
end
 end