Regular Expression Set Arithmetic

A modest proposal ... or a request for pointers and info if someone's heard of this before ...

RegularExpressions are sets. Their contents are strings - also, in principle, decomposable to sets. Some regexes have a finite membership, many (most?) have an infinite membership. With me so far?

Well then why not implement the obvious set of operators for sets - intersection, union, subtraction, and most importantly subset? Why just membership (=~)?

Let's pick RubyLanguage for this because, hey, ruby's cool. Ruby regex syntax would overload, respectively, * + - and < operators [see OperatorOverloading]. I have a notion to employ < especially for a somewhat unusual form of MultiCaster ... but before reinventing wheels I thought it'd be worth finding out if anyone's heard of any such critter. -- PeterMerel.

Also worth noting that Ruby's Composable really should be based on < rather than <=> for the sake of poset representation. Yeah, I know the lineage, but still that's a sucky decision ... -- Pete.

Determining if one regex is the superset of another (i.e., determining if one regex accepts all strings accepted by another) is computationally very difficult, and may be undecideable. There's a good reason that RubyLanguage/PerlLanguage/etc. don't support those operations.

No, that is simply wrong. All the operations on regexes (or rather finite-automata) are theoretically very well understood, see e.g.,

The latter gives tight bounds for the complexity of the relevant constructions (e.g., n log n states for complement of an n state finite automaton). The algorithms for constructing, e.g., union and complement are standard in undergraduate studies of ComputerScience (at least at the university where I got my degree I had to do these manually as part of homework exercises). And all other operations (e.g., inclusion) can be trivially constructed from these two alone.

The algorithms for negation and dissection are definitely not easy to implement, but that's really no excuse, that they are missing from most standard packages. But I bet there are libraries, that provide them. I'd really really like to use regex libraries, that support negation and intersection, because that would make some lexer much much easier to specify (example: string literals with embedded escapes). -- GunnarZarncke

But remember: RegularExpressionsArent---in other words, the "regex" feature in most languages trivially allows the constructions of automata which are more powerful than FiniteStateMachines. Many regex packages are (I think) TuringComplete, in which case the problem being discussed is most definitely undecideable.

I agree that TuringComplete regexes will kill the arithmetic suggested. But classical ones will do fine, and that's sufficient. OTOH ExtendedSetTheory seems more appropriate for the use contemplated by SymphonicArchitecture, so no worries. Thanks all for the helpful discussion. -- Pete


I'm familiar with a technique (applied to tables, actually) where the resolution of set inclusion/exclusion/intersection is a matter of run-time semantics. The set selection for term (B) is performed on the results of term (A), rather than on the entire set. I'm sure you know the drill.

It seems to me that something similar could be done with regex, so that the syntax is supported which means "resolve the first regex, then use that result as input to the next regex." The notation in abstract would look pretty normal (whatever that is), but rather than engage in trying to pre-compute what xY * xZ might mean, you'd resolve it by applying xZ(xY).

That is, if I grasp what you're asking.

If you're looking for a means to pre-compute the evaluation domain from two (or more) input regexes that seems really ambitious.

-- GarryHamilton


See Also: AlternativesToRegularExpressions


EditText of this page (last edited July 11, 2006) or FindPage with title or text search