RegularExpressions as implemented by many regex packages are often not regular languages, but instead as powerful as any context free language (even if this is not necessarily easy to use).
This means, that RegularExpressionSetArithmetic is ironically not implementable in these more-powerful packages (or at least not efficiently so).
The following discussion originally stated this, but seems to have a focus on perl and mixes up NFA and context free.
Regular Expressions Are Not:
- Languages using RegularExpressions have features that make them non-regular
- Pedantic note: some recent applications using RegularExpressions (e.g., PerlLanguage 5+) have added features that make them non-regular. Not that anyone actually using them cares.
- Context-Free
- Actually, some regular expressions aren't even context-free. And for some applications, people should care.
- ProperlyDfaBased?
- A proper DFA-based RegularExpression search will run in O(n) time, where n is the length of the string being examined. So-called NFA engines (which aren't non-deterministic finite automata at all) can often take exponential time or worse. In most cases, this doesn't matter...
- This is a self-contradiction. Notice it can be paraphrased "NFA engines aren't NFAs". So what do you really mean? What is it that is not an NFA? Implementing REs as first-match doesn't preclude them from being NFAs, for instance.
What can and cannot be handled
What it means is this: So-called "NFA engines" do not use non-deterministic finite autonoma in their implementation (indeed, as we cannot build a NFA). They use computational models that approach the power of TuringMachines and as such can handle things that a true regex cannot handle.
The following grammars cannot be handled by a true regular expression:
EssExpression ::= Atom | ( EssExpression* )
- It's a ContextFreeGrammar?, and requires a pushdown automaton. (Though you can cheat and add a counter to a DFA to count nesting of parens; if you do that, though, you don't have a true DFA).
EqualLengthAbc ::= A ^n B ^n C ^n
In other words, some number of
A followed by an equal number of
B and an equal number of
C. That grammar is context-sensitive, and requires a linearly-bounded
TuringMachine to parse.
What PerlLanguage handles easily
PerlLanguage, and the similar regex syntax found in many other languages, can handle these easily. The engines that they use are full backtracking TuringMachines.
- This is factually incorrect. Apparently you are speculating about what they do.
- Hmm. Are you saying that Perl doesn't do backtracking? I thought it did (when appropriate).
- It backtracks sometimes, but only to implement NFAs, not in a way that is more powerful. The "TuringMachine" part is what is incorrect.
- Basically when people say "NFA", they commonly are being a bit mathematically sloppy. In theory, there is no difference in functionality between what an NFA and a DFA (only in their time/space complexities). In practice, people have often implemented NFAs as first-match, whereas a true NFA attempts all possible matches in parallel. But although that's different, it's not more powerful.
- Perl allows selecting either DFA or that pseudo-NFA behavior. It also has some mild context-free extensions to allow reference to labels, but in practice this has nowhere near the expressive power of e.g. Yacc, even though it's theoretically context-free.
- This part of perl is in a heavy state of flux, and my comments do not apply to whatever the latest version of Perl 6 specifies will someday be implemented; they're cleaning it up and making it more powerful.
Plus, arbitrary Perl code can be specified for the actions of a Perl regex; and Perl is obviously
TuringComplete.
More on Context-Free
Your explanation is kind of wandering all over the map. If you add to regular expressions labelled rules and the ability to reference them within rules, then you get extended BNF, which of course is context free.
- Agreed. Part of the problem (which this page hopes to clarify) is that the term "regex" is overloaded. To a mathematician or a theoretical compter scientist, it refers to a grammar than can be accepted by a finite automaton (deterministic or otherwise; they are equivalent in computational power). Virtually none of the regex packages available limit you to finite state machine-type regular expressions. You are absolutely correct that extended BNF gives you context-free grammars, not regular grammars.
- Virtually all of those packages allow reference to labels for context-free power? I don't think so. You probably just mean that it's common for them to offer "NFA"s that do first match rather than parallel match, which is still not context-free, as I just mentioned above.
MuddlingAround?
Chomsky's 4 levels of power are not this complicated. Saying "regular expressions aren't" just muddles the subject.
Also your terminology is incorrect: "NFA engines" do not use non-deterministic finite autonoma in their implementation (indeed, as we cannot build a NFA). No - you are mixing up the term "non-deterministic" from other areas of math. In the mathematical theory of grammars, it is completely standard to talk about "non-deterministic finite automata", and NFA is simply an abbreviation for that!
The comment you responded to was a complaint I was making some months back about wording. I understand the topic. The wording at the top of the page is muddled.
See also: AlternativesToRegularExpressions
CategoryRegularExpressions