A meta regular expression is the intersection or subtraction of 2 other (meta or simple) regular expressions.
Take a regex AST and produce a NFA.
Take a regex AST and produce a NFA. Except when noted the Thompson-McNaughton-Yamada algorithm is used. Reference: http://stackoverflow.com/questions/11819185/steps-to-creating-an-nfa-from-a-regular-expression
Regular expressions can have character classes and wildcards.
Regular expressions can have character classes and wildcards. In order to produce a NFA, they should be expanded to disjunctions. In the case of wildcards or negated characted classes, the complete alphabet must also be known to produce the expansion:
Example transformations with alphabet: abcdefgh
[abc] -> a|b|c [abc] -> d|e|f|g|h def[abc] -> def(d|e|f|g|h) . -> a|b|c|d|e|f|g|h abc. -> abc(a|b|c|d|e|f|g|h)
As the alphabet can be potentially huge (such as unicode is) something must be done to reduce the number of disjunctions:
[abc] -> a|b|c [abc] -> <other_char> def[abc] -> def(d|e|f|<other_char>) . -> <other_char> abc. -> abc(a|b|c|<other_char>)
Where <other_char> is a special metacharacter that matches any of the characters of the alphabet not present in the regex. Note that with this technique knowing the whole alphabet explicitly is not needed.
Care must be taken when the regex is meant to be used for an operation with another regex (such as intersection or difference). In this case, <other_char> must match only the characters present in neither regex. Example:
Regex space: [abc] and [cd] Characters present in any regex: abcd [abc] -> a|b|c [cd] -> a|b|<other_char>
A meta regular expression is the intersection or subtraction of 2 other (meta or simple) regular expressions. Lookaround constructions are transformed in equivalent meta simple regular expressions for processing.
A(?=B)C is transformed into AC ∩ AB.* A(?!B)C is transformed into AC - AB.*
In the case of more than one lookaround, the transformation is applied recursively.
This works if A is of known length
Only top level lookarounds that are part of a juxtaposition are permitted, i.e. they are no allowed inside parenthesis, nested or as members of a conjunction. Examples:
Allowed: A(?!B)C (?!B)C
Not allowed: (?!B)|B part of a conjuction (?!(?!B)) lookaround inside lookaround (A(?!B))B lookaround inside parenthesis A+(?!B)C lookaround with variable-length prefix
NOTE: Only lookahead is currently implemented