Regular Expression Implementation

Written by Adrian Stoll. This notebook is a first draft, please contact me at adrs at umich.edu if you find any mistakes. It contains working code for a regular expression matcher along with explantions of basic concepts relating to regular expressions, languages, parsers, and finite automaton. Hope you enjoy!

Regular expressions are a language for describing patterns in strings. They are useful for all sorts of pattern matching problems such as text search and input validation.

Basic Concepts

Alphabets

An alphabet is a set of characters from which strings are created. Typically an alphabet is denoted $\Sigma$. Examples include binary ($\Sigma = \{0, 1\}$), ASCII ($\Sigma = \{0x00, 0x01, ... , a, b, c, .. 0x7F\}$, and letters ($\Sigma = \{a, b, c, ..., x, y, z\}$).

Sentences

A sentence is a string made of characters from an alphabet. For example 01011 is a sentence from the binary alphabet. The emply string is traditionally denoted by $\epsilon$.

Languages

$\Sigma*$ is the set of possible sequences of characters from alphabet $\Sigma$. A languge is a subset of all the possible strings which can be created from an alphabet. Formally, any set $L \subset \Sigma$ is a language. For example $\{0, 1, 00, 11, 000, 010, 101, 111, ...\} = \{$ all binary palindromes $\} \subset \{0, 1\}*$ is a language over the binary alphabet.

Regular Expressions

Regular expressions are patterns that represent certain strings. Strings matching the pattern of regular expression r form a language denoted L(r).

Rules

The rules for what strings match a regular expression are recursively defined below.

The bases definitions are:

  • If $r = \epsilon$ then $L(r) = \{\}$
  • If $r = a$ for $a \in \Sigma$, then $L(r) = \{a\}$

The recursive defintions are:

  • If $r$ is a regular expression then $s = (r)$ is a regular expression and $L(s) = L(r)$
  • If $r$ and $s$ are regular expressions then $t = (r)(s)$ is a regular expression. $L(t) = \{x : x = yz \land y \in L(r) \land z \in L(s)$
  • If $r$ and $s$ are regular expressions then $t = (r)|(s)$ is a regular expression with $L(t) = L(r) \cup L(s)$
  • If $r$ is a regular expression then $s = (r)*$ is a regular expression defining $L(s) = \{x_1x_2...x_n : x_i \in L(r)\}$
In [225]:
import re

# I will restrict the alphabet to the following characters to keep the implementation simple
# To handle arbitrary characters one would have to implment escape sequences for '|', '(', ')', and '*'
alphabet = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

examples = [
    # Empty regular expression only matches the empty string
    ('', '', True),
    ('', 'a', False),
    # Character literals match themselves
    ('a', 'a', True),
    ('a', 'b', False),
    # . - matches any character
    # For simplicity I will won't implement this
    # "." can is shorthand for "a|b|c|d|e|f..."
    # ('.', 'a', True),
    # ('.', 'b', True),
    # Concatentation
    # <regex 1><regex 2> - matches strings XY where X matches <regex 1> and Y matches <regex 2>
    ('abc', 'abc', True),
    ('abc', 'cab', False),
    ('abc', 'aba', False),
    # Kleene star - matches any number of occurances
    # <regex>* - matches strings A_1A_2A_3...A_n where each A_i matches <regex>
    ('a*', '', True),
    ('a*', 'a', True),
    ('a*', 'aaaaaa', True),
    ('a*', 'bbb', False),
    # OR
    # <regex 1>|<regex 2> - matches strings X where X matches <regex 1> or <regex 2>
    ('a|b', 'a', True),
    ('a|b', 'b', True),
    ('a|b', 'c', False),
    # Parenthesis
    ('(a|b)*', 'aabbabab', True),
    ('()', '', True),
    ('()', 'a', False),
    ('(a|b)*', 'aabbcbab', False)
]

def standardFullMatch(pattern, s):
    'Implementation from the Python standard library'
    return bool(re.fullmatch(pattern, s))

def checkMatcher(examples, matcher):
    'Tests a matching function'
    for pattern, s, matches in examples:
        m = matcher(pattern, s)
        assert(m == matches)
        print('matches("{}", "{}") -> {}'.format(pattern, s, m))

checkMatcher(examples, standardFullMatch)
matches("", "") -> True
matches("", "a") -> False
matches("a", "a") -> True
matches("a", "b") -> False
matches("abc", "abc") -> True
matches("abc", "cab") -> False
matches("abc", "aba") -> False
matches("a*", "") -> True
matches("a*", "a") -> True
matches("a*", "aaaaaa") -> True
matches("a*", "bbb") -> False
matches("a|b", "a") -> True
matches("a|b", "b") -> True
matches("a|b", "c") -> False
matches("(a|b)*", "aabbabab") -> True
matches("()", "") -> True
matches("()", "a") -> False
matches("(a|b)*", "aabbcbab") -> False

Operator Precedence

What is operator precedence? Operator precedence is the answer to the question does a @ b # c = a @ (b # c) or does a @ b # c = (a @ b) # c? Note: @ and # are arbitrary binary operators here.

The order of precedence from highest to lowest is: () anything in parenthesis, * Kleen star, concatenation, then | or. The examples below illistrates how operator precedence manifests.

In [226]:
precedenceExamples = [
    ### Kleen star
    # Kleen star has higher precedence than OR
    ('a|b*', 'bbb', True),
    # If OR had higher precedence then this would match
    ('a|b*', 'aba', False),

    # Kleen star has higher precedence than concatenation
    ('ab*', 'abbb', True),
    ('ab*', 'a', True),
    # If concatenation had higher precedence the following results would be matches
    ('ab*', 'abababab', False),
    ('ab*', '', False),
    
    ### Concatenation
    # Concatenation has higher precedence than OR
    ('abc|def', 'abc', True),
    # If OR had higher precedence the following would match
    ('abc|def', 'abcef', False),
    
    # Why parenthesis are needed
    ('abc*', 'abcabcabc', False),
    ('(abc)*', 'abcabcabc', True),

    ('abc*', '', False),
    ('(abc)*', '', True),

    ('abc*', 'abccc', True),
    ('(abc)*', 'abccc', False),

    ('abc*', 'ab', True),    
    ('a(bc)*', 'abcbc', True),
    ('a(bc)*', 'a', True)
]

checkMatcher(precedenceExamples, standardFullMatch)
matches("a|b*", "bbb") -> True
matches("a|b*", "aba") -> False
matches("ab*", "abbb") -> True
matches("ab*", "a") -> True
matches("ab*", "abababab") -> False
matches("ab*", "") -> False
matches("abc|def", "abc") -> True
matches("abc|def", "abcef") -> False
matches("abc*", "abcabcabc") -> False
matches("(abc)*", "abcabcabc") -> True
matches("abc*", "") -> False
matches("(abc)*", "") -> True
matches("abc*", "abccc") -> True
matches("(abc)*", "abccc") -> False
matches("abc*", "ab") -> True
matches("a(bc)*", "abcbc") -> True
matches("a(bc)*", "a") -> True

Associativity

What is associativity? Associativity is the answer to the question does a @ b @ c = (a @ b) @ c or a @ b @ c = a @ (b @ c)? If a value has the operator @ on both sides of it in an expression, associativity specifies which side "claims" the value.

Subtraction is an example of a left associative operation since a - b - c = (a - b) - c. Exponentiation is an example of a right associative operation since a^b^c = a^(b^c). For fully associative operations such as addition (a + b) + c = a + b + c = a + (b + c) and the operation is both left and right associative.

All the regular expression operators introduced above are left associative.

In [227]:
associativityExamples = [
    ('a*b*c', 'c', True),
    ('a*b*c', 'aaac', True),
    ('a*b*c', 'bc', True),
    ('a*b*c', 'aabbbc', True),
    # If '*' applied to the right side the following would match
    ('a*b*c', 'a', False),
    ('a*b*c', 'accc', False),
    ('a*b*c', 'abbbb', False),
    ('a*b*c', 'abbbcc', False)
]
checkMatcher(associativityExamples, standardFullMatch)
matches("a*b*c", "c") -> True
matches("a*b*c", "aaac") -> True
matches("a*b*c", "bc") -> True
matches("a*b*c", "aabbbc") -> True
matches("a*b*c", "a") -> False
matches("a*b*c", "accc") -> False
matches("a*b*c", "abbbb") -> False
matches("a*b*c", "abbbcc") -> False

Regular Expression Parse Trees

In [228]:
class Node:
    'Represents a node in the parse tree'
    # All nodes have the following field and method:
    # level - stores prededence information to inform us of when to add parenthesis when pretty pritting
    # __repr__ - prints the regex represented by the parse tree with with any uneccearry parens removed
    pass

class Epsilon(Node):
    'Represents an empty regular expression'
    def __init__(self):
        self.level = 0
    def __repr__(self):
        return ''

class Character(Node):
    def __init__(self, c):
        self.c = c
        self.level = 0
    def __repr__(self):
        return self.c

class KleenStar(Node):
    def __init__(self, r):
        self.r = r
        self.level = 1
    def __repr__(self):
        if self.r.level > self.level:
            return '({})*'.format(self.r)
        else:
            return '{}*'.format(self.r)

class Concatenate(Node):
    def __init__(self, a, b):
        self.a, self.b = a, b
        self.level = 2
    def __repr__(self):
        def formatSide(side):
            if side.level > self.level:
                return '({})'.format(side)
            else:
                return '{}'.format(side)
        return formatSide(self.a) + formatSide(self.b)

class Or(Node):
    def __init__(self, a, b):
        self.a, self.b = a, b
        self.level = 3
    def __repr__(self):
        def formatSide(side):
            if side.level > self.level:
                return '({})'.format(side)
            else:
                return '{}'.format(side)
        return formatSide(self.a) + '|' + formatSide(self.b)

# Char and (), *, concat, |
In [229]:
representationTests = [
    (Epsilon(), ''),
    (Character('a'), 'a'),
    (KleenStar(Concatenate(Character('a'), Character('b'))), '(ab)*'),
    (Concatenate(KleenStar(Character('a')), Character('b')), 'a*b'),
    (Concatenate(Character('a'), KleenStar(Character('b'))), 'ab*'),
    (Concatenate(Concatenate(Character('a'), Character('b')), Character('c')), 'abc'),
    (Concatenate(Concatenate(Character('a'), KleenStar(Character('b'))), Character('c')), 'ab*c'),
    (Or(Character('a'), Character('b')), 'a|b'),
    (Or(KleenStar(Character('a')), Character('b')), 'a*|b'),
    (KleenStar(Or(Character('a'), Character('b'))), '(a|b)*'),
    (Or(Character('a'), KleenStar(Character('b'))), 'a|b*'),
    (Or(Character('a'), Epsilon()), 'a|'),
    (Concatenate(Or(Character('a'), Epsilon()), KleenStar(Character('b'))), '(a|)b*')
]

for tree, representation in representationTests:
    assert(tree.__repr__() == representation)

Regular Expression Grammar

Grammers

TODO: rewrite

A grammar consists of terminals and non-terminals. A terminal is an element of the grammer which is a token/character. A non-terminal is an element of the grammer made of a sequence of other non-terminals and terminals.

For our grammer the terminals will be characters in the alphabet, '(', ')', '|', '.', and '*'.

Regular Expression Grammar With Left Recusion

regex -> regex '|' term | term
term -> term factor | factor
factor -> atom | atom '*'
atom -> '(' regex ')' | character | epsilon
character -> 'a' | 'b' | 'c' | ... | 'z'

This grammar is unsuitable for recursive descent parsing because it has left recursion. Left recursion is when a grammar has productions of the form A -> Ax which would cause infinite loops in a recursive descent parser. The productions term -> term factor and term -> term factorare the problems with this grammar.

Luckly there is a procedure for removing left recursion from grammars.

Regular Expression Grammar Without Left Recusion

regex -> term regexRest
regexRest ->  '|' term regexRest | espilon
term -> factor termRest
termRest -> factor termRest | epsilon
factor -> atom | atom '*'
atom -> '(' regex ')' | character | epsilon
character -> 'a' | 'b' | 'c' | ... | 'z'
In [230]:
class Parser:
    def __init__(self, s):
        # Input string
        self.s = s
        # Position of parsing in input string
        self.i = 0
        # Current character parser is looking at
        self.lookahead = None
        self.get()

    def get(self):
        'Move to next input character'
        if self.i >= len(self.s):
            self.lookahead = None
        else:
            self.lookahead = self.s[self.i]
        self.i += 1

    def match(self, c):
        'Check value of current character and advance'
        if self.lookahead != c:
            raise ValueError('index %d: expected %r not %r ' % (self.i, c, self.lookahead))
        self.get()

    def regex(self):
        r = self.term()
        while self.lookahead == '|':
            self.match('|')
            r = Or(r, self.term())
        return r

    def term(self):
        t = self.factor()
        while self.lookahead == '(' or (self.lookahead and self.lookahead in alphabet):
            t = Concatenate(t, self.factor())
        return t

    def factor(self):
        f = self.atom()
        if self.lookahead == '*':
            self.match('*')
            f = KleenStar(f)
        return f

    def atom(self):
        if self.lookahead == '(':
            self.match('(')
            a = self.regex()
            self.match(')')
            return a
        elif self.lookahead and self.lookahead in alphabet:
            return self.character()
        elif not self.lookahead or self.lookahead == ')' or self.lookahead == '|':
            return Epsilon()
        else:
            raise ValueError('index %d: invalid character %r, expected atom' % (self.i, self.lookahead))

    def character(self):
        c = Character(self.lookahead)
        self.match(self.lookahead)
        return c
    
    def parse(self):
        r = self.regex()
        # Check that there are no characters left
        self.match(None)
        return r

def parseRegex(s):
    return Parser(s).parse()
In [231]:
# Input regular expression, regular expression with unecessary parens removed
parseExamples = [
    ('a', 'a'),
    ('(a)', 'a'),
    ('((a))', 'a'),
    ('a*', 'a*'),
    ('(a)*', 'a*'),
    ('aa', 'aa'), 
    ('abc', 'abc'),
    ('a*bc','a*bc'),
    ('(ab)*', '(ab)*'),
    ('ab*(c*d)*', 'ab*(c*d)*'),
    ('(a|b)*abb','(a|b)*abb'),
    ('(a|b)*(d*(e*|f))', '(a|b)*d*(e*|f)'),
    ('',''),
    ('a|', 'a|'),
    ('(a|)b*','(a|)b*'),
    ('|a', '|a'),
    ('(|a)bc*', '(|a)bc*'),
    ('()',''),
    ('()()',''),
    #('()()|',''),
]
# TODO: add simplification rules
# "r**" -> "r*"
# "()" -> ""
# "|" -> ""
# "*" -> ""
# TODO: write SimplifyNode(n) function which applies the following transformations:
# Concatenate(r, Epsilon()) -> r
# Concatenate(Epsilon(), r) -> r
# KleenStar(Epsilon()) -> Epsilon()
# KleenStar(KleenStar(r)) -> KleenStar(r)
# Or(Epsilon(), Epsilon()) -> Epsilon()

print('Origional, Simplified')
print('-'*40)
for pattern, simplified in parseExamples:
    tree = parseRegex(pattern)
    assert(tree.__repr__() == simplified)
    print('"{}" -> "{}"'.format(pattern, simplified))

invalidRegexs = [
    ('(', 'no close paren'),
    ('(a', 'no close paren'),
    (')', 'no open paren'),
    ('a)', 'no open paren'),
    ('\x01', 'character not in alphabet'),
]

print('\nInvalid regular expressions')
print('-'*40)
for s, description in invalidRegexs:
    try:
        parseRegex(s)
        assert(False)
    except ValueError:
        print('"{}" - {}'.format(s, description))
Origional, Simplified
----------------------------------------
"a" -> "a"
"(a)" -> "a"
"((a))" -> "a"
"a*" -> "a*"
"(a)*" -> "a*"
"aa" -> "aa"
"abc" -> "abc"
"a*bc" -> "a*bc"
"(ab)*" -> "(ab)*"
"ab*(c*d)*" -> "ab*(c*d)*"
"(a|b)*abb" -> "(a|b)*abb"
"(a|b)*(d*(e*|f))" -> "(a|b)*d*(e*|f)"
"" -> ""
"a|" -> "a|"
"(a|)b*" -> "(a|)b*"
"|a" -> "|a"
"(|a)bc*" -> "(|a)bc*"
"()" -> ""
"()()" -> ""

Invalid regular expressions
----------------------------------------
"(" - no close paren
"(a" - no close paren
")" - no open paren
"a)" - no open paren
"" - character not in alphabet

Compiling Regular Expressions into Finate Automaton

Finte Automaton

A finite automaton is a state machine that reads an input string and either accepts or rejects it. Every finite automaton defines a language - the language of strings the automaton accepts. There are two types of automaton: deterministic and non-deterministic. Deterministic automatons (DFAs) allow at most one state transition for the current input character, while non-deterministic automatons (NFAs) allow multiple transitions. NFAs also have "epsilon transitions" which are state transitions on the empty string.

NFAs and DFAs can recognize exactly the same languages. These languages are called regular languages. A DFA is also an NFA, so any language reconizable by a DFA automatically is recongizable by an NFA. To model a NFA with a DFA the DFA's states are sets of all the possible states an NFA could be in at some particular time. DFAs are simpler to apply because they have at most one transition per input character, but may have exponentially more states than an equivalent NFA. In the following descriptions and implementation I will only consider NFAs.

Formally an NFA consists of [1]:

  • a finite set of states Q
  • a finite alphabet $\Sigma$
  • a transition function $\delta \colon Q \times \Sigma_{\epsilon} \to P(Q)$ where $\Sigma_{\epsilon} = \Sigma \cup \{\epsilon\}$ and $P(Q)$ is the power set (set of all subsets of Q).
  • a start state $q_0 \in Q$
  • an accept state $a \in Q$

Compiling

Kleen's Theorem says the languages expressable by regular expressions and finite automatons are the same [2]. This means we can implement regular expression matching with NFAs. We will be implementing Thompson's construction of regular expressions.

TODO: add images of NFAs

In [241]:
# States are reprexented by indexes into an array of state infromation.
# The states array will store the transitions from each state.
# The transitions for state i are represented by a dictionary mapping
# the current charracter c to the set
# {j : i->j is a valid transition on input c}
class RegexNFA:
    epsilon = ''
    def __init__(self, tree):
        self.states = [{}]
        self.start = 0
        self.accept = self.compileNode(tree, 0)

    def addTransition(self, character, start, end):
        assert(end < len(self.states))
        transitions = self.states[start].get(character, set())
        transitions.add(end)
        self.states[start][character] = transitions

    def newState(self):
        state = len(self.states)
        self.states.append({})
        return state

    def compileNode(self, node, start):
        # Only construction no requiring auxillary states
        if type(node) == Concatenate:
            aAccept = self.compileNode(node.a, start)
            bAccept = self.compileNode(node.b, aAccept)
            return bAccept

        # Create new accept state
        f = self.newState()
        if type(node) == Epsilon:
            self.addTransition(NFA.epsilon, start, f)
        elif type(node) == Character:
            self.addTransition(node.c, start, f)
        elif type(node) == Or:
            aStart = self.newState()
            bStart = self.newState()
            self.addTransition(NFA.epsilon, start, aStart)
            self.addTransition(NFA.epsilon, start, bStart)
            aAccept = self.compileNode(node.a, aStart)
            bAccept = self.compileNode(node.b, bStart)
            self.addTransition(NFA.epsilon, aAccept, f)
            self.addTransition(NFA.epsilon, bAccept, f)
        elif type(node) == KleenStar:
            rStart = self.newState()
            self.addTransition(NFA.epsilon, start, rStart)
            rAccept = self.compileNode(node.r, rStart)
            
            # Transition back to start to allow matching multiple rs
            self.addTransition(NFA.epsilon, rAccept, rStart)
            
            # Transition for final match
            self.addTransition(NFA.epsilon, rAccept, f)

            # Epsilon transition to allow zero matches
            self.addTransition(NFA.epsilon, start, f)
        else:
            raise ValueError('unknown node type %r', type(node))
        return f
    
    def move(self, currentStates, c):
        '''
        Input:
        current set of possible states the NFA could be in and next
        input character
        
        Output:
        set of possible states the NFA could reach by following
        a transition on character c from on of the input states
        '''
        endStates = set()
        for state in currentStates:
            endStates.update(self.states[state].get(c, set()))
        return endStates
    
    def epsilonClosure(self, currentStates):
        'augments input to include states reachible by epsilon transitions'
        # This is plain old depth first search
        pending = list(currentStates)
        closure = currentStates
        while pending:
            t = pending.pop()
            # Look at all possible epsilon transitions from t
            for s in self.states[t].get(NFA.epsilon, set()):
                if s not in closure:
                    closure.add(s)
                    pending.append(s)
        return closure

    def accepts(self, s):
        'Runs NFA on s and determines if the NFA accepts input s'
        current = self.epsilonClosure({self.start})
        for c in s:
            current = self.epsilonClosure(self.move(current, c))
            if not current:
                return False
        return self.accept in current
In [242]:
def fullMatch(pattern, s):
    r = RegexNFA(parseRegex(pattern))
    return r.accepts(s)
In [248]:
print('Test cases')
print('-'*40)
checkMatcher(examples, fullMatch)
checkMatcher(precedenceExamples, fullMatch)
checkMatcher(associativityExamples, fullMatch)
print('It works!')
Test cases
----------------------------------------
matches("", "") -> True
matches("", "a") -> False
matches("a", "a") -> True
matches("a", "b") -> False
matches("abc", "abc") -> True
matches("abc", "cab") -> False
matches("abc", "aba") -> False
matches("a*", "") -> True
matches("a*", "a") -> True
matches("a*", "aaaaaa") -> True
matches("a*", "bbb") -> False
matches("a|b", "a") -> True
matches("a|b", "b") -> True
matches("a|b", "c") -> False
matches("(a|b)*", "aabbabab") -> True
matches("()", "") -> True
matches("()", "a") -> False
matches("(a|b)*", "aabbcbab") -> False
matches("a|b*", "bbb") -> True
matches("a|b*", "aba") -> False
matches("ab*", "abbb") -> True
matches("ab*", "a") -> True
matches("ab*", "abababab") -> False
matches("ab*", "") -> False
matches("abc|def", "abc") -> True
matches("abc|def", "abcef") -> False
matches("abc*", "abcabcabc") -> False
matches("(abc)*", "abcabcabc") -> True
matches("abc*", "") -> False
matches("(abc)*", "") -> True
matches("abc*", "abccc") -> True
matches("(abc)*", "abccc") -> False
matches("abc*", "ab") -> True
matches("a(bc)*", "abcbc") -> True
matches("a(bc)*", "a") -> True
matches("a*b*c", "c") -> True
matches("a*b*c", "aaac") -> True
matches("a*b*c", "bc") -> True
matches("a*b*c", "aabbbc") -> True
matches("a*b*c", "a") -> False
matches("a*b*c", "accc") -> False
matches("a*b*c", "abbbb") -> False
matches("a*b*c", "abbbcc") -> False
It works!

References