Parsing and Recombining Inputs

In the chapter on Grammars, we discussed how grammars can be used to represent various languages. We also saw how grammars can be used to generate strings of the corresponding language. Grammars can also perform the reverse. That is, given a string, one can decompose the string into its constituent parts that correspond to the parts of grammar used to generate it – the derivation tree of that string. These parts (and parts from other similar strings) can later be recombined using the same grammar to produce new strings.

In this chapter, we use grammars to parse and decompose a given set of valid seed inputs into their corresponding derivation trees. This structural representation allows us to mutate, crossover, and recombine their parts in order to generate new valid, slightly changed inputs (i.e., fuzz)

Prerequisites

Fuzzing a Simple Program

Here is a simple program that accepts a CSV file of vehicle details and processes this information.

def process_inventory(inventory):
    res = []
    for vehicle in inventory.split('\n'):
        ret = process_vehicle(vehicle)
        res.extend(ret)
    return '\n'.join(res)

The CSV file contains details of one vehicle per line. Each row is processed in process_vehicle().

def process_vehicle(vehicle):
    year, kind, company, model, *_ = vehicle.split(',')
    if kind == 'van':
        return process_van(year, company, model)

    elif kind == 'car':
        return process_car(year, company, model)

    else:
        raise Exception('Invalid entry')

Depending on the kind of vehicle, the processing changes.

def process_van(year, company, model):
    res = ["We have a %s %s van from %s vintage." % (company, model, year)]
    iyear = int(year)
    if iyear > 2010:
        res.append("It is a recent model!")
    else:
        res.append("It is an old but reliable model!")
    return res
def process_car(year, company, model):
    res = ["We have a %s %s car from %s vintage." % (company, model, year)]
    iyear = int(year)
    if iyear > 2016:
        res.append("It is a recent model!")
    else:
        res.append("It is an old but reliable model!")
    return res

Here is a sample of inputs that the process_inventory() accepts.

mystring = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar\
"""
print(process_inventory(mystring))
We have a Ford E350 van from 1997 vintage.
It is an old but reliable model!
We have a Mercury Cougar car from 2000 vintage.
It is an old but reliable model!

Let us try to fuzz this program. Given that the process_inventory() takes a CSV file, we can write a simple grammar for generating comma separated values, and generate the required CSV rows. For convenience, we fuzz process_vehicle() directly.

import string
CSV_GRAMMAR = {
    '<start>': ['<csvline>'],
    '<csvline>': ['<items>'],
    '<items>': ['<item>,<items>', '<item>'],
    '<item>': ['<letters>'],
    '<letters>': ['<letter><letters>', '<letter>'],
    '<letter>': list(string.ascii_letters + string.digits + string.punctuation + ' \t\n')
}

We need some infrastructure first for viewing the grammar.

from Grammars import EXPR_GRAMMAR, START_SYMBOL, RE_NONTERMINAL, is_valid_grammar, syntax_diagram
from Fuzzer import Fuzzer
from GrammarFuzzer import GrammarFuzzer, FasterGrammarFuzzer, display_tree, tree_to_string, dot_escape

from ExpectError import ExpectError
from Coverage import Coverage
from Timer import Timer
syntax_diagram(CSV_GRAMMAR)
start
csvline
csvline
items
items
item , items item
item
letters
letters
letter letters letter
letter
i h g f e d c b a j k l m n o p q r s B A z y x w v u t C D E F G H I J K L U T S R Q P O N M V W X Y Z 0 1 2 3 4 $ # " ! 9 8 7 6 5 % & ' ( ) * + , - . [ @ ? > = < ; : / \ ] ^ _ ` { | } ~

We generate 1000 values, and evaluate the process_vehicle() with each.

gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 1000
valid = []
time = 0
for i in range(trials):
    with Timer() as t:
        vehicle_info = gf.fuzz()
        try:
            process_vehicle(vehicle_info)
            valid.append(vehicle_info)
        except:
            pass
        time += t.elapsed_time()
print("%d valid strings, that is GrammarFuzzer generated %f%% valid entries from %d inputs" %
      (len(valid), len(valid) * 100.0 / trials, trials))
print("Total time of %f seconds" % time)
0 valid strings, that is GrammarFuzzer generated 0.000000% valid entries from 1000 inputs
Total time of 6.447689 seconds

This is obviously not working. But why?

gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 10
valid = []
time = 0
for i in range(trials):
    vehicle_info = gf.fuzz()
    try:
        print(repr(vehicle_info), end="")
        process_vehicle(vehicle_info)
    except Exception as e:
        print("\t", e)
    else:
        print()
'9w9J\'/,LU<"l,|,Y,Zv)Amvx,c\n'	 Invalid entry
'(n8].H7,qolS'	 not enough values to unpack (expected at least 4, got 2)
'\nQoLWQ,jSa'	 not enough values to unpack (expected at least 4, got 2)
'K1,\n,RE,fq,%,,sT+aAb'	 Invalid entry
"m,d,,8j4'),-yQ,B7"	 Invalid entry
'g4,s1\t[}{.,M,<,\nzd,.am'	 Invalid entry
',Z[,z,c,#x1,gc.F'	 Invalid entry
'pWs,rT`,R'	 not enough values to unpack (expected at least 4, got 3)
'iN,br%,Q,R'	 Invalid entry
'ol,\nH<\tn,^#,=A'	 Invalid entry

None of the entries will get through unless the fuzzer can produce either van or car. Indeed, the reason is that the grammar itself does not capture the complete information about the format. So here is another idea. We modify the GrammarFuzzer to know a bit about our format.

import copy
import random
class PooledGrammarFuzzer(GrammarFuzzer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._node_cache = {}

    def update_cache(self, key, values):
        self._node_cache[key] = values

    def expand_node_randomly(self, node):
        (symbol, children) = node
        assert children is None
        if symbol in self._node_cache:
            if random.randint(0, 1) == 1:
                return super().expand_node_randomly(node)
            return copy.deepcopy(random.choice(self._node_cache[symbol]))
        return super().expand_node_randomly(node)

Let us try again!

gf = PooledGrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
gf.update_cache('<item>', [
    ('<item>', [('car', [])]),
    ('<item>', [('van', [])]),
])
trials = 10
valid = []
time = 0
for i in range(trials):
    vehicle_info = gf.fuzz()
    try:
        print(repr(vehicle_info), end="")
        process_vehicle(vehicle_info)
    except Exception as e:
        print("\t", e)
    else:
        print()
',h,van,|'	 Invalid entry
'M,w:K,car,car,van'	 Invalid entry
'J,?Y,van,van,car,J,~D+'	 Invalid entry
'S4,car,car,o'	 invalid literal for int() with base 10: 'S4'
'2*-,van'	 not enough values to unpack (expected at least 4, got 2)
'van,%,5,]'	 Invalid entry
'van,G3{y,j,h:'	 Invalid entry
'$0;o,M,car,car'	 Invalid entry
'2d,f,e'	 not enough values to unpack (expected at least 4, got 3)
'/~NE,car,car'	 not enough values to unpack (expected at least 4, got 3)

At least we are getting somewhere! It would be really nice if we could incorporate what we know about the sample data in our fuzzer. In fact, it would be nice if we could extract the template and valid values from samples, and use them in our fuzzing. How do we do that?

An Ad Hoc Parser

As we saw in the previous section, programmers often have to extract parts of data that obey certain rules. For example, for CSV files, each element in a row is separated by commas, and multiple raws are used to store the data.

To extract the information, we write an ad hoc parser parse_csv().

def parse_csv(mystring):
    children = []
    tree = (START_SYMBOL, children)
    for i, line in enumerate(mystring.split('\n')):
        children.append(("record %d" % i, [(cell, [])
                                           for cell in line.split(',')]))
    return tree

We also change the default orientation of the graph to left to right rather than top to bottom for easier viewing using lr_graph().

def lr_graph(dot):
    dot.attr('node', shape='plain')
    dot.graph_attr['rankdir'] = 'LR'

The display_tree() shows the structure of our CSV file after parsing.

tree = parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)
%3 0 <start> 1 record 0 0->1 6 record 1 0->6 2 1997 1->2 3 van 1->3 4 Ford 1->4 5 E350 1->5 7 2000 6->7 8 car 6->8 9 Mercury 6->9 10 Cougar 6->10

This is of course simple. What if we encounter slightly more complexity? Again, another example from the Wikipedia.

mystring = '''\
1997,Ford,E350,"ac, abs, moon",3000.00\
'''
print(mystring)
1997,Ford,E350,"ac, abs, moon",3000.00

We define a new annotation method highlight_node() to mark the nodes that are interesting.

def highlight_node(predicate):
    def hl_node(dot, nid, symbol, ann):
        if predicate(dot, nid, symbol, ann):
            dot.node(repr(nid), dot_escape(symbol), fontcolor='red')
        else:
            dot.node(repr(nid), dot_escape(symbol))
    return hl_node

Using highlight_node() we can highlight particular nodes that we were wrongly parsed.

tree = parse_csv(mystring)
bad_nodes = {5, 6, 7, 12, 13, 20, 22, 23, 24, 25}
def hl_predicate(_d, nid, _s, _a): return nid in bad_nodes
highlight_err_node = highlight_node(hl_predicate)
display_tree(tree, log=False, node_attr=highlight_err_node,
             graph_attr=lr_graph)
%3 0 <start> 1 record 0 0->1 2 1997 1->2 3 Ford 1->3 4 E350 1->4 5 "ac 1->5 6 abs 1->6 7 moon" 1->7 8 3000.00 1->8

The marked nodes indicate where our parsing went wrong. We can of course extend our parser to understand quotes. First we define some of the helper functions parse_quote(), find_comma() and comma_split()

def parse_quote(string, i):
    v = string[i + 1:].find('"')
    return v + i + 1 if v >= 0 else -1
def find_comma(string, i):
    slen = len(string)
    while i < slen:
        if string[i] == '"':
            i = parse_quote(string, i)
            if i == -1:
                return -1
        if string[i] == ',':
            return i
        i += 1
    return -1
def comma_split(string):
    slen = len(string)
    i = 0
    while i < slen:
        c = find_comma(string, i)
        if c == -1:
            yield string[i:]
            return
        else:
            yield string[i:c]
        i = c + 1

We can update our parse_csv() procedure to use our advanced quote parser.

def parse_csv(mystring):
    children = []
    tree = (START_SYMBOL, children)
    for i, line in enumerate(mystring.split('\n')):
        children.append(("record %d" % i, [(cell, [])
                                           for cell in comma_split(line)]))
    return tree

Our new parse_csv() can now handle quotes correctly.

tree = parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)
%3 0 <start> 1 record 0 0->1 2 1997 1->2 3 Ford 1->3 4 E350 1->4 5 "ac, abs, moon" 1->5 6 3000.00 1->6

That of course does not survive long:

mystring = '''\
1999,Chevy,"Venture \\"Extended Edition, Very Large\\"",,5000.00\
'''
print(mystring)
1999,Chevy,"Venture \"Extended Edition, Very Large\"",,5000.00

A few embedded quotes are sufficient to confuse our parser again.

tree = parse_csv(mystring)
bad_nodes = {4, 5}
display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)
%3 0 <start> 1 record 0 0->1 2 1999 1->2 3 Chevy 1->3 4 "Venture \"Extended Edition 1->4 5 Very Large\"" 1->5 6 1->6 7 5000.00 1->7

Here is another record from that CSV file:

mystring = '''\
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
'''
print(mystring)
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
tree = parse_csv(mystring)
bad_nodes = {5, 6, 7, 8, 9, 10}
display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)
%3 0 <start> 1 record 0 0->1 6 record 1 0->6 10 record 2 0->10 2 1996 1->2 3 Jeep 1->3 4 Grand Cherokee 1->4 5 "MUST SELL! 1->5 7 air 6->7 8 moon roof 6->8 9 loaded",4799.00 6->9

Fixing this would require modifying both inner parse_quote() and the outer parse_csv() procedures. We note that each of these features actually documented in the CSV RFC 4180

Indeed, each additional improvement falls apart even with a little extra complexity. The problem becomes severe when one encounters recursive expressions. For example, JSON is a common alternative to CSV files for saving data. Similarly, one may have to parse data from an HTML table instead of a CSV file if one is getting the data from the web.

One might be tempted to fix it with a little more ad hoc parsing, with a bit of regular expressions thrown in. However, that is the path to insanity.

It is here that the formal parsers shine. The main idea is that, any given set of strings belong to a language, and these languages can be specified by their grammars (as we saw in the chapter on grammars). The great thing about grammars is that they can be composed. That is, one can introduce finer and finer details into an internal structure without affecting the external structure, and similarly, one can change the external structure without much impact on the internal structure. We briefly describe grammars in the next section.

Grammars

A grammar, as you have read from the chapter on grammars is a set of rules that explain how the start symbol can be expanded. Each rule has a name, also called a nonterminal, and a set of alternative choices in how the nonterminal can be expanded.

A1_GRAMMAR = {
    "<start>": ["<expr>"],
    "<expr>": ["<expr>+<expr>", "<expr>-<expr>", "<integer>"],
    "<integer>": ["<digit><integer>", "<digit>"],
    "<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
syntax_diagram(A1_GRAMMAR)
start
expr
expr
expr + expr expr - expr integer
integer
digit integer digit
digit
0 1 2 3 4 5 6 7 8 9

In the above expression, the rule <expr> : [<expr>+<expr>,<expr>-<expr>,<integer>] corresponds to how the nonterminal <expr> might be expanded. The expression <expr>+<expr> corresponds to one of the alternative choices. We call this an alternative expansion for the nonterminal <expr>. Finally, in an expression <expr>+<expr>, each of <expr>, +, and <expr> are symbols in that expansion. A symbol could be either a nonterminal or a terminal symbol based on whether its expansion is available in the grammar.

Here is a string that represents an arithmetic expression that we would like to parse, which is specified by the grammar above:

mystring = '1+2'

The derivation tree for our expression from this grammar is given by:

tree = ('<start>', [('<expr>',
                     [('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
                      ('+', []),
                      ('<expr>', [('<integer>', [('<digit>', [('2',
                                                               [])])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <expr> 1->2 6 + 1->6 7 <expr> 1->7 3 <integer> 2->3 4 <digit> 3->4 5 1 4->5 8 <integer> 7->8 9 <digit> 8->9 10 2 9->10

While a grammar can be used to specify a given language, there could be multiple grammars that correspond to the same language. For example, here is another grammar to describe the same addition expression.

A2_GRAMMAR = {
    "<start>": ["<expr>"],
    "<expr>": ["<integer><expr_>"],
    "<expr_>": ["+<expr>", "-<expr>", ""],
    "<integer>": ["<digit><integer_>"],
    "<integer_>": ["<integer>", ""],
    "<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
syntax_diagram(A2_GRAMMAR)
start
expr
expr
integer expr_
expr_
+ expr - expr
integer
digit integer_
integer_
integer
digit
0 1 2 3 4 5 6 7 8 9

The corresponding derivation tree is given by:

tree = ('<start>', [('<expr>', [('<integer>', [('<digit>', [('1', [])]),
                                               ('<integer_>', [])]),
                                ('<expr_>', [('+', []),
                                             ('<expr>',
                                              [('<integer>',
                                                [('<digit>', [('2', [])]),
                                                 ('<integer_>', [])]),
                                               ('<expr_>', [])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <integer> 1->2 6 <expr_> 1->6 3 <digit> 2->3 5 <integer_> 2->5 4 1 3->4 7 + 6->7 8 <expr> 6->8 9 <integer> 8->9 13 <expr_> 8->13 10 <digit> 9->10 12 <integer_> 9->12 11 2 10->11

Indeed, there could be different classes of grammars that describe the same language. For example, the first grammar A1_GRAMMAR is a grammar that sports both right and left recursion, while the second grammar A2_GRAMMAR does not have left recursion in the nonterminals in any of its productions, but contains epsilon productions. (An epsilon production is a production that has empty string in its right hand side.)

You would have noticed that we reuse the term <expr> in its own definition. Using the same nonterminal in its own definition is called recursion. There are two specific kinds of recursion one should be aware of in parsing, as we see in the next section.

Recursion

A grammar is left recursive if any of its nonterminals are left recursive, and a nonterminal is directly left-recursive if the left-most symbol of any of its productions is itself.

LR_GRAMMAR = {
    '<start>': ['<A>'],
    '<A>': ['<A>a', ''],
}
syntax_diagram(LR_GRAMMAR)
start
A
A
A a
mystring = 'aaaaaa'
display_tree(
    ('<start>', (('<A>', (('<A>', (('<A>', []), ('a', []))), ('a', []))), ('a', []))))
%3 0 <start> 1 <A> 0->1 6 a 0->6 2 <A> 1->2 5 a 1->5 3 <A> 2->3 4 a 2->4

A grammar is indirectly left-recursive if any of the left-most symbols can be expanded using their definitions to produce the nonterminal as the left-most symbol of the expansion. The left recursion is called a hidden-left-recursion if during the series of expansions of a nonterminal, one reaches a rule where the rule contains the same nonterminal after a prefix of other symbols, and these symbols can derive the empty string. For example, in A1_GRAMMAR, <integer> will be considered hidden-left recursive if <digit> could derive an empty string.

Right recursive grammars are defined similarly. Below is the derivation tree for the right recursive grammar that represents the same language as that of LR_GRAMMAR.

RR_GRAMMAR = {
    '<start>': ['<A>'],
    '<A>': ['a<A>', ''],
}
syntax_diagram(RR_GRAMMAR)
start
A
A
a A
display_tree(('<start>', ((
    '<A>', (('a', []), ('<A>', (('a', []), ('<A>', (('a', []), ('<A>', []))))))),)))
%3 0 <start> 1 <A> 0->1 2 a 1->2 3 <A> 1->3 4 a 3->4 5 <A> 3->5 6 a 5->6 7 <A> 5->7

Ambiguity

To complicate matters further, there could be multiple derivation trees – also called parses – corresponding to the same string from the same grammar. For example, a string 1+2+3 can be parsed in two ways as we see below using the A1_GRAMMAR

mystring = '1+2+3'
tree = ('<start>',
        [('<expr>',
          [('<expr>', [('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
                       ('+', []),
                       ('<expr>', [('<integer>',
                                    [('<digit>', [('2', [])])])])]), ('+', []),
           ('<expr>', [('<integer>', [('<digit>', [('3', [])])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <expr> 1->2 12 + 1->12 13 <expr> 1->13 3 <expr> 2->3 7 + 2->7 8 <expr> 2->8 4 <integer> 3->4 5 <digit> 4->5 6 1 5->6 9 <integer> 8->9 10 <digit> 9->10 11 2 10->11 14 <integer> 13->14 15 <digit> 14->15 16 3 15->16
tree = ('<start>',
        [('<expr>', [('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
                     ('+', []),
                     ('<expr>',
                      [('<expr>', [('<integer>', [('<digit>', [('2', [])])])]),
                       ('+', []),
                       ('<expr>', [('<integer>', [('<digit>', [('3',
                                                                [])])])])])])])
assert tree_to_string(tree) == mystring
display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <expr> 1->2 6 + 1->6 7 <expr> 1->7 3 <integer> 2->3 4 <digit> 3->4 5 1 4->5 8 <expr> 7->8 12 + 7->12 13 <expr> 7->13 9 <integer> 8->9 10 <digit> 9->10 11 2 10->11 14 <integer> 13->14 15 <digit> 14->15 16 3 15->16

There are many ways to resolve ambiguities. One approach taken by Parsing Expression Grammars explained in the next section is to specify a particular order of resolution, and choose the first one. Another approach is to simply return all possible derivation trees, which is the approach taken by Earley parser we develop later.

Next, we develop different parsers. To do that, we define a minimal interface for parsing that is obeyed by all parsers. There are two approaches to parsing a string using a grammar.

  1. The traditional approach is to use a lexer (also called a tokenizer or a scanner) to first tokenize the incoming string, and feed the grammar one token at a time. The lexer is typically a smaller parser that accepts a regular language. The advantage of this approach is that the grammar used by the parser can eschew the details of tokenization. Further, one gets a shallow derivation tree at the end of the parsing which can be directly used for generating the Abstract Syntax Tree.
  2. The second approach is to use a tree pruner after the complete parse. With this approach, one uses a grammar that incorporates complete details of the syntax. Next, the nodes corresponding to tokens are pruned and replaced with their corresponding strings as leaf nodes. The utility of this approach is that the parser is more powerful, and further there is no artificial distinction between lexing and parsing.

In this chapter, we use the second approach. This approach is implemented in the prune_tree method.

The Parser class we define below provides the minimal interface. The main methods that need to be implemented by the classes implementing this interface are parse_prefix and parse. The parse_prefix returns a tuple, which contains the index until which parsing was completed successfully, and the parse forest until that index. The method parse returns a list of derivation trees if the parse was successful.

class Parser(object):
    def __init__(self, grammar, **kwargs):
        self._grammar = grammar
        self._start_symbol = kwargs.get('start_symbol') or START_SYMBOL
        self.log = kwargs.get('log') or False
        self.coalesce = kwargs.get('coalesce') or True
        self.tokens = kwargs.get('tokens') or set()

    def grammar(self):
        return self._grammar

    def start_symbol(self):
        return self._start_symbol

    def parse_prefix(self, text):
        """Return pair (cursor, forest) for longest prefix of text"""
        raise NotImplemented()

    def parse(self, text):
        cursor, forest = self.parse_prefix(text)
        if cursor < len(text):
            raise SyntaxError("at " + repr(text[cursor:]))
        return [self.prune_tree(tree) for tree in forest]

    def coalesce(self, children):
        last = ''
        new_lst = []
        for cn, cc in children:
            if cn not in self._grammar:
                last += cn
            else:
                if last:
                    new_lst.append((last, []))
                    last = ''
                new_lst.append((cn, cc))
        if last:
            new_lst.append((last, []))
        return new_lst

    def prune_tree(self, tree):
        name, children = tree
        if self.coalesce:
            children = self.coalesce(children)
        if name in self.tokens:
            return (name, [(tree_to_string(tree), [])])
        else:
            return (name, [self.prune_tree(c) for c in children])

Parsing Expression Grammars

A Parsing Expression Grammar (PEG) [Ford et al, 2004.] is a type of recognition based formal grammar that specifies the sequence of steps to take to parse a given string. A parsing expression grammar is very similar to a context-free grammar (CFG) such as the ones we saw in the chapter on grammars. As in a CFG, a parsing expression grammar is represented by a set of nonterminals and corresponding alternatives representing how to match each. For example, here is a PEG that matches a or b.

PEG1 = {
    '<start>': ['a', 'b']
}

However, unlike the CFG, the alternatives represent ordered choice. That is, rather than choosing all rules that can potentially match, we stop at the first match that succeed. For example, the below PEG can match ab but not abc unlike a CFG which will match both. (We call the sequence of ordered choice expressions choice expressions rather than alternatives to make the distinction from CFG clear.)

PEG2 = {
    '<start>': ['ab', 'abc']
}

Each choice in a choice expression represents a rule on how to satisfy that particular choice. The choice is a sequence of symbols (terminals and nonterminals) that are matched against a given text as in a CFG.

Beyond the syntax of grammar definitions we have seen so far, a PEG can also contain a few additional elements. See the exercises at the end of the chapter for additional information.

The PEGs model the typical practice in handwritten recursive descent parsers, and hence it may be considered more intuitive to understand. We look at parsers for PEGs next.

The Packrat Parser for Predicate Expression Grammars

Short of hand rolling a parser, Packrat parsing is one of the simplest parsing techniques, and is one of the techniques for parsing PEGs. The Packrat parser is so named because it tries to cache all results from simpler problems in the hope that these solutions can be used to avoid re-computation later. We develop a minimal Packrat parser next.

But before that, we need to implement a few supporting tools.

The EXPR_GRAMMAR we import from the chapter on grammars is oriented towards generation. In particular, the production rules are stored as strings. We need to massage this representation a little to conform to a canonical representation where each token in a rule is represented separately. The canonical format uses separate tokens to represent each symbol in an expansion.

import re
def canonical(grammar, letters=False):
    def split(expansion):
        if isinstance(expansion, tuple):
            expansion = expansion[0]

        return [token for token in re.split(
            RE_NONTERMINAL, expansion) if token]

    def tokenize(word):
        return list(word) if letters else [word]

    def canonical_expr(expression):
        return [
            token for word in split(expression)
            for token in ([word] if word in grammar else tokenize(word))
        ]

    return {
        k: [canonical_expr(expression) for expression in alternatives]
        for k, alternatives in grammar.items()
    }
canonical(EXPR_GRAMMAR)
{'<start>': [['<expr>']],
 '<expr>': [['<term>', ' + ', '<expr>'],
  ['<term>', ' - ', '<expr>'],
  ['<term>']],
 '<term>': [['<factor>', ' * ', '<term>'],
  ['<factor>', ' / ', '<term>'],
  ['<factor>']],
 '<factor>': [['+', '<factor>'],
  ['-', '<factor>'],
  ['(', '<expr>', ')'],
  ['<integer>', '.', '<integer>'],
  ['<integer>']],
 '<integer>': [['<digit>', '<integer>'], ['<digit>']],
 '<digit>': [['0'],
  ['1'],
  ['2'],
  ['3'],
  ['4'],
  ['5'],
  ['6'],
  ['7'],
  ['8'],
  ['9']]}

It is easier to work with the canonical representation during parsing. Hence, we update our parser class to store the canonical representation also.

class Parser(Parser):
    def __init__(self, grammar, **kwargs):
        self._grammar = grammar
        self._start_symbol = kwargs.get('start_symbol') or START_SYMBOL
        self.log = kwargs.get('log') or False
        self.tokens = kwargs.get('tokens') or set()
        self.cgrammar = canonical(grammar)

The Parser

We derive from the Parser base class first, and we accept the text to be parsed in the parse() method, which in turn calls unify_key() with the start_symbol.

Note. While our PEG parser can produce only a single unambiguous parse tree, other parsers can produce multiple parses for ambiguous grammars. Hence, we return a list of trees (in this case with a single element).

class PEGParser(Parser):
    def parse_prefix(self, text):
        cursor, tree = self.unify_key(self.start_symbol(), text, 0)
        return cursor, [tree]

Unify Key

The unify_key() algorithm is simple. If given a terminal symbol, it tries to match the symbol with the current position in the text. If the symbol and text match, it returns successfully with the new parse index at.

If on the other hand, it was given a nonterminal, it retrieves the choice expression corresponding to the key, and tries to match each choice in order using unify_rule(). If any of the rules succeed in being unified with the given text, the parse is considered a success, and we return with the new parse index returned by unify_rule().

class PEGParser(PEGParser):
    def unify_key(self, key, text, at=0):
        if self.log:
            print("unify_key: %s with %s" % (repr(key), repr(text[at:])))
        if key not in self.cgrammar:
            if text[at:].startswith(key):
                return at + len(key), (key, [])
            else:
                return at, None
        for rule in self.cgrammar[key]:
            to, res = self.unify_rule(rule, text, at)
            if res:
                return (to, (key, res))
        return 0, None
mystring = "1"
peg = PEGParser(EXPR_GRAMMAR, log=True)
peg.unify_key('1', mystring)
unify_key: '1' with '1'
(1, ('1', []))
mystring = "2"
peg.unify_key('1', mystring)
unify_key: '1' with '2'
(0, None)

Unify Rule

The unify_rule() method is similar. It retrieves the tokens corresponding to the rule that it needs to unify with the text, and calls unify_key() on them in sequence. If all tokens are successfully unified with the text, the parse is a success.

class PEGParser(PEGParser):
    def unify_rule(self, rule, text, at):
        if self.log:
            print('unify_rule: %s with %s' % (repr(rule), repr(text[at:])))
        results = []
        for token in rule:
            at, res = self.unify_key(token, text, at)
            if res is None:
                return at, None
            results.append(res)
        return at, results
mystring = "0"
peg = PEGParser(EXPR_GRAMMAR, log=True)
peg.unify_rule(peg.cgrammar['<digit>'][0], mystring, 0)
unify_rule: ['0'] with '0'
unify_key: '0' with '0'
(1, [('0', [])])
mystring = "12"
peg.unify_rule(peg.cgrammar['<integer>'][0], mystring, 0)
unify_rule: ['<digit>', '<integer>'] with '12'
unify_key: '<digit>' with '12'
unify_rule: ['0'] with '12'
unify_key: '0' with '12'
unify_rule: ['1'] with '12'
unify_key: '1' with '12'
unify_key: '<integer>' with '2'
unify_rule: ['<digit>', '<integer>'] with '2'
unify_key: '<digit>' with '2'
unify_rule: ['0'] with '2'
unify_key: '0' with '2'
unify_rule: ['1'] with '2'
unify_key: '1' with '2'
unify_rule: ['2'] with '2'
unify_key: '2' with '2'
unify_key: '<integer>' with ''
unify_rule: ['<digit>', '<integer>'] with ''
unify_key: '<digit>' with ''
unify_rule: ['0'] with ''
unify_key: '0' with ''
unify_rule: ['1'] with ''
unify_key: '1' with ''
unify_rule: ['2'] with ''
unify_key: '2' with ''
unify_rule: ['3'] with ''
unify_key: '3' with ''
unify_rule: ['4'] with ''
unify_key: '4' with ''
unify_rule: ['5'] with ''
unify_key: '5' with ''
unify_rule: ['6'] with ''
unify_key: '6' with ''
unify_rule: ['7'] with ''
unify_key: '7' with ''
unify_rule: ['8'] with ''
unify_key: '8' with ''
unify_rule: ['9'] with ''
unify_key: '9' with ''
unify_rule: ['<digit>'] with ''
unify_key: '<digit>' with ''
unify_rule: ['0'] with ''
unify_key: '0' with ''
unify_rule: ['1'] with ''
unify_key: '1' with ''
unify_rule: ['2'] with ''
unify_key: '2' with ''
unify_rule: ['3'] with ''
unify_key: '3' with ''
unify_rule: ['4'] with ''
unify_key: '4' with ''
unify_rule: ['5'] with ''
unify_key: '5' with ''
unify_rule: ['6'] with ''
unify_key: '6' with ''
unify_rule: ['7'] with ''
unify_key: '7' with ''
unify_rule: ['8'] with ''
unify_key: '8' with ''
unify_rule: ['9'] with ''
unify_key: '9' with ''
unify_rule: ['<digit>'] with '2'
unify_key: '<digit>' with '2'
unify_rule: ['0'] with '2'
unify_key: '0' with '2'
unify_rule: ['1'] with '2'
unify_key: '1' with '2'
unify_rule: ['2'] with '2'
unify_key: '2' with '2'
(2, [('<digit>', [('1', [])]), ('<integer>', [('<digit>', [('2', [])])])])
mystring = "1 + 2"
peg = PEGParser(EXPR_GRAMMAR, log=False)
peg.parse(mystring)
[('<start>',
  [('<expr>',
    [('<term>', [('<factor>', [('<integer>', [('<digit>', [('1', [])])])])]),
     (' + ', []),
     ('<expr>',
      [('<term>',
        [('<factor>', [('<integer>', [('<digit>', [('2', [])])])])])])])])]

The two methods are mutually recursive, and given that unify_key() tries each alternative until it succeeds, unify_key can be called multiple times with the same arguments. Hence, it is important to memoize the results of unify_key. Python provides a simple decorator lru_cache for memoizing any function call that has hashable arguments. We add that to our implementation so that repeated calls to unify_key() with the same argument get cached results.

This memoization gives the algorithm its name – Packrat.

from functools import lru_cache
class PEGParser(PEGParser):
    @lru_cache(maxsize=None)
    def unify_key(self, key, text, at=0):
        if key not in self.cgrammar:
            if text[at:].startswith(key):
                return at + len(key), (key, [])
            else:
                return at, None
        for rule in self.cgrammar[key]:
            to, res = self.unify_rule(rule, text, at)
            if res:
                return (to, (key, res))
        return 0, None

We wrap initialization and calling of PEGParser in a method parse() already implemented in the Parser base class that accepts the text to be parsed along with the grammar.

Here are a few examples of our parser in action.

mystring = "1 + (2 * 3)"
peg = PEGParser(EXPR_GRAMMAR)
for tree in peg.parse(mystring):
    assert tree_to_string(tree) == mystring
    display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <term> 1->2 7 + 1->7 8 <expr> 1->8 3 <factor> 2->3 4 <integer> 3->4 5 <digit> 4->5 6 1 5->6 9 <term> 8->9 10 <factor> 9->10 11 ( 10->11 12 <expr> 10->12 24 ) 10->24 13 <term> 12->13 14 <factor> 13->14 18 * 13->18 19 <term> 13->19 15 <integer> 14->15 16 <digit> 15->16 17 2 16->17 20 <factor> 19->20 21 <integer> 20->21 22 <digit> 21->22 23 3 22->23
mystring = "1 * (2 + 3.35)"
for tree in peg.parse(mystring):
    assert tree_to_string(tree) == mystring
    display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <term> 1->2 3 <factor> 2->3 7 * 2->7 8 <term> 2->8 4 <integer> 3->4 5 <digit> 4->5 6 1 5->6 9 <factor> 8->9 10 ( 9->10 11 <expr> 9->11 31 ) 9->31 12 <term> 11->12 17 + 11->17 18 <expr> 11->18 13 <factor> 12->13 14 <integer> 13->14 15 <digit> 14->15 16 2 15->16 19 <term> 18->19 20 <factor> 19->20 21 <integer> 20->21 24 . 20->24 25 <integer> 20->25 22 <digit> 21->22 23 3 22->23 26 <digit> 25->26 28 <integer> 25->28 27 3 26->27 29 <digit> 28->29 30 5 29->30

One should be aware that while the grammar looks like a CFG, the language described by a PEG may be different. Indeed, only LL(1) grammars are guaranteed to represent the same language for both PEGs and other parsers. Behavior of PEGs for other classes of grammars could be surprising [Redziejowski et al, 2008.].

We previously showed how using fragments of existing data can help quite a bit with fuzzing. We now explore this idea in more detail.

Recombining Parsed Inputs

Recombining parsed inputs was pioneered by Langfuzz [Holler et al, 2012.]. The main challenge is that program inputs often carry additional constraints beyond what is described by the syntax. For example, in Java, one needs to declare a variable (using a specific format for declaration) before it can be used in an expression. This restriction is not captured in the Java CFG. Checking for type correctness is another example for additional restrictions carried by program definitions.

When fuzzing compilers and interpreters, naive generation of programs using the language CFG often fails to achieve significant deeper coverage due to these kinds of checks external to the grammar. Holler et al. suggests using pre-existing valid code fragments to get around these restrictions. The idea is that the pre-existing valid code fragments already conform to the restrictions external to the grammar, and can often provide a means to evade validity checks.

A Grammar-based Mutational Fuzzer

The idea is that one can treat the derivation tree of a preexisting program as the scaffolding, poke holes in it, and patch it with generated inputs from our grammar. Given below is a grammar for a language that allows assignment of variables.

import string
from Grammars import crange
VAR_GRAMMAR = {
    '<start>': ['<statements>'],
    '<statements>': ['<statement>;<statements>', '<statement>'],
    '<statement>': ['<assignment>'],
    '<assignment>': ['<identifier>=<expr>'],
    '<identifier>': ['<word>'],
    '<word>': ['<alpha><word>', '<alpha>'],
    '<alpha>': list(string.ascii_letters),
    '<expr>': ['<term>+<expr>', '<term>-<expr>', '<term>'],
    '<term>': ['<factor>*<term>', '<factor>/<term>', '<factor>'],
    '<factor>':
    ['+<factor>', '-<factor>', '(<expr>)', '<identifier>', '<number>'],
    '<number>': ['<integer>.<integer>', '<integer>'],
    '<integer>': ['<digit><integer>', '<digit>'],
    '<digit>': crange('0', '9')
}
syntax_diagram(VAR_GRAMMAR)
start
statements
statements
statement ; statements statement
statement
assignment
assignment
identifier = expr
identifier
word
word
alpha word alpha
alpha
e d c b a f g h i j o n m l k p q r s t y x w v u z A B C D I H G F E J K L M N S R Q P O T U V W X Y Z
expr
term + expr term - expr term
term
factor * term factor / term factor
factor
- factor + factor ( expr ) identifier number
number
integer . integer integer
integer
digit integer digit
digit
0 1 2 3 4 5 6 7 8 9

Let us use our new grammar to parse a program.

mystring = 'va=10;vb=20'
def hl_predicate(_d, _n, symbol, _a): return symbol in {
    '<number>', '<identifier>'}
parser = PEGParser(VAR_GRAMMAR)
for tree in parser.parse(mystring):
    display_tree(tree, node_attr=highlight_node(hl_predicate))
%3 0 <start> 1 <statements> 0->1 2 <statement> 1->2 22 ; 1->22 23 <statements> 1->23 3 <assignment> 2->3 4 <identifier> 3->4 11 = 3->11 12 <expr> 3->12 5 <word> 4->5 6 <alpha> 5->6 8 <word> 5->8 7 v 6->7 9 <alpha> 8->9 10 a 9->10 13 <term> 12->13 14 <factor> 13->14 15 <number> 14->15 16 <integer> 15->16 17 <digit> 16->17 19 <integer> 16->19 18 1 17->18 20 <digit> 19->20 21 0 20->21 24 <statement> 23->24 25 <assignment> 24->25 26 <identifier> 25->26 33 = 25->33 34 <expr> 25->34 27 <word> 26->27 28 <alpha> 27->28 30 <word> 27->30 29 v 28->29 31 <alpha> 30->31 32 b 31->32 35 <term> 34->35 36 <factor> 35->36 37 <number> 36->37 38 <integer> 37->38 39 <digit> 38->39 41 <integer> 38->41 40 2 39->40 42 <digit> 41->42 43 0 42->43

As can be seen from the above example, our grammar is rather detailed. So we need to define our token nodes, which correspond to the red nodes above.

VAR_TOKENS = {'<number>', '<identifier>'}

These tokens are pruned using the prune_tree method that we mentioned previously. Here is a slightly more complex program to parse, but with the tree pruned using tokens:

mystring = 'avar=1.3;bvar=avar-3*(4+300)'
parser = PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS)
for tree in parser.parse(mystring):
    display_tree(tree, node_attr=highlight_node(hl_predicate))
%3 0 <start> 1 <statements> 0->1 2 <statement> 1->2 12 ; 1->12 13 <statements> 1->13 3 <assignment> 2->3 4 <identifier> 3->4 6 = 3->6 7 <expr> 3->7 5 avar 4->5 8 <term> 7->8 9 <factor> 8->9 10 <number> 9->10 11 1.3 10->11 14 <statement> 13->14 15 <assignment> 14->15 16 <identifier> 15->16 18 = 15->18 19 <expr> 15->19 17 bvar 16->17 20 <term> 19->20 24 - 19->24 25 <expr> 19->25 21 <factor> 20->21 22 <identifier> 21->22 23 avar 22->23 26 <term> 25->26 27 <factor> 26->27 30 * 26->30 31 <term> 26->31 28 <number> 27->28 29 3 28->29 32 <factor> 31->32 33 ( 32->33 34 <expr> 32->34 45 ) 32->45 35 <term> 34->35 39 + 34->39 40 <expr> 34->40 36 <factor> 35->36 37 <number> 36->37 38 4 37->38 41 <term> 40->41 42 <factor> 41->42 43 <number> 42->43 44 300 43->44

We develop a LangFuzzer class that generates recombined inputs. To apply the Langfuzz approach, we need a few parsed strings.

mystrings = [
    'abc=12+(3+3.3)',
    'a=1;b=2;c=a+b',
    'avar=1.3;bvar=avar-3*(4+300)',
    'a=1.3;b=a-1*(4+3+(2/a))',
    'a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)',
    'x=10;y=20;z=(x+y)*(x-y)',
    'x=23;y=51;z=x*x-y*y',
]

We recurse through any given tree, collecting parsed fragments corresponding to each nonterminal. Further, we also name each node so that we can address each node separately.

class LangFuzzer(Fuzzer):
    def __init__(self, parser):
        self.parser = parser
        self.fragments = {k: [] for k in self.parser.cgrammar}

    def traverse_tree(self, node):
        counter = 1
        nodes = {}

        def helper(node, id):
            nonlocal counter
            name, children = node
            new_children = []
            nodes[id] = node
            for child in children:
                counter += 1
                new_children.append(helper(child, counter))
            return name, new_children, id

        return helper(node, counter), nodes

    def fragment(self, strings):
        self.trees = []
        for string in strings:
            for tree in self.parser.parse(string):
                tree, nodes = self.traverse_tree(tree)
                self.trees.append((tree, nodes))
                for node in nodes:
                    symbol = nodes[node][0]
                    if symbol in self.fragments:
                        self.fragments[symbol].append(nodes[node])
        return self.fragments

We thus obtain all valid fragments from our parsed strings.

lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS))
fragments = lf.fragment(mystrings)
for key in fragments:
    print("%s: %d" % (key, len(fragments[key])))
<start>: 7
<statements>: 18
<statement>: 18
<assignment>: 18
<identifier>: 37
<word>: 0
<alpha>: 0
<expr>: 39
<term>: 50
<factor>: 51
<number>: 23
<integer>: 0
<digit>: 0

All that remains is to actually find a place to poke a hole using candidate(), and patch that hole using generate_new_tree(). We will explain how to do this next.

But before that, we update our initialization method with a call to `fragment().

import random
class LangFuzzer(LangFuzzer):
    def __init__(self, parser, strings):
        self.parser = parser
        self.fragments = {k: [] for k in self.parser.cgrammar}
        self.fragment(strings)

Candidate

LangFuzzer accepts a list of strings, which are stored as derivation trees in the object.

The method candidate() chooses one of the derivation trees randomly as the template, and identifies a node such that it can be replaced by another node that is different from itself. That is, it chooses a node such that, if the non-terminal name of the node is node_type, there is at least one other entry in fragment[node_type])

class LangFuzzer(LangFuzzer):
    def candidate(self):
        tree, nodes = random.choice(self.trees)
        interesting_nodes = [
            n for n in nodes if nodes[n][0] in self.fragments
            and len(self.fragments[nodes[n][0]]) > 1
        ]
        node = random.choice(interesting_nodes)
        return tree, node

Here is how it is used -- the red node is the node chosen.

random.seed(1)
lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)
tree, node = lf.candidate()
def hl_predicate(_d, nid, _s, _a): return nid in {node}
display_tree(tree, node_attr=highlight_node(hl_predicate))
%3 0 <start> 1 <statements> 0->1 2 <statement> 1->2 12 ; 1->12 13 <statements> 1->13 3 <assignment> 2->3 4 <identifier> 3->4 6 = 3->6 7 <expr> 3->7 5 a 4->5 8 <term> 7->8 9 <factor> 8->9 10 <number> 9->10 11 1 10->11 14 <statement> 13->14 24 ; 13->24 25 <statements> 13->25 15 <assignment> 14->15 16 <identifier> 15->16 18 = 15->18 19 <expr> 15->19 17 b 16->17 20 <term> 19->20 21 <factor> 20->21 22 <number> 21->22 23 2 22->23 26 <statement> 25->26 27 <assignment> 26->27 28 <identifier> 27->28 30 = 27->30 31 <expr> 27->31 29 c 28->29 32 <term> 31->32 36 + 31->36 37 <expr> 31->37 33 <factor> 32->33 34 <identifier> 33->34 35 a 34->35 38 <term> 37->38 39 <factor> 38->39 40 <identifier> 39->40 41 b 40->41

Generate New Tree

Once we have identified the node, one can generate a new tree by replacing that node with another node of similar type from our fragment pool.

class LangFuzzer(LangFuzzer):
    def generate_new_tree(self, node, choice):
        name, children, id = node
        if id == choice:
            return random.choice(self.fragments[name])
        else:
            return (name, [self.generate_new_tree(c, choice)
                           for c in children])

Again, the red node indicates where the replacement has occurred.

random.seed(1)
lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)
tree, node = lf.candidate()
def hl_predicate(_d, nid, _s, _a): return nid in {node}
new_tree = lf.generate_new_tree(tree, node)
for s in [tree_to_string(i) for i in [tree, new_tree]]:
    print(s)
display_tree(new_tree, node_attr=highlight_node(hl_predicate))
a=1;b=2;c=a+b
a=1;b=2;b=2
%3 0 <start> 1 <statements> 0->1 2 <statement> 1->2 12 ; 1->12 13 <statements> 1->13 3 <assignment> 2->3 4 <identifier> 3->4 6 = 3->6 7 <expr> 3->7 5 a 4->5 8 <term> 7->8 9 <factor> 8->9 10 <number> 9->10 11 1 10->11 14 <statement> 13->14 24 ; 13->24 25 <statements> 13->25 15 <assignment> 14->15 16 <identifier> 15->16 18 = 15->18 19 <expr> 15->19 17 b 16->17 20 <term> 19->20 21 <factor> 20->21 22 <number> 21->22 23 2 22->23 26 <statement> 25->26 27 <assignment> 26->27 28 <identifier> 27->28 30 = 27->30 31 <expr> 27->31 29 b 28->29 32 <term> 31->32 33 <factor> 32->33 34 <number> 33->34 35 2 34->35

Fuzz

The fuzz() method simply calls the procedures defined before in order.

class LangFuzzer(LangFuzzer):
    def fuzz(self):
        tree, node = self.candidate()
        modified = self.generate_new_tree(tree, node)
        return tree_to_string(modified)

Here is our fuzzer in action.

lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)
for i in range(10):
    print(lf.fuzz())
x=23;bvar=avar-3*(4+300)
x=23;y=51;z=x*x-(x+y)*(x-y)
x=10;y=20;z=(1.3)*(x-y)
abc=12+(12+3.3)
x=23;y=51;z=y*x-y*y
a=10;b=20;c=34;d=-b+(b*b-4*y)/(2*a)
abc=12+((4+3+(2/a))+3.3)
x=10;y=20;z=(x+y)*(x-y)
abc=12+(3+3.3)
x=10;y=20;z=(x+y)*(x-y)

How effective was our fuzzing? Let us find out!

trials = 100

lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)
valid = []
time = 0
for i in range(trials):
    with Timer() as t:
        s = lf.fuzz()
        try:
            exec(s, {}, {})
            valid.append((s, t.elapsed_time()))
        except:
            pass
        time += t.elapsed_time()
print("%d valid strings, that is LangFuzzer generated %f%% valid entries" %
      (len(valid), len(valid) * 100.0 / trials))
print("Total time of %f seconds" % time)
61 valid strings, that is LangFuzzer generated 61.000000% valid entries
Total time of 0.012519 seconds
gf = GrammarFuzzer(VAR_GRAMMAR)
valid = []
time = 0
for i in range(trials):
    with Timer() as t:
        s = gf.fuzz()
        try:
            exec(s, {}, {})
            valid.append(s)
        except:
            pass
        time += t.elapsed_time()
print("%d valid strings, that is GrammarFuzzer generated %f%% valid entries" %
      (len(valid), len(valid) * 100.0 / trials))
print("Total time of %f seconds" % time)
4 valid strings, that is GrammarFuzzer generated 4.000000% valid entries
Total time of 1.572683 seconds

That is, our LangFuzzer is rather effective on generating valid entries when compared to the GrammarFuzzer.

Parsing Context-Free Grammars

Problems with PEG

While PEGs are simple at first sight, their behavior in some cases might be a bit unintuitive. For example, here is an example [Unresolved citation: redziejowski.]:

PEG_SURPRISE = {
    "<A>": ["a<A>a", "aa"]
}

When interpreted as a CFG and used as a string generator, it will produce strings of the form aa, aaaa, aaaaaa that is, it produces strings where the number of a is $ 2*n $ where $ n > 0 $.

strings = []
for e in range(4):
    f = GrammarFuzzer(PEG_SURPRISE, start_symbol='<A>')
    tree = ('<A>', None)
    for _ in range(e):
        tree = f.expand_tree_once(tree)
    tree = f.expand_tree_with_strategy(tree, f.expand_node_min_cost)
    strings.append(tree_to_string(tree))
    display_tree(tree)
strings
%3 0 <A> 1 aa 0->1
%3 0 <A> 1 a 0->1 2 <A> 0->2 4 a 0->4 3 aa 2->3
%3 0 <A> 1 a 0->1 2 <A> 0->2 7 a 0->7 3 a 2->3 4 <A> 2->4 6 a 2->6 5 aa 4->5
%3 0 <A> 1 a 0->1 2 <A> 0->2 10 a 0->10 3 a 2->3 4 <A> 2->4 9 a 2->9 5 a 4->5 6 <A> 4->6 8 a 4->8 7 aa 6->7
['aa', 'aaaa', 'aaaaaa', 'aaaaaaaa']

However, the PEG parser can only recognize strings of the form $2^n$

peg = PEGParser(PEG_SURPRISE, start_symbol='<A>')
for s in strings:
    with ExpectError():
        for tree in peg.parse(s):
            display_tree(tree)
        print(s)
%3 0 <A> 1 aa 0->1
aa
%3 0 <A> 1 a 0->1 2 <A> 0->2 4 a 0->4 3 aa 2->3
aaaa
Traceback (most recent call last):
  File "<ipython-input-96-dec55ebf796e>", line 4, in <module>
    for tree in peg.parse(s):
  File "<ipython-input-49-247e45c602ef>", line 22, in parse
    raise SyntaxError("at " + repr(text[cursor:]))
  File "<string>", line None
SyntaxError: at 'aa' (expected)
%3 0 <A> 1 a 0->1 2 <A> 0->2 10 a 0->10 3 a 2->3 4 <A> 2->4 9 a 2->9 5 a 4->5 6 <A> 4->6 8 a 4->8 7 aa 6->7
aaaaaaaa

This is not the only problem with Parsing Expression Grammars. While PEGs are expressive and the packrat parser for parsing them is simple and intuitive, PEGs suffer from a major deficiency for our purposes. PEGs are oriented towards language recognition, and it is not clear how to translate an arbitrary PEG to a CFG. As we mentioned earlier, a naive re-interpretation of a PEG as a CFG does not work very well. Further, it is not clear what is the exact relation between the class of languages represented by PEG and the class of languages represented by CFG. Since our primary focus is fuzzing – that is generation of strings – , we next look at parsers that can accept context-free grammars.

The general idea of CFG parser is the following: Peek at the input text for the allowed number of characters, and use these, and our parser state to determine which rules can be applied to complete parsing. We next look at a typical CFG parsing algorithm, the Earley Parser.

The Earley Parser

The Earley parser is a general parser that is able to parse any arbitrary CFG. It was invented by Jay Earley [Earley et al, 1970.] for use in computational linguistics. While its computational complexity is $O(n^3)$ for parsing strings with arbitrary grammars, it can parse strings with unambiguous grammars in $O(n^2)$ time, and all LR(k) grammars in linear time ($O(n)$ [Joop M.I.M. Leo, 1991.]). Further improvements – notably handling epsilon rules – were invented by Aycock et al. [John Aycock et al, 2002.].

Note that one restriction of our implementation is that the start symbol can have only one alternative in its alternative expressions. This is not a restriction in practice because any grammar with multiple alternatives for its start symbol can be extended with a new start symbol that has the original start symbol as its only choice. That is, given a grammar as below,

grammar = {
    '<start>': ['<A>', '<B>'],
    ...
}

one may rewrite it as below to conform to the single-alternative rule.

grammar = {
    '<start>': ['<start_>'],
    '<start_>': ['<A>', '<B>'],
    ...
}

We first implement a simpler parser that is a parser for nearly all CFGs, but not quite. In particular, our parser does not understand epsilon rules – rules that derive empty string. We show later how the parser can be extended to handle these.

We use the following grammar in our examples below.

SAMPLE_GRAMMAR = {
    '<start>': ['<A><B>'],
    '<A>': ['a<B>c', 'a<A>'],
    '<B>': ['b<C>', '<D>'],
    '<C>': ['c'],
    '<D>': ['d']
}
C_SAMPLE_GRAMMAR = canonical(SAMPLE_GRAMMAR)
syntax_diagram(SAMPLE_GRAMMAR)
start
A B
A
a B c a A
B
b C D
C
c
D
d

The basic idea of Earley parsing is the following

  • Start with the alternative expressions corresponding to the START_SYMBOL. These represent the possible ways to parse the string from a high level. Essentially each expression represents a parsing path. Queue each expression in our set of possible parses of the string. The parsed index of an expression is the part of expression that has already been recognized. In the beginning of parse, the parsed index of all expressions is at the beginning. Further, each letter gets a queue of expressions that recognizes that letter at that point in our parse.
  • Examine our queue of possible parses and check if any of them start with a nonterminal. If it does, then that nonterminal needs to be recognized from the input before the given rule can be parsed. Hence, add the alternative expressions corresponding to the nonterminal to the queue. Do this recursively.
  • At this point, we are ready to advance. Examine the current letter in the input, and select all expressions that have that particular letter at the parsed index. These expressions can now advance one step. Advance these selected expressions by incrementing their parsed index and add them to the queue of expressions in line for recognizing the next input letter.
  • If while doing these things, we find that any of the expressions have finished parsing, we fetch its corresponding nonterminal, and advance all expressions that have that nonterminal at their parsed index.
  • Continue this procedure recursively until all expressions that we have queued for the current letter have been processed. Then start processing the queue for the next letter.

We explain each step in detail with examples in the coming sections.

The parser uses dynamic programming to generate a table containing a forest of possible parses at each letter index – the table contains as many columns as there are letters in the input, and each column contains different parsing rules at various stages of the parse.

For example, given an input adcd, the Column 0 would contain the following:

<start> : ● <A> <B>

which is the starting rule that indicates that we are currently parsing the rule <start>, and the parsing state is just before identifying the symbol <A>. It would also contain the following which are two alternative paths it could take to complete the parsing.

<A> : ● a <B> c
<A> : ● a <A>

Column 1 would contain the following, which represents the possible completion after reading a.

<A> : a ● <B> c
<A> : a ● <A>
<B> : ● b <C>
<B> : ● <D>
<A> : ● a <B> c
<A> : ● a <A>
<D> : ● d

Column 2 would contain the following after reading d

<D> : d ●
<B> : <D> ●
<A> : a <B> ● c

Similarly, Column 3 would contain the following after reading c

<A> : a <B> c ●
<start> : <A> ● <B>
<B> : ● b <C>
<B> : ● <D>
<D> : ● d

Finally, Column 4 would contain the following after reading d, with the at the end of the <start> rule indicating that the parse was successful.

<D> : d ●
<B> : <D> ●
<start> : <A> <B> ●

As you can see from above, we are essentially filling a table (a table is also called a chart) of entries based on each letter we read, and the grammar rules that can be applied. This chart gives the parser its other name -- Chart parsing.

Columns

We define the Column first. The Column is initialized by its own index in the input string, and the letter at that index. Internally, we also keep track of the states that are added to the column as the parsing progresses.

class Column(object):
    def __init__(self, index, letter):
        self.index, self.letter = index, letter
        self.states, self._unique = [], {}

    def __str__(self):
        return "%s chart[%d]\n%s" % (self.letter, self.index, "\n".join(
            str(state) for state in self.states if state.finished()))

The Column only stores unique states. Hence, when a new state is added to our Column, we check whether it is already known.

class Column(Column):
    def add(self, state):
        if state in self._unique:
            return self._unique[state]
        self._unique[state] = state
        self.states.append(state)
        state.e_col = self
        return self._unique[state]

Items

An item represents a parse in progress for a specific rule. Hence the item contains the name of the nonterminal, and the corresponding alternative expression (expr) which together form the rule, and the current position of parsing in this expression -- dot.

Note. If you are familiar with LR parsing, you will notice that an item is simply an LR0 item.

class Item(object):
    def __init__(self, name, expr, dot):
        self.name, self.expr, self.dot = name, expr, dot

We also provide a few convenience methods. The method finished() checks if the dot has moved beyond the last element in expr. The method advance() produces a new Item with the dot advanced one token, and represents an advance of the parsing. The method at_dot() returns the current symbol being parsed.

class Item(Item):
    def finished(self):
        return self.dot >= len(self.expr)

    def advance(self):
        return Item(self.name, self.expr, self.dot + 1)

    def at_dot(self):
        return self.expr[self.dot] if self.dot < len(self.expr) else None

Here is how an item could be used. We first define our item

item_name = '<B>'
item_expr = C_SAMPLE_GRAMMAR[item_name][1]
an_item = Item(item_name, tuple(item_expr), 0)

To determine where the status of parsing, we use at_dot()

an_item.at_dot()
'<D>'

That is, the next symbol to be parsed is <D>

If we advance the item, we get another item that represents the finished parsing rule <B>.

another_item = an_item.advance()
another_item.finished()
True

States

For Earley parsing, the state of the parsing is simply one Item along with some meta information such as the starting s_col and ending column e_col for each state. Hence we inherit from Item to create a State. Since we are interested in comparing states, we define hash() and eq() with the corresponding methods.

class State(Item):
    def __init__(self, name, expr, dot, s_col, e_col=None):
        super().__init__(name, expr, dot)
        self.s_col, self.e_col = s_col, e_col

    def __str__(self):
        def idx(var):
            return var.index if var else -1

        return self.name + ':= ' + ' '.join([
            str(p)
            for p in [*self.expr[:self.dot], '|', *self.expr[self.dot:]]
        ]) + "(%d,%d)" % (idx(self.s_col), idx(self.e_col))

    def copy(self):
        return State(self.name, self.expr, self.dot, self.s_col, self.e_col)

    def _t(self):
        return (self.name, self.expr, self.dot, self.s_col.index)

    def __hash__(self):
        return hash(self._t())

    def __eq__(self, other):
        return self._t() == other._t()

    def advance(self):
        return State(self.name, self.expr, self.dot + 1, self.s_col)

The usage of State is similar to that of Item. The only difference is that it is used along with the Column to track the parsing state. For example, we initialize the first column as follows:

col_0 = Column(0, None)
item_expr = tuple(*C_SAMPLE_GRAMMAR[START_SYMBOL])
start_state = State(START_SYMBOL, item_expr, 0, col_0)
col_0.add(start_state)
start_state.at_dot()
'<A>'

The first column is then updated by using add() method of Column

sym = start_state.at_dot()
for alt in C_SAMPLE_GRAMMAR[sym]:
    col_0.add(State(sym, tuple(alt), 0, col_0))
for s in col_0.states:
    print(s)
<start>:= | <A> <B>(0,0)
<A>:= | a <B> c(0,0)
<A>:= | a <A>(0,0)

The Parsing Algorithm

The Earley algorithm starts by initializing the chart with columns (as many as there are letters in the input). We also seed the first column with a state representing the expression corresponding to the start symbol. In our case, the state corresponds to the start symbol with the dot at 0 is represented as below. The symbol represents the parsing status. In this case, we have not parsed anything.

<start>: ● <A> <B>

We pass this partial chart to a method for filling the rest of the parse chart.

class EarleyParser(Parser):
    def __init__(self, grammar, **kwargs):
        super().__init__(grammar, **kwargs)
        self.cgrammar = canonical(grammar, letters=True)

Before starting to parse, we seed the chart with the state representing the ongoing parse of the start symbol.

class EarleyParser(EarleyParser):
    def chart_parse(self, words, start):
        alt = tuple(*self.cgrammar[start])
        chart = [Column(i, tok) for i, tok in enumerate([None, *words])]
        chart[0].add(State(start, alt, 0, chart[0]))
        return self.fill_chart(chart)

The main parsing loop in fill_chart() has three fundamental operations. predict(), scan(), and complete(). We discuss predict next.

Predicting States

We have already seeded chart[0] with a state [<A>,<B>] with dot at 0. Next, given that <A> is a nonterminal, we predict the possible parse continuations of this state. That is, it could be either a <B> c or A <A>.

The general idea of predict() is as follows: Say you have a state with name <A> from the above grammar, and expression containing [a,<B>,c]. Imagine that you have seen a already, which means that the dot will be on <B>. Below, is a representation of our parse status. The left hand side of ● represents the portion already parsed (a), and the right hand side represents the portion yet to be parsed (<B> c).

<A>: a  ●  <B> c

To recognize <B>, we look at the definition of <B>, which has different alternative expressions. The predict() step adds each of these alternatives to the set of states, with dot at 0.

<A>: a ● <B> c
<B>: ● b c
<B>: ● <D>

In essence, the predict() method, when called with the current nonterminal, fetches the alternative expressions corresponding to this nonterminal, and adds these as predicted child states to the current column.

class EarleyParser(EarleyParser):
    def predict(self, col, sym, state):
        for alt in self.cgrammar[sym]:
            col.add(State(sym, tuple(alt), 0, col))

To see how to use predict, we first construct the 0th column as before, and we assign the constructed column to an instance of the EarleyParser.

col_0 = Column(0, None)
col_0.add(start_state)
ep = EarleyParser(SAMPLE_GRAMMAR)
ep.chart = [col_0]

It should contain a single state -- <start> at 0

for s in ep.chart[0].states:
    print(s)
<start>:= | <A> <B>(0,0)

We apply predict to fill out the 0th column, and the column should contain the possible parse paths.

ep.predict(col_0, '<A>', s)
for s in ep.chart[0].states:
    print(s)
<start>:= | <A> <B>(0,0)
<A>:= | a <B> c(0,0)
<A>:= | a <A>(0,0)

Scanning Tokens

What if rather than a nonterminal, the state contained a terminal symbol such as a letter? In that case, we are ready to make some progress. For example, consider the second state:

<B>: ● b c

We scan the next column's letter. Say the next token is b. If the letter matches what we have, then create a new state by advancing the current state by one letter.

<B>: b ● c

This new state is added to the next column (i.e the column that has the matched letter).

class EarleyParser(EarleyParser):
    def scan(self, col, state, letter):
        if letter == col.letter:
            col.add(state.advance())

As before, we construct the partial parse first, this time adding a new column so that we can observe the effects of scan()

ep = EarleyParser(SAMPLE_GRAMMAR)
col_1 = Column(1, 'a')
ep.chart = [col_0, col_1]
new_state = ep.chart[0].states[1]
print(new_state)
<A>:= | a <B> c(0,0)
ep.scan(col_1, new_state, 'a')
for s in ep.chart[1].states:
    print(s)
<A>:= a | <B> c(0,1)

Completing Processing

When we advance, what if we actually complete() the processing of the current rule? If so, we want to update not just this state, but also all the parent states from which this state was derived. For example, say we have states as below.

<A>: a ● <B> c
<B>: b c ●

The state <B>: b c ● is now complete. So, we need to advance <A>: a ● <B> c one step forward.

How do we determine the parent states? Note from predict that we added the predicted child states to the same column as that of the inspected state. Hence, we look at the starting column of the current state, with the same symbol at_dot as that of the name of the completed state.

For each such parent found, we advance that parent (because we have just finished parsing that non terminal for their at_dot) and add the new states to the current column.

class EarleyParser(EarleyParser):
    def complete(self, col, state):
        return self.earley_complete(col, state)

    def earley_complete(self, col, state):
        parent_states = [
            st for st in state.s_col.states if st.at_dot() == state.name
        ]
        for st in parent_states:
            col.add(st.advance())

Here is an example of completed processing. First we complete the Column 0

ep = EarleyParser(SAMPLE_GRAMMAR)
col_1 = Column(1, 'a')
col_2 = Column(2, 'd')
ep.chart = [col_0, col_1, col_2]
ep.predict(col_0, '<A>', s)
for s in ep.chart[0].states:
    print(s)
<start>:= | <A> <B>(0,0)
<A>:= | a <B> c(0,0)
<A>:= | a <A>(0,0)

Then we use scan() to populate Column 1

for state in ep.chart[0].states:
    if state.at_dot() not in SAMPLE_GRAMMAR:
        ep.scan(col_1, state, 'a')
for s in ep.chart[1].states:
    print(s)
<A>:= a | <B> c(0,1)
<A>:= a | <A>(0,1)
for state in ep.chart[1].states:
    if state.at_dot() in SAMPLE_GRAMMAR:
        ep.predict(col_1, state.at_dot(), state)
for s in ep.chart[1].states:
    print(s)
<A>:= a | <B> c(0,1)
<A>:= a | <A>(0,1)
<B>:= | b <C>(1,1)
<B>:= | <D>(1,1)
<A>:= | a <B> c(1,1)
<A>:= | a <A>(1,1)
<D>:= | d(1,1)

Then we use scan() again to populate Column 2

for state in ep.chart[1].states:
    if state.at_dot() not in SAMPLE_GRAMMAR:
        ep.scan(col_2, state, state.at_dot())

for s in ep.chart[2].states:
    print(s)
<D>:= d |(1,2)

Now, we can use complete():

for state in ep.chart[2].states:
    if state.finished():
        ep.complete(col_2, state)

for s in ep.chart[2].states:
    print(s)
<D>:= d |(1,2)
<B>:= <D> |(1,2)
<A>:= a <B> | c(0,2)

Filling the Chart

The main driving loop in fill_chart() essentially calls these operations in order. We loop over each column in order.

  • For each column, fetch one state in the column at a time, and check if the state is finished.
    • If it is, then we complete() all the parent states depending on this state.
  • If the state was not finished, we check to see if the state's current symbol at_dot is a nonterminal.
    • If it is a nonterminal, we predict() possible continuations, and update the current column with these states.
    • If it was not, we scan() the next column and advance the current state if it matches the next letter.
class EarleyParser(EarleyParser):
    def fill_chart(self, chart):
        for i, col in enumerate(chart):
            for state in col.states:
                if state.finished():
                    self.complete(col, state)
                else:
                    sym = state.at_dot()
                    if sym in self.cgrammar:
                        self.predict(col, sym, state)
                    else:
                        if i + 1 >= len(chart):
                            continue
                        self.scan(chart[i + 1], state, sym)
            if self.log:
                print(col, '\n')
        return chart

We now can recognize a given string as belonging to a language represented by a grammar.

ep = EarleyParser(SAMPLE_GRAMMAR, log=True)
columns = ep.chart_parse('adcd', START_SYMBOL)
None chart[0]

a chart[1]

d chart[2]
<D>:= d |(1,2)
<B>:= <D> |(1,2) 

c chart[3]
<A>:= a <B> c |(0,3) 

d chart[4]
<D>:= d |(3,4)
<B>:= <D> |(3,4)
<start>:= <A> <B> |(0,4) 

The chart we printed above only shows completed entries at each index. The parenthesized expression indicates the column just before the first character was recognized, and the ending column.

Notice how the <start> nonterminal shows fully parsed status.

last_col = columns[-1]
for s in last_col.states:
    if s.name == '<start>':
        print(s)
<start>:= <A> <B> |(0,4)

Since chart_parse() returns the completed table, we now need to extract the derivation trees.

The Parse Method

For determining how far we have managed to parse, we simply look for the last index from chart_parse() where the start_symbol was found.

class EarleyParser(EarleyParser):
    def parse_prefix(self, text):
        self.table = self.chart_parse(text, self.start_symbol())
        for col in reversed(self.table):
            states = [
                st for st in col.states if st.name == self.start_symbol()
            ]
            if states:
                return col.index, states
        return -1, []

Here is the parse_prefix() in action.

ep = EarleyParser(SAMPLE_GRAMMAR)
cursor, last_states = ep.parse_prefix('adcd')
print(cursor, [str(s) for s in last_states])
4 ['<start>:= <A> <B> |(0,4)']

The following is adapted from the excellent reference on Earley parsing by Loup Vaillant.

Our parse() method is as follows. It depends on two methods parse_forest() and extract_trees() that will be defined next.

class EarleyParser(EarleyParser):
    def parse(self, text):
        cursor, states = self.parse_prefix(text)
        start = next((s for s in states if s.finished()), None)
        if cursor < len(text) or not start:
            raise SyntaxError("at " + repr(text[cursor:]))

        forest = self.extract_trees(self.parse_forest(self.table, start))
        return [self.prune_tree(tree) for tree in forest]

Parsing Paths

The parse_paths() method tries to unify the given expression in named_expr with the parsed string. For that, it extracts the last symbol in named_expr and checks if it is a terminal symbol. If it is, then it checks the chart at til to see if the letter corresponding to the position matches the terminal symbol. If it does, extend our start index by the length of the symbol.

If the symbol was a nonterminal symbol, then we retrieve the parsed states at the current end column index (til) that correspond to the nonterminal symbol, and collect the start index. These are the end column indexes for the remaining expression.

Given our list of start indexes, we obtain the parse paths from the remaining expression. If we can obtain any, then we return the parse paths. If not, we return an empty list.

class EarleyParser(EarleyParser):
    def parse_paths(self, named_expr, chart, frm, til):
        def paths(state, start, k, e):
            if not e:
                return [[(state, k)]] if start == frm else []
            else:
                return [[(state, k)] + r
                        for r in self.parse_paths(e, chart, frm, start)]

        *expr, var = named_expr
        starts = None
        if var not in self.cgrammar:
            starts = ([(var, til - len(var),
                        't')] if til > 0 and chart[til].letter == var else [])
        else:
            starts = [(s, s.s_col.index, 'n') for s in chart[til].states
                      if s.finished() and s.name == var]

        return [p for s, start, k in starts for p in paths(s, start, k, expr)]

Here is the parse_paths() in action

print(SAMPLE_GRAMMAR['<start>'])
ep = EarleyParser(SAMPLE_GRAMMAR)
completed_start = last_states[0]
paths = ep.parse_paths(completed_start.expr, columns, 0, 4)
for path in paths:
    print([list(str(s_) for s_ in s) for s in path])
['<A><B>']
[['<B>:= <D> |(3,4)', 'n'], ['<A>:= a <B> c |(0,3)', 'n']]

That is, the parse path for <start> given the input adcd included recognizing the expression <A><B>. This was recognized by the two states: <A> from input(0) to input(2) which further involved recognizing the rule a<B>c, and the next state <B> from input(3) which involved recognizing the rule <D>.

Parsing Forests

The parse_forest() method takes the state which represents the completed parse, and determines the possible ways that its expressions corresponded to the parsed expression. For example, say we are parsing 1+2+3, and the state has [<expr>,+,<expr>] in expr. It could have been parsed as either [{<expr>:1+2},+,{<expr>:3}] or [{<expr>:1},+,{<expr>:2+3}].

class EarleyParser(EarleyParser):
    def parse_forest(self, chart, state):
        def forest(s, kind):
            return self.parse_forest(chart, s) if kind == 'n' else (s, [])

        pathexprs = self.parse_paths(state.expr, chart, state.s_col.index,
                                     state.e_col.index) if state.expr else []
        return state.name, [[forest(v, k) for v, k in reversed(pathexpr)]
                            for pathexpr in pathexprs]
ep = EarleyParser(SAMPLE_GRAMMAR)
result = ep.parse_forest(columns, last_states[0])
result
('<start>',
 [[('<A>', [[('a', []), ('<B>', [[('<D>', [[('d', [])]])]]), ('c', [])]]),
   ('<B>', [[('<D>', [[('d', [])]])]])]])

Extracting Trees

What we have from parse_forest() is a forest of trees. We need to extract a single tree from that forest. That is accomplished as follows.

(For now, we return the first available derivation tree. To do that, we need to extract the parse forest from the state corresponding to start.)

class EarleyParser(EarleyParser):
    def extract_a_tree(self, forest_node):
        name, paths = forest_node
        if not paths:
            return (name, [])
        return (name, [self.extract_a_tree(p) for p in paths[0]])

    def extract_trees(self, forest):
        return [self.extract_a_tree(forest)]

We now verify that our parser can parse a given expression.

A3_GRAMMAR = {
    "<start>": ["<expr>"],
    "<expr>": ["<expr>+<expr>", "<expr>-<expr>", "(<expr>)", "<integer>"],
    "<integer>": ["<digit><integer>", "<digit>"],
    "<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
syntax_diagram(A3_GRAMMAR)
start
expr
expr
expr - expr expr + expr ( expr ) integer
integer
digit integer digit
digit
0 1 2 3 4 5 6 7 8 9
mystring = '(1+24)-33'
parser = EarleyParser(A3_GRAMMAR)
for tree in parser.parse(mystring):
    assert tree_to_string(tree) == mystring
    display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <expr> 1->2 18 - 1->18 19 <expr> 1->19 3 ( 2->3 4 <expr> 2->4 17 ) 2->17 5 <expr> 4->5 9 + 4->9 10 <expr> 4->10 6 <integer> 5->6 7 <digit> 6->7 8 1 7->8 11 <integer> 10->11 12 <digit> 11->12 14 <integer> 11->14 13 2 12->13 15 <digit> 14->15 16 4 15->16 20 <integer> 19->20 21 <digit> 20->21 23 <integer> 20->23 22 3 21->22 24 <digit> 23->24 25 3 24->25

We now have a complete parser that can parse almost arbitrary CFG. There remains a small corner to fix -- the case of epsilon rules as we will see later.

Ambiguous Parsing

Ambiguous grammars are grammars that can produce multiple derivation trees for some given string. For example, the A3_GRAMMAR can parse 1+2+3 in two different ways – [1+2]+3 and 1+[2+3].

Extracting a single tree might be reasonable for unambiguous parses. However, what if the given grammar produces ambiguity when given a string? We need to extract all derivation trees in that case. We enhance our extract_trees() method to extract multiple derivation trees.

class EarleyParser(EarleyParser):
    def extract_trees(self, forest_node):
        name, paths = forest_node
        if not paths:
            return [(name, [])]
        results = []
        for path in paths:
            ptrees = zip(*[self.extract_trees(p) for p in path])
            results.extend([(name, p) for p in ptrees])
        return results

As before, we verify that everything works.

mystring = '12+23-34'
parser = EarleyParser(A1_GRAMMAR)
for tree in parser.parse(mystring):
    assert mystring == tree_to_string(tree)
    display_tree(tree)
%3 0 <start> 1 <expr> 0->1 2 <expr> 1->2 18 - 1->18 19 <expr> 1->19 3 <expr> 2->3 10 + 2->10 11 <expr> 2->11 4 <integer> 3->4 5 <digit> 4->5 7 <integer> 4->7 6 1 5->6 8 <digit> 7->8 9 2 8->9 12 <integer> 11->12 13 <digit> 12->13 15 <integer> 12->15 14 2 13->14 16 <digit> 15->16 17 3 16->17 20 <integer> 19->20 21 <digit> 20->21 23 <integer> 20->23 22 3 21->22 24 <digit> 23->24 25 4 24->25
%3 0 <start> 1 <expr> 0->1 2 <expr> 1->2 9 + 1->9 10 <expr> 1->10 3 <integer> 2->3 4 <digit> 3->4 6 <integer> 3->6 5 1 4->5 7 <digit> 6->7 8 2 7->8 11 <expr> 10->11 18 - 10->18 19 <expr> 10->19 12 <integer> 11->12 13 <digit> 12->13 15 <integer> 12->15 14 2 13->14 16 <digit> 15->16 17 3 16->17 20 <integer> 19->20 21 <digit> 20->21 23 <integer> 20->23 22 3 21->22 24 <digit> 23->24 25 4 24->25

One can also use a GrammarFuzzer to verify that everything works.

gf = GrammarFuzzer(A1_GRAMMAR)
for i in range(5):
    s = gf.fuzz()
    print(i, s)
    for tree in parser.parse(s):
        assert tree_to_string(tree) == s
0 3
1 2-2-2-8-8-8+3-6-9+5
2 9-48+0-5-3+9+90+5-6+8-847
3 2-7-62-8-7+7+5-3-6+1
4 8-1+2-5+4+9-0-4+6-1

The Aycock Epsilon Fix

While parsing, one often requires to know whether a given nonterminal can derive an empty string. For example, in the following grammar A can derive an empty string, while B can't. The nonterminals that can derive an empty string are called nullable nonterminals. For example, in the below grammar E_GRAMMAR_1, <A> is nullable, and since <A> is one of the alternatives of <start>, <start> is also nullable. But <B> is not nullable.

E_GRAMMAR_1 = {
    '<start>': ['<A>', '<B>'],
    '<A>': ['a', ''],
    '<B>': ['b']
}

One of the problems with the original Earley implementation is that it does not handle rules that can derive empty strings very well. For example, the given grammar should match a

EPSILON = ''
E_GRAMMAR = {
    '<start>': ['<S>'],
    '<S>': ['<A><A><A><A>'],
    '<A>': ['a', '<E>'],
    '<E>': [EPSILON]
}
syntax_diagram(E_GRAMMAR)
start
S
S
A A A A
A
a E
E
mystring = 'a'
parser = EarleyParser(E_GRAMMAR)
with ExpectError():
    trees = parser.parse(mystring)
Traceback (most recent call last):
  File "<ipython-input-146-61c19f40fd3a>", line 4, in <module>
    trees = parser.parse(mystring)
  File "<ipython-input-131-1b14a4332f3d>", line 6, in parse
    raise SyntaxError("at " + repr(text[cursor:]))
  File "<string>", line None
SyntaxError: at 'a' (expected)

Aycock et al.[John Aycock et al, 2002.] suggests a simple fix. Their idea is to pre-compute the nullable set and use it to advance the nullable states. However, before we do that, we need to compute the nullable set. The nullable set consists of all nonterminals that can derive an empty string.

Computing the nullable set requires expanding each production rule in the grammar iteratively and inspecting whether a given rule can derive the empty string. Each iteration needs to take into account new terminals that have been found to be nullable. The procedure stops when we obtain a stable result. This procedure can be abstracted into a more general method fixpoint.

Fixpoint

A fixpoint of a function is an element in the function's domain such that it is mapped to itself. For example, 1 is a fixpoint of square root because squareroot(1) == 1.

(We use str rather than hash to check for equality in fixpoint because the data structure set, which we would like to use as an argument has a good string representation but is not hashable).

def fixpoint(f):
    def helper(arg):
        while True:
            sarg = str(arg)
            arg_ = f(arg)
            if str(arg_) == sarg:
                return arg
            arg = arg_

    return helper

Remember my_sqrt() from the first chapter? We can define my_sqrt() using fixpoint.

def my_sqrt(x):
    @fixpoint
    def _my_sqrt(approx):
        return (approx + x / approx) / 2

    return _my_sqrt(1)
my_sqrt(2)
1.414213562373095

Nullable

Similarly, we can define nullable using fixpoint. We essentially provide the definition of a single intermediate step. That is, assuming that nullables contain the current nullable nonterminals, we iterate over the grammar looking for productions which are nullable -- that is, productions where the entire sequence can yield an empty string on some expansion.

We need to iterate over the different alternative expressions and their corresponding nonterminals. Hence we define a rules() method converts our dictionary representation to this pair format.

def rules(grammar):
    return [(key, choice)
            for key, choices in grammar.items()
            for choice in choices]

The terminals() method extracts all terminal symbols from a canonical grammar representation.

def terminals(grammar):
    return set(token
               for key, choice in rules(grammar)
               for token in choice if token not in grammar)
def nullable_expr(expr, nullables):
    return all(token in nullables for token in expr)
def nullable(grammar):
    productions = rules(grammar)

    @fixpoint
    def nullable_(nullables):
        for A, expr in productions:
            if nullable_expr(expr, nullables):
                nullables |= {A}
        return (nullables)

    return nullable_({EPSILON})
for key, grammar in {
        'E_GRAMMAR': E_GRAMMAR,
        'E_GRAMMAR_1': E_GRAMMAR_1
}.items():
    print(key, nullable(canonical(grammar)))
E_GRAMMAR {'', '<A>', '<start>', '<E>', '<S>'}
E_GRAMMAR_1 {'', '<A>', '<start>'}

So, once we have the nullable set, all that we need to do is, after we have called predict on a state corresponding to a nonterminal, check if it is nullable and if it is, advance and add the state to the current column.

class EarleyParser(EarleyParser):
    def __init__(self, grammar, **kwargs):
        super().__init__(grammar, **kwargs)
        self.cgrammar = canonical(grammar, letters=True)
        self.epsilon = nullable(self.cgrammar)

    def predict(self, col, sym, state):
        for alt in self.cgrammar[sym]:
            col.add(State(sym, tuple(alt), 0, col))
        if sym in self.epsilon:
            col.add(state.advance())
mystring = 'a'
parser = EarleyParser(E_GRAMMAR)
for tree in parser.parse(mystring):
    display_tree(tree)
%3 0 <start> 1 <S> 0->1 2 <A> 1->2 4 <A> 1->4 6 <A> 1->6 8 <A> 1->8 3 <E> 2->3 5 <E> 4->5 7 <E> 6->7 9 a 8->9
%3 0 <start> 1 <S> 0->1 2 <A> 1->2 4 <A> 1->4 6 <A> 1->6 8 <A> 1->8 3 <E> 2->3 5 <E> 4->5 7 a 6->7 9 <E> 8->9
%3 0 <start> 1 <S> 0->1 2 <A> 1->2 4 <A> 1->4 6 <A> 1->6 8 <A> 1->8 3 <E> 2->3 5 a 4->5 7 <E> 6->7 9 <E> 8->9
%3 0 <start> 1 <S> 0->1 2 <A> 1->2 4 <A> 1->4 6 <A> 1->6 8 <A> 1->8 3 a 2->3 5 <E> 4->5 7 <E> 6->7 9 <E> 8->9

More Earley Parsing

A number of other optimizations exist for Earley parsers. A fast industrial strength Earley parser implementation is the Marpa parser. Further, Earley parsing need not be restricted to character data. One may also parse streams (audio and video streams) [Qi et al, 2018.] using a generalized Earley parser.

Testing the Parsers

While we have defined two parser variants, it would be nice to have some confirmation that our parses work well. While it is possible to formally prove that they work, it is much more satisfying to generate random grammars, their corresponding strings, and parse them using the same grammar.

def prod_line_grammar(nonterminals, terminals):
    g = {
        '<start>': ['<symbols>'],
        '<symbols>': ['<symbol><symbols>', '<symbol>'],
        '<symbol>': ['<nonterminals>', '<terminals>'],
        '<nonterminals>': ['<lt><alpha><gt>'],
        '<lt>': ['<'],
        '<gt>': ['>'],
        '<alpha>': nonterminals,
        '<terminals>': terminals
    }

    if not nonterminals:
        g['<nonterminals>'] = ['']
        del g['<lt>']
        del g['<alpha>']
        del g['<gt>']

    return g
syntax_diagram(prod_line_grammar(["A", "B", "C"], ["1", "2", "3"]))
start
symbols
symbols
symbol symbols symbol
symbol
nonterminals terminals
nonterminals
lt alpha gt
lt
<
gt
>
alpha
A B C
terminals
1 2 3
def make_rule(nonterminals, terminals, num_alts):
    prod_grammar = prod_line_grammar(nonterminals, terminals)

    gf = GrammarFuzzer(prod_grammar, min_nonterminals=3, max_nonterminals=5)
    name = "<%s>" % ''.join(random.choices(string.ascii_uppercase, k=3))

    return (name, [gf.fuzz() for _ in range(num_alts)])
make_rule(["A", "B", "C"], ["1", "2", "3"], 3)
('<HDG>', ['<C>1', '<A>31', '13333'])
def make_grammar(num_symbols=3, num_alts=3):
    terminals = list(string.ascii_lowercase)
    grammar = {}
    name = None
    for _ in range(num_symbols):
        nonterminals = [k[1:-1] for k in grammar.keys()]
        name, expansions = \
            make_rule(nonterminals, terminals, num_alts)
        grammar[name] = expansions

    grammar[START_SYMBOL] = [name]

    assert is_valid_grammar(grammar)
    return grammar
make_grammar()
{'<LLV>': ['gz', 'eg', 's'],
 '<GDO>': ['<LLV><LLV>', 'eobqe', '<LLV>e<LLV>'],
 '<VES>': ['<GDO>ws', '<GDO><LLV>', '<LLV><LLV>s'],
 '<start>': ['<VES>']}

Now we verify if our arbitrary grammars can be used by the Earley parser.

for i in range(5):
    my_grammar = make_grammar()
    print(my_grammar)
    parser = EarleyParser(my_grammar)
    mygf = GrammarFuzzer(my_grammar)
    s = mygf.fuzz()
    print(s)
    for tree in parser.parse(s):
        assert tree_to_string(tree) == s
        display_tree(tree)
{'<EIZ>': ['q', 'ti', 'moa'], '<ISQ>': ['<EIZ>rrqy', '<EIZ>ss', 'hhb'], '<KWH>': ['v<ISQ>pd', '<ISQ>wung', '<EIZ>fo'], '<start>': ['<KWH>']}
qfo
%3 0 <start> 1 <KWH> 0->1 2 <EIZ> 1->2 4 fo 1->4 3 q 2->3
{'<OUZ>': ['cr', 'eyha', 'kp'], '<BTM>': ['<OUZ>tq', 'mk<OUZ>d', 'gi<OUZ>'], '<YIO>': ['itiih', 'c<OUZ><BTM>y', '<OUZ>zx'], '<start>': ['<YIO>']}
ceyhaeyhatqy
%3 0 <start> 1 <YIO> 0->1 2 c 1->2 3 <OUZ> 1->3 5 <BTM> 1->5 9 y 1->9 4 eyha 3->4 6 <OUZ> 5->6 8 tq 5->8 7 eyha 6->7
{'<HKL>': ['jy', 'oigm', ''], '<KBP>': ['<HKL>a<HKL>', '<HKL>d', '<HKL>gr'], '<UHE>': ['<KBP>kk', '<KBP>jy', 'lqy'], '<start>': ['<UHE>']}
lqy
%3 0 <start> 1 <UHE> 0->1 2 lqy 1->2
{'<TOR>': ['r', 'znz', 's'], '<DOX>': ['<TOR>rqri', '<TOR><TOR>', '<TOR>oa'], '<VGG>': ['<TOR>k', '<DOX>uy', '<DOX>cn'], '<start>': ['<VGG>']}
roauy
%3 0 <start> 1 <VGG> 0->1 2 <DOX> 1->2 6 uy 1->6 3 <TOR> 2->3 5 oa 2->5 4 r 3->4
{'<IFU>': ['ifx', 'te', 'ver'], '<NIY>': ['<IFU>hl', '<IFU>mkt', '<IFU><IFU>'], '<JKU>': ['<IFU>i', '<IFU>xj', '<IFU>p<NIY>'], '<start>': ['<JKU>']}
verxj
%3 0 <start> 1 <JKU> 0->1 2 <IFU> 1->2 4 xj 1->4 3 ver 2->3

With this, we have completed both implementation and testing of arbitrary CFG, which can now be used along with LangFuzzer to generate better fuzzing inputs.

Background

Numerous parsing techniques exist that can parse a given string using a given grammar, and produce corresponding derivation tree or trees. However, some of these techniques work only on specific classes of grammars. These classes of grammars are named after the specific kind of parser that can accept grammars of that category. That is, the upper bound for the capabilities of the parser defines the grammar class named after that parser.

The LL and LR parsing are the main traditions in parsing. Here, LL means left-to-right, leftmost derivation, and it represents a top-down approach. On the other hand, and LR (left-to-right, rightmost derivation) represents a bottom-up approach. Another way to look at it is that LL parsers compute the derivation tree incrementally in pre-order while LR parsers compute the derivation tree in post-order [Pingali et al, 2015.]).

Different classes of grammars differ in the features that are available to the user for writing a grammar of that class. That is, the corresponding kind of parser will be unable to parse a grammar that makes use of more features than allowed. For example, the A2_GRAMMAR is an LL grammar because it lacks left recursion, w