Parsing Inputs¶

In the chapter on Grammars, we discussed how grammars can be used to represent various languages. We also saw how grammars can be used to generate strings of the corresponding language. Grammars can also perform the reverse. That is, given a string, one can decompose the string into its constituent parts that correspond to the parts of grammar used to generate it – the derivation tree of that string. These parts (and parts from other similar strings) can later be recombined using the same grammar to produce new strings.

In this chapter, we use grammars to parse and decompose a given set of valid seed inputs into their corresponding derivation trees. This structural representation allows us to mutate, crossover, and recombine their parts in order to generate new valid, slightly changed inputs (i.e., fuzz)

Why Parsing for Fuzzing?¶

Why would one want to parse existing inputs in order to fuzz? Let us illustrate the problem with an example. Here is a simple program that accepts a CSV file of vehicle details and processes this information.

In [10]:
def process_inventory(inventory):
    res = []
    for vehicle in inventory.split('\n'):
        ret = process_vehicle(vehicle)
        res.extend(ret)
    return '\n'.join(res)

The CSV file contains details of one vehicle per line. Each row is processed in process_vehicle().

In [11]:
def process_vehicle(vehicle):
    year, kind, company, model, *_ = vehicle.split(',')
    if kind == 'van':
        return process_van(year, company, model)

    elif kind == 'car':
        return process_car(year, company, model)

    else:
        raise Exception('Invalid entry')

Depending on the kind of vehicle, the processing changes.

In [12]:
def process_van(year, company, model):
    res = ["We have a %s %s van from %s vintage." % (company, model, year)]
    iyear = int(year)
    if iyear > 2010:
        res.append("It is a recent model!")
    else:
        res.append("It is an old but reliable model!")
    return res
In [13]:
def process_car(year, company, model):
    res = ["We have a %s %s car from %s vintage." % (company, model, year)]
    iyear = int(year)
    if iyear > 2016:
        res.append("It is a recent model!")
    else:
        res.append("It is an old but reliable model!")
    return res

Here is a sample of inputs that the process_inventory() accepts.

In [14]:
mystring = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar\
"""
print(process_inventory(mystring))
We have a Ford E350 van from 1997 vintage.
It is an old but reliable model!
We have a Mercury Cougar car from 2000 vintage.
It is an old but reliable model!

Let us try to fuzz this program. Given that the process_inventory() takes a CSV file, we can write a simple grammar for generating comma separated values, and generate the required CSV rows. For convenience, we fuzz process_vehicle() directly.

In [16]:
CSV_GRAMMAR: Grammar = {
    '<start>': ['<csvline>'],
    '<csvline>': ['<items>'],
    '<items>': ['<item>,<items>', '<item>'],
    '<item>': ['<letters>'],
    '<letters>': ['<letter><letters>', '<letter>'],
    '<letter>': list(string.ascii_letters + string.digits + string.punctuation + ' \t\n')
}

We need some infrastructure first for viewing the grammar.

In [17]:
syntax_diagram(CSV_GRAMMAR)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> csvline
csvline
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> items
items
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> item , items item
item
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> letters
letters
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> letter letters letter
letter
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> i h g f e d c b a j k l m n o p q r s B A z y x w v u t C D E F G H I J K L U T S R Q P O N M V W X Y Z 0 1 2 3 4 $ # " ! 9 8 7 6 5 % & ' ( ) * + , - . [ @ ? > = < ; : / \ ] ^ _ ` { | } ~

We generate 1000 values, and evaluate the process_vehicle() with each.

In [18]:
gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 1000
valid: List[str] = []
time = 0
for i in range(trials):
    with Timer() as t:
        vehicle_info = gf.fuzz()
        try:
            process_vehicle(vehicle_info)
            valid.append(vehicle_info)
        except:
            pass
        time += t.elapsed_time()
print("%d valid strings, that is GrammarFuzzer generated %f%% valid entries from %d inputs" %
      (len(valid), len(valid) * 100.0 / trials, trials))
print("Total time of %f seconds" % time)
0 valid strings, that is GrammarFuzzer generated 0.000000% valid entries from 1000 inputs
Total time of 2.478398 seconds

This is obviously not working. But why?

In [19]:
gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 10
time = 0
for i in range(trials):
    vehicle_info = gf.fuzz()
    try:
        print(repr(vehicle_info), end="")
        process_vehicle(vehicle_info)
    except Exception as e:
        print("\t", e)
    else:
        print()
'9w9J\'/,LU<"l,|,Y,Zv)Amvx,c\n'	 Invalid entry
'(n8].H7,qolS'	 not enough values to unpack (expected at least 4, got 2)
'\nQoLWQ,jSa'	 not enough values to unpack (expected at least 4, got 2)
'K1,\n,RE,fq,%,,sT+aAb'	 Invalid entry
"m,d,,8j4'),-yQ,B7"	 Invalid entry
'g4,s1\t[}{.,M,<,\nzd,.am'	 Invalid entry
',Z[,z,c,#x1,gc.F'	 Invalid entry
'pWs,rT`,R'	 not enough values to unpack (expected at least 4, got 3)
'iN,br%,Q,R'	 Invalid entry
'ol,\nH<\tn,^#,=A'	 Invalid entry

None of the entries will get through unless the fuzzer can produce either van or car. Indeed, the reason is that the grammar itself does not capture the complete information about the format. So here is another idea. We modify the GrammarFuzzer to know a bit about our format.

Let us try again!

In [23]:
gf = PooledGrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
gf.update_cache('<item>', [
    ('<item>', [('car', [])]),
    ('<item>', [('van', [])]),
])
trials = 10
time = 0
for i in range(trials):
    vehicle_info = gf.fuzz()
    try:
        print(repr(vehicle_info), end="")
        process_vehicle(vehicle_info)
    except Exception as e:
        print("\t", e)
    else:
        print()
',h,van,|'	 Invalid entry
'M,w:K,car,car,van'	 Invalid entry
'J,?Y,van,van,car,J,~D+'	 Invalid entry
'S4,car,car,o'	 invalid literal for int() with base 10: 'S4'
'2*-,van'	 not enough values to unpack (expected at least 4, got 2)
'van,%,5,]'	 Invalid entry
'van,G3{y,j,h:'	 Invalid entry
'$0;o,M,car,car'	 Invalid entry
'2d,f,e'	 not enough values to unpack (expected at least 4, got 3)
'/~NE,car,car'	 not enough values to unpack (expected at least 4, got 3)

At least we are getting somewhere! It would be really nice if we could incorporate what we know about the sample data in our fuzzer. In fact, it would be nice if we could extract the template and valid values from samples, and use them in our fuzzing. How do we do that? The quick answer to this question is: Use a parser.

Using a Parser¶

Generally speaking, a parser is the part of a program that processes (structured) input. The parsers we discuss in this chapter transform an input string into a derivation tree (discussed in the chapter on efficient grammar fuzzing). From a user's perspective, all it takes to parse an input is two steps:

  1. Initialize the parser with a grammar, as in
parser = Parser(grammar)
  1. Using the parser to retrieve a list of derivation trees:
trees = parser.parse(input)

Once we have parsed a tree, we can use it just as the derivation trees produced from grammar fuzzing.

We discuss a number of such parsers, in particular

  • parsing expression grammar parsers (PEGParser), which are very efficient, but limited to specific grammar structure; and
  • Earley parsers (EarleyParser), which accept any kind of context-free grammars.

If you just want to use parsers (say, because your main focus is testing), you can just stop here and move on to the next chapter, where we learn how to make use of parsed inputs to mutate and recombine them. If you want to understand how parsers work, though, this chapter is right for you.

An Ad Hoc Parser¶

As we saw in the previous section, programmers often have to extract parts of data that obey certain rules. For example, for CSV files, each element in a row is separated by commas, and multiple raws are used to store the data.

To extract the information, we write an ad hoc parser simple_parse_csv().

In [24]:
def simple_parse_csv(mystring: str) -> DerivationTree:
    children: List[DerivationTree] = []
    tree = (START_SYMBOL, children)
    for i, line in enumerate(mystring.split('\n')):
        children.append(("record %d" % i, [(cell, [])
                                           for cell in line.split(',')]))
    return tree

We also change the default orientation of the graph to left to right rather than top to bottom for easier viewing using lr_graph().

In [25]:
def lr_graph(dot):
    dot.attr('node', shape='plain')
    dot.graph_attr['rankdir'] = 'LR'

The display_tree() shows the structure of our CSV file after parsing.

In [26]:
tree = simple_parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)
Out[26]:
0 <start> 1 record 0 0->1 6 record 1 0->6 2 1997 1->2 3 van 1->3 4 Ford 1->4 5 E350 1->5 7 2000 6->7 8 car 6->8 9 Mercury 6->9 10 Cougar 6->10

This is of course simple. What if we encounter slightly more complexity? Again, another example from the Wikipedia.

In [27]:
mystring = '''\
1997,Ford,E350,"ac, abs, moon",3000.00\
'''
print(mystring)
1997,Ford,E350,"ac, abs, moon",3000.00

We define a new annotation method highlight_node() to mark the nodes that are interesting.

In [28]:
def highlight_node(predicate):
    def hl_node(dot, nid, symbol, ann):
        if predicate(dot, nid, symbol, ann):
            dot.node(repr(nid), dot_escape(symbol), fontcolor='red')
        else:
            dot.node(repr(nid), dot_escape(symbol))
    return hl_node

Using highlight_node() we can highlight particular nodes that we were wrongly parsed.

In [29]:
tree = simple_parse_csv(mystring)
bad_nodes = {5, 6, 7, 12, 13, 20, 22, 23, 24, 25}
In [30]:
def hl_predicate(_d, nid, _s, _a): return nid in bad_nodes
In [31]:
highlight_err_node = highlight_node(hl_predicate)
display_tree(tree, log=False, node_attr=highlight_err_node,
             graph_attr=lr_graph)
Out[31]:
0 <start> 1 record 0 0->1 2 1997 1->2 3 Ford 1->3 4 E350 1->4 5 "ac 1->5 6 abs 1->6 7 moon" 1->7 8 3000.00 1->8

The marked nodes indicate where our parsing went wrong. We can of course extend our parser to understand quotes. First we define some of the helper functions parse_quote(), find_comma() and comma_split()

In [32]:
def parse_quote(string, i):
    v = string[i + 1:].find('"')
    return v + i + 1 if v >= 0 else -1
In [33]:
def find_comma(string, i):
    slen = len(string)
    while i < slen:
        if string[i] == '"':
            i = parse_quote(string, i)
            if i == -1:
                return -1
        if string[i] == ',':
            return i
        i += 1
    return -1
In [34]:
def comma_split(string):
    slen = len(string)
    i = 0
    while i < slen:
        c = find_comma(string, i)
        if c == -1:
            yield string[i:]
            return
        else:
            yield string[i:c]
        i = c + 1

We can update our parse_csv() procedure to use our advanced quote parser.

In [35]:
def parse_csv(mystring):
    children = []
    tree = (START_SYMBOL, children)
    for i, line in enumerate(mystring.split('\n')):
        children.append(("record %d" % i, [(cell, [])
                                           for cell in comma_split(line)]))
    return tree

Our new parse_csv() can now handle quotes correctly.

In [36]:
tree = parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)
Out[36]:
0 <start> 1 record 0 0->1 2 1997 1->2 3 Ford 1->3 4 E350 1->4 5 "ac, abs, moon" 1->5 6 3000.00 1->6

That of course does not survive long:

In [37]:
mystring = '''\
1999,Chevy,"Venture \\"Extended Edition, Very Large\\"",,5000.00\
'''
print(mystring)
1999,Chevy,"Venture \"Extended Edition, Very Large\"",,5000.00

A few embedded quotes are sufficient to confuse our parser again.

In [38]:
tree = parse_csv(mystring)
bad_nodes = {4, 5}
display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)
Out[38]:
0 <start> 1 record 0 0->1 2 1999 1->2 3 Chevy 1->3 4 "Venture \"Extended Edition 1->4 5 Very Large\"" 1->5 6 1->6 7 5000.00 1->7

Here is another record from that CSV file:

In [39]:
mystring = '''\
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
'''
print(mystring)
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

In [40]:
tree = parse_csv(mystring)
bad_nodes = {5, 6, 7, 8, 9, 10}
display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)
Out[40]:
0 <start> 1 record 0 0->1 6 record 1 0->6 10 record 2 0->10 2 1996 1->2 3 Jeep 1->3 4 Grand Cherokee 1->4 5 "MUST SELL! 1->5 7 air 6->7 8 moon roof 6->8 9 loaded",4799.00 6->9

Fixing this would require modifying both inner parse_quote() and the outer parse_csv() procedures. We note that each of these features actually documented in the CSV RFC 4180

Indeed, each additional improvement falls apart even with a little extra complexity. The problem becomes severe when one encounters recursive expressions. For example, JSON is a common alternative to CSV files for saving data. Similarly, one may have to parse data from an HTML table instead of a CSV file if one is getting the data from the web.

One might be tempted to fix it with a little more ad hoc parsing, with a bit of regular expressions thrown in. However, that is the path to insanity.

It is here that formal parsers shine. The main idea is that, any given set of strings belong to a language, and these languages can be specified by their grammars (as we saw in the chapter on grammars). The great thing about grammars is that they can be composed. That is, one can introduce finer and finer details into an internal structure without affecting the external structure, and similarly, one can change the external structure without much impact on the internal structure.

Grammars in Parsing¶

We briefly describe grammars in the context of parsing.

A Parser Class¶

Next, we develop different parsers. To do that, we define a minimal interface for parsing that is obeyed by all parsers. There are two approaches to parsing a string using a grammar.

  1. The traditional approach is to use a lexer (also called a tokenizer or a scanner) to first tokenize the incoming string, and feed the grammar one token at a time. The lexer is typically a smaller parser that accepts a regular language. The advantage of this approach is that the grammar used by the parser can eschew the details of tokenization. Further, one gets a shallow derivation tree at the end of the parsing which can be directly used for generating the Abstract Syntax Tree.
  2. The second approach is to use a tree pruner after the complete parse. With this approach, one uses a grammar that incorporates complete details of the syntax. Next, the nodes corresponding to tokens are pruned and replaced with their corresponding strings as leaf nodes. The utility of this approach is that the parser is more powerful, and further there is no artificial distinction between lexing and parsing.

In this chapter, we use the second approach. This approach is implemented in the prune_tree method.

The Parser class we define below provides the minimal interface. The main methods that need to be implemented by the classes implementing this interface are parse_prefix and parse. The parse_prefix returns a tuple, which contains the index until which parsing was completed successfully, and the parse forest until that index. The method parse returns a list of derivation trees if the parse was successful.

In [56]:
class Parser:
    """Base class for parsing."""

    def __init__(self, grammar: Grammar, *,
                 start_symbol: str = START_SYMBOL,
                 log: bool = False,
                 coalesce: bool = True,
                 tokens: Set[str] = set()) -> None:
        """Constructor.
           `grammar` is the grammar to be used for parsing.
           Keyword arguments:
           `start_symbol` is the start symbol (default: '<start>').
           `log` enables logging (default: False).
           `coalesce` defines if tokens should be coalesced (default: True).
           `tokens`, if set, is a set of tokens to be used."""
        self._grammar = grammar
        self._start_symbol = start_symbol
        self.log = log
        self.coalesce_tokens = coalesce
        self.tokens = tokens

    def grammar(self) -> Grammar:
        """Return the grammar of this parser."""
        return self._grammar

    def start_symbol(self) -> str:
        """Return the start symbol of this parser."""
        return self._start_symbol

    def parse_prefix(self, text: str) -> Tuple[int, Iterable[DerivationTree]]:
        """Return pair (cursor, forest) for longest prefix of text. 
           To be defined in subclasses."""
        raise NotImplementedError

    def parse(self, text: str) -> Iterable[DerivationTree]:
        """Parse `text` using the grammar. 
           Return an iterable of parse trees."""
        cursor, forest = self.parse_prefix(text)
        if cursor < len(text):
            raise SyntaxError("at " + repr(text[cursor:]))
        return [self.prune_tree(tree) for tree in forest]

    def parse_on(self, text: str, start_symbol: str) -> Generator:
        old_start = self._start_symbol
        try:
            self._start_symbol = start_symbol
            yield from self.parse(text)
        finally:
            self._start_symbol = old_start

    def coalesce(self, children: List[DerivationTree]) -> List[DerivationTree]:
        last = ''
        new_lst: List[DerivationTree] = []
        for cn, cc in children:
            if cn not in self._grammar:
                last += cn
            else:
                if last:
                    new_lst.append((last, []))
                    last = ''
                new_lst.append((cn, cc))
        if last:
            new_lst.append((last, []))
        return new_lst

    def prune_tree(self, tree: DerivationTree) -> DerivationTree:
        name, children = tree
        assert isinstance(children, list)

        if self.coalesce_tokens:
            children = self.coalesce(cast(List[DerivationTree], children))
        if name in self.tokens:
            return (name, [(tree_to_string(tree), [])])
        else:
            return (name, [self.prune_tree(c) for c in children])

Parsing Expression Grammars¶

A Parsing Expression Grammar (PEG) \cite{Ford2004} is a type of recognition based formal grammar that specifies the sequence of steps to take to parse a given string. A parsing expression grammar is very similar to a context-free grammar (CFG) such as the ones we saw in the chapter on grammars. As in a CFG, a parsing expression grammar is represented by a set of nonterminals and corresponding alternatives representing how to match each. For example, here is a PEG that matches a or b.

In [69]:
PEG1 = {
    '<start>': ['a', 'b']
}

However, unlike the CFG, the alternatives represent ordered choice. That is, rather than choosing all rules that can potentially match, we stop at the first match that succeed. For example, the below PEG can match ab but not abc unlike a CFG which will match both. (We call the sequence of ordered choice expressions choice expressions rather than alternatives to make the distinction from CFG clear.)

In [70]:
PEG2 = {
    '<start>': ['ab', 'abc']
}

Each choice in a choice expression represents a rule on how to satisfy that particular choice. The choice is a sequence of symbols (terminals and nonterminals) that are matched against a given text as in a CFG.

The Packrat Parser for Predicate Expression Grammars¶

Short of hand rolling a parser, Packrat parsing is one of the simplest parsing techniques, and is one of the techniques for parsing PEGs. The Packrat parser is so named because it tries to cache all results from simpler problems in the hope that these solutions can be used to avoid re-computation later. We develop a minimal Packrat parser next.

We derive from the Parser base class first, and we accept the text to be parsed in the parse() method, which in turn calls unify_key() with the start_symbol.

Note. While our PEG parser can produce only a single unambiguous parse tree, other parsers can produce multiple parses for ambiguous grammars. Hence, we return a list of trees (in this case with a single element).

In [71]:
class PEGParser(Parser):
    def parse_prefix(self, text):
        cursor, tree = self.unify_key(self.start_symbol(), text, 0)
        return cursor, [tree]

Here are a few examples of our parser in action.

In [81]:
mystring = "1 + (2 * 3)"
peg = PEGParser(EXPR_GRAMMAR)
for tree in peg.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))
0 <start> 1 <expr> 0->1 2 <term> 1->2 7 + 1->7 8 <expr> 1->8 3 <factor> 2->3 4 <integer> 3->4 5 <digit> 4->5 6 1 (49) 5->6 9 <term> 8->9 10 <factor> 9->10 11 ( (40) 10->11 12 <expr> 10->12 24 ) (41) 10->24 13 <term> 12->13 14 <factor> 13->14 18 * 13->18 19 <term> 13->19 15 <integer> 14->15 16 <digit> 15->16 17 2 (50) 16->17 20 <factor> 19->20 21 <integer> 20->21 22 <digit> 21->22 23 3 (51) 22->23
In [82]:
mystring = "1 * (2 + 3.35)"
for tree in peg.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))
0 <start> 1 <expr> 0->1 2 <term> 1->2 3 <factor> 2->3 7 * 2->7 8 <term> 2->8 4 <integer> 3->4 5 <digit> 4->5 6 1 (49) 5->6 9 <factor> 8->9 10 ( (40) 9->10 11 <expr> 9->11 31 ) (41) 9->31 12 <term> 11->12 17 + 11->17 18 <expr> 11->18 13 <factor> 12->13 14 <integer> 13->14 15 <digit> 14->15 16 2 (50) 15->16 19 <term> 18->19 20 <factor> 19->20 21 <integer> 20->21 24 . (46) 20->24 25 <integer> 20->25 22 <digit> 21->22 23 3 (51) 22->23 26 <digit> 25->26 28 <integer> 25->28 27 3 (51) 26->27 29 <digit> 28->29 30 5 (53) 29->30

One should be aware that while the grammar looks like a CFG, the language described by a PEG may be different. Indeed, only LL(1) grammars are guaranteed to represent the same language for both PEGs and other parsers. Behavior of PEGs for other classes of grammars could be surprising \cite{redziejowski2008}.

Parsing Context-Free Grammars¶

Problems with PEG¶

While PEGs are simple at first sight, their behavior in some cases might be a bit unintuitive. For example, here is an example \cite{redziejowski2008}:

In [83]:
PEG_SURPRISE: Grammar = {
    "<A>": ["a<A>a", "aa"]
}

When interpreted as a CFG and used as a string generator, it will produce strings of the form aa, aaaa, aaaaaa that is, it produces strings where the number of a is $ 2*n $ where $ n > 0 $.

In [84]:
strings = []
for nn in range(4):
    f = GrammarFuzzer(PEG_SURPRISE, start_symbol='<A>')
    tree = ('<A>', None)
    for _ in range(nn):
        tree = f.expand_tree_once(tree)
    tree = f.expand_tree_with_strategy(tree, f.expand_node_min_cost)
    strings.append(tree_to_string(tree))
    display_tree(tree)
strings
Out[84]:
['aa', 'aaaa', 'aaaaaa', 'aaaaaaaa']

However, the PEG parser can only recognize strings of the form $2^n$

In [85]:
peg = PEGParser(PEG_SURPRISE, start_symbol='<A>')
for s in strings:
    with ExpectError():
        for tree in peg.parse(s):
            display_tree(tree)
        print(s)
aa
aaaa
aaaaaaaa
Traceback (most recent call last):
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_10601/3226632005.py", line 4, in <module>
    for tree in peg.parse(s):
                ^^^^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_10601/2022555909.py", line 40, in parse
    raise SyntaxError("at " + repr(text[cursor:]))
SyntaxError: at 'aa' (expected)

This is not the only problem with Parsing Expression Grammars. While PEGs are expressive and the packrat parser for parsing them is simple and intuitive, PEGs suffer from a major deficiency for our purposes. PEGs are oriented towards language recognition, and it is not clear how to translate an arbitrary PEG to a CFG. As we mentioned earlier, a naive re-interpretation of a PEG as a CFG does not work very well. Further, it is not clear what is the exact relation between the class of languages represented by PEG and the class of languages represented by CFG. Since our primary focus is fuzzing – that is generation of strings – , we next look at parsers that can accept context-free grammars.

The general idea of CFG parser is the following: Peek at the input text for the allowed number of characters, and use these, and our parser state to determine which rules can be applied to complete parsing. We next look at a typical CFG parsing algorithm, the Earley Parser.

The Earley Parser¶

The Earley parser is a general parser that is able to parse any arbitrary CFG. It was invented by Jay Earley \cite{Earley1970} for use in computational linguistics. While its computational complexity is $O(n^3)$ for parsing strings with arbitrary grammars, it can parse strings with unambiguous grammars in $O(n^2)$ time, and all LR(k) grammars in linear time ($O(n)$ \cite{Leo1991}). Further improvements – notably handling epsilon rules – were invented by Aycock et al. \cite{Aycock2002}.

Note that one restriction of our implementation is that the start symbol can have only one alternative in its alternative expressions. This is not a restriction in practice because any grammar with multiple alternatives for its start symbol can be extended with a new start symbol that has the original start symbol as its only choice. That is, given a grammar as below,

grammar = {
    '<start>': ['<A>', '<B>'],
    ...
}

one may rewrite it as below to conform to the single-alternative rule.

grammar = {
    '<start>': ['<start_>'],
    '<start_>': ['<A>', '<B>'],
    ...
}

Let us implement a class EarleyParser, again derived from Parser which implements an Earley parser.

Here are a few examples of the Earley parser in action.

In [162]:
mystring = "1 + (2 * 3)"
earley = EarleyParser(EXPR_GRAMMAR)
for tree in earley.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))
0 <start> 1 <expr> 0->1 2 <term> 1->2 7 + 1->7 8 <expr> 1->8 3 <factor> 2->3 4 <integer> 3->4 5 <digit> 4->5 6 1 (49) 5->6 9 <term> 8->9 10 <factor> 9->10 11 ( (40) 10->11 12 <expr> 10->12 24 ) (41) 10->24 13 <term> 12->13 14 <factor> 13->14 18 * 13->18 19 <term> 13->19 15 <integer> 14->15 16 <digit> 15->16 17 2 (50) 16->17 20 <factor> 19->20 21 <integer> 20->21 22 <digit> 21->22 23 3 (51) 22->23
In [163]:
mystring = "1 * (2 + 3.35)"
for tree in earley.parse(mystring):
    assert tree_to_string(tree) == mystring
    display(display_tree(tree))
0 <start> 1 <expr> 0->1 2 <term> 1->2 3 <factor> 2->3 7 * 2->7 8 <term> 2->8 4 <integer> 3->4 5 <digit> 4->5 6 1 (49) 5->6 9 <factor> 8->9 10 ( (40) 9->10 11 <expr> 9->11 31 ) (41) 9->31 12 <term> 11->12 17 + 11->17 18 <expr> 11->18 13 <factor> 12->13 14 <integer> 13->14 15 <digit> 14->15 16 2 (50) 15->16 19 <term> 18->19 20 <factor> 19->20 21 <integer> 20->21 24 . (46) 20->24 25 <integer> 20->25 22 <digit> 21->22 23 3 (51) 22->23 26 <digit> 25->26 28 <integer> 25->28 27 3 (51) 26->27 29 <digit> 28->29 30 5 (53) 29->30

In contrast to the PEGParser, above, the EarleyParser can handle arbitrary context-free grammars.

Background¶

Numerous parsing techniques exist that can parse a given string using a given grammar, and produce corresponding derivation tree or trees. However, some of these techniques work only on specific classes of grammars. These classes of grammars are named after the specific kind of parser that can accept grammars of that category. That is, the upper bound for the capabilities of the parser defines the grammar class named after that parser.

The LL and LR parsing are the main traditions in parsing. Here, LL means left-to-right, leftmost derivation, and it represents a top-down approach. On the other hand, and LR (left-to-right, rightmost derivation) represents a bottom-up approach. Another way to look at it is that LL parsers compute the derivation tree incrementally in pre-order while LR parsers compute the derivation tree in post-order \cite{pingali2015graphical}).

Different classes of grammars differ in the features that are available to the user for writing a grammar of that class. That is, the corresponding kind of parser will be unable to parse a grammar that makes use of more features than allowed. For example, the A2_GRAMMAR is an LL grammar because it lacks left recursion, while A1_GRAMMAR is not an LL grammar. This is because an LL parser parses its input from left to right, and constructs the leftmost derivation of its input by expanding the nonterminals it encounters. If there is a left recursion in one of these rules, an LL parser will enter an infinite loop.

Similarly, a grammar is LL(k) if it can be parsed by an LL parser with k lookahead token, and LR(k) grammar can only be parsed with LR parser with at least k lookahead tokens. These grammars are interesting because both LL(k) and LR(k) grammars have $O(n)$ parsers, and can be used with relatively restricted computational budget compared to other grammars.

The languages for which one can provide an LL(k) grammar is called LL(k) languages (where k is the minimum lookahead required). Similarly, LR(k) is defined as the set of languages that have an LR(k) grammar. In terms of languages, LL(k) $\subset$ LL(k+1) and LL(k) $\subset$ LR(k), and LR(k) $=$ LR(1). All deterministic CFLs have an LR(1) grammar. However, there exist CFLs that are inherently ambiguous \cite{ogden1968helpful}, and for these, one can't provide an LR(1) grammar.

The other main parsing algorithms for CFGs are GLL \cite{scott2010gll}, GLR \cite{tomita1987efficient,tomita2012generalized}, and CYK \cite{grune2008parsing}. The ALL(*) (used by ANTLR) on the other hand is a grammar representation that uses Regular Expression like predicates (similar to advanced PEGs – see Exercise) rather than a fixed lookahead. Hence, ALL(*) can accept a larger class of grammars than CFGs.

In terms of computational limits of parsing, the main CFG parsers have a complexity of $O(n^3)$ for arbitrary grammars. However, parsing with arbitrary CFG is reducible to boolean matrix multiplication \cite{Valiant1975} (and the reverse \cite{Lee2002}). This is at present bounded by $O(2^{23728639})$ \cite{LeGall2014}. Hence, worse case complexity for parsing arbitrary CFG is likely to remain close to cubic.

Regarding PEGs, the actual class of languages that is expressible in PEG is currently unknown. In particular, we know that PEGs can express certain languages such as $a^n b^n c^n$. However, we do not know if there exist CFLs that are not expressible with PEGs. In Section 2.3, we provided an instance of a counter-intuitive PEG grammar. While important for our purposes (we use grammars for generation of inputs) this is not a criticism of parsing with PEGs. PEG focuses on writing grammars for recognizing a given language, and not necessarily in interpreting what language an arbitrary PEG might yield. Given a Context-Free Language to parse, it is almost always possible to write a grammar for it in PEG, and given that 1) a PEG can parse any string in $O(n)$ time, and 2) at present we know of no CFL that can't be expressed as a PEG, and 3) compared with LR grammars, a PEG is often more intuitive because it allows top-down interpretation, when writing a parser for a language, PEGs should be under serious consideration.

Synopsis¶

This chapter introduces Parser classes, parsing a string into a derivation tree as introduced in the chapter on efficient grammar fuzzing. Two important parser classes are provided:

  • Parsing Expression Grammar parsers (PEGParser). These are very efficient, but limited to specific grammar structure. Notably, the alternatives represent ordered choice. That is, rather than choosing all rules that can potentially match, we stop at the first match that succeed.
  • Earley parsers (EarleyParser). These accept any kind of context-free grammars, and explore all parsing alternatives (if any).

Using any of these is fairly easy, though. First, instantiate them with a grammar:

In [173]:
us_phone_parser = EarleyParser(US_PHONE_GRAMMAR)

Then, use the parse() method to retrieve a list of possible derivation trees:

In [174]:
trees = us_phone_parser.parse("(555)987-6543")
tree = list(trees)[0]
display_tree(tree)
Out[174]:
0 <start> 1 <phone-number> 0->1 2 ( (40) 1->2 3 <area> 1->3 10 ) (41) 1->10 11 <exchange> 1->11 18 - (45) 1->18 19 <line> 1->19 4 <lead-digit> 3->4 6 <digit> 3->6 8 <digit> 3->8 5 5 (53) 4->5 7 5 (53) 6->7 9 5 (53) 8->9 12 <lead-digit> 11->12 14 <digit> 11->14 16 <digit> 11->16 13 9 (57) 12->13 15 8 (56) 14->15 17 7 (55) 16->17 20 <digit> 19->20 22 <digit> 19->22 24 <digit> 19->24 26 <digit> 19->26 21 6 (54) 20->21 23 5 (53) 22->23 25 4 (52) 24->25 27 3 (51) 26->27

These derivation trees can then be used for test generation, notably for mutating and recombining existing inputs.

In [175]:
# ignore
from ClassDiagram import display_class_hierarchy
In [176]:
# ignore
display_class_hierarchy([PEGParser, EarleyParser],
                        public_methods=[
                            Parser.parse,
                            Parser.__init__,
                            Parser.grammar,
                            Parser.start_symbol
                        ],
                        types={
                            'DerivationTree': DerivationTree,
                            'Grammar': Grammar
                        },
                        project='fuzzingbook')
Out[176]:
PEGParser PEGParser parse_prefix() unify_rule() unify_key() Parser Parser __init__() grammar() parse() start_symbol() coalesce() parse_on() parse_prefix() prune_tree() PEGParser->Parser EarleyParser EarleyParser __init__() chart_parse() complete() earley_complete() extract_a_tree() extract_trees() fill_chart() forest() parse() parse_forest() parse_paths() parse_prefix() predict() scan() EarleyParser->Parser Legend Legend •  public_method() •  private_method() •  overloaded_method() Hover over names to see doc

Lessons Learned¶

  • Grammars can be used to generate derivation trees for a given string.
  • Parsing Expression Grammars are intuitive, and easy to implement, but require care to write.
  • Earley Parsers can parse arbitrary Context Free Grammars.

Next Steps¶

  • Use parsed inputs to recombine existing inputs

Solution. Here is a possible solution:

In [177]:
class PackratParser(Parser):
    def parse_prefix(self, text):
        txt, res = self.unify_key(self.start_symbol(), text)
        return len(txt), [res]

    def parse(self, text):
        remain, res = self.parse_prefix(text)
        if remain:
            raise SyntaxError("at " + res)
        return res

    def unify_rule(self, rule, text):
        results = []
        for token in rule:
            text, res = self.unify_key(token, text)
            if res is None:
                return text, None
            results.append(res)
        return text, results

    def unify_key(self, key, text):
        if key not in self.cgrammar:
            if text.startswith(key):
                return text[len(key):], (key, [])
            else:
                return text, None
        for rule in self.cgrammar[key]:
            text_, res = self.unify_rule(rule, text)
            if res:
                return (text_, (key, res))
        return text, None
In [178]:
mystring = "1 + (2 * 3)"
for tree in PackratParser(EXPR_GRAMMAR).parse(mystring):
    assert tree_to_string(tree) == mystring
    display_tree(tree)

Solution. Python allows us to append to a list in flight, while a dict, even though it is ordered does not allow that facility.

That is, the following will work

values = [1]
for v in values:
   values.append(v*2)

However, the following will result in an error

values = {1:1}
for v in values:
    values[v*2] = v*2

In the fill_chart, we make use of this facility to modify the set of states we are iterating on, on the fly.

In [179]:
mystring = 'aaaaaa'

Compare that to the parsing of RR_GRAMMAR as seen below:

Finding a deterministic reduction path is as follows:

Given a complete state, represented by <A> : seq_1 ● (s, e) where s is the starting column for this rule, and e the current column, there is a deterministic reduction path above it if two constraints are satisfied.

  1. There exist a single item in the form <B> : seq_2 ● <A> (k, s) in column s.
  2. That should be the single item in s with dot in front of <A>

The resulting item is of the form <B> : seq_2 <A> ● (k, e), which is simply item from (1) advanced, and is considered above <A>:.. (s, e) in the deterministic reduction path. The seq_1 and seq_2 are arbitrary symbol sequences.

This forms the following chain of links, with <A>:.. (s_1, e) being the child of <B>:.. (s_2, e) etc.

Here is one way to visualize the chain:

<C> : seq_3 <B> ● (s_3, e)  
             |  constraints satisfied by <C> : seq_3 ● <B> (s_3, s_2)
            <B> : seq_2 <A> ● (s_2, e)  
                         | constraints satisfied by <B> : seq_2 ● <A> (s_2, s_1)
                        <A> : seq_1 ● (s_1, e)

Essentially, what we want to do is to identify potential deterministic right recursion candidates, perform completion on them, and throw away the result. We do this until we reach the top. See Grune et al.~\cite{grune2008parsing} for further information.

Note that the completions are in the same column (e), with each candidate with constraints satisfied in further and further earlier columns (as shown below):

<C> : seq_3 ● <B> (s_3, s_2)  -->              <C> : seq_3 <B> ● (s_3, e)
               |
              <B> : seq_2 ● <A> (s_2, s_1) --> <B> : seq_2 <A> ● (s_2, e)  
                             |
                            <A> : seq_1 ●                        (s_1, e)

Following this chain, the topmost item is the item <C>:.. (s_3, e) that does not have a parent. The topmost item needs to be saved is called a transitive item by Leo, and it is associated with the non-terminal symbol that started the lookup. The transitive item needs to be added to each column we inspect.

Here is the skeleton for the parser LeoParser.

Solution. Here is a possible solution:

In [186]:
class LeoParser(LeoParser):
    def get_top(self, state_A):
        st_B_inc = self.uniq_postdot(state_A)
        if not st_B_inc:
            return None
        
        t_name = st_B_inc.name
        if t_name in st_B_inc.e_col.transitives:
            return st_B_inc.e_col.transitives[t_name]

        st_B = st_B_inc.advance()

        top = self.get_top(st_B) or st_B
        return st_B_inc.e_col.add_transitive(t_name, top)

We verify the Leo parser with a few more right recursive grammars.

In [195]:
result = LeoParser(RR_GRAMMAR4, log=True).parse(mystring4)
for _ in result: pass
None chart[0]
<A>:= |(0,0) 

a chart[1]
 

b chart[2]
<A>:= |(2,2)
<A>:= a b <A> |(0,2) 

a chart[3]
 

b chart[4]
<A>:= |(4,4)
<A>:= a b <A> |(2,4)
<A>:= a b <A> |(0,4) 

a chart[5]
 

b chart[6]
<A>:= |(6,6)
<A>:= a b <A> |(4,6)
<A>:= a b <A> |(0,6) 

a chart[7]
 

b chart[8]
<A>:= |(8,8)
<A>:= a b <A> |(6,8)
<A>:= a b <A> |(0,8) 

c chart[9]
<start>:= <A> c |(0,9) 

In [202]:
result = LeoParser(LR_GRAMMAR, log=True).parse(mystring)
for _ in result: pass
None chart[0]
<A>:= |(0,0)
<start>:= <A> |(0,0) 

a chart[1]
<A>:= <A> a |(0,1)
<start>:= <A> |(0,1) 

a chart[2]
<A>:= <A> a |(0,2)
<start>:= <A> |(0,2) 

a chart[3]
<A>:= <A> a |(0,3)
<start>:= <A> |(0,3) 

a chart[4]
<A>:= <A> a |(0,4)
<start>:= <A> |(0,4) 

a chart[5]
<A>:= <A> a |(0,5)
<start>:= <A> |(0,5) 

a chart[6]
<A>:= <A> a |(0,6)
<start>:= <A> |(0,6) 

We define a rearrange() method to generate a reversed table where each column contains states that start at that column.

In [209]:
class LeoParser(LeoParser):
    def rearrange(self, table):
        f_table = [Column(c.index, c.letter) for c in table]
        for col in table:
            for s in col.states:
                f_table[s.s_col.index].states.append(s)
        return f_table
In [211]:
class LeoParser(LeoParser):
    def parse(self, text):
        cursor, states = self.parse_prefix(text)
        start = next((s for s in states if s.finished()), None)
        if cursor < len(text) or not start:
            raise SyntaxError("at " + repr(text[cursor:]))

        self.r_table = self.rearrange(self.table)
        forest = self.extract_trees(self.parse_forest(self.table, start))
        for tree in forest:
            yield self.prune_tree(tree)
In [212]:
class LeoParser(LeoParser):
    def parse_forest(self, chart, state):
        if isinstance(state, TState):
            self.expand_tstate(state.back(), state.e_col)
        
        return super().parse_forest(chart, state)

Exercise 6: Filtered Earley Parser¶

One of the problems with our Earley and Leo Parsers is that it can get stuck in infinite loops when parsing with grammars that contain token repetitions in alternatives. For example, consider the grammar below.

In [225]:
RECURSION_GRAMMAR: Grammar = {
    "<start>": ["<A>"],
    "<A>": ["<A>", "<A>aa", "AA", "<B>"],
    "<B>": ["<C>", "<C>cc", "CC"],
    "<C>": ["<B>", "<B>bb", "BB"]
}

With this grammar, one can produce an infinite chain of derivations of <A>, (direct recursion) or an infinite chain of derivations of <B> -> <C> -> <B> ... (indirect recursion). The problem is that, our implementation can get stuck trying to derive one of these infinite chains. One possibility is to use the LazyExtractor. Another, is to simply avoid generating such chains.

In [227]:
with ExpectTimeout(1, print_traceback=False):
    mystring = 'AA'
    parser = LeoParser(RECURSION_GRAMMAR)
    tree, *_ = parser.parse(mystring)
    assert tree_to_string(tree) == mystring
    display_tree(tree)
RecursionError: maximum recursion depth exceeded (expected)

Can you implement a solution such that any tree that contains such a chain is discarded?

Exercise 7: Iterative Earley Parser¶

Recursive algorithms are quite handy in some cases, but sometimes we might want to have iteration instead of recursion due to memory or speed problems.

Can you implement an iterative version of the EarleyParser?

Hint: In general, you can use a stack to replace a recursive algorithm with an iterative one. An easy way to do this is pushing the parameters onto a stack instead of passing them to the recursive function.

Solution. Here is a possible solution.

Let's see if it works with some of the grammars we have seen so far.

Solution. The first set of all terminals is the set containing just themselves. So we initialize that first. Then we update the first set with rules that derive empty strings.

In [240]:
def firstset(grammar, nullable):
    first = {i: {i} for i in terminals(grammar)}
    for k in grammar:
        first[k] = {EPSILON} if k in nullable else set()
    return firstset_((rules(grammar), first, nullable))[1]

Finally, we rely on the fixpoint to update the first set with the contents of the current first set until the first set stops changing.

In [241]:
def first_expr(expr, first, nullable):
    tokens = set()
    for token in expr:
        tokens |= first[token]
        if token not in nullable:
            break
    return tokens
In [242]:
@fixpoint
def firstset_(arg):
    (rules, first, epsilon) = arg
    for A, expression in rules:
        first[A] |= first_expr(expression, first, epsilon)
    return (rules, first, epsilon)
In [243]:
firstset(canonical(A1_GRAMMAR), EPSILON)
Out[243]:
{'1': {'1'},
 '4': {'4'},
 '0': {'0'},
 '6': {'6'},
 '2': {'2'},
 '8': {'8'},
 '7': {'7'},
 '9': {'9'},
 '5': {'5'},
 '3': {'3'},
 '-': {'-'},
 '+': {'+'},
 '<start>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'},
 '<expr>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'},
 '<integer>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'},
 '<digit>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}}

Solution. The implementation of followset() is similar to firstset(). We first initialize the follow set with EOF, get the epsilon and first sets, and use the fixpoint() decorator to iteratively compute the follow set until nothing changes.

In [244]:
EOF = '\0'
In [245]:
def followset(grammar, start):
    follow = {i: set() for i in grammar}
    follow[start] = {EOF}

    epsilon = nullable(grammar)
    first = firstset(grammar, epsilon)
    return followset_((grammar, epsilon, first, follow))[-1]

Given the current follow set, one can update the follow set as follows:

In [246]:
@fixpoint
def followset_(arg):
    grammar, epsilon, first, follow = arg
    for A, expression in rules(grammar):
        f_B = follow[A]
        for t in reversed(expression):
            if t in grammar:
                follow[t] |= f_B
            f_B = f_B | first[t] if t in epsilon else (first[t] - {EPSILON})

    return (grammar, epsilon, first, follow)
In [247]:
followset(canonical(A1_GRAMMAR), START_SYMBOL)
Out[247]:
{'<start>': {'\x00'},
 '<expr>': {'\x00', '+', '-'},
 '<integer>': {'\x00', '+', '-'},
 '<digit>': {'\x00',
  '+',
  '-',
  '0',
  '1',
  '2',
  '3',
  '4',
  '5',
  '6',
  '7',
  '8',
  '9'}}
Rule Name + - 0 1 2 3 4 5 6 7 8 9
start 0 0 0 0 0 0 0 0 0 0
expr 1 1 1 1 1 1 1 1 1 1
expr_ 2 3
integer 5 5 5 5 5 5 5 5 5 5
integer_ 7 7 6 6 6 6 6 6 6 6 6 6
digit 8 9 10 11 12 13 14 15 16 17

Solution. We define predict() as we explained before. Then we use the predicted rules to populate the parse table.

In [250]:
class LL1Parser(LL1Parser):
    def predict(self, rulepair, first, follow, epsilon):
        A, rule = rulepair
        rf = first_expr(rule, first, epsilon)
        if nullable_expr(rule, epsilon):
            rf |= follow[A]
        return rf

    def parse_table(self):
        self.my_rules = rules(self.cgrammar)
        epsilon = nullable(self.cgrammar)
        first = firstset(self.cgrammar, epsilon)
        # inefficient, can combine the three.
        follow = followset(self.cgrammar, self.start_symbol())

        ptable = [(i, self.predict(rule, first, follow, epsilon))
                  for i, rule in enumerate(self.my_rules)]

        parse_tbl = {k: {} for k in self.cgrammar}

        for i, pvals in ptable:
            (k, expr) = self.my_rules[i]
            parse_tbl[k].update({v: i for v in pvals})

        self.table = parse_tbl
In [251]:
ll1parser = LL1Parser(A2_GRAMMAR)
ll1parser.parse_table()
ll1parser.show_table()
Rule Name	| + | - | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<start>  	|   |   | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
<expr>  	|   |   | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1
<expr_>  	| 2 | 3 |   |   |   |   |   |   |   |   |   |  
<integer>  	|   |   | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5
<integer_>  	| 7 | 7 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6
<digit>  	|   |   | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17

Solution. Here is the complete parser:

In [252]:
class LL1Parser(LL1Parser):
    def parse_helper(self, stack, inplst):
        inp, *inplst = inplst
        exprs = []
        while stack:
            val, *stack = stack
            if isinstance(val, tuple):
                exprs.append(val)
            elif val not in self.cgrammar:  # terminal
                assert val == inp
                exprs.append(val)
                inp, *inplst = inplst or [None]
            else:
                if inp is not None:
                    i = self.table[val][inp]
                    _, rhs = self.my_rules[i]
                    stack = rhs + [(val, len(rhs))] + stack
        return self.linear_to_tree(exprs)

    def parse(self, inp):
        self.parse_table()
        k, _ = self.my_rules[0]
        stack = [k]
        return self.parse_helper(stack, inp)

    def linear_to_tree(self, arr):
        stack = []
        while arr:
            elt = arr.pop(0)
            if not isinstance(elt, tuple):
                stack.append((elt, []))
            else:
                # get the last n
                sym, n = elt
                elts = stack[-n:] if n > 0 else []
                stack = stack[0:len(stack) - n]
                stack.append((sym, elts))
        assert len(stack) == 1
        return stack[0]
In [253]:
ll1parser = LL1Parser(A2_GRAMMAR)
tree = ll1parser.parse('1+2')
display_tree(tree)
Out[253]:
0 <start> 1 <expr> 0->1 2 <expr_> 1->2 3 <expr> 2->3 4 <integer> 3->4 8 <integer> 3->8 5 <digit> 4->5 7 <integer_> 4->7 6 1 (49) 5->6 9 + (43) 8->9 10 <digit> 8->10 11 2 (50) 10->11