Here is a simple program that accepts a CSV file of vehicle details and processes this information.
def process_inventory(inventory):
res = []
for vehicle in inventory.split('\n'):
ret = process_vehicle(vehicle)
res.extend(ret)
return '\n'.join(res)
The CSV file contains details of one vehicle per line. Each row is processed in process_vehicle()
.
def process_vehicle(vehicle):
year, kind, company, model, *_ = vehicle.split(',')
if kind == 'van':
return process_van(year, company, model)
elif kind == 'car':
return process_car(year, company, model)
else:
raise Exception('Invalid entry')
Depending on the kind of vehicle, the processing changes.
def process_van(year, company, model):
res = ["We have a %s %s van from %s vintage." % (company, model, year)]
iyear = int(year)
if iyear > 2010:
res.append("It is a recent model!")
else:
res.append("It is an old but reliable model!")
return res
def process_car(year, company, model):
res = ["We have a %s %s car from %s vintage." % (company, model, year)]
iyear = int(year)
if iyear > 2016:
res.append("It is a recent model!")
else:
res.append("It is an old but reliable model!")
return res
Here is a sample of inputs that the process_inventory()
accepts.
mystring = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar\
"""
print(process_inventory(mystring))
We have a Ford E350 van from 1997 vintage. It is an old but reliable model! We have a Mercury Cougar car from 2000 vintage. It is an old but reliable model!
Let us try to fuzz this program. Given that the process_inventory()
takes a CSV file, we can write a simple grammar for generating comma separated values, and generate the required CSV rows. For convenience, we fuzz process_vehicle()
directly.
CSV_GRAMMAR = {
'<start>': ['<csvline>'],
'<csvline>': ['<items>'],
'<items>': ['<item>,<items>', '<item>'],
'<item>': ['<letters>'],
'<letters>': ['<letter><letters>', '<letter>'],
'<letter>': list(string.ascii_letters + string.digits + string.punctuation + ' \t\n')
}
We need some infrastructure first for viewing the grammar.
syntax_diagram(CSV_GRAMMAR)
start
csvline
items
item
letters
letter
We generate 1000
values, and evaluate the process_vehicle()
with each.
gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 1000
valid = []
time = 0
for i in range(trials):
with Timer() as t:
vehicle_info = gf.fuzz()
try:
process_vehicle(vehicle_info)
valid.append(vehicle_info)
except:
pass
time += t.elapsed_time()
print("%d valid strings, that is GrammarFuzzer generated %f%% valid entries from %d inputs" %
(len(valid), len(valid) * 100.0 / trials, trials))
print("Total time of %f seconds" % time)
0 valid strings, that is GrammarFuzzer generated 0.000000% valid entries from 1000 inputs Total time of 5.665059 seconds
This is obviously not working. But why?
gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 10
valid = []
time = 0
for i in range(trials):
vehicle_info = gf.fuzz()
try:
print(repr(vehicle_info), end="")
process_vehicle(vehicle_info)
except Exception as e:
print("\t", e)
else:
print()
'9w9J\'/,LU<"l,,Y,Zv)Amvx,c\n' Invalid entry '(n8].H7,qolS' not enough values to unpack (expected at least 4, got 2) '\nQoLWQ,jSa' not enough values to unpack (expected at least 4, got 2) 'K1,\n,RE,fq,%,,sT+aAb' Invalid entry "m,d,,8j4'),yQ,B7" Invalid entry 'g4,s1\t[}{.,M,<,\nzd,.am' Invalid entry ',Z[,z,c,#x1,gc.F' Invalid entry 'pWs,rT`,R' not enough values to unpack (expected at least 4, got 3) 'iN,br%,Q,R' Invalid entry 'ol,\nH<\tn,^#,=A' Invalid entry
None of the entries will get through unless the fuzzer can produce either van
or car
.
Indeed, the reason is that the grammar itself does not capture the complete information about the format. So here is another idea. We modify the GrammarFuzzer
to know a bit about our format.
Let us try again!
gf = PooledGrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
gf.update_cache('<item>', [
('<item>', [('car', [])]),
('<item>', [('van', [])]),
])
trials = 10
valid = []
time = 0
for i in range(trials):
vehicle_info = gf.fuzz()
try:
print(repr(vehicle_info), end="")
process_vehicle(vehicle_info)
except Exception as e:
print("\t", e)
else:
print()
',h,van,' Invalid entry 'M,w:K,car,car,van' Invalid entry 'J,?Y,van,van,car,J,~D+' Invalid entry 'S4,car,car,o' invalid literal for int() with base 10: 'S4' '2*,van' not enough values to unpack (expected at least 4, got 2) 'van,%,5,]' Invalid entry 'van,G3{y,j,h:' Invalid entry '$0;o,M,car,car' Invalid entry '2d,f,e' not enough values to unpack (expected at least 4, got 3) '/~NE,car,car' not enough values to unpack (expected at least 4, got 3)
At least we are getting somewhere! It would be really nice if we could incorporate what we know about the sample data in our fuzzer. In fact, it would be nice if we could extract the template and valid values from samples, and use them in our fuzzing. How do we do that? The quick answer to this question is: Use a parser.
Generally speaking, a parser is the part of a a program that processes (structured) input. The parsers we discuss in this chapter transform an input string into a derivation tree (discussed in the chapter on efficient grammar fuzzing). From a user's perspective, all it takes to parse an input is two steps:
Initialize the parser with a grammar, as in
parser = Parser(grammar)
Using the parser to retrieve a list of derivation trees:
trees = parser.parse(input)
Once we have parsed a tree, we can use it just as the derivation trees produced from grammar fuzzing.
We discuss a number of such parsers, in particular
PEGParser
), which are very efficient, but limited to specific grammar structure; andEarleyParser
), which accept any kind of contextfree grammars.If you just want to use parsers (say, because your main focus is testing), you can just stop here and move on to the next chapter, where we learn how to make use of parsed inputs to mutate and recombine them. If you want to understand how parsers work, though, this chapter is right for you.
As we saw in the previous section, programmers often have to extract parts of data that obey certain rules. For example, for CSV files, each element in a row is separated by commas, and multiple raws are used to store the data.
To extract the information, we write an ad hoc parser parse_csv()
.
def parse_csv(mystring):
children = []
tree = (START_SYMBOL, children)
for i, line in enumerate(mystring.split('\n')):
children.append(("record %d" % i, [(cell, [])
for cell in line.split(',')]))
return tree
We also change the default orientation of the graph to left to right rather than top to bottom for easier viewing using lr_graph()
.
def lr_graph(dot):
dot.attr('node', shape='plain')
dot.graph_attr['rankdir'] = 'LR'
The display_tree()
shows the structure of our CSV file after parsing.
tree = parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)
This is of course simple. What if we encounter slightly more complexity? Again, another example from the Wikipedia.
mystring = '''\
1997,Ford,E350,"ac, abs, moon",3000.00\
'''
print(mystring)
1997,Ford,E350,"ac, abs, moon",3000.00
We define a new annotation method highlight_node()
to mark the nodes that are interesting.
def highlight_node(predicate):
def hl_node(dot, nid, symbol, ann):
if predicate(dot, nid, symbol, ann):
dot.node(repr(nid), dot_escape(symbol), fontcolor='red')
else:
dot.node(repr(nid), dot_escape(symbol))
return hl_node
Using highlight_node()
we can highlight particular nodes that we were wrongly parsed.
tree = parse_csv(mystring)
bad_nodes = {5, 6, 7, 12, 13, 20, 22, 23, 24, 25}
def hl_predicate(_d, nid, _s, _a): return nid in bad_nodes
highlight_err_node = highlight_node(hl_predicate)
display_tree(tree, log=False, node_attr=highlight_err_node,
graph_attr=lr_graph)
The marked nodes indicate where our parsing went wrong. We can of course extend our parser to understand quotes. First we define some of the helper functions parse_quote()
, find_comma()
and comma_split()
def parse_quote(string, i):
v = string[i + 1:].find('"')
return v + i + 1 if v >= 0 else 1
def find_comma(string, i):
slen = len(string)
while i < slen:
if string[i] == '"':
i = parse_quote(string, i)
if i == 1:
return 1
if string[i] == ',':
return i
i += 1
return 1
def comma_split(string):
slen = len(string)
i = 0
while i < slen:
c = find_comma(string, i)
if c == 1:
yield string[i:]
return
else:
yield string[i:c]
i = c + 1
We can update our parse_csv()
procedure to use our advanced quote parser.
def parse_csv(mystring):
children = []
tree = (START_SYMBOL, children)
for i, line in enumerate(mystring.split('\n')):
children.append(("record %d" % i, [(cell, [])
for cell in comma_split(line)]))
return tree
Our new parse_csv()
can now handle quotes correctly.
tree = parse_csv(mystring)
display_tree(tree, graph_attr=lr_graph)
That of course does not survive long:
mystring = '''\
1999,Chevy,"Venture \\"Extended Edition, Very Large\\"",,5000.00\
'''
print(mystring)
1999,Chevy,"Venture \"Extended Edition, Very Large\"",,5000.00
A few embedded quotes are sufficient to confuse our parser again.
tree = parse_csv(mystring)
bad_nodes = {4, 5}
display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)
Here is another record from that CSV file:
mystring = '''\
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00
'''
print(mystring)
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof, loaded",4799.00
tree = parse_csv(mystring)
bad_nodes = {5, 6, 7, 8, 9, 10}
display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)
Fixing this would require modifying both inner parse_quote()
and the outer parse_csv()
procedures. We note that each of these features actually documented in the CSV RFC 4180
Indeed, each additional improvement falls apart even with a little extra complexity. The problem becomes severe when one encounters recursive expressions. For example, JSON is a common alternative to CSV files for saving data. Similarly, one may have to parse data from an HTML table instead of a CSV file if one is getting the data from the web.
One might be tempted to fix it with a little more ad hoc parsing, with a bit of regular expressions thrown in. However, that is the path to insanity.
It is here that formal parsers shine. The main idea is that, any given set of strings belong to a language, and these languages can be specified by their grammars (as we saw in the chapter on grammars). The great thing about grammars is that they can be composed. That is, one can introduce finer and finer details into an internal structure without affecting the external structure, and similarly, one can change the external structure without much impact on the internal structure. We briefly describe grammars in the next section.
A grammar, as you have read from the chapter on grammars is a set of rules that explain how the start symbol can be expanded. Each rule has a name, also called a nonterminal, and a set of alternative choices in how the nonterminal can be expanded.
A1_GRAMMAR = {
"<start>": ["<expr>"],
"<expr>": ["<expr>+<expr>", "<expr><expr>", "<integer>"],
"<integer>": ["<digit><integer>", "<digit>"],
"<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
syntax_diagram(A1_GRAMMAR)
start
expr
integer
digit
Here is a string that represents an arithmetic expression that we would like to parse, which is specified by the grammar above:
mystring = '1+2'
The derivation tree for our expression from this grammar is given by:
tree = ('<start>', [('<expr>',
[('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
('+', []),
('<expr>', [('<integer>', [('<digit>', [('2',
[])])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)
While a grammar can be used to specify a given language, there could be multiple grammars that correspond to the same language. For example, here is another grammar to describe the same addition expression.
A2_GRAMMAR = {
"<start>": ["<expr>"],
"<expr>": ["<integer><expr_>"],
"<expr_>": ["+<expr>", "<expr>", ""],
"<integer>": ["<digit><integer_>"],
"<integer_>": ["<integer>", ""],
"<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}
syntax_diagram(A2_GRAMMAR)
start
expr
expr_
integer
integer_
digit
The corresponding derivation tree is given by:
tree = ('<start>', [('<expr>', [('<integer>', [('<digit>', [('1', [])]),
('<integer_>', [])]),
('<expr_>', [('+', []),
('<expr>',
[('<integer>',
[('<digit>', [('2', [])]),
('<integer_>', [])]),
('<expr_>', [])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)
LR_GRAMMAR = {
'<start>': ['<A>'],
'<A>': ['<A>a', ''],
}
syntax_diagram(LR_GRAMMAR)
start
A
mystring = 'aaaaaa'
display_tree(
('<start>', (('<A>', (('<A>', (('<A>', []), ('a', []))), ('a', []))), ('a', []))))
A grammar is indirectly leftrecursive if any
of the leftmost symbols can be expanded using their definitions to
produce the nonterminal as the leftmost symbol of the expansion. The left
recursion is called a hiddenleftrecursion if during the series of
expansions of a nonterminal, one reaches a rule where the rule contains
the same nonterminal after a prefix of other symbols, and these symbols can
derive the empty string. For example, in A1_GRAMMAR
, <integer>
will be
considered hiddenleft recursive if <digit>
could derive an empty string.
Right recursive grammars are defined similarly.
Below is the derivation tree for the right recursive grammar that represents the same
language as that of LR_GRAMMAR
.
RR_GRAMMAR = {
'<start>': ['<A>'],
'<A>': ['a<A>', ''],
}
syntax_diagram(RR_GRAMMAR)
start
A
display_tree(('<start>', ((
'<A>', (('a', []), ('<A>', (('a', []), ('<A>', (('a', []), ('<A>', []))))))),)))
To complicate matters further, there could be
multiple derivation trees – also called parses – corresponding to the
same string from the same grammar. For example, a string 1+2+3
can be parsed
in two ways as we see below using the A1_GRAMMAR
mystring = '1+2+3'
tree = ('<start>',
[('<expr>',
[('<expr>', [('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
('+', []),
('<expr>', [('<integer>',
[('<digit>', [('2', [])])])])]), ('+', []),
('<expr>', [('<integer>', [('<digit>', [('3', [])])])])])])
assert mystring == tree_to_string(tree)
display_tree(tree)
tree = ('<start>',
[('<expr>', [('<expr>', [('<integer>', [('<digit>', [('1', [])])])]),
('+', []),
('<expr>',
[('<expr>', [('<integer>', [('<digit>', [('2', [])])])]),
('+', []),
('<expr>', [('<integer>', [('<digit>', [('3',
[])])])])])])])
assert tree_to_string(tree) == mystring
display_tree(tree)
There are many ways to resolve ambiguities. One approach taken by Parsing Expression Grammars explained in the next section is to specify a particular order of resolution, and choose the first one. Another approach is to simply return all possible derivation trees, which is the approach taken by Earley parser we develop later.
Next, we develop different parsers. To do that, we define a minimal interface for parsing that is obeyed by all parsers. There are two approaches to parsing a string using a grammar.
In this chapter, we use the second approach. This approach is implemented in the prune_tree
method.
The Parser class we define below provides the minimal interface. The main methods that need to be implemented by the classes implementing this interface are parse_prefix
and parse
. The parse_prefix
returns a tuple, which contains the index until which parsing was completed successfully, and the parse forest until that index. The method parse
returns a list of derivation trees if the parse was successful.
class Parser(object):
def __init__(self, grammar, **kwargs):
self._grammar = grammar
self._start_symbol = kwargs.get('start_symbol', START_SYMBOL)
self.log = kwargs.get('log', False)
self.coalesce_tokens = kwargs.get('coalesce', True)
self.tokens = kwargs.get('tokens', set())
def grammar(self):
return self._grammar
def start_symbol(self):
return self._start_symbol
def parse_prefix(self, text):
"""Return pair (cursor, forest) for longest prefix of text"""
raise NotImplemented()
def parse(self, text):
cursor, forest = self.parse_prefix(text)
if cursor < len(text):
raise SyntaxError("at " + repr(text[cursor:]))
return [self.prune_tree(tree) for tree in forest]
def coalesce(self, children):
last = ''
new_lst = []
for cn, cc in children:
if cn not in self._grammar:
last += cn
else:
if last:
new_lst.append((last, []))
last = ''
new_lst.append((cn, cc))
if last:
new_lst.append((last, []))
return new_lst
def prune_tree(self, tree):
name, children = tree
if self.coalesce_tokens:
children = self.coalesce(children)
if name in self.tokens:
return (name, [(tree_to_string(tree), [])])
else:
return (name, [self.prune_tree(c) for c in children])
A Parsing Expression Grammar (PEG) \cite{Ford2004} is a type of recognition based formal grammar that specifies the sequence of steps to take to parse a given string.
A parsing expression grammar is very similar to a contextfree grammar (CFG) such as the ones we saw in the chapter on grammars. As in a CFG, a parsing expression grammar is represented by a set of nonterminals and corresponding alternatives representing how to match each. For example, here is a PEG that matches a
or b
.
PEG1 = {
'<start>': ['a', 'b']
}
PEG2 = {
'<start>': ['ab', 'abc']
}
Short of hand rolling a parser, Packrat parsing is one of the simplest parsing techniques, and is one of the techniques for parsing PEGs. The Packrat parser is so named because it tries to cache all results from simpler problems in the hope that these solutions can be used to avoid recomputation later. We develop a minimal Packrat parser next.
But before that, we need to implement a few supporting tools.
class Parser(Parser):
def __init__(self, grammar, **kwargs):
self._grammar = grammar
self._start_symbol = kwargs.get('start_symbol', START_SYMBOL)
self.log = kwargs.get('log', False)
self.tokens = kwargs.get('tokens', set())
self.coalesce_tokens = kwargs.get('coalesce', True)
self.cgrammar = canonical(grammar)
We derive from the Parser
base class first, and we accept the text to be parsed in the parse()
method, which in turn calls unify_key()
with the start_symbol
.
Note. While our PEG parser can produce only a single unambiguous parse tree, other parsers can produce multiple parses for ambiguous grammars. Hence, we return a list of trees (in this case with a single element).
class PEGParser(Parser):
def parse_prefix(self, text):
cursor, tree = self.unify_key(self.start_symbol(), text, 0)
return cursor, [tree]
class PEGParser(PEGParser):
def unify_key(self, key, text, at=0):
if self.log:
print("unify_key: %s with %s" % (repr(key), repr(text[at:])))
if key not in self.cgrammar:
if text[at:].startswith(key):
return at + len(key), (key, [])
else:
return at, None
for rule in self.cgrammar[key]:
to, res = self.unify_rule(rule, text, at)
if res:
return (to, (key, res))
return 0, None
mystring = "1"
peg = PEGParser(EXPR_GRAMMAR, log=True)
peg.unify_key('1', mystring)
unify_key: '1' with '1'
(1, ('1', []))
mystring = "2"
peg.unify_key('1', mystring)
unify_key: '1' with '2'
(0, None)
class PEGParser(PEGParser):
def unify_rule(self, rule, text, at):
if self.log:
print('unify_rule: %s with %s' % (repr(rule), repr(text[at:])))
results = []
for token in rule:
at, res = self.unify_key(token, text, at)
if res is None:
return at, None
results.append(res)
return at, results
mystring = "0"
peg = PEGParser(EXPR_GRAMMAR, log=True)
peg.unify_rule(peg.cgrammar['<digit>'][0], mystring, 0)
unify_rule: ['0'] with '0' unify_key: '0' with '0'
(1, [('0', [])])
mystring = "12"
peg.unify_rule(peg.cgrammar['<integer>'][0], mystring, 0)
unify_rule: ['<digit>', '<integer>'] with '12' unify_key: '<digit>' with '12' unify_rule: ['0'] with '12' unify_key: '0' with '12' unify_rule: ['1'] with '12' unify_key: '1' with '12' unify_key: '<integer>' with '2' unify_rule: ['<digit>', '<integer>'] with '2' unify_key: '<digit>' with '2' unify_rule: ['0'] with '2' unify_key: '0' with '2' unify_rule: ['1'] with '2' unify_key: '1' with '2' unify_rule: ['2'] with '2' unify_key: '2' with '2' unify_key: '<integer>' with '' unify_rule: ['<digit>', '<integer>'] with '' unify_key: '<digit>' with '' unify_rule: ['0'] with '' unify_key: '0' with '' unify_rule: ['1'] with '' unify_key: '1' with '' unify_rule: ['2'] with '' unify_key: '2' with '' unify_rule: ['3'] with '' unify_key: '3' with '' unify_rule: ['4'] with '' unify_key: '4' with '' unify_rule: ['5'] with '' unify_key: '5' with '' unify_rule: ['6'] with '' unify_key: '6' with '' unify_rule: ['7'] with '' unify_key: '7' with '' unify_rule: ['8'] with '' unify_key: '8' with '' unify_rule: ['9'] with '' unify_key: '9' with '' unify_rule: ['<digit>'] with '' unify_key: '<digit>' with '' unify_rule: ['0'] with '' unify_key: '0' with '' unify_rule: ['1'] with '' unify_key: '1' with '' unify_rule: ['2'] with '' unify_key: '2' with '' unify_rule: ['3'] with '' unify_key: '3' with '' unify_rule: ['4'] with '' unify_key: '4' with '' unify_rule: ['5'] with '' unify_key: '5' with '' unify_rule: ['6'] with '' unify_key: '6' with '' unify_rule: ['7'] with '' unify_key: '7' with '' unify_rule: ['8'] with '' unify_key: '8' with '' unify_rule: ['9'] with '' unify_key: '9' with '' unify_rule: ['<digit>'] with '2' unify_key: '<digit>' with '2' unify_rule: ['0'] with '2' unify_key: '0' with '2' unify_rule: ['1'] with '2' unify_key: '1' with '2' unify_rule: ['2'] with '2' unify_key: '2' with '2'
(2, [('<digit>', [('1', [])]), ('<integer>', [('<digit>', [('2', [])])])])
mystring = "1 + 2"
peg = PEGParser(EXPR_GRAMMAR, log=False)
peg.parse(mystring)
[('<start>', [('<expr>', [('<term>', [('<factor>', [('<integer>', [('<digit>', [('1', [])])])])]), (' + ', []), ('<expr>', [('<term>', [('<factor>', [('<integer>', [('<digit>', [('2', [])])])])])])])])]
class PEGParser(PEGParser):
@lru_cache(maxsize=None)
def unify_key(self, key, text, at=0):
if key not in self.cgrammar:
if text[at:].startswith(key):
return at + len(key), (key, [])
else:
return at, None
for rule in self.cgrammar[key]:
to, res = self.unify_rule(rule, text, at)
if res:
return (to, (key, res))
return 0, None
mystring = "1 + (2 * 3)"
peg = PEGParser(EXPR_GRAMMAR)
for tree in peg.parse(mystring):
assert tree_to_string(tree) == mystring
display_tree(tree)
mystring = "1 * (2 + 3.35)"
for tree in peg.parse(mystring):
assert tree_to_string(tree) == mystring
display_tree(tree)
While PEGs are simple at first sight, their behavior in some cases might be a bit unintuitive. For example, here is an example \cite{redziejowski}:
PEG_SURPRISE = {
"<A>": ["a<A>a", "aa"]
}
strings = []
for e in range(4):
f = GrammarFuzzer(PEG_SURPRISE, start_symbol='<A>')
tree = ('<A>', None)
for _ in range(e):
tree = f.expand_tree_once(tree)
tree = f.expand_tree_with_strategy(tree, f.expand_node_min_cost)
strings.append(tree_to_string(tree))
display_tree(tree)
strings
['aa', 'aaaa', 'aaaaaa', 'aaaaaaaa']
However, the PEG parser can only recognize strings of the form $2^n$
peg = PEGParser(PEG_SURPRISE, start_symbol='<A>')
for s in strings:
with ExpectError():
for tree in peg.parse(s):
display_tree(tree)
print(s)
aa
aaaa
Traceback (most recent call last): File "<ipythoninput70dec55ebf796e>", line 4, in <module> for tree in peg.parse(s): File "<ipythoninput49abe75b43d33f>", line 22, in parse raise SyntaxError("at " + repr(text[cursor:])) File "<string>", line None SyntaxError: at 'aa' (expected)
aaaaaaaa
The general idea of CFG parser is the following: Peek at the input text for the allowed number of characters, and use these, and our parser state to determine which rules can be applied to complete parsing. We next look at a typical CFG parsing algorithm, the Earley Parser.
We use the following grammar in our examples below.
SAMPLE_GRAMMAR = {
'<start>': ['<A><B>'],
'<A>': ['a<B>c', 'a<A>'],
'<B>': ['b<C>', '<D>'],
'<C>': ['c'],
'<D>': ['d']
}
C_SAMPLE_GRAMMAR = canonical(SAMPLE_GRAMMAR)
syntax_diagram(SAMPLE_GRAMMAR)
start
A
B
C
D
class Column(object):
def __init__(self, index, letter):
self.index, self.letter = index, letter
self.states, self._unique = [], {}
def __str__(self):
return "%s chart[%d]\n%s" % (self.letter, self.index, "\n".join(
str(state) for state in self.states if state.finished()))
class Column(Column):
def add(self, state):
if state in self._unique:
return self._unique[state]
self._unique[state] = state
self.states.append(state)
state.e_col = self
return self._unique[state]
class Item(object):
def __init__(self, name, expr, dot):
self.name, self.expr, self.dot = name, expr, dot
class Item(Item):
def finished(self):
return self.dot >= len(self.expr)
def advance(self):
return Item(self.name, self.expr, self.dot + 1)
def at_dot(self):
return self.expr[self.dot] if self.dot < len(self.expr) else None
Here is how an item could be used. We first define our item
item_name = '<B>'
item_expr = C_SAMPLE_GRAMMAR[item_name][1]
an_item = Item(item_name, tuple(item_expr), 0)
To determine where the status of parsing, we use at_dot()
an_item.at_dot()
'<D>'
That is, the next symbol to be parsed is <D>
If we advance the item, we get another item that represents the finished parsing rule <B>
.
another_item = an_item.advance()
another_item.finished()
True
class State(Item):
def __init__(self, name, expr, dot, s_col, e_col=None):
super().__init__(name, expr, dot)
self.s_col, self.e_col = s_col, e_col
def __str__(self):
def idx(var):
return var.index if var else 1
return self.name + ':= ' + ' '.join([
str(p)
for p in [*self.expr[:self.dot], '', *self.expr[self.dot:]]
]) + "(%d,%d)" % (idx(self.s_col), idx(self.e_col))
def copy(self):
return State(self.name, self.expr, self.dot, self.s_col, self.e_col)
def _t(self):
return (self.name, self.expr, self.dot, self.s_col.index)
def __hash__(self):
return hash(self._t())
def __eq__(self, other):
return self._t() == other._t()
def advance(self):
return State(self.name, self.expr, self.dot + 1, self.s_col)
The usage of State
is similar to that of Item
. The only difference is that it is used along with the Column
to track the parsing state. For example, we initialize the first column as follows:
col_0 = Column(0, None)
item_expr = tuple(*C_SAMPLE_GRAMMAR[START_SYMBOL])
start_state = State(START_SYMBOL, item_expr, 0, col_0)
col_0.add(start_state)
start_state.at_dot()
'<A>'
The first column is then updated by using add()
method of Column
sym = start_state.at_dot()
for alt in C_SAMPLE_GRAMMAR[sym]:
col_0.add(State(sym, tuple(alt), 0, col_0))
for s in col_0.states:
print(s)
<start>:=  <A> <B>(0,0) <A>:=  a <B> c(0,0) <A>:=  a <A>(0,0)
class EarleyParser(Parser):
def __init__(self, grammar, **kwargs):
super().__init__(grammar, **kwargs)
self.cgrammar = canonical(grammar, letters=True)
class EarleyParser(EarleyParser):
def chart_parse(self, words, start):
alt = tuple(*self.cgrammar[start])
chart = [Column(i, tok) for i, tok in enumerate([None, *words])]
chart[0].add(State(start, alt, 0, chart[0]))
return self.fill_chart(chart)
class EarleyParser(EarleyParser):
def predict(self, col, sym, state):
for alt in self.cgrammar[sym]:
col.add(State(sym, tuple(alt), 0, col))
To see how to use predict
, we first construct the 0th column as before, and we assign the constructed column to an instance of the EarleyParser.
col_0 = Column(0, None)
col_0.add(start_state)
ep = EarleyParser(SAMPLE_GRAMMAR)
ep.chart = [col_0]
It should contain a single state  <start> at 0
for s in ep.chart[0].states:
print(s)
<start>:=  <A> <B>(0,0)
We apply predict to fill out the 0th column, and the column should contain the possible parse paths.
ep.predict(col_0, '<A>', s)
for s in ep.chart[0].states:
print(s)
<start>:=  <A> <B>(0,0) <A>:=  a <B> c(0,0) <A>:=  a <A>(0,0)
class EarleyParser(EarleyParser):
def scan(self, col, state, letter):
if letter == col.letter:
col.add(state.advance())
As before, we construct the partial parse first, this time adding a new column so that we can observe the effects of scan()
ep = EarleyParser(SAMPLE_GRAMMAR)
col_1 = Column(1, 'a')
ep.chart = [col_0, col_1]
new_state = ep.chart[0].states[1]
print(new_state)
<A>:=  a <B> c(0,0)
ep.scan(col_1, new_state, 'a')
for s in ep.chart[1].states:
print(s)
<A>:= a  <B> c(0,1)
class EarleyParser(EarleyParser):
def complete(self, col, state):
return self.earley_complete(col, state)
def earley_complete(self, col, state):
parent_states = [
st for st in state.s_col.states if st.at_dot() == state.name
]
for st in parent_states:
col.add(st.advance())
Here is an example of completed processing. First we complete the Column 0
ep = EarleyParser(SAMPLE_GRAMMAR)
col_1 = Column(1, 'a')
col_2 = Column(2, 'd')
ep.chart = [col_0, col_1, col_2]
ep.predict(col_0, '<A>', s)
for s in ep.chart[0].states:
print(s)
<start>:=  <A> <B>(0,0) <A>:=  a <B> c(0,0) <A>:=  a <A>(0,0)
Then we use scan()
to populate Column 1
for state in ep.chart[0].states:
if state.at_dot() not in SAMPLE_GRAMMAR:
ep.scan(col_1, state, 'a')
for s in ep.chart[1].states:
print(s)
<A>:= a  <B> c(0,1) <A>:= a  <A>(0,1)
for state in ep.chart[1].states:
if state.at_dot() in SAMPLE_GRAMMAR:
ep.predict(col_1, state.at_dot(), state)
for s in ep.chart[1].states:
print(s)
<A>:= a  <B> c(0,1) <A>:= a  <A>(0,1) <B>:=  b <C>(1,1) <B>:=  <D>(1,1) <A>:=  a <B> c(1,1) <A>:=  a <A>(1,1) <D>:=  d(1,1)
Then we use scan()
again to populate Column 2
for state in ep.chart[1].states:
if state.at_dot() not in SAMPLE_GRAMMAR:
ep.scan(col_2, state, state.at_dot())
for s in ep.chart[2].states:
print(s)
<D>:= d (1,2)
Now, we can use complete()
:
for state in ep.chart[2].states:
if state.finished():
ep.complete(col_2, state)
for s in ep.chart[2].states:
print(s)
<D>:= d (1,2) <B>:= <D> (1,2) <A>:= a <B>  c(0,2)
class EarleyParser(EarleyParser):
def fill_chart(self, chart):
for i, col in enumerate(chart):
for state in col.states:
if state.finished():
self.complete(col, state)
else:
sym = state.at_dot()
if sym in self.cgrammar:
self.predict(col, sym, state)
else:
if i + 1 >= len(chart):
continue
self.scan(chart[i + 1], state, sym)
if self.log:
print(col, '\n')
return chart
We now can recognize a given string as belonging to a language represented by a grammar.
ep = EarleyParser(SAMPLE_GRAMMAR, log=True)
columns = ep.chart_parse('adcd', START_SYMBOL)
None chart[0] a chart[1] d chart[2] <D>:= d (1,2) <B>:= <D> (1,2) c chart[3] <A>:= a <B> c (0,3) d chart[4] <D>:= d (3,4) <B>:= <D> (3,4) <start>:= <A> <B> (0,4)
The chart we printed above only shows completed entries at each index. The parenthesized expression indicates the column just before the first character was recognized, and the ending column.
Notice how the <start>
nonterminal shows fully parsed status.
last_col = columns[1]
for s in last_col.states:
if s.name == '<start>':
print(s)
<start>:= <A> <B> (0,4)
class EarleyParser(EarleyParser):
def parse_prefix(self, text):
self.table = self.chart_parse(text, self.start_symbol())
for col in reversed(self.table):
states = [
st for st in col.states if st.name == self.start_symbol()
]
if states:
return col.index, states
return 1, []
Here is the parse_prefix()
in action.
ep = EarleyParser(SAMPLE_GRAMMAR)
cursor, last_states = ep.parse_prefix('adcd')
print(cursor, [str(s) for s in last_states])
4 ['<start>:= <A> <B> (0,4)']
The following is adapted from the excellent reference on Earley parsing by Loup Vaillant.
Our parse()
method is as follows. It depends on two methods parse_forest()
and extract_trees()
that will be defined next.
class EarleyParser(EarleyParser):
def parse(self, text):
cursor, states = self.parse_prefix(text)
start = next((s for s in states if s.finished()), None)
if cursor < len(text) or not start:
raise SyntaxError("at " + repr(text[cursor:]))
forest = self.parse_forest(self.table, start)
for tree in self.extract_trees(forest):
yield self.prune_tree(tree)
class EarleyParser(EarleyParser):
def parse_paths(self, named_expr, chart, frm, til):
def paths(state, start, k, e):
if not e:
return [[(state, k)]] if start == frm else []
else:
return [[(state, k)] + r
for r in self.parse_paths(e, chart, frm, start)]
*expr, var = named_expr
starts = None
if var not in self.cgrammar:
starts = ([(var, til  len(var),
't')] if til > 0 and chart[til].letter == var else [])
else:
starts = [(s, s.s_col.index, 'n') for s in chart[til].states
if s.finished() and s.name == var]
return [p for s, start, k in starts for p in paths(s, start, k, expr)]
Here is the parse_paths()
in action
print(SAMPLE_GRAMMAR['<start>'])
ep = EarleyParser(SAMPLE_GRAMMAR)
completed_start = last_states[0]
paths = ep.parse_paths(completed_start.expr, columns, 0, 4)
for path in paths:
print([list(str(s_) for s_ in s) for s in path])
['<A><B>'] [['<B>:= <D> (3,4)', 'n'], ['<A>:= a <B> c (0,3)', 'n']]
That is, the parse path for <start>
given the input adcd
included recognizing the expression <A><B>
. This was recognized by the two states: <A>
from input(0) to input(2) which further involved recognizing the rule a<B>c
, and the next state <B>
from input(3) which involved recognizing the rule <D>
.
class EarleyParser(EarleyParser):
def forest(self, s, kind, chart):
return self.parse_forest(chart, s) if kind == 'n' else (s, [])
def parse_forest(self, chart, state):
pathexprs = self.parse_paths(state.expr, chart, state.s_col.index,
state.e_col.index) if state.expr else []
return state.name, [[(v, k, chart) for v, k in reversed(pathexpr)]
for pathexpr in pathexprs]
ep = EarleyParser(SAMPLE_GRAMMAR)
result = ep.parse_forest(columns, last_states[0])
result
('<start>', [[(<__main__.State at 0x11043ff28>, 'n', [<__main__.Column at 0x1103ef748>, <__main__.Column at 0x1103ef8d0>, <__main__.Column at 0x1103ef080>, <__main__.Column at 0x1103ef438>, <__main__.Column at 0x1103ef240>]), (<__main__.State at 0x1103d2eb8>, 'n', [<__main__.Column at 0x1103ef748>, <__main__.Column at 0x1103ef8d0>, <__main__.Column at 0x1103ef080>, <__main__.Column at 0x1103ef438>, <__main__.Column at 0x1103ef240>])]])
class EarleyParser(EarleyParser):
def extract_a_tree(self, forest_node):
name, paths = forest_node
if not paths:
return (name, [])
return (name, [self.extract_a_tree(self.forest(*p)) for p in paths[0]])
def extract_trees(self, forest):
yield self.extract_a_tree(forest)
A3_GRAMMAR = {
"<start>": ["<bexpr>"],
"<bexpr>": [
"<aexpr><gt><aexpr>", "<aexpr><lt><aexpr>", "<aexpr>=<aexpr>",
"<bexpr>=<bexpr>", "<bexpr>&<bexpr>", "<bexpr><bexpr>", "(<bexrp>)"
],
"<aexpr>":
["<aexpr>+<aexpr>", "<aexpr><aexpr>", "(<aexpr>)", "<integer>"],
"<integer>": ["<digit><integer>", "<digit>"],
"<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
"<lt>": ['<'],
"<gt>": ['>']
}
syntax_diagram(A3_GRAMMAR)
start
bexpr
aexpr
integer
digit
lt
gt
mystring = '(1+24)=33'
parser = EarleyParser(A3_GRAMMAR)
for tree in parser.parse(mystring):
assert tree_to_string(tree) == mystring
display_tree(tree)
We now have a complete parser that can parse almost arbitrary CFG. There remains a small corner to fix  the case of epsilon rules as we will see later.
class EarleyParser(EarleyParser):
def extract_trees(self, forest_node):
name, paths = forest_node
if not paths:
yield (name, [])
results = []
for path in paths:
ptrees = [self.extract_trees(self.forest(*p)) for p in path]
for p in zip(*ptrees):
yield (name, p)
One can also use a GrammarFuzzer
to verify that everything works.
gf = GrammarFuzzer(A1_GRAMMAR)
for i in range(5):
s = gf.fuzz()
print(i, s)
for tree in parser.parse(s):
assert tree_to_string(tree) == s
0 045+3+29+7751449 1 0+9+52+18+43+7+2 2 76413 3 9339 4 62
EPSILON = ''
E_GRAMMAR = {
'<start>': ['<S>'],
'<S>': ['<A><A><A><A>'],
'<A>': ['a', '<E>'],
'<E>': [EPSILON]
}
syntax_diagram(E_GRAMMAR)
start
S
A
E
mystring = 'a'
parser = EarleyParser(E_GRAMMAR)
with ExpectError():
trees = parser.parse(mystring)
A fixpoint
of a function is an element in the function's domain such that it is mapped to itself. For example, 1 is a fixpoint
of square root because squareroot(1) == 1
.
(We use str
rather than hash
to check for equality in fixpoint
because the data structure set
, which we would like to use as an argument has a good string representation but is not hashable).
def fixpoint(f):
def helper(arg):
while True:
sarg = str(arg)
arg_ = f(arg)
if str(arg_) == sarg:
return arg
arg = arg_
return helper
def my_sqrt(x):
@fixpoint
def _my_sqrt(approx):
return (approx + x / approx) / 2
return _my_sqrt(1)
my_sqrt(2)
1.414213562373095
Similarly, we can define nullable
using fixpoint
. We essentially provide the definition of a single intermediate step. That is, assuming that nullables
contain the current nullable
nonterminals, we iterate over the grammar looking for productions which are nullable
 that is, productions where the entire sequence can yield an empty string on some expansion.
def nullable_expr(expr, nullables):
return all(token in nullables for token in expr)
def nullable(grammar):
productions = rules(grammar)
@fixpoint
def nullable_(nullables):
for A, expr in productions:
if nullable_expr(expr, nullables):
nullables = {A}
return (nullables)
return nullable_({EPSILON})
for key, grammar in {
'E_GRAMMAR': E_GRAMMAR,
'E_GRAMMAR_1': E_GRAMMAR_1
}.items():
print(key, nullable(canonical(grammar)))
E_GRAMMAR {'', '<start>', '<S>', '<A>', '<E>'} E_GRAMMAR_1 {'', '<start>', '<A>'}
class EarleyParser(EarleyParser):
def __init__(self, grammar, **kwargs):
super().__init__(grammar, **kwargs)
self.cgrammar = canonical(grammar, letters=True)
self.epsilon = nullable(self.cgrammar)
def predict(self, col, sym, state):
for alt in self.cgrammar[sym]:
col.add(State(sym, tuple(alt), 0, col))
if sym in self.epsilon:
col.add(state.advance())
mystring = 'a'
parser = EarleyParser(E_GRAMMAR)
for tree in parser.parse(mystring):
display_tree(tree)
To ensure that our parser does parse all kinds of grammars, let us try two more test cases.
DIRECTLY_SELF_REFERRING = {
'<start>': ['<query>'],
'<query>': ['select <expr> from a'],
"<expr>": [ "<expr>", "a"],
}
INDIRECTLY_SELF_REFERRING = {
'<start>': ['<query>'],
'<query>': ['select <expr> from a'],
"<expr>": [ "<aexpr>", "a"],
"<aexpr>": [ "<expr>"],
}
mystring = 'select a from a'
for grammar in [DIRECTLY_SELF_REFERRING, INDIRECTLY_SELF_REFERRING]:
trees = EarleyParser(grammar).parse(mystring)
for tree in trees:
assert mystring == tree_to_string(tree)
display_tree(tree)
A number of other optimizations exist for Earley parsers. A fast industrial strength Earley parser implementation is the Marpa parser. Further, Earley parsing need not be restricted to character data. One may also parse streams (audio and video streams) \cite{qi2018generalized} using a generalized Earley parser.
While we have defined two parser variants, it would be nice to have some confirmation that our parses work well. While it is possible to formally prove that they work, it is much more satisfying to generate random grammars, their corresponding strings, and parse them using the same grammar.
def prod_line_grammar(nonterminals, terminals):
g = {
'<start>': ['<symbols>'],
'<symbols>': ['<symbol><symbols>', '<symbol>'],
'<symbol>': ['<nonterminals>', '<terminals>'],
'<nonterminals>': ['<lt><alpha><gt>'],
'<lt>': ['<'],
'<gt>': ['>'],
'<alpha>': nonterminals,
'<terminals>': terminals
}
if not nonterminals:
g['<nonterminals>'] = ['']
del g['<lt>']
del g['<alpha>']
del g['<gt>']
return g
syntax_diagram(prod_line_grammar(["A", "B", "C"], ["1", "2", "3"]))
start
symbols
symbol
nonterminals
lt
gt
alpha
terminals
def make_rule(nonterminals, terminals, num_alts):
prod_grammar = prod_line_grammar(nonterminals, terminals)
gf = GrammarFuzzer(prod_grammar, min_nonterminals=3, max_nonterminals=5)
name = "<%s>" % ''.join(random.choices(string.ascii_uppercase, k=3))
return (name, [gf.fuzz() for _ in range(num_alts)])
make_rule(["A", "B", "C"], ["1", "2", "3"], 3)
('<KWR>', ['<A><A>', '<B>12', '<A>11'])
def make_grammar(num_symbols=3, num_alts=3):
terminals = list(string.ascii_lowercase)
grammar = {}
name = None
for _ in range(num_symbols):
nonterminals = [k[1:1] for k in grammar.keys()]
name, expansions = \
make_rule(nonterminals, terminals, num_alts)
grammar[name] = expansions
grammar[START_SYMBOL] = [name]
# Remove unused parts
for nonterminal in unreachable_nonterminals(grammar):
del grammar[nonterminal]
assert is_valid_grammar(grammar)
return grammar
make_grammar()
{'<YNK>': ['oxz', 'gh', 'm'], '<AGT>': ['m<YNK>fy', 'f<YNK>uv', '<YNK>aj'], '<XSJ>': ['<AGT>oy', 'y<YNK>gd', '<YNK>dk'], '<start>': ['<XSJ>']}
Now we verify if our arbitrary grammars can be used by the Earley parser.
for i in range(5):
my_grammar = make_grammar()
print(my_grammar)
parser = EarleyParser(my_grammar)
mygf = GrammarFuzzer(my_grammar)
s = mygf.fuzz()
print(s)
for tree in parser.parse(s):
assert tree_to_string(tree) == s
display_tree(tree)
{'<HIT>': ['kk', '', 'c'], '<ROZ>': ['<HIT>gy', 'nyz', '<HIT>md'], '<start>': ['<ROZ>']} gy
{'<YYM>': ['t', 'dw', 'gki'], '<TZU>': ['<YYM>rx', '<YYM>uix', '<YYM>j'], '<XUG>': ['<TZU>jvj', '<YYM>s', '<YYM>d'], '<start>': ['<XUG>']} ts
{'<OAK>': ['d', 't', 'g'], '<VFU>': ['<OAK>ta', '<OAK>qy', '<OAK>m<OAK>b'], '<PIQ>': ['<VFU>ic', 'p<VFU>np', '<OAK>v'], '<start>': ['<PIQ>']} gv
{'<DYN>': ['c', 'ff', 'w'], '<DNQ>': ['<DYN>eoa', '<DYN><DYN>', 'beu<DYN><DYN>'], '<XYW>': ['<DNQ>o', '<DYN>rx', '<DYN>k'], '<start>': ['<XYW>']} wrx
{'<JPV>': ['rxx', 'knkbb', 'aq'], '<JSE>': ['c<JPV>i', 'oh<JPV><JPV><JPV>h', '<JPV><JPV>l'], '<JLL>': ['<JSE>qo', '<JSE>wx', '<JSE>hm'], '<start>': ['<JLL>']} rxxaqlqo
With this, we have completed both implementation and testing of arbitrary CFG, which can now be used along with LangFuzzer
to generate better fuzzing inputs.
Numerous parsing techniques exist that can parse a given string using a given grammar, and produce corresponding derivation tree or trees. However, some of these techniques work only on specific classes of grammars. These classes of grammars are named after the specific kind of parser that can accept grammars of that category. That is, the upper bound for the capabilities of the parser defines the grammar class named after that parser.
The LL and LR parsing are the main traditions in parsing. Here, LL means lefttoright, leftmost derivation, and it represents a topdown approach. On the other hand, and LR (lefttoright, rightmost derivation) represents a bottomup approach. Another way to look at it is that LL parsers compute the derivation tree incrementally in preorder while LR parsers compute the derivation tree in postorder \cite{pingali2015graphical}).
Different classes of grammars differ in the features that are available to
the user for writing a grammar of that class. That is, the corresponding
kind of parser will be unable to parse a grammar that makes use of more
features than allowed. For example, the A2_GRAMMAR
is an LL
grammar because it lacks left recursion, while A1_GRAMMAR
is not an
LL grammar. This is because an LL parser parses
its input from left to right, and constructs the leftmost derivation of its
input by expanding the nonterminals it encounters. If there is a left
recursion in one of these rules, an LL parser will enter an infinite loop.
Similarly, a grammar is LL(k) if it can be parsed by an LL parser with k lookahead token, and LR(k) grammar can only be parsed with LR parser with at least k lookahead tokens. These grammars are interesting because both LL(k) and LR(k) grammars have $O(n)$ parsers, and can be used with relatively restricted computational budget compared to other grammars.
The languages for which one can provide an LL(k) grammar is called LL(k) languages (where k is the minimum lookahead required). Similarly, LR(k) is defined as the set of languages that have an LR(k) grammar. In terms of languages, LL(k) $\subset$ LL(k+1) and LL(k) $\subset$ LR(k), and LR(k) $=$ LR(1). All deterministic CFLs have an LR(1) grammar. However, there exist CFLs that are inherently ambiguous \cite{ogden1968helpful}, and for these, one can't provide an LR(1) grammar.
The other main parsing algorithms for CFGs are GLL \cite{scott2010gll}, GLR \cite{tomita1987efficient,tomita2012generalized}, and CYK \cite{grune2008parsing}. The ALL(*) (used by ANTLR) on the other hand is a grammar representation that uses Regular Expression like predicates (similar to advanced PEGs – see Exercise) rather than a fixed lookahead. Hence, ALL(*) can accept a larger class of grammars than CFGs.
In terms of computational limits of parsing, the main CFG parsers have a complexity of $O(n^3)$ for arbitrary grammars. However, parsing with arbitrary CFG is reducible to boolean matrix multiplication \cite{Valiant1975} (and the reverse \cite{Lee2002}). This is at present bounded by $O(2^{23728639}$) \cite{LeGall2014}. Hence, worse case complexity for parsing arbitrary CFG is likely to remain close to cubic.
Regarding PEGs, the actual class of languages that is expressible in PEG is currently unknown. In particular, we know that PEGs can express certain languages such as $a^n b^n c^n$. However, we do not know if there exist CFLs that are not expressible with PEGs. In Section 2.3, we provided an instance of a counterintuitive PEG grammar. While important for our purposes (we use grammars for generation of inputs) this is not a criticism of parsing with PEGs. PEG focuses on writing grammars for recognizing a given language, and not necessarily in interpreting what language an arbitrary PEG might yield. Given a ContextFree Language to parse, it is almost always possible to write a grammar for it in PEG, and given that 1) a PEG can parse any string in $O(n)$ time, and 2) at present we know of no CFL that can't be expressed as a PEG, and 3) compared with LR grammars, a PEG is often more intuitive because it allows topdown interpretation, when writing a parser for a language, PEGs should be under serious consideration.
Solution. Here is a possible solution:
class PackratParser(Parser):
def parse_prefix(self, text):
txt, res = self.unify_key(self.start_symbol(), text)
return len(txt), [res]
def parse(self, text):
remain, res = self.parse_prefix(text)
if remain:
raise SyntaxError("at " + res)
return res
def unify_rule(self, rule, text):
results = []
for token in rule:
text, res = self.unify_key(token, text)
if res is None:
return text, None
results.append(res)
return text, results
def unify_key(self, key, text):
if key not in self.cgrammar:
if text.startswith(key):
return text[len(key):], (key, [])
else:
return text, None
for rule in self.cgrammar[key]:
text_, res = self.unify_rule(rule, text)
if res:
return (text_, (key, res))
return text, None
mystring = "1 + (2 * 3)"
for tree in PackratParser(EXPR_GRAMMAR).parse(mystring):
assert tree_to_string(tree) == mystring
display_tree(tree)
Solution. Python allows us to append to a list in flight, while a dict, eventhough it is ordered does not allow that facility.
That is, the following will work
values = [1]
for v in values:
values.append(v*2)
However, the following will result in an error
values = {1:1}
for v in values:
values[v*2] = v*2
In the fill_chart
, we make use of this facility to modify the set of states we are iterating on, on the fly.
mystring = 'aaaaaa'
Compare that to the parsing of RR_GRAMMAR
as seen below:
Finding a deterministic reduction path is as follows:
Given a complete state, represented by <A> : seq_1 ● (s, e)
where s
is the starting column for this rule, and e
the current column, there is a deterministic reduction path above it if two constraints are satisfied.
<B> : seq_2 ● <A> (k, s)
in column s
.<A>
The resulting item is of the form <B> : seq_2 <A> ● (k, e)
, which is simply item from (1) advanced, and is considered above <A>:.. (s, e)
in the deterministic reduction path.
The seq_1
and seq_2
are arbitrary symbol sequences.
This forms the following chain of links, with <A>:.. (s_1, e)
being the child of <B>:.. (s_2, e)
etc.
Here is one way to visualize the chain:
<C> : seq_3 <B> ● (s_3, e)
 constraints satisfied by <C> : seq_3 ● <B> (s_3, s_2)
<B> : seq_2 <A> ● (s_2, e)
 constraints satisfied by <B> : seq_2 ● <A> (s_2, s_1)
<A> : seq_1 ● (s_1, e)
Essentially, what we want to do is to identify potential deterministic right recursion candidates, perform completion on them, and throw away the result. We do this until we reach the top. See Grune et al.~\cite{grune2008parsing} for further information.
Note that the completions are in the same column (e
), with each candidates with constraints satisfied
in further and further earlier columns (as shown below):
<C> : seq_3 ● <B> (s_3, s_2) > <C> : seq_3 <B> ● (s_3, e)

<B> : seq_2 ● <A> (s_2, s_1) > <B> : seq_2 <A> ● (s_2, e)

<A> : seq_1 ● (s_1, e)
Following this chain, the topmost item is the item <C>:.. (s_3, e)
that does not have a parent. The topmost item needs to be saved is called a transitive item by Leo, and it is associated with the nonterminal symbol that started the lookup. The transitive item needs to be added to each column we inspect.
Here is the skeleton for the parser LeoParser
.
Solution. Here is a possible solution:
class LeoParser(LeoParser):
def get_top(self, state_A):
st_B_inc = self.uniq_postdot(state_A)
if not st_B_inc:
return None
t_name = st_B_inc.name
if t_name in st_B_inc.e_col.transitives:
return st_B_inc.e_col.transitives[t_name]
st_B = st_B_inc.advance()
top = self.get_top(st_B) or st_B
return st_B_inc.e_col.add_transitive(t_name, top)
We verify the Leo parser with a few more right recursive grammars.
result = LeoParser(RR_GRAMMAR4, log=True).parse(mystring4)
for _ in result: pass
None chart[0] <A>:= (0,0) a chart[1] b chart[2] <A>:= (2,2) <A>:= a b <A> (0,2) a chart[3] b chart[4] <A>:= (4,4) <A>:= a b <A> (2,4) <A>:= a b <A> (0,4) a chart[5] b chart[6] <A>:= (6,6) <A>:= a b <A> (4,6) <A>:= a b <A> (0,6) a chart[7] b chart[8] <A>:= (8,8) <A>:= a b <A> (6,8) <A>:= a b <A> (0,8) c chart[9] <start>:= <A> c (0,9)
result = LeoParser(LR_GRAMMAR, log=True).parse(mystring)
for _ in result: pass
None chart[0] <A>:= (0,0) <start>:= <A> (0,0) a chart[1] <A>:= <A> a (0,1) <start>:= <A> (0,1) a chart[2] <A>:= <A> a (0,2) <start>:= <A> (0,2) a chart[3] <A>:= <A> a (0,3) <start>:= <A> (0,3) a chart[4] <A>:= <A> a (0,4) <start>:= <A> (0,4) a chart[5] <A>:= <A> a (0,5) <start>:= <A> (0,5) a chart[6] <A>:= <A> a (0,6) <start>:= <A> (0,6)
We define a rearrange()
method to generate a reversed table where each column contains states that start at that column.
class LeoParser(LeoParser):
def rearrange(self, table):
f_table = [Column(c.index, c.letter) for c in table]
for col in table:
for s in col.states:
f_table[s.s_col.index].states.append(s)
return f_table
class LeoParser(LeoParser):
def parse(self, text):
cursor, states = self.parse_prefix(text)
start = next((s for s in states if s.finished()), None)
if cursor < len(text) or not start:
raise SyntaxError("at " + repr(text[cursor:]))
self.r_table = self.rearrange(self.table)
forest = self.extract_trees(self.parse_forest(self.table, start))
for tree in forest:
yield self.prune_tree(tree)
class LeoParser(LeoParser):
def parse_forest(self, chart, state):
if isinstance(state, TState):
self.expand_tstate(state.back(), state.e_col)
return super().parse_forest(chart, state)
One of the problems with our Earley and Leo Parsers is that it can get stuck in infinite loops when parsing with grammars that contain token repetitions in alternatives. For example, consider the grammar below.
RECURSION_GRAMMAR = {
"<start>": ["<A>"],
"<A>": ["<A>", "<A>aa", "AA", "<B>"],
"<B>": ["<C>", "<C>cc" ,"CC"],
"<C>": ["<B>", "<B>bb", "BB"]
}
With this grammar, one can produce an infinite chain of derivations of <A>
, (direct recursion) or an infinite chain of derivations of <B> > <C> > <B> ...
(indirect recursion). The problem is that, our implementation can get stuck trying to derive one of these infinite chains.
with ExpectTimeout(1, print_traceback=False):
mystring = 'AA'
parser = LeoParser(RECURSION_GRAMMAR)
tree, *_ = parser.parse(mystring)
assert tree_to_string(tree) == mystring
display_tree(tree)
TimeoutError (expected)
Can you implement a solution such that any tree that contains such a chain is discarded?
Recursive algorithms are quite handy in some cases but sometimes we might want to have iteration instead of recursion due to memory or speed problems.
Can you implement an iterative version of the EarleyParser
?
Hint: In general, you can use a stack to replace a recursive algorithm with an iterative one. An easy way to do this is pushing the parameters onto a stack instead of passing them to the recursive function.
Solution. Here is a possible solution.
Let's see if it works with some of the grammars we have seen so far.
Solution. The first set of all terminals is the set containing just themselves. So we initialize that first. Then we update the first set with rules that derive empty strings.
def firstset(grammar, nullable):
first = {i: {i} for i in terminals(grammar)}
for k in grammar:
first[k] = {EPSILON} if k in nullable else set()
return firstset_((rules(grammar), first, nullable))[1]
Finally, we rely on the fixpoint
to update the first set with the contents of the current first set until the first set stops changing.
def first_expr(expr, first, nullable):
tokens = set()
for token in expr:
tokens = first[token]
if token not in nullable:
break
return tokens
@fixpoint
def firstset_(arg):
(rules, first, epsilon) = arg
for A, expression in rules:
first[A] = first_expr(expression, first, epsilon)
return (rules, first, epsilon)
firstset(canonical(A1_GRAMMAR), EPSILON)
{'8': {'8'}, '4': {'4'}, '7': {'7'}, '': {''}, '5': {'5'}, '2': {'2'}, '+': {'+'}, '6': {'6'}, '3': {'3'}, '1': {'1'}, '0': {'0'}, '9': {'9'}, '<start>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}, '<expr>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}, '<integer>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}, '<digit>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}}
Solution. The implementation of followset()
is similar to firstset()
. We first initialize the follow set with EOF
, get the epsilon and first sets, and use the fixpoint()
decorator to iteratively compute the follow set until nothing changes.
EOF = '\0'
def followset(grammar, start):
follow = {i: set() for i in grammar}
follow[start] = {EOF}
epsilon = nullable(grammar)
first = firstset(grammar, epsilon)
return followset_((grammar, epsilon, first, follow))[1]
Given the current follow set, one can update the follow set as follows:
@fixpoint
def followset_(arg):
grammar, epsilon, first, follow = arg
for A, expression in rules(grammar):
f_B = follow[A]
for t in reversed(expression):
if t in grammar:
follow[t] = f_B
f_B = f_B  first[t] if t in epsilon else (first[t]  {EPSILON})
return (grammar, epsilon, first, follow)
followset(canonical(A1_GRAMMAR), START_SYMBOL)
{'<start>': {'\x00'}, '<expr>': {'\x00', '+', ''}, '<integer>': {'\x00', '+', ''}, '<digit>': {'\x00', '+', '', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}}
Rule Name  +    0  1  2  3  4  5  6  7  8  9  

start  0  0  0  0  0  0  0  0  0  0  
expr  1  1  1  1  1  1  1  1  1  1  
expr_  2  3  
integer  5  5  5  5  5  5  5  5  5  5  
integer_  7  7  6  6  6  6  6  6  6  6  6  6  
digit  8  9  10  11  12  13  14  15  16  17 
Solution. We define predict()
as we explained before. Then we use the predicted rules to populate the parse table.
class LL1Parser(LL1Parser):
def predict(self, rulepair, first, follow, epsilon):
A, rule = rulepair
rf = first_expr(rule, first, epsilon)
if nullable_expr(rule, epsilon):
rf = follow[A]
return rf
def parse_table(self):
self.my_rules = rules(self.cgrammar)
epsilon = nullable(self.cgrammar)
first = firstset(self.cgrammar, epsilon)
# inefficient, can combine the three.
follow = followset(self.cgrammar, self.start_symbol())
ptable = [(i, self.predict(rule, first, follow, epsilon))
for i, rule in enumerate(self.my_rules)]
parse_tbl = {k: {} for k in self.cgrammar}
for i, pvals in ptable:
(k, expr) = self.my_rules[i]
parse_tbl[k].update({v: i for v in pvals})
self.table = parse_tbl
ll1parser = LL1Parser(A2_GRAMMAR)
ll1parser.parse_table()
ll1parser.show_table()
Rule Name  +    0  1  2  3  4  5  6  7  8  9 <start>    0  0  0  0  0  0  0  0  0  0 <expr>    1  1  1  1  1  1  1  1  1  1 <expr_>  2  3           <integer>    5  5  5  5  5  5  5  5  5  5 <integer_>  7  7  6  6  6  6  6  6  6  6  6  6 <digit>    8  9  10  11  12  13  14  15  16  17
Solution. Here is the complete parser:
class LL1Parser(LL1Parser):
def parse_helper(self, stack, inplst):
inp, *inplst = inplst
exprs = []
while stack:
val, *stack = stack
if isinstance(val, tuple):
exprs.append(val)
elif val not in self.cgrammar: # terminal
assert val == inp
exprs.append(val)
inp, *inplst = inplst or [None]
else:
if inp is not None:
i = self.table[val][inp]
_, rhs = self.my_rules[i]
stack = rhs + [(val, len(rhs))] + stack
return self.linear_to_tree(exprs)
def parse(self, inp):
self.parse_table()
k, _ = self.my_rules[0]
stack = [k]
return self.parse_helper(stack, inp)
def linear_to_tree(self, arr):
stack = []
while arr:
elt = arr.pop(0)
if not isinstance(elt, tuple):
stack.append((elt, []))
else:
# get the last n
sym, n = elt
elts = stack[n:] if n > 0 else []
stack = stack[0:len(stack)  n]
stack.append((sym, elts))
assert len(stack) == 1
return stack[0]
ll1parser = LL1Parser(A2_GRAMMAR)
tree = ll1parser.parse('1+2')
display_tree(tree)