Fuzzing a Simple Program¶

Here is a simple program that accepts a CSV file of vehicle details and processes this information.

def process_inventory(inventory):
res = []
for vehicle in inventory.split('\n'):
ret = process_vehicle(vehicle)
res.extend(ret)
return '\n'.join(res)


The CSV file contains details of one vehicle per line. Each row is processed in process_vehicle().

def process_vehicle(vehicle):
year, kind, company, model, *_ = vehicle.split(',')
if kind == 'van':
return process_van(year, company, model)

elif kind == 'car':
return process_car(year, company, model)

else:
raise Exception('Invalid entry')


Depending on the kind of vehicle, the processing changes.

def process_van(year, company, model):
res = ["We have a %s %s van from %s vintage." % (company, model, year)]
iyear = int(year)
if iyear > 2010:
res.append("It is a recent model!")
else:
res.append("It is an old but reliable model!")
return res

def process_car(year, company, model):
res = ["We have a %s %s car from %s vintage." % (company, model, year)]
iyear = int(year)
if iyear > 2016:
res.append("It is a recent model!")
else:
res.append("It is an old but reliable model!")
return res


Here is a sample of inputs that the process_inventory() accepts.

mystring = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar\
"""
print(process_inventory(mystring))

We have a Ford E350 van from 1997 vintage.
It is an old but reliable model!
We have a Mercury Cougar car from 2000 vintage.
It is an old but reliable model!


Let us try to fuzz this program. Given that the process_inventory() takes a CSV file, we can write a simple grammar for generating comma separated values, and generate the required CSV rows. For convenience, we fuzz process_vehicle() directly.

CSV_GRAMMAR = {
'<start>': ['<csvline>'],
'<csvline>': ['<items>'],
'<items>': ['<item>,<items>', '<item>'],
'<item>': ['<letters>'],
'<letters>': ['<letter><letters>', '<letter>'],
'<letter>': list(string.ascii_letters + string.digits + string.punctuation + ' \t\n')
}


We need some infrastructure first for viewing the grammar.

syntax_diagram(CSV_GRAMMAR)

start

csvline

items

item

letters

letter


We generate 1000 values, and evaluate the process_vehicle() with each.

gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 1000
valid = []
time = 0
for i in range(trials):
with Timer() as t:
vehicle_info = gf.fuzz()
try:
process_vehicle(vehicle_info)
valid.append(vehicle_info)
except:
pass
time += t.elapsed_time()
print("%d valid strings, that is GrammarFuzzer generated %f%% valid entries from %d inputs" %
(len(valid), len(valid) * 100.0 / trials, trials))
print("Total time of %f seconds" % time)

0 valid strings, that is GrammarFuzzer generated 0.000000% valid entries from 1000 inputs
Total time of 5.665059 seconds


This is obviously not working. But why?

gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
trials = 10
valid = []
time = 0
for i in range(trials):
vehicle_info = gf.fuzz()
try:
print(repr(vehicle_info), end="")
process_vehicle(vehicle_info)
except Exception as e:
print("\t", e)
else:
print()

'9w9J\'/,LU<"l,|,Y,Zv)Amvx,c\n'	 Invalid entry
'(n8].H7,qolS'	 not enough values to unpack (expected at least 4, got 2)
'\nQoLWQ,jSa'	 not enough values to unpack (expected at least 4, got 2)
'K1,\n,RE,fq,%,,sT+aAb'	 Invalid entry
"m,d,,8j4'),-yQ,B7"	 Invalid entry
'g4,s1\t[}{.,M,<,\nzd,.am'	 Invalid entry
',Z[,z,c,#x1,gc.F'	 Invalid entry
'pWs,rT,R'	 not enough values to unpack (expected at least 4, got 3)
'iN,br%,Q,R'	 Invalid entry
'ol,\nH<\tn,^#,=A'	 Invalid entry


None of the entries will get through unless the fuzzer can produce either van or car. Indeed, the reason is that the grammar itself does not capture the complete information about the format. So here is another idea. We modify the GrammarFuzzer to know a bit about our format.

Let us try again!

gf = PooledGrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)
gf.update_cache('<item>', [
('<item>', [('car', [])]),
('<item>', [('van', [])]),
])
trials = 10
valid = []
time = 0
for i in range(trials):
vehicle_info = gf.fuzz()
try:
print(repr(vehicle_info), end="")
process_vehicle(vehicle_info)
except Exception as e:
print("\t", e)
else:
print()

',h,van,|'	 Invalid entry
'M,w:K,car,car,van'	 Invalid entry
'J,?Y,van,van,car,J,~D+'	 Invalid entry
'S4,car,car,o'	 invalid literal for int() with base 10: 'S4'
'2*-,van'	 not enough values to unpack (expected at least 4, got 2)
'van,%,5,]'	 Invalid entry
'van,G3{y,j,h:'	 Invalid entry

Lessons Learned¶

• Grammars can be used to generate derivation trees for a given string.
• Parsing Expression Grammars are intuitive, and easy to implement, but require care to write.
• Earley Parsers can parse arbitrary Context Free Grammars.

Next Steps¶

Solution. Here is a possible solution:

class PackratParser(Parser):
def parse_prefix(self, text):
txt, res = self.unify_key(self.start_symbol(), text)
return len(txt), [res]

def parse(self, text):
remain, res = self.parse_prefix(text)
if remain:
raise SyntaxError("at " + res)
return res

def unify_rule(self, rule, text):
results = []
for token in rule:
text, res = self.unify_key(token, text)
if res is None:
return text, None
results.append(res)
return text, results

def unify_key(self, key, text):
if key not in self.cgrammar:
if text.startswith(key):
return text[len(key):], (key, [])
else:
return text, None
for rule in self.cgrammar[key]:
text_, res = self.unify_rule(rule, text)
if res:
return (text_, (key, res))
return text, None

mystring = "1 + (2 * 3)"
for tree in PackratParser(EXPR_GRAMMAR).parse(mystring):
assert tree_to_string(tree) == mystring
display_tree(tree)


Solution. Python allows us to append to a list in flight, while a dict, eventhough it is ordered does not allow that facility.

That is, the following will work

values = [1]
for v in values:
values.append(v*2)


However, the following will result in an error

values = {1:1}
for v in values:
values[v*2] = v*2


In the fill_chart, we make use of this facility to modify the set of states we are iterating on, on the fly.

In [143]:
Compare that to the parsing of RR_GRAMMAR as seen below:

Finding a deterministic reduction path is as follows:

Given a complete state, represented by <A> : seq_1 ● (s, e) where s is the starting column for this rule, and e the current column, there is a deterministic reduction path above it if two constraints are satisfied.

1. There exist a single item in the form <B> : seq_2 ● <A> (k, s) in column s.
2. That should be the single item in s with dot in front of <A>

The resulting item is of the form <B> : seq_2 <A> ● (k, e), which is simply item from (1) advanced, and is considered above <A>:.. (s, e) in the deterministic reduction path. The seq_1 and seq_2 are arbitrary symbol sequences.

This forms the following chain of links, with <A>:.. (s_1, e) being the child of <B>:.. (s_2, e) etc.

Here is one way to visualize the chain:

<C> : seq_3 <B> ● (s_3, e)
|  constraints satisfied by <C> : seq_3 ● <B> (s_3, s_2)
<B> : seq_2 <A> ● (s_2, e)
| constraints satisfied by <B> : seq_2 ● <A> (s_2, s_1)
<A> : seq_1 ● (s_1, e)

Essentially, what we want to do is to identify potential deterministic right recursion candidates, perform completion on them, and throw away the result. We do this until we reach the top. See Grune et al.~\cite{grune2008parsing} for further information.

Note that the completions are in the same column (e), with each candidates with constraints satisfied in further and further earlier columns (as shown below):

<C> : seq_3 ● <B> (s_3, s_2)  -->              <C> : seq_3 <B> ● (s_3, e)
|
<B> : seq_2 ● <A> (s_2, s_1) --> <B> : seq_2 <A> ● (s_2, e)
|
<A> : seq_1 ●                        (s_1, e)

Following this chain, the topmost item is the item <C>:.. (s_3, e) that does not have a parent. The topmost item needs to be saved is called a transitive item by Leo, and it is associated with the non-terminal symbol that started the lookup. The transitive item needs to be added to each column we inspect.

Here is the skeleton for the parser LeoParser.

Solution. Here is a possible solution:

class LeoParser(LeoParser):
def get_top(self, state_A):
st_B_inc = self.uniq_postdot(state_A)
if not st_B_inc:
return None

t_name = st_B_inc.name
if t_name in st_B_inc.e_col.transitives:
return st_B_inc.e_col.transitives[t_name]

top = self.get_top(st_B) or st_B


We verify the Leo parser with a few more right recursive grammars.

In [159]:
result = LeoParser(RR_GRAMMAR4, log=True).parse(mystring4)
for _ in result: pass

None chart[0]
<A>:= |(0,0)

a chart[1]

b chart[2]
<A>:= |(2,2)
<A>:= a b <A> |(0,2)

a chart[3]

b chart[4]
<A>:= |(4,4)
<A>:= a b <A> |(2,4)
<A>:= a b <A> |(0,4)

a chart[5]

b chart[6]
<A>:= |(6,6)
<A>:= a b <A> |(4,6)
<A>:= a b <A> |(0,6)

a chart[7]

b chart[8]
<A>:= |(8,8)
<A>:= a b <A> |(6,8)
<A>:= a b <A> |(0,8)

c chart[9]
<start>:= <A> c |(0,9)


result = LeoParser(LR_GRAMMAR, log=True).parse(mystring)
for _ in result: pass

None chart[0]
<A>:= |(0,0)
<start>:= <A> |(0,0)

a chart[1]
<A>:= <A> a |(0,1)
<start>:= <A> |(0,1)

a chart[2]
<A>:= <A> a |(0,2)
<start>:= <A> |(0,2)

a chart[3]
<A>:= <A> a |(0,3)
<start>:= <A> |(0,3)

a chart[4]
<A>:= <A> a |(0,4)
<start>:= <A> |(0,4)

a chart[5]
<A>:= <A> a |(0,5)
<start>:= <A> |(0,5)

a chart[6]
<A>:= <A> a |(0,6)
<start>:= <A> |(0,6)



We define a rearrange() method to generate a reversed table where each column contains states that start at that column.

class LeoParser(LeoParser):
def rearrange(self, table):
f_table = [Column(c.index, c.letter) for c in table]
for col in table:
for s in col.states:
f_table[s.s_col.index].states.append(s)
return f_table

In [175]:
def parse(self, text):
cursor, states = self.parse_prefix(text)
start = next((s for s in states if s.finished()), None)
if cursor < len(text) or not start:
raise SyntaxError("at " + repr(text[cursor:]))

self.r_table = self.rearrange(self.table)
forest = self.extract_trees(self.parse_forest(self.table, start))
for tree in forest:
yield self.prune_tree(tree)

In [176]:
def parse_forest(self, chart, state):
if isinstance(state, TState):
self.expand_tstate(state.back(), state.e_col)

return super().parse_forest(chart, state)


Exercise 6: Filtered Earley Parser¶

One of the problems with our Earley and Leo Parsers is that it can get stuck in infinite loops when parsing with grammars that contain token repetitions in alternatives. For example, consider the grammar below.

RECURSION_GRAMMAR = {
"<start>": ["<A>"],
"<A>": ["<A>", "<A>aa", "AA", "<B>"],
"<B>": ["<C>", "<C>cc" ,"CC"],
"<C>": ["<B>", "<B>bb", "BB"]
}


With this grammar, one can produce an infinite chain of derivations of <A>, (direct recursion) or an infinite chain of derivations of <B> -> <C> -> <B> ... (indirect recursion). The problem is that, our implementation can get stuck trying to derive one of these infinite chains.

In [191]:
mystring = 'AA'
parser = LeoParser(RECURSION_GRAMMAR)
tree, *_ = parser.parse(mystring)
assert tree_to_string(tree) == mystring
display_tree(tree)

TimeoutError (expected)


Can you implement a solution such that any tree that contains such a chain is discarded?

Exercise 7: Iterative Earley Parser¶

Recursive algorithms are quite handy in some cases but sometimes we might want to have iteration instead of recursion due to memory or speed problems.

Can you implement an iterative version of the EarleyParser?

Hint: In general, you can use a stack to replace a recursive algorithm with an iterative one. An easy way to do this is pushing the parameters onto a stack instead of passing them to the recursive function.

Solution. Here is a possible solution.

Let's see if it works with some of the grammars we have seen so far.

Solution. The first set of all terminals is the set containing just themselves. So we initialize that first. Then we update the first set with rules that derive empty strings.

In [205]:
first = {i: {i} for i in terminals(grammar)}
for k in grammar:
first[k] = {EPSILON} if k in nullable else set()
return firstset_((rules(grammar), first, nullable))[1]


Finally, we rely on the fixpoint to update the first set with the contents of the current first set until the first set stops changing.

In [206]:
tokens = set()
for token in expr:
tokens |= first[token]
if token not in nullable:
break

In [207]:
def firstset_(arg):
(rules, first, epsilon) = arg
for A, expression in rules:
first[A] |= first_expr(expression, first, epsilon)
return (rules, first, epsilon)

In [208]:
Out[208]:
{'8': {'8'},
'4': {'4'},
'7': {'7'},
'-': {'-'},
'5': {'5'},
'2': {'2'},
'+': {'+'},
'6': {'6'},
'3': {'3'},
'1': {'1'},
'0': {'0'},
'9': {'9'},
'<start>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'},
'<expr>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'},
'<integer>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'},
'<digit>': {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'}}

Solution. The implementation of followset() is similar to firstset(). We first initialize the follow set with EOF, get the epsilon and first sets, and use the fixpoint() decorator to iteratively compute the follow set until nothing changes.

EOF = '\0'

In [210]:
follow = {i: set() for i in grammar}
follow[start] = {EOF}

epsilon = nullable(grammar)
first = firstset(grammar, epsilon)
return followset_((grammar, epsilon, first, follow))[-1]


Given the current follow set, one can update the follow set as follows:

In [211]:
def followset_(arg):
grammar, epsilon, first, follow = arg
for A, expression in rules(grammar):
f_B = follow[A]
for t in reversed(expression):
if t in grammar:
follow[t] |= f_B
f_B = f_B | first[t] if t in epsilon else (first[t] - {EPSILON})

return (grammar, epsilon, first, follow)

In [212]:
Out[212]:
{'<start>': {'\x00'},
'<expr>': {'\x00', '+', '-'},
'<integer>': {'\x00', '+', '-'},
'<digit>': {'\x00',
'+',
'-',
'0',
'1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9'}}
Rule Name + - 0 1 2 3 4 5 6 7 8 9
start 0 0 0 0 0 0 0 0 0 0
expr 1 1 1 1 1 1 1 1 1 1
expr_ 2 3
integer 5 5 5 5 5 5 5 5 5 5
integer_ 7 7 6 6 6 6 6 6 6 6 6 6
digit 8 9 10 11 12 13 14 15 16 17

Solution. We define predict() as we explained before. Then we use the predicted rules to populate the parse table.

In [215]:
def predict(self, rulepair, first, follow, epsilon):
A, rule = rulepair
rf = first_expr(rule, first, epsilon)
if nullable_expr(rule, epsilon):
rf |= follow[A]
return rf

def parse_table(self):
self.my_rules = rules(self.cgrammar)
epsilon = nullable(self.cgrammar)
first = firstset(self.cgrammar, epsilon)
# inefficient, can combine the three.

ptable = [(i, self.predict(rule, first, follow, epsilon))
for i, rule in enumerate(self.my_rules)]

parse_tbl = {k: {} for k in self.cgrammar}

for i, pvals in ptable:
(k, expr) = self.my_rules[i]
parse_tbl[k].update({v: i for v in pvals})

self.table = parse_tbl

In [216]:
ll1parser.parse_table()
ll1parser.show_table()

Rule Name	| + | - | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<start>  	|   |   | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
<expr>  	|   |   | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1
<expr_>  	| 2 | 3 |   |   |   |   |   |   |   |   |   |
<integer>  	|   |   | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 5
<integer_>  	| 7 | 7 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6 | 6
<digit>  	|   |   | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17


Solution. Here is the complete parser:

In [217]:
def parse_helper(self, stack, inplst):
inp, *inplst = inplst
exprs = []
while stack:
val, *stack = stack
if isinstance(val, tuple):
exprs.append(val)
elif val not in self.cgrammar:  # terminal
assert val == inp
exprs.append(val)
inp, *inplst = inplst or [None]
else:
if inp is not None:
i = self.table[val][inp]
_, rhs = self.my_rules[i]
stack = rhs + [(val, len(rhs))] + stack
return self.linear_to_tree(exprs)

def parse(self, inp):
self.parse_table()
k, _ = self.my_rules[0]
stack = [k]
return self.parse_helper(stack, inp)

def linear_to_tree(self, arr):
stack = []
while arr:
elt = arr.pop(0)
if not isinstance(elt, tuple):
stack.append((elt, []))
else:
# get the last n
sym, n = elt
elts = stack[-n:] if n > 0 else []
stack = stack[0:len(stack) - n]
stack.append((sym, elts))
assert len(stack) == 1
return stack[0]

In [218]:
tree = ll1parser.parse('1+2')
display_tree(tree)
`