So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place. While the grammars we have seen so far have been rather simple, creating a grammar for complex inputs can involve quite some effort. In this chapter, we therefore introduce techniques that automatically mine grammars from programs – by executing the programs and observing how they process which parts of the input. In conjunction with a grammar fuzzer, this allows us to
Prerequisites
Consider the process_inventory()
method from the chapter on parsers:
It takes inputs of the following form.
INVENTORY = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar
1999,car,Chevy,Venture\
"""
print(process_inventory(INVENTORY))
We have a Ford E350 van from 1997 vintage. It is an old but reliable model! We have a Mercury Cougar car from 2000 vintage. It is an old but reliable model! We have a Chevy Venture car from 1999 vintage. It is an old but reliable model!
We found from the chapter on parsers that coarse grammars do not work well for fuzzing when the input format includes details expressed only in code. That is, even though we have the formal specification of CSV files (RFC 4180), the inventory system includes further rules as to what is expected at each index of the CSV file. The solution of simply recombining existing inputs, while practical, is incomplete. In particular, it relies on a formal input specification being available in the first place. However, we have no assurance that the program obeys the input specification given.
One of the ways out of this predicament is to interrogate the program under test as to what its input specification is. That is, if the program under test is written in a style such that specific methods are responsible for handling specific parts of the input, one can recover the parse tree by observing the process of parsing. Further, one can recover a reasonable approximation of the grammar by abstraction from multiple input trees.
We start with the assumption (1) that the program is written in such a fashion that specific methods are responsible for parsing specific fragments of the program -- This includes almost all ad hoc parsers.
The idea is as follows:
Say we want to obtain the input grammar for the function process_vehicle()
. We first collect the sample inputs for this function.
VEHICLES = INVENTORY.split('\n')
The set of methods responsible for processing inventory are the following.
INVENTORY_METHODS = {
'process_inventory',
'process_vehicle',
'process_van',
'process_car'}
We have seen from the chapter on configuration fuzzing that one can hook into the Python runtime to observe the arguments to a function and any local variables created. We have also seen that one can obtain the context of execution by inspecting the frame
argument. Here is a simple tracer that can return the local variables and other contextual information in a traced function. We reuse the Coverage
tracing class.
class Tracer(Coverage):
def traceit(self, frame, event, arg):
method_name = inspect.getframeinfo(frame).function
if method_name not in INVENTORY_METHODS:
return
file_name = inspect.getframeinfo(frame).filename
param_names = inspect.getargvalues(frame).args
lineno = inspect.getframeinfo(frame).lineno
local_vars = inspect.getargvalues(frame).locals
print(event, file_name, lineno, method_name, param_names, local_vars)
return self.traceit
We run the code under trace context.
with Tracer() as tracer:
process_vehicle(VEHICLES[0])
call /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 29 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 30 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 31 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []} call /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 40 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 41 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 42 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.']} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 43 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 46 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997} line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997} return /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997} return /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}
The main thing that we want out of tracing is a list of assignments of input fragments to different variables. We can use the tracing facility settrace()
to get that as we showed above.
However, the settrace()
function hooks into the Python debugging facility. When it is in operation, no debugger can hook into the program. That is, if there is a problem with our grammar miner, we will not be able to attach a debugger to it to understand what is happening. This is not ideal. Hence, we limit the tracer to the simplest implementation possible, and implement the core of grammar mining in later stages.
The traceit()
function relies on information from the frame
variable which exposes Python internals. We define a context
class that encapsulates the information that we need from the frame
.
The Context
class provides easy access to the information such as the current module, and parameter names.
class Context:
def __init__(self, frame, track_caller=True):
self.method = inspect.getframeinfo(frame).function
self.parameter_names = inspect.getargvalues(frame).args
self.file_name = inspect.getframeinfo(frame).filename
self.line_no = inspect.getframeinfo(frame).lineno
def _t(self):
return (self.file_name, self.line_no, self.method,
','.join(self.parameter_names))
def __repr__(self):
return "%s:%d:%s(%s)" % self._t()
Here we add a few convenience methods that operate on the frame
to Context
.
class Context(Context):
def extract_vars(self, frame):
return inspect.getargvalues(frame).locals
def parameters(self, all_vars):
return {k: v for k, v in all_vars.items() if k in self.parameter_names}
def qualified(self, all_vars):
return {"%s:%s" % (self.method, k): v for k, v in all_vars.items()}
We hook printing the context to our traceit()
to see it in action. First we define a log_event()
for displaying events.
def log_event(event, var):
print({'call': '->', 'return': '<-'}.get(event, ' '), var)
And use the log_event()
in the traceit()
function.
class Tracer(Tracer):
def traceit(self, frame, event, arg):
log_event(event, Context(frame))
return self.traceit
Running process_vehicle()
under trace prints the contexts encountered.
with Tracer() as tracer:
process_vehicle(VEHICLES[0])
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle) -> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model) <- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model) <- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle) -> /Users/zeller/Projects/fuzzingbook/notebooks/Coverage.ipynb:102:__exit__(self,exc_type,exc_value,tb) /Users/zeller/Projects/fuzzingbook/notebooks/Coverage.ipynb:105:__exit__(self,exc_type,exc_value,tb)
The trace produced by executing any function can get overwhelmingly large. Hence, we need to restrict our attention to specific modules. Further, we also restrict our attention exclusively to str
variables since these variables are more likely to contain input fragments. (We will show how to deal with complex objects later in exercises.)
The Context
class we developed earlier is used to decide which modules to monitor, and which variables to trace.
We store the current input string so that it can be used to determine if any particular string fragments came from the current input string. Any optional arguments are processed separately.
class Tracer(Tracer):
def __init__(self, my_input, **kwargs):
self.options(kwargs)
self.my_input, self.trace = my_input, []
We use an optional argument files
to indicate the specific source files we are interested in, and methods
to indicate which specific methods are of interest. Further, we also use log
to specify whether verbose logging should be enabled during trace. We use the log_event()
method we defined earlier for logging.
The options processing is as below.
class Tracer(Tracer):
def options(self, kwargs):
self.files = kwargs.get('files', [])
self.methods = kwargs.get('methods', [])
self.log = log_event if kwargs.get('log') else lambda _evt, _var: None
The files
and methods
are checked to determine, if a particular event should be traced or not
class Tracer(Tracer):
def tracing_context(self, cxt, event, arg):
fres = not self.files or any(
cxt.file_name.endswith(f) for f in self.files)
mres = not self.methods or any(cxt.method == m for m in self.methods)
return fres and mres
Similar to the context of events, we also want to restrict our attention to specific variables. For now, we want to focus only on strings. (See the Exercises at the end of the chapter on how to extend it to other kinds of objects).
class Tracer(Tracer):
def tracing_var(self, k, v):
return isinstance(v, str)
We modify the traceit()
to call an on_event()
function with the context information only on the specific events we are interested in.
class Tracer(Tracer):
def on_event(self, event, arg, cxt, my_vars):
self.trace.append((event, arg, cxt, my_vars))
def create_context(self, frame):
return Context(frame)
def traceit(self, frame, event, arg):
cxt = self.create_context(frame)
if not self.tracing_context(cxt, event, arg):
return self.traceit
self.log(event, cxt)
my_vars = {
k: v
for k, v in cxt.extract_vars(frame).items()
if self.tracing_var(k, v)
}
self.on_event(event, arg, cxt, my_vars)
return self.traceit
The Tracer
class can now focus on specific kinds of events on specific files. Further, it provides a first level filter for variables that we find interesting. For example, we want to focus specifically on variables from process_*
methods that contain input fragments. Here is how our updated Tracer
can be used.
with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:
process_vehicle(VEHICLES[0])
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle) -> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model) <- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model) <- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)
The execution produced the following trace.
for t in tracer.trace:
print(t[0], t[2].method, dict(t[3]))
call process_vehicle {'vehicle': '1997,van,Ford,E350'} line process_vehicle {'vehicle': '1997,van,Ford,E350'} line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'} line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'} call process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} return process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'} return process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}
Since we are saving the input already in Tracer
, it is redundant to specify it separately again as an argument.
with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:
process_vehicle(tracer.my_input)
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle) -> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model) /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model) <- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model) <- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)
We define a DefineTracker
class that processes the trace from the Tracer
. The idea is to store different variable definitions which are input fragments.
The tracker identifies string fragments that are part of the input string, and stores them in a dictionary my_assignments
. It saves the trace, and the corresponding input for processing. Finally, it calls process()
to process the trace
it was given. We will start with a simple tracker that relies on certain assumptions, and later see how these assumptions can be relaxed.
class DefineTracker:
def __init__(self, my_input, trace, **kwargs):
self.options(kwargs)
self.my_input = my_input
self.trace = trace
self.my_assignments = {}
self.process()
One of the problems of using substring search is that short string sequences tend to be included in other string sequences even though they may not have come from the original string. That is, say the input fragment is v
, it could have equally come from either van
or chevy
. We rely on being able to predict the exact place in the input where a given fragment occurred. Hence, we define a constant FRAGMENT_LEN
such that we ignore strings up to that length. We also incorporate a logging facility as before.
FRAGMENT_LEN = 3
class DefineTracker(DefineTracker):
def options(self, kwargs):
self.log = log_event if kwargs.get('log') else lambda _evt, _var: None
self.fragment_len = kwargs.get('fragment_len', FRAGMENT_LEN)
Our tracer simply records the variable values as they occur. We next need to check if the variables contain values from the input string. Common ways to do this is to rely on symbolic execution or at least dynamic tainting, which are powerful, but also complex. However, one can obtain a reasonable approximation by simply relying on substring search. That is, we consider any value produced that is a substring of the original input string to have come from the original input.
We define an is_input_fragment()
method that relies on string inclusion to detect if the string came from the input.
class DefineTracker(DefineTracker):
def is_input_fragment(self, var, value):
return len(value) >= self.fragment_len and value in self.my_input
We can use is_input_fragment()
to select only a subset of variables defined, as implemented below in fragments()
.
class DefineTracker(DefineTracker):
def fragments(self, variables):
return {k: v for k, v in variables.items(
) if self.is_input_fragment(k, v)}
The tracker processes each event, and at each event, it updates the dictionary my_assignments
with the current local variables that contain strings that are part of the input. Note that there is a choice here with respect to what happens during reassignment. We can either discard all the reassignments, or keep only the last assignment. Here, we choose the latter. If you want the former behavior, check whether the value exists in my_assignments
before storing a fragment.
class DefineTracker(DefineTracker):
def track_event(self, event, arg, cxt, my_vars):
self.log(event, (cxt.method, my_vars))
self.my_assignments.update(self.fragments(my_vars))
def process(self):
for event, arg, cxt, my_vars in self.trace:
self.track_event(event, arg, cxt, my_vars)
Using the tracker, we can obtain the input fragments. For example, say we are only interested in strings that are at least 5
characters long.
tracker = DefineTracker(tracer.my_input, tracer.trace, fragment_len=5)
for k, v in tracker.my_assignments.items():
print(k, '=', repr(v))
vehicle = '1997,van,Ford,E350'
Or strings that are 2
characters long (the default).
tracker = DefineTracker(tracer.my_input, tracer.trace)
for k, v in tracker.my_assignments.items():
print(k, '=', repr(v))
vehicle = '1997,van,Ford,E350' year = '1997' kind = 'van' company = 'Ford' model = 'E350'
class DefineTracker(DefineTracker):
def assignments(self):
return self.my_assignments.items()
The input fragments from the DefineTracker
only tell half the story. The fragments may be created at different stages of parsing. Hence, we need to assemble the fragments to a derivation tree of the input. The basic idea is as follows:
Our input from the previous step was:
"1997,van,Ford,E350"
We start a derivation tree, and associate it with the start symbol in the grammar.
derivation_tree: DerivationTree = (START_SYMBOL, [("1997,van,Ford,E350", [])])
display_tree(derivation_tree)
The next input was:
vehicle = "1997,van,Ford,E350"
Since vehicle covers the <start>
node's value completely, we replace the value with the vehicle node.
derivation_tree: DerivationTree = (START_SYMBOL,
[('<vehicle>', [("1997,van,Ford,E350", [])],
[])])
display_tree(derivation_tree)
The next input was:
year = '1997'
Traversing the derivation tree from <start>
, we see that it replaces a portion of the <vehicle>
node's value. Hence we split the <vehicle>
node's value to two children, where one corresponds to the value "1997"
and the other to ",van,Ford,E350"
, and replace the first one with the node <year>
.
derivation_tree: DerivationTree = (START_SYMBOL,
[('<vehicle>', [('<year>', [('1997', [])]),
(",van,Ford,E350", [])], [])])
display_tree(derivation_tree)
We perform similar operations for
company = 'Ford'
derivation_tree: DerivationTree = (START_SYMBOL,
[('<vehicle>', [('<year>', [('1997', [])]),
(",van,", []),
('<company>', [('Ford', [])]),
(",E350", [])], [])])
display_tree(derivation_tree)
Similarly for
kind = 'van'
and
model = 'E350'
derivation_tree: DerivationTree = (START_SYMBOL,
[('<vehicle>', [('<year>', [('1997', [])]),
(",", []),
("<kind>", [('van', [])]),
(",", []),
('<company>', [('Ford', [])]),
(",", []),
("<model>", [('E350', [])])
], [])])
display_tree(derivation_tree)
We now develop the complete algorithm with the above described steps.
The derivation tree TreeMiner
is initialized with the input string, and the variable assignments, and it converts the assignments to the corresponding derivation tree.
class TreeMiner:
def __init__(self, my_input, my_assignments, **kwargs):
self.options(kwargs)
self.my_input = my_input
self.my_assignments = my_assignments
self.tree = self.get_derivation_tree()
def options(self, kwargs):
self.log = log_call if kwargs.get('log') else lambda _i, _v: None
def get_derivation_tree(self):
return (START_SYMBOL, [])
The log_call()
is as follows.
def log_call(indent, var):
print('\t' * indent, var)
The basic idea is as follows:
my_assignments
:val
in the derivation tree recursively.V1
of a node P1
, we partition the value of the node P1
into three parts, with the central part matching the value val
, and the first and last part, the corresponding prefix and suffix in V1
.P1
with three children, where prefix and suffix mentioned earlier are string values, and the matching value val
is replaced by a node var
with a single value val
.First, we define a wrapper to generate a nonterminal from a variable name.
def to_nonterminal(var):
return "<" + var.lower() + ">"
The string_part_of_value()
method checks whether the given part
value was part of the whole.
class TreeMiner(TreeMiner):
def string_part_of_value(self, part, value):
return (part in value)
The partition_by_part()
splits the value
by the given part if it matches, and returns a list containing the first part, the part that was replaced, and the last part. This is a format that can be used as a part of the list of children.
class TreeMiner(TreeMiner):
def partition(self, part, value):
return value.partition(part)
class TreeMiner(TreeMiner):
def partition_by_part(self, pair, value):
k, part = pair
prefix_k_suffix = [
(k, [[part, []]]) if i == 1 else (e, [])
for i, e in enumerate(self.partition(part, value))
if e]
return prefix_k_suffix
The insert_into_tree()
method accepts a given tree tree
and a (k,v)
pair. It recursively checks whether the given pair can be applied. If the pair can be applied, it applies the pair and returns True
.
class TreeMiner(TreeMiner):
def insert_into_tree(self, my_tree, pair):
var, values = my_tree
k, v = pair
self.log(1, "- Node: %s\t\t? (%s:%s)" % (var, k, repr(v)))
applied = False
for i, value_ in enumerate(values):
value, arr = value_
self.log(2, "-> [%d] %s" % (i, repr(value)))
if is_nonterminal(value):
applied = self.insert_into_tree(value_, pair)
if applied:
break
elif self.string_part_of_value(v, value):
prefix_k_suffix = self.partition_by_part(pair, value)
del values[i]
for j, rep in enumerate(prefix_k_suffix):
values.insert(j + i, rep)
applied = True
self.log(2, " > %s" % (repr([i[0] for i in prefix_k_suffix])))
break
else:
continue
return applied
Here is how insert_into_tree()
is used.
tree: DerivationTree = (START_SYMBOL, [("1997,van,Ford,E350", [])])
m = TreeMiner('', {}, log=True)
First, we have our input string as the only node.
display_tree(tree)
Inserting the <vehicle>
node.
v = m.insert_into_tree(tree, ('<vehicle>', "1997,van,Ford,E350"))
- Node: <start> ? (<vehicle>:'1997,van,Ford,E350') -> [0] '1997,van,Ford,E350' > ['<vehicle>']
display_tree(tree)
Inserting <model>
node.
v = m.insert_into_tree(tree, ('<model>', 'E350'))
- Node: <start> ? (<model>:'E350') -> [0] '<vehicle>' - Node: <vehicle> ? (<model>:'E350') -> [0] '1997,van,Ford,E350' > ['1997,van,Ford,', '<model>']
display_tree((tree))
Inserting <company>
.
v = m.insert_into_tree(tree, ('<company>', 'Ford'))
- Node: <start> ? (<company>:'Ford') -> [0] '<vehicle>' - Node: <vehicle> ? (<company>:'Ford') -> [0] '1997,van,Ford,' > ['1997,van,', '<company>', ',']
display_tree(tree)
Inserting <kind>
.
v = m.insert_into_tree(tree, ('<kind>', 'van'))
- Node: <start> ? (<kind>:'van') -> [0] '<vehicle>' - Node: <vehicle> ? (<kind>:'van') -> [0] '1997,van,' > ['1997,', '<kind>', ',']
display_tree(tree)
Inserting <year>
.
v = m.insert_into_tree(tree, ('<year>', '1997'))
- Node: <start> ? (<year>:'1997') -> [0] '<vehicle>' - Node: <vehicle> ? (<year>:'1997') -> [0] '1997,' > ['<year>', ',']
display_tree(tree)
To make life simple, we define a wrapper function nt_var()
that will convert a token to its corresponding nonterminal symbol.
class TreeMiner(TreeMiner):
def nt_var(self, var):
return var if is_nonterminal(var) else to_nonterminal(var)
Now, we need to apply a new definition to an entire grammar.
class TreeMiner(TreeMiner):
def apply_new_definition(self, tree, var, value):
nt_var = self.nt_var(var)
return self.insert_into_tree(tree, (nt_var, value))
This algorithm is implemented as get_derivation_tree()
.
class TreeMiner(TreeMiner):
def get_derivation_tree(self):
tree = (START_SYMBOL, [(self.my_input, [])])
for var, value in self.my_assignments:
self.log(0, "%s=%s" % (var, repr(value)))
self.apply_new_definition(tree, var, value)
return tree
The TreeMiner
is used as follows:
with Tracer(VEHICLES[0]) as tracer:
process_vehicle(tracer.my_input)
assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()
dt = TreeMiner(tracer.my_input, assignments, log=True)
dt.tree
vehicle='1997,van,Ford,E350' - Node: <start> ? (<vehicle>:'1997,van,Ford,E350') -> [0] '1997,van,Ford,E350' > ['<vehicle>'] year='1997' - Node: <start> ? (<year>:'1997') -> [0] '<vehicle>' - Node: <vehicle> ? (<year>:'1997') -> [0] '1997,van,Ford,E350' > ['<year>', ',van,Ford,E350'] kind='van' - Node: <start> ? (<kind>:'van') -> [0] '<vehicle>' - Node: <vehicle> ? (<kind>:'van') -> [0] '<year>' - Node: <year> ? (<kind>:'van') -> [0] '1997' -> [1] ',van,Ford,E350' > [',', '<kind>', ',Ford,E350'] company='Ford' - Node: <start> ? (<company>:'Ford') -> [0] '<vehicle>' - Node: <vehicle> ? (<company>:'Ford') -> [0] '<year>' - Node: <year> ? (<company>:'Ford') -> [0] '1997' -> [1] ',' -> [2] '<kind>' - Node: <kind> ? (<company>:'Ford') -> [0] 'van' -> [3] ',Ford,E350' > [',', '<company>', ',E350'] model='E350' - Node: <start> ? (<model>:'E350') -> [0] '<vehicle>' - Node: <vehicle> ? (<model>:'E350') -> [0] '<year>' - Node: <year> ? (<model>:'E350') -> [0] '1997' -> [1] ',' -> [2] '<kind>' - Node: <kind> ? (<model>:'E350') -> [0] 'van' -> [3] ',' -> [4] '<company>' - Node: <company> ? (<model>:'E350') -> [0] 'Ford' -> [5] ',E350' > [',', '<model>']
('<start>', [('<vehicle>', [('<year>', [['1997', []]]), (',', []), ('<kind>', [['van', []]]), (',', []), ('<company>', [['Ford', []]]), (',', []), ('<model>', [['E350', []]])])])
The obtained derivation tree is as below.
display_tree(TreeMiner(tracer.my_input, assignments).tree)
Combining all the pieces:
trees = []
for vehicle in VEHICLES:
print(vehicle)
with Tracer(vehicle) as tracer:
process_vehicle(tracer.my_input)
assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()
trees.append((tracer.my_input, assignments))
for var, val in assignments:
print(var + " = " + repr(val))
print()
1997,van,Ford,E350 vehicle = '1997,van,Ford,E350' year = '1997' kind = 'van' company = 'Ford' model = 'E350' 2000,car,Mercury,Cougar vehicle = '2000,car,Mercury,Cougar' year = '2000' kind = 'car' company = 'Mercury' model = 'Cougar' 1999,car,Chevy,Venture vehicle = '1999,car,Chevy,Venture' year = '1999' kind = 'car' company = 'Chevy' model = 'Venture'
The corresponding derivation trees are below.
csv_dt = []
for inputstr, assignments in trees:
print(inputstr)
dt = TreeMiner(inputstr, assignments)
csv_dt.append(dt)
display_tree(dt.tree)
1997,van,Ford,E350 2000,car,Mercury,Cougar 1999,car,Chevy,Venture
We define a class Miner
that can combine multiple derivation trees to produce the grammar. The initial grammar is empty.
class GrammarMiner:
def __init__(self):
self.grammar = {}
The tree_to_grammar()
method converts our derivation tree to a grammar by picking one node at a time, and adding it to the grammar. The node name becomes the key, and any list of children it has becomes another alternative for that key.
class GrammarMiner(GrammarMiner):
def tree_to_grammar(self, tree):
node, children = tree
one_alt = [ck for ck, gc in children]
hsh = {node: [one_alt] if one_alt else []}
for child in children:
if not is_nonterminal(child[0]):
continue
chsh = self.tree_to_grammar(child)
for k in chsh:
if k not in hsh:
hsh[k] = chsh[k]
else:
hsh[k].extend(chsh[k])
return hsh
gm = GrammarMiner()
gm.tree_to_grammar(csv_dt[0].tree)
{'<start>': [['<vehicle>']], '<vehicle>': [['<year>', ',', '<kind>', ',', '<company>', ',', '<model>']], '<year>': [['1997']], '<kind>': [['van']], '<company>': [['Ford']], '<model>': [['E350']]}
The grammar being generated here is canonical
. We define a function readable()
that takes in a canonical grammar and returns it in a readable form.
def readable(grammar):
def readable_rule(rule):
return ''.join(rule)
return {k: list(set(readable_rule(a) for a in grammar[k]))
for k in grammar}
syntax_diagram(readable(gm.tree_to_grammar(csv_dt[0].tree)))
start
vehicle
year
kind
company
model
The add_tree()
method gets a combined list of non-terminals from current grammar, and the tree to be added to the grammar, and updates the definitions of each non-terminal.
class GrammarMiner(GrammarMiner):
def add_tree(self, t):
t_grammar = self.tree_to_grammar(t.tree)
self.grammar = {
key: self.grammar.get(key, []) + t_grammar.get(key, [])
for key in itertools.chain(self.grammar.keys(), t_grammar.keys())
}
The add_tree()
is used as follows:
inventory_grammar_miner = GrammarMiner()
for dt in csv_dt:
inventory_grammar_miner.add_tree(dt)
syntax_diagram(readable(inventory_grammar_miner.grammar))
start
vehicle
year
kind
company
model
Given execution traces from various inputs, one can define update_grammar()
to obtain the complete grammar from the traces.
class GrammarMiner(GrammarMiner):
def update_grammar(self, inputstr, trace):
at = self.create_tracker(inputstr, trace)
dt = self.create_tree_miner(inputstr, at.assignments())
self.add_tree(dt)
return self.grammar
def create_tracker(self, *args):
return DefineTracker(*args)
def create_tree_miner(self, *args):
return TreeMiner(*args)
The complete grammar recovery is implemented in recover_grammar()
.
def recover_grammar(fn: Callable, inputs: Iterable[str],
**kwargs: Any) -> Grammar:
miner = GrammarMiner()
for inputstr in inputs:
with Tracer(inputstr, **kwargs) as tracer:
fn(tracer.my_input)
miner.update_grammar(tracer.my_input, tracer.trace)
return readable(miner.grammar)
Note that the grammar could have been retrieved directly from the tracker, without the intermediate derivation tree stage. However, going through the derivation tree allows one to inspect the inputs being fragmented and verify that it happens correctly.
inventory_grammar = recover_grammar(process_vehicle, VEHICLES)
inventory_grammar
{'<start>': ['<vehicle>'], '<vehicle>': ['<year>,<kind>,<company>,<model>'], '<year>': ['1997', '2000', '1999'], '<kind>': ['car', 'van'], '<company>': ['Ford', 'Chevy', 'Mercury'], '<model>': ['Cougar', 'E350', 'Venture']}
Our algorithm is robust enough to recover grammar from real world programs. For example, the urlparse
function in the Python urlib
module accepts the following sample URLs.
URLS = [
'http://user:pass@www.google.com:80/?q=path#ref',
'https://www.cispa.saarland:80/',
'http://www.fuzzingbook.org/#News',
]
The urllib caches its intermediate results for faster access. Hence, we need to disable it using clear_cache()
after every invocation.
We use the sample URLs to recover grammar as follows. The urlparse
function tends to cache its previous parsing results. Hence, we define a new method url_parse()
that clears the cache before each call.
def url_parse(url):
clear_cache()
urlparse(url)
trees = []
for url in URLS:
print(url)
with Tracer(url) as tracer:
url_parse(tracer.my_input)
assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()
trees.append((tracer.my_input, assignments))
for var, val in assignments:
print(var + " = " + repr(val))
print()
url_dt = []
for inputstr, assignments in trees:
print(inputstr)
dt = TreeMiner(inputstr, assignments)
url_dt.append(dt)
display_tree(dt.tree)
http://user:pass@www.google.com:80/?q=path#ref url = 'http://user:pass@www.google.com:80/?q=path#ref' scheme = 'http' netloc = 'user:pass@www.google.com:80' fragment = 'ref' query = 'q=path' https://www.cispa.saarland:80/ url = 'https://www.cispa.saarland:80/' scheme = 'https' netloc = 'www.cispa.saarland:80' http://www.fuzzingbook.org/#News url = 'http://www.fuzzingbook.org/#News' scheme = 'http' netloc = 'www.fuzzingbook.org' fragment = 'News' http://user:pass@www.google.com:80/?q=path#ref https://www.cispa.saarland:80/ http://www.fuzzingbook.org/#News
Let us use url_parse()
to recover the grammar:
url_grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])
syntax_diagram(url_grammar)
start
url
scheme
netloc
query
fragment
The recovered grammar describes the URL format reasonably well.
We can now use our recovered grammar for fuzzing as follows.
First, the inventory grammar.
f = GrammarFuzzer(inventory_grammar)
for _ in range(10):
print(f.fuzz())
1999,car,Ford,Cougar 2000,car,Chevy,E350 1999,van,Ford,Venture 1997,car,Mercury,Venture 2000,car,Ford,Cougar 2000,car,Ford,E350 1999,car,Chevy,Cougar 1999,car,Chevy,Cougar 1999,car,Ford,E350 1997,car,Chevy,Cougar
Next, the URL grammar.
f = GrammarFuzzer(url_grammar)
for _ in range(10):
print(f.fuzz())
https://www.cispa.saarland:80/?q=path#ref https://user:pass@www.google.com:80/ http://www.fuzzingbook.org/?q=path#News https://www.fuzzingbook.org/#ref https://www.cispa.saarland:80/#News http://www.fuzzingbook.org/ https://www.fuzzingbook.org/?q=path#News https://user:pass@www.google.com:80/#News https://www.cispa.saarland:80/ https://user:pass@www.google.com:80/#News
What this means is that we can now take a program and a few samples, extract its grammar, and then use this very grammar for fuzzing. Now that's quite an opportunity!
One of the problems with our simple grammar miner is the assumption that the values assigned to variables are stable. Unfortunately, that may not hold true in all cases. For example, here is a URL with a slightly different format.
URLS_X = URLS + ['ftp://freebsd.org/releases/5.8']
The grammar generated from this set of samples is not as nice as what we got earlier
url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])
syntax_diagram(url_grammar)
start
url
scheme
netloc
query
fragment
Clearly, something has gone wrong.
To investigate why the url
definition has gone wrong, let us inspect the trace for the URL.
clear_cache()
with Tracer(URLS_X[0]) as tracer:
urlparse(tracer.my_input)
for i, t in enumerate(tracer.trace):
if t[0] in {'call', 'line'} and 'parse.py' in str(t[2]) and t[3]:
print(i, t[2]._t()[1], t[3:])
0 372 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 1 392 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 5 124 ({'arg': ''},) 6 121 ({'arg': ''},) 7 126 ({'arg': ''},) 8 127 ({'arg': ''},) 10 393 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 11 437 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 12 458 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 16 124 ({'arg': ''},) 17 121 ({'arg': ''},) 18 126 ({'arg': ''},) 19 127 ({'arg': ''},) 21 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 22 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\t'},) 23 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\t'},) 24 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\t'},) 25 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\r'},) 26 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\r'},) 27 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\r'},) 28 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 29 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 30 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 31 464 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 32 465 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 33 466 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 34 467 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 35 469 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 36 471 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},) 37 472 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': ''},) 38 473 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': ''},) 39 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': ''},) 40 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},) 41 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},) 42 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},) 43 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},) 44 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},) 45 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},) 46 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},) 47 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},) 48 478 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},) 49 480 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},) 50 481 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},) 51 411 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},) 52 412 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},) 53 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},) 54 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},) 55 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},) 56 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},) 57 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},) 58 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},) 59 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},) 60 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},) 61 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},) 62 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},) 63 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},) 64 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},) 65 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},) 66 417 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},) 68 482 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},) 69 483 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},) 70 482 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},) 71 485 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},) 72 486 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},) 73 487 ({'url': '/?q=path', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},) 74 488 ({'url': '/?q=path', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},) 75 489 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},) 76 419 ({'netloc': 'user:pass@www.google.com:80'},) 77 420 ({'netloc': 'user:pass@www.google.com:80'},) 78 421 ({'netloc': 'user:pass@www.google.com:80'},) 80 490 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},) 84 491 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},) 85 492 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},) 90 394 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},) 91 395 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},) 92 398 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},) 93 399 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},) 97 400 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)
Notice how the value of url
changes as the parsing progresses? This violates our assumption that the value assigned to a variable is stable. We next look at how this limitation can be removed.
One way to uniquely identify different variables is to annotate them with line numbers both when they are defined and also when their value changes. Consider the code fragment below
def C(cp_1):
c_2 = cp_1 + '@2'
c_3 = c_2 + '@3'
return c_3
def B(bp_7):
b_8 = bp_7 + '@8'
return C(b_8)
def A(ap_12):
a_13 = ap_12 + '@13'
a_14 = B(a_13) + '@14'
a_14 = a_14 + '@15'
a_13 = a_14 + '@16'
a_14 = B(a_13) + '@17'
a_14 = B(a_13) + '@18'
Notice how all variables are either named corresponding to either where they are defined, or the value is annotated to indicate that it was changed.
Let us run this under the trace.
with Tracer('____') as tracer:
A(tracer.my_input)
for t in tracer.trace:
print(t[0], "%d:%s" % (t[2].line_no, t[2].method), t[3])
call 1:A {'ap_12': '____'} line 2:A {'ap_12': '____'} line 3:A {'ap_12': '____', 'a_13': '____@13'} call 1:B {'bp_7': '____@13'} line 2:B {'bp_7': '____@13'} line 3:B {'bp_7': '____@13', 'b_8': '____@13@8'} call 1:C {'cp_1': '____@13@8'} line 2:C {'cp_1': '____@13@8'} line 3:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2'} line 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'} return 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'} return 3:B {'bp_7': '____@13', 'b_8': '____@13@8'} line 4:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14'} line 5:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14@15'} line 6:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15'} call 1:B {'bp_7': '____@13@8@2@3@14@15@16'} line 2:B {'bp_7': '____@13@8@2@3@14@15@16'} line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'} call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'} line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'} line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'} line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'} return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'} return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'} line 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@17'} call 1:B {'bp_7': '____@13@8@2@3@14@15@16'} line 2:B {'bp_7': '____@13@8@2@3@14@15@16'} line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'} call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'} line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'} line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'} line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'} return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'} return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'} return 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@18'} call 102:__exit__ {} line 105:__exit__ {}
Each variable was referenced first as follows:
cp_1
-- call 1:C
c_2
-- line 3:C
(but the previous event was line 2:C
)c_3
-- line 4:C
(but the previous event was line 3:C
)bp_7
-- call 7:B
b_8
-- line 9:B
(but the previous event was line 8:B
)ap_12
-- call 12:A
a_13
-- line 14:A
(but the previous event was line 13:A
)a_14
-- line 15:A
(the previous event was return 9:B
. However, the previous event in A()
was line 14:A
)a_14
at 15 -- line 16:A
(the previous event was line 15:A
)a_13
at 16 -- line 17:A
(the previous event was line 16:A
)a_14
at 17 -- return 17:A
(the previous event in A()
was line 17:A
)a_14
at 18 -- return 18:A
(the previous event in A()
was line 18:A
)So, our observations are that, if it is a call, the current location is the right one for any new variables being defined. On the other hand, if the variable being referenced for the first time (or reassigned a new value), then the right location to consider is the previous location in the same method invocation. Next, let us see how we can incorporate this information into variable naming.
Next, we need a way to track the individual method calls as they are being made. For this we define the class CallStack
. Each method invocation gets a separate identifier, and when the method call is over, the identifier is reset.
class CallStack:
def __init__(self, **kwargs):
self.options(kwargs)
self.method_id = (START_SYMBOL, 0)
self.method_register = 0
self.mstack = [self.method_id]
def enter(self, method):
self.method_register += 1
self.method_id = (method, self.method_register)
self.log('call', "%s%s" % (self.indent(), str(self)))
self.mstack.append(self.method_id)
def leave(self):
self.mstack.pop()
self.log('return', "%s%s" % (self.indent(), str(self)))
self.method_id = self.mstack[-1]
A few extra functions to make life simpler.
class CallStack(CallStack):
def options(self, kwargs):
self.log = log_event if kwargs.get('log') else lambda _evt, _var: None
def indent(self):
return len(self.mstack) * "\t"
def at(self, n):
return self.mstack[n]
def __len__(self):
return len(mstack) - 1
def __str__(self):
return "%s:%d" % self.method_id
def __repr__(self):
return repr(self.method_id)
We also define a convenience method to display a given stack.
def display_stack(istack):
def stack_to_tree(stack):
current, *rest = stack
if not rest:
return (repr(current), [])
return (repr(current), [stack_to_tree(rest)])
display_tree(stack_to_tree(istack.mstack), graph_attr=lr_graph)
Here is how we can use the CallStack
.
cs = CallStack()
display_stack(cs)
cs
('<start>', 0)
cs.enter('hello')
display_stack(cs)
cs
('hello', 1)
cs.enter('world')
display_stack(cs)
cs
('world', 2)
cs.leave()
display_stack(cs)
cs
('hello', 1)
cs.enter('world')
display_stack(cs)
cs
('world', 3)
cs.leave()
display_stack(cs)
cs
('hello', 1)
In order to account for variable reassignments, we need to have a more intelligent data structure than a dictionary for storing variables. We first define a simple interface Vars
. It acts as a container for variables, and is instantiated at my_assignments
.
The Vars
stores references to variables as they occur during parsing in its internal dictionary defs
. We initialize the dictionary with the original string.
class Vars:
def __init__(self, original):
self.defs = {}
self.my_input = original
The dictionary needs two methods: update()
that takes a set of key-value pairs to update itself, and _set_kv()
that updates a particular key-value pair.
class Vars(Vars):
def _set_kv(self, k, v):
self.defs[k] = v
def __setitem__(self, k, v):
self._set_kv(k, v)
def update(self, v):
for k, v in v.items():
self._set_kv(k, v)
The Vars
is a proxy for the internal dictionary. For example, here is how one can use it.
v = Vars('')
v.defs
{}
v['x'] = 'X'
v.defs
{'x': 'X'}
v.update({'x': 'x', 'y': 'y'})
v.defs
{'x': 'x', 'y': 'y'}
We now extend the simple Vars
to account for variable reassignments. For this, we define AssignmentVars
.
The idea for detecting reassignments and renaming variables is as follows: We keep track of the previous reassignments to particular variables using accessed_seq_var
. It contains the last rename of any particular variable as its corresponding value. The new_vars
contains a list of all new variables that were added on this iteration.
class AssignmentVars(Vars):
def __init__(self, original):
super().__init__(original)
self.accessed_seq_var = {}
self.var_def_lines = {}
self.current_event = None
self.new_vars = set()
self.method_init()
The method_init()
method takes care of keeping track of method invocations using records saved in the call_stack
. event_locations
is for keeping track of the locations accessed within this method. This is used for line number tracking of variable definitions.
class AssignmentVars(AssignmentVars):
def method_init(self):
self.call_stack = CallStack()
self.event_locations = {self.call_stack.method_id: []}
The update()
is now modified to track the changed line numbers if any, using var_location_register()
. We reinitialize the new_vars
after use for the next event.
class AssignmentVars(AssignmentVars):
def update(self, v):
for k, v in v.items():
self._set_kv(k, v)
self.var_location_register(self.new_vars)
self.new_vars = set()
The variable name now incorporates an index of how many reassignments it has gone through, effectively making each reassignment a unique variable.
class AssignmentVars(AssignmentVars):
def var_name(self, var):
return (var, self.accessed_seq_var[var])
While storing variables, we need to first check whether it was previously known. If it is not, we need to initialize the rename count. This is accomplished by var_access
.
class AssignmentVars(AssignmentVars):
def var_access(self, var):
if var not in self.accessed_seq_var:
self.accessed_seq_var[var] = 0
return self.var_name(var)
During a variable reassignment, we update the accessed_seq_var
to reflect the new count.
class AssignmentVars(AssignmentVars):
def var_assign(self, var):
self.accessed_seq_var[var] += 1
self.new_vars.add(self.var_name(var))
return self.var_name(var)
These methods can be used as follows
sav = AssignmentVars('')
sav.defs
{}
sav.var_access('v1')
('v1', 0)
sav.var_assign('v1')
('v1', 1)
Assigning to it again increments the counter.
sav.var_assign('v1')
('v1', 2)
The core of the logic is in _set_kv()
. When a variable is being assigned, we get the sequenced variable name s_var
. If the sequenced variable name was previously unknown in defs
, then we have no further concerns. We add the sequenced variable to defs
.
If the variable is previously known, then it is an indication of a possible reassignment. In this case, we look at the value the variable is holding. We check if the value changed. If it has not, then it is not.
If the value has changed, it is a reassignment. We first increment the variable usage sequence using var_assign
, retrieve the new name, update the new name in defs
.
class AssignmentVars(AssignmentVars):
def _set_kv(self, var, val):
s_var = self.var_access(var)
if s_var in self.defs and self.defs[s_var] == val:
return
self.defs[self.var_assign(var)] = val
Here is how it can be used. Assigning a variable the first time initializes its counter.
sav = AssignmentVars('')
sav['x'] = 'X'
sav.defs
{('x', 1): 'X'}
If the variable is assigned again with the same value, it is probably not a reassignment.
sav['x'] = 'X'
sav.defs
{('x', 1): 'X'}
However, if the value changed, it is a reassignment.
sav['x'] = 'Y'
sav.defs
{('x', 1): 'X', ('x', 2): 'Y'}
There is a subtlety here. It is possible for a child method to be called from the middle of a parent method, and for both to use the same variable name with different values. In this case, when the child returns, parent will have the old variable with old value in context. With our implementation, we consider this as a reassignment. However, this is OK because adding a new reassignment is harmless, but missing one is not. Further, we will discuss later how this can be avoided.
We also define bookkeeping codes for register_event()
method_enter()
and method_exit()
which are the methods responsible for keeping track of the method stack. The basic idea is that, each method_enter()
represents a new method invocation. Hence, it merits a new method id, which is generated from the method_register
, and saved in the method_id
. Since this is a new method, the method stack is extended by one element with this id. In the case of method_exit()
, we pop the method stack, and reset the current method_id
to what was below the current one.
class AssignmentVars(AssignmentVars):
def method_enter(self, cxt, my_vars):
self.current_event = 'call'
self.call_stack.enter(cxt.method)
self.event_locations[self.call_stack.method_id] = []
self.register_event(cxt)
self.update(my_vars)
def method_exit(self, cxt, my_vars):
self.current_event = 'return'
self.register_event(cxt)
self.update(my_vars)
self.call_stack.leave()
def method_statement(self, cxt, my_vars):
self.current_event = 'line'
self.register_event(cxt)
self.update(my_vars)
For each of the method events, we also register the event using register_event()
which keeps track of the line numbers that were referenced in this method.
class AssignmentVars(AssignmentVars):
def register_event(self, cxt):
self.event_locations[self.call_stack.method_id].append(cxt.line_no)
The var_location_register()
keeps the locations of newly added variables. The definition location of variables in a call
is the current location. However, for a line
, it would be the previous event in the current method.
class AssignmentVars(AssignmentVars):
def var_location_register(self, my_vars):
def loc(mid):
if self.current_event == 'call':
return self.event_locations[mid][-1]
elif self.current_event == 'line':
return self.event_locations[mid][-2]
elif self.current_event == 'return':
return self.event_locations[mid][-2]
else:
assert False
my_loc = loc(self.call_stack.method_id)
for var in my_vars:
self.var_def_lines[var] = my_loc
We define defined_vars()
which returns the names of variables annotated with the line numbers as below.
class AssignmentVars(AssignmentVars):
def defined_vars(self, formatted=True):
def fmt(k):
v = (k[0], self.var_def_lines[k])
return "%s@%s" % v if formatted else v
return [(fmt(k), v) for k, v in self.defs.items()]
Similar to defined_vars()
we define seq_vars()
which annotates different variables with the number of times they were used.
class AssignmentVars(AssignmentVars):
def seq_vars(self, formatted=True):
def fmt(k):
v = (k[0], self.var_def_lines[k], k[1])
return "%s@%s:%s" % v if formatted else v
return {fmt(k): v for k, v in self.defs.items()}
The AssignmentTracker
keeps the assignment definitions using the AssignmentVars
we defined previously.
class AssignmentTracker(DefineTracker):
def __init__(self, my_input, trace, **kwargs):
self.options(kwargs)
self.my_input = my_input
self.my_assignments = self.create_assignments(my_input)
self.trace = trace
self.process()
def create_assignments(self, *args):
return AssignmentVars(*args)
To fine-tune the process, we define an optional parameter called track_return
. During tracing a method return, Python produces a virtual variable that contains the result of the returned value. If the track_return
is set, we capture this value as a variable.
track_return
-- if true, add a virtual variable to the Vars representing the return valueclass AssignmentTracker(AssignmentTracker):
def options(self, kwargs):
self.track_return = kwargs.get('track_return', False)
super().options(kwargs)
There can be different kinds of events during a trace, which includes call
when a function is entered, return
when the function returns, exception
when an exception is thrown and line
when a statement is executed.
The previous Tracker
was too simplistic in that it did not distinguish between the different events. We rectify that and define on_call()
, on_return()
, and on_line()
respectively, which get called on their corresponding events.
Note that on_line()
is called also for on_return()
. The reason is, that Python invokes the trace function before the corresponding line is executed. Hence, effectively, the on_return()
is called with the binding produced by the execution of the previous statement in the environment. Our processing in effect is done on values that were bound by the previous statement. Hence, calling on_line()
here is appropriate as it provides the event handler a chance to work on the previous binding.
class AssignmentTracker(AssignmentTracker):
def on_call(self, arg, cxt, my_vars):
my_vars = cxt.parameters(my_vars)
self.my_assignments.method_enter(cxt, self.fragments(my_vars))
def on_line(self, arg, cxt, my_vars):
self.my_assignments.method_statement(cxt, self.fragments(my_vars))
def on_return(self, arg, cxt, my_vars):
self.on_line(arg, cxt, my_vars)
my_vars = {'<-%s' % cxt.method: arg} if self.track_return else {}
self.my_assignments.method_exit(cxt, my_vars)
def on_exception(self, arg, cxt, my_vara):
return
def track_event(self, event, arg, cxt, my_vars):
self.current_event = event
dispatch = {
'call': self.on_call,
'return': self.on_return,
'line': self.on_line,
'exception': self.on_exception
}
dispatch[event](arg, cxt, my_vars)
We can now use AssignmentTracker
to track the different variables. To verify that our variable line number inference works, we recover definitions from the functions A()
, B()
and C()
(with data annotations removed so that the input fragments are correctly identified).
def C(cp_1): # type: ignore
c_2 = cp_1
c_3 = c_2
return c_3
def B(bp_7): # type: ignore
b_8 = bp_7
return C(b_8)
def A(ap_12): # type: ignore
a_13 = ap_12
a_14 = B(a_13)
a_14 = a_14
a_13 = a_14
a_14 = B(a_13)
a_14 = B(a_14)[3:]
Running A()
with sufficient input.
with Tracer('---xxx') as tracer:
A(tracer.my_input)
tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)
for k, v in tracker.my_assignments.seq_vars().items():
print(k, '=', repr(v))
print()
for k, v in tracker.my_assignments.defined_vars(formatted=True):
print(k, '=', repr(v))
ap_12@1:1 = '---xxx' a_13@2:1 = '---xxx' bp_7@1:1 = '---xxx' b_8@2:1 = '---xxx' cp_1@1:1 = '---xxx' c_2@2:1 = '---xxx' c_3@3:1 = '---xxx' a_14@3:1 = '---xxx' a_14@7:2 = 'xxx' ap_12@1 = '---xxx' a_13@2 = '---xxx' bp_7@1 = '---xxx' b_8@2 = '---xxx' cp_1@1 = '---xxx' c_2@2 = '---xxx' c_3@3 = '---xxx' a_14@3 = '---xxx' a_14@7 = 'xxx'
As can be seen, the line numbers are now correctly identified for each variable.
Let us try retrieving the assignments for a real world example.
traces = []
for inputstr in URLS_X:
clear_cache()
with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
urlparse(tracer.my_input)
traces.append((tracer.my_input, tracer.trace))
tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)
for k, v in tracker.my_assignments.defined_vars():
print(k, '=', repr(v))
print()
url@372 = 'http://user:pass@www.google.com:80/?q=path#ref' url@478 = '//user:pass@www.google.com:80/?q=path#ref' scheme@478 = 'http' url@481 = '/?q=path#ref' netloc@481 = 'user:pass@www.google.com:80' url@486 = '/?q=path' fragment@486 = 'ref' query@488 = 'q=path' url@393 = 'http://user:pass@www.google.com:80/?q=path#ref' url@372 = 'https://www.cispa.saarland:80/' url@478 = '//www.cispa.saarland:80/' scheme@478 = 'https' netloc@481 = 'www.cispa.saarland:80' url@393 = 'https://www.cispa.saarland:80/' url@372 = 'http://www.fuzzingbook.org/#News' url@478 = '//www.fuzzingbook.org/#News' scheme@478 = 'http' url@481 = '/#News' netloc@481 = 'www.fuzzingbook.org' fragment@486 = 'News' url@393 = 'http://www.fuzzingbook.org/#News' url@372 = 'ftp://freebsd.org/releases/5.8' url@478 = '//freebsd.org/releases/5.8' scheme@478 = 'ftp' url@481 = '/releases/5.8' netloc@481 = 'freebsd.org' url@393 = 'ftp://freebsd.org/releases/5.8' url@394 = '/releases/5.8'
The line numbers of variables can be verified from the source code of urllib/parse.py.
Does handling variable reassignments help with our URL examples? We look at these next.
class TreeMiner(TreeMiner):
def get_derivation_tree(self):
tree = (START_SYMBOL, [(self.my_input, [])])
for var, value in self.my_assignments:
self.log(0, "%s=%s" % (var, repr(value)))
self.apply_new_definition(tree, var, value)
return tree
First we obtain the derivation tree of the URL 1
clear_cache()
with Tracer(URLS_X[0], files=['urllib/parse.py']) as tracer:
urlparse(tracer.my_input)
sm = AssignmentTracker(tracer.my_input, tracer.trace)
dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())
display_tree(dt.tree)