Mining Input Grammars¶

So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place. While the grammars we have seen so far have been rather simple, creating a grammar for complex inputs can involve quite some effort. In this chapter, we therefore introduce techniques that automatically mine grammars from programs – by executing the programs and observing how they process which parts of the input. In conjunction with a grammar fuzzer, this allows us to

  1. take a program,
  2. extract its input grammar, and
  3. fuzz it with high efficiency and effectiveness, using the concepts in this book.

Prerequisites

  • You should have read the chapter on grammars.
  • The chapter on configuration fuzzing introduces grammar mining for configuration options, as well as observing variables and values during execution.
  • We use the tracer from the chapter on coverage.
  • The concept of parsing from the chapter on parsers is also useful.

A Grammar Challenge¶

Consider the process_inventory() method from the chapter on parsers:

It takes inputs of the following form.

In [5]:
INVENTORY = """\
1997,van,Ford,E350
2000,car,Mercury,Cougar
1999,car,Chevy,Venture\
"""
In [6]:
print(process_inventory(INVENTORY))
We have a Ford E350 van from 1997 vintage.
It is an old but reliable model!
We have a Mercury Cougar car from 2000 vintage.
It is an old but reliable model!
We have a Chevy Venture car from 1999 vintage.
It is an old but reliable model!

We found from the chapter on parsers that coarse grammars do not work well for fuzzing when the input format includes details expressed only in code. That is, even though we have the formal specification of CSV files (RFC 4180), the inventory system includes further rules as to what is expected at each index of the CSV file. The solution of simply recombining existing inputs, while practical, is incomplete. In particular, it relies on a formal input specification being available in the first place. However, we have no assurance that the program obeys the input specification given.

One of the ways out of this predicament is to interrogate the program under test as to what its input specification is. That is, if the program under test is written in a style such that specific methods are responsible for handling specific parts of the input, one can recover the parse tree by observing the process of parsing. Further, one can recover a reasonable approximation of the grammar by abstraction from multiple input trees.

We start with the assumption (1) that the program is written in such a fashion that specific methods are responsible for parsing specific fragments of the program -- This includes almost all ad hoc parsers.

The idea is as follows:

  • Hook into the Python execution and observe the fragments of input string as they are produced and named in different methods.
  • Stitch the input fragments together in a tree structure to retrieve the Parse Tree.
  • Abstract common elements from multiple parse trees to produce the Context Free Grammar of the input.

A Simple Grammar Miner¶

Say we want to obtain the input grammar for the function process_vehicle(). We first collect the sample inputs for this function.

In [7]:
VEHICLES = INVENTORY.split('\n')

The set of methods responsible for processing inventory are the following.

In [8]:
INVENTORY_METHODS = {
    'process_inventory',
    'process_vehicle',
    'process_van',
    'process_car'}

We have seen from the chapter on configuration fuzzing that one can hook into the Python runtime to observe the arguments to a function and any local variables created. We have also seen that one can obtain the context of execution by inspecting the frame argument. Here is a simple tracer that can return the local variables and other contextual information in a traced function. We reuse the Coverage tracing class.

Tracer¶

In [11]:
class Tracer(Coverage):
    def traceit(self, frame, event, arg):
        method_name = inspect.getframeinfo(frame).function
        if method_name not in INVENTORY_METHODS:
            return
        file_name = inspect.getframeinfo(frame).filename

        param_names = inspect.getargvalues(frame).args
        lineno = inspect.getframeinfo(frame).lineno
        local_vars = inspect.getargvalues(frame).locals
        print(event, file_name, lineno, method_name, param_names, local_vars)
        return self.traceit

We run the code under trace context.

In [12]:
with Tracer() as tracer:
    process_vehicle(VEHICLES[0])
call /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 29 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 30 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 31 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}
call /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 40 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 41 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 42 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.']}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 43 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 46 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997}
line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997}
return /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997}
return /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}

The main thing that we want out of tracing is a list of assignments of input fragments to different variables. We can use the tracing facility settrace() to get that as we showed above.

However, the settrace() function hooks into the Python debugging facility. When it is in operation, no debugger can hook into the program. That is, if there is a problem with our grammar miner, we will not be able to attach a debugger to it to understand what is happening. This is not ideal. Hence, we limit the tracer to the simplest implementation possible, and implement the core of grammar mining in later stages.

The traceit() function relies on information from the frame variable which exposes Python internals. We define a context class that encapsulates the information that we need from the frame.

Context¶

The Context class provides easy access to the information such as the current module, and parameter names.

In [13]:
class Context:
    def __init__(self, frame, track_caller=True):
        self.method = inspect.getframeinfo(frame).function
        self.parameter_names = inspect.getargvalues(frame).args
        self.file_name = inspect.getframeinfo(frame).filename
        self.line_no = inspect.getframeinfo(frame).lineno

    def _t(self):
        return (self.file_name, self.line_no, self.method,
                ','.join(self.parameter_names))

    def __repr__(self):
        return "%s:%d:%s(%s)" % self._t()

Here we add a few convenience methods that operate on the frame to Context.

In [14]:
class Context(Context):
    def extract_vars(self, frame):
        return inspect.getargvalues(frame).locals

    def parameters(self, all_vars):
        return {k: v for k, v in all_vars.items() if k in self.parameter_names}

    def qualified(self, all_vars):
        return {"%s:%s" % (self.method, k): v for k, v in all_vars.items()}

We hook printing the context to our traceit() to see it in action. First we define a log_event() for displaying events.

In [15]:
def log_event(event, var):
    print({'call': '->', 'return': '<-'}.get(event, '  '), var)

And use the log_event() in the traceit() function.

In [16]:
class Tracer(Tracer):
    def traceit(self, frame, event, arg):
        log_event(event, Context(frame))
        return self.traceit

Running process_vehicle() under trace prints the contexts encountered.

In [17]:
with Tracer() as tracer:
    process_vehicle(VEHICLES[0])
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)
<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)
<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)
-> /Users/zeller/Projects/fuzzingbook/notebooks/Coverage.ipynb:102:__exit__(self,exc_type,exc_value,tb)
   /Users/zeller/Projects/fuzzingbook/notebooks/Coverage.ipynb:105:__exit__(self,exc_type,exc_value,tb)

The trace produced by executing any function can get overwhelmingly large. Hence, we need to restrict our attention to specific modules. Further, we also restrict our attention exclusively to str variables since these variables are more likely to contain input fragments. (We will show how to deal with complex objects later in exercises.)

The Context class we developed earlier is used to decide which modules to monitor, and which variables to trace.

We store the current input string so that it can be used to determine if any particular string fragments came from the current input string. Any optional arguments are processed separately.

In [18]:
class Tracer(Tracer):
    def __init__(self, my_input, **kwargs):
        self.options(kwargs)
        self.my_input, self.trace = my_input, []

We use an optional argument files to indicate the specific source files we are interested in, and methods to indicate which specific methods are of interest. Further, we also use log to specify whether verbose logging should be enabled during trace. We use the log_event() method we defined earlier for logging.

The options processing is as below.

In [19]:
class Tracer(Tracer):
    def options(self, kwargs):
        self.files = kwargs.get('files', [])
        self.methods = kwargs.get('methods', [])
        self.log = log_event if kwargs.get('log') else lambda _evt, _var: None

The files and methods are checked to determine, if a particular event should be traced or not

In [20]:
class Tracer(Tracer):
    def tracing_context(self, cxt, event, arg):
        fres = not self.files or any(
            cxt.file_name.endswith(f) for f in self.files)
        mres = not self.methods or any(cxt.method == m for m in self.methods)
        return fres and mres

Similar to the context of events, we also want to restrict our attention to specific variables. For now, we want to focus only on strings. (See the Exercises at the end of the chapter on how to extend it to other kinds of objects).

In [21]:
class Tracer(Tracer):
    def tracing_var(self, k, v):
        return isinstance(v, str)

We modify the traceit() to call an on_event() function with the context information only on the specific events we are interested in.

In [22]:
class Tracer(Tracer):
    def on_event(self, event, arg, cxt, my_vars):
        self.trace.append((event, arg, cxt, my_vars))
        
    def create_context(self, frame):
        return Context(frame)

    def traceit(self, frame, event, arg):
        cxt = self.create_context(frame)
        if not self.tracing_context(cxt, event, arg):
            return self.traceit
        self.log(event, cxt)

        my_vars = {
            k: v
            for k, v in cxt.extract_vars(frame).items()
            if self.tracing_var(k, v)
        }
        self.on_event(event, arg, cxt, my_vars)
        return self.traceit

The Tracer class can now focus on specific kinds of events on specific files. Further, it provides a first level filter for variables that we find interesting. For example, we want to focus specifically on variables from process_* methods that contain input fragments. Here is how our updated Tracer can be used.

In [23]:
with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:
    process_vehicle(VEHICLES[0])
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)
<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)
<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)

The execution produced the following trace.

In [24]:
for t in tracer.trace:
    print(t[0], t[2].method, dict(t[3]))
call process_vehicle {'vehicle': '1997,van,Ford,E350'}
line process_vehicle {'vehicle': '1997,van,Ford,E350'}
line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}
line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}
call process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
return process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}
return process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}

Since we are saving the input already in Tracer, it is redundant to specify it separately again as an argument.

In [25]:
with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:
    process_vehicle(tracer.my_input)
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)
-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)
   /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)
<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)
<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)

DefineTracker¶

We define a DefineTracker class that processes the trace from the Tracer. The idea is to store different variable definitions which are input fragments.

The tracker identifies string fragments that are part of the input string, and stores them in a dictionary my_assignments. It saves the trace, and the corresponding input for processing. Finally, it calls process() to process the trace it was given. We will start with a simple tracker that relies on certain assumptions, and later see how these assumptions can be relaxed.

In [26]:
class DefineTracker:
    def __init__(self, my_input, trace, **kwargs):
        self.options(kwargs)
        self.my_input = my_input
        self.trace = trace
        self.my_assignments = {}
        self.process()

One of the problems of using substring search is that short string sequences tend to be included in other string sequences even though they may not have come from the original string. That is, say the input fragment is v, it could have equally come from either van or chevy. We rely on being able to predict the exact place in the input where a given fragment occurred. Hence, we define a constant FRAGMENT_LEN such that we ignore strings up to that length. We also incorporate a logging facility as before.

In [27]:
FRAGMENT_LEN = 3
In [28]:
class DefineTracker(DefineTracker):
    def options(self, kwargs):
        self.log = log_event if kwargs.get('log') else lambda _evt, _var: None
        self.fragment_len = kwargs.get('fragment_len', FRAGMENT_LEN)

Our tracer simply records the variable values as they occur. We next need to check if the variables contain values from the input string. Common ways to do this is to rely on symbolic execution or at least dynamic tainting, which are powerful, but also complex. However, one can obtain a reasonable approximation by simply relying on substring search. That is, we consider any value produced that is a substring of the original input string to have come from the original input.

We define an is_input_fragment() method that relies on string inclusion to detect if the string came from the input.

In [29]:
class DefineTracker(DefineTracker):
    def is_input_fragment(self, var, value):
        return len(value) >= self.fragment_len and value in self.my_input

We can use is_input_fragment() to select only a subset of variables defined, as implemented below in fragments().

In [30]:
class DefineTracker(DefineTracker):
    def fragments(self, variables):
        return {k: v for k, v in variables.items(
        ) if self.is_input_fragment(k, v)}

The tracker processes each event, and at each event, it updates the dictionary my_assignments with the current local variables that contain strings that are part of the input. Note that there is a choice here with respect to what happens during reassignment. We can either discard all the reassignments, or keep only the last assignment. Here, we choose the latter. If you want the former behavior, check whether the value exists in my_assignments before storing a fragment.

In [31]:
class DefineTracker(DefineTracker):
    def track_event(self, event, arg, cxt, my_vars):
        self.log(event, (cxt.method, my_vars))
        self.my_assignments.update(self.fragments(my_vars))

    def process(self):
        for event, arg, cxt, my_vars in self.trace:
            self.track_event(event, arg, cxt, my_vars)

Using the tracker, we can obtain the input fragments. For example, say we are only interested in strings that are at least 5 characters long.

In [32]:
tracker = DefineTracker(tracer.my_input, tracer.trace, fragment_len=5)
for k, v in tracker.my_assignments.items():
    print(k, '=', repr(v))
vehicle = '1997,van,Ford,E350'

Or strings that are 2 characters long (the default).

In [33]:
tracker = DefineTracker(tracer.my_input, tracer.trace)
for k, v in tracker.my_assignments.items():
    print(k, '=', repr(v))
vehicle = '1997,van,Ford,E350'
year = '1997'
kind = 'van'
company = 'Ford'
model = 'E350'
In [34]:
class DefineTracker(DefineTracker):
    def assignments(self):
        return self.my_assignments.items()

Assembling a Derivation Tree¶

The input fragments from the DefineTracker only tell half the story. The fragments may be created at different stages of parsing. Hence, we need to assemble the fragments to a derivation tree of the input. The basic idea is as follows:

Our input from the previous step was:

"1997,van,Ford,E350"

We start a derivation tree, and associate it with the start symbol in the grammar.

In [37]:
derivation_tree: DerivationTree = (START_SYMBOL, [("1997,van,Ford,E350", [])])
In [38]:
display_tree(derivation_tree)
Out[38]:
0 <start> 1 1997,van,Ford,E350 0->1

The next input was:

vehicle = "1997,van,Ford,E350"

Since vehicle covers the <start> node's value completely, we replace the value with the vehicle node.

In [39]:
derivation_tree: DerivationTree = (START_SYMBOL, 
                                   [('<vehicle>', [("1997,van,Ford,E350", [])],
                                                   [])])
In [40]:
display_tree(derivation_tree)
Out[40]:
0 <start> 1 <vehicle> 0->1 2 1997,van,Ford,E350 1->2

The next input was:

year = '1997'

Traversing the derivation tree from <start>, we see that it replaces a portion of the <vehicle> node's value. Hence we split the <vehicle> node's value to two children, where one corresponds to the value "1997" and the other to ",van,Ford,E350", and replace the first one with the node <year>.

In [41]:
derivation_tree: DerivationTree = (START_SYMBOL, 
                                   [('<vehicle>', [('<year>', [('1997', [])]),
                                                   (",van,Ford,E350", [])], [])])
In [42]:
display_tree(derivation_tree)
Out[42]:
0 <start> 1 <vehicle> 0->1 2 <year> 1->2 4 ,van,Ford,E350 1->4 3 1997 2->3

We perform similar operations for

company = 'Ford'
In [43]:
derivation_tree: DerivationTree = (START_SYMBOL, 
                                   [('<vehicle>', [('<year>', [('1997', [])]),
                                                   (",van,", []),
                                                   ('<company>', [('Ford', [])]),
                                                   (",E350", [])], [])])
In [44]:
display_tree(derivation_tree)
Out[44]:
0 <start> 1 <vehicle> 0->1 2 <year> 1->2 4 ,van, 1->4 5 <company> 1->5 7 ,E350 1->7 3 1997 2->3 6 Ford 5->6

Similarly for

kind = 'van'

and

model = 'E350'
In [45]:
derivation_tree: DerivationTree = (START_SYMBOL, 
                                   [('<vehicle>', [('<year>', [('1997', [])]),
                                                   (",", []),
                                                   ("<kind>", [('van', [])]),
                                                   (",", []),
                                                   ('<company>', [('Ford', [])]),
                                                   (",", []),
                                                   ("<model>", [('E350', [])])
                                                   ], [])])
In [46]:
display_tree(derivation_tree)
Out[46]:
0 <start> 1 <vehicle> 0->1 2 <year> 1->2 4 , (44) 1->4 5 <kind> 1->5 7 , (44) 1->7 8 <company> 1->8 10 , (44) 1->10 11 <model> 1->11 3 1997 2->3 6 van 5->6 9 Ford 8->9 12 E350 11->12

We now develop the complete algorithm with the above described steps. The derivation tree TreeMiner is initialized with the input string, and the variable assignments, and it converts the assignments to the corresponding derivation tree.

In [47]:
class TreeMiner:
    def __init__(self, my_input, my_assignments, **kwargs):
        self.options(kwargs)
        self.my_input = my_input
        self.my_assignments = my_assignments
        self.tree = self.get_derivation_tree()

    def options(self, kwargs):
        self.log = log_call if kwargs.get('log') else lambda _i, _v: None

    def get_derivation_tree(self):
        return (START_SYMBOL, [])

The log_call() is as follows.

In [48]:
def log_call(indent, var):
    print('\t' * indent, var)

The basic idea is as follows:

  • For now, we assume that the value assigned to a variable is stable. That is, it is never reassigned. In particular, there are no recursive calls, or multiple calls to the same function from different parts. (We will show how to overcome this limitation later).
  • For each pair var, value found in my_assignments:
    1. We search for occurrences of value val in the derivation tree recursively.
    2. If an occurrence was found as a value V1 of a node P1, we partition the value of the node P1 into three parts, with the central part matching the value val, and the first and last part, the corresponding prefix and suffix in V1.
    3. Reconstitute the node P1 with three children, where prefix and suffix mentioned earlier are string values, and the matching value val is replaced by a node var with a single value val.

First, we define a wrapper to generate a nonterminal from a variable name.

In [49]:
def to_nonterminal(var):
    return "<" + var.lower() + ">"

The string_part_of_value() method checks whether the given part value was part of the whole.

In [50]:
class TreeMiner(TreeMiner):
    def string_part_of_value(self, part, value):
        return (part in value)

The partition_by_part() splits the value by the given part if it matches, and returns a list containing the first part, the part that was replaced, and the last part. This is a format that can be used as a part of the list of children.

In [51]:
class TreeMiner(TreeMiner):
    def partition(self, part, value):
        return value.partition(part)
In [52]:
class TreeMiner(TreeMiner):
    def partition_by_part(self, pair, value):
        k, part = pair
        prefix_k_suffix = [
                    (k, [[part, []]]) if i == 1 else (e, [])
                    for i, e in enumerate(self.partition(part, value))
                    if e]
        return prefix_k_suffix

The insert_into_tree() method accepts a given tree tree and a (k,v) pair. It recursively checks whether the given pair can be applied. If the pair can be applied, it applies the pair and returns True.

In [53]:
class TreeMiner(TreeMiner):
    def insert_into_tree(self, my_tree, pair):
        var, values = my_tree
        k, v = pair
        self.log(1, "- Node: %s\t\t? (%s:%s)" % (var, k, repr(v)))
        applied = False
        for i, value_ in enumerate(values):
            value, arr = value_
            self.log(2, "-> [%d] %s" % (i, repr(value)))
            if is_nonterminal(value):
                applied = self.insert_into_tree(value_, pair)
                if applied:
                    break
            elif self.string_part_of_value(v, value):
                prefix_k_suffix = self.partition_by_part(pair, value)
                del values[i]
                for j, rep in enumerate(prefix_k_suffix):
                    values.insert(j + i, rep)
                applied = True

                self.log(2, " > %s" % (repr([i[0] for i in prefix_k_suffix])))
                break
            else:
                continue
        return applied

Here is how insert_into_tree() is used.

In [54]:
tree: DerivationTree = (START_SYMBOL, [("1997,van,Ford,E350", [])])
m = TreeMiner('', {}, log=True)

First, we have our input string as the only node.

In [55]:
display_tree(tree)
Out[55]:
0 <start> 1 1997,van,Ford,E350 0->1

Inserting the <vehicle> node.

In [56]:
v = m.insert_into_tree(tree, ('<vehicle>', "1997,van,Ford,E350"))
	 - Node: <start>		? (<vehicle>:'1997,van,Ford,E350')
		 -> [0] '1997,van,Ford,E350'
		  > ['<vehicle>']
In [57]:
display_tree(tree)
Out[57]:
0 <start> 1 <vehicle> 0->1 2 1997,van,Ford,E350 1->2

Inserting <model> node.

In [58]:
v = m.insert_into_tree(tree, ('<model>', 'E350'))
	 - Node: <start>		? (<model>:'E350')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<model>:'E350')
		 -> [0] '1997,van,Ford,E350'
		  > ['1997,van,Ford,', '<model>']
In [59]:
display_tree((tree))
Out[59]:
0 <start> 1 <vehicle> 0->1 2 1997,van,Ford, 1->2 3 <model> 1->3 4 E350 3->4

Inserting <company>.

In [60]:
v = m.insert_into_tree(tree, ('<company>', 'Ford'))
	 - Node: <start>		? (<company>:'Ford')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<company>:'Ford')
		 -> [0] '1997,van,Ford,'
		  > ['1997,van,', '<company>', ',']
In [61]:
display_tree(tree)
Out[61]:
0 <start> 1 <vehicle> 0->1 2 1997,van, 1->2 3 <company> 1->3 5 , (44) 1->5 6 <model> 1->6 4 Ford 3->4 7 E350 6->7

Inserting <kind>.

In [62]:
v = m.insert_into_tree(tree, ('<kind>', 'van'))
	 - Node: <start>		? (<kind>:'van')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<kind>:'van')
		 -> [0] '1997,van,'
		  > ['1997,', '<kind>', ',']
In [63]:
display_tree(tree)
Out[63]:
0 <start> 1 <vehicle> 0->1 2 1997, 1->2 3 <kind> 1->3 5 , (44) 1->5 6 <company> 1->6 8 , (44) 1->8 9 <model> 1->9 4 van 3->4 7 Ford 6->7 10 E350 9->10

Inserting <year>.

In [64]:
v = m.insert_into_tree(tree, ('<year>', '1997'))
	 - Node: <start>		? (<year>:'1997')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<year>:'1997')
		 -> [0] '1997,'
		  > ['<year>', ',']
In [65]:
display_tree(tree)
Out[65]:
0 <start> 1 <vehicle> 0->1 2 <year> 1->2 4 , (44) 1->4 5 <kind> 1->5 7 , (44) 1->7 8 <company> 1->8 10 , (44) 1->10 11 <model> 1->11 3 1997 2->3 6 van 5->6 9 Ford 8->9 12 E350 11->12

To make life simple, we define a wrapper function nt_var() that will convert a token to its corresponding nonterminal symbol.

In [66]:
class TreeMiner(TreeMiner):
    def nt_var(self, var):
        return var if is_nonterminal(var) else to_nonterminal(var)

Now, we need to apply a new definition to an entire grammar.

In [67]:
class TreeMiner(TreeMiner):
    def apply_new_definition(self, tree, var, value):
        nt_var = self.nt_var(var)
        return self.insert_into_tree(tree, (nt_var, value))

This algorithm is implemented as get_derivation_tree().

In [68]:
class TreeMiner(TreeMiner):
    def get_derivation_tree(self):
        tree = (START_SYMBOL, [(self.my_input, [])])

        for var, value in self.my_assignments:
            self.log(0, "%s=%s" % (var, repr(value)))
            self.apply_new_definition(tree, var, value)
        return tree

The TreeMiner is used as follows:

In [69]:
with Tracer(VEHICLES[0]) as tracer:
    process_vehicle(tracer.my_input)
assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()
dt = TreeMiner(tracer.my_input, assignments, log=True)
dt.tree
 vehicle='1997,van,Ford,E350'
	 - Node: <start>		? (<vehicle>:'1997,van,Ford,E350')
		 -> [0] '1997,van,Ford,E350'
		  > ['<vehicle>']
 year='1997'
	 - Node: <start>		? (<year>:'1997')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<year>:'1997')
		 -> [0] '1997,van,Ford,E350'
		  > ['<year>', ',van,Ford,E350']
 kind='van'
	 - Node: <start>		? (<kind>:'van')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<kind>:'van')
		 -> [0] '<year>'
	 - Node: <year>		? (<kind>:'van')
		 -> [0] '1997'
		 -> [1] ',van,Ford,E350'
		  > [',', '<kind>', ',Ford,E350']
 company='Ford'
	 - Node: <start>		? (<company>:'Ford')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<company>:'Ford')
		 -> [0] '<year>'
	 - Node: <year>		? (<company>:'Ford')
		 -> [0] '1997'
		 -> [1] ','
		 -> [2] '<kind>'
	 - Node: <kind>		? (<company>:'Ford')
		 -> [0] 'van'
		 -> [3] ',Ford,E350'
		  > [',', '<company>', ',E350']
 model='E350'
	 - Node: <start>		? (<model>:'E350')
		 -> [0] '<vehicle>'
	 - Node: <vehicle>		? (<model>:'E350')
		 -> [0] '<year>'
	 - Node: <year>		? (<model>:'E350')
		 -> [0] '1997'
		 -> [1] ','
		 -> [2] '<kind>'
	 - Node: <kind>		? (<model>:'E350')
		 -> [0] 'van'
		 -> [3] ','
		 -> [4] '<company>'
	 - Node: <company>		? (<model>:'E350')
		 -> [0] 'Ford'
		 -> [5] ',E350'
		  > [',', '<model>']
Out[69]:
('<start>',
 [('<vehicle>',
   [('<year>', [['1997', []]]),
    (',', []),
    ('<kind>', [['van', []]]),
    (',', []),
    ('<company>', [['Ford', []]]),
    (',', []),
    ('<model>', [['E350', []]])])])

The obtained derivation tree is as below.

In [70]:
display_tree(TreeMiner(tracer.my_input, assignments).tree)
Out[70]:
0 <start> 1 <vehicle> 0->1 2 <year> 1->2 4 , (44) 1->4 5 <kind> 1->5 7 , (44) 1->7 8 <company> 1->8 10 , (44) 1->10 11 <model> 1->11 3 1997 2->3 6 van 5->6 9 Ford 8->9 12 E350 11->12

Combining all the pieces:

In [71]:
trees = []
for vehicle in VEHICLES:
    print(vehicle)
    with Tracer(vehicle) as tracer:
        process_vehicle(tracer.my_input)
    assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()
    trees.append((tracer.my_input, assignments))
    for var, val in assignments:
        print(var + " = " + repr(val))
    print()
1997,van,Ford,E350
vehicle = '1997,van,Ford,E350'
year = '1997'
kind = 'van'
company = 'Ford'
model = 'E350'

2000,car,Mercury,Cougar
vehicle = '2000,car,Mercury,Cougar'
year = '2000'
kind = 'car'
company = 'Mercury'
model = 'Cougar'

1999,car,Chevy,Venture
vehicle = '1999,car,Chevy,Venture'
year = '1999'
kind = 'car'
company = 'Chevy'
model = 'Venture'

The corresponding derivation trees are below.

In [72]:
csv_dt = []
for inputstr, assignments in trees:
    print(inputstr)
    dt = TreeMiner(inputstr, assignments)
    csv_dt.append(dt)
    display_tree(dt.tree)
1997,van,Ford,E350
2000,car,Mercury,Cougar
1999,car,Chevy,Venture

Recovering Grammars from Derivation Trees¶

We define a class Miner that can combine multiple derivation trees to produce the grammar. The initial grammar is empty.

In [73]:
class GrammarMiner:
    def __init__(self):
        self.grammar = {}

The tree_to_grammar() method converts our derivation tree to a grammar by picking one node at a time, and adding it to the grammar. The node name becomes the key, and any list of children it has becomes another alternative for that key.

In [74]:
class GrammarMiner(GrammarMiner):
    def tree_to_grammar(self, tree):
        node, children = tree
        one_alt = [ck for ck, gc in children]
        hsh = {node: [one_alt] if one_alt else []}
        for child in children:
            if not is_nonterminal(child[0]):
                continue
            chsh = self.tree_to_grammar(child)
            for k in chsh:
                if k not in hsh:
                    hsh[k] = chsh[k]
                else:
                    hsh[k].extend(chsh[k])
        return hsh
In [75]:
gm = GrammarMiner()
gm.tree_to_grammar(csv_dt[0].tree)
Out[75]:
{'<start>': [['<vehicle>']],
 '<vehicle>': [['<year>', ',', '<kind>', ',', '<company>', ',', '<model>']],
 '<year>': [['1997']],
 '<kind>': [['van']],
 '<company>': [['Ford']],
 '<model>': [['E350']]}

The grammar being generated here is canonical. We define a function readable() that takes in a canonical grammar and returns it in a readable form.

In [76]:
def readable(grammar):
    def readable_rule(rule):
        return ''.join(rule)

    return {k: list(set(readable_rule(a) for a in grammar[k]))
            for k in grammar}
In [77]:
syntax_diagram(readable(gm.tree_to_grammar(csv_dt[0].tree)))
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> vehicle
vehicle
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> year , kind , company , model
year
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 1997
kind
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> van
company
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Ford
model
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> E350

The add_tree() method gets a combined list of non-terminals from current grammar, and the tree to be added to the grammar, and updates the definitions of each non-terminal.

In [79]:
class GrammarMiner(GrammarMiner):
    def add_tree(self, t):
        t_grammar = self.tree_to_grammar(t.tree)
        self.grammar = {
            key: self.grammar.get(key, []) + t_grammar.get(key, [])
            for key in itertools.chain(self.grammar.keys(), t_grammar.keys())
        }

The add_tree() is used as follows:

In [80]:
inventory_grammar_miner = GrammarMiner()
for dt in csv_dt:
    inventory_grammar_miner.add_tree(dt)
In [81]:
syntax_diagram(readable(inventory_grammar_miner.grammar))
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> vehicle
vehicle
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> year , kind , company , model
year
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 1997 2000 1999
kind
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> car van
company
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Ford Chevy Mercury
model
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Cougar E350 Venture

Given execution traces from various inputs, one can define update_grammar() to obtain the complete grammar from the traces.

In [82]:
class GrammarMiner(GrammarMiner):
    def update_grammar(self, inputstr, trace):
        at = self.create_tracker(inputstr, trace)
        dt = self.create_tree_miner(inputstr, at.assignments())
        self.add_tree(dt)
        return self.grammar

    def create_tracker(self, *args):
        return DefineTracker(*args)

    def create_tree_miner(self, *args):
        return TreeMiner(*args)

The complete grammar recovery is implemented in recover_grammar().

In [83]:
def recover_grammar(fn: Callable, inputs: Iterable[str], 
                    **kwargs: Any) -> Grammar:
    miner = GrammarMiner()

    for inputstr in inputs:
        with Tracer(inputstr, **kwargs) as tracer:
            fn(tracer.my_input)
        miner.update_grammar(tracer.my_input, tracer.trace)

    return readable(miner.grammar)

Note that the grammar could have been retrieved directly from the tracker, without the intermediate derivation tree stage. However, going through the derivation tree allows one to inspect the inputs being fragmented and verify that it happens correctly.

Example 1. Recovering the Inventory Grammar¶

In [84]:
inventory_grammar = recover_grammar(process_vehicle, VEHICLES)
In [85]:
inventory_grammar
Out[85]:
{'<start>': ['<vehicle>'],
 '<vehicle>': ['<year>,<kind>,<company>,<model>'],
 '<year>': ['1997', '2000', '1999'],
 '<kind>': ['car', 'van'],
 '<company>': ['Ford', 'Chevy', 'Mercury'],
 '<model>': ['Cougar', 'E350', 'Venture']}

Example 2. Recovering URL Grammar¶

Our algorithm is robust enough to recover grammar from real world programs. For example, the urlparse function in the Python urlib module accepts the following sample URLs.

In [86]:
URLS = [
    'http://user:pass@www.google.com:80/?q=path#ref',
    'https://www.cispa.saarland:80/',
    'http://www.fuzzingbook.org/#News',
]

The urllib caches its intermediate results for faster access. Hence, we need to disable it using clear_cache() after every invocation.

We use the sample URLs to recover grammar as follows. The urlparse function tends to cache its previous parsing results. Hence, we define a new method url_parse() that clears the cache before each call.

In [88]:
def url_parse(url):
    clear_cache()
    urlparse(url)
In [89]:
trees = []
for url in URLS:
    print(url)
    with Tracer(url) as tracer:
        url_parse(tracer.my_input)
    assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()
    trees.append((tracer.my_input, assignments))
    for var, val in assignments:
        print(var + " = " + repr(val))
    print()


url_dt = []
for inputstr, assignments in trees:
    print(inputstr)
    dt = TreeMiner(inputstr, assignments)
    url_dt.append(dt)
    display_tree(dt.tree)
http://user:pass@www.google.com:80/?q=path#ref
url = 'http://user:pass@www.google.com:80/?q=path#ref'
scheme = 'http'
netloc = 'user:pass@www.google.com:80'
fragment = 'ref'
query = 'q=path'

https://www.cispa.saarland:80/
url = 'https://www.cispa.saarland:80/'
scheme = 'https'
netloc = 'www.cispa.saarland:80'

http://www.fuzzingbook.org/#News
url = 'http://www.fuzzingbook.org/#News'
scheme = 'http'
netloc = 'www.fuzzingbook.org'
fragment = 'News'

http://user:pass@www.google.com:80/?q=path#ref
https://www.cispa.saarland:80/
http://www.fuzzingbook.org/#News

Let us use url_parse() to recover the grammar:

In [90]:
url_grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])
In [91]:
syntax_diagram(url_grammar)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> url
url
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> scheme :// netloc / scheme :// netloc /? query # fragment scheme :// netloc /# fragment
scheme
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> https http
netloc
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> user:pass@www.google.com:80 www.fuzzingbook.org www.cispa.saarland:80
query
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> q=path
fragment
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> ref News

The recovered grammar describes the URL format reasonably well.

Fuzzing¶

We can now use our recovered grammar for fuzzing as follows.

First, the inventory grammar.

In [92]:
f = GrammarFuzzer(inventory_grammar)
for _ in range(10):
    print(f.fuzz())
1999,car,Ford,Cougar
2000,car,Chevy,E350
1999,van,Ford,Venture
1997,car,Mercury,Venture
2000,car,Ford,Cougar
2000,car,Ford,E350
1999,car,Chevy,Cougar
1999,car,Chevy,Cougar
1999,car,Ford,E350
1997,car,Chevy,Cougar

Next, the URL grammar.

In [93]:
f = GrammarFuzzer(url_grammar)
for _ in range(10):
    print(f.fuzz())
https://www.cispa.saarland:80/?q=path#ref
https://user:pass@www.google.com:80/
http://www.fuzzingbook.org/?q=path#News
https://www.fuzzingbook.org/#ref
https://www.cispa.saarland:80/#News
http://www.fuzzingbook.org/
https://www.fuzzingbook.org/?q=path#News
https://user:pass@www.google.com:80/#News
https://www.cispa.saarland:80/
https://user:pass@www.google.com:80/#News

What this means is that we can now take a program and a few samples, extract its grammar, and then use this very grammar for fuzzing. Now that's quite an opportunity!

Problems with the Simple Miner¶

One of the problems with our simple grammar miner is the assumption that the values assigned to variables are stable. Unfortunately, that may not hold true in all cases. For example, here is a URL with a slightly different format.

In [94]:
URLS_X = URLS + ['ftp://freebsd.org/releases/5.8']

The grammar generated from this set of samples is not as nice as what we got earlier

In [95]:
url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])
In [96]:
syntax_diagram(url_grammar)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> url scheme :// netloc url
url
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> /releases/5.8 scheme :// netloc / scheme :// netloc /? query # fragment scheme :// netloc /# fragment
scheme
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> https ftp http
netloc
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> freebsd.org user:pass@www.google.com:80 www.fuzzingbook.org www.cispa.saarland:80
query
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> q=path
fragment
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> ref News

Clearly, something has gone wrong.

To investigate why the url definition has gone wrong, let us inspect the trace for the URL.

In [97]:
clear_cache()
with Tracer(URLS_X[0]) as tracer:
    urlparse(tracer.my_input)
for i, t in enumerate(tracer.trace):
    if t[0] in {'call', 'line'} and 'parse.py' in str(t[2]) and t[3]:
        print(i, t[2]._t()[1], t[3:])
0 372 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
1 392 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
5 124 ({'arg': ''},)
6 121 ({'arg': ''},)
7 126 ({'arg': ''},)
8 127 ({'arg': ''},)
10 393 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
11 437 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
12 458 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
16 124 ({'arg': ''},)
17 121 ({'arg': ''},)
18 126 ({'arg': ''},)
19 127 ({'arg': ''},)
21 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
22 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\t'},)
23 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\t'},)
24 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\t'},)
25 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\r'},)
26 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\r'},)
27 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\r'},)
28 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
29 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
30 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
31 464 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
32 465 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
33 466 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
34 467 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
35 469 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
36 471 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n'},)
37 472 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': ''},)
38 473 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': ''},)
39 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': ''},)
40 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},)
41 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},)
42 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)
43 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)
44 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)
45 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)
46 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)
47 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)
48 478 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)
49 480 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)
50 481 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)
51 411 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)
52 412 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)
53 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)
54 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)
55 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)
56 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)
57 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)
58 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)
59 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)
60 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)
61 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)
62 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)
63 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)
64 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)
65 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)
66 417 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)
68 482 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)
69 483 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)
70 482 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)
71 485 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)
72 486 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)
73 487 ({'url': '/?q=path', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},)
74 488 ({'url': '/?q=path', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},)
75 489 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)
76 419 ({'netloc': 'user:pass@www.google.com:80'},)
77 420 ({'netloc': 'user:pass@www.google.com:80'},)
78 421 ({'netloc': 'user:pass@www.google.com:80'},)
80 490 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)
84 491 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)
85 492 ({'url': '/', 'scheme': 'http', 'b': '\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)
90 394 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)
91 395 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},)
92 398 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},)
93 399 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)
97 400 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)

Notice how the value of url changes as the parsing progresses? This violates our assumption that the value assigned to a variable is stable. We next look at how this limitation can be removed.

Grammar Miner with Reassignment¶

One way to uniquely identify different variables is to annotate them with line numbers both when they are defined and also when their value changes. Consider the code fragment below

Tracking variable assignment locations¶

In [98]:
def C(cp_1):
    c_2 = cp_1 + '@2'
    c_3 = c_2 + '@3'
    return c_3
In [99]:
def B(bp_7):
    b_8 = bp_7 + '@8'
    return C(b_8)
In [100]:
def A(ap_12):
    a_13 = ap_12 + '@13'
    a_14 = B(a_13) + '@14'
    a_14 = a_14 + '@15'
    a_13 = a_14 + '@16'
    a_14 = B(a_13) + '@17'
    a_14 = B(a_13) + '@18'

Notice how all variables are either named corresponding to either where they are defined, or the value is annotated to indicate that it was changed.

Let us run this under the trace.

In [101]:
with Tracer('____') as tracer:
    A(tracer.my_input)

for t in tracer.trace:
    print(t[0], "%d:%s" % (t[2].line_no, t[2].method), t[3])
call 1:A {'ap_12': '____'}
line 2:A {'ap_12': '____'}
line 3:A {'ap_12': '____', 'a_13': '____@13'}
call 1:B {'bp_7': '____@13'}
line 2:B {'bp_7': '____@13'}
line 3:B {'bp_7': '____@13', 'b_8': '____@13@8'}
call 1:C {'cp_1': '____@13@8'}
line 2:C {'cp_1': '____@13@8'}
line 3:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2'}
line 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'}
return 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'}
return 3:B {'bp_7': '____@13', 'b_8': '____@13@8'}
line 4:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14'}
line 5:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14@15'}
line 6:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15'}
call 1:B {'bp_7': '____@13@8@2@3@14@15@16'}
line 2:B {'bp_7': '____@13@8@2@3@14@15@16'}
line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}
call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'}
line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'}
line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'}
line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}
return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}
return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}
line 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@17'}
call 1:B {'bp_7': '____@13@8@2@3@14@15@16'}
line 2:B {'bp_7': '____@13@8@2@3@14@15@16'}
line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}
call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'}
line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'}
line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'}
line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}
return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}
return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}
return 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@18'}
call 102:__exit__ {}
line 105:__exit__ {}

Each variable was referenced first as follows:

  • cp_1 -- call 1:C
  • c_2 -- line 3:C (but the previous event was line 2:C)
  • c_3 -- line 4:C (but the previous event was line 3:C)
  • bp_7 -- call 7:B
  • b_8 -- line 9:B (but the previous event was line 8:B)
  • ap_12 -- call 12:A
  • a_13 -- line 14:A (but the previous event was line 13:A)
  • a_14 -- line 15:A (the previous event was return 9:B. However, the previous event in A() was line 14:A)
  • reassign a_14 at 15 -- line 16:A (the previous event was line 15:A)
  • reassign a_13 at 16 -- line 17:A (the previous event was line 16:A)
  • reassign a_14 at 17 -- return 17:A (the previous event in A() was line 17:A)
  • reassign a_14 at 18 -- return 18:A (the previous event in A() was line 18:A)

So, our observations are that, if it is a call, the current location is the right one for any new variables being defined. On the other hand, if the variable being referenced for the first time (or reassigned a new value), then the right location to consider is the previous location in the same method invocation. Next, let us see how we can incorporate this information into variable naming.

Next, we need a way to track the individual method calls as they are being made. For this we define the class CallStack. Each method invocation gets a separate identifier, and when the method call is over, the identifier is reset.

CallStack¶

In [102]:
class CallStack:
    def __init__(self, **kwargs):
        self.options(kwargs)
        self.method_id = (START_SYMBOL, 0)
        self.method_register = 0
        self.mstack = [self.method_id]

    def enter(self, method):
        self.method_register += 1
        self.method_id = (method, self.method_register)
        self.log('call', "%s%s" % (self.indent(), str(self)))
        self.mstack.append(self.method_id)

    def leave(self):
        self.mstack.pop()
        self.log('return', "%s%s" % (self.indent(), str(self)))
        self.method_id = self.mstack[-1]

A few extra functions to make life simpler.

In [103]:
class CallStack(CallStack):
    def options(self, kwargs):
        self.log = log_event if kwargs.get('log') else lambda _evt, _var: None

    def indent(self):
        return len(self.mstack) * "\t"

    def at(self, n):
        return self.mstack[n]

    def __len__(self):
        return len(mstack) - 1

    def __str__(self):
        return "%s:%d" % self.method_id

    def __repr__(self):
        return repr(self.method_id)

We also define a convenience method to display a given stack.

In [104]:
def display_stack(istack):
    def stack_to_tree(stack):
        current, *rest = stack
        if not rest:
            return (repr(current), [])
        return (repr(current), [stack_to_tree(rest)])
    display_tree(stack_to_tree(istack.mstack), graph_attr=lr_graph)

Here is how we can use the CallStack.

In [105]:
cs = CallStack()
display_stack(cs)
cs
Out[105]:
('<start>', 0)
In [106]:
cs.enter('hello')
display_stack(cs)
cs
Out[106]:
('hello', 1)
In [107]:
cs.enter('world')
display_stack(cs)
cs
Out[107]:
('world', 2)
In [108]:
cs.leave()
display_stack(cs)
cs
Out[108]:
('hello', 1)
In [109]:
cs.enter('world')
display_stack(cs)
cs
Out[109]:
('world', 3)
In [110]:
cs.leave()
display_stack(cs)
cs
Out[110]:
('hello', 1)

In order to account for variable reassignments, we need to have a more intelligent data structure than a dictionary for storing variables. We first define a simple interface Vars. It acts as a container for variables, and is instantiated at my_assignments.

Vars¶

The Vars stores references to variables as they occur during parsing in its internal dictionary defs. We initialize the dictionary with the original string.

In [111]:
class Vars:
    def __init__(self, original):
        self.defs = {}
        self.my_input = original

The dictionary needs two methods: update() that takes a set of key-value pairs to update itself, and _set_kv() that updates a particular key-value pair.

In [112]:
class Vars(Vars):
    def _set_kv(self, k, v):
        self.defs[k] = v

    def __setitem__(self, k, v):
        self._set_kv(k, v)

    def update(self, v):
        for k, v in v.items():
            self._set_kv(k, v)

The Vars is a proxy for the internal dictionary. For example, here is how one can use it.

In [113]:
v = Vars('')
v.defs
Out[113]:
{}
In [114]:
v['x'] = 'X'
v.defs
Out[114]:
{'x': 'X'}
In [115]:
v.update({'x': 'x', 'y': 'y'})
v.defs
Out[115]:
{'x': 'x', 'y': 'y'}

AssignmentVars¶

We now extend the simple Vars to account for variable reassignments. For this, we define AssignmentVars.

The idea for detecting reassignments and renaming variables is as follows: We keep track of the previous reassignments to particular variables using accessed_seq_var. It contains the last rename of any particular variable as its corresponding value. The new_vars contains a list of all new variables that were added on this iteration.

In [116]:
class AssignmentVars(Vars):
    def __init__(self, original):
        super().__init__(original)
        self.accessed_seq_var = {}
        self.var_def_lines = {}
        self.current_event = None
        self.new_vars = set()
        self.method_init()

The method_init() method takes care of keeping track of method invocations using records saved in the call_stack. event_locations is for keeping track of the locations accessed within this method. This is used for line number tracking of variable definitions.

In [117]:
class AssignmentVars(AssignmentVars):
    def method_init(self):
        self.call_stack = CallStack()
        self.event_locations = {self.call_stack.method_id: []}

The update() is now modified to track the changed line numbers if any, using var_location_register(). We reinitialize the new_vars after use for the next event.

In [118]:
class AssignmentVars(AssignmentVars):
    def update(self, v):
        for k, v in v.items():
            self._set_kv(k, v)
        self.var_location_register(self.new_vars)
        self.new_vars = set()

The variable name now incorporates an index of how many reassignments it has gone through, effectively making each reassignment a unique variable.

In [119]:
class AssignmentVars(AssignmentVars):
    def var_name(self, var):
        return (var, self.accessed_seq_var[var])

While storing variables, we need to first check whether it was previously known. If it is not, we need to initialize the rename count. This is accomplished by var_access.

In [120]:
class AssignmentVars(AssignmentVars):
    def var_access(self, var):
        if var not in self.accessed_seq_var:
            self.accessed_seq_var[var] = 0
        return self.var_name(var)

During a variable reassignment, we update the accessed_seq_var to reflect the new count.

In [121]:
class AssignmentVars(AssignmentVars):
    def var_assign(self, var):
        self.accessed_seq_var[var] += 1
        self.new_vars.add(self.var_name(var))
        return self.var_name(var)

These methods can be used as follows

In [122]:
sav = AssignmentVars('')
sav.defs
Out[122]:
{}
In [123]:
sav.var_access('v1')
Out[123]:
('v1', 0)
In [124]:
sav.var_assign('v1')
Out[124]:
('v1', 1)

Assigning to it again increments the counter.

In [125]:
sav.var_assign('v1')
Out[125]:
('v1', 2)

The core of the logic is in _set_kv(). When a variable is being assigned, we get the sequenced variable name s_var. If the sequenced variable name was previously unknown in defs, then we have no further concerns. We add the sequenced variable to defs.

If the variable is previously known, then it is an indication of a possible reassignment. In this case, we look at the value the variable is holding. We check if the value changed. If it has not, then it is not.

If the value has changed, it is a reassignment. We first increment the variable usage sequence using var_assign, retrieve the new name, update the new name in defs.

In [126]:
class AssignmentVars(AssignmentVars):
    def _set_kv(self, var, val):
        s_var = self.var_access(var)
        if s_var in self.defs and self.defs[s_var] == val:
            return
        self.defs[self.var_assign(var)] = val

Here is how it can be used. Assigning a variable the first time initializes its counter.

In [127]:
sav = AssignmentVars('')
sav['x'] = 'X'
sav.defs
Out[127]:
{('x', 1): 'X'}

If the variable is assigned again with the same value, it is probably not a reassignment.

In [128]:
sav['x'] = 'X'
sav.defs
Out[128]:
{('x', 1): 'X'}

However, if the value changed, it is a reassignment.

In [129]:
sav['x'] = 'Y'
sav.defs
Out[129]:
{('x', 1): 'X', ('x', 2): 'Y'}

There is a subtlety here. It is possible for a child method to be called from the middle of a parent method, and for both to use the same variable name with different values. In this case, when the child returns, parent will have the old variable with old value in context. With our implementation, we consider this as a reassignment. However, this is OK because adding a new reassignment is harmless, but missing one is not. Further, we will discuss later how this can be avoided.

We also define bookkeeping codes for register_event() method_enter() and method_exit() which are the methods responsible for keeping track of the method stack. The basic idea is that, each method_enter() represents a new method invocation. Hence, it merits a new method id, which is generated from the method_register, and saved in the method_id. Since this is a new method, the method stack is extended by one element with this id. In the case of method_exit(), we pop the method stack, and reset the current method_id to what was below the current one.

In [130]:
class AssignmentVars(AssignmentVars):
    def method_enter(self, cxt, my_vars):
        self.current_event = 'call'
        self.call_stack.enter(cxt.method)
        self.event_locations[self.call_stack.method_id] = []
        self.register_event(cxt)
        self.update(my_vars)

    def method_exit(self, cxt, my_vars):
        self.current_event = 'return'
        self.register_event(cxt)
        self.update(my_vars)
        self.call_stack.leave()

    def method_statement(self, cxt, my_vars):
        self.current_event = 'line'
        self.register_event(cxt)
        self.update(my_vars)

For each of the method events, we also register the event using register_event() which keeps track of the line numbers that were referenced in this method.

In [131]:
class AssignmentVars(AssignmentVars):
    def register_event(self, cxt):
        self.event_locations[self.call_stack.method_id].append(cxt.line_no)

The var_location_register() keeps the locations of newly added variables. The definition location of variables in a call is the current location. However, for a line, it would be the previous event in the current method.

In [132]:
class AssignmentVars(AssignmentVars):
    def var_location_register(self, my_vars):
        def loc(mid):
            if self.current_event == 'call':
                return self.event_locations[mid][-1]
            elif self.current_event == 'line':
                return self.event_locations[mid][-2]
            elif self.current_event == 'return':
                return self.event_locations[mid][-2]
            else:
                assert False

        my_loc = loc(self.call_stack.method_id)
        for var in my_vars:
            self.var_def_lines[var] = my_loc

We define defined_vars() which returns the names of variables annotated with the line numbers as below.

In [133]:
class AssignmentVars(AssignmentVars):
    def defined_vars(self, formatted=True):
        def fmt(k):
            v = (k[0], self.var_def_lines[k])
            return "%s@%s" % v if formatted else v

        return [(fmt(k), v) for k, v in self.defs.items()]

Similar to defined_vars() we define seq_vars() which annotates different variables with the number of times they were used.

In [134]:
class AssignmentVars(AssignmentVars):
    def seq_vars(self, formatted=True):
        def fmt(k):
            v = (k[0], self.var_def_lines[k], k[1])
            return "%s@%s:%s" % v if formatted else v

        return {fmt(k): v for k, v in self.defs.items()}

AssignmentTracker¶

The AssignmentTracker keeps the assignment definitions using the AssignmentVars we defined previously.

In [135]:
class AssignmentTracker(DefineTracker):
    def __init__(self, my_input, trace, **kwargs):
        self.options(kwargs)
        self.my_input = my_input

        self.my_assignments = self.create_assignments(my_input)

        self.trace = trace
        self.process()

    def create_assignments(self, *args):
        return AssignmentVars(*args)

To fine-tune the process, we define an optional parameter called track_return. During tracing a method return, Python produces a virtual variable that contains the result of the returned value. If the track_return is set, we capture this value as a variable.

  • track_return -- if true, add a virtual variable to the Vars representing the return value
In [136]:
class AssignmentTracker(AssignmentTracker):
    def options(self, kwargs):
        self.track_return = kwargs.get('track_return', False)
        super().options(kwargs)

There can be different kinds of events during a trace, which includes call when a function is entered, return when the function returns, exception when an exception is thrown and line when a statement is executed.

The previous Tracker was too simplistic in that it did not distinguish between the different events. We rectify that and define on_call(), on_return(), and on_line() respectively, which get called on their corresponding events.

Note that on_line() is called also for on_return(). The reason is, that Python invokes the trace function before the corresponding line is executed. Hence, effectively, the on_return() is called with the binding produced by the execution of the previous statement in the environment. Our processing in effect is done on values that were bound by the previous statement. Hence, calling on_line() here is appropriate as it provides the event handler a chance to work on the previous binding.

In [137]:
class AssignmentTracker(AssignmentTracker):
    def on_call(self, arg, cxt, my_vars):
        my_vars = cxt.parameters(my_vars)
        self.my_assignments.method_enter(cxt, self.fragments(my_vars))

    def on_line(self, arg, cxt, my_vars):
        self.my_assignments.method_statement(cxt, self.fragments(my_vars))

    def on_return(self, arg, cxt, my_vars):
        self.on_line(arg, cxt, my_vars)
        my_vars = {'<-%s' % cxt.method: arg} if self.track_return else {}
        self.my_assignments.method_exit(cxt, my_vars)

    def on_exception(self, arg, cxt, my_vara):
        return

    def track_event(self, event, arg, cxt, my_vars):
        self.current_event = event
        dispatch = {
            'call': self.on_call,
            'return': self.on_return,
            'line': self.on_line,
            'exception': self.on_exception
        }
        dispatch[event](arg, cxt, my_vars)

We can now use AssignmentTracker to track the different variables. To verify that our variable line number inference works, we recover definitions from the functions A(), B() and C() (with data annotations removed so that the input fragments are correctly identified).

In [138]:
def C(cp_1):  # type: ignore
    c_2 = cp_1
    c_3 = c_2
    return c_3
In [139]:
def B(bp_7):  # type: ignore
    b_8 = bp_7
    return C(b_8)
In [140]:
def A(ap_12):  # type: ignore
    a_13 = ap_12
    a_14 = B(a_13)
    a_14 = a_14
    a_13 = a_14
    a_14 = B(a_13)
    a_14 = B(a_14)[3:]

Running A() with sufficient input.

In [141]:
with Tracer('---xxx') as tracer:
    A(tracer.my_input)
tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)
for k, v in tracker.my_assignments.seq_vars().items():
    print(k, '=', repr(v))
print()
for k, v in tracker.my_assignments.defined_vars(formatted=True):
    print(k, '=', repr(v))
ap_12@1:1 = '---xxx'
a_13@2:1 = '---xxx'
bp_7@1:1 = '---xxx'
b_8@2:1 = '---xxx'
cp_1@1:1 = '---xxx'
c_2@2:1 = '---xxx'
c_3@3:1 = '---xxx'
a_14@3:1 = '---xxx'
a_14@7:2 = 'xxx'

ap_12@1 = '---xxx'
a_13@2 = '---xxx'
bp_7@1 = '---xxx'
b_8@2 = '---xxx'
cp_1@1 = '---xxx'
c_2@2 = '---xxx'
c_3@3 = '---xxx'
a_14@3 = '---xxx'
a_14@7 = 'xxx'

As can be seen, the line numbers are now correctly identified for each variable.

Let us try retrieving the assignments for a real world example.

In [142]:
traces = []
for inputstr in URLS_X:
    clear_cache()
    with Tracer(inputstr, files=['urllib/parse.py']) as tracer:
        urlparse(tracer.my_input)
    traces.append((tracer.my_input, tracer.trace))

    tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)
    for k, v in tracker.my_assignments.defined_vars():
        print(k, '=', repr(v))
    print()
url@372 = 'http://user:pass@www.google.com:80/?q=path#ref'
url@478 = '//user:pass@www.google.com:80/?q=path#ref'
scheme@478 = 'http'
url@481 = '/?q=path#ref'
netloc@481 = 'user:pass@www.google.com:80'
url@486 = '/?q=path'
fragment@486 = 'ref'
query@488 = 'q=path'
url@393 = 'http://user:pass@www.google.com:80/?q=path#ref'

url@372 = 'https://www.cispa.saarland:80/'
url@478 = '//www.cispa.saarland:80/'
scheme@478 = 'https'
netloc@481 = 'www.cispa.saarland:80'
url@393 = 'https://www.cispa.saarland:80/'

url@372 = 'http://www.fuzzingbook.org/#News'
url@478 = '//www.fuzzingbook.org/#News'
scheme@478 = 'http'
url@481 = '/#News'
netloc@481 = 'www.fuzzingbook.org'
fragment@486 = 'News'
url@393 = 'http://www.fuzzingbook.org/#News'

url@372 = 'ftp://freebsd.org/releases/5.8'
url@478 = '//freebsd.org/releases/5.8'
scheme@478 = 'ftp'
url@481 = '/releases/5.8'
netloc@481 = 'freebsd.org'
url@393 = 'ftp://freebsd.org/releases/5.8'
url@394 = '/releases/5.8'

The line numbers of variables can be verified from the source code of urllib/parse.py.

Recovering a Derivation Tree¶

Does handling variable reassignments help with our URL examples? We look at these next.

In [143]:
class TreeMiner(TreeMiner):
    def get_derivation_tree(self):
        tree = (START_SYMBOL, [(self.my_input, [])])
        for var, value in self.my_assignments:
            self.log(0, "%s=%s" % (var, repr(value)))
            self.apply_new_definition(tree, var, value)
        return tree

Example 1: Recovering URL Derivation Tree¶

First we obtain the derivation tree of the URL 1

URL 1 derivation tree¶
In [144]:
clear_cache()
with Tracer(URLS_X[0], files=['urllib/parse.py']) as tracer:
    urlparse(tracer.my_input)
sm = AssignmentTracker(tracer.my_input, tracer.trace)
dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())
display_tree(dt.tree)
Out[144]:
0 <start> 1 <url@372> 0->1 2 <scheme@478> 1->2 4 : (58) 1->4 5 <url@478> 1->5 3 http 2->3 6 // 5->6 7 <netloc@481> 5->7 9 <url@481> 5->9 8 user:pass@www.google.com:80 7->8 10 <url@486> 9->10 14 # (35) 9->14 15 <fragment@486> 9->15 11 /? 10->11 12 <query@488> 10->12 13 q=path 12->13 16 ref 15->16