Symbolic Fuzzing

Symbolic Fuzzing#

One of the problems with traditional methods of fuzzing is that they fail to exercise all the possible behaviors that a system can have, especially when the input space is large. Quite often the execution of a specific branch of execution may happen only with very specific inputs, which could represent a minimal fraction of the input space. The traditional fuzzing methods relies on chance to produce inputs they need. However, relying on randomness to generate values that we want is a bad idea when the space to be explored is huge. For example, a function that accepts a string, even if one only considers the first \(10\) characters, already has \(2^{80}\) possible inputs. If one is looking for a specific string, random generation of values will take a few thousand years even in one of the super computers.

In the chapter on concolic testing, we have seen how concolic tracing can offer a way out. We saw how concolic tracing can be implemented using direct information flows using the Python interpreter. However, there are two problems with this approach.

The first is that concolic tracing relies on the existence of sample inputs. What if one has no sample inputs?
Second, direct information flows could be unreliable if the program has indirect information flows such as those based on control flow.

In both cases, static code analysis can bridge the gap. However, that raises the question: Can we determine the complete behavior of the program by examining it statically, and check if it behaves unexpectedly under some (unknown) input or result in an unexpected output?

Symbolic execution is one of the ways that we can reason about the behavior of a program without executing it. A program is a computation that can be treated as a system of equations that obtains the output values from the given inputs. Executing the program symbolically – that is, solving these mathematically – along with any specified objective such as covering a particular branch or obtaining a particular output will get us inputs that can accomplish this task.

In this chapter, we investigate how symbolic execution can be implemented, and how it can be used to obtain interesting values for fuzzing.

from bookutils import YouTubeVideo
YouTubeVideo('AJfRBF8NEWU')

Prerequisites

You should understand how to use type annotations in Python.
A working knowledge of SMT solvers, especially Z3 is useful.
You should have read the chapter on coverage.
A familiarity with chapter on concolic fuzzing would be helpful.

# ignore
import sys

# ignore
if sys.version_info >= (3, 12):
    print("This code requires Python 3.11 or earlier")
    sys.exit(0)

Synopsis#

To use the code provided in this chapter, write

>>> from fuzzingbook.SymbolicFuzzer import <identifier>

and then make use of the following features.

This chapter provides an implementation of a symbolic fuzzing engine SymbolicFuzzer. The fuzzer uses symbolic execution to exhaustively explore paths in the program to a limited depth, and generate inputs that will reach these paths.

As an example, consider the function gcd(), computing the greatest common divisor of a and b:

def gcd(a: int, b: int) -> int:
    if a < b:
        c: int = a  # type: ignore
        a = b
        b = c

    while b != 0:
        c: int = a  # type: ignore
        a = b
        b = c % b

    return a

To explore gcd(), the fuzzer can be used as follows, producing values for arguments that cover different paths in gcd() (including multiple times of loop iterations):

>>> gcd_fuzzer = SymbolicFuzzer(gcd, max_tries=10, max_iter=10, max_depth=10)
>>> for i in range(10):
>>>     args = gcd_fuzzer.fuzz()
>>>     print(args)
{'a': 8, 'b': 3}
{'a': 1, 'b': 2}
{'a': 2, 'b': 5}
{'a': 7, 'b': 6}
{'a': 9, 'b': 10}
{'a': 4, 'b': 4}
{'a': 10, 'b': 9}
{'a': 2, 'b': 10}
{'a': 14, 'b': 7}
{'a': 3, 'b': 2}

Note that the variable values returned by fuzz() are Z3 symbolic values; to convert them to Python numbers, use their method as_long():

>>> for i in range(10):
>>>     args = gcd_fuzzer.fuzz()
>>>     a = args['a'].as_long()
>>>     b = args['b'].as_long()
>>>     d = gcd(a, b)
>>>     print(f"gcd({a}, {b}) = {d}")
gcd(0, 8) = 8
gcd(-1, 10) = -1
gcd(13, 2) = 1
gcd(0, 10) = 10
gcd(6, 7) = 1
gcd(14, 2) = 2
gcd(-1, 11) = -1
gcd(15, 0) = 15
gcd(0, -1) = -1
gcd(-3, -2) = -1

The symbolic fuzzer is subject to a number of constraints. First, it requires that the function to be fuzzed has correct type annotations, including all local variables. Second, it solves loops by unrolling them, but only for a fixed amount.

For programs without loops and variable reassignments, the SimpleSymbolicFuzzer is a faster, but more limited alternative.

Obtaining Path Conditions for Coverage#

In the chapter on parsing and recombining inputs, we saw how difficult it was to generate inputs for process_vehicle() – a simple function that accepts a string. The solution given there was to rely on preexisting sample inputs. However, this solution is inadequate as it assumes the existence of sample inputs. What if there are no sample inputs at hand?

For a simpler example, let us consider the following triangle function (which we already have seen in the chapter on concolic fuzzing). Can we generate inputs to cover all the paths?

Note. We use type annotations to denote the argument types of programs. The chapter on discovering dynamic invariants will discuss how these types can be inferred automatically.

def check_triangle(a: int, b: int, c: int) -> str:
    if a == b:
        if a == c:
            if b == c:
                return "Equilateral"
            else:
                return "Isosceles"
        else:
            return "Isosceles"
    else:
        if b != c:
            if a == c:
                return "Isosceles"
            else:
                return "Scalene"
        else:
            return "Isosceles"

The Control Flow Graph#

The control flow graph of this function can be represented as follows:

import bookutils.setup

import inspect

from ControlFlow import PyCFG, to_graph, gen_cfg

def show_cfg(fn, **kwargs):
    return to_graph(gen_cfg(inspect.getsource(fn)), **kwargs)

show_cfg(check_triangle)

_images/706ffc25addafbde6efb0647c287721ac6ed3b261e4b3dda6ea309275ed68706.svg

The possible execution paths traced by the program can be represented as follows, with the numbers indicating the specific line numbers executed.

paths = {
    '<path 1>': ([1, 2, 3, 4, 5], 'Equilateral'),
    '<path 2>': ([1, 2, 3, 4, 7], 'Isosceles'),
    '<path 3>': ([1, 2, 3, 9], 'Isosceles'),
    '<path 4>': ([1, 2, 11, 12, 13], 'Isosceles'),
    '<path 5>': ([1, 2, 11, 12, 15], 'Scalene'),
    '<path 6>': ([1, 2, 11, 17], 'Isosceles'),
}

Consider the <path 1>. To trace this path, we need to execute the following statements in order.

check_triangle(a, b, c)
if (a == b) -> True
if (a == c) -> True
if (b == c) -> True
return 'Equilateral'

That is, any execution that traces this path has to start with values for a, b, and c that obeys the constraints in line numbers 2: (a == b) evaluates to True, 3: (a == c) evaluates to True, and 4: (b == c) evaluates to True. Can we generate inputs such that these constraints are satisfied?

We have seen from the chapter on concolic fuzzing how one can use an SMT solver such as Z3 to obtain a solution.

import z3  # type: ignore

z3_ver = z3.get_version()
print(z3_ver)

(4, 11, 2, 0)

assert z3_ver >= (4, 8, 6, 0), "Please check z3 version"

What kind of symbolic variables do we need? We can obtain that information from the type annotations of the function.

def get_annotations(fn):
    sig = inspect.signature(fn)
    return ([(i.name, i.annotation)
             for i in sig.parameters.values()], sig.return_annotation)

params, ret = get_annotations(check_triangle)
params, ret

([('a', int), ('b', int), ('c', int)], str)

We create symbolic variables to represent each of the parameters

SYM_VARS = {
    int: (
        z3.Int, z3.IntVal), float: (
            z3.Real, z3.RealVal), str: (
                z3.String, z3.StringVal)}

def get_symbolicparams(fn):
    params, ret = get_annotations(fn)
    return [SYM_VARS[typ][0](name)
            for name, typ in params], SYM_VARS[ret][0]('__return__')

(a, b, c), r = get_symbolicparams(check_triangle)
a, b, c, r

(a, b, c, __return__)

We can now ask z3 to solve the set of equations for us as follows.

z3.solve(a == b, a == c, b == c)

[a = 0, b = 0, c = 0]

Here we find the first problem in our program. Our program seems to not check whether the sides are greater than zero. (Real-world triangles all have sides with a positive length.) Assume for now that we do not have that restriction. Does our program correctly follow the path described?

We can use the ArcCoverage from the chapter on concolic fuzzing as a tracer to visualize that information as below.

from ConcolicFuzzer import ArcCoverage  # minor dependency

First, we recover the trace.

with ArcCoverage() as cov:
    assert check_triangle(0, 0, 0) == 'Equilateral'
cov._trace, cov.arcs()

([('check_triangle', 1),
  ('check_triangle', 2),
  ('check_triangle', 3),
  ('check_triangle', 4),
  ('check_triangle', 5),
  ('__exit__', 102),
  ('__exit__', 105)],
 [(1, 2), (2, 3), (3, 4), (4, 5), (5, 102), (102, 105)])

We can now determine the path taken.

The CFG with Path Taken#

show_cfg(check_triangle, arcs=cov.arcs())

_images/78e6a6fac9aebaf7faaf495434e4c4195022a9c5a3f490861dfaefd1b9966538.svg

As you can see, the path taken is <path 1>.

Similarly, for solving <path 2> we need to simply invert the condition at <line 2>:

z3.solve(a == b, a == c, z3.Not(b == c))

no solution

The symbolic execution suggests that there is no solution. A moment’s reflection will convince us that it is indeed true. Let us proceed with the other paths. The <path 3> can be obtained by inverting the condition at <line 4>.

z3.solve(a == b, z3.Not(a == c))

[b = 1, c = 0, a = 1]

with ArcCoverage() as cov:
    assert check_triangle(1, 1, 0) == 'Isosceles'
[i for fn, i in cov._trace if fn == 'check_triangle']

[1, 2, 3, 9]

paths['<path 3>']

([1, 2, 3, 9], 'Isosceles')

How about path <4>?

z3.solve(z3.Not(a == b), b != c, a == c)

[b = 0, c = 1, a = 1]

As we mentioned earlier, our program does not account for sides with zero or negative length. We can modify our program to check for zero and negative input. However, do we always have to make sure that every function has to account for all possible inputs? It is possible that the check_triangle is not directly exposed to the user, and it is called from another function that already guarantees that the inputs would be positive. In the chapter on dynamic invariants, we will show how to discover such preconditions and post conditions.

We can easily add such a precondition here.

pre_condition = z3.And(a > 0, b > 0, c > 0)

z3.solve(pre_condition, z3.Not(a == b), b != c, a == c)

[c = 2, b = 1, a = 2]

with ArcCoverage() as cov:
    assert check_triangle(1, 2, 1) == 'Isosceles'
[i for fn, i in cov._trace if fn == 'check_triangle']

[1, 2, 11, 12, 13]

paths['<path 4>']

([1, 2, 11, 12, 13], 'Isosceles')

Continuing to path <5>:

z3.solve(pre_condition, z3.Not(a == b), b != c, z3.Not(a == c))

[a = 1, c = 3, b = 2]

And indeed it is a Scalene triangle.

with ArcCoverage() as cov:
    assert check_triangle(3, 1, 2) == 'Scalene'

paths['<path 5>']

([1, 2, 11, 12, 15], 'Scalene')

Finally, for <path 6> the procedure is similar.

z3.solve(pre_condition, z3.Not(a == b), z3.Not(b != c))

[c = 2, a = 1, b = 2]

with ArcCoverage() as cov:
    assert check_triangle(2, 1, 1) == 'Isosceles'
[i for fn, i in cov._trace if fn == 'check_triangle']

[1, 2, 11, 17]

paths['<path 6>']

([1, 2, 11, 17], 'Isosceles')

What if we wanted another solution? We can simply ask the solver to solve again, and not give us the same values.

seen = [z3.And(a == 2, b == 1, c == 1)]

z3.solve(pre_condition, z3.Not(z3.Or(seen)), z3.Not(a == b), z3.Not(b != c))

[c = 2, a = 1, b = 2]

seen.append(z3.And(a == 1, b == 2, c == 2))

z3.solve(pre_condition, z3.Not(z3.Or(seen)), z3.Not(a == b), z3.Not(b != c))

[c = 1, a = 3, b = 1]

That is, using simple symbolic computation, we were able to easily see that (1) some paths are not reachable, and (2) some conditions were insufficient – we needed preconditions. What about the total coverage obtained?

Visualizing the Coverage#

Visualizing the statement coverage can be accomplished as below.

class VisualizedArcCoverage(ArcCoverage):
    def show_coverage(self, fn):
        src = fn if isinstance(fn, str) else inspect.getsource(fn)
        covered = set([lineno for method, lineno in self._trace])
        for i, s in enumerate(src.split('\n')):
            print('%s %2d: %s' % ('#' if i + 1 in covered else ' ', i + 1, s))

We run all the inputs obtained under the coverage tracer.

with VisualizedArcCoverage() as cov:
    assert check_triangle(0, 0, 0) == 'Equilateral'
    assert check_triangle(1, 1, 0) == 'Isosceles'
    assert check_triangle(1, 2, 1) == 'Isosceles'
    assert check_triangle(3, 1, 2) == 'Scalene'
    assert check_triangle(2, 1, 1) == 'Isosceles'

cov.show_coverage(check_triangle)

#  1: def check_triangle(a: int, b: int, c: int) -> str:
#  2:     if a == b:
#  3:         if a == c:
#  4:             if b == c:
#  5:                 return "Equilateral"
   6:             else:
   7:                 return "Isosceles"
   8:         else:
#  9:             return "Isosceles"
  10:     else:
# 11:         if b != c:
# 12:             if a == c:
# 13:                 return "Isosceles"
  14:             else:
# 15:                 return "Scalene"
  16:         else:
# 17:             return "Isosceles"
  18: 

The coverage is as expected. The generated values do seem to cover all code that can be covered.

We have seen how to reason about each path through the program. Can we combine them together to produce a single expression that represents the program behavior? This is what we will discuss next.

Function Summaries#

Consider this equation for determining absolute value.

def abs_value(x: float) -> float:
    if x < 0:
        v: float = -x  # type: ignore
    else:
        v: float = x  # type: ignore
    return v

show_cfg(abs_value)

_images/a068195018df547e48a892ef91bc3d54fd1767dde392a22c655981c51eec5071.svg

What can we say about the value of v at line: 5? Let us trace and see. First, we have variable x at line: 1.

(x,), r = get_symbolicparams(abs_value)

At line: 2, we face a bifurcation in the possible paths. Hence, we produce two paths with corresponding constraints.

l2_T = x < 0
l2_F = z3.Not(x < 0)

For line: 3, we only need to consider the If path. However, we have an assignment. So we use a new variable here. The type float is indicated in the source, and its equivalent z3 type is Real.

v_0 = z3.Real('v_0')
l3 = z3.And(l2_T, v_0 == -x)

Similarly, for line: 5, we have an assignment. (Can we reuse the variable v_0 from before?)

v_1 = z3.Real('v_1')
l5 = z3.And(l2_F, v_1 == x)

When we come to line: 6, we see that we have two input streams. We have a choice. We can either keep each path separate as we did previously.

v = z3.Real('v')
for s in [z3.And(l3, v == v_0), z3.And(l5, v == v_1)]:
    z3.solve(x != 0, s)

[x = -1/2, v_0 = 1/2, v = 1/2]
[v_1 = 1, x = 1, v = 1]

Or, we can combine them together and produce a single predicate at line: 6.

v = z3.Real('v')
l6 = z3.Or(z3.And(l3, v == v_0), z3.And(l5, v == v_1))
z3.solve(l6)

[v_1 = 1/2, x = -1/4, v_0 = 1/4, v = 1/4]

Note. Merging two incoming streams of execution can be non-trivial, especially when the execution paths are traversed multiple times (E.g. loops and recursion). For those interested, lookup inferring loop invariants.

We can get this to produce any number of solutions for abs() as below.

s = z3.Solver()
s.add(l6)
for i in range(5):
    if s.check() == z3.sat:
        m = s.model()
        x_val = m[x]
        print(m)
    else:
        print('no solution')
        break
    s.add(z3.Not(x == x_val))
s

[v_1 = 1/2, x = 1/2, v_0 = 0, v = 1/2]
[v_1 = 0, x = 0, v = 0]
[v_1 = 1/4, x = 1/4, v = 1/4]
[v_1 = 1/8, x = 1/8, v = 1/8]
[v_1 = 1/16, x = 1/16, v = 1/16]

[x < 0 ∧ v_0 = -x ∧ v = v_0 ∨ ¬(x < 0) ∧ v_1 = x ∧ v = v_1, ¬(1/2 = x), ¬(0 = x), ¬(1/4 = x), ¬(1/8 = x), ¬(1/16 = x)]

The solver is not particularly random. So we need to help it a bit to produce values on the negative range.

s.add(x < 0)
for i in range(5):
    if s.check() == z3.sat:
        m = s.model()
        x_val = m[x]
        print(m)
    else:
        print('no solution')
        break
    s.add(z3.Not(x == x_val))

[x = -1/32, v_0 = 1/32, v = 1/32]
[x = -33/32, v_0 = 33/32, v = 33/32]
[x = -65/32, v_0 = 65/32, v = 65/32]
[x = -97/32, v_0 = 97/32, v = 97/32]
[x = -129/32, v_0 = 129/32, v = 129/32]

[x < 0 ∧ v_0 = -x ∧ v = v_0 ∨ ¬(x < 0) ∧ v_1 = x ∧ v = v_1, ¬(1/2 = x), ¬(0 = x), ¬(1/4 = x), ¬(1/8 = x), ¬(1/16 = x), x < 0, ¬(-1/32 = x), ¬(-33/32 = x), ¬(-65/32 = x), ¬(-97/32 = x), ¬(-129/32 = x)]

Note that the single expression produced at line: 6 is essentially a summary for abs_value().

abs_value_summary = l6
abs_value_summary

x < 0 ∧ v_0 = -x ∧ v = v_0 ∨ ¬(x < 0) ∧ v_1 = x ∧ v = v_1

The z3 solver can be used to simplify the predicates where possible.

z3.simplify(l6)

¬(0 ≤ x) ∧ v_0 = -1·x ∧ v = v_0 ∨ 0 ≤ x ∧ v_1 = x ∧ v = v_1

One can use this summary rather than trace into abs_value() when abs_value() is used elsewhere. However, that presents us with a problem. It is possible that the same function may be called multiple times. In this case, using the same variables will lead to collision. One way to avoid that is to prefix some call specific value to the variables.

Note: The SMT 2.0 standard allows one to define functions (macros in SMT parlance) directly. For example, the abs-value will be defined as follows:

(define-fun abs-value ((x Int)) Int
       (if (> x 0)
           x
           (* -1 x)))

Or equivalently, (especially if abs-value is recursively defined)

(declare-fun abs-value (Int) Int)
(assert (forall ((x Int))
                (= (abs-value x)
                   (if (> x 0)
                       x
                       (* -1 x)))))

One can then say

(> (abs-value x) (abs-value y))

Unfortunately, the z3py project does not expose this facility in Python. Hence, we have to use the prefix_vars() hack.

import ast

The method prefix_vars() modifies the variables in an expression such that the variables are prefixed with a given value.

def prefix_vars(astnode, prefix):
    if isinstance(astnode, ast.BoolOp):
        return ast.BoolOp(astnode.op,
                          [prefix_vars(i, prefix) for i in astnode.values], [])
    elif isinstance(astnode, ast.BinOp):
        return ast.BinOp(
            prefix_vars(astnode.left, prefix), astnode.op,
            prefix_vars(astnode.right, prefix))
    elif isinstance(astnode, ast.UnaryOp):
        return ast.UnaryOp(astnode.op, prefix_vars(astnode.operand, prefix))
    elif isinstance(astnode, ast.Call):
        return ast.Call(prefix_vars(astnode.func, prefix),
                        [prefix_vars(i, prefix) for i in astnode.args],
                        astnode.keywords)
    elif isinstance(astnode, ast.Compare):
        return ast.Compare(
            prefix_vars(astnode.left, prefix), astnode.ops,
            [prefix_vars(i, prefix) for i in astnode.comparators])
    elif isinstance(astnode, ast.Name):
        if astnode.id in {'And', 'Or', 'Not'}:
            return ast.Name('z3.%s' % (astnode.id), astnode.ctx)
        else:
            return ast.Name('%s%s' % (prefix, astnode.id), astnode.ctx)
    elif isinstance(astnode, ast.Return):
        return ast.Return(prefix_vars(astnode.value, env))
    else:
        return astnode

For applying prefix_vars() one needs the abstract syntax tree (AST) of the Python expression involved. We obtain this by invoking ast.parse():

xy_ast = ast.parse('x+y')

We can visualize the resulting tree as follows:

from bookutils import rich_output

if rich_output():
    # Normally, this will do
    from showast import show_ast
else:
    def show_ast(tree):
        ast.dump(tree, indent=4)

show_ast(xy_ast)

_images/85988f14d711f2d92bdc535263f97871e8fc8207075f87ee2c49cd27b29f5b4f.svg

What the visualization does not show, though, is that when parsing Python source code, the resulting AST comes wrapped in a Module by default:

xy_ast

<ast.Module at 0x116dfa6e0>

And to access the expression (Expr), we need to access the first child of that “module”:

xy_ast.body[0]

<ast.Expr at 0x116dfa4a0>

The actual expression is within that Expr object:

xy_ast.body[0].value  # type: ignore

<ast.BinOp at 0x116dfa2c0>

Hence, for easier manipulation of an expression AST, we define a function get_expression() which unwraps it and returns the AST representation of the expression inside.

def get_expression(src):
    return ast.parse(src).body[0].value

It is used as follows:

e = get_expression('x+y')
e

<ast.BinOp at 0x116df8a00>

The function to_src() allows us to unparse an expression.

def to_src(astnode):
    return ast.unparse(astnode).strip()

It is used as follows:

to_src(e)

'x + y'

We can combine both pieces to produce a prefixed expression. Let us prefix all variables with x1_:

abs_value_summary_ast = get_expression(str(abs_value_summary))
print(to_src(prefix_vars(abs_value_summary_ast, 'x1_')))

z3.Or(z3.And(z3.And(x1_x < 0, x1_v_0 == -x1_x), x1_v == x1_v_0), z3.And(z3.And(z3.Not(x1_x < 0), x1_v_1 == x1_x), x1_v == x1_v_1))

Get Names and Types of Variables Used#

What about the declarations used? Given that we have all equations in Z3, we can retrieve this information directly. We define z3_names_and_types() that takes in a Z3 expression, and extracts the variable definitions required.

def z3_names_and_types(z3_ast):
    hm = {}
    children = z3_ast.children()
    if children:
        for c in children:
            hm.update(z3_names_and_types(c))
    else:
        # HACK.. How else to distinguish literals and vars?
        if (str(z3_ast.decl()) != str(z3_ast.sort())):
            hm["%s" % str(z3_ast.decl())] = 'z3.%s' % str(z3_ast.sort())
        else:
            pass
    return hm

abs_value_declarations = z3_names_and_types(abs_value_summary)
abs_value_declarations

{'x': 'z3.Real', 'v_0': 'z3.Real', 'v': 'z3.Real', 'v_1': 'z3.Real'}

However, z3_names_and_types() is limited in that it requires the Z3 AST to operate. Hence, we also define used_identifiers() that can extract identifiers directly from the string representation of any Python expression, (including Z3 constraints). One trade-off here is that we lose track of the type information. But we will see how to recover that later.

def used_identifiers(src):
    def names(astnode):
        lst = []
        if isinstance(astnode, ast.BoolOp):
            for i in astnode.values:
                lst.extend(names(i))
        elif isinstance(astnode, ast.BinOp):
            lst.extend(names(astnode.left))
            lst.extend(names(astnode.right))
        elif isinstance(astnode, ast.UnaryOp):
            lst.extend(names(astnode.operand))
        elif isinstance(astnode, ast.Call):
            for i in astnode.args:
                lst.extend(names(i))
        elif isinstance(astnode, ast.Compare):
            lst.extend(names(astnode.left))
            for i in astnode.comparators:
                lst.extend(names(i))
        elif isinstance(astnode, ast.Name):
            lst.append(astnode.id)
        elif isinstance(astnode, ast.Expr):
            lst.extend(names(astnode.value))
        elif isinstance(astnode, (ast.Num, ast.Str, ast.Tuple, ast.NameConstant)):
            pass
        elif isinstance(astnode, ast.Assign):
            for t in astnode.targets:
                lst.extend(names(t))
            lst.extend(names(astnode.value))
        elif isinstance(astnode, ast.Module):
            for b in astnode.body:
                lst.extend(names(b))
        else:
            raise Exception(str(astnode))
        return list(set(lst))
    return names(ast.parse(src))

used_identifiers(str(abs_value_summary))

['v_0', 'x', 'v', 'v_1']

We can now register the function summary abs_value for later use.

function_summaries = {}
function_summaries['abs_value'] = {
    'predicate': str(abs_value_summary),
    'vars': abs_value_declarations}

As we mentioned previously, we do not want to rely on Z3 to extract the type information. A better alternative is to let the user specify the type information as annotations, and extract this information from the program. We will see next how this can be achieved.

First, we convert the Python type to Z3 type map to its string equivalent.

SYM_VARS_STR = {
    k.__name__: ("z3.%s" % v1.__name__, "z3.%s" % v2.__name__)
    for k, (v1, v2) in SYM_VARS.items()
}
SYM_VARS_STR

{'int': ('z3.Int', 'z3.IntVal'),
 'float': ('z3.Real', 'z3.RealVal'),
 'str': ('z3.String', 'z3.StringVal')}

We also define a convenience method translate_to_z3_name() for accessing the Z3 type for symbolic variables.

def translate_to_z3_name(v):
    return SYM_VARS_STR[v][0]

We now define the method declarations() that extracts variables used in Python statements. The idea is to look for augmented assignments that contain annotated type information. These are collected and returned.

If there are call nodes, they represent function calls. The used variables in these function calls are recovered from the corresponding function summaries.

def declarations(astnode, hm=None):
    if hm is None:
        hm = {}
    if isinstance(astnode, ast.Module):
        for b in astnode.body:
            declarations(b, hm)
    elif isinstance(astnode, ast.FunctionDef):
        # hm[astnode.name + '__return__'] = \
        # translate_to_z3_name(astnode.returns.id)
        for a in astnode.args.args:
            hm[a.arg] = translate_to_z3_name(a.annotation.id)
        for b in astnode.body:
            declarations(b, hm)
    elif isinstance(astnode, ast.Call):
        # get declarations from the function summary.
        n = astnode.function
        assert isinstance(n, ast.Name)  # for now.
        name = n.id
        hm.update(dict(function_summaries[name]['vars']))
    elif isinstance(astnode, ast.AnnAssign):
        assert isinstance(astnode.target, ast.Name)
        hm[astnode.target.id] = translate_to_z3_name(astnode.annotation.id)
    elif isinstance(astnode, ast.Assign):
        # verify it is already defined
        for t in astnode.targets:
            assert isinstance(t, ast.Name)
            assert t.id in hm
    elif isinstance(astnode, ast.AugAssign):
        assert isinstance(astnode.target, ast.Name)
        assert astnode.target.id in hm
    elif isinstance(astnode, (ast.If, ast.For, ast.While)):
        for b in astnode.body:
            declarations(b, hm)
        for b in astnode.orelse:
            declarations(b, hm)
    elif isinstance(astnode, ast.Return):
        pass
    else:
        raise Exception(str(astnode))
    return hm

With this, we can now extract the variables used in an expression.

declarations(ast.parse('s: int = 3\np: float = 4.0\ns += 1'))

{'s': 'z3.Int', 'p': 'z3.Real'}

We wrap declarations() in the method used_vars() that operates directly on function objects.

def used_vars(fn):
    return declarations(ast.parse(inspect.getsource(fn)))

Here is how it can be used:

used_vars(check_triangle)

{'a': 'z3.Int', 'b': 'z3.Int', 'c': 'z3.Int'}

used_vars(abs_value)

{'x': 'z3.Real', 'v': 'z3.Real'}

Given the extracted variables and their Z3 types, we need a way to re-instantiate them when needed. We define define_symbolic_vars() that translates these descriptions to a form that can be directly exec()ed.

def define_symbolic_vars(fn_vars, prefix):
    sym_var_dec = ', '.join([prefix + n for n in fn_vars])
    sym_var_def = ', '.join(["%s('%s%s')" % (t, prefix, n)
                             for n, t in fn_vars.items()])
    return "%s = %s" % (sym_var_dec, sym_var_def)

Here is how it can be used:

define_symbolic_vars(abs_value_declarations, '')

"x, v_0, v, v_1 = z3.Real('x'), z3.Real('v_0'), z3.Real('v'), z3.Real('v_1')"

We next define gen_fn_summary() that returns a function summary in instantiable form using Z3.

def gen_fn_summary(prefix, fn):
    summary = function_summaries[fn.__name__]['predicate']
    fn_vars = function_summaries[fn.__name__]['vars']
    decl = define_symbolic_vars(fn_vars, prefix)
    summary_ast = get_expression(summary)
    return decl, to_src(prefix_vars(summary_ast, prefix))

Here is how it can be used:

gen_fn_summary('a_', abs_value)

("a_x, a_v_0, a_v, a_v_1 = z3.Real('a_x'), z3.Real('a_v_0'), z3.Real('a_v'), z3.Real('a_v_1')",
 'z3.Or(z3.And(z3.And(a_x < 0, a_v_0 == -a_x), a_v == a_v_0), z3.And(z3.And(z3.Not(a_x < 0), a_v_1 == a_x), a_v == a_v_1))')

gen_fn_summary('b_', abs_value)

("b_x, b_v_0, b_v, b_v_1 = z3.Real('b_x'), z3.Real('b_v_0'), z3.Real('b_v'), z3.Real('b_v_1')",
 'z3.Or(z3.And(z3.And(b_x < 0, b_v_0 == -b_x), b_v == b_v_0), z3.And(z3.And(z3.Not(b_x < 0), b_v_1 == b_x), b_v == b_v_1))')

How do we use our function summaries? Here is a function abs_max() that uses abs_value().

def abs_max(a: float, b: float):
    a1: float = abs_value(a)
    b1: float = abs_value(b)
    if a1 > b1:
        c: float = a1  # type: ignore
    else:
        c: float = b1  # type: ignore
    return c

To trace this function symbolically, we first define the two variables a and b.

a = z3.Real('a')
b = z3.Real('b')

The line: 2 contains definition for a1, which we define as a symbolic variable.

a1 = z3.Real('a1')

We also need to call abs_value(), which is accomplished as follows. Since this is the first call to abs_value(), we use abs1 as the prefix.

d, v = gen_fn_summary('abs1_', abs_value)
d, v

("abs1_x, abs1_v_0, abs1_v, abs1_v_1 = z3.Real('abs1_x'), z3.Real('abs1_v_0'), z3.Real('abs1_v'), z3.Real('abs1_v_1')",
 'z3.Or(z3.And(z3.And(abs1_x < 0, abs1_v_0 == -abs1_x), abs1_v == abs1_v_0), z3.And(z3.And(z3.Not(abs1_x < 0), abs1_v_1 == abs1_x), abs1_v == abs1_v_1))')

We also need to equate the resulting value (<prefix>_v) to the symbolic variable a1 we defined earlier.

l2_src = "l2 = z3.And(a == abs1_x, a1 == abs1_v, %s)" % v
l2_src

'l2 = z3.And(a == abs1_x, a1 == abs1_v, z3.Or(z3.And(z3.And(abs1_x < 0, abs1_v_0 == -abs1_x), abs1_v == abs1_v_0), z3.And(z3.And(z3.Not(abs1_x < 0), abs1_v_1 == abs1_x), abs1_v == abs1_v_1)))'

Applying both declaration and the assignment.

exec(d)
exec(l2_src)

l2  # type: ignore

a = abs1_x ∧ a1 = abs1_v ∧ (abs1_x < 0 ∧ abs1_v_0 = -abs1_x ∧ abs1_v = abs1_v_0 ∨ ¬(abs1_x < 0) ∧ abs1_v_1 = abs1_x ∧ abs1_v = abs1_v_1)

We need to do the same for line: 3, but with abs2 as the prefix.

b1 = z3.Real('b1')
d, v = gen_fn_summary('abs2_', abs_value)
l3_src = "l3_ = z3.And(b == abs2_x, b1 == abs2_v, %s)" % v
exec(d)
exec(l3_src)

l3_  # type: ignore

b = abs2_x ∧ b1 = abs2_v ∧ (abs2_x < 0 ∧ abs2_v_0 = -abs2_x ∧ abs2_v = abs2_v_0 ∨ ¬(abs2_x < 0) ∧ abs2_v_1 = abs2_x ∧ abs2_v = abs2_v_1)

To get the true set of predicates at line: 3, we need to add the predicates from line: 2.

l3 = z3.And(l2, l3_)  # type: ignore

l3

a = abs1_x ∧ a1 = abs1_v ∧ (abs1_x < 0 ∧ abs1_v_0 = -abs1_x ∧ abs1_v = abs1_v_0 ∨ ¬(abs1_x < 0) ∧ abs1_v_1 = abs1_x ∧ abs1_v = abs1_v_1) ∧ b = abs2_x ∧ b1 = abs2_v ∧ (abs2_x < 0 ∧ abs2_v_0 = -abs2_x ∧ abs2_v = abs2_v_0 ∨ ¬(abs2_x < 0) ∧ abs2_v_1 = abs2_x ∧ abs2_v = abs2_v_1)

This equation can be simplified a bit using z3.

z3.simplify(l3)

a = abs1_x ∧ a1 = abs1_v ∧ (¬(0 ≤ abs1_x) ∧ abs1_v_0 = -1·abs1_x ∧ abs1_v = abs1_v_0 ∨ 0 ≤ abs1_x ∧ abs1_v_1 = abs1_x ∧ abs1_v = abs1_v_1) ∧ b = abs2_x ∧ b1 = abs2_v ∧ (¬(0 ≤ abs2_x) ∧ abs2_v_0 = -1·abs2_x ∧ abs2_v = abs2_v_0 ∨ 0 ≤ abs2_x ∧ abs2_v_1 = abs2_x ∧ abs2_v = abs2_v_1)

Coming to line: 4, we have a condition.

l4_cond = a1 > b1
l4 = z3.And(l3, l4_cond)

For line: 5, we define the symbolic variable c_0 assuming we took the IF branch.

c_0 = z3.Real('c_0')
l5 = z3.And(l4, c_0 == a1)

For line: 6, the ELSE branch was taken. So we invert that condition.

l6 = z3.And(l3, z3.Not(l4_cond))

For line: 7, we define c_1.

c_1 = z3.Real('c_1')
l7 = z3.And(l6, c_1 == b1)

s1 = z3.Solver()
s1.add(l5)
s1.check()

sat

m1 = s1.model()
sorted([(d, m1[d]) for d in m1.decls() if not d.name(
).startswith('abs')], key=lambda x: x[0].name())

[(a, 1/2), (a1, 1/2), (b, -1/4), (b1, 1/4), (c_0, 1/2)]

s2 = z3.Solver()
s2.add(l7)
s2.check()

sat

m2 = s2.model()
sorted([(d, m2[d]) for d in m2.decls() if not d.name(
).startswith('abs')], key=lambda x: x[0].name())

[(a, -1/2), (a1, 1/2), (b, 1/2), (b1, 1/2), (c_1, 1/2)]

What we really want to do is to automate this process, because doing this by hand is tedious and error-prone. Essentially, we want the ability to extract all paths in the program, and symbolically execute each path, which will generate the inputs required to cover all reachable portions of the program.

Simple Symbolic Fuzzing#

We define a simple symbolic fuzzer that can generate input values symbolically with the following assumptions:

There are no loops in the program
The function is self-contained.
No recursion.
No reassignments for variables.

The key idea is as follows: We traverse through the control flow graph from the entry point, and generate all possible paths to a given depth. Then we collect constraints that we encountered along the path, and generate inputs that will traverse the program up to that point.

We build our fuzzer based on the class Fuzzer.

from Fuzzer import Fuzzer

We start by extracting the control flow graph of the function passed. We also provide a hook for child classes to do their processing.

class SimpleSymbolicFuzzer(Fuzzer):
    """Simple symbolic fuzzer"""

    def __init__(self, fn, **kwargs):
        """Constructor.
        `fn` is the function to be fuzzed.
        Possible keyword parameters:
        * `max_depth` - the depth to which one should attempt
          to trace the execution (default 100) 
        * `max_tries` - the maximum number of attempts
          we will try to produce a value before giving up (default 100)
        * `max_iter` - the number of iterations we will attempt (default 100).
        """
        self.fn_name = fn.__name__
        py_cfg = PyCFG()
        py_cfg.gen_cfg(inspect.getsource(fn))
        self.fnenter, self.fnexit = py_cfg.functions[self.fn_name]
        self.used_variables = used_vars(fn)
        self.fn_args = list(inspect.signature(fn).parameters)
        self.z3 = z3.Solver()

        self.paths = None
        self.last_path = None

        self.options(kwargs)
        self.process()

    def process(self):
        ...  # to be defined later

We need a few variables to control how much we are willing to traverse.

MAX_DEPTH is the depth to which one should attempt to trace the execution.

MAX_DEPTH = 100

MAX_TRIES is the maximum number of attempts we will try to produce a value before giving up.

MAX_TRIES = 100

MAX_ITER is the number of iterations we will attempt.

MAX_ITER = 100

The options() method sets these parameters in the fuzzing class.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def options(self, kwargs):
        self.max_depth = kwargs.get('max_depth', MAX_DEPTH)
        self.max_tries = kwargs.get('max_tries', MAX_TRIES)
        self.max_iter = kwargs.get('max_iter', MAX_ITER)
        self._options = kwargs

The initialization generates a control flow graph and hooks it to fnenter and fnexit.

symfz_ct = SimpleSymbolicFuzzer(check_triangle)

symfz_ct.fnenter, symfz_ct.fnexit

(id:9 line[1] parents: [] : enter: check_triangle(a, b, c),
 id:10 line[1] parents: [14, 15, 16, 19, 20, 21] : exit: check_triangle(a, b, c))

Generating All Possible Paths#

We can use the procedure get_all_paths() starting from fnenter to recursively retrieve all paths in the function.

The idea is as follows: Start with the function entry point fnenter, and recursively follow the children using the CFG. At any node there is a branching, there would be multiple children. On other nodes there would be only one child. Let us say a node had \(n\) children. Such a node would result in \(n\) paths. We attach the current node to the head of each path, and return all paths thus generated.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def get_all_paths(self, fenter, depth=0):
        if depth > self.max_depth:
            raise Exception('Maximum depth exceeded')
        if not fenter.children:
            return [[(0, fenter)]]

        fnpaths = []
        for idx, child in enumerate(fenter.children):
            child_paths = self.get_all_paths(child, depth + 1)
            for path in child_paths:
                # In a conditional branch, idx is 0 for IF, and 1 for Else
                fnpaths.append([(idx, fenter)] + path)
        return fnpaths

This can be used as follows.

symfz_ct = SimpleSymbolicFuzzer(check_triangle)
all_paths = symfz_ct.get_all_paths(symfz_ct.fnenter)

len(all_paths)

all_paths[1]

[(0, id:24 line[1] parents: [] : enter: check_triangle(a, b, c)),
 (0, id:26 line[2] parents: [24] : _if: a == b),
 (0, id:27 line[3] parents: [26] : _if: a == c),
 (1, id:28 line[4] parents: [27] : _if: b == c),
 (0, id:30 line[7] parents: [28] : return 'Isosceles'),
 (0,
  id:25 line[1] parents: [29, 30, 31, 34, 35, 36] : exit: check_triangle(a, b, c))]

We hook get_all_paths() to initialization as below.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def process(self):
        self.paths = self.get_all_paths(self.fnenter)
        self.last_path = len(self.paths)

Extracting All Constraints#

For any given path, we define a function extract_constraints() to extract the constraints such that they are executable directly with Z3. The idx represents the particular branch that was taken. Hence, if the False branch was taken in a conditional, we attach a negation of the conditional.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def extract_constraints(self, path):
        predicates = []
        for (idx, elt) in path:
            if isinstance(elt.ast_node, ast.AnnAssign):
                if elt.ast_node.target.id in {'_if', '_while'}:
                    s = to_src(elt.ast_node.annotation)
                    predicates.append(("%s" if idx == 0 else "z3.Not(%s)") % s)
                elif isinstance(elt.ast_node.annotation, ast.Call):
                    assert elt.ast_node.annotation.func.id == self.fn_name
                else:
                    node = elt.ast_node
                    t = ast.Compare(node.target, [ast.Eq()], [node.value])
                    predicates.append(to_src(t))
            elif isinstance(elt.ast_node, ast.Assign):
                node = elt.ast_node
                t = ast.Compare(node.targets[0], [ast.Eq()], [node.value])
                predicates.append(to_src(t))
            else:
                pass
        return predicates

symfz_ct = SimpleSymbolicFuzzer(check_triangle)
all_paths = symfz_ct.get_all_paths(symfz_ct.fnenter)
symfz_ct.extract_constraints(all_paths[0])

['a == b', 'a == c', 'b == c']

constraints = symfz_ct.extract_constraints(all_paths[1])
constraints

['a == b', 'a == c', 'z3.Not(b == c)']

Fuzzing with Simple Symbolic Fuzzer#

To actually generate solutions, we define fuzz(). For that, we need to first extract all paths. Then choose a particular path, and extract the constraints in that path, which is then solved using z3.

from contextlib import contextmanager

First we create a checkpoint for our current solver so that we can check a predicate, and rollback if necessary.

@contextmanager
def checkpoint(z3solver):
    z3solver.push()
    yield z3solver
    z3solver.pop()

The use_path() function extracts constraints for a single function, applies it to our current solver (under a checkpoint), and returns the results if some solutions can be found. If solutions were found, we also make sure that we never reuse those solutions.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def solve_path_constraint(self, path):
        # re-initializing does not seem problematic.
        # a = z3.Int('a').get_id() remains the same.
        constraints = self.extract_constraints(path)
        decl = define_symbolic_vars(self.used_variables, '')
        exec(decl)

        solutions = {}
        with checkpoint(self.z3):
            st = 'self.z3.add(%s)' % ', '.join(constraints)
            eval(st)
            if self.z3.check() != z3.sat:
                return {}
            m = self.z3.model()
            solutions = {d.name(): m[d] for d in m.decls()}
            my_args = {k: solutions.get(k, None) for k in self.fn_args}
        predicate = 'z3.And(%s)' % ','.join(
            ["%s == %s" % (k, v) for k, v in my_args.items()])
        eval('self.z3.add(z3.Not(%s))' % predicate)
        return my_args

We define get_path() that retrieves the current path and updates the path used.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def get_next_path(self):
        self.last_path -= 1
        if self.last_path == -1:
            self.last_path = len(self.paths) - 1
        return self.paths[self.last_path]

The fuzz() method simply solves each path in order.

class SimpleSymbolicFuzzer(SimpleSymbolicFuzzer):
    def fuzz(self):
        """Produce one solution for each path.
        Returns a mapping of variable names to (symbolic) Z3 values."""
        for i in range(self.max_tries):
            res = self.solve_path_constraint(self.get_next_path())
            if res:
                return res

        return {}

The fuzzer can be used as follows. Note that we need to convert the symbolic variables returned to Python numbers, using as_long():

a, b, c = None, None, None
symfz_ct = SimpleSymbolicFuzzer(check_triangle)
for i in range(1, 10):
    args = symfz_ct.fuzz()
    res = check_triangle(args['a'].as_long(),
                         args['b'].as_long(),
                         args['c'].as_long())
    print(args, "result:", res)

{'a': 2, 'b': 3, 'c': 3} result: Isosceles
{'a': 6, 'b': 4, 'c': 5} result: Scalene
{'a': 8, 'b': 7, 'c': 8} result: Isosceles
{'a': 9, 'b': 9, 'c': 10} result: Isosceles
{'a': 11, 'b': 11, 'c': 11} result: Equilateral
{'a': 13, 'b': 12, 'c': 12} result: Isosceles
{'a': 16, 'b': 14, 'c': 15} result: Scalene
{'a': 18, 'b': 17, 'c': 18} result: Isosceles
{'a': 19, 'b': 19, 'c': 20} result: Isosceles

For symbolic fractions, we access their numerators and denominators:

symfz_av = SimpleSymbolicFuzzer(abs_value)
for i in range(1, 10):
    args = symfz_av.fuzz()
    abs_res = abs_value(args['x'].numerator_as_long() /
                        args['x'].denominator_as_long())
    print(args, "result:", abs_res)

{'x': 0} result: 0.0
{'x': -1/2} result: 0.5
{'x': 1/2} result: 0.5
{'x': -1/4} result: 0.25
{'x': 1/4} result: 0.25
{'x': -3/8} result: 0.375
{'x': 1/8} result: 0.125
{'x': -3/2} result: 1.5
{'x': 1/16} result: 0.0625

The SimpleSymbolicFuzzer seems to work well for the simple programs we checked above.

Problems with the Simple Fuzzer#

As we mentioned earlier, the SimpleSymbolicFuzzer cannot yet deal with variable reassignments. Further, it also fails to account for any loops. For example, consider the following program.

def gcd(a: int, b: int) -> int:
    if a < b:
        c: int = a  # type: ignore
        a = b
        b = c

    while b != 0:
        c: int = a  # type: ignore
        a = b
        b = c % b

    return a

show_cfg(gcd)

_images/3e9e9a23e905ac9f726d13ee1f2b22aa1cff4f4027d660efc93a063332b43cdf.svg

from ExpectError import ExpectError

with ExpectError():
    symfz_gcd = SimpleSymbolicFuzzer(gcd, max_depth=1000, max_iter=10)
    for i in range(1, 100):
        r = symfz_gcd.fuzz()
        v = gcd(r['a'].as_long(), r['b'].as_long())
        print(r, v)

Traceback (most recent call last):
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/3731434224.py", line 2, in <module>
    symfz_gcd = SimpleSymbolicFuzzer(gcd, max_depth=1000, max_iter=10)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/2089833100.py", line 26, in __init__
    self.process()
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/4234366425.py", line 3, in process
    self.paths = self.get_all_paths(self.fnenter)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/228300930.py", line 10, in get_all_paths
    child_paths = self.get_all_paths(child, depth + 1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/228300930.py", line 10, in get_all_paths
    child_paths = self.get_all_paths(child, depth + 1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/228300930.py", line 10, in get_all_paths
    child_paths = self.get_all_paths(child, depth + 1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [Previous line repeated 998 more times]
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/228300930.py", line 4, in get_all_paths
    raise Exception('Maximum depth exceeded')
Exception: Maximum depth exceeded (expected)

The problem here is that our SimpleSymbolicFuzzer has no concept of loops and variable reassignments. We will see how to fix this shortcoming next.

Advanced Symbolic Fuzzing#

We next define SymbolicFuzzer that can deal with reassignments and unrolling of loops.

class SymbolicFuzzer(SimpleSymbolicFuzzer):
    """Symbolic fuzzing with reassignments and loop unrolling"""

    def options(self, kwargs):
        super().options(kwargs)

Once we allow reassignments and loop unrolling, we have to deal with what to call the new variables generated. This is what we will tackle next.

Dealing with Reassignments#

We want to rename all variables present in an expression such that the variables are annotated with their usage count. This makes it possible to determine variable reassignments. To do that, we define the rename_variables() function that, when given an env that contains the current usage index of different variables, renames the variables in the passed AST node with the annotations, and returns a copy with the modifications. Note that we can’t use NodeTransformer here as it would modify the AST.

That is, if the expression is env[v] == 1, v is renamed to _v_1

def rename_variables(astnode, env):
    if isinstance(astnode, ast.BoolOp):
        fn = 'z3.And' if isinstance(astnode.op, ast.And) else 'z3.Or'
        return ast.Call(
            ast.Name(fn, None),
            [rename_variables(i, env) for i in astnode.values], [])
    elif isinstance(astnode, ast.BinOp):
        return ast.BinOp(
            rename_variables(astnode.left, env), astnode.op,
            rename_variables(astnode.right, env))
    elif isinstance(astnode, ast.UnaryOp):
        if isinstance(astnode.op, ast.Not):
            return ast.Call(
                ast.Name('z3.Not', None),
                [rename_variables(astnode.operand, env)], [])
        else:
            return ast.UnaryOp(astnode.op,
                               rename_variables(astnode.operand, env))
    elif isinstance(astnode, ast.Call):
        return ast.Call(astnode.func,
                        [rename_variables(i, env) for i in astnode.args],
                        astnode.keywords)
    elif isinstance(astnode, ast.Compare):
        return ast.Compare(
            rename_variables(astnode.left, env), astnode.ops,
            [rename_variables(i, env) for i in astnode.comparators])
    elif isinstance(astnode, ast.Name):
        if astnode.id not in env:
            env[astnode.id] = 0
        num = env[astnode.id]
        return ast.Name('_%s_%d' % (astnode.id, num), astnode.ctx)
    elif isinstance(astnode, ast.Return):
        return ast.Return(rename_variables(astnode.value, env))
    else:
        return astnode

To verify that it works as intended, we start with an environment.

env = {'x': 1}

ba = get_expression('x == 1 and y == 2')
type(ba)

ast.BoolOp

assert to_src(rename_variables(ba, env)) == 'z3.And(_x_1 == 1, _y_0 == 2)'

bo = get_expression('x == 1 or y == 2')
type(bo.op)

ast.Or

assert to_src(rename_variables(bo, env)) == 'z3.Or(_x_1 == 1, _y_0 == 2)'

b = get_expression('x + y')
type(b)

ast.BinOp

assert to_src(rename_variables(b, env)) == '_x_1 + _y_0'

u = get_expression('-y')
type(u)

ast.UnaryOp

assert to_src(rename_variables(u, env)) == '-_y_0'

un = get_expression('not y')
type(un.op)

ast.Not

assert to_src(rename_variables(un, env)) == 'z3.Not(_y_0)'

c = get_expression('x == y')
type(c)

ast.Compare

assert to_src(rename_variables(c, env)) == '_x_1 == _y_0'

f = get_expression('fn(x,y)')
type(f)

ast.Call

assert to_src(rename_variables(f, env)) == 'fn(_x_1, _y_0)'

env

{'x': 1, 'y': 0}

Next, we want to process the CFG, and correctly transform the paths.

Tracking Assignments#

For keeping track of assignments in the CFG, We define a data structure PNode that stores the current CFG node.

class PNode:
    def __init__(self, idx, cfgnode, parent=None, order=0, seen=None):
        self.seen = {} if seen is None else seen
        self.max_iter = MAX_ITER
        self.idx, self.cfgnode, self.parent, self.order = idx, cfgnode, parent, order

    def __repr__(self):
        return "PNode:%d[%s order:%d]" % (self.idx, str(self.cfgnode),
                                          self.order)

Defining a new PNode is done as follows.

cfg = PyCFG()
cfg.gen_cfg(inspect.getsource(gcd))
gcd_fnenter, _ = cfg.functions['gcd']

PNode(0, gcd_fnenter)

PNode:0[id:27 line[1] parents: [] : enter: gcd(a, b) order:0]

The copy() method generates a copy for the child’s keep, indicating which path was taken (with order of the child).

class PNode(PNode):
    def copy(self, order):
        p = PNode(self.idx, self.cfgnode, self.parent, order, self.seen)
        assert p.order == order
        return p

Using the copy operation.

PNode(0, gcd_fnenter).copy(1)

PNode:0[id:27 line[1] parents: [] : enter: gcd(a, b) order:1]

Stepwise Exploration of Paths#

A problem we had with our SimpleSymbolicFuzzer is that it explored a path to completion before attempting another. However, this is non-optimal. One may want to explore the graph in a more step-wise manner, expanding every possible execution one step at a time.

Hence, we define explore() which explores the children of a node if any, one step at a time. If done exhaustively, this will generate all paths from a starting node until no more children are left. We made PNode to a container class so that this iteration can be driven from outside, and stopped if say a maximum iteration is complete, or certain paths need to be prioritized.

class PNode(PNode):
    def explore(self):
        ret = []
        for (i, n) in enumerate(self.cfgnode.children):
            key = "[%d]%s" % (self.idx + 1, n)
            ccount = self.seen.get(key, 0)
            if ccount > self.max_iter:
                continue  # drop this child
            self.seen[key] = ccount + 1
            pn = PNode(self.idx + 1, n, self.copy(i), seen=self.seen)
            ret.append(pn)
        return ret

We can use explore() as follows.

PNode(0, gcd_fnenter).explore()

[PNode:1[id:29 line[2] parents: [27] : _if: a < b order:0]]

PNode(0, gcd_fnenter).explore()[0].explore()

[PNode:2[id:30 line[3] parents: [29] : c: int = a order:0],
 PNode:2[id:33 line[7] parents: [32, 29, 36] : _while: b != 0 order:0]]

The method get_path_to_root() recursively goes up through child->parent chain retrieving the complete chain to the topmost parent.

class PNode(PNode):
    def get_path_to_root(self):
        path = []
        n = self
        while n:
            path.append(n)
            n = n.parent
        return list(reversed(path))

p = PNode(0, gcd_fnenter)
[s.get_path_to_root() for s in p.explore()[0].explore()[0].explore()[0].explore()]

[[PNode:0[id:27 line[1] parents: [] : enter: gcd(a, b) order:0],
  PNode:1[id:29 line[2] parents: [27] : _if: a < b order:0],
  PNode:2[id:30 line[3] parents: [29] : c: int = a order:0],
  PNode:3[id:31 line[4] parents: [30] : a = b order:0],
  PNode:4[id:32 line[5] parents: [31] : b = c order:0]]]

The string representation of the node is in z3 solvable form.

class PNode(PNode):
    def __str__(self):
        path = self.get_path_to_root()
        ssa_path = to_single_assignment_predicates(path)
        return ', '.join([to_src(p) for p in ssa_path])

However, before using it, we need to take care of variable renaming so that reassignments can work.

Renaming Used Variables#

We need to rename used variables. Any variable v = xxx should be renamed to _v_0 and any later assignment such as v = v + 1 should be transformed to _v_1 = _v_0 + 1 and later conditionals such as v == x should be transformed to (_v_1 == _x_0). The method to_single_assignment_predicates() does this for a given path.

def to_single_assignment_predicates(path):
    env = {}
    new_path = []
    for i, node in enumerate(path):
        ast_node = node.cfgnode.ast_node
        new_node = None
        if isinstance(ast_node, ast.AnnAssign) and ast_node.target.id in {
                'exit'}:
            new_node = None
        elif isinstance(ast_node, ast.AnnAssign) and ast_node.target.id in {'enter'}:
            args = [
                ast.parse(
                    "%s == _%s_0" %
                    (a.id, a.id)).body[0].value for a in ast_node.annotation.args]
            new_node = ast.Call(ast.Name('z3.And', None), args, [])
        elif isinstance(ast_node, ast.AnnAssign) and ast_node.target.id in {'_if', '_while'}:
            new_node = rename_variables(ast_node.annotation, env)
            if node.order != 0:
                assert node.order == 1
                new_node = ast.Call(ast.Name('z3.Not', None), [new_node], [])
        elif isinstance(ast_node, ast.AnnAssign):
            assigned = ast_node.target.id
            val = [rename_variables(ast_node.value, env)]
            env[assigned] = 0 if assigned not in env else env[assigned] + 1
            target = ast.Name('_%s_%d' %
                              (ast_node.target.id, env[assigned]), None)
            new_node = ast.Expr(ast.Compare(target, [ast.Eq()], val))
        elif isinstance(ast_node, ast.Assign):
            assigned = ast_node.targets[0].id
            val = [rename_variables(ast_node.value, env)]
            env[assigned] = 0 if assigned not in env else env[assigned] + 1
            target = ast.Name('_%s_%d' %
                              (ast_node.targets[0].id, env[assigned]), None)
            new_node = ast.Expr(ast.Compare(target, [ast.Eq()], val))
        elif isinstance(ast_node, (ast.Return, ast.Pass)):
            new_node = None
        else:
            s = "NI %s %s" % (type(ast_node), ast_node.target.id)
            raise Exception(s)
        new_path.append(new_node)
    return new_path

Here is how it can be used:

p = PNode(0, gcd_fnenter)
path = p.explore()[0].explore()[0].explore()[0].get_path_to_root()
spath = to_single_assignment_predicates(path)

[to_src(s) for s in spath]

['z3.And(a == _a_0, b == _b_0)', '_a_0 < _b_0', '_c_0 == _a_0', '_a_1 == _b_0']

Check Before You Loop#

One of the ways in which the concolic execution simplifies symbolic execution is in the treatment of loops. Rather than trying to determine an invariant for a loop, we simply unroll the loops a number of times until we hit the MAX_DEPTH limit. However, not all loops will need to be unrolled until MAX_DEPTH is reached. Some of them may exit before. Hence, it is necessary to check whether the given set of constraints can be satisfied before continuing to explore further.

def identifiers_with_types(identifiers, defined):
    with_types = dict(defined)
    for i in identifiers:
        if i[0] == '_':
            nxt = i[1:].find('_', 1)
            name = i[1:nxt + 1]
            assert name in defined
            typ = defined[name]
            with_types[i] = typ
    return with_types

The extract_constraints() generates the z3 constraints from a path. The main work is done by to_single_assignment_predicates(). The extract_constraints() then converts the AST to source.

class SymbolicFuzzer(SymbolicFuzzer):
    def extract_constraints(self, path):
        return [to_src(p) for p in to_single_assignment_predicates(path) if p]

Solving Path Constraints#

We now update our solve_path_constraint() method to take into account the new identifiers created during reassignments.

class SymbolicFuzzer(SymbolicFuzzer):
    def solve_path_constraint(self, path):
        # re-initializing does not seem problematic.
        # a = z3.Int('a').get_id() remains the same.
        constraints = self.extract_constraints(path)
        identifiers = [
            c for i in constraints for c in used_identifiers(i)]  # <- changes
        with_types = identifiers_with_types(
            identifiers, self.used_variables)  # <- changes
        decl = define_symbolic_vars(with_types, '')
        exec(decl)

        solutions = {}
        with checkpoint(self.z3):
            st = 'self.z3.add(%s)' % ', '.join(constraints)
            eval(st)
            if self.z3.check() != z3.sat:
                return {}
            m = self.z3.model()
            solutions = {d.name(): m[d] for d in m.decls()}
            my_args = {k: solutions.get(k, None) for k in self.fn_args}

        predicate = 'z3.And(%s)' % ','.join(
            ["%s == %s" % (k, v) for k, v in my_args.items()])
        eval('self.z3.add(z3.Not(%s))' % predicate)

        return my_args

Generating All Paths#

The get_all_paths() is now similarly updated so that we unroll loops only to a specified height. It is also converted to an iterative exploration style so that we explore the CFG in a breadth first manner.

class SymbolicFuzzer(SymbolicFuzzer):
    def get_all_paths(self, fenter):
        path_lst = [PNode(0, fenter)]
        completed = []
        for i in range(self.max_iter):
            new_paths = [PNode(0, fenter)]
            for path in path_lst:
                # explore each path once
                if path.cfgnode.children:
                    np = path.explore()
                    for p in np:
                        if path.idx > self.max_depth:
                            break
                        new_paths.append(p)
                else:
                    completed.append(path)
            path_lst = new_paths
        return completed + path_lst

We can now obtain all paths using our advanced symbolic fuzzer as follows.

asymfz_gcd = SymbolicFuzzer(
    gcd, max_iter=10, max_tries=10, max_depth=10)
all_paths = asymfz_gcd.get_all_paths(asymfz_gcd.fnenter)

len(all_paths)

all_paths[37].get_path_to_root()

[PNode:0[id:40 line[1] parents: [] : enter: gcd(a, b) order:0],
 PNode:1[id:42 line[2] parents: [40] : _if: a < b order:1],
 PNode:2[id:46 line[7] parents: [45, 42, 49] : _while: b != 0 order:0],
 PNode:3[id:47 line[8] parents: [46] : c: int = a order:0],
 PNode:4[id:48 line[9] parents: [47] : a = b order:0],
 PNode:5[id:49 line[10] parents: [48] : b = c % b order:0],
 PNode:6[id:46 line[7] parents: [45, 42, 49] : _while: b != 0 order:0],
 PNode:7[id:47 line[8] parents: [46] : c: int = a order:0],
 PNode:8[id:48 line[9] parents: [47] : a = b order:0],
 PNode:9[id:49 line[10] parents: [48] : b = c % b order:0],
 PNode:10[id:46 line[7] parents: [45, 42, 49] : _while: b != 0 order:0]]

We can also list the predicates in each path.

for s in to_single_assignment_predicates(all_paths[37].get_path_to_root()):
    if s is not None:
        print(to_src(s))

z3.And(a == _a_0, b == _b_0)
z3.Not(_a_0 < _b_0)
_b_0 != 0
_c_0 == _a_0
_a_1 == _b_0
_b_1 == _c_0 % _b_0
_b_1 != 0
_c_1 == _a_1
_a_2 == _b_1
_b_2 == _c_1 % _b_1
_b_2 != 0

constraints = asymfz_gcd.extract_constraints(all_paths[37].get_path_to_root())

constraints

['z3.And(a == _a_0, b == _b_0)',
 'z3.Not(_a_0 < _b_0)',
 '_b_0 != 0',
 '_c_0 == _a_0',
 '_a_1 == _b_0',
 '_b_1 == _c_0 % _b_0',
 '_b_1 != 0',
 '_c_1 == _a_1',
 '_a_2 == _b_1',
 '_b_2 == _c_1 % _b_1',
 '_b_2 != 0']

The constraints printed out demonstrates that our approach for renaming variables was successful. We need only one more piece to complete the puzzle. Our path is still a PNode. We need to modify get_next_path() so that we return the corresponding predicate chain.

class SymbolicFuzzer(SymbolicFuzzer):
    def get_next_path(self):
        self.last_path -= 1
        if self.last_path == -1:
            self.last_path = len(self.paths) - 1
        return self.paths[self.last_path].get_path_to_root()

We will see next how to use our fuzzer for fuzzing.

Fuzzing with Advanced Symbolic Fuzzer#

We use our advanced symbolic fuzzer on gcd to generate plausible inputs.

asymfz_gcd = SymbolicFuzzer(
    gcd, max_tries=10, max_iter=10, max_depth=10)
data = []
for i in range(10):
    r = asymfz_gcd.fuzz()
    data.append((r['a'].as_long(), r['b'].as_long()))
    v = gcd(*data[-1])
    print(r, "result:", repr(v))

{'a': 8, 'b': 3} result: 1
{'a': 1, 'b': 2} result: 1
{'a': 2, 'b': 5} result: 1
{'a': 5, 'b': 2} result: 1
{'a': 3, 'b': 4} result: 1
{'a': 9, 'b': 9} result: 9
{'a': 7, 'b': 6} result: 1
{'a': 5, 'b': 10} result: 5
{'a': 3, 'b': 1} result: 1
{'a': 10, 'b': 7} result: 1

The outputs look reasonable. However, what is the coverage obtained?

with VisualizedArcCoverage() as cov:
    for a, b in data:
        gcd(a, b)

cov.show_coverage(gcd)

#  1: def gcd(a: int, b: int) -> int:
#  2:     if a < b:
#  3:         c: int = a  # type: ignore
#  4:         a = b
#  5:         b = c
   6: 
#  7:     while b != 0:
#  8:         c: int = a  # type: ignore
#  9:         a = b
# 10:         b = c % b
  11: 
# 12:     return a
  13: 

show_cfg(gcd, arcs=cov.arcs())

_images/b7b0916a60ec16da5e2f94ef40be344fd218cd67ead8a274c9bc3915905a88d8.svg

Indeed, both branch and statement coverage visualization seems to indicate that we achieved complete coverage. How do we make use of our fuzzer in practice? We explore a small case study of a program to solve the roots of a quadratic equation.

Example: Roots of a Quadratic Equation#

Here is the famous equation for finding the roots of a quadratic equation.

from typing import Tuple

def roots(a: float, b: float, c: float) -> Tuple[float, float]:
    d: float = b * b - 4 * a * c
    ax: float = 0.5 * d
    bx: float = 0
    while (ax - bx) > 0.1:
        bx = 0.5 * (ax + d / ax)
        ax = bx

    s: float = bx
    a2: float = 2 * a
    ba2: float = b / a2

    return -ba2 + s / a2, -ba2 - s / a2

Does the program look correct? Let us investigate if the program is reasonable. But before that, we need a helper function sym_to_float() to convert symbolic values to floating point.

def sym_to_float(v):
    if v is None:
        return math.inf
    elif isinstance(v, z3.IntNumRef):
        return v.as_long()
    return v.numerator_as_long() / v.denominator_as_long()

Now we are ready to fuzz.

asymfz_roots = SymbolicFuzzer(
    roots,
    max_tries=10,
    max_iter=10,
    max_depth=10)

with ExpectError():
    for i in range(100):
        r = asymfz_roots.fuzz()
        print(r)
        d = [sym_to_float(r[i]) for i in ['a', 'b', 'c']]
        v = roots(*d)
        print(d, v)

{'a': 11/10, 'b': 0, 'c': -1/2}
[1.1, 0.0, -0.5] (0.7045454545454545, -0.7045454545454545)
{'a': 0, 'b': 58617/131072, 'c': -1/2}

Traceback (most recent call last):
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/94617992.py", line 6, in <module>
    v = roots(*d)
        ^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/2299859991.py", line 11, in roots
    ba2: float = b / a2
                 ~~^~~~
ZeroDivisionError: float division by zero (expected)

We have a ZeroDivisionError. Can we eliminate it?

Roots - Check Before Divide#

def roots2(a: float, b: float, c: float) -> Tuple[float, float]:
    d: float = b * b - 4 * a * c

    xa: float = 0.5 * d
    xb: float = 0
    while (xa - xb) > 0.1:
        xb = 0.5 * (xa + d / xa)
        xa = xb

    s: float = xb

    if a == 0:
        return -c / b, -c / b  # only one solution

    a2: float = 2 * a
    ba2: float = b / a2
    return -ba2 + s / a2, -ba2 - s / a2

asymfz_roots = SymbolicFuzzer(
    roots2,
    max_tries=10,
    max_iter=10,
    max_depth=10)

with ExpectError():
    for i in range(1000):
        r = asymfz_roots.fuzz()
        d = [sym_to_float(r[i]) for i in ['a', 'b', 'c']]
        v = roots2(*d)
        #print(d, v)

Traceback (most recent call last):
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/567876003.py", line 5, in <module>
    v = roots2(*d)
        ^^^^^^^^^^
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_16239/714795544.py", line 13, in roots2
    return -c / b, -c / b  # only one solution
           ~~~^~~
ZeroDivisionError: float division by zero (expected)

Apparently, our fix was incomplete. Let us try again.

Roots - Eliminating the Zero Division Error#

import math

def roots3(a: float, b: float, c: float) -> Tuple[float, float]:
    d: float = b * b - 4 * a * c

    xa: float = 0.5 * d
    xb: float = 0
    while (xa - xb) > 0.1:
        xb = 0.5 * (xa + d / xa)
        xa = xb
    s: float = xb

    if a == 0:
        if b == 0:
            return math.inf, math.inf
        return -c / b, -c / b  # only one solution

    a2: float = 2 * a
    ba2: float = b / a2
    return -ba2 + s / a2, -ba2 - s / a2

asymfz_roots = SymbolicFuzzer(
    roots3,
    max_tries=10,
    max_iter=10,
    max_depth=10)

for i in range(10):
    r = asymfz_roots.fuzz()
    print(r)
    d = [sym_to_float(r[i]) for i in ['a', 'b', 'c']]
    v = roots3(*d)
    print(d, v)

{'a': -1, 'b': 0, 'c': 0}
[-1.0, 0.0, 0.0] (0.0, 0.0)
{'a': -11/20, 'b': 0, 'c': 1}
[-0.55, 0.0, 1.0] (-1.409090909090909, 1.409090909090909)
{'a': -1/40, 'b': 0, 'c': 2}
[-0.025, 0.0, 2.0] (0.0, 0.0)
{'a': 0, 'b': -1/4, 'c': 3}
[0.0, -0.25, 3.0] (12.0, 12.0)
{'a': 0, 'b': 0, 'c': 4}
[0.0, 0.0, 4.0] (inf, inf)
{'a': 0, 'b': -777645/524288, 'c': 5}
[0.0, -1.4832401275634766, 5.0] (3.370998334715712, 3.370998334715712)
{'a': -2/91, 'b': 0, 'c': 91/40}
[-0.02197802197802198, 0.0, 2.275] (0.0, 0.0)
{'a': 0, 'b': 4/9, 'c': 6}
[0.0, 0.4444444444444444, 6.0] (-13.5, -13.5)
{'a': 0, 'b': 0, 'c': 7}
[0.0, 0.0, 7.0] (inf, inf)
{'a': -7/6, 'b': -1, 'c': 3/2}
[-1.1666666666666667, -1.0, 1.5] (-1.7142857142857142, 0.857142857142857)

With this, we have demonstrated that we can use our SymbolicFuzzer to fuzz programs, and it can aid in identifying problems in code.

Limitations#

There is an evident error in the roots3() function. We are not checking for negative roots. However, the symbolic execution does not seem to have detected it. Why are we not able to detect the problem of negative roots? Because we stop execution at a predetermined depth without throwing an error. That is, our symbolic execution is wide but shallow. One of the ways this limitation can be overcome is by relying on concolic execution, that allows one to go deeper than pure symbolic execution.

A second problem is that symbolic execution is necessarily computation intensive. This means that specification based fuzzers are often able to generate a much larger set of inputs, and consecutively more coverage on programs that do not check for magic bytes, such that they provide a reasonable gradient of exploration.

Synopsis#

This chapter provides an implementation of a symbolic fuzzing engine SymbolicFuzzer. The fuzzer uses symbolic execution to exhaustively explore paths in the program to a limited depth, and generate inputs that will reach these paths.

As an example, consider the function gcd(), computing the greatest common divisor of a and b:

# ignore
from bookutils import print_content

# ignore
print_content(inspect.getsource(gcd), '.py')

def gcd(a: int, b: int) -> int:
    if a < b:
        c: int = a  # type: ignore
        a = b
        b = c

    while b != 0:
        c: int = a  # type: ignore
        a = b
        b = c % b

    return a

To explore gcd(), the fuzzer can be used as follows, producing values for arguments that cover different paths in gcd() (including multiple times of loop iterations):

gcd_fuzzer = SymbolicFuzzer(gcd, max_tries=10, max_iter=10, max_depth=10)
for i in range(10):
    args = gcd_fuzzer.fuzz()
    print(args)

{'a': 8, 'b': 3}
{'a': 1, 'b': 2}
{'a': 2, 'b': 5}
{'a': 7, 'b': 6}
{'a': 9, 'b': 10}
{'a': 4, 'b': 4}
{'a': 10, 'b': 9}
{'a': 2, 'b': 10}
{'a': 14, 'b': 7}
{'a': 3, 'b': 2}

Note that the variable values returned by fuzz() are Z3 symbolic values; to convert them to Python numbers, use their method as_long():

for i in range(10):
    args = gcd_fuzzer.fuzz()
    a = args['a'].as_long()
    b = args['b'].as_long()
    d = gcd(a, b)
    print(f"gcd({a}, {b}) = {d}")

gcd(0, 8) = 8
gcd(-1, 10) = -1
gcd(13, 2) = 1
gcd(0, 10) = 10
gcd(6, 7) = 1
gcd(14, 2) = 2
gcd(-1, 11) = -1
gcd(15, 0) = 15
gcd(0, -1) = -1
gcd(-3, -2) = -1

The symbolic fuzzer is subject to a number of constraints. First, it requires that the function to be fuzzed has correct type annotations, including all local variables. Second, it solves loops by unrolling them, but only for a fixed amount.

For programs without loops and variable reassignments, the SimpleSymbolicFuzzer is a faster, but more limited alternative.

# ignore
from ClassDiagram import display_class_hierarchy
display_class_hierarchy(SymbolicFuzzer)

Lessons Learned#

One can use symbolic execution to augment the inputs that explore all characteristics of a program.
Symbolic execution can be broad but shallow.
Symbolic execution is well suited for programs that rely on specific values to be present in the input, however, its utility decreases when such values are not present, and the input space represents a gradient in terms of coverage.

Next Steps#

Search based fuzzing can often be an acceptable middle ground when random fuzzing does not provide sufficient results, but symbolic fuzzing is too heavyweight.

Background#

Symbolic execution of programs was originally described by King \cite{king1976symbolic} in 1976. It is used extensively in vulnerability analysis of software, especially binary programs. Some well known symbolic execution tools include KLEE \cite{KLEE}, angr \cite{wang2017angr}, Driller \cite{stephens2016driller}, and SAGE \cite{godefroid2012sage}. The best known symbolic execution environment for Python is CHEF \cite{bucur2014prototyping} which does symbolic execution by modifying the interpreter.

The Z3 solver we use in this chapter was developed at Microsoft Research under the lead of Leonardo de Moura and Nikolaj Bjørner \cite{z3}. It is one of the most popular solvers.

Exercises#

Exercise 1: Extending Symbolic Fuzzer to use function summaries#

We showed in the first section how function summaries may be produced. Can you extend the SymbolicFuzzer to use function summaries when needed?

Solution. None yet available.

Exercise 2: Statically checking if a loop should be unrolled further#

We examined how loops would be unrolled during exploration to a fixed depth. However, not all loops need to be unrolled completely. Some loops may contain only a constant number of iterations. For example, consider the loop below.

i = 0
while i < 10:
    i += 1

This loop needs to be unrolled exactly \(10\) times. For such cases, can you implement a method can_be_satisfied() which is invoked as below, to only unroll further if the path condition can be satisfied.

class SymbolicFuzzer(SymbolicFuzzer):
    def get_all_paths(self, fenter):
        path_lst = [PNode(0, fenter)]
        completed = []
        for i in range(self.max_iter):
            new_paths = [PNode(0, fenter)]
            for path in path_lst:
                # explore each path once
                if path.cfgnode.children:
                    np = path.explore()
                    for p in np:
                        if path.idx > self.max_depth:
                            break
                        if self.can_be_satisfied(p):
                            new_paths.append(p)
                        else:
                            break
                else:
                    completed.append(path)
            path_lst = new_paths
        return completed + path_lst

Solution. Here is a solution.

class SymbolicFuzzer(SymbolicFuzzer):
    def can_be_satisfied(self, p):
        s2 = self.extract_constraints(p.get_path_to_root())
        s = z3.Solver()
        identifiers = [c for i in s2 for c in used_identifiers(i)]
        with_types = identifiers_with_types(identifiers, self.used_variables)
        decl = define_symbolic_vars(with_types, '')
        exec(decl)
        exec("s.add(z3.And(%s))" % ','.join(s2), globals(), locals())
        return s.check() == z3.sat

With this implementation, new conditions are appended to paths if and only if the paths are still satisfiable after incorporating the condition.

Exercise 3: Implementing a Concolic Fuzzer#

We have seen in the chapter on concolic fuzzing how to trace a function concolically using information flow. However, this is somewhat suboptimal as the constraints can get dropped when the information flow is indirect (as in control flow based information flow). Can you implement concolic tracing using the infrastructure we built for symbolic execution?

Solution. Here is a possible solution.

In concolic execution, we rely on a seed input to guide our symbolic execution. We collect the line numbers that our seed input traces, and feed it to the symbolic execution such that in the explore step, only the child node that correspond to the seed input execution path is chosen. This allows us to collect the complete set of constraints along a representative path. Once we have it, we can choose any particular predicate and invert it to explore the program execution paths near the representative path.

We modify our original ArcCoverage to provide all line numbers that the program traversed.

class TrackingArcCoverage(ArcCoverage):
    def offsets_from_entry(self, fn):
        zero = self._trace[0][1] - 1
        return [l - zero for (f, l) in self._trace if f == fn]

with TrackingArcCoverage() as cov:
    roots3(1, 1, 1)

cov.offsets_from_entry('roots3')

[1, 2, 4, 5, 6, 9, 11, 16, 17, 18]

The ConcolicTracer first extracts the program trace on a seed input.

class ConcolicTracer(SymbolicFuzzer):
    def __init__(self, fn, fnargs, **kwargs):
        with TrackingArcCoverage() as cov:
            fn(*fnargs)
        self.lines = cov.offsets_from_entry(fn.__name__)
        self.current_line = 0
        super().__init__(fn, **kwargs)

The method get_all_paths() now tries to follow the seed execution path.

class ConcolicTracer(ConcolicTracer):
    def get_all_paths(self, fenter):
        assert fenter.ast_node.lineno == self.lines[self.current_line]
        self.current_line += 1
        last_node = PNode(0, fenter)
        while last_node and self.current_line < len(self.lines):
            if last_node.cfgnode.children:
                np = last_node.explore()
                for p in np:
                    if self.lines[self.current_line] == p.cfgnode.ast_node.lineno:
                        self.current_line += 1
                        last_node = p
                        break
                else:
                    last_node = None
                    break
            else:
                break
        assert len(self.lines) == self.current_line
        return [last_node]

We are now ready to concolically trace our execution.

Tracing the Execution Concolically#

acfz_roots = ConcolicTracer(
    roots3,
    fnargs=[1, 1, 1],
    max_tries=10,
    max_iter=10,
    max_depth=10)

acfz_roots.paths[0].get_path_to_root()

[PNode:0[id:75 line[1] parents: [] : enter: roots3(a, b, c) order:0],
 PNode:1[id:77 line[2] parents: [75] : d: float = b * b - 4 * a * c order:0],
 PNode:2[id:78 line[4] parents: [77] : xa: float = 0.5 * d order:0],
 PNode:3[id:79 line[5] parents: [78] : xb: float = 0 order:0],
 PNode:4[id:80 line[6] parents: [79, 82] : _while: xa - xb > 0.1 order:1],
 PNode:5[id:83 line[9] parents: [80] : s: float = xb order:0],
 PNode:6[id:84 line[11] parents: [83] : _if: a == 0 order:1],
 PNode:7[id:88 line[16] parents: [84] : a2: float = 2 * a order:0],
 PNode:8[id:89 line[17] parents: [88] : ba2: float = b / a2 order:0],
 PNode:9[id:90 line[18] parents: [89] : return (-ba2 + s / a2, -ba2 - s / a2) order:0]]

print(cov.offsets_from_entry('roots3'))
print([i.cfgnode.ast_node.lineno for i in acfz_roots.paths[0].get_path_to_root()])
print(acfz_roots.lines)

[1, 2, 4, 5, 6, 9, 11, 16, 17, 18]
[1, 2, 4, 5, 6, 9, 11, 16, 17, 18]
[1, 2, 4, 5, 6, 9, 11, 16, 17, 18]

As can be seen above, we recovered the trace information correctly. Next, we extract the constraints as usual.

constraints = acfz_roots.extract_constraints(
    acfz_roots.paths[0].get_path_to_root())

constraints

['z3.And(a == _a_0, b == _b_0, c == _c_0)',
 '_d_0 == _b_0 * _b_0 - 4 * _a_0 * _c_0',
 '_xa_0 == 0.5 * _d_0',
 '_xb_0 == 0',
 'z3.Not(_xa_0 - _xb_0 > 0.1)',
 '_s_0 == _xb_0',
 'z3.Not(_a_0 == 0)',
 '_a2_0 == 2 * _a_0',
 '_ba2_0 == _b_0 / _a2_0']

Next, we change our constraints to symbolic variables and solve them.

identifiers = [c for i in constraints for c in used_identifiers(i)]
with_types = identifiers_with_types(identifiers, acfz_roots.used_variables)
decl = define_symbolic_vars(with_types, '')
exec(decl)

We are ready to solve our constraints. However, before that, here is a question for you. Should it result in exactly the same arguments?

eval('z3.solve(%s)' % ','.join(constraints))

[a = 1,
 _a_0 = 1,
 _b_0 = 1/8,
 _c_0 = 1,
 _ba2_0 = 1/16,
 _a2_0 = 2,
 _s_0 = 0,
 _xb_0 = 0,
 _xa_0 = -255/128,
 _d_0 = -255/64,
 c = 1,
 b = 1/8,
 /0 = [(1/8, 2) -> 1/16, else -> 0]]

acfz_roots.fuzz()

{'a': -1, 'b': 0, 'c': 0}

Did they take the same path?

with ArcCoverage() as cov:
    roots(1, 1, 1)
show_cfg(roots, arcs=cov.arcs())

_images/e2207ea21ccf92101bfce7f3fea6fdf21f7001372720f793e9b947be5c059758.svg

with ArcCoverage() as cov:
    roots(-1, 0, 0)
show_cfg(roots, arcs=cov.arcs())

Indeed, even though the arguments were different, the path traced is exactly the same.

As we saw in the chapter on concolic fuzzing, concolic tracing has another use, namely that we can use it to explore nearby paths. We will see how to do that next.

Exploring Nearby Paths#

We collected the following constraints.

constraints

['z3.And(a == _a_0, b == _b_0, c == _c_0)',
 '_d_0 == _b_0 * _b_0 - 4 * _a_0 * _c_0',
 '_xa_0 == 0.5 * _d_0',
 '_xb_0 == 0',
 'z3.Not(_xa_0 - _xb_0 > 0.1)',
 '_s_0 == _xb_0',
 'z3.Not(_a_0 == 0)',
 '_a2_0 == 2 * _a_0',
 '_ba2_0 == _b_0 / _a2_0']

We can explore nearby paths by negating the conditionals starting from the very last. (A question for the student: Why do we want to start negating from the very last?)

new_constraints = constraints[0:4] + ['z3.Not(%s)' % constraints[4]]

new_constraints

['z3.And(a == _a_0, b == _b_0, c == _c_0)',
 '_d_0 == _b_0 * _b_0 - 4 * _a_0 * _c_0',
 '_xa_0 == 0.5 * _d_0',
 '_xb_0 == 0',
 'z3.Not(z3.Not(_xa_0 - _xb_0 > 0.1))']

eval('z3.solve(%s)' % ','.join(new_constraints))

[_a_0 = -11/20,
 _b_0 = 0,
 _c_0 = 1,
 _xb_0 = 0,
 _xa_0 = 11/10,
 _d_0 = 11/5,
 c = 1,
 b = 0,
 a = -11/20]

with ArcCoverage() as cov:
    roots3(1, 0, -11 / 20)
show_cfg(roots3, arcs=cov.arcs())

_images/e9a101fff60e67afae0ecd394c0c1834c9ca7e3cf3e899c39f0e66ca5a57e3f5.svg

Indeed, the path traced is now different. One can continue this procedure to the necessary number of times to explore all nearby paths to the execution.

Can you incorporate this exploration into the concolic fuzzer?

Solution. None yet available.

Compatibility#

Earlier versions of this chapter used the name AdvancedSymbolicFuzzer for SymbolicFuzzer.

AdvancedSymbolicFuzzer = SymbolicFuzzer

Symbolic Fuzzing

Contents

Symbolic Fuzzing#

Synopsis#

Obtaining Path Conditions for Coverage#

The Control Flow Graph#

The CFG with Path Taken#

Visualizing the Coverage#

Function Summaries#

Get Names and Types of Variables Used#

Simple Symbolic Fuzzing#

Generating All Possible Paths#

Extracting All Constraints#

Fuzzing with Simple Symbolic Fuzzer#

Problems with the Simple Fuzzer#

Advanced Symbolic Fuzzing#

Dealing with Reassignments#

Tracking Assignments#

Stepwise Exploration of Paths#

Renaming Used Variables#

Check Before You Loop#

Solving Path Constraints#

Generating All Paths#

Fuzzing with Advanced Symbolic Fuzzer#

Example: Roots of a Quadratic Equation#

Roots - Check Before Divide#

Roots - Eliminating the Zero Division Error#

Limitations#

Synopsis#

Lessons Learned#

Next Steps#

Background#

Exercises#

Exercise 1: Extending Symbolic Fuzzer to use function summaries#

Exercise 2: Statically checking if a loop should be unrolled further#

Exercise 3: Implementing a Concolic Fuzzer#

Tracing the Execution Concolically#

Exploring Nearby Paths#

Compatibility#