Grammar Coverage

Producing inputs from grammars gives all possible expansions of a rule the same likelihood. For producing a comprehensive test suite, however, it makes more sense to maximize variety – for instance, by not repeating the same expansions over and over again. In this chapter, we explore how to systematically cover elements of a grammar such that we maximize variety and do not miss out individual elements.

Covering Grammar Elements

The aim of test generation is to cover all functionality of a program – hopefully including the failing functionality, of course. This functionality, however, is tied to the structure of the input: If we fail to produce certain input elements, then the associated code and functionality will not be triggered either, nixing our chances to find a bug in there.

As an example, consider our expression grammar EXPR_GRAMMAR from the chapter on grammars.. If we do not produce negative numbers, then negative numbers will not be tested. If we do not produce floating-point numbers, then floating-point numbers will not be tested. Our aim must thus be to cover all possible expansions.

One way to maximize such variety is to track the expansions that occur during grammar production: If we already have seen some expansion, we can prefer other possible expansion candidates out of the set of possible expansions. Consider the following rule in our expression grammar:

In [3]:
['+<factor>', '-<factor>', '(<expr>)', '<integer>.<integer>', '<integer>']

Let us assume we have already produced an <integer> in the first expansion of <factor>. As it comes to expand the next factor, we would mark the <integer> expansion as already covered, and choose one of the yet uncovered alternatives such as -<factor> (a negative number) or <integer>.<integer> (a floating-point number). Only when we have covered all alternatives would we go back and reconsider expansions covered before.

Tracking Grammar Coverage

This concept of grammar coverage is very easy to implement. We introduce a class GrammarCoverageFuzzer that keeps track of the current grammar coverage achieved:

In [6]:
class TrackingGrammarCoverageFuzzer(GrammarFuzzer):
    def __init__(self, *args, **kwargs):
        # invoke superclass __init__(), passing all arguments
        super().__init__(*args, **kwargs)

    def reset_coverage(self):
        self.covered_expansions = set()

    def expansion_coverage(self):
        return self.covered_expansions

In this set covered_expansions, we store individual expansions seen as pairs of (symbol, expansion), using the function expansion_key() to generate a string representation for the pair.

In [7]:
def expansion_key(symbol, expansion):
    """Convert (symbol, children) into a key.  `children` can be an expansion string or a derivation tree."""
    if isinstance(expansion, tuple):
        expansion = expansion[0]
    if not isinstance(expansion, str):
        children = expansion
        expansion = all_terminals((symbol, children))
    return symbol + " -> " + expansion
In [8]:
'<start> -> <expr>'

Instead of expansion, we can also pass a list of children as argument, which will then automatically be converted into a string.

In [9]:
children = [("<expr>", None), (" + ", []), ("<term>", None)]
expansion_key("<expr>", children)
'<expr> -> <expr> + <term>'

We can compute the set of possible expansions in a grammar by enumerating all expansions. The method max_expansion_coverage() traverses the grammar recursively starting from the given symbol (by default: the grammar start symbol) and accumulates all expansions in the set expansions. With the max_depth parameter (default: $\infty$), we can control how deep the grammar exploration should go; we will need this later in the chapter.

In [10]:
class TrackingGrammarCoverageFuzzer(TrackingGrammarCoverageFuzzer):
    def _max_expansion_coverage(self, symbol, max_depth):
        if max_depth <= 0:
            return set()


        expansions = set()
        for expansion in self.grammar[symbol]:
            expansions.add(expansion_key(symbol, expansion))
            for nonterminal in nonterminals(expansion):
                if nonterminal not in self._symbols_seen:
                    expansions |= self._max_expansion_coverage(
                        nonterminal, max_depth - 1)

        return expansions

    def max_expansion_coverage(self, symbol=None, max_depth=float('inf')):
        """Return set of all expansions in a grammar starting with `symbol`"""
        if symbol is None:
            symbol = self.start_symbol

        self._symbols_seen = set()
        cov = self._max_expansion_coverage(symbol, max_depth)

        if symbol == START_SYMBOL:
            assert len(self._symbols_seen) == len(self.grammar)

        return cov

We can use max_expansion_coverage() to compute all the expansions within the expression grammar:

In [11]:
expr_fuzzer = TrackingGrammarCoverageFuzzer(EXPR_GRAMMAR)
{'<digit> -> 0',
 '<digit> -> 1',
 '<digit> -> 2',
 '<digit> -> 3',
 '<digit> -> 4',
 '<digit> -> 5',
 '<digit> -> 6',
 '<digit> -> 7',
 '<digit> -> 8',
 '<digit> -> 9',
 '<expr> -> <term>',
 '<expr> -> <term> + <expr>',
 '<expr> -> <term> - <expr>',
 '<factor> -> (<expr>)',
 '<factor> -> +<factor>',
 '<factor> -> -<factor>',
 '<factor> -> <integer>',
 '<factor> -> <integer>.<integer>',
 '<integer> -> <digit>',
 '<integer> -> <digit><integer>',
 '<start> -> <expr>',
 '<term> -> <factor>',
 '<term> -> <factor> * <term>',
 '<term> -> <factor> / <term>'}

During expansion, we can keep track of expansions seen. To do so, we hook into the method choose_node_expansion(), expanding a single node in our Grammar fuzzer.

In [12]:
class TrackingGrammarCoverageFuzzer(TrackingGrammarCoverageFuzzer):
    def add_coverage(self, symbol, new_children):
        key = expansion_key(symbol, new_children)

        if self.log and key not in self.covered_expansions:
            print("Now covered:", key)

    def choose_node_expansion(self, node, possible_children):
        (symbol, children) = node
        index = super().choose_node_expansion(node, possible_children)
        self.add_coverage(symbol, possible_children[index])
        return index

The method missing_expansion_coverage() is a helper method that returns the expansions that still have to be covered:

In [13]:
class TrackingGrammarCoverageFuzzer(TrackingGrammarCoverageFuzzer):
    def missing_expansion_coverage(self):
        return self.max_expansion_coverage() - self.expansion_coverage()

Let us show how tracking works. To keep things simple, let us focus on <digit> expansions only.

In [14]:
digit_fuzzer = TrackingGrammarCoverageFuzzer(
    EXPR_GRAMMAR, start_symbol="<digit>", log=True)
Tree: <digit>
Expanding <digit> randomly
Now covered: <digit> -> 9
Tree: 9
In [15]:
Tree: <digit>
Expanding <digit> randomly
Now covered: <digit> -> 0
Tree: 0
In [16]:
Tree: <digit>
Expanding <digit> randomly
Now covered: <digit> -> 5
Tree: 5

Here's the set of covered expansions so far:

In [17]:
{'<digit> -> 0', '<digit> -> 5', '<digit> -> 9'}

This is the set of all expansions we can cover:

In [18]:
{'<digit> -> 0',
 '<digit> -> 1',
 '<digit> -> 2',
 '<digit> -> 3',
 '<digit> -> 4',
 '<digit> -> 5',
 '<digit> -> 6',
 '<digit> -> 7',
 '<digit> -> 8',
 '<digit> -> 9'}

This is the missing coverage:

In [19]:
{'<digit> -> 1',
 '<digit> -> 2',
 '<digit> -> 3',
 '<digit> -> 4',
 '<digit> -> 6',
 '<digit> -> 7',
 '<digit> -> 8'}

On average, how many characters do we have to produce until all expansions are covered?

In [20]:
def average_length_until_full_coverage(fuzzer):
    trials = 50

    sum = 0
    for trial in range(trials):
        # print(trial, end=" ")
        while len(fuzzer.missing_expansion_coverage()) > 0:
            s = fuzzer.fuzz()
            sum += len(s)

    return sum / trials
In [21]:
digit_fuzzer.log = False

For full expressions, this takes a bit longer:

In [22]:
expr_fuzzer = TrackingGrammarCoverageFuzzer(EXPR_GRAMMAR)

Covering Grammar Expansions

Let us now not only track coverage, but actually produce coverage. The idea is as follows:

  1. We determine children yet uncovered (in uncovered_children)
  2. If all children are covered, we fall back to the original method (i.e., choosing one expansion randomly)
  3. Otherwise, we select a child from the uncovered children and mark it as covered.

To this end, we introduce a new fuzzer SimpleGrammarCoverageFuzzer that implements this strategy in the choose_node_expansion() method.

In [23]:
class SimpleGrammarCoverageFuzzer(TrackingGrammarCoverageFuzzer):
    def choose_node_expansion(self, node, possible_children):
        # Prefer uncovered expansions
        (symbol, children) = node
        uncovered_children = [c for (i, c) in enumerate(possible_children)
                              if expansion_key(symbol, c) not in self.covered_expansions]
        index_map = [i for (i, c) in enumerate(possible_children)
                     if c in uncovered_children]

        if len(uncovered_children) == 0:
            # All expansions covered - use superclass method
            return self.choose_covered_node_expansion(node, possible_children)

        # Select from uncovered nodes
        index = self.choose_uncovered_node_expansion(node, uncovered_children)

        return index_map[index]

The two methods choose_covered_node_expansion() and choose_uncovered_node_expansion() are provided for subclasses to hook in:

In [24]:
class SimpleGrammarCoverageFuzzer(SimpleGrammarCoverageFuzzer):
    def choose_uncovered_node_expansion(self, node, possible_children):
        return TrackingGrammarCoverageFuzzer.choose_node_expansion(
            self, node, possible_children)

    def choose_covered_node_expansion(self, node, possible_children):
        return TrackingGrammarCoverageFuzzer.choose_node_expansion(
            self, node, possible_children)

By returning the set of expansions covered so far, we can invoke the fuzzer multiple times, each time adding to the grammar coverage. Using the EXPR_GRAMMAR grammar to produce digits, for instance, the fuzzer produces one digit after the other:

In [25]:
f = SimpleGrammarCoverageFuzzer(EXPR_GRAMMAR, start_symbol="<digit>")
In [26]:
In [27]:

Here's the set of covered expansions so far:

In [28]:
{'<digit> -> 1', '<digit> -> 2', '<digit> -> 5'}

Let us fuzz some more. We see that with each iteration, we cover another expansion:

In [29]:
for i in range(7):
    print(f.fuzz(), end=" ")
0 9 7 4 8 3 6 

At the end, all expansions are covered:

In [30]:

Let us apply this on a more complex grammar – e.g., the full expression grammar. We see that after a few iterations, we cover each and every digit, operator, and expansion:

In [31]:
f = SimpleGrammarCoverageFuzzer(EXPR_GRAMMAR)
for i in range(10):
+(0.31 / (5) / 9 + 4 * 6 / 3 - 8 - 7) * -2
+++2 / 87360
((4) * 0 - 1) / -9.6 + 7 / 6 + 1 * 8 + 7 * 8
++++26 / -64.45
(8 / 1 / 6 + 9 + 7 + 8) * 1.1 / 0 * 1
++(3.5 / 3) - (-4 + 3) / (8 / 0) / -4 * 2 / 1
+(90 / --(28 * 8 / 5 + 5 / (5 / 8))) - +9.36 / 2.5 * (5 * (7 * 6 * 5) / 8)
9.11 / 7.28
1 / (9 - 5 * 6) / 6 / 7 / 7 + 1 + 1 - 7 * -3

Again, all expansions are covered:

In [32]:

We see that our strategy is much more effective in achieving coverage than the random approach:

In [33]:

Deep Foresight

Selecting expansions for individual rules is a good start; however, it is not sufficient, as the following example shows. We apply our coverage fuzzer on the CGI grammar from the chapter on coverage:

In [34]:
{'<start>': ['<string>'],
 '<string>': ['<letter>', '<letter><string>'],
 '<letter>': ['<plus>', '<percent>', '<other>'],
 '<plus>': ['+'],
 '<percent>': ['%<hexdigit><hexdigit>'],
 '<hexdigit>': ['0',
 '<other>': ['0', '1', '2', '3', '4', '5', 'a', 'b', 'c', 'd', 'e', '-', '_']}
In [35]:
f = SimpleGrammarCoverageFuzzer(CGI_GRAMMAR)
for i in range(10):

After 10 iterations, we still have a number of expansions uncovered:

In [36]:
{'<hexdigit> -> 2',
 '<hexdigit> -> 4',
 '<hexdigit> -> 5',
 '<hexdigit> -> 9',
 '<hexdigit> -> c',
 '<hexdigit> -> f',
 '<other> -> 0',
 '<other> -> 1',
 '<other> -> 3',
 '<other> -> 4',
 '<other> -> 5',
 '<other> -> a',
 '<other> -> b'}

Why is that so? The problem is that in the CGI grammar, the largest number of variations to be covered occurs in the hexdigit rule. However, we first need to reach this expansion. When expanding a <letter> symbol, we have the choice between three possible expansions:

In [37]:
['<plus>', '<percent>', '<other>']

If all three expansions are covered already, then choose_node_expansion() above will choose one randomly – even if there may be more expansions to cover when choosing <percent>.

What we need is a better strategy that will pick <percent> if there are more uncovered expansions following – even if <percent> is covered. Such a strategy was first discussed by W. Burkhardt \cite{Burkhardt1967} under the name of "Shortest Path Selection":

This version selects, from several alternatives for development, that syntactic unit under which there is still an unused unit available, starting with the shortest path.

This is what we will implement in the next steps.

Determining Maximum per-Symbol Coverage

To address this problem, we introduce a new class GrammarCoverageFuzzer that builds on SimpleGrammarCoverageFuzzer, but with a better strategy. First, we need to compute the maximum set of expansions that can be reached from a particular symbol, as we already have implemented in max_expansion_coverage(). The idea is to later compute the intersection of this set and the expansions already covered, such that we can favor those expansions with a non-empty intersection.

The first step – computing the maximum set of expansions that can be reached from a symbol – is already implemented. By passing a symbol parameter to max_expansion_coverage(), we can compute the possible expansions for every symbol:

In [38]:
f = SimpleGrammarCoverageFuzzer(EXPR_GRAMMAR)
{'<digit> -> 0',
 '<digit> -> 1',
 '<digit> -> 2',
 '<digit> -> 3',
 '<digit> -> 4',
 '<digit> -> 5',
 '<digit> -> 6',
 '<digit> -> 7',
 '<digit> -> 8',
 '<digit> -> 9',
 '<integer> -> <digit>',
 '<integer> -> <digit><integer>'}
In [39]:
{'<digit> -> 0',
 '<digit> -> 1',
 '<digit> -> 2',
 '<digit> -> 3',
 '<digit> -> 4',
 '<digit> -> 5',
 '<digit> -> 6',
 '<digit> -> 7',
 '<digit> -> 8',
 '<digit> -> 9'}

Determining yet Uncovered Children

We can now start to implement GrammarCoverageFuzzer. Computing max_expansion_coverage() allows us to determine the missing coverage for each child. To this end, we subtract the coverage already seen (expansion_coverage()) from the coverage that could be obtained.

In [40]:
class GrammarCoverageFuzzer(SimpleGrammarCoverageFuzzer):
    def _new_child_coverage(self, children, max_depth):
        new_cov = set()
        for (c_symbol, _) in children:
            if c_symbol in self.grammar:
                new_cov |= self.max_expansion_coverage(
                    c_symbol, max_depth)
        return new_cov

    def new_child_coverage(self, symbol, children, max_depth=float('inf')):
        """Return new coverage that would be obtained by expanding (symbol, children)"""
        new_cov = self._new_child_coverage(children, max_depth)
        new_cov.add(expansion_key(symbol, children))
        new_cov -= self.expansion_coverage()   # -= is set subtraction
        return new_cov

Let us illustrate new_child_coverage(). We again start fuzzing, choosing expansions randomly.

In [41]:
f = GrammarCoverageFuzzer(EXPR_GRAMMAR, start_symbol="<digit>", log=True)
Tree: <digit>
Expanding <digit> randomly
Now covered: <digit> -> 2
Tree: 2

This is our current coverage:

In [42]:
{'<digit> -> 2'}

When we go through the individual expansion possibilities for <digit>, we see that all expansions offer additional coverage, except for the one we have just seen.

In [43]:
for expansion in EXPR_GRAMMAR["<digit>"]:
    children = f.expansion_to_children(expansion)
    print(expansion, f.new_child_coverage("<digit>", children))
0 {'<digit> -> 0'}
1 {'<digit> -> 1'}
2 set()
3 {'<digit> -> 3'}
4 {'<digit> -> 4'}
5 {'<digit> -> 5'}
6 {'<digit> -> 6'}
7 {'<digit> -> 7'}
8 {'<digit> -> 8'}
9 {'<digit> -> 9'}

This means that whenever choosing an expansion, we can make use of new_child_coverage() and choose among the expansions that offer the greatest new (unseen) coverage.

Adaptive Lookahead

When choosing a child, we do not look out for the maximum overall coverage to be obtained, as this would have expansions with many uncovered possibilities totally dominate other expansions. Instead, we aim for a breadth-first strategy, first covering all expansions up to a given depth, and only then looking for a greater depth. The method new_coverages() is at the heart of this strategy: Starting with a maximum depth (max_depth) of zero, it increases the depth until it finds at least one uncovered expansion.

In [44]:
class GrammarCoverageFuzzer(GrammarCoverageFuzzer):
    def new_coverages(self, node, possible_children):
        """Return coverage to be obtained for each child at minimum depth"""
        (symbol, children) = node
        for max_depth in range(len(self.grammar)):
            new_coverages = [
                    symbol, c, max_depth) for c in possible_children]
            max_new_coverage = max(len(new_coverage)
                                   for new_coverage in new_coverages)
            if max_new_coverage > 0:
                # Uncovered node found
                return new_coverages

        # All covered
        return None

All Together

We can now define choose_node_expansion() to make use of this strategy: First, we determine the possible coverages to be obtained (using new_coverages()); then we (randomly) select among the children which sport the maximum coverage.

In [45]:
class GrammarCoverageFuzzer(GrammarCoverageFuzzer):
    def choose_node_expansion(self, node, possible_children):
        (symbol, children) = node
        new_coverages = self.new_coverages(node, possible_children)

        if new_coverages is None:
            # All expansions covered - use superclass method
            return self.choose_covered_node_expansion(node, possible_children)

        max_new_coverage = max(len(cov) for cov in new_coverages)

        children_with_max_new_coverage = [c for (i, c) in enumerate(possible_children)
                                          if len(new_coverages[i]) == max_new_coverage]
        index_map = [i for (i, c) in enumerate(possible_children)
                     if len(new_coverages[i]) == max_new_coverage]

        # Select a random expansion
        new_children_index = self.choose_uncovered_node_expansion(
            node, children_with_max_new_coverage)
        new_children = children_with_max_new_coverage[new_children_index]

        # Save the expansion as covered
        key = expansion_key(symbol, new_children)

        if self.log:
            print("Now covered:", key)

        return index_map[new_children_index]

Our fuzzer is now complete. Let us apply it on a series of examples. On expressions, it quickly covers all digits and operators:

In [46]:
f = GrammarCoverageFuzzer(EXPR_GRAMMAR, min_nonterminals=3)
'-4.02 / (1) * +3 + 5.9 / 7 * 8 - 6'
In [47]:
f.max_expansion_coverage() - f.expansion_coverage()

On average, it is again faster than the simple strategy:

In [48]:

On the CGI grammar, it takes but a few iterations to cover all letters and digits:

In [49]:
f = GrammarCoverageFuzzer(CGI_GRAMMAR, min_nonterminals=5)
while len(f.max_expansion_coverage() - f.expansion_coverage()) > 0:

This improvement can also be seen in comparing the random, expansion-only, and deep foresight strategies on the CGI grammar:

In [50]:
In [51]:
In [52]:

Coverage in Context

Sometimes, grammar elements are used in more than just one place. In our expression grammar, for instance, the <integer> symbol is used for integer numbers as well as for floating point numbers:

In [53]:
['+<factor>', '-<factor>', '(<expr>)', '<integer>.<integer>', '<integer>']

Our coverage production, as defined above, will ensure that all <integer> expansions (i.e., all <digit> expansions) are covered. However, the individual digits would be distributed across all occurrences of <integer> in the grammar. If our coverage-based fuzzer produces, say, 1234.56 and 7890, we would have full coverage of all digit expansions. However, <integer>.<integer> and <integer> in the <factor> expansions above would individually cover only a fraction of the digits. If floating-point numbers and whole numbers have different functions that read them in, we would like each of these functions to be tested with all digits; maybe we would also like the whole and fractional part of a floating-point number to be tested with all digits each.

Ignoring the context in which a symbol is used (in our case, the various uses of <integer> and <digit> in the <factor> context) can be useful if we can assume that all occurrences of this symbol are treated alike anyway. If not, though, one way to ensure that an occurrence of a symbol is systematically covered independently of other occurrences is to assign the occurrence to a new symbol which is a duplicate of the old symbol. We will first show how to manually create such duplicates, and then a dedicated function which does it automatically.

Extending Grammars for Context Coverage Manually

As stated above, one simple way to achieve coverage in context is by duplicating symbols as well as the rules they reference to. For instance, we could replace <integer>.<integer> by <integer-1>.<integer-2> and give <integer-1> and <integer-2> the same definitions as the original <integer>. This would mean that not only all expansions of <integer>, but also all expansions of <integer-1> and <integer-2> would be covered.

Let us illustrate this with actual code:

In [54]:
dup_expr_grammar = extend_grammar(EXPR_GRAMMAR,
                                      "<factor>": ["+<factor>", "-<factor>", "(<expr>)", "<integer-1>.<integer-2>", "<integer>"],
                                      "<integer-1>": ["<digit-1><integer-1>", "<digit-1>"],
                                      "<integer-2>": ["<digit-2><integer-2>", "<digit-2>"],
                                      ["0", "1", "2", "3", "4",
                                          "5", "6", "7", "8", "9"],
                                      ["0", "1", "2", "3", "4",
                                          "5", "6", "7", "8", "9"]

If we now run our coverage-based fuzzer on the extended grammar, we will cover all digits both of regular integers, as well as all digits in the whole and fraction part of floating-point numbers:

In [56]:
f = GrammarCoverageFuzzer(dup_expr_grammar, start_symbol="<factor>")
for i in range(10):
-(43.76 / 8.0 * 5.5 / 6.9 * 6 / 4 + +03)
(90.1 - 1 * 7.3 * 9 + 5 / 8 / 7)

We see how our "foresighted" coverage fuzzer specifically generates floating-point numbers that cover all digits both in the whole and fractional parts.

Extending Grammars for Context Coverage Programmatically

If we want to enhance coverage in context, manually adapting our grammars may not be the perfect choice, since any change to the grammar will have to be replicated in all duplicates. Instead, we introduce a function that will do the duplication for us.

The function duplicate_context() takes a grammar, a symbol in the grammar, and an expansion of this symbol (None or not given: all expansions of symbol), and it changes the expansion to refer to a duplicate of all originally referenced rules. The idea is that we invoke it as

dup_expr_grammar = extend_grammar(EXPR_GRAMMAR)
duplicate_context(dup_expr_grammar, "<factor>", "<integer>.<integer>")

and get a similar result as with our manual changes, above.

Here is the code:

In [58]:
def duplicate_context(grammar, symbol, expansion=None, depth=float('inf')):
    """Duplicate an expansion within a grammar.

    In the given grammar, take the given expansion of the given symbol
    (if expansion is omitted: all symbols), and replace it with a
    new expansion referring to a duplicate of all originally referenced rules.

    If depth is given, limit duplication to `depth` references (default: unlimited)
    orig_grammar = extend_grammar(grammar)
    _duplicate_context(grammar, orig_grammar, symbol,
                       expansion, depth, seen={})

    # After duplication, we may have unreachable rules; delete them
    for nonterminal in unreachable_nonterminals(grammar):
        del grammar[nonterminal]

The bulk of the work takes place in this helper function. The additional parameter seen keeps track of symbols already expanded and avoids infinite recursion.

In [60]:
def _duplicate_context(grammar, orig_grammar, symbol, expansion, depth, seen):
    for i in range(len(grammar[symbol])):
        if expansion is None or grammar[symbol][i] == expansion:
            new_expansion = ""
            for (s, c) in expansion_to_children(grammar[symbol][i]):
                if s in seen:                 # Duplicated already
                    new_expansion += seen[s]
                elif c == [] or depth == 0:   # Terminal symbol or end of recursion
                    new_expansion += s
                else:                         # Nonterminal symbol - duplicate
                    # Add new symbol with copy of rule
                    new_s = new_symbol(grammar, s)
                    grammar[new_s] = copy.deepcopy(orig_grammar[s])

                    # Duplicate its expansions recursively
                    # {**seen, **{s: new_s}} is seen + {s: new_s}
                    _duplicate_context(grammar, orig_grammar, new_s, expansion=None,
                                       depth=depth - 1, seen={**seen, **{s: new_s}})
                    new_expansion += new_s

            grammar[symbol][i] = new_expansion

Here's our above example of how duplicate_context() works, now with results. We let it duplicate the <integer>.<integer> expansion in our expression grammar, and obtain a new grammar with an <integer-1>.<integer-2> expansion where both <integer-1> and <integer-2> refer to copies of the original rules:

In [61]:
dup_expr_grammar = extend_grammar(EXPR_GRAMMAR)
duplicate_context(dup_expr_grammar, "<factor>", "<integer>.<integer>")
{'<start>': ['<expr>'],
 '<expr>': ['<term> + <expr>', '<term> - <expr>', '<term>'],
 '<term>': ['<factor> * <term>', '<factor> / <term>', '<factor>'],
 '<factor>': ['+<factor>',
 '<integer>': ['<digit><integer>', '<digit>'],
 '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<integer-1>': ['<digit-1><integer-1>', '<digit-2>'],
 '<digit-1>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<digit-2>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<integer-2>': ['<digit-3><integer-2>', '<digit-4>'],
 '<digit-3>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<digit-4>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}

Just like above, using such a grammar for coverage fuzzing will now cover digits in a number of contexts. To be precise, there are five contexts: Regular integers, as well as single-digit and multi-digit whole and fractional parts of floating-point numbers.

In [62]:
f = GrammarCoverageFuzzer(dup_expr_grammar, start_symbol="<factor>")
for i in range(10):
+-(1 / 3 + 6 / 0 - 7 * 59 * 3 + 8 * 4)

The depth parameter controls how deep the duplication should go. Setting depth to 1 will duplicate only the next rule:

In [63]:
dup_expr_grammar = extend_grammar(EXPR_GRAMMAR)
duplicate_context(dup_expr_grammar, "<factor>", "<integer>.<integer>", depth=1)
{'<start>': ['<expr>'],
 '<expr>': ['<term> + <expr>', '<term> - <expr>', '<term>'],
 '<term>': ['<factor> * <term>', '<factor> / <term>', '<factor>'],
 '<factor>': ['+<factor>',
 '<integer>': ['<digit><integer>', '<digit>'],
 '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<integer-1>': ['<digit><integer-1>', '<digit>'],
 '<integer-2>': ['<digit><integer-2>', '<digit>']}

By default, depth is set to $\infty$, indicating unlimited duplication. True unbounded duplication could lead to problems for a recursive grammar such as EXPR_GRAMMAR, so duplicate_context() is set to no longer duplicate symbols once duplicated. Still, if we apply it to duplicate all <expr> expansions, we obtain a grammar with no less than 296 rules:

In [65]:
dup_expr_grammar = extend_grammar(EXPR_GRAMMAR)
duplicate_context(dup_expr_grammar, "<expr>")
In [66]:
assert is_valid_grammar(dup_expr_grammar)

This gives us almost 2000 expansions to cover:

In [67]:
f = GrammarCoverageFuzzer(dup_expr_grammar)

Duplicating one more time keeps on both growing the grammar and the coverage requirements:

In [68]:
dup_expr_grammar = extend_grammar(EXPR_GRAMMAR)
duplicate_context(dup_expr_grammar, "<expr>")
duplicate_context(dup_expr_grammar, "<expr-1>")
In [69]:
f = GrammarCoverageFuzzer(dup_expr_grammar)

At this point, plenty of contexts can be covered individually – for instance, multiplications of elements within additions:

In [70]:
['<term-1> + <expr-4>', '<term-5> - <expr-8>', '<term-9>']
In [71]:
['<factor-1-1> * <term-1-1>', '<factor-2-1> / <term-1-1>', '<factor-3-1>']
In [72]:

The resulting grammars may no longer be useful for human maintenance; but running a coverage-driven fuzzer such as GrammarCoverageFuzzer() will then go and cover all these expansions in all contexts. If you want to cover elements in a large number of contexts, then duplicate_context() followed by a coverage-driven fuzzer is your friend.

Covering Code by Covering Grammars

With or without context: By systematically covering all input elements, we get a larger variety in our inputs – but does this translate into a wider variety of program behaviors? After all, these behaviors are what we want to cover, including the unexpected behaviors.

In a grammar, there are elements that directly correspond to program features. A program handling arithmetic expressions will have functionality that is directly triggered by individual elements - say, an addition feature triggered by the presence of +, subtraction triggered by the presence of -, and floating-point arithmetic triggered by the presence of floating-point numbers in the input.

Such a connection between input structure and functionality leads to a strong correlation between grammar coverage and code coverage. In other words: If we can achieve a high grammar coverage, this also leads to a high code coverage.

CGI Grammars

Let us explore this relationship on one of our grammars – say, the CGI decoder from the chapter on coverage. We compute a mapping coverages where in coverages[x] = {y_1, y_2, ...}, x is the grammar coverage obtained, and y_n is the code coverage obtained for the n-th run.

We first compute the maximum coverage, as in the the chapter on coverage:

In [74]:
with Coverage() as cov_max:

Now, we run our experiment:

In [75]:
f = GrammarCoverageFuzzer(CGI_GRAMMAR, max_nonterminals=2)
coverages = {}

trials = 100
for trial in range(trials):
    overall_cov = set()
    max_cov = 30

    for i in range(10):
        s = f.fuzz()
        with Coverage() as cov:
        overall_cov |= cov.coverage()

        x = len(f.expansion_coverage()) * 100 / len(f.max_expansion_coverage())
        y = len(overall_cov) * 100 / len(cov_max.coverage())
        if x not in coverages:
            coverages[x] = []

We compute the averages for the y-values:

In [76]:
xs = list(coverages.keys())
ys = [sum(coverages[x]) / len(coverages[x]) for x in coverages]

and create a scatter plot:

In [79]:
ax = plt.axes(label="coverage")

plt.xlim(0, max(xs))
plt.ylim(0, max(ys))

plt.title('Coverage of cgi_decode() vs. grammar coverage')
plt.xlabel('grammar coverage (expansions)')
plt.ylabel('code coverage (lines)')
plt.scatter(xs, ys);

We see that the higher the grammar coverage, the higher the code coverage.

This also translates into a correlation coefficient of about 0.9, indicating a strong correlation:

In [81]:
np.corrcoef(xs, ys)
array([[1.        , 0.81663071],
       [0.81663071, 1.        ]])

This is also confirmed by the Spearman rank correlation:

In [83]:
spearmanr(xs, ys)
SpearmanrResult(correlation=0.937547293248041, pvalue=1.0928720949027369e-09)

URL Grammars

Let us repeat this experiment on URL grammars. We use the same code as above, except for exchanging the grammars and the function in place:

Again, we first compute the maximum coverage, making an educated guess as in the the chapter on coverage:

In [85]:
with Coverage() as cov_max:

Here comes the actual experiment:

In [86]:
f = GrammarCoverageFuzzer(URL_GRAMMAR, max_nonterminals=2)
coverages = {}

trials = 100
for trial in range(trials):
    overall_cov = set()

    for i in range(20):
        s = f.fuzz()
        with Coverage() as cov:
        overall_cov |= cov.coverage()

        x = len(f.expansion_coverage()) * 100 / len(f.max_expansion_coverage())
        y = len(overall_cov) * 100 / len(cov_max.coverage())
        if x not in coverages:
            coverages[x] = []
In [87]:
xs = list(coverages.keys())
ys = [sum(coverages[x]) / len(coverages[x]) for x in coverages]
In [88]:
ax = plt.axes(label="coverage")

plt.xlim(0, max(xs))
plt.ylim(0, max(ys))

plt.title('Coverage of urlparse() vs. grammar coverage')
plt.xlabel('grammar coverage (expansions)')
plt.ylabel('code coverage (lines)')
plt.scatter(xs, ys);

Here, we have an even stronger correlation of more than .95:

In [89]:
np.corrcoef(xs, ys)
array([[1.        , 0.94110679],
       [0.94110679, 1.        ]])

This is also confirmed by the Spearman rank correlation:

In [90]:
spearmanr(xs, ys)
SpearmanrResult(correlation=1.0, pvalue=0.0)

We conclude: If one wants to obtain high code coverage, it is a good idea to strive for high grammar coverage first.

Will this always work?

The correlation observed for the CGI and URL examples will not hold for every program and every structure.

Equivalent Elements

First, some grammar elements are treated uniformly by a program even though the grammar sees them as different symbols. In the host name of a URL, for instance, we can have many different characters, although a URL-handling program treats them all the same. Likewise, individual digits, once composed into a number, make less of a difference than the value of the number itself. Hence, achieving variety in digits or characters will not necessarily yield a large difference in functionality.

This problem can be addressed by differentiating elements dependent on their context, and covering alternatives for each context, as discussed above. The key is to identify the contexts in which variety is required, and those where it is not.

Deep Data Processing

Second, the way the data is processed can make a large difference. Consider the input to a media player, consisting of compressed media data. While processing the media data, the media player will show differences in behavior (notably in its output), but these differences cannot be directly triggered through individual elements of the media data. Likewise, a machine learner that is trained on a large set of inputs typically will not have its behavior controlled by a single syntactic element of the input. (Well, it could, but then, we would not need a machine learner.) In these cases of "deep" data processing, achieving structural coverage in the grammar will not necessarily induce code coverage.

One way to address this problem is to achieve not only syntactic, but actually semantic variety. In the chapter on fuzzing with constraints, we will see how to specifically generate and filter input values, especially numerical values. Such generators can also be applied in context, such that each and every facet of the input can be controlled individually. Also, in the above examples, some parts of the input can still be covered structurally: Metadata (such as author name or composer for the media player) or configuration data (such as settings for the machine learner) can and should be covered systematically; we will see how this is done in the chapter on "Configuration fuzzing".

Lessons Learned

  • Achieving grammar coverage quickly results in a large variety of inputs.
  • Duplicating grammar rules allows to cover elements in specific contexts.
  • Achieving grammar coverage can help in obtaining code coverage.


The idea of ensuring that each expansion in the grammar is used at least once goes back to Burkhardt \cite{Burkhardt1967}, to be later rediscovered by Paul Purdom \cite{Purdom1972}. The relation between grammar coverage and code coverage was discovered by Nikolas Havrikov, who explores it in his PhD thesis.


Exercise 1: Testing ls

Consider the Unix ls program, used to list the contents of a directory. Create a grammar for invoking ls:

In [91]:
    '<start>': ['-<options>'],
    '<options>': ['<option>*'],
    '<option>': ['1', 'A', '@',
                 # many more

Use GrammarCoverageFuzzer to test all options. Be sure to invoke ls with each option set.

Exercise 2: Caching

The value of max_expansion_coverage() depends on the grammar only. Change the implementation such that the values are precomputed for each symbol and depth upon initialization (__init__()); this way, max_expansion_coverage() can simply lookup the value in the table.