Probabilistic Grammar Fuzzing¶

Let us give grammars even more power by assigning probabilities to individual expansions. This allows us to control how many of each element should be produced, and thus allows us to target our generated tests towards specific functionality. We also show how to learn such probabilities from given sample inputs, and specifically direct our tests towards input features that are uncommon in these samples.

from bookutils import YouTubeVideo
YouTubeVideo('9htOliNwopc')

Prerequisites

You should have read the chapter on grammars.
Our implementation hooks into the grammar-based fuzzer introduced in "Efficient Grammar Fuzzing"
For learning probabilities from samples, we make use of parsers.

Synopsis¶

To use the code provided in this chapter, write

>>> from fuzzingbook.ProbabilisticGrammarFuzzer import <identifier>

and then make use of the following features.

A probabilistic grammar allows attaching individual probabilities to production rules. To set the probability of an individual expansion S to the value X (between 0 and 1), replace it with a pair

(S, opts(prob=X))

If we want to ensure that 90% of phone numbers generated have an area code starting with 9, we can write:

>>> from Grammars import US_PHONE_GRAMMAR, extend_grammar, opts
>>> PROBABILISTIC_US_PHONE_GRAMMAR: Grammar = extend_grammar(US_PHONE_GRAMMAR,
>>> {
>>>       "<lead-digit>": [
>>>                           "2", "3", "4", "5", "6", "7", "8",
>>>                           ("9", opts(prob=0.9))
>>>                       ],
>>> })

A ProbabilisticGrammarFuzzer will extract and interpret these options. Here is an example:

>>> probabilistic_us_phone_fuzzer = ProbabilisticGrammarFuzzer(PROBABILISTIC_US_PHONE_GRAMMAR)
>>> [probabilistic_us_phone_fuzzer.fuzz() for i in range(5)]
['(918)925-2501',
 '(981)925-0792',
 '(934)995-5029',
 '(955)999-7801',
 '(964)927-0877']

As you can see, the large majority of area codes now starts with 9.

The Law of Leading Digits¶

In all our examples so far, you may have noted that inputs generated by a program differ quite a bit from "natural" inputs as they occur in real life. This is true even for innocuous elements such as numbers – yes, the numbers we have generated so far actually differ from numbers in the real world. This is because in real-life sets of numerical data, the leading significant digit is likely to be small: Actually, on average, the leading digit 1 occurs more than six times as often as the leading digit 8 or 9. It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, physical and mathematical constants (Wikipedia).

This law of leading digits was first observed by Newcomb [Simon Newcomb, 1881] and later formalized by Benford in [Frank Benford, 1938]. Let us take a look at the conditions that determine the first digit of a number. We can easily compute the first digit by converting the number into a string and take the first character:

def first_digit_via_string(x: int) -> int:
    return ord(repr(x)[0]) - ord('0')

first_digit_via_string(2001)

To do this mathematically, though, we have to take the fractional part of their logarithm, or formally

$$ d = 10^{\{\log_{10}(x)\}} $$

where $\{x\}$ is the fractional part of $x$ (i.e. $\{1.234\} = 0.234$).

import math

def first_digit_via_log(x: int) -> int:
    frac, whole = math.modf(math.log10(x))
    return int(10 ** frac)

first_digit_via_log(2001)

Most sets of "naturally" occurring numbers should not have any bias in the fractional parts of their logarithms, and hence, the fractional part $\{\log_{10}(x)\}$ is typically uniformly distributed. However, the fractional parts for the individual digits are not evenly distributed.

For a number to start with a digit $d$, the condition $d < 10^{\{\log_{10}(x)\}} < d + 1$ must hold. To start with the digit 1, the fractional part $\{\log_{10}(x)\}$ must thus be in the range

(math.log10(1), math.log10(2))

To start with the digit 2, though, it must be in the range

(math.log10(2), math.log10(3))

which is much smaller. Formally, the probability $P(d)$ for a leading digit $d$ (again, assuming uniformly distributed fractional parts) is known as Benford's law: $$ P(d) = \log_{10}(d + 1) - \log_{10}(d) $$ which gives us:

def prob_leading_digit(d: int) -> float:
    return math.log10(d + 1) - math.log10(d)

Let us compute these probabilities for all digits:

digit_probs = [prob_leading_digit(d) for d in range(1, 10)]
[(d, "%.2f" % digit_probs[d - 1]) for d in range(1, 10)]

We see that a leading 1 is indeed six times as probable as a leading 9.

Benford's law has a number of applications. Most notably, it can be used to detect "non-natural" numbers, i.e. numbers that apparently were created randomly rather than coming from a "natural" source. if you write a scientific paper and fake data by putting in random numbers (for instance, using our grammar fuzzer on integers), you will likely violate Benford's law, and this can indeed be spotted. On the other hand, how would we proceed if we wanted to create numbers that adhere to Benford's law? To this end, we need to be able to encode probabilities such as the above in our grammar, such that we can ensure that a leading digit is indeed a 1 in 30% of all cases.

Specifying Probabilities¶

The goal of this chapter is to assign probabilities to individual expansions in the grammar, such that we can express that some expansion alternatives should be favored over others. This is not only useful to generate "natural"-looking numbers, but even more so to direct test generation towards a specific goal. If you recently have changed some code in your program, you would probably like to generate inputs that exercise precisely this code. By raising the probabilities on the input elements associated with the changed code, you will get more tests that exercise the changed code.

Our concept for expressing probabilities is to annotate individual expansions with attributes such as probabilities, using the annotation mechanism introduced in the chapter on grammars. To this end, we allow that an expansion cannot only be a string, but also a pair of a string and a set of attributes, as in

"<expr>":
        [("<term> + <expr>", opts(prob=0.1)),
         ("<term> - <expr>", opts(prob=0.2)),
         "<term>"]

Here, the opts() function would allow us to express probabilities for choosing the individual expansions. The addition would have a probability of 10%, the subtraction of 20%. The remaining probability (in this case 70%) is equally distributed over the non-attributed expansions (in this case the single last one).

We can now use pairs with opts() to assign probabilities to our expression grammar:

import bookutils.setup

from Fuzzer import Fuzzer

from GrammarFuzzer import GrammarFuzzer, all_terminals, display_tree, DerivationTree

from Grammars import is_valid_grammar, EXPR_GRAMMAR, START_SYMBOL, crange
from Grammars import opts, exp_string, exp_opt, set_opts
from Grammars import Grammar, Expansion

from typing import List, Dict, Set, Optional, cast, Any, Tuple

PROBABILISTIC_EXPR_GRAMMAR: Grammar = {
    "<start>":
        ["<expr>"],

    "<expr>":
        [("<term> + <expr>", opts(prob=0.1)),
         ("<term> - <expr>", opts(prob=0.2)),
         "<term>"],

    "<term>":
        [("<factor> * <term>", opts(prob=0.1)),
         ("<factor> / <term>", opts(prob=0.1)),
         "<factor>"
         ],

    "<factor>":
        ["+<factor>", "-<factor>", "(<expr>)",
            "<leadinteger>", "<leadinteger>.<integer>"],

    "<leadinteger>":
        ["<leaddigit><integer>", "<leaddigit>"],

    # Benford's law: frequency distribution of leading digits
    "<leaddigit>":
        [("1", opts(prob=0.301)),
         ("2", opts(prob=0.176)),
         ("3", opts(prob=0.125)),
         ("4", opts(prob=0.097)),
         ("5", opts(prob=0.079)),
         ("6", opts(prob=0.067)),
         ("7", opts(prob=0.058)),
         ("8", opts(prob=0.051)),
         ("9", opts(prob=0.046)),
         ],

    # Remaining digits are equally distributed
    "<integer>":
        ["<digit><integer>", "<digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
}

assert is_valid_grammar(PROBABILISTIC_EXPR_GRAMMAR, supported_opts={'prob'})

This is how the grammar expansions are represented internally:

leaddigits: List[Expansion] = PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"]
leaddigits

However, we typically access the expansion string and the associated probability via designated helper functions, exp_string() (from the chapter on Grammars) and exp_prob():

leaddigit_expansion = leaddigits[0]
leaddigit_expansion

exp_string(leaddigit_expansion)

def exp_prob(expansion: Expansion) -> float:
    """Return the options of an expansion"""
    return exp_opt(expansion, 'prob')

exp_prob(leaddigit_expansion)

Our existing fuzzers are all set up to work with grammars annotated this way. They simply ignore all annotations.

f = GrammarFuzzer(PROBABILISTIC_EXPR_GRAMMAR)
f.fuzz()

from GrammarCoverageFuzzer import GrammarCoverageFuzzer  # minor dependency

f = GrammarCoverageFuzzer(PROBABILISTIC_EXPR_GRAMMAR)
f.fuzz()

Computing Probabilities¶

Let us define functions that access probabilities for given expansions. While doing so, they also check for inconsistencies.

Distributing Probabilities¶

Here is how we distribute probabilities for expansions without specified probabilities. Given an expansion rule

$$S ::= a_1\:|\: a_2 \:|\: \dots \:|\: a_n \:|\: u_1 \:|\: u_2 \:|\: \dots u_m$$

with $n \ge 0$ alternatives $a_i$ for which the probability $p(a_i)$ is specified and $m \ge 0$ alternatives $u_j$ for which the probability $p(u_j)$ is unspecified, the "remaining" probability is distributed equally over all $u_j$; in other words,

$$p(u_j) = \frac{1 - \sum_{i = 1}^{n}p(a_i)}{m}$$

If no probabilities are specified ($n = 0$), then all expansions have the same probability.

The overall sum of probabilities must be 1:

$$\sum_{i = 1}^{n} p(a_i) + \sum_{j = 1}^{m} p(u_i) = 1$$

We check these properties while distributing probabilities.

The function exp_probabilities() returns a mapping of all expansions in a rule to their respective probabilities.

def exp_probabilities(expansions: List[Expansion],
                      nonterminal: str ="<symbol>") \
        -> Dict[Expansion, float]:
    probabilities = [exp_prob(expansion) for expansion in expansions]
    prob_dist = prob_distribution(probabilities, nonterminal)

    prob_mapping: Dict[Expansion, float] = {}
    for i in range(len(expansions)):
        expansion = exp_string(expansions[i])
        prob_mapping[expansion] = prob_dist[i]

    return prob_mapping

The gist of exp_probabilities() is handled in prob_distribution(), which does the actual checking and computation.

def prob_distribution(probabilities: List[Optional[float]],
                      nonterminal: str = "<symbol>"):
    epsilon = 0.00001

    number_of_unspecified_probabilities = probabilities.count(None)
    if number_of_unspecified_probabilities == 0:
        sum_probabilities = cast(float, sum(probabilities))
        assert abs(sum_probabilities - 1.0) < epsilon, \
            nonterminal + ": sum of probabilities must be 1.0"
        return probabilities

    sum_of_specified_probabilities = 0.0
    for p in probabilities:
        if p is not None:
            sum_of_specified_probabilities += p
    assert 0 <= sum_of_specified_probabilities <= 1.0, \
        nonterminal + ": sum of specified probabilities must be between 0.0 and 1.0"

    default_probability = ((1.0 - sum_of_specified_probabilities)
                           / number_of_unspecified_probabilities)
    all_probabilities = []
    for p in probabilities:
        if p is None:
            p = default_probability
        all_probabilities.append(p)

    assert abs(sum(all_probabilities) - 1.0) < epsilon
    return all_probabilities

Here's the mapping exp_probabilities() returns for the annotated <leaddigit> element:

print(exp_probabilities(PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"]))

If no expansion is annotated, all expansions have the same likelihood of being selected, as in our previous grammar fuzzers.

print(exp_probabilities(PROBABILISTIC_EXPR_GRAMMAR["<digit>"]))

Here's how exp_probabilities() distributes any remaining probability across non-annotated expansions:

exp_probabilities(PROBABILISTIC_EXPR_GRAMMAR["<expr>"])

Checking Probabilities¶

We can use the checking capabilities of exp_probabilities() to check a probabilistic grammar for consistency:

def is_valid_probabilistic_grammar(grammar: Grammar,
                                   start_symbol: str = START_SYMBOL) -> bool:
    if not is_valid_grammar(grammar, start_symbol):
        return False

    for nonterminal in grammar:
        expansions = grammar[nonterminal]
        _ = exp_probabilities(expansions, nonterminal)

    return True

assert is_valid_probabilistic_grammar(PROBABILISTIC_EXPR_GRAMMAR)

assert is_valid_probabilistic_grammar(EXPR_GRAMMAR)

from ExpectError import ExpectError

with ExpectError():
    assert is_valid_probabilistic_grammar({"<start>": [("1", opts(prob=0.5))]})

with ExpectError():
    assert is_valid_probabilistic_grammar(
        {"<start>": [("1", opts(prob=1.5)), "2"]})

Expanding by Probability¶

Now that we have seen how to specify probabilities for a grammar, we can actually implement probabilistic expansion. In our ProbabilisticGrammarFuzzer, it suffices to overload one method, namely choose_node_expansion(). For each of the children we can choose from (typically all expansions of a symbol), we determine their probability (using exp_probabilities() defined above), and make a weighted choice using random.choices() with a weight argument.

import random

class ProbabilisticGrammarFuzzer(GrammarFuzzer):
    """A grammar-based fuzzer respecting probabilities in grammars."""

    def check_grammar(self) -> None:
        super().check_grammar()
        assert is_valid_probabilistic_grammar(self.grammar, self.start_symbol)

    def supported_opts(self) -> Set[str]:
        return super().supported_opts() | {'prob'}

class ProbabilisticGrammarFuzzer(ProbabilisticGrammarFuzzer):
    def choose_node_expansion(self, node: DerivationTree,
                              children_alternatives: List[Any]) -> int:
        (symbol, tree) = node
        expansions = self.grammar[symbol]
        probabilities = exp_probabilities(expansions)

        weights: List[float] = []
        for children in children_alternatives:
            expansion = all_terminals((symbol, children))
            children_weight = probabilities[expansion]
            if self.log:
                print(repr(expansion), "p =", children_weight)
            weights.append(children_weight)

        if sum(weights) == 0:
            # No alternative (probably expanding at minimum cost)
            return random.choices(
                range(len(children_alternatives)))[0]
        else:
            return random.choices(
                range(len(children_alternatives)), weights=weights)[0]

Our probabilistic grammar fuzzer works just like the non-probabilistic grammar fuzzer, except that it actually respects probability annotations. Let us generate a couple of "natural" numbers that respect Benford's law:

natural_fuzzer = ProbabilisticGrammarFuzzer(
    PROBABILISTIC_EXPR_GRAMMAR, start_symbol="<leadinteger>")
print([natural_fuzzer.fuzz() for i in range(20)])

In contrast, these numbers are pure random:

integer_fuzzer = GrammarFuzzer(
    PROBABILISTIC_EXPR_GRAMMAR, start_symbol="<leadinteger>")
print([integer_fuzzer.fuzz() for i in range(20)])

Are the "natural" numbers really more "natural" than the random ones? To show that ProbabilisticGrammarFuzzer indeed respects the probabilistic annotations, let us create a specific fuzzer for the lead digit:

leaddigit_fuzzer = ProbabilisticGrammarFuzzer(
    PROBABILISTIC_EXPR_GRAMMAR, start_symbol="<leaddigit>")
leaddigit_fuzzer.fuzz()

If we generate thousands of lead digits, their distribution should again follow Benford's law:

trials = 10000

count = {}
for c in crange('0', '9'):
    count[c] = 0

for i in range(trials):
    count[leaddigit_fuzzer.fuzz()] += 1

print([(digit, count[digit] / trials) for digit in count])

Quod erat demonstrandum! The distribution is pretty much exactly as originally specified. We now have a fuzzer where we can exercise control by specifying probabilities.

Directed Fuzzing¶

Assigning probabilities to individual expansions gives us great control over which inputs should be generated. By choosing probabilities wisely, we can direct fuzzing towards specific functions and features – for instance, towards functions that are particularly critical, prone to failures, or that have been recently changed.

As an example, consider the URL grammar from the chapter on grammars. Let us assume we have just made a change to our implementation of the secure FTP protocol. By assigning a higher probability to the ftps scheme, we can generate more URLs that will specifically test this functionality.

First, let us define a helper function that sets a particular option:

Here's a specialization just for probabilities:

def set_prob(grammar: Grammar, symbol: str, 
             expansion: Expansion, prob: Optional[float]) -> None:
    """Set the probability of the given expansion of grammar[symbol]"""
    set_opts(grammar, symbol, expansion, opts(prob=prob))

Let us use set_prob() to give the ftps expansion a probability of 80%:

from Grammars import URL_GRAMMAR, extend_grammar

probabilistic_url_grammar = extend_grammar(URL_GRAMMAR)
set_prob(probabilistic_url_grammar, "<scheme>", "ftps", 0.8)
assert is_valid_probabilistic_grammar(probabilistic_url_grammar)

probabilistic_url_grammar["<scheme>"]

If we use this grammar for fuzzing, we will get plenty of ftps: prefixes:

prob_url_fuzzer = ProbabilisticGrammarFuzzer(probabilistic_url_grammar)
for i in range(10):
    print(prob_url_fuzzer.fuzz())

In a similar vein, we can direct URL generation towards specific hosts or ports; we can favor URLs with queries, fragments, or logins – or URLs without these. All it takes is to set appropriate probabilities.

By setting the probability of an expansion to zero, we can effectively disable specific expansions:

set_prob(probabilistic_url_grammar, "<scheme>", "ftps", 0.0)
assert is_valid_probabilistic_grammar(probabilistic_url_grammar)

prob_url_fuzzer = ProbabilisticGrammarFuzzer(probabilistic_url_grammar)
for i in range(10):
    print(prob_url_fuzzer.fuzz())

Note that even if we set the probability of an expansion to zero, we may still see the expansion taken. This can happen during the "closing" phase of our grammar fuzzer, when the expansion is closed at minimum cost. At this stage, even expansions with "zero" probability will be taken if this is necessary for closing the expansion.

Let us illustrate this feature using the <expr> rule from our expression grammar:

from Grammars import EXPR_GRAMMAR

probabilistic_expr_grammar = extend_grammar(EXPR_GRAMMAR)
probabilistic_expr_grammar["<expr>"]

If we set the probability of the <term> expansion to zero, the string should expand again and again.

set_prob(probabilistic_expr_grammar, "<expr>", "<term>", 0.0)
assert is_valid_probabilistic_grammar(probabilistic_expr_grammar)

Still, in the "closing" phase, subexpressions will eventually expand into <term>, as it is the only way to close the expansion. Tracking choose_node_expansion() shows that it is invoked with only one possible expansion <term>, which has to be taken even though its specified probability is zero.

prob_expr_fuzzer = ProbabilisticGrammarFuzzer(probabilistic_expr_grammar)
prob_expr_fuzzer.fuzz()

Probabilities in Context¶

While specified probabilities give us a means to control which expansions are taken how often, this control by itself may not be enough. As an example, consider the following grammar for IPv4 addresses:

def decrange(start: int, end: int) -> List[Expansion]:
    """Return a list with string representations of numbers in the range [start, end)"""
    return [repr(n) for n in range(start, end)]

IP_ADDRESS_GRAMMAR: Grammar = {
    "<start>": ["<address>"],
    "<address>": ["<octet>.<octet>.<octet>.<octet>"],
    # ["0", "1", "2", ..., "255"]
    "<octet>": decrange(0, 256)
}

print(IP_ADDRESS_GRAMMAR["<octet>"][:20])

assert is_valid_grammar(IP_ADDRESS_GRAMMAR)

We can easily use this grammar to create IP addresses:

ip_fuzzer = ProbabilisticGrammarFuzzer(IP_ADDRESS_GRAMMAR)
ip_fuzzer.fuzz()

However, if we want to assign a specific probability to one of the four octets, we are out of luck. All we can do is to assign the same probability distribution for all four octets:

probabilistic_ip_address_grammar = extend_grammar(IP_ADDRESS_GRAMMAR)
set_prob(probabilistic_ip_address_grammar, "<octet>", "127", 0.8)

probabilistic_ip_fuzzer = ProbabilisticGrammarFuzzer(
    probabilistic_ip_address_grammar)
probabilistic_ip_fuzzer.fuzz()

If we want to assign different probabilities to each of the four octets, what do we do?

The answer lies in the concept of context, which we already have seen while discussing coverage-driven fuzzers. As with coverage-driven fuzzing, the idea is to duplicate the element whose probability we want to set dependent on its context. In our case, this means to duplicate the <octet> element to four individual ones, each of which can then get an individual probability distribution. We can do this programmatically, using the duplicate_context() method:

from GrammarCoverageFuzzer import duplicate_context  # minor dependency

probabilistic_ip_address_grammar = extend_grammar(IP_ADDRESS_GRAMMAR)
duplicate_context(probabilistic_ip_address_grammar, "<address>")

probabilistic_ip_address_grammar["<address>"]

We can now assign different probabilities to each of the <octet> symbols. For instance, we can force specific expansions by setting their probability to 100%:

set_prob(probabilistic_ip_address_grammar, "<octet-1>", "127", 1.0)
set_prob(probabilistic_ip_address_grammar, "<octet-2>", "0", 1.0)

assert is_valid_probabilistic_grammar(probabilistic_ip_address_grammar)

The remaining two octets <octet-3> and <octet-4> have no specific probabilities set. During fuzzing, all their expansions (all octets) are thus still available:

probabilistic_ip_fuzzer = ProbabilisticGrammarFuzzer(
    probabilistic_ip_address_grammar)
[probabilistic_ip_fuzzer.fuzz() for i in range(5)]

Just as with coverage, we can duplicate grammar rules arbitrarily often to get more and more finer-grained control over probabilities. However, this finer-grained control also comes at the cost of having to maintain these probabilities. In the next section, we will therefore discuss means to assign and tune such probabilities automatically.

Learning Probabilities from Samples¶

Probabilities need not be set manually all the time. They can also be learned from other sources, notably by counting how frequently individual expansions occur in a given set of inputs. This is useful in a number of situations, including:

Test common features. The idea is that during testing, one may want to focus on frequently occurring (or frequently used) features first, to ensure correct functionality for the most common usages.
Test uncommon features. Here, the idea is to have test generation focus on features that are rarely seen (or not seen at all) in inputs. This is the same motivation as with grammar coverage, but from a probabilistic standpoint.
Focus on specific slices. One may have a set of inputs that is of particular interest (for instance, because they exercise a critical functionality, or recently have discovered bugs). Using this learned distribution for fuzzing allows us to focus on precisely these functionalities of interest.

Let us first introduce counting expansions and learning probabilities, and then detail these scenarios.

Counting Expansions¶

We start with implementing a means to take a set of inputs and determine the number of expansions in that set. To this end, we need the parsers introduced in the previous chapter to transform a string input into a derivation tree. For our IP address grammar, this is how this works:

from Parser import Parser, EarleyParser

IP_ADDRESS_TOKENS = {"<octet>"}  # EarleyParser needs explicit tokens

parser = EarleyParser(IP_ADDRESS_GRAMMAR)

tree, *_ = parser.parse("127.0.0.1")
display_tree(tree)

In a tree such as this one, we can now count individual expansions. In the above tree, for instance, we have two expansions of <octet> into 0, one into 1, and one into 127. The expansion <octet> into 0 makes up 50% of all expansions seen; the expansions into 127 and 1 make up 25% each, and the other ones 0%. These are the probabilities we'd like to assign to our "learned" grammar.

We introduce a class ExpansionCountMiner which allows us to count how frequently individual expansions take place. Its initialization method takes a parser (say, an EarleyParser) that would be initialized with the appropriate grammar.

from GrammarCoverageFuzzer import expansion_key  # minor dependency

from Grammars import is_nonterminal

class ExpansionCountMiner:
    def __init__(self, parser: Parser, log: bool = False) -> None:
        assert isinstance(parser, Parser)
        self.grammar = extend_grammar(parser.grammar())
        self.parser = parser
        self.log = log
        self.reset()

The attribute expansion_counts holds the expansions seen; adding a tree with add_tree() traverses the given tree and adds all expansions seen.

class ExpansionCountMiner(ExpansionCountMiner):
    def reset(self) -> None:
        self.expansion_counts: Dict[str, int] = {}

    def add_coverage(self, symbol: str, children: List[DerivationTree]) -> None:
        key = expansion_key(symbol, children)

        if self.log:
            print("Found", key)

        if key not in self.expansion_counts:
            self.expansion_counts[key] = 0
        self.expansion_counts[key] += 1

    def add_tree(self, tree: DerivationTree) -> None:
        (symbol, children) = tree
        if not is_nonterminal(symbol):
            return
        assert children is not None

        direct_children: List[DerivationTree] = [
            (symbol, None) if is_nonterminal(symbol) 
            else (symbol, []) for symbol, c in children]
        self.add_coverage(symbol, direct_children)

        for c in children:
            self.add_tree(c)

The method count_expansions() is the one facing the public; it takes a list of inputs, parses them, and processes the resulting trees. The method counts() returns the counts found.

class ExpansionCountMiner(ExpansionCountMiner):
    def count_expansions(self, inputs: List[str]) -> None:
        for inp in inputs:
            tree, *_ = self.parser.parse(inp)
            self.add_tree(tree)

    def counts(self) -> Dict[str, int]:
        return self.expansion_counts

Let us try this out on our IP address grammar. We create an ExpansionCountMiner for our IP address grammar:

expansion_count_miner = ExpansionCountMiner(EarleyParser(IP_ADDRESS_GRAMMAR))

We parse a (small) set of IP addresses and count the expansions occurring:

expansion_count_miner.count_expansions(["127.0.0.1", "1.2.3.4"])
expansion_count_miner.counts()

You see that we have one expansion into 127, and two into 0. These are the counts we can use to assign probabilities.

Assigning Probabilities¶

The distribution of counts, as determined by ExpansionCountMiner is what we can use to assign probabilities to our grammar. To this end, we introduce a subclass ProbabilisticGrammarMiner whose method set_expansion_probabilities() processes all expansions of a given symbol, checks whether it occurs in a given count distribution, and assigns probabilities using the following formula.

Given a set $T$ of derivation trees (as mined from samples), we determine the probabilities $p_i$ for each alternative $a_i$ of a symbol $S \rightarrow a_1 | \dots | a_n$ as

$$p_i = \frac{\text{Expansions of $S \rightarrow a_i$ in $T$}}{\text{Expansions of $S$ in $T$}}$$

Should $S$ not occur at all in $T$, then $p_i$ is unspecified.

Here is the implementation of set_expansion_probabilities(), implementing the above formula:

class ProbabilisticGrammarMiner(ExpansionCountMiner):
    def set_probabilities(self, counts: Dict[str, int]):
        for symbol in self.grammar:
            self.set_expansion_probabilities(symbol, counts)

    def set_expansion_probabilities(self, symbol: str, counts: Dict[str, int]):
        expansions = self.grammar[symbol]
        if len(expansions) == 1:
            set_prob(self.grammar, symbol, expansions[0], None)
            return

        expansion_counts = [
            counts.get(
                expansion_key(
                    symbol,
                    expansion),
                0) for expansion in expansions]
        total = sum(expansion_counts)
        for i, expansion in enumerate(expansions):
            p = expansion_counts[i] / total if total > 0 else None
            # if self.log:
            #     print("Setting", expansion_key(symbol, expansion), p)
            set_prob(self.grammar, symbol, expansion, p)

The typical use of ProbabilisticGrammarMiner is through mine_probabilistic_grammar(), which first determines a distribution from a set of inputs, and then sets the probabilities accordingly.

class ProbabilisticGrammarMiner(ProbabilisticGrammarMiner):
    def mine_probabilistic_grammar(self, inputs: List[str]) -> Grammar:
        self.count_expansions(inputs)
        self.set_probabilities(self.counts())
        return self.grammar

Let us put this to use. We create a grammar miner for IP addresses:

probabilistic_grammar_miner = ProbabilisticGrammarMiner(
    EarleyParser(IP_ADDRESS_GRAMMAR))

We now use mine_probabilistic_grammar() to mine the grammar:

probabilistic_ip_address_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar([
                                                                                          "127.0.0.1", "1.2.3.4"])

assert is_valid_probabilistic_grammar(probabilistic_ip_address_grammar)

Here's the resulting distribution of octets in our grammar:

[expansion for expansion in probabilistic_ip_address_grammar['<octet>']
    if exp_prob(expansion) > 0]

If we use these probabilities for fuzzing, we will get the same distribution of octets as in our sample:

probabilistic_ip_fuzzer = ProbabilisticGrammarFuzzer(
    probabilistic_ip_address_grammar)
[probabilistic_ip_fuzzer.fuzz() for i in range(10)]

By learning from a sample, we can thus adjust our fuzzing towards the (syntactic) properties of this very sample.

Testing Common Features¶

Let us now get to our three usage scenarios. The first scenario is to create probability distributions right out of a sample, and to use these very distributions during test generation. This helps to focus test generation on those features that are most commonly used, which thus minimizes the risk of customers encountering failures.

To illustrate testing of common features, we choose the URL domain. Let us assume that we are running some Web-related service, and this is a sample of the URLs our customers access most:

URL_SAMPLE: List[str] = [
    "https://user:password@cispa.saarland:80/",
    "https://fuzzingbook.com?def=56&x89=3&x46=48&def=def",
    "https://cispa.saarland:80/def?def=7&x23=abc",
    "https://fuzzingbook.com:80/",
    "https://fuzzingbook.com:80/abc?def=abc&abc=x14&def=abc&abc=2&def=38",
    "ftps://fuzzingbook.com/x87",
    "https://user:password@fuzzingbook.com:6?def=54&x44=abc",
    "http://fuzzingbook.com:80?x33=25&def=8",
    "http://fuzzingbook.com:8080/def",
]

Using the Earley parser from the chapter on parsers, we can parse any of these inputs into a parse tree; we have to specify a token set, though.

URL_TOKENS: Set[str] = {"<scheme>", "<userinfo>", "<host>", "<port>", "<id>"}

url_parser = EarleyParser(URL_GRAMMAR, tokens=URL_TOKENS)
url_input = URL_SAMPLE[2]
print(url_input)
tree, *_ = url_parser.parse(url_input)
display_tree(tree)

Let us apply our ProbabilisticGrammarMiner class on these inputs, using the above url_parser parser, and obtain a probabilistic URL grammar:

probabilistic_grammar_miner = ProbabilisticGrammarMiner(url_parser)
probabilistic_url_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar(
    URL_SAMPLE)

These are the counts we obtained during parsing:

print(probabilistic_grammar_miner.counts())

These counts translate into individual probabilities. We see that in our sample, most URLs use the https: scheme, whereas there is no input using the ftp: scheme.

probabilistic_url_grammar['<scheme>']

Likewise, we see that most given URLs have multiple parameters:

probabilistic_url_grammar['<params>']

When we use this probabilistic grammar for fuzzing, these distributions are reflected in our generated inputs – no ftp: schemes either, and most inputs have multiple parameters.

g = ProbabilisticGrammarFuzzer(probabilistic_url_grammar)
[g.fuzz() for i in range(10)]

Being able to replicate a probability distribution learned from a sample is not only important for focusing on commonly used features. It can also help in achieving valid inputs, in particular if one learns probabilities in context, as discussed above: If within a given context, some elements are more likely than others (because they depend on each other), a learned probability distribution will reflect this; and hence, inputs generated from this learned probability distribution will have a higher chance to be valid, too. We will explore this further in the exercises, below.

Testing Uncommon Features¶

So far, we have focused on common features; but from a testing perspective, one may just as well test uncommon features – that is, features that rarely occur in our usage samples and therefore would be less exercised in practice. This is a common scenario in security testing, where one focuses on uncommon (and possibly lesser-known) features, as fewer users means fewer bugs reported, and thus more bugs left to be found and exploited.

To have our probabilistic grammar fuzzer focus on uncommon features, we change the learned probabilities such that commonly occurring features (i.e., those with a high learned probability) get a low probability, and vice versa: The last shall be first, and the first last. A particularly simple way to achieve such an inversion of probabilities is to swap them: The alternatives with the highest and lowest probability swaps their probabilities, as so the alternatives with the second-highest and second-lowest probability, the alternatives with the third highest and lowest, and so on.

The function invert_expansion() takes an expansion (a list of alternatives) from a grammar and returns a new inverted expansion in which the probabilities have been swapped according to the rule above. It creates a list of indexes, sorts it by increasing probability, and then for each $n$-th element, assigns it the probability of the $n$-th last element in the indexes.

import copy

def invert_expansion(expansion: List[Expansion]) -> List[Expansion]:
    def sort_by_prob(x: Tuple[int, float]) -> float:
        index, prob = x
        return prob if prob is not None else 0.0

    inverted_expansion: List[Expansion] = copy.deepcopy(expansion)
    indexes_and_probs = [(index, exp_prob(alternative))
                         for index, alternative in enumerate(expansion)]
    indexes_and_probs.sort(key=sort_by_prob)
    indexes = [i for (i, _) in indexes_and_probs]

    for j in range(len(indexes)):
        k = len(indexes) - 1 - j
        # print(indexes[j], "gets", indexes[k])
        inverted_expansion[indexes[j]][1]['prob'] = expansion[indexes[k]][1]['prob']

    return inverted_expansion

Here's invert_expansion() in action. This is our original probability distribution for URL schemes:

probabilistic_url_grammar['<scheme>']

And this is the "inverted" distribution. We see that the ftp: scheme, which previously had a probability of zero, now has the highest probability, whereas the most common scheme, https:, now has the previous zero probability of the ftp: scheme.

invert_expansion(probabilistic_url_grammar['<scheme>'])

One nice feature of this swapping of probabilities is that the sum of probabilities stays unchanged; no normalization is needed. Another nice feature is that the inversion of the inversion returns the original distribution:

invert_expansion(invert_expansion(probabilistic_url_grammar['<scheme>']))

Note that our implementation does not universally satisfy this property: If two alternatives $a_1$ and $a_2$ in the expansion share the same probability, then the second inversion may assign different probabilities to $a_1$ and $a_2$.

We can apply this inversion of expansions across the entire grammar:

def invert_probs(grammar: Grammar) -> Grammar:
    inverted_grammar = extend_grammar(grammar)
    for symbol in grammar:
        inverted_grammar[symbol] = invert_expansion(grammar[symbol])
    return inverted_grammar

This means that probabilities would be swapped for each and every expansion:

probabilistic_url_grammar["<digit>"]

inverted_probabilistic_url_grammar = invert_probs(probabilistic_url_grammar)
inverted_probabilistic_url_grammar["<digit>"]

If we now use this "inverted" grammar for fuzzing, the generated inputs will focus on the complement of the input samples. We will get plenty of tests of user/password features, as well as ftp: schemes – in essence, all the features present in our language, but rarely used (if at all) in our input samples.

g = ProbabilisticGrammarFuzzer(inverted_probabilistic_url_grammar)
[g.fuzz() for i in range(10)]

Besides having only common or only uncommon features, one can also create mixed forms – for instance, testing uncommon features in a common context. This can be helpful for security testing, where one may want an innocuous (common) "envelope" combined with an (uncommon) "payload". It all depends on where and how we tune the probabilities.

Learning Probabilities from Input Slices¶

In our previous examples, we have learned from all inputs to generate common or uncommon inputs. However, we can also learn from a subset of inputs to focus on the features present in that subset (or, conversely, to avoid its features). If we know, for instance, that there is some subset of inputs that covers a functionality of interest (say, because it is particularly critical or because it has been recently changed), we can learn from this very subset and focus our test generation on its features.

To illustrate this approach, let us use the CGI grammar introduced in the chapter on coverage. We have a special interest in Line 25 in our CGI decoder – that is, the line that processes a % character followed by two valid hexadecimal digits:

...
        elif c == '%':
            digit_high, digit_low = s[i + 1], s[i + 2]
            i += 2
            if digit_high in hex_values and digit_low in hex_values:
                v = hex_values[digit_high] * 16 + hex_values[digit_low] ### Line 25
                t += chr(v)
        ...

Let us assume that we do not know precisely under which conditions Line 25 is executed – but still, we'd like to test it thoroughly. With our probability learning tools, we can learn these conditions, though. We start with a set of random inputs and consider the subset that covers Line 25.

from Coverage import Coverage, cgi_decode
from Grammars import CGI_GRAMMAR

cgi_fuzzer = GrammarFuzzer(CGI_GRAMMAR)

trials = 100
coverage = {}

for i in range(trials):
    cgi_input = cgi_fuzzer.fuzz()
    with Coverage() as cov:
        cgi_decode(cgi_input)
    coverage[cgi_input] = cov.coverage()

These are all the random inputs that cover Line 25:

coverage_slice = [cgi_input for cgi_input in coverage
                  if ('cgi_decode', 25) in coverage[cgi_input]]

print(coverage_slice)

Actually, about half of the inputs cover Line 25:

len(coverage_slice) / trials

Let us now learn a probabilistic grammar from this slice of inputs:

probabilistic_grammar_miner = ProbabilisticGrammarMiner(
    EarleyParser(CGI_GRAMMAR))
probabilistic_cgi_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar(
    coverage_slice)

assert is_valid_probabilistic_grammar(probabilistic_cgi_grammar)

We see that percentage signs are very likely to occur:

probabilistic_cgi_grammar['<letter>']

Using this grammar, we can now generate tests that specifically target Line 25:

probabilistic_cgi_fuzzer = ProbabilisticGrammarFuzzer(
    probabilistic_cgi_grammar)
print([probabilistic_cgi_fuzzer.fuzz() for i in range(20)])

trials = 100
coverage = {}

for i in range(trials):
    cgi_input = probabilistic_cgi_fuzzer.fuzz()
    with Coverage() as cov:
        cgi_decode(cgi_input)
    coverage[cgi_input] = cov.coverage()

We see that the fraction of inputs that cover Line 25 is much higher already, showing that our focusing works:

coverage_slice: List[str] = [cgi_input for cgi_input in coverage
                             if ('cgi_decode', 25) in coverage[cgi_input]]

len(coverage_slice) / trials

Repeating this one more time yields an even higher focusing:

for run in range(3):
    probabilistic_cgi_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar(
        coverage_slice)
    probabilistic_cgi_fuzzer = ProbabilisticGrammarFuzzer(
        probabilistic_cgi_grammar)

    trials = 100
    coverage = {}

    for i in range(trials):
        cgi_input = probabilistic_cgi_fuzzer.fuzz()
        with Coverage() as cov:
            cgi_decode(cgi_input)
        coverage[cgi_input] = cov.coverage()

    coverage_slice = [cgi_input for cgi_input in coverage
                      if ('cgi_decode', 25) in coverage[cgi_input]]

len(coverage_slice) / trials

By learning (and re-learning) probabilities from a subset of sample inputs, we can specialize fuzzers towards the properties of that subset – in our case, inputs that contain percentage signs and valid hexadecimal letters. The degree to which we can specialize things is induced by the number of variables we can control – in our case, the probabilities for the individual rules. Adding more context to the grammar, as discussed above, will increase the number of variables, and thus the amount of specialization.

A high degree of specialization, however, limits our possibilities to explore combinations that fall outside the selected scope, and limit our possibilities to find bugs induced by these combinations. This trade-off is known as exploration vs. exploitation in machine learning – shall one try to explore as many (possibly shallow) combinations as possible, or focus (exploit) specific areas? In the end, it all depends on where the bugs are, and where we are most likely to find them. Assigning and learning probabilities allows us to control the search strategies – from the common to the uncommon to specific subsets.

Detecting Unnatural Numbers¶

Let us close this chapter by getting back to our introductory example. We said that Benford's law allows us not only to produce, but also to detect "unnatural" lead digit distributions such as the ones produced by simple random choices.

If we use the regular GrammarFuzzer class (which ignores probabilities) to generate (random) lead digits, this is the distribution we get for each leading digit:

sample_size = 1000
random_integer_fuzzer = GrammarFuzzer(
    PROBABILISTIC_EXPR_GRAMMAR,
    start_symbol="<leaddigit>")
random_integers = [random_integer_fuzzer.fuzz() for i in range(sample_size)]

random_counts = [random_integers.count(str(c)) for c in crange('1', '9')]
random_counts

(For simplicity, we use the simple list count() method here rather than deploying the full-fledged ProbabilisticGrammarMiner.)

If we had a natural distribution of lead digits, this is what we would expect:

expected_prob_counts = [
    exp_prob(
        PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"][i]) *
    sample_size for i in range(9)]
print(expected_prob_counts)

And if we had a random distribution, we would expect an equal distribution:

expected_random_counts = [sample_size / 9 for i in range(9)]
print(expected_random_counts)

Which distribution better matches our random_counts lead digits? To this end, we run a $\chi^2$-test to compare the distribution we found (random_counts) against the "natural" lead digit distribution expected_prob_counts and the random distribution expected_random_counts.

from scipy.stats import chisquare

It turns out that there is a zero chance (pvalue = 0.0) that the observed distribution follows a "natural" distribution:

chisquare(random_counts, expected_prob_counts)

However, there is a 97% chance that the observed behavior follows a random distribution:

chisquare(random_counts, expected_random_counts)

Hence, if you find some numbers published and doubt their validity, you can run the above test to check whether they are likely to be natural. Better yet, insist that authors use Jupyter notebooks to produce their results, such that you can check every step of the calculation :-)

Lessons Learned¶

By specifying probabilities, one can steer fuzzing towards input features of interest.
Learning probabilities from samples allows one to focus on features that are common or uncommon in input samples.
Learning probabilities from a subset of samples allows one to produce more similar inputs.

Next Steps¶

Now that we have brought together probabilities and grammars (and revisited parsers and grammars), we have created a foundation for many applications. Our next chapters will focus on

how to reduce failing inputs to a minimum
how to carve and produce tests at the function level
how to automatically test (Web) user interfaces

Enjoy!

Background¶

The idea of mining probabilities by parsing a corpus of data was first covered in "Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data" [Patra et al, 2016] which also learns and applies probabilistic rules for derivation trees. Applying this idea on probabilistic grammars as well as inverting probabilities or learning from slices was first executed in the work "Inputs from Hell: Generating Uncommon Inputs from Common Samples" [Esteban Pavese et al, 2018].

Our exposition of Benford's law follows this article.

Exercises¶

Exercise 1: Probabilistic Fuzzing with Coverage¶

Create a class ProbabilisticGrammarCoverageFuzzer that extends GrammarCoverageFuzzer with probabilistic capabilities. The idea is to first cover all uncovered expansions (like GrammarCoverageFuzzer) and once all expansions are covered, to proceed by probabilities (like ProbabilisticGrammarFuzzer).

To this end, define new instances of the choose_covered_node_expansion() and choose_uncovered_node_expansion() methods that choose an expansion based on the given weights.

Use the notebook to work on the exercises and see solutions.

If you are an advanced programmer, realize the class via multiple inheritance from GrammarCoverageFuzzer and ProbabilisticGrammarFuzzer to achieve this.

Multiple inheritance is a tricky thing. If you have two classes $A'$ and $A''$ which both inherit from $A$, the same method $m()$ of $A$ may be overloaded in both $A'$ and $A''$. If one now inherits from both $A'$ and $A''$, and calls $m()$, which of the $m()$ implementations should be called? Python "resolves" this conflict by simply invoking the one $m()$ method in the class one inherits from first.

To avoid such conflicts, one can check whether the order in which one inherits makes a difference. The method inheritance_conflicts() compares the attributes with each other; if they refer to different code, you have to resolve the conflict.

Use the notebook to work on the exercises and see solutions.

from bookutils import inheritance_conflicts

inheritance_conflicts(GrammarCoverageFuzzer, ProbabilisticGrammarFuzzer)

This is a method you have to implement for multiple inheritance besides choose_covered_node_expansion() and choose_uncovered_node_expansion().

Use the notebook to work on the exercises and see solutions.

Exercise 2: Learning from Past Bugs¶

Learning from a set of inputs can be extremely valuable if one learns from inputs that are known to have caused failures before. In this exercise, you will go and learn distributions from past vulnerabilities.

Download js-vuln-db, a set of JavaScript engine vulnerabilities. Each vulnerability comes with code that exercises it.
Extract all number literals from the code, using re.findall() with appropriate regular expressions.
Convert these literals to (decimal) numeric values and count their respective occurrences.
Create a grammar RISKY_NUMBERS that produces these numbers with probabilities reflecting the above counts.

Of course, there is more to vulnerabilities than just a specific numbers, but some numbers are more likely to induce errors than others. The next time you fuzz a system, do not generate numbers randomly; instead, pick one from RISKY_NUMBERS :-)

Use the notebook to work on the exercises and see solutions.

Probabilistic Grammar Fuzzing¶

Synopsis¶

The Law of Leading Digits¶

Specifying Probabilities¶

Computing Probabilities¶

Distributing Probabilities¶

Checking Probabilities¶

Expanding by Probability¶

Directed Fuzzing¶

Probabilities in Context¶

Learning Probabilities from Samples¶

Counting Expansions¶

Assigning Probabilities¶

Testing Common Features¶

Testing Uncommon Features¶

Learning Probabilities from Input Slices¶

Detecting Unnatural Numbers¶

Lessons Learned¶

Next Steps¶

Background¶

Exercises¶

Exercise 1: Probabilistic Fuzzing with Coverage¶

Exercise 2: Learning from Past Bugs¶

How to Cite this Work