Let us give grammars even more power by assigning *probabilities* to individual expansions. This allows us to control how many of each element should be produced, and thus allows us to *target* our generated tests towards specific functionality. We also show how to learn such probabilities from given sample inputs, and specifically direct our tests towards input features that are uncommon in these samples.

**Prerequisites**

- You should have read the chapter on grammars.
- Our implementation hooks into the grammar-based fuzzer introduced in "Efficient Grammar Fuzzing"
- For learning probabilities from samples, we make use of parsers.

*differ* from numbers in the real world. This is because in real-life sets of numerical data, the *leading significant digit* is likely to be small: Actually, on average, the leading digit `1`

occurs more than *six times* as often as the leading digit `8`

or `9`

. It has been shown that this result applies to a wide variety of data sets, including electricity bills, street addresses, stock prices, house prices, population numbers, death rates, lengths of rivers, physical and mathematical constants (Wikipedia).

This law of leading digits was first observed by Newcomb \cite{Newcomb1881} and later formalized by Benford in \cite{Benford1938}. Let us take a look at the conditions that determine the first digit of a number. We can easily compute the first digit by converting the number into a string and take the first character:

In [1]:

```
def first_digit_via_string(x):
return ord(repr(x)[0]) - ord('0')
```

In [2]:

```
first_digit_via_string(2001)
```

Out[2]:

2

To do this mathematically, though, we have to take the fractional part of their logarithm, or formally

$$ d = 10^{\{\log_{10}(x)\}} $$where $\{x\}$ is the fractional part of $x$ (i.e. $\{1.234\} = 0.234$).

In [4]:

```
def first_digit_via_log(x):
frac, whole = math.modf(math.log10(x))
return int(10 ** frac)
```

In [5]:

```
first_digit_via_log(2001)
```

Out[5]:

2

Most sets of "naturally" occurring numbers should not have any bias in the fractional parts of their logarithms, and hence, the fractional part $\{\log_{10}(x)\}$ is typically uniformly distributed. However, the fractional parts for the individual digits are *not* evenly distributed.

In [6]:

```
(math.log10(1), math.log10(2))
```

Out[6]:

(0.0, 0.3010299956639812)

To start with the digit 2, though, it must be in the range

In [7]:

```
(math.log10(2), math.log10(3))
```

Out[7]:

(0.3010299956639812, 0.47712125471966244)

which is much smaller. Formally, the probability $P(d)$ for a leading digit $d$ (again, assuming uniformly distributed fractional parts) is known as Benford's law: $$ P(d) = \log_{10}(d + 1) - \log_{10}(d) $$ which gives us:

In [8]:

```
def prob_leading_digit(d):
return math.log10(d + 1) - math.log10(d)
```

Let us compute these probabilities for all digits:

In [9]:

```
digit_probs = [prob_leading_digit(d) for d in range(1, 10)]
[(d, "%.2f" % digit_probs[d - 1]) for d in range(1, 10)]
```

Out[9]:

[(1, '0.30'), (2, '0.18'), (3, '0.12'), (4, '0.10'), (5, '0.08'), (6, '0.07'), (7, '0.06'), (8, '0.05'), (9, '0.05')]

In [11]:

```
labels = range(1, 10)
fig1, ax1 = plt.subplots()
ax1.pie(digit_probs, labels=labels, shadow=True, autopct='%1.1f%%',
counterclock=False, startangle=90)
ax1.axis('equal');
```

We see that a leading 1 is indeed six times a probable than a leading 9.

Benford's law has a number of applications. Most notably, it can be used to detect "non-natural" numbers, i.e. numbers that apparently were created randomly rather than coming from a "natural" source. if you write a scientific paper and fake data by putting in random numbers (for instance, using our grammar fuzzer on integers), you will likely violate Benford's law, and this can indeed be spotted. On the other hand, how would we proceed if we *wanted* to create numbers that adhere to Benford's law? To this end, we need to be able to *encode* probabilities such as the above in our grammar, such that we can ensure that a leading digit is indeed a `1`

in 30% of all cases.

The goal of this chapter is to assign *probabilities* to individual expansions in the grammar, such that we can express that some expansion alternatives should be favored over others. This is not only useful to generate "natural"-looking numbers, but even more so to *direct* test generation towards a specific goal. If you recently have changed some code in your program, you would probably like to generate inputs that exercise precisely this code. By raising the probabilities on the input elements associated with the changed code, you will get more tests that exercise the changed code.

Our concept for expressing probabilities is to *annotate* individual expansions with attributes such as probabilities, using the annotation mechanism introduced in the chapter on grammars. To this end, we allow that an expansion cannot only be a string, but also a *pair* of a string and a set of attributes, as in

```
"<expr>":
[("<term> + <expr>", opts(prob=0.1)),
("<term> - <expr>", opts(prob=0.2)),
"<term>"]
```

Here, the `opts()`

function would allow us to express probabilities for choosing the individual expansions. The addition would have a probability of 10%, the subtraction of 20%. The remaining probability (in this case 70%) is equally distributed over the non-attributed expansions (in this case the single last one).

We can now use pairs with `opts()`

to assign probabilities to our expression grammar:

In [13]:

```
from GrammarFuzzer import GrammarFuzzer, all_terminals, display_tree
```

In [14]:

```
from Grammars import is_valid_grammar, EXPR_GRAMMAR, START_SYMBOL, crange, syntax_diagram
from Grammars import opts, exp_string, exp_opt, set_opts
```

This is how the grammar expansions are represented internally:

In [17]:

```
PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"]
```

Out[17]:

[('1', {'prob': 0.301}), ('2', {'prob': 0.176}), ('3', {'prob': 0.125}), ('4', {'prob': 0.097}), ('5', {'prob': 0.079}), ('6', {'prob': 0.067}), ('7', {'prob': 0.058}), ('8', {'prob': 0.051}), ('9', {'prob': 0.046})]

`exp_string()`

(from the chapter on Grammars) and `exp_prob()`

:

In [18]:

```
leaddigit_expansion = PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"][0]
leaddigit_expansion
```

Out[18]:

('1', {'prob': 0.301})

In [19]:

```
exp_string(leaddigit_expansion)
```

Out[19]:

'1'

In [20]:

```
def exp_prob(expansion):
"""Return the options of an expansion"""
return exp_opt(expansion, 'prob')
```

In [21]:

```
exp_prob(leaddigit_expansion)
```

Out[21]:

0.301

In [22]:

```
f = GrammarFuzzer(PROBABILISTIC_EXPR_GRAMMAR)
f.fuzz()
```

warning: option 'prob' is not supported

Out[22]:

'4 + ++--7.0 - -7 - +++7.3 * (1 * 3 + 5 / 3 / 5 + 2) * 38 * (2 + 8)'

In [24]:

```
f = GrammarCoverageFuzzer(PROBABILISTIC_EXPR_GRAMMAR)
f.fuzz()
```

warning: option 'prob' is not supported

Out[24]:

'1.30'

Let us define functions that access probabilities for given expansions. While doing so, they also check for inconsistencies.

Here is how we distribute probabilities for expansions without specified probabilities. Given an expansion rule

$$S ::= a_1\:|\: a_2 \:|\: \dots \:|\: a_n \:|\: u_1 \:|\: u_2 \:|\: \dots u_m$$with $n \ge 0$ alternatives $a_i$ for which the probability $p(a_i)$ is *specified* and
$m \ge 0$ alternatives $u_j$ for which the probability $p(u_j)$ is *unspecified*,
the "remaining" probability is distributed equally over all $u_j$; in other words,

If no probabilities are specified ($n = 0$), then all expansions have the same probability.

The overall sum of probabilities must be 1:

$$\sum_{i = 0}^{n} p(a_i) + \sum_{j = 0}^{m} p(u_i) = 1$$We check these properties while distributing probabilities.

The function `exp_probabilities()`

returns a mapping of all expansions in a rule to their respective probabilities.

In [25]:

```
def exp_probabilities(expansions, nonterminal="<symbol>"):
probabilities = [exp_prob(expansion) for expansion in expansions]
prob_dist = prob_distribution(probabilities, nonterminal)
prob_mapping = {}
for i in range(len(expansions)):
expansion = exp_string(expansions[i])
prob_mapping[expansion] = prob_dist[i]
return prob_mapping
```

`exp_probabilities()`

is handled in `prob_distribution()`

, which does the actual checking and computation.

In [26]:

```
def prob_distribution(probabilities, nonterminal="<symbol>"):
epsilon = 0.00001
number_of_unspecified_probabilities = probabilities.count(None)
if number_of_unspecified_probabilities == 0:
assert abs(sum(probabilities) - 1.0) < epsilon, \
nonterminal + ": sum of probabilities must be 1.0"
return probabilities
sum_of_specified_probabilities = 0.0
for p in probabilities:
if p is not None:
sum_of_specified_probabilities += p
assert 0 <= sum_of_specified_probabilities <= 1.0, \
nonterminal + ": sum of specified probabilities must be between 0.0 and 1.0"
default_probability = ((1.0 - sum_of_specified_probabilities)
/ number_of_unspecified_probabilities)
all_probabilities = []
for p in probabilities:
if p is None:
p = default_probability
all_probabilities.append(p)
assert abs(sum(all_probabilities) - 1.0) < epsilon
return all_probabilities
```

Here's the mapping `exp_probabilities()`

returns for the annotated `<leaddigit>`

element:

In [27]:

```
print(exp_probabilities(PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"]))
```

In [28]:

```
print(exp_probabilities(PROBABILISTIC_EXPR_GRAMMAR["<digit>"]))
```

{'0': 0.1, '1': 0.1, '2': 0.1, '3': 0.1, '4': 0.1, '5': 0.1, '6': 0.1, '7': 0.1, '8': 0.1, '9': 0.1}

`exp_probabilities()`

distributes any remaining probability across non-annotated expansions:

In [29]:

```
exp_probabilities(PROBABILISTIC_EXPR_GRAMMAR["<expr>"])
```

Out[29]:

{'<term> + <expr>': 0.1, '<term> - <expr>': 0.2, '<term>': 0.7}

`exp_probabilities()`

to check a probabilistic grammar for consistency:

In [30]:

```
def is_valid_probabilistic_grammar(grammar, start_symbol=START_SYMBOL):
if not is_valid_grammar(grammar, start_symbol):
return False
for nonterminal in grammar:
expansions = grammar[nonterminal]
prob_dist = exp_probabilities(expansions, nonterminal)
return True
```

In [31]:

```
assert is_valid_probabilistic_grammar(PROBABILISTIC_EXPR_GRAMMAR)
```

In [32]:

```
assert is_valid_probabilistic_grammar(EXPR_GRAMMAR)
```

In [34]:

```
with ExpectError():
assert is_valid_probabilistic_grammar({"<start>": [("1", opts(prob=0.5))]})
```

In [35]:

```
with ExpectError():
assert is_valid_probabilistic_grammar(
{"<start>": [("1", opts(prob=1.5)), "2"]})
```

Now that we have seen how to specify probabilities for a grammar, we can actually implement probabilistic expansion. In our `ProbabilisticGrammarFuzzer`

, it suffices to overload one method, namely `choose_node_expansion()`

. For each of the children we can choose from (typically all expansions of a symbol), we determine their probability (using `exp_probabilities()`

defined above), and make a weighted choice using `random.choices()`

with a `weight`

argument.

In [37]:

```
class ProbabilisticGrammarFuzzer(GrammarFuzzer):
def check_grammar(self):
super().check_grammar()
assert is_valid_probabilistic_grammar(self.grammar)
def supported_opts(self):
return super().supported_opts() | {'prob'}
```

In [38]:

```
class ProbabilisticGrammarFuzzer(ProbabilisticGrammarFuzzer):
def choose_node_expansion(self, node, possible_children):
(symbol, tree) = node
expansions = self.grammar[symbol]
probabilities = exp_probabilities(expansions)
weights = []
for child in possible_children:
expansion = all_terminals((node, child))
child_weight = probabilities[expansion]
if self.log:
print(repr(expansion), "p =", child_weight)
weights.append(child_weight)
if sum(weights) == 0:
# No alternative (probably expanding at minimum cost)
weights = None
return random.choices(
range(len(possible_children)), weights=weights)[0]
```

Our probabilistic grammar fuzzer works just like the non-probabilistic grammar fuzzer, except that it actually respects probability annotations. Let us generate a couple of "natural" numbers that respect Benford's law:

In [39]:

```
natural_fuzzer = ProbabilisticGrammarFuzzer(
PROBABILISTIC_EXPR_GRAMMAR, start_symbol="<leadinteger>")
print([natural_fuzzer.fuzz() for i in range(20)])
```

In contrast, these numbers are pure random:

In [40]:

```
integer_fuzzer = GrammarFuzzer(
PROBABILISTIC_EXPR_GRAMMAR, start_symbol="<leadinteger>")
print([integer_fuzzer.fuzz() for i in range(20)])
```

['3', '1', '56', '5', '408251024', '7', '27', '2', '9', '6456', '7', '32', '1', '4', '7', '19', '2', '6', '2', '5']

warning: option 'prob' is not supported

`ProbabilisticGrammarFuzzer`

indeed respects the probabilistic annotations, let us create a specific fuzzer for the lead digit:

In [41]:

```
leaddigit_fuzzer = ProbabilisticGrammarFuzzer(
PROBABILISTIC_EXPR_GRAMMAR, start_symbol="<leaddigit>")
leaddigit_fuzzer.fuzz()
```

Out[41]:

'8'

If we generate thousands of lead digits, their distribution should again follow Benford's law:

In [42]:

```
trials = 10000
count = {}
for c in crange('0', '9'):
count[c] = 0
for i in range(trials):
count[leaddigit_fuzzer.fuzz()] += 1
print([(digit, count[digit] / trials) for digit in count])
```

Quod erat demonstrandum! The distribution is pretty much exactly as originally specified. We now have a fuzzer where we can exercise control by specifying probabilities.

Assigning probabilities to individual expansions gives us great control over which inputs should be generated. By choosing probabilities wisely, we can *direct* fuzzing towards specific functions and features – for instance, towards functions that are particularly critical, prone to failures, or that have been recently changed.

`ftps`

scheme, we can generate more URLs that will specifically test this functionality.

First, let us define a helper function that sets a particular option:

Here's a specialization just for probabilities:

In [43]:

```
def set_prob(grammar, symbol, expansion, prob):
"""Set the probability of the given expansion of grammar[symbol]"""
set_opts(grammar, symbol, expansion, opts(prob=prob))
```

Let us use `set_prob()`

to give the `ftps`

expansion a probability of 80%:

In [45]:

```
probabilistic_url_grammar = extend_grammar(URL_GRAMMAR)
set_prob(probabilistic_url_grammar, "<scheme>", "ftps", 0.8)
assert is_valid_probabilistic_grammar(probabilistic_url_grammar)
```

In [46]:

```
probabilistic_url_grammar["<scheme>"]
```

Out[46]:

['http', 'https', 'ftp', ('ftps', {'prob': 0.8})]

If we use this grammar for fuzzing, we will get plenty of `ftps:`

prefixes:

In [47]:

```
prob_url_fuzzer = ProbabilisticGrammarFuzzer(probabilistic_url_grammar)
for i in range(10):
print(prob_url_fuzzer.fuzz())
```

In a similar vein, we can direct URL generation towards specific hosts or ports; we can favor URLs with queries, fragments, or logins – or URLs without these. All it takes is to set appropriate probabilities.

By setting the probability of an expansion to zero, we can effectively disable specific expansions:

In [48]:

```
set_prob(probabilistic_url_grammar, "<scheme>", "ftps", 0.0)
assert is_valid_probabilistic_grammar(probabilistic_url_grammar)
```

In [49]:

```
prob_url_fuzzer = ProbabilisticGrammarFuzzer(probabilistic_url_grammar)
for i in range(10):
print(prob_url_fuzzer.fuzz())
```

Note that even if we set the probability of an expansion to zero, we may still see the expansion taken. This can happen during the "closing" phase of our grammar fuzzer, when the expansion is closed at minimum cost. At this stage, even expansions with "zero" probability will be taken if this is necessary for closing the expansion.

Let us illustrate this feature using the `<expr>`

rule from our expression grammar:

In [51]:

```
probabilistic_expr_grammar = extend_grammar(EXPR_GRAMMAR)
probabilistic_expr_grammar["<expr>"]
```

Out[51]:

['<term> + <expr>', '<term> - <expr>', '<term>']

If we set the probability of the `<term>`

expansion to zero, the string should expand again and again.

In [52]:

```
set_prob(probabilistic_expr_grammar, "<expr>", "<term>", 0.0)
assert is_valid_probabilistic_grammar(probabilistic_expr_grammar)
```

Still, in the "closing" phase, subexpressions will eventually expand into `<term>`

, as it is the only way to close the expansion. Tracking `choose_node_expansion()`

shows that is is invoked with only one possible expansion `<term>`

, which has to be taken even though its specified probability is zero.

In [53]:

```
prob_expr_fuzzer = ProbabilisticGrammarFuzzer(probabilistic_expr_grammar)
prob_expr_fuzzer.fuzz()
```

Out[53]:

'44 / 7 / 1 * 3 / 6 - +1.63 + 3 * 7 + 1 - 2'

While specified probabilities give us a means to control which expansions are taken how often, this control by itself may not be enough. As an example, consider the following grammar for IPv4 addresses:

In [54]:

```
def decrange(start, end):
"""Return a list with string representations of numbers in the range [start, end)"""
return [repr(n) for n in range(start, end)]
```

In [55]:

```
IP_ADDRESS_GRAMMAR = {
"<start>": ["<address>"],
"<address>": ["<octet>.<octet>.<octet>.<octet>"],
# ["0", "1", "2", ..., "255"]
"<octet>": decrange(0, 256)
}
```

In [56]:

```
print(IP_ADDRESS_GRAMMAR["<octet>"][:20])
```

In [57]:

```
assert is_valid_grammar(IP_ADDRESS_GRAMMAR)
```

We can easily use this grammar to create IP addresses:

In [58]:

```
ip_fuzzer = ProbabilisticGrammarFuzzer(IP_ADDRESS_GRAMMAR)
ip_fuzzer.fuzz()
```

Out[58]:

'34.155.77.136'

In [59]:

```
probabilistic_ip_address_grammar = extend_grammar(IP_ADDRESS_GRAMMAR)
set_prob(probabilistic_ip_address_grammar, "<octet>", "127", 0.8)
```

In [60]:

```
probabilistic_ip_fuzzer = ProbabilisticGrammarFuzzer(
probabilistic_ip_address_grammar)
probabilistic_ip_fuzzer.fuzz()
```

Out[60]:

'127.127.127.127'

If we want to assign *different* probabilities to each of the four octets, what do we do?

*context*, which we already have seen while discussing coverage-driven fuzzers. As with coverage-driven fuzzing, the idea is to *duplicate* the element whose probability we want to set dependent on its context. In our case, this means to duplicate the `<octet>`

element to four individual ones, each of which can then get an individual probability distribution. We can do this programmatically, using the `duplicate_context()`

method:

In [62]:

```
probabilistic_ip_address_grammar = extend_grammar(IP_ADDRESS_GRAMMAR)
duplicate_context(probabilistic_ip_address_grammar, "<address>")
```

In [63]:

```
probabilistic_ip_address_grammar["<address>"]
```

Out[63]:

['<octet-1>.<octet-2>.<octet-3>.<octet-4>']

`<octet>`

symbols. For instance, we can force specific expansions by setting their probability to 100%:

In [64]:

```
set_prob(probabilistic_ip_address_grammar, "<octet-1>", "127", 1.0)
set_prob(probabilistic_ip_address_grammar, "<octet-2>", "0", 1.0)
```

In [65]:

```
assert is_valid_probabilistic_grammar(probabilistic_ip_address_grammar)
```

`<octet-3>`

and `<octet-4>`

have no specific probabilities set. During fuzzing, all their expansions (all octets) are thus still available:

In [66]:

```
probabilistic_ip_fuzzer = ProbabilisticGrammarFuzzer(
probabilistic_ip_address_grammar)
[probabilistic_ip_fuzzer.fuzz() for i in range(5)]
```

Out[66]:

['127.0.201.77', '127.0.98.36', '127.0.12.232', '127.0.146.161', '127.0.245.151']

Probabilities need not be set manually all the time. They can also be *learned* from other sources, notably by counting *how frequently individual expansions occur in a given set of inputs*. This is useful in a number of situations, including:

- Test
*common*features. The idea is that during testing, one may want to focus on frequently occurring (or frequently used) features first, to ensure correct functionality for the most common usages. - Test
*uncommon*features. Here, the idea is to have test generation focus on features that are rarely seen (or not seen at all) in inputs. This is the same motivation as with grammar coverage, but from a probabilistic standpoint. - Focus on specific
*slices*. One may have a set of inputs that is of particular interest (for instance, because they exercise a critical functionality, or recently have discovered bugs). Using this learned distribution for fuzzing allows us to*focus*on precisely these functionalities of interest.

Let us first introduce counting expansions and learning probabilities, and then detail these scenarios.

We start with implementing a means to take a set of inputs and determine the number of expansions in that set. To this end, we need the *parsers* introduced in the previous chapter to transform a string input into a derivation tree. For our IP address grammar, this is how this works:

In [68]:

```
IP_ADDRESS_TOKENS = {"<octet>"} # EarleyParser needs explicit tokens
```

In [69]:

```
parser = EarleyParser(IP_ADDRESS_GRAMMAR)
```

In [70]:

```
tree = parser.parse("127.0.0.1")[0]
display_tree(tree)
```

In a tree such as this one, we can now *count* individual expansions. In the above tree, for instance, we have two expansions of `<octet>`

into `0`

, one into `1`

, and one into `127`

. In other words, the expansion `<octet>`

into `0`

makes up 50% of all expansions seen; the expansions into `127`

and `1`

make up 25% each, and the other ones 0%. These are the probabilities we'd like to assign to our "learned" grammar.

`ExpansionCountMiner`

which allows us to count how frequently individual expansions take place. Its initialization method takes a parser (say, an `EarleyParser`

) that would be initialized with the appropriate grammar.

In [71]:

```
class ExpansionCountMiner(object):
def __init__(self, parser, log=False):
assert isinstance(parser, Parser)
self.grammar = extend_grammar(parser.grammar())
self.parser = parser
self.log = log
self.reset()
```

`expansion_counts`

holds the expansions seen; adding a tree with `add_tree()`

traverses the given tree and adds all expansions seen.

In [74]:

```
class ExpansionCountMiner(ExpansionCountMiner):
def reset(self):
self.expansion_counts = {}
def add_coverage(self, symbol, children):
key = expansion_key(symbol, children)
if self.log:
print("Found", key)
if key not in self.expansion_counts:
self.expansion_counts[key] = 0
self.expansion_counts[key] += 1
def add_tree(self, tree):
(symbol, children) = tree
if not is_nonterminal(symbol):
return
direct_children = [
(symbol, None) if is_nonterminal(symbol) else (
symbol, []) for symbol, c in children]
self.add_coverage(symbol, direct_children)
for c in children:
self.add_tree(c)
```

The method `count_expansions()`

is the one facing the public; it takes a list of inputs, parses them, and processes the resulting trees. The method `counts()`

returns the counts found.

In [75]:

```
class ExpansionCountMiner(ExpansionCountMiner):
def count_expansions(self, inputs):
for inp in inputs:
tree = self.parser.parse(inp)[0]
self.add_tree(tree)
def counts(self):
return self.expansion_counts
```

`ExpansionCountMiner`

for our IP address grammar:

In [76]:

```
expansion_count_miner = ExpansionCountMiner(EarleyParser(IP_ADDRESS_GRAMMAR))
```

We parse a (small) set of IP addresses and count the expansions occurring:

In [77]:

```
expansion_count_miner.count_expansions(["127.0.0.1", "1.2.3.4"])
expansion_count_miner.counts()
```

Out[77]:

{'<start> -> <address>': 2, '<address> -> <octet>.<octet>.<octet>.<octet>': 2, '<octet> -> 127': 1, '<octet> -> 0': 2, '<octet> -> 1': 2, '<octet> -> 2': 1, '<octet> -> 3': 1, '<octet> -> 4': 1}

`127`

, and two into `0`

. These are the counts we can use to assign probabilities.

The distribution of counts, as determined by `ExpansionCountMiner`

is what we can use to assign probabilities to our grammar. To this end, we introduce a subclass `ProbabilisticGrammarMiner`

whose method `set_expansion_probabilities()`

processes all expansions of a given symbol, checks whether it occurs in a given count distribution, and assigns probabilities using the following formula.

Given a set $T$ of derivation trees (as mined from samples), we determine the probabilities $p_i$ for each alternative $a_i$ of a symbol $S \rightarrow a_1 | \dots | a_n$ as

$$p_i = \frac{\text{Expansions of $S \rightarrow a_i$ in $T$}}{\text{Expansions of $S$ in $T$}}$$Should $S$ not occur at all in $T$, then $p_i$ is *unspecified*.

Here is the implementation of `set_expansion_probabilities()`

, implementing the above formula:

In [78]:

```
class ProbabilisticGrammarMiner(ExpansionCountMiner):
def set_probabilities(self, counts):
for symbol in self.grammar:
self.set_expansion_probabilities(symbol, counts)
def set_expansion_probabilities(self, symbol, counts):
expansions = self.grammar[symbol]
if len(expansions) == 1:
set_prob(self.grammar, symbol, expansions[0], None)
return
expansion_counts = [
counts.get(
expansion_key(
symbol,
expansion),
0) for expansion in expansions]
total = sum(expansion_counts)
for i, expansion in enumerate(expansions):
p = expansion_counts[i] / total if total > 0 else None
# if self.log:
# print("Setting", expansion_key(symbol, expansion), p)
set_prob(self.grammar, symbol, expansion, p)
```

The typical use of `ProbabilisticGrammarMiner`

is through `mine_probabilistic_grammar()`

, which first determines a distribution from a set of inputs, and then sets the probabilities accordingly.

In [79]:

```
class ProbabilisticGrammarMiner(ProbabilisticGrammarMiner):
def mine_probabilistic_grammar(self, inputs):
self.count_expansions(inputs)
self.set_probabilities(self.counts())
return self.grammar
```

Let us put this to use. We create a grammar miner for IP addresses:

In [80]:

```
probabilistic_grammar_miner = ProbabilisticGrammarMiner(
EarleyParser(IP_ADDRESS_GRAMMAR))
```

We now use `mine_probabilistic_grammar()`

to mine the grammar:

In [81]:

```
probabilistic_ip_address_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar([
"127.0.0.1", "1.2.3.4"])
```

In [82]:

```
assert is_valid_probabilistic_grammar(probabilistic_ip_address_grammar)
```

Here's the resulting distribution of octets in our grammar:

In [83]:

```
[expansion for expansion in probabilistic_ip_address_grammar['<octet>']
if exp_prob(expansion) > 0]
```

Out[83]:

[('0', {'prob': 0.25}), ('1', {'prob': 0.25}), ('2', {'prob': 0.125}), ('3', {'prob': 0.125}), ('4', {'prob': 0.125}), ('127', {'prob': 0.125})]

In [84]:

```
probabilistic_ip_fuzzer = ProbabilisticGrammarFuzzer(
probabilistic_ip_address_grammar)
[probabilistic_ip_fuzzer.fuzz() for i in range(10)]
```

Out[84]:

['4.2.2.0', '2.1.4.0', '0.1.3.127', '0.3.0.127', '4.0.2.1', '3.127.0.0', '2.2.1.1', '4.0.1.0', '2.4.0.1', '0.0.3.127']

Let us now get to our three usage scenarios. The first scenario is to create probability distributions right out of a sample, and to use these very distributions during test generation. This helps focusing test generation on those features that are *most commonly used*, which thus minimizes the risk of customers encountering failures.

In [85]:

```
URL_SAMPLE = [
"https://user:password@cispa.saarland:80/",
"https://fuzzingbook.com?def=56&x89=3&x46=48&def=def",
"https://cispa.saarland:80/def?def=7&x23=abc",
"https://fuzzingbook.com:80/",
"https://fuzzingbook.com:80/abc?def=abc&abc=x14&def=abc&abc=2&def=38",
"ftps://fuzzingbook.com/x87",
"https://user:password@fuzzingbook.com:6?def=54&x44=abc",
"http://fuzzingbook.com:80?x33=25&def=8",
"http://fuzzingbook.com:8080/def",
]
```

In [86]:

```
URL_TOKENS = {"<scheme>", "<userinfo>", "<host>", "<port>", "<id>"}
```

In [87]:

```
url_parser = EarleyParser(URL_GRAMMAR, tokens=URL_TOKENS)
url_input = URL_SAMPLE[2]
print(url_input)
display_tree(url_parser.parse(url_input)[0])
```

https://cispa.saarland:80/def?def=7&x23=abc

`ProbabilisticGrammarMiner`

class on these inputs, using the above `url_parser`

parser, and obtain a probabilistic URL grammar:

In [88]:

```
probabilistic_grammar_miner = ProbabilisticGrammarMiner(url_parser)
probabilistic_url_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar(
URL_SAMPLE)
```

These are the counts we obtained during parsing:

In [89]:

```
print(probabilistic_grammar_miner.counts())
```

`https:`

scheme, whereas there is no input using the `ftp:`

scheme.

In [90]:

```
probabilistic_url_grammar['<scheme>']
```

Out[90]:

[('http', {'prob': 0.2222222222222222}), ('https', {'prob': 0.6666666666666666}), ('ftp', {'prob': 0.0}), ('ftps', {'prob': 0.1111111111111111})]

Likewise, we see that most given URLs have multiple parameters:

In [91]:

```
probabilistic_url_grammar['<params>']
```

Out[91]:

[('<param>', {'prob': 0.3333333333333333}), ('<param>&<params>', {'prob': 0.6666666666666666})]

When we use this probabilistic grammar for fuzzing, these distributions are reflected in our generated inputs – no `ftp:`

schemes either, and most inputs have multiple parameters.

In [92]:

```
g = ProbabilisticGrammarFuzzer(probabilistic_url_grammar)
[g.fuzz() for i in range(10)]
```

Out[92]:

['https://fuzzingbook.com/def?def=abc&def=abc&def=def&def=abc&abc=def&def=def&abc=def', 'http://fuzzingbook.com:80/def?def=7&def=abc&abc=88', 'https://cispa.saarland/def?def=2', 'http://user:password@fuzzingbook.com:80/def?abc=abc&def=78', 'http://cispa.saarland:80/?def=54&def=abc', 'https://fuzzingbook.com:80/def?def=def', 'http://fuzzingbook.com:80/abc?abc=abc&abc=def&def=85&abc=7&def=6&abc=2&def=abc', 'https://fuzzingbook.com/abc?def=32&def=3&abc=4', 'http://fuzzingbook.com/def?abc=24&def=def&def=48', 'https://cispa.saarland:80/?abc=abc']

Being able to replicate a probability distribution learned from a sample is not only important for focusing on commonly used features. It can also help in achieving *valid inputs*, in particular if one learns probabilities *in context*, as discussed above: If within a given context, some elements are more likely than others (because they depend on each other), a learned probability distribution will reflect this; and hence, inputs generated from this learned probability distribution will have a higher chance to be valid, too. We will explore this further in the exercises, below.

So far, we have focused on *common* features; but from a testing perspective, one may just as well test *uncommon* features – that is, features that rarely occur in our usage samples and therefore would be less exercised in practice. This is a common scenario in security testing, where one focuses on uncommon (and possibly lesser-known) features, as fewer users means fewer bugs reported, and thus more bugs left to be found and exploited.

To have our probabilistic grammar fuzzer focus on *uncommon* features, we *change the learned probabilities* such that commonly occuring features (i.e., those with a high learned probability) get a low probability, and vice versa: The last shall be first, and the first last. A particularly simple way to achieve such an *inversion* of probabilities is to *swap* them: The alternatives with the highest and lowest probability swaps their probabilities, as so the alternatives with the second highest and second lowest probability, the alternatives with the third highest and lowest, and so on.

`invert_expansion()`

takes an expansion (a list of alternatives) from a grammar and returns a new inverted expansion in which the probabilities have been swapped according to the rule above. It creates a list of indexes, sorts it by increasing probability, and then for each $n$-th element, assigns it the probability of the $n$-th last element in the indexes.

In [94]:

```
def invert_expansion(expansion):
def sort_by_prob(x):
index, prob = x
return prob if prob is not None else 0
inverted_expansion = copy.deepcopy(expansion)
indexes = [(index, exp_prob(alternative))
for index, alternative in enumerate(expansion)]
indexes.sort(key=sort_by_prob)
indexes = [i for (i, _) in indexes]
for j in range(len(indexes)):
k = len(indexes) - 1 - j
# print(indexes[j], "gets", indexes[k])
inverted_expansion[indexes[j]
][1]['prob'] = expansion[indexes[k]][1]['prob']
return inverted_expansion
```

Here's `invert_expansion()`

in action. This is out original probability distribution for URL schemes:

In [95]:

```
probabilistic_url_grammar['<scheme>']
```

Out[95]:

`ftp:`

scheme, which previously had a probability of zero, now has the highest probability, whereas the most common scheme, `https:`

, now has the previous zero probability of the `ftp:`

scheme.

In [96]:

```
invert_expansion(probabilistic_url_grammar['<scheme>'])
```

Out[96]:

[('http', {'prob': 0.1111111111111111}), ('https', {'prob': 0.0}), ('ftp', {'prob': 0.6666666666666666}), ('ftps', {'prob': 0.2222222222222222})]

In [97]:

```
invert_expansion(invert_expansion(probabilistic_url_grammar['<scheme>']))
```

Out[97]:

Note that our implementation does not universally satisfy this property: If two alternatives $a_1$ and $a_2$ in the expansion share the same probability, then the second inversion may assign different probabilities to $a_1$ and $a_2$.

We can apply this inversion of expansions across the entire grammar:

In [98]:

```
def invert_probs(grammar):
inverted_grammar = extend_grammar(grammar)
for symbol in grammar:
inverted_grammar[symbol] = invert_expansion(grammar[symbol])
return inverted_grammar
```

This means that probabilities would be swapped for each and every expansion:

In [99]:

```
probabilistic_url_grammar["<digit>"]
```

Out[99]:

[('0', {'prob': 0.0}), ('1', {'prob': 0.0}), ('2', {'prob': 0.14285714285714285}), ('3', {'prob': 0.14285714285714285}), ('4', {'prob': 0.14285714285714285}), ('5', {'prob': 0.21428571428571427}), ('6', {'prob': 0.07142857142857142}), ('7', {'prob': 0.07142857142857142}), ('8', {'prob': 0.21428571428571427}), ('9', {'prob': 0.0})]

In [100]:

```
inverted_probabilistic_url_grammar = invert_probs(probabilistic_url_grammar)
inverted_probabilistic_url_grammar["<digit>"]
```

Out[100]:

[('0', {'prob': 0.21428571428571427}), ('1', {'prob': 0.21428571428571427}), ('2', {'prob': 0.07142857142857142}), ('3', {'prob': 0.07142857142857142}), ('4', {'prob': 0.0}), ('5', {'prob': 0.0}), ('6', {'prob': 0.14285714285714285}), ('7', {'prob': 0.14285714285714285}), ('8', {'prob': 0.0}), ('9', {'prob': 0.14285714285714285})]

If we now use this "inverted" grammar for fuzzing, the generated inputs will focus on the *complement of the input samples*. We will get plenty of tests of user/password features, as well as `ftp:`

schemes – in essence, all the features present in our language, but rarely used (if at all) in our input samples.

In [101]:

```
g = ProbabilisticGrammarFuzzer(inverted_probabilistic_url_grammar)
[g.fuzz() for i in range(10)]
```

Out[101]:

['ftp://www.google.com', 'ftp://www.google.com/', 'ftp://www.google.com/', 'ftp://user:password@cispa.saarland', 'ftps://www.google.com/', 'ftp://user:password@www.google.com/', 'ftp://user:password@www.google.com', 'ftp://www.google.com/', 'ftp://user:password@www.google.com', 'ftp://user:password@www.google.com/']

Besides having *only* common or *only* uncommon features, one can also create mixed forms – for instance, testing uncommon features in a common context. This can be helpful for security testing, where one may want an innocuous (common) "envelope" combined with an (uncommon) "payload". It all depends on where and how we tune the probabilities.

In our previous examples, we have learned from *all* inputs to generate common or uncommon inputs. However, we can also learn from a *subset* of inputs to focus on the features present in that subset (or, conversely, to *avoid* its features). If we know, for instance, that there is some subset of inputs that covers a functionality of interest (say, because it is particularly critical or because it has been recently changed), we can learn from this very subset and focus our test generation on its features.

To illustrate this approach, let us use the CGI grammar introduced in the chapter on coverage. We have a special interest in Line 25 in our CGI decoder – that is, the line that processes a `%`

character followed by two valid hexadecimal digits:

```
...
elif c == '%':
digit_high, digit_low = s[i + 1], s[i + 2]
i += 2
if digit_high in hex_values and digit_low in hex_values:
v = hex_values[digit_high] * 16 + hex_values[digit_low] ### Line 25
t += chr(v)
...
```

Let us assume that we do not know precisely under which conditions Line 25 is executed – but still, we'd like to test it thoroughly. With our probability learning tools, we can learn these conditions, though. We start with a set of random inputs and consider the subset that covers Line 25.

In [103]:

```
cgi_fuzzer = GrammarFuzzer(CGI_GRAMMAR)
```

In [104]:

```
trials = 100
coverage = {}
for i in range(trials):
cgi_input = cgi_fuzzer.fuzz()
with Coverage() as cov:
cgi_decode(cgi_input)
coverage[cgi_input] = cov.coverage()
```

These are all the random inputs that cover Line 25:

In [105]:

```
coverage_slice = [cgi_input for cgi_input in coverage
if ('cgi_decode', 25) in coverage[cgi_input]]
```

In [106]:

```
print(coverage_slice)
```

Actually, about half of the inputs cover Line 25:

In [107]:

```
len(coverage_slice) / trials
```

Out[107]:

0.51

Let us now learn a probabilistic grammar from this slice of inputs:

In [108]:

```
probabilistic_grammar_miner = ProbabilisticGrammarMiner(
EarleyParser(CGI_GRAMMAR))
probabilistic_cgi_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar(
coverage_slice)
```

In [109]:

```
assert is_valid_probabilistic_grammar(probabilistic_cgi_grammar)
```

We see that percentage signs are very likely to occur:

In [110]:

```
probabilistic_cgi_grammar['<letter>']
```

Out[110]:

[('<plus>', {'prob': 0.2318840579710145}), ('<percent>', {'prob': 0.5362318840579711}), ('<other>', {'prob': 0.2318840579710145})]

Using this grammar, we can now generate tests that specifically target Line 25:

In [111]:

```
probabilistic_cgi_fuzzer = ProbabilisticGrammarFuzzer(
probabilistic_cgi_grammar)
print([probabilistic_cgi_fuzzer.fuzz() for i in range(20)])
```

In [112]:

```
trials = 100
coverage = {}
for i in range(trials):
cgi_input = probabilistic_cgi_fuzzer.fuzz()
with Coverage() as cov:
cgi_decode(cgi_input)
coverage[cgi_input] = cov.coverage()
```

We see that the fraction of inputs that cover Line 25 is much higher already, showing that our focusing works:

In [113]:

```
coverage_slice = [cgi_input for cgi_input in coverage
if ('cgi_decode', 25) in coverage[cgi_input]]
```

In [114]:

```
len(coverage_slice) / trials
```

Out[114]:

0.76

Repeating this one more time yields an even higher focusing:

In [115]:

```
for run in range(3):
probabilistic_cgi_grammar = probabilistic_grammar_miner.mine_probabilistic_grammar(
coverage_slice)
probabilistic_cgi_fuzzer = ProbabilisticGrammarFuzzer(
probabilistic_cgi_grammar)
trials = 100
coverage = {}
for i in range(trials):
cgi_input = probabilistic_cgi_fuzzer.fuzz()
with Coverage() as cov:
cgi_decode(cgi_input)
coverage[cgi_input] = cov.coverage()
coverage_slice = [cgi_input for cgi_input in coverage
if ('cgi_decode', 25) in coverage[cgi_input]]
```

In [116]:

```
len(coverage_slice) / trials
```

Out[116]:

0.85

*specialize* fuzzers towards the properties of that subset – in our case, inputs that contain percentage signs and valid hexadecimal letters. The degree to which we can specialize things is induced by the number of variables we can control – in our case, the probabilities for the individual rules. Adding more context to the grammar, as discussed above, will increase the number of variables, and thus the amount of specialization.

A high degree of specialization, however, limits our possibilities to explore combinations that fall *outside* of the selected scope, and limit our possibilities to find bugs induced by these combinations. This tradeoff is known as *exploration vs. exploitation* in machine learning – shall one try to explore as many (possibly shallow) combinations as possible, or focus (exploit) specific areas? In the end, it all depends on where the bugs are, and where we are most likely to find them. Assigning and learning probabilities allows us to control the search strategies – from the common to the uncommon to specific subsets.

Let us close this chapter by getting back to our introductory example. We said that Benford's law allows us not only to produce, but also to detect "unnatural" lead digit distributions such as the ones produced by simple random choices.

If we use the regular `GrammarFuzzer`

class (which ignores probabilities) to generate (random) lead digits, this is the distribution we get for each leading digit:

In [117]:

```
sample_size = 1000
random_integer_fuzzer = GrammarFuzzer(
PROBABILISTIC_EXPR_GRAMMAR,
start_symbol="<leaddigit>")
random_integers = [random_integer_fuzzer.fuzz() for i in range(sample_size)]
```

warning: option 'prob' is not supported

In [118]:

```
random_counts = [random_integers.count(c) for c in crange('1', '9')]
random_counts
```

Out[118]:

[122, 123, 116, 98, 117, 105, 99, 114, 106]

`count()`

method here rather than deploying the full-fledged `ProbabilisticGrammarMiner`

.)

If we had a natural distribution of lead digits, this is what we would expect:

In [119]:

```
expected_prob_counts = [
exp_prob(
PROBABILISTIC_EXPR_GRAMMAR["<leaddigit>"][i]) *
sample_size for i in range(9)]
print(expected_prob_counts)
```

[301.0, 176.0, 125.0, 97.0, 79.0, 67.0, 58.0, 51.0, 46.0]

And if we had a random distribution, we would expect an equal distribution:

In [120]:

```
expected_random_counts = [sample_size / 9 for i in range(9)]
print(expected_random_counts)
```

Which distribution better matches our `random_counts`

lead digits? To this end, we run a $\chi^2$-test to compare the distribution we found (`random_counts`

) against the "natural" lead digit distribution `expected_prob_counts`

and the random distribution `expected_random_counts`

.

`pvalue`

= 0.0) that the observed distribution follows a "natural" distribution:

In [122]:

```
chisquare(random_counts, expected_prob_counts)
```

Out[122]:

Power_divergenceResult(statistic=347.96491895076656, pvalue=2.4618008235897177e-70)

However, there is a 97% chance that the observed behavior follows a random distribution:

In [123]:

```
chisquare(random_counts, expected_random_counts)
```

Out[123]:

Power_divergenceResult(statistic=6.38, pvalue=0.6047465691667375)

Hence, if you find some numbers published and doubt their validity, you can run the above test to check whether they are likely to be natural. Better yet, insist that authors use Jupyter notebooks to produce their results, such that you can check every step of the calculation :-)

- By specifying probabilities, one can steer fuzzing towards input features of interest.
- Learning probabilities from samples allows one to focus on features that are common or uncommon in input samples.
- Learning probabilities from a subset of samples allows one to produce more similar inputs.

Now that we have brought together probabilities and grammars (and revisited parsers and grammars), we have created a foundation for many applications. Our next chapters will focus on

- how to
*reduce*failing inputs to a minimum - how to carve and produce tests at the function level
- how to automatically test (Web) user interfaces

Enjoy!

The idea of mining probabilities by parsing a corpus of data was first covered in "Learning to Fuzz: Application-Independent Fuzz Testing with Probabilistic, Generative Models of Input Data" \cite{Patra2016} which also learns and applies probabilistic rules for derivation trees. Applying this idea on probabilistic grammars as well as inverting probabilities or learning from slices was first executed in the work "Inputs from Hell: Generating Uncommon Inputs from Common Samples" \cite{Pavese2018}.

Our exposition of Benford's law follows this article.

Create a class `ProbabilisticGrammarCoverageFuzzer`

that extends `GrammarCoverageFuzzer`

with probabilistic capabilities. The idea is to first cover all uncovered expansions (like `GrammarCoverageFuzzer`

) and once all expansions are covered, to proceed by probabilities (like `ProbabilisticGrammarFuzzer`

).

To this end, define new instances of the `choose_covered_node_expansion()`

and `choose_uncovered_node_expansion()`

methods that choose an expansion based on the given weights.

If you are an advanced programmer, realize the class via *multiple inheritance* from `GrammarCoverageFuzzer`

and `ProbabilisticGrammarFuzzer`

to achieve this.

Multiple inheritance is a tricky thing. If you have two classes $A'$ and $A''$ which both inherit from $A$, the same method $m()$ of $A$ may be overloaded in both $A'$ and $A''$. If one now inherits from *both* $A'$ and $A''$, and calls $m()$, which of the $m()$ implementations should be called? Python "resolves" this conflict by simply invoking the one $m()$ method in the class one inherits from first.

To avoid such conflicts, one can check whether the order in which one inherits makes a difference. The method `inheritance_conflicts()`

compares the attributes with each other; if they refer to different code, you have to resolve the conflict.

In [125]:

```
inheritance_conflicts(GrammarCoverageFuzzer, ProbabilisticGrammarFuzzer)
```

Out[125]:

['choose_node_expansion']

*have* to implement for multiple inheritance besides `choose_covered_node_expansion()`

and `choose_uncovered_node_expansion()`

.

Learning from a set of inputs can be extremely valuable if one learns from *inputs that are known to have caused failures before.* In this exercise, you will go and learn distributions from past vulnerabilities.

- Download
`js-vuln-db`

, a set of JavaScript engine vulnerabilities. Each vulnerability comes with code that exercises it. - Extract all
*number literals*from the code, using`re.findall()`

with appropriate regular expressions. - Convert these literals to (decimal)
*numeric values*and count their respective occurrences. - Create a grammar
`RISKY_NUMBERS`

that produces these numbers with probabilities reflecting the above counts.

Of course, there is more to vulnerabilities than just a specific numbers, but some numbers are more likely to induce errors than others. The next time you fuzz a system, do not generate numbers randomly; instead, pick one from `RISKY_NUMBERS`

:-)