Fuzzing with Grammars¶

In the chapter on "Mutation-Based Fuzzing", we have seen how to use extra hints – such as sample input files – to speed up test generation. In this chapter, we take this idea one step further, by providing a specification of the legal inputs to a program. Specifying inputs via a grammar allows for very systematic and efficient test generation, in particular for complex input formats. Grammars also serve as the base for configuration fuzzing, API fuzzing, GUI fuzzing, and many more.

Input Languages¶

All possible behaviors of a program can be triggered by its input. "Input" here can be a wide range of possible sources: We are talking about data that is read from files, from the environment, or over the network, data input by the user, or data acquired from interaction with other resources. The set of all these inputs determines how the program will behave – including its failures. When testing, it is thus very helpful to think about possible input sources, how to get them under control, and how to systematically test them.

For the sake of simplicity, we will assume for now that the program has only one source of inputs; this is the same assumption we have been using in the previous chapters, too. The set of valid inputs to a program is called a language. Languages range from the simple to the complex: the CSV language denotes the set of valid comma-separated inputs, whereas the Python language denotes the set of valid Python programs. We commonly separate data languages and programming languages, although any program can also be treated as input data (say, to a compiler). The Wikipedia page on file formats lists more than 1,000 different file formats, each of which is its own language.

To formally describe languages, the field of formal languages has devised a number of language specifications that describe a language. Regular expressions represent the simplest class of these languages to denote sets of strings: The regular expression [a-z]*, for instance, denotes a (possibly empty) sequence of lowercase letters. Automata theory connects these languages to automata that accept these inputs; finite state machines, for instance, can be used to specify the language of regular expressions.

Regular expressions are great for not-too-complex input formats, and the associated finite state machines have many properties that make them great for reasoning. To specify more complex inputs, though, they quickly encounter limitations. At the other end of the language spectrum, we have universal grammars that denote the language accepted by Turing machines. A Turing machine can compute anything that can be computed; and with Python being Turing-complete, this means that we can also use a Python program $p$ to specify or even enumerate legal inputs. But then, computer science theory also tells us that each such testing program has to be written specifically for the program to be tested, which is not the level of automation we want.

Grammars¶

The middle ground between regular expressions and Turing machines is covered by grammars. Grammars are among the most popular (and best understood) formalisms to formally specify input languages. Using a grammar, one can express a wide range of the properties of an input language. Grammars are particularly great for expressing the syntactical structure of an input, and are the formalism of choice to express nested or recursive inputs. The grammars we use are so-called context-free grammars, one of the easiest and most popular grammar formalisms.

Rules and Expansions¶

A grammar consists of a start symbol and a set of expansion rules (or simply rules) which indicate how the start symbol (and other symbols) can be expanded. As an example, consider the following grammar, denoting a sequence of two digits:

<start> ::= <digit><digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

To read such a grammar, start with the start symbol (<start>). An expansion rule <A> ::= <B> means that the symbol on the left side (<A>) can be replaced by the string on the right side (<B>). In the above grammar, <start> would be replaced by <digit><digit>.

In this string again, <digit> would be replaced by the string on the right side of the <digit> rule. The special operator | denotes expansion alternatives (or simply alternatives), meaning that any of the digits can be chosen for an expansion. Each <digit> thus would be expanded into one of the given digits, eventually yielding a string between 00 and 99. There are no further expansions for 0 to 9, so we are all set.

The interesting thing about grammars is that they can be recursive. That is, expansions can make use of symbols expanded earlier – which would then be expanded again. As an example, consider a grammar that describes integers:

<start>  ::= <integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Here, a <integer> is either a single digit, or a digit followed by another integer. The number 1234 thus would be represented as a single digit 1, followed by the integer 234, which in turn is a digit 2, followed by the integer 34.

If we wanted to express that an integer can be preceded by a sign (+ or -), we would write the grammar as

<start>   ::= <number>
<number>  ::= <integer> | +<integer> | -<integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

These rules formally define the language: Anything that can be derived from the start symbol is part of the language; anything that cannot is not.

In [6]:
quiz("Which of these strings cannot be produced "
     "from the above `<start>` symbol?",
     [
         "`007`",
         "`-42`",
         "`++1`",
         "`3.14`"
     ], "[27 ** (1/3), 256 ** (1/4)]")
Out[6]:

Quiz

Which of these strings cannot be produced from the above <start> symbol?





Arithmetic Expressions¶

Let us expand our grammar to cover full arithmetic expressions – a poster child example for a grammar. We see that an expression (<expr>) is either a sum, or a difference, or a term; a term is either a product or a division, or a factor; and a factor is either a number or a parenthesized expression. Almost all rules can have recursion, and thus allow arbitrary complex expressions such as (1 + 2) * (3.4 / 5.6 - 789).

<start>   ::= <expr>
<expr>    ::= <term> + <expr> | <term> - <expr> | <term>
<term>    ::= <term> * <factor> | <term> / <factor> | <factor>
<factor>  ::= +<factor> | -<factor> | (<expr>) | <integer> | <integer>.<integer>
<integer> ::= <digit><integer> | <digit>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

In such a grammar, if we start with <start> and then expand one symbol after another, randomly choosing alternatives, we can quickly produce one valid arithmetic expression after another. Such grammar fuzzing is highly effective as it comes to produce complex inputs, and this is what we will implement in this chapter.

In [7]:
quiz("Which of these strings cannot be produced "
     "from the above `<start>` symbol?",
     [
         "`1 + 1`",
         "`1+1`",
         "`+1`",
         "`+(1)`",
     ], "4 ** 0.5")
Out[7]:

Quiz

Which of these strings cannot be produced from the above <start> symbol?





Representing Grammars in Python¶

Our first step in building a grammar fuzzer is to find an appropriate format for grammars. To make the writing of grammars as simple as possible, we use a format that is based on strings and lists. Our grammars in Python take the format of a mapping between symbol names and expansions, where expansions are lists of alternatives. A one-rule grammar for digits thus takes the form

In [8]:
DIGIT_GRAMMAR = {
    "<start>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

We can capture the grammar structure in a Grammar type, in which each symbol (a string) is mapped to a list of expansions (strings):

In [12]:
Grammar = Dict[str, List[Expansion]]

With this Grammar type, the full grammar for arithmetic expressions looks like this:

In [13]:
EXPR_GRAMMAR: Grammar = {
    "<start>":
        ["<expr>"],

    "<expr>":
        ["<term> + <expr>", "<term> - <expr>", "<term>"],

    "<term>":
        ["<factor> * <term>", "<factor> / <term>", "<factor>"],

    "<factor>":
        ["+<factor>",
         "-<factor>",
         "(<expr>)",
         "<integer>.<integer>",
         "<integer>"],

    "<integer>":
        ["<digit><integer>", "<digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

In the grammar, every symbol can be defined exactly once. We can access any rule by its symbol...

In [14]:
EXPR_GRAMMAR["<digit>"]
Out[14]:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

....and we can check whether a symbol is in the grammar:

In [15]:
"<identifier>" in EXPR_GRAMMAR
Out[15]:
False

Note that we assume that on the left-hand side of a rule (i.e., the key in the mapping) is always a single symbol. This is the property that gives our grammars the characterization of context-free.

Some Definitions¶

We assume that the canonical start symbol is <start>:

In [16]:
START_SYMBOL = "<start>"

The handy nonterminals() function extracts the list of nonterminal symbols (i.e., anything between < and >, except spaces) from an expansion.

In [18]:
RE_NONTERMINAL = re.compile(r'(<[^<> ]*>)')
In [19]:
def nonterminals(expansion):
    # In later chapters, we allow expansions to be tuples,
    # with the expansion being the first element
    if isinstance(expansion, tuple):
        expansion = expansion[0]

    return RE_NONTERMINAL.findall(expansion)
In [20]:
assert nonterminals("<term> * <factor>") == ["<term>", "<factor>"]
assert nonterminals("<digit><integer>") == ["<digit>", "<integer>"]
assert nonterminals("1 < 3 > 2") == []
assert nonterminals("1 <3> 2") == ["<3>"]
assert nonterminals("1 + 2") == []
assert nonterminals(("<1>", {'option': 'value'})) == ["<1>"]

Likewise, is_nonterminal() checks whether some symbol is a nonterminal:

In [21]:
def is_nonterminal(s):
    return RE_NONTERMINAL.match(s)
In [22]:
assert is_nonterminal("<abc>")
assert is_nonterminal("<symbol-1>")
assert not is_nonterminal("+")

A Simple Grammar Fuzzer¶

Let us now put the above grammars to use. We will build a very simple grammar fuzzer that starts with a start symbol (<start>) and then keeps on expanding it. To avoid expansion to infinite inputs, we place a limit (max_nonterminals) on the number of nonterminals. Furthermore, to avoid being stuck in a situation where we cannot reduce the number of symbols any further, we also limit the total number of expansion steps.

In [24]:
class ExpansionError(Exception):
    pass
In [25]:
def simple_grammar_fuzzer(grammar: Grammar, 
                          start_symbol: str = START_SYMBOL,
                          max_nonterminals: int = 10,
                          max_expansion_trials: int = 100,
                          log: bool = False) -> str:
    """Produce a string from `grammar`.
       `start_symbol`: use a start symbol other than `<start>` (default).
       `max_nonterminals`: the maximum number of nonterminals 
         still left for expansion
       `max_expansion_trials`: maximum # of attempts to produce a string
       `log`: print expansion progress if True"""

    term = start_symbol
    expansion_trials = 0

    while len(nonterminals(term)) > 0:
        symbol_to_expand = random.choice(nonterminals(term))
        expansions = grammar[symbol_to_expand]
        expansion = random.choice(expansions)
        # In later chapters, we allow expansions to be tuples,
        # with the expansion being the first element
        if isinstance(expansion, tuple):
            expansion = expansion[0]

        new_term = term.replace(symbol_to_expand, expansion, 1)

        if len(nonterminals(new_term)) < max_nonterminals:
            term = new_term
            if log:
                print("%-40s" % (symbol_to_expand + " -> " + expansion), term)
            expansion_trials = 0
        else:
            expansion_trials += 1
            if expansion_trials >= max_expansion_trials:
                raise ExpansionError("Cannot expand " + repr(term))

    return term

Let us see how this simple grammar fuzzer obtains an arithmetic expression from the start symbol:

In [26]:
simple_grammar_fuzzer(grammar=EXPR_GRAMMAR, max_nonterminals=3, log=True)
<start> -> <expr>                        <expr>
<expr> -> <term> + <expr>                <term> + <expr>
<term> -> <factor>                       <factor> + <expr>
<factor> -> <integer>                    <integer> + <expr>
<integer> -> <digit>                     <digit> + <expr>
<digit> -> 6                             6 + <expr>
<expr> -> <term> - <expr>                6 + <term> - <expr>
<expr> -> <term>                         6 + <term> - <term>
<term> -> <factor>                       6 + <factor> - <term>
<factor> -> -<factor>                    6 + -<factor> - <term>
<term> -> <factor>                       6 + -<factor> - <factor>
<factor> -> (<expr>)                     6 + -(<expr>) - <factor>
<factor> -> (<expr>)                     6 + -(<expr>) - (<expr>)
<expr> -> <term>                         6 + -(<term>) - (<expr>)
<expr> -> <term>                         6 + -(<term>) - (<term>)
<term> -> <factor>                       6 + -(<factor>) - (<term>)
<factor> -> +<factor>                    6 + -(+<factor>) - (<term>)
<factor> -> +<factor>                    6 + -(++<factor>) - (<term>)
<term> -> <factor>                       6 + -(++<factor>) - (<factor>)
<factor> -> (<expr>)                     6 + -(++(<expr>)) - (<factor>)
<factor> -> <integer>                    6 + -(++(<expr>)) - (<integer>)
<expr> -> <term>                         6 + -(++(<term>)) - (<integer>)
<integer> -> <digit>                     6 + -(++(<term>)) - (<digit>)
<digit> -> 9                             6 + -(++(<term>)) - (9)
<term> -> <factor> * <term>              6 + -(++(<factor> * <term>)) - (9)
<term> -> <factor>                       6 + -(++(<factor> * <factor>)) - (9)
<factor> -> <integer>                    6 + -(++(<integer> * <factor>)) - (9)
<integer> -> <digit>                     6 + -(++(<digit> * <factor>)) - (9)
<digit> -> 2                             6 + -(++(2 * <factor>)) - (9)
<factor> -> +<factor>                    6 + -(++(2 * +<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +-<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +--<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +---<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +----<factor>)) - (9)
<factor> -> <integer>.<integer>          6 + -(++(2 * +----<integer>.<integer>)) - (9)
<integer> -> <digit>                     6 + -(++(2 * +----<digit>.<integer>)) - (9)
<integer> -> <digit>                     6 + -(++(2 * +----<digit>.<digit>)) - (9)
<digit> -> 1                             6 + -(++(2 * +----1.<digit>)) - (9)
<digit> -> 7                             6 + -(++(2 * +----1.7)) - (9)
Out[26]:
'6 + -(++(2 * +----1.7)) - (9)'

By increasing the limit of nonterminals, we can quickly get much longer productions:

In [27]:
for i in range(10):
    print(simple_grammar_fuzzer(grammar=EXPR_GRAMMAR, max_nonterminals=5))
7 / +48.5
-5.9 / 9 - 4 * +-(-+++((1 + (+7 - (-1 * (++-+7.7 - -+-4.0))))) * +--4 - -(6) + 64)
8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-((-+(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090
1 - -3 * 7 - 28 / 9
(+9) * +-5 * ++-926.2 - (+9.03 / -+(-(-6) / 2 * +(-+--(8) / -(+1.0) - 5 + 4)) * 3.5)
8 + -(9.6 - 3 - -+-4 * +77)
-(((((++((((+((++++-((+-37))))))))))))) / ++(-(+++(+6)) * -++-(+(++(---6 * (((7)) * (1) / (-7.6 * 535338) + +256) * 0) * 0))) - 4 + +1
5.43
(9 / -405 / -23 - +-((+-(2 * (13))))) + +6 - +8 - 934
-++2 - (--+715769550) / 8 / (1)

Note that while our fuzzer does the job in most cases, it has a number of drawbacks.

In [28]:
quiz("What drawbacks does `simple_grammar_fuzzer()` have?",
     [
         "It has a large number of string search and replace operations",
         "It may fail to produce a string (`ExpansionError`)",
         "It often picks some symbol to expand "
         "that does not even occur in the string",
         "All of the above"
     ], "1 << 2")
Out[28]:

Quiz

What drawbacks does simple_grammar_fuzzer() have?





Indeed, simple_grammar_fuzzer() is rather inefficient due to the large number of search and replace operations, and it may even fail to produce a string. On the other hand, the implementation is straightforward and does the job in most cases. For this chapter, we'll stick to it; in the next chapter, we'll show how to build a more efficient one.

Visualizing Grammars as Railroad Diagrams¶

With grammars, we can easily specify the format for several of the examples we discussed earlier. The above arithmetic expressions, for instance, can be directly sent into bc (or any other program that takes arithmetic expressions). Before we introduce a few additional grammars, let us give a means to visualize them, giving an alternate view to aid their understanding.

Railroad diagrams, also called syntax diagrams, are a graphical representation of context-free grammars. They are read left to right, following possible "rail" tracks; the sequence of symbols encountered on the track defines the language. To produce railroad diagrams, we implement a function syntax_diagram().

Let us use syntax_diagram() to produce a railroad diagram of our expression grammar:

In [39]:
syntax_diagram(EXPR_GRAMMAR)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> expr
expr
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> term + expr term - expr term
term
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> factor * term factor / term factor
factor
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> - factor + factor ( expr ) integer . integer integer
integer
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> digit integer digit
digit
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 0 1 2 3 4 5 6 7 8 9

This railroad representation will come in handy as it comes to visualizing the structure of grammars – especially for more complex grammars.

Some Grammars¶

Let us create (and visualize) some more grammars and use them for fuzzing.

A CGI Grammar¶

Here's a grammar for cgi_decode() introduced in the chapter on coverage.

In [40]:
CGI_GRAMMAR: Grammar = {
    "<start>":
        ["<string>"],

    "<string>":
        ["<letter>", "<letter><string>"],

    "<letter>":
        ["<plus>", "<percent>", "<other>"],

    "<plus>":
        ["+"],

    "<percent>":
        ["%<hexdigit><hexdigit>"],

    "<hexdigit>":
        ["0", "1", "2", "3", "4", "5", "6", "7",
            "8", "9", "a", "b", "c", "d", "e", "f"],

    "<other>":  # Actually, could be _all_ letters
        ["0", "1", "2", "3", "4", "5", "a", "b", "c", "d", "e", "-", "_"],
}
In [41]:
syntax_diagram(CGI_GRAMMAR)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> string
string
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> letter letter string
letter
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> plus percent other
plus
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> +
percent
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> % hexdigit hexdigit
hexdigit
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 0 1 2 3 4 5 6 7 8 9 a b c d e f
other
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 0 1 2 3 4 5 a b c d e - _

In contrast to basic fuzzing or mutation-based fuzzing, the grammar quickly produces all sorts of combinations:

In [42]:
for i in range(10):
    print(simple_grammar_fuzzer(grammar=CGI_GRAMMAR, max_nonterminals=10))
+%9a
+++%ce+
+_
+%c6c
++
+%cd+5
1%ee
%b9%d5
%96
%57d%42

A URL Grammar¶

The same properties we have seen for CGI input also hold for more complex inputs. Let us use a grammar to produce numerous valid URLs:

In [43]:
URL_GRAMMAR: Grammar = {
    "<start>":
        ["<url>"],
    "<url>":
        ["<scheme>://<authority><path><query>"],
    "<scheme>":
        ["http", "https", "ftp", "ftps"],
    "<authority>":
        ["<host>", "<host>:<port>", "<userinfo>@<host>", "<userinfo>@<host>:<port>"],
    "<host>":  # Just a few
        ["cispa.saarland", "www.google.com", "fuzzingbook.com"],
    "<port>":
        ["80", "8080", "<nat>"],
    "<nat>":
        ["<digit>", "<digit><digit>"],
    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
    "<userinfo>":  # Just one
        ["user:password"],
    "<path>":  # Just a few
        ["", "/", "/<id>"],
    "<id>":  # Just a few
        ["abc", "def", "x<digit><digit>"],
    "<query>":
        ["", "?<params>"],
    "<params>":
        ["<param>", "<param>&<params>"],
    "<param>":  # Just a few
        ["<id>=<id>", "<id>=<nat>"],
}
In [44]:
syntax_diagram(URL_GRAMMAR)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> url
url
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> scheme :// authority path query
scheme
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> https http ftp ftps
authority
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> host : port host userinfo @ host userinfo @ host : port
host
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> cispa.saarland www.google.com fuzzingbook.com
port
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 80 8080 nat
nat
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> digit digit digit
digit
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> 0 1 2 3 4 5 6 7 8 9
userinfo
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> user:password
path
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> / / id
id
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> abc def x digit digit
query
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> ? params
params
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> param param & params
param
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> id = id id = nat

Again, within milliseconds, we can produce plenty of valid inputs.

In [45]:
for i in range(10):
    print(simple_grammar_fuzzer(grammar=URL_GRAMMAR, max_nonterminals=10))
https://user:password@cispa.saarland:80/
http://fuzzingbook.com?def=56&x89=3&x46=48&def=def
ftp://cispa.saarland/?x71=5&x35=90&def=abc
https://cispa.saarland:80/def?def=7&x23=abc
https://fuzzingbook.com:80/
https://fuzzingbook.com:80/abc?def=abc&abc=x14&def=abc&abc=2&def=38
ftps://fuzzingbook.com/x87
https://user:password@fuzzingbook.com:6?def=54&x44=abc
http://fuzzingbook.com:80?x33=25&def=8
http://fuzzingbook.com:8080/def

A Natural Language Grammar¶

Finally, grammars are not limited to formal languages such as computer inputs, but can also be used to produce natural language. This is the grammar we used to pick a title for this book:

In [46]:
TITLE_GRAMMAR: Grammar = {
    "<start>": ["<title>"],
    "<title>": ["<topic>: <subtopic>"],
    "<topic>": ["Generating Software Tests", "<fuzzing-prefix>Fuzzing", "The Fuzzing Book"],
    "<fuzzing-prefix>": ["", "The Art of ", "The Joy of "],
    "<subtopic>": ["<subtopic-main>",
                   "<subtopic-prefix><subtopic-main>",
                   "<subtopic-main><subtopic-suffix>"],
    "<subtopic-main>": ["Breaking Software",
                        "Generating Software Tests",
                        "Principles, Techniques and Tools"],
    "<subtopic-prefix>": ["", "Tools and Techniques for "],
    "<subtopic-suffix>": [" for <reader-property> and <reader-property>",
                          " for <software-property> and <software-property>"],
    "<reader-property>": ["Fun", "Profit"],
    "<software-property>": ["Robustness", "Reliability", "Security"],
}
In [47]:
syntax_diagram(TITLE_GRAMMAR)
start
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> title
title
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> topic : subtopic
topic
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Generating Software Tests fuzzing-prefix Fuzzing The Fuzzing Book
fuzzing-prefix
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> The Art of The Joy of
subtopic
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> subtopic-main subtopic-prefix subtopic-main subtopic-main subtopic-suffix
subtopic-main
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Breaking Software Generating Software Tests Principles, Techniques and Tools
subtopic-prefix
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Tools and Techniques for
subtopic-suffix
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> for reader-property and reader-property for software-property and software-property
reader-property
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Fun Profit
software-property
<style>/* */ svg.railroad-diagram { } svg.railroad-diagram path { stroke-width:3; stroke:black; fill:white; } svg.railroad-diagram text { font:14px "Fira Mono", monospace; text-anchor:middle; } svg.railroad-diagram text.label{ text-anchor:start; } svg.railroad-diagram text.comment{ font:italic 12px "Fira Mono", monospace; } svg.railroad-diagram rect{ stroke-width:2; stroke:black; fill:mistyrose; } /* */ </style> Robustness Reliability Security
In [49]:
titles: Set[str] = set()
while len(titles) < 10:
    titles.add(simple_grammar_fuzzer(
        grammar=TITLE_GRAMMAR, max_nonterminals=10))
titles
Out[49]:
{'Fuzzing: Generating Software Tests',
 'Fuzzing: Principles, Techniques and Tools',
 'Generating Software Tests: Breaking Software',
 'Generating Software Tests: Breaking Software for Robustness and Robustness',
 'Generating Software Tests: Principles, Techniques and Tools',
 'Generating Software Tests: Principles, Techniques and Tools for Profit and Fun',
 'Generating Software Tests: Tools and Techniques for Principles, Techniques and Tools',
 'The Fuzzing Book: Breaking Software',
 'The Fuzzing Book: Generating Software Tests for Profit and Profit',
 'The Fuzzing Book: Generating Software Tests for Robustness and Robustness'}

(If you find that there is redundancy ("Robustness and Robustness") in here: In our chapter on coverage-based fuzzing, we will show how to cover each expansion only once. And if you like some alternatives more than others, probabilistic grammar fuzzing will be there for you.)

Grammars as Mutation Seeds¶

One very useful property of grammars is that they produce mostly valid inputs. From a syntactical standpoint, the inputs are actually always valid, as they satisfy the constraints of the given grammar. (Of course, one needs a valid grammar in the first place.) However, there are also semantic properties that cannot be easily expressed in a grammar. If, say, for a URL, the port range is supposed to be between 1024 and 2048, this is hard to write in a grammar. If one has to satisfy more complex constraints, one quickly reaches the limits of what a grammar can express.

One way around this is to attach constraints to grammars, as we will discuss later in this book. Another possibility is to put together the strengths of grammar-based fuzzing and mutation-based fuzzing. The idea is to use the grammar-generated inputs as seeds for further mutation-based fuzzing. This way, we can explore not only valid inputs, but also check out the boundaries between valid and invalid inputs. This is particularly interesting as slightly invalid inputs allow finding parser errors (which are often abundant). As with fuzzing in general, it is the unexpected which reveals errors in programs.

To use our generated inputs as seeds, we can feed them directly into the mutation fuzzers introduced earlier:

In [51]:
number_of_seeds = 10
seeds = [
    simple_grammar_fuzzer(
        grammar=URL_GRAMMAR,
        max_nonterminals=10) for i in range(number_of_seeds)]
seeds
Out[51]:
['ftps://user:password@www.google.com:80',
 'http://cispa.saarland/',
 'ftp://www.google.com:42/',
 'ftps://user:password@fuzzingbook.com:39?abc=abc',
 'https://www.google.com?x33=1&x06=1',
 'http://www.google.com:02/',
 'https://user:password@www.google.com/',
 'ftp://cispa.saarland:8080/?abc=abc&def=def&abc=5',
 'http://www.google.com:80/def?def=abc',
 'http://user:password@cispa.saarland/']
In [52]:
m = MutationFuzzer(seeds)
In [53]:
[m.fuzz() for i in range(20)]
Out[53]:
['ftps://user:password@www.google.com:80',
 'http://cispa.saarland/',
 'ftp://www.google.com:42/',
 'ftps://user:password@fuzzingbook.com:39?abc=abc',
 'https://www.google.com?x33=1&x06=1',
 'http://www.google.com:02/',
 'https://user:password@www.google.com/',
 'ftp://cispa.saarland:8080/?abc=abc&def=def&abc=5',
 'http://www.google.com:80/def?def=abc',
 'http://user:password@cispa.saarland/',
 'Eh4tp:www.coogle.com:80/def?d%f=abc',
 'ftps://}ser:passwod@fuzzingbook.com:9?abc=abc',
 'uftp//cispa.sRaarland:808&0?abc=abc&def=defabc=5',
 'http://user:paswor9d@cispar.saarland/v',
 'ftp://Www.g\x7fogle.cAom:42/',
 'hht://userC:qassMword@cispy.csaarland/',
 'httx://ww.googlecom:80defde`f=ac',
 'htt://cispq.waarlnd/',
 'htFtp\t://cmspa./saarna(md/',
 'ft:/www.google.com:42\x0f']

While the first 10 fuzz() calls return the seeded inputs (as designed), the later ones again create arbitrary mutations. Using MutationCoverageFuzzer instead of MutationFuzzer, we could again have our search guided by coverage – and thus bring together the best of multiple worlds.

A Grammar Toolbox¶

Let us now introduce a few techniques that help us writing grammars.

Escapes¶

With < and > delimiting nonterminals in our grammars, how can we actually express that some input should contain < and >? The answer is simple: Just introduce a symbol for them.

In [54]:
simple_nonterminal_grammar: Grammar = {
    "<start>": ["<nonterminal>"],
    "<nonterminal>": ["<left-angle><identifier><right-angle>"],
    "<left-angle>": ["<"],
    "<right-angle>": [">"],
    "<identifier>": ["id"]  # for now
}

In simple_nonterminal_grammar, neither the expansion for <left-angle> nor the expansion for <right-angle> can be mistaken for a nonterminal. Hence, we can produce as many as we want.

(Note that this does not work with simple_grammar_fuzzer(), but rather with the GrammarFuzzer class we'll introduce in the next chapter.)

Extending Grammars¶

In the course of this book, we frequently run into the issue of creating a grammar by extending an existing grammar with new features. Such an extension is very much like subclassing in object-oriented programming.

To create a new grammar $g'$ from an existing grammar $g$, we first copy $g$ into $g'$, and then go and extend existing rules with new alternatives and/or add new symbols. Here's an example, extending the above nonterminal grammar with a better rule for identifiers:

In [56]:
nonterminal_grammar = copy.deepcopy(simple_nonterminal_grammar)
nonterminal_grammar["<identifier>"] = ["<idchar>", "<identifier><idchar>"]
nonterminal_grammar["<idchar>"] = ['a', 'b', 'c', 'd']  # for now
In [57]:
nonterminal_grammar
Out[57]:
{'<start>': ['<nonterminal>'],
 '<nonterminal>': ['<left-angle><identifier><right-angle>'],
 '<left-angle>': ['<'],
 '<right-angle>': ['>'],
 '<identifier>': ['<idchar>', '<identifier><idchar>'],
 '<idchar>': ['a', 'b', 'c', 'd']}

Since such an extension of grammars is a common operation, we introduce a custom function extend_grammar() which first copies the given grammar and then updates it from a dictionary, using the Python dictionary update() method:

In [58]:
def extend_grammar(grammar: Grammar, extension: Grammar = {}) -> Grammar:
    """Create a copy of `grammar`, updated with `extension`."""
    new_grammar = copy.deepcopy(grammar)
    new_grammar.update(extension)
    return new_grammar

This call to extend_grammar() extends simple_nonterminal_grammar to nonterminal_grammar just like the "manual" example above:

In [59]:
nonterminal_grammar = extend_grammar(simple_nonterminal_grammar,
                                     {
                                         "<identifier>": ["<idchar>", "<identifier><idchar>"],
                                         # for now
                                         "<idchar>": ['a', 'b', 'c', 'd']
                                     }
                                     )

Character Classes¶

In the above nonterminal_grammar, we have enumerated only the first few letters; indeed, enumerating all letters or digits in a grammar manually, as in <idchar> ::= 'a' | 'b' | 'c' ... is a bit painful.

However, remember that grammars are part of a program, and can thus also be constructed programmatically. We introduce a function srange() which constructs a list of characters in a string:

In [60]:
import string
In [61]:
def srange(characters: str) -> List[Expansion]:
    """Construct a list with all characters in the string"""
    return [c for c in characters]

If we pass it the constant string.ascii_letters, which holds all ASCII letters, srange() returns a list of all ASCII letters:

In [62]:
string.ascii_letters
Out[62]:
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
In [63]:
srange(string.ascii_letters)[:10]
Out[63]:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

We can use such constants in our grammar to quickly define identifiers:

In [64]:
nonterminal_grammar = extend_grammar(nonterminal_grammar,
                                     {
                                         "<idchar>": (srange(string.ascii_letters) + 
                                                      srange(string.digits) + 
                                                      srange("-_"))
                                     }
                                     )
In [65]:
[simple_grammar_fuzzer(nonterminal_grammar, "<identifier>") for i in range(10)]
Out[65]:
['b', 'd', 'V9', 'x4c', 'YdiEWj', 'c', 'xd', '7', 'vIU', 'QhKD']

The shortcut crange(start, end) returns a list of all characters in the ASCII range of start to (including) end:

In [66]:
def crange(character_start: str, character_end: str) -> List[Expansion]:
    return [chr(i)
            for i in range(ord(character_start), ord(character_end) + 1)]

We can use this to express ranges of characters:

In [67]:
crange('0', '9')
Out[67]:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
In [68]:
assert crange('a', 'z') == srange(string.ascii_lowercase)

Grammar Shortcuts¶

In the above nonterminal_grammar, as in other grammars, we have to express repetitions of characters using recursion, that is, by referring to the original definition:

In [69]:
nonterminal_grammar["<identifier>"]
Out[69]:
['<idchar>', '<identifier><idchar>']

It could be a bit easier if we simply could state that a nonterminal should be a non-empty sequence of letters – for instance, as in

<identifier> = <idchar>+

where + denotes a non-empty repetition of the symbol it follows.

Operators such as + are frequently introduced as handy shortcuts in grammars. Formally, our grammars come in the so-called Backus-Naur form, or BNF for short. Operators extend BNF to so-called _extended BNF, or EBNF* for short:

  • The form <symbol>? indicates that <symbol> is optional – that is, it can occur 0 or 1 times.
  • The form <symbol>+ indicates that <symbol> can occur 1 or more times repeatedly.
  • The form <symbol>* indicates that <symbol> can occur 0 or more times. (In other words, it is an optional repetition.)

To make matters even more interesting, we would like to use parentheses with the above shortcuts. Thus, (<foo><bar>)? indicates that the sequence of <foo> and <bar> is optional.

Using such operators, we can define the identifier rule in a simpler way. To this end, let us create a copy of the original grammar and modify the <identifier> rule:

In [70]:
nonterminal_ebnf_grammar = extend_grammar(nonterminal_grammar,
                                          {
                                              "<identifier>": ["<idchar>+"]
                                          }
                                          )

Likewise, we can simplify the expression grammar. Consider how signs are optional, and how integers can be expressed as sequences of digits.

In [71]:
EXPR_EBNF_GRAMMAR: Grammar = {
    "<start>":
        ["<expr>"],

    "<expr>":
        ["<term> + <expr>", "<term> - <expr>", "<term>"],

    "<term>":
        ["<factor> * <term>", "<factor> / <term>", "<factor>"],

    "<factor>":
        ["<sign>?<factor>", "(<expr>)", "<integer>(.<integer>)?"],

    "<sign>":
        ["+", "-"],

    "<integer>":
        ["<digit>+"],

    "<digit>":
        srange(string.digits)
}

Let us implement a function convert_ebnf_grammar() that takes such an EBNF grammar and automatically translates it into a BNF grammar.

Here's an example of using convert_ebnf_grammar():

In [86]:
convert_ebnf_grammar({"<authority>": ["(<userinfo>@)?<host>(:<port>)?"]})
Out[86]:
{'<authority>': ['<symbol-2><host><symbol-1-1>'],
 '<symbol>': ['<userinfo>@'],
 '<symbol-1>': [':<port>'],
 '<symbol-2>': ['', '<symbol>'],
 '<symbol-1-1>': ['', '<symbol-1>']}
In [87]:
expr_grammar = convert_ebnf_grammar(EXPR_EBNF_GRAMMAR)
expr_grammar
Out[87]:
{'<start>': ['<expr>'],
 '<expr>': ['<term> + <expr>', '<term> - <expr>', '<term>'],
 '<term>': ['<factor> * <term>', '<factor> / <term>', '<factor>'],
 '<factor>': ['<sign-1><factor>', '(<expr>)', '<integer><symbol-1>'],
 '<sign>': ['+', '-'],
 '<integer>': ['<digit-1>'],
 '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<symbol>': ['.<integer>'],
 '<sign-1>': ['', '<sign>'],
 '<symbol-1>': ['', '<symbol>'],
 '<digit-1>': ['<digit>', '<digit><digit-1>']}

Success! We have nicely converted the EBNF grammar into BNF.

With character classes and EBNF grammar conversion, we have two powerful tools that make the writing of grammars easier. We will use these again and again as it comes to working with grammars.

Grammar Extensions¶

During the course of this book, we frequently want to specify additional information for grammars, such as probabilities or constraints. To support these extensions, as well as possibly others, we define an annotation mechanism.

Our concept for annotating grammars is to add annotations to individual expansions. To this end, we allow that an expansion cannot only be a string, but also a pair of a string and a set of attributes, as in

"<expr>":
        [("<term> + <expr>", opts(min_depth=10)),
         ("<term> - <expr>", opts(max_depth=2)),
         "<term>"]

Here, the opts() function would allow us to express annotations that apply to the individual expansions; in this case, the addition would be annotated with a min_depth value of 10, and the subtraction with a max_depth value of 2. The meaning of these annotations is left to the individual algorithms dealing with the grammars; the general idea, though, is that they can be ignored.

Checking Grammars¶

Since grammars are represented as strings, it is fairly easy to introduce errors. So let us introduce a helper function that checks a grammar for consistency.

The helper function is_valid_grammar() iterates over a grammar to check whether all used symbols are defined, and vice versa, which is very useful for debugging; it also checks whether all symbols are reachable from the start symbol. You don't have to delve into details here, but as always, it is important to get the input data straight before we make use of it.

Let us make use of is_valid_grammar(). Our grammars defined above pass the test:

In [104]:
assert is_valid_grammar(EXPR_GRAMMAR)
assert is_valid_grammar(CGI_GRAMMAR)
assert is_valid_grammar(URL_GRAMMAR)

The check can also be applied to EBNF grammars:

In [105]:
assert is_valid_grammar(EXPR_EBNF_GRAMMAR)

These do not pass the test, though:

In [106]:
assert not is_valid_grammar({"<start>": ["<x>"], "<y>": ["1"]})  # type: ignore
'<y>': defined, but not used. Consider applying trim_grammar() on the grammar
'<x>': used, but not defined
'<y>': unreachable from <start>. Consider applying trim_grammar() on the grammar
In [107]:
assert not is_valid_grammar({"<start>": "123"})  # type: ignore
'<start>': expansion is not a list
In [108]:
assert not is_valid_grammar({"<start>": []})  # type: ignore
'<start>': expansion list empty
In [109]:
assert not is_valid_grammar({"<start>": [1, 2, 3]})  # type: ignore
'<start>': 1: not a string

(The #type: ignore annotations avoid static checkers flagging the above as errors).

From here on, we will always use is_valid_grammar() when defining a grammar.

Synopsis¶

This chapter introduces grammars as a simple means to specify input languages, and to use them for testing programs with syntactically valid inputs. A grammar is defined as a mapping of nonterminal symbols to lists of alternative expansions, as in the following example:

In [110]:
US_PHONE_GRAMMAR: Grammar = {
    "<start>": ["<phone-number>"],
    "<phone-number>": ["(<area>)<exchange>-<line>"],
    "<area>": ["<lead-digit><digit><digit>"],
    "<exchange>": ["<lead-digit><digit><digit>"],
    "<line>": ["<digit><digit><digit><digit>"],
    "<lead-digit>": ["2", "3", "4", "5", "6", "7", "8", "9"],
    "<digit>": ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

assert is_valid_grammar(US_PHONE_GRAMMAR)

Nonterminal symbols are enclosed in angle brackets (say, <digit>). To generate an input string from a grammar, a producer starts with the start symbol (<start>) and randomly chooses a random expansion for this symbol. It continues the process until all nonterminal symbols are expanded. The function simple_grammar_fuzzer() does just that:

In [111]:
[simple_grammar_fuzzer(US_PHONE_GRAMMAR) for i in range(5)]
Out[111]:
['(692)449-5179',
 '(519)230-7422',
 '(613)761-0853',
 '(979)881-3858',
 '(810)914-5475']

In practice, though, instead of simple_grammar_fuzzer(), you should use the GrammarFuzzer class or one of its coverage-based, probabilistic-based, or generator-based derivatives; these are more efficient, protect against infinite growth, and provide several additional features.

This chapter also introduces a grammar toolbox with several helper functions that ease the writing of grammars, such as using shortcut notations for character classes and repetitions, or extending grammars

Lessons Learned¶

  • Grammars are powerful tools to express and produce syntactically valid inputs.
  • Inputs produced from grammars can be used as is, or used as seeds for mutation-based fuzzing.
  • Grammars can be extended with character classes and operators to make writing easier.

Next Steps¶

As they make a great foundation for generating software tests, we use grammars again and again in this book. As a sneak preview, we can use grammars to fuzz configurations:

<options> ::= <option>*
<option> ::= -h | --version | -v | -d | -i | --global-config <filename>

We can use grammars for fuzzing functions and APIs and fuzzing graphical user interfaces:

<call-sequence> ::= <call>*
<call> ::= urlparse(<url>) | urlsplit(<url>)

We can assign probabilities and constraints to individual expansions:

<term>: 50% <factor> * <term> |  30% <factor> / <term> | 20% <factor>
<integer>: <digit>+ { <integer> >= 100 }

All these extras become especially valuable as we can

  1. infer grammars automatically, dropping the need to specify them manually, and
  2. guide them towards specific goals such as coverage or critical functions;

which we also discuss for all techniques in this book.

To get there, however, we still have a bit of homework to do. In particular, we first have to learn how to

  • create an efficient grammar fuzzer

Background¶

As one of the foundations of human language, grammars have been around as long as human language existed. The first formalization of generative grammars was by Dakṣiputra Pāṇini in 350 BC \cite{Panini350bce}. As a general means to express formal languages for both data and programs, their role in computer science cannot be overstated. The seminal work by Chomsky \cite{Chomsky1956} introduced the central models of regular languages, context-free grammars, context-sensitive grammars, and universal grammars as they are used (and taught) in computer science as a means to specify input and programming languages ever since.

The use of grammars for producing test inputs goes back to Burkhardt \cite{Burkhardt1967}, to be later rediscovered and applied by Hanford \cite{Hanford1970} and Purdom \cite{Purdom1972}. The most important use of grammar testing since then has been compiler testing. Actually, grammar-based testing is one important reason why compilers and Web browsers work as they should:

  • The CSmith tool \cite{Yang2011} specifically targets C programs, starting with a C grammar and then applying additional steps, such as referring to variables and functions defined earlier or ensuring integer and type safety. Their authors have used it "to find and report more than 400 previously unknown compiler bugs."

  • The LangFuzz work \cite{Holler2012}, which shares two authors with this book, uses a generic grammar to produce outputs, and is used day and night to generate JavaScript programs and test their interpreters; as of today, it has found more than 2,600 bugs in browsers such as Mozilla Firefox, Google Chrome, and Microsoft Edge.

  • The EMI Project \cite{Le2014} uses grammars to stress-test C compilers, transforming known tests into alternative programs that should be semantically equivalent over all inputs. Again, this has led to more than 100 bugs in C compilers being fixed.

  • Grammarinator \cite{Hodovan2018} is an open-source grammar fuzzer (written in Python!), using the popular ANTLR format as grammar specification. Like LangFuzz, it uses the grammar for both parsing and producing, and has found more than 100 issues in the JerryScript lightweight JavaScript engine and an associated platform.

  • Domato is a generic grammar generation engine that is specifically used for fuzzing DOM input. It has revealed a number of security issues in popular Web browsers.

Compilers and Web browsers, of course, are not only domains where grammars are needed for testing, but also domains where grammars are well-known. Our claim in this book is that grammars can be used to generate almost any input, and our aim is to empower you to do precisely that.

Exercises¶

Exercise 1: A JSON Grammar¶

Take a look at the JSON specification and derive a grammar from it:

  • Use character classes to express valid characters
  • Use EBNF to express repetitions and optional parts
  • Assume that
    • a string is a sequence of digits, ASCII letters, punctuation and space characters without quotes or escapes
    • whitespace is just a single space.
  • Use is_valid_grammar() to ensure the grammar is valid.

Feed the grammar into simple_grammar_fuzzer(). Do you encounter any errors, and why?

Exercise 2: Finding Bugs¶

The name simple_grammar_fuzzer() does not come by accident: The way it expands grammars is limited in several ways. What happens if you apply simple_grammar_fuzzer() on nonterminal_grammar and expr_grammar, as defined above, and why?

Exercise 3: Grammars with Regular Expressions¶

In a grammar extended with regular expressions, we can use the special form

/regex/

to include regular expressions in expansions. For instance, we can have a rule

<integer> ::= /[+-]?[0-9]+/

to quickly express that an integer is an optional sign, followed by a sequence of digits.

Part 1: Convert regular expressions¶

Write a converter convert_regex(r) that takes a regular expression r and creates an equivalent grammar. Support the following regular expression constructs:

  • *, +, ?, () should work just in EBNFs, above.
  • a|b should translate into a list of alternatives [a, b].
  • . should match any character except newline.
  • [abc] should translate into srange("abc")
  • [^abc] should translate into the set of ASCII characters except srange("abc").
  • [a-b] should translate into crange(a, b)
  • [^a-b] should translate into the set of ASCII characters except crange(a, b).

Example: convert_regex(r"[0-9]+") should yield a grammar such as

{
    "<start>": ["<s1>"],
    "<s1>": [ "<s2>", "<s1><s2>" ],
    "<s2>": crange('0', '9')
}

Part 2: Identify and expand regular expressions¶

Write a converter convert_regex_grammar(g) that takes a EBNF grammar g containing regular expressions in the form /.../ and creates an equivalent BNF grammar. Support the regular expression constructs as above.

Example: convert_regex_grammar({ "<integer>" : "/[+-]?[0-9]+/" }) should yield a grammar such as

{
    "<integer>": ["<s1><s3>"],
    "<s1>": [ "", "<s2>" ],
    "<s2>": srange("+-"),
    "<s3>": [ "<s4>", "<s4><s3>" ],
    "<s4>": crange('0', '9')
}

Optional: Support escapes in regular expressions: \c translates to the literal character c; \/ translates to / (and thus does not end the regular expression); \\ translates to \.

Exercise 4: Defining Grammars as Functions (Advanced)¶

To obtain a nicer syntax for specifying grammars, one can make use of Python constructs which then will be parsed by an additional function. For instance, we can imagine a grammar definition which uses | as a means to separate alternatives:

In [120]:
def expression_grammar_fn():
    start = "<expr>"
    expr = "<term> + <expr>" | "<term> - <expr>"
    term = "<factor> * <term>" | "<factor> / <term>" | "<factor>"
    factor = "+<factor>" | "-<factor>" | "(<expr>)" | "<integer>.<integer>" | "<integer>"
    integer = "<digit><integer>" | "<digit>"
    digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

If we execute expression_grammar_fn(), this will yield an error. Yet, the purpose of expression_grammar_fn() is not to be executed, but to be used as data from which the grammar will be constructed.

In [121]:
with ExpectError():
    expression_grammar_fn()
Traceback (most recent call last):
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_8783/1271268731.py", line 2, in <module>
    expression_grammar_fn()
  File "/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_8783/3029408019.py", line 3, in expression_grammar_fn
    expr = "<term> + <expr>" | "<term> - <expr>"
           ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for |: 'str' and 'str' (expected)

To this end, we make use of the ast (abstract syntax tree) and inspect (code inspection) modules.

First, we obtain the source code of expression_grammar_fn()...

In [123]:
source = inspect.getsource(expression_grammar_fn)
source
Out[123]:
'def expression_grammar_fn():\n    start = "<expr>"\n    expr = "<term> + <expr>" | "<term> - <expr>"\n    term = "<factor> * <term>" | "<factor> / <term>" | "<factor>"\n    factor = "+<factor>" | "-<factor>" | "(<expr>)" | "<integer>.<integer>" | "<integer>"\n    integer = "<digit><integer>" | "<digit>"\n    digit = \'0\' | \'1\' | \'2\' | \'3\' | \'4\' | \'5\' | \'6\' | \'7\' | \'8\' | \'9\'\n'

... which we then parse into an abstract syntax tree:

In [124]:
tree = ast.parse(source)

We can now parse the tree to find operators and alternatives. get_alternatives() iterates over all nodes op of the tree; If the node looks like a binary or (|) operation, we drill deeper and recurse. If not, we have reached a single production, and we try to get the expression from the production. We define the to_expr parameter depending on how we want to represent the production. In this case, we represent a single production by a single string.

In [125]:
def get_alternatives(op, to_expr=lambda o: o.s):
    if isinstance(op, ast.BinOp) and isinstance(op.op, ast.BitOr):
        return get_alternatives(op.left, to_expr) + [to_expr(op.right)]
    return [to_expr(op)]

funct_parser() takes the abstract syntax tree of a function (say, expression_grammar_fn()) and iterates over all assignments:

In [126]:
def funct_parser(tree, to_expr=lambda o: o.s):
    return {assign.targets[0].id: get_alternatives(assign.value, to_expr)
            for assign in tree.body[0].body}

The result is a grammar in our regular format:

In [127]:
grammar = funct_parser(tree)
for symbol in grammar:
    print(symbol, "::=", grammar[symbol])
start ::= ['<expr>']
expr ::= ['<term> + <expr>', '<term> - <expr>']
term ::= ['<factor> * <term>', '<factor> / <term>', '<factor>']
factor ::= ['+<factor>', '-<factor>', '(<expr>)', '<integer>.<integer>', '<integer>']
integer ::= ['<digit><integer>', '<digit>']
digit ::= ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
/var/folders/n2/xd9445p97rb3xh7m1dfx8_4h0006ts/T/ipykernel_8783/1939255217.py:1: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  def funct_parser(tree, to_expr=lambda o: o.s):

Part 1 (a): One Single Function¶

Write a single function define_grammar(fn) that takes a grammar defined as function (such as expression_grammar_fn()) and returns a regular grammar.

Solution. This is straightforward:

Part 1 (b): Alternative representations¶

We note that the grammar representation we designed previously does not allow simple generation of alternatives such as srange() and crange(). Further, one may find the string representation of expressions limiting. It turns out that it is simple to extend our grammar definition to support grammars such as below:

In [130]:
def define_name(o):
    return o.id if isinstance(o, ast.Name) else o.s
In [131]:
def define_expr(op):
    if isinstance(op, ast.BinOp) and isinstance(op.op, ast.Add):
        return (*define_expr(op.left), define_name(op.right))
    return (define_name(op),)
In [132]:
def define_ex_grammar(fn):
    return define_grammar(fn, define_expr)

The grammar:

@define_ex_grammar
def expression_grammar():
    start   = expr
    expr    = (term + '+' + expr
            |  term + '-' + expr)
    term    = (factor + '*' + term
            |  factor + '/' + term
            |  factor)
    factor  = ('+' + factor
            |  '-' + factor
            |  '(' + expr + ')'
            |  integer + '.' + integer
            |  integer)
    integer = (digit + integer
            |  digit)
    digit   = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

for symbol in expression_grammar:
    print(symbol, "::=", expression_grammar[symbol])

Note. The grammar data structure thus obtained is a little more detailed than the standard data structure. It represents each production as a tuple.

We note that we have not enabled srange() or crange() in the above grammar. How would you go about adding these? (Hint: wrap define_expr() to look for ast.Call)

Part 2: Extended Grammars¶

Introduce an operator * that takes a pair (min, max) where min and max are the minimum and maximum number of repetitions, respectively. A missing value min stands for zero; a missing value max for infinity.

In [133]:
def identifier_grammar_fn():
    identifier = idchar * (1,)

With the * operator, we can generalize the EBNF operators – ? becomes (0,1), * becomes (0,), and + becomes (1,). Write a converter that takes an extended grammar defined using *, parse it, and convert it into BNF.