Fuzzing with Grammars

In the chapter on "Mutation-Based Fuzzing", we have seen how to use extra hints – such as sample input files – to speed up test generation. In this chapter, we take this idea one step further, by providing a specification of the legal inputs to a program. Specifying inputs via a grammar allows for very systematic and efficient test generation, in particular for complex input formats. Grammars also serve as the base for configuration fuzzing, API fuzzing, GUI fuzzing, and many more.

Prerequisites

Input Languages

All possible behaviors of a program can be triggered by its input. "Input" here can be a wide range of possible sources: We are talking about data that is read from files, from the environment, or over the network, data input by the user, or data acquired from interaction with other resources. The set of all these inputs determines how the program will behave – including its failures. When testing, it is thus very helpful to think about possible input sources, how to get them under control, and how to systematically test them.

For the sake of simplicity, we will assume for now that the program has only one source of inputs; this is the same assumption we have been using in the previous chapters, too. The set of valid inputs to a program is called a language. Languages range from the simple to the complex: the CSV language denotes the set of valid comma-separated inputs, whereas the Python language denotes the set of valid Python programs. We commonly separate data languages and programming languages, although any program can also be treated as input data (say, to a compiler). The Wikipedia page on file formats lists more than 1,000 different file formats, each of which is its own language.

To formally describe languages, the field of formal languages has devised a number of language specifications that describe a language. Regular expressions represent the simplest class of these languages to denote sets of strings: The regular expression [a-z]*, for instance, denotes a (possibly empty) sequence of lowercase letters. Automata theory connects these languages to automata that accept these inputs; finite state machines, for instance, can be used to specify the language of regular expressions.

Regular expressions are great for not-too-complex input formats, and the associated finite state machines have many properties that make them great for reasoning. To specify more complex inputs, though, they quickly encounter limitations. At the other end of the language spectrum, we have universal grammars that denote the language accepted by Turing machines. A Turing machine can compute anything that can be computed; and with Python being Turing-complete, this means that we can also use a Python program $p$ to specify or even enumerate legal inputs. But then, computer science theory also tells us that each such testing program has to be written specifically for the program to be tested, which is not the level of automation we want.

Grammars

The middle ground between regular expressions and Turing machines is covered by grammars. Grammars are among the most popular (and best understood) formalisms to formally specify input languages. Using a grammar, one can express a wide range of the properties of an input language. Grammars are particularly great for expressing the syntactical structure of an input, and are the formalism of choice to express nested or recursive inputs. The grammars we use are so-called context-free grammars, one of the easiest and most popular grammar formalisms.

Rules and Expansions

A grammar consists of a start symbol and a set of expansion rules (or simply rules) which indicate how the start symbol (and other symbols) can be expanded. As an example, consider the following grammar, denoting a sequence of two digits:

<start> ::= <digit><digit>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

To read such a grammar, start with the start symbol (<start>). An expansion rule <A> ::= <B> means that the symbol on the left side (<A>) can be replaced by the string on the right side (<B>). In the above grammar, <start> would be replaced by <digit><digit>.

In this string again, <digit> would be replaced by the string on the right side of the <digit> rule. The special operator | denotes expansion alternatives (or simply alternatives), meaning that any of the digits can be chosen for an expansion. Each <digit> thus would be expanded into one of the given digits, eventually yielding a string between 00 and 99. There are no further expansions for 0 to 9, so we are all set.

The interesting thing about grammars is that they can be recursive. That is, expansions can make use of symbols expanded earlier – which would then be expanded again. As an example, consider a grammar that describes integers:

<start>  ::= <integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Here, a <integer> is either a single digit, or a digit followed by another integer. The number 1234 thus would be represented as a single digit 1, followed by the integer 234, which in turn is a digit 2, followed by the integer 34.

If we wanted to express that an integer can be preceded by a sign (+ or -), we would write the grammar as

<start>   ::= <number>
<number>  ::= <integer> | +<integer> | -<integer>
<integer> ::= <digit> | <digit><integer>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

These rules formally define the language: Anything that can be derived from the start symbol is part of the language; anything that cannot is not.

Arithmetic Expressions

Let us expand our grammar to cover full arithmetic expressions – a poster child example for a grammar. We see that an expression (<expr>) is either a sum, or a difference, or a term; a term is either a product or a division, or a factor; and a factor is either a number or a parenthesized expression. Almost all rules can have recursion, and thus allow arbitrary complex expressions such as (1 + 2) * (3.4 / 5.6 - 789).

<start>   ::= <expr>
<expr>    ::= <term> + <expr> | <term> - <expr> | <term>
<term>    ::= <term> * <factor> | <term> / <factor> | <factor>
<factor>  ::= +<factor> | -<factor> | (<expr>) | <integer> | <integer>.<integer>
<integer> ::= <digit><integer> | <digit>
<digit>   ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

In such a grammar, if we start with <start> and then expand one symbol after another, randomly choosing alternatives, we can quickly produce one valid arithmetic expression after another. Such grammar fuzzing is highly effective as it comes to produce complex inputs, and this is what we will implement in this chapter.

Representing Grammars in Python

Our first step in building a grammar fuzzer is to find an appropriate format for grammars. To make the writing of grammars as simple as possible, we use a format that is based on strings and lists. Our grammars in Python take the format of a mapping between symbol names and expansions, where expansions are lists of alternatives. A one-rule grammar for digits thus takes the form

import fuzzingbook_utils
DIGIT_GRAMMAR = {
    "<start>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

whereas the full grammar for arithmetic expressions looks like this:

EXPR_GRAMMAR = {
    "<start>":
        ["<expr>"],

    "<expr>":
        ["<term> + <expr>", "<term> - <expr>", "<term>"],

    "<term>":
        ["<factor> * <term>", "<factor> / <term>", "<factor>"],

    "<factor>":
        ["+<factor>",
         "-<factor>",
         "(<expr>)",
         "<integer>.<integer>",
         "<integer>"],

    "<integer>":
        ["<digit><integer>", "<digit>"],

    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
}

In the grammar, every symbol can be defined exactly once. We can access any rule by its symbol...

EXPR_GRAMMAR["<digit>"]
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

....and we can check whether a symbol is in the grammar:

"<identifier>" in EXPR_GRAMMAR
False

Note that we assume that on the left hand side of a rule (i.e., the key in the mapping) is always a single symbol. This is the property that gives our grammars the characterization of context-free.

Some Definitions

We assume that the canonical start symbol is <start>:

START_SYMBOL = "<start>"

The handy nonterminals() function extracts the list of nonterminal symbols (i.e., anything between < and >, except spaces) from an expansion.

import re
RE_NONTERMINAL = re.compile(r'(<[^<> ]*>)')
def nonterminals(expansion):
    # In later chapters, we allow expansions to be tuples,
    # with the expansion being the first element
    if isinstance(expansion, tuple):
        expansion = expansion[0]

    return re.findall(RE_NONTERMINAL, expansion)
assert nonterminals("<term> * <factor>") == ["<term>", "<factor>"]
assert nonterminals("<digit><integer>") == ["<digit>", "<integer>"]
assert nonterminals("1 < 3 > 2") == []
assert nonterminals("1 <3> 2") == ["<3>"]
assert nonterminals("1 + 2") == []
assert nonterminals(("<1>", {'option': 'value'})) == ["<1>"]

Likewise, is_nonterminal() checks whether some symbol is a nonterminal:

def is_nonterminal(s):
    return re.match(RE_NONTERMINAL, s)
assert is_nonterminal("<abc>")
assert is_nonterminal("<symbol-1>")
assert not is_nonterminal("+")

A Simple Grammar Fuzzer

Let us now put the above grammars to use. We will build a very simple grammar fuzzer that starts with a start symbol (<start>) and then keeps on expanding it. To avoid expansion to infinite inputs, we place a limit (max_symbols) on the number of symbols. Furthermore, to avoid being stuck in a situation where we cannot reduce the number of symbols any further, we also limit the total number of expansion steps.

import random
class ExpansionError(Exception):
    pass
def simple_grammar_fuzzer(grammar, start_symbol=START_SYMBOL,
                          max_nonterminals=10, max_expansion_trials=100,
                          log=False):
    term = start_symbol
    expansion_trials = 0

    while len(nonterminals(term)) > 0:
        symbol_to_expand = random.choice(nonterminals(term))
        expansions = grammar[symbol_to_expand]
        expansion = random.choice(expansions)
        new_term = term.replace(symbol_to_expand, expansion, 1)

        if len(nonterminals(new_term)) < max_nonterminals:
            term = new_term
            if log:
                print("%-40s" % (symbol_to_expand + " -> " + expansion), term)
            expansion_trials = 0
        else:
            expansion_trials += 1
            if expansion_trials >= max_expansion_trials:
                raise ExpansionError("Cannot expand " + repr(term))

    return term

Let us see how this simple grammar fuzzer obtains an arithmetic expression from the start symbol:

simple_grammar_fuzzer(grammar=EXPR_GRAMMAR, max_nonterminals=3, log=True)
<start> -> <expr>                        <expr>
<expr> -> <term> + <expr>                <term> + <expr>
<term> -> <factor>                       <factor> + <expr>
<factor> -> <integer>                    <integer> + <expr>
<integer> -> <digit>                     <digit> + <expr>
<digit> -> 6                             6 + <expr>
<expr> -> <term> - <expr>                6 + <term> - <expr>
<expr> -> <term>                         6 + <term> - <term>
<term> -> <factor>                       6 + <factor> - <term>
<factor> -> -<factor>                    6 + -<factor> - <term>
<term> -> <factor>                       6 + -<factor> - <factor>
<factor> -> (<expr>)                     6 + -(<expr>) - <factor>
<factor> -> (<expr>)                     6 + -(<expr>) - (<expr>)
<expr> -> <term>                         6 + -(<term>) - (<expr>)
<expr> -> <term>                         6 + -(<term>) - (<term>)
<term> -> <factor>                       6 + -(<factor>) - (<term>)
<factor> -> +<factor>                    6 + -(+<factor>) - (<term>)
<factor> -> +<factor>                    6 + -(++<factor>) - (<term>)
<term> -> <factor>                       6 + -(++<factor>) - (<factor>)
<factor> -> (<expr>)                     6 + -(++(<expr>)) - (<factor>)
<factor> -> <integer>                    6 + -(++(<expr>)) - (<integer>)
<expr> -> <term>                         6 + -(++(<term>)) - (<integer>)
<integer> -> <digit>                     6 + -(++(<term>)) - (<digit>)
<digit> -> 9                             6 + -(++(<term>)) - (9)
<term> -> <factor> * <term>              6 + -(++(<factor> * <term>)) - (9)
<term> -> <factor>                       6 + -(++(<factor> * <factor>)) - (9)
<factor> -> <integer>                    6 + -(++(<integer> * <factor>)) - (9)
<integer> -> <digit>                     6 + -(++(<digit> * <factor>)) - (9)
<digit> -> 2                             6 + -(++(2 * <factor>)) - (9)
<factor> -> +<factor>                    6 + -(++(2 * +<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +-<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +--<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +---<factor>)) - (9)
<factor> -> -<factor>                    6 + -(++(2 * +----<factor>)) - (9)
<factor> -> <integer>.<integer>          6 + -(++(2 * +----<integer>.<integer>)) - (9)
<integer> -> <digit>                     6 + -(++(2 * +----<digit>.<integer>)) - (9)
<integer> -> <digit>                     6 + -(++(2 * +----<digit>.<digit>)) - (9)
<digit> -> 1                             6 + -(++(2 * +----1.<digit>)) - (9)
<digit> -> 7                             6 + -(++(2 * +----1.7)) - (9)
'6 + -(++(2 * +----1.7)) - (9)'

By increasing the limit of nonterminals, we can quickly get much longer productions:

for i in range(10):
    print(simple_grammar_fuzzer(grammar=EXPR_GRAMMAR, max_nonterminals=5))
7 / +48.5
-5.9 / 9 - 4 * +-(-+++((1 + (+7 - (-1 * (++-+7.7 - -+-4.0))))) * +--4 - -(6) + 64)
8.2 - 27 - -9 / +((+9 * --2 + --+-+-((-1 * +(8 - 5 - 6)) * (-((-+(((+(4))))) - ++4) / +(-+---((5.6 - --(3 * -1.8 * +(6 * +-(((-(-6) * ---+6)) / +--(+-+-7 * (-0 * (+(((((2)) + 8 - 3 - ++9.0 + ---(--+7 / (1 / +++6.37) + (1) / 482) / +++-+0)))) * -+5 + 7.513)))) - (+1 / ++((-84)))))))) * ++5 / +-(--2 - -++-9.0)))) / 5 * --++090
1 - -3 * 7 - 28 / 9
(+9) * +-5 * ++-926.2 - (+9.03 / -+(-(-6) / 2 * +(-+--(8) / -(+1.0) - 5 + 4)) * 3.5)
8 + -(9.6 - 3 - -+-4 * +77)
-(((((++((((+((++++-((+-37))))))))))))) / ++(-(+++(+6)) * -++-(+(++(---6 * (((7)) * (1) / (-7.6 * 535338) + +256) * 0) * 0))) - 4 + +1
5.43
(9 / -405 / -23 - +-((+-(2 * (13))))) + +6 - +8 - 934
-++2 - (--+715769550) / 8 / (1)

Note that this fuzzer is rather inefficient due to the large number of search and replace operations. On the other hand, the implementation is straightforward and does the job in most cases. For this chapter, we'll stick to it; in the next chapter, we'll show how to build a more efficient one.

Visualizing Grammars as Railroad Diagrams

With grammars, we can easily specify the format for several of the examples we discussed earlier. The above arithmetic expressions, for instance, can be directly sent into bc (or any other program that takes arithmetic expressions). Before we introduce a few additional grammars, let us give a means to visualize them, giving an alternate view to aid their understanding.

Railroad diagrams, also called syntax diagrams, are a graphical representation of context-free grammars. They are read left to right, following possible "rail" tracks; the sequence of symbols encountered on the track defines the language.

We use RailroadDiagrams, an external library for visualization.

from RailroadDiagrams import NonTerminal, Terminal, Choice, HorizontalChoice, Sequence, Diagram, show_diagram
from IPython.display import SVG, HTML, display

We first define the method syntax_diagram_symbol() to visualize a given symbol. Nonterminal symbols are denoted as ovals, whereas terminal symbols (such as <term>) are denoted as rectangles.

def syntax_diagram_symbol(symbol):
    if is_nonterminal(symbol):
        return NonTerminal(symbol[1:-1])
    else:
        return Terminal(symbol)
SVG(show_diagram(syntax_diagram_symbol('<term>')))
term

We define syntax_diagram_expr() to visualize expansion alternatives.

def syntax_diagram_expr(expansion):
    # In later chapters, we allow expansions to be tuples,
    # with the expansion being the first element
    if isinstance(expansion, tuple):
        expansion = expansion[0]

    symbols = [sym for sym in re.split(RE_NONTERMINAL, expansion) if sym != ""]
    if len(symbols) == 0:
        symbols = [""]  # special case: empty expansion

    return Sequence(*[syntax_diagram_symbol(sym) for sym in symbols])
SVG(show_diagram(syntax_diagram_expr(EXPR_GRAMMAR['<term>'][0])))
factor * term

This is the first alternative of <term> – a <factor> followed by * and a <term>.

Next, we define syntax_diagram_alt() for displaying alternate expressions.

from itertools import zip_longest
def syntax_diagram_alt(alt):
    max_len = 5
    l = len(alt)
    if l > max_len:
        iter_len = l // max_len
        alts = list(zip_longest(*[alt[i::iter_len] for i in range(iter_len)]))
        exprs = [[syntax_diagram_expr(expr) for expr in alt
                  if expr is not None] for alt in alts]
        choices = [Choice(len(expr)//2,*expr) for expr in exprs]
        return HorizontalChoice(*choices)
    else:
        return Choice(l//2,*[syntax_diagram_expr(expr) for expr in alt])
SVG(show_diagram(syntax_diagram_alt(EXPR_GRAMMAR['<digit>'])))
0 1 2 3 4 5 6 7 8 9

We see that a <digit> can be any single digit from 0 to 9.

Finally, we define syntax_diagram() which given a grammar, displays the syntax diagram of its rules.

def syntax_diagram(grammar):
    for key in grammar:
        print("%s" % key[1:-1])
        display(SVG(show_diagram(syntax_diagram_alt(grammar[key]))))
syntax_diagram(EXPR_GRAMMAR)
start
expr
expr
term + expr term - expr term
term
factor * term factor / term factor
factor
- factor + factor ( expr ) integer . integer integer
integer
digit integer digit
digit
0 1 2 3 4 5 6 7 8 9

This railroad representation will come in handy as it comes to visualizing the structure of grammars – especially for more complex grammars.

Some Grammars

Let us create (and visualize) some more grammars and use them for fuzzing.

A CGI Grammar

Here's a grammar for cgi_decode() introduced in the chapter on coverage.

CGI_GRAMMAR = {
    "<start>":
        ["<string>"],

    "<string>":
        ["<letter>", "<letter><string>"],

    "<letter>":
        ["<plus>", "<percent>", "<other>"],

    "<plus>":
        ["+"],

    "<percent>":
        ["%<hexdigit><hexdigit>"],

    "<hexdigit>":
        ["0", "1", "2", "3", "4", "5", "6", "7",
            "8", "9", "a", "b", "c", "d", "e", "f"],

    "<other>":  # Actually, could be _all_ letters
        ["0", "1", "2", "3", "4", "5", "a", "b", "c", "d", "e", "-", "_"],
}
syntax_diagram(CGI_GRAMMAR)
start
string
string
letter letter string
letter
plus percent other
plus
+
percent
% hexdigit hexdigit
hexdigit
0 1 2 3 4 5 6 7 8 9 a b c d e f
other
0 1 2 3 4 5 a b c d e - _

In contrast to basic fuzzing or mutation-based fuzzing, the grammar quickly produces all sorts of combinations:

for i in range(10):
    print(simple_grammar_fuzzer(grammar=CGI_GRAMMAR, max_nonterminals=10))
+%9a
+++%ce+
+_
+%c6c
++
+%cd+5
1%ee
%b9%d5
%96
%57d%42

A URL Grammar

The same properties we have seen for CGI input also hold for more complex inputs. Let us use a grammar to produce a large number of valid URLs:

URL_GRAMMAR = {
    "<start>":
        ["<url>"],
    "<url>":
        ["<scheme>://<authority><path><query>"],
    "<scheme>":
        ["http", "https", "ftp", "ftps"],
    "<authority>":
        ["<host>", "<host>:<port>", "<userinfo>@<host>", "<userinfo>@<host>:<port>"],
    "<host>":  # Just a few
        ["cispa.saarland", "www.google.com", "fuzzingbook.com"],
    "<port>":
        ["80", "8080", "<nat>"],
    "<nat>":
        ["<digit>", "<digit><digit>"],
    "<digit>":
        ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"],
    "<userinfo>":  # Just one
        ["user:password"],
    "<path>":  # Just a few
        ["", "/", "/<id>"],
    "<id>":  # Just a few
        ["abc", "def", "x<digit><digit>"],
    "<query>":
        ["", "?<params>"],
    "<params>":
        ["<param>", "<param>&<params>"],
    "<param>":  # Just a few
        ["<id>=<id>", "<id>=<nat>"],
}
syntax_diagram(URL_GRAMMAR)
start
url
url
scheme :// authority path query
scheme
https http ftp ftps
authority
host : port host userinfo @ host userinfo @ host : port
host
cispa.saarland www.google.com fuzzingbook.com
port
80 8080 nat
nat
digit digit digit
digit
0 1 2 3 4 5 6 7 8 9
userinfo
user:password
path
/ / id
id
abc def x digit digit
query
? params
params
param param & params
param
id = id id = nat

Again, within milliseconds, we can produce plenty of valid inputs.

for i in range(10):
    print(simple_grammar_fuzzer(grammar=URL_GRAMMAR, max_nonterminals=10))
https://user:password@cispa.saarland:80/
http://fuzzingbook.com?def=56&x89=3&x46=48&def=def
ftp://cispa.saarland/?x71=5&x35=90&def=abc
https://cispa.saarland:80/def?def=7&x23=abc
https://fuzzingbook.com:80/
https://fuzzingbook.com:80/abc?def=abc&abc=x14&def=abc&abc=2&def=38
ftps://fuzzingbook.com/x87
https://user:password@fuzzingbook.com:6?def=54&x44=abc
http://fuzzingbook.com:80?x33=25&def=8
http://fuzzingbook.com:8080/def

A Natural Language Grammar

Finally, grammars are not limited to formal languages such as computer inputs, but can also be used to produce natural language. This is the grammar we used to pick a title for this book:

TITLE_GRAMMAR = {
    "<start>": ["<title>"],
    "<title>": ["<topic>: <subtopic>"],
    "<topic>": ["Generating Software Tests", "<fuzzing-prefix>Fuzzing", "The Fuzzing Book"],
    "<fuzzing-prefix>": ["", "The Art of ", "The Joy of "],
    "<subtopic>": ["<subtopic-main>",
                   "<subtopic-prefix><subtopic-main>",
                   "<subtopic-main><subtopic-suffix>"],
    "<subtopic-main>": ["Breaking Software",
                        "Generating Software Tests",
                        "Principles, Techniques and Tools"],
    "<subtopic-prefix>": ["", "Tools and Techniques for "],
    "<subtopic-suffix>": [" for <reader-property> and <reader-property>",
                          " for <software-property> and <software-property>"],
    "<reader-property>": ["Fun", "Profit"],
    "<software-property>": ["Robustness", "Reliability", "Security"],
}
syntax_diagram(TITLE_GRAMMAR)
start
title
title
topic : subtopic
topic
Generating Software Tests fuzzing-prefix Fuzzing The Fuzzing Book
fuzzing-prefix
The Art of The Joy of
subtopic
subtopic-main subtopic-prefix subtopic-main subtopic-main subtopic-suffix
subtopic-main
Breaking Software Generating Software Tests Principles, Techniques and Tools
subtopic-prefix
Tools and Techniques for
subtopic-suffix
for reader-property and reader-property for software-property and software-property
reader-property
Fun Profit
software-property
Robustness Reliability Security
titles = set()
while len(titles) < 10:
    titles.add(simple_grammar_fuzzer(
        grammar=TITLE_GRAMMAR, max_nonterminals=10))
titles
{'Fuzzing: Generating Software Tests',
 'Fuzzing: Principles, Techniques and Tools',
 'Generating Software Tests: Breaking Software',
 'Generating Software Tests: Breaking Software for Robustness and Robustness',
 'Generating Software Tests: Principles, Techniques and Tools',
 'Generating Software Tests: Principles, Techniques and Tools for Profit and Fun',
 'Generating Software Tests: Tools and Techniques for Principles, Techniques and Tools',
 'The Fuzzing Book: Breaking Software',
 'The Fuzzing Book: Generating Software Tests for Profit and Profit',
 'The Fuzzing Book: Generating Software Tests for Robustness and Robustness'}

(If you find that there is redundancy ("Robustness and Robustness") in here: In our chapter on coverage-based fuzzing, we will show how to cover each expansion only once. And if you like some alternatives more than others, probabilistic grammar fuzzing will be there for you.)

Grammars as Mutation Seeds

One very useful property of grammars is that they produce mostly valid inputs. From a syntactical standpoint, the inputs are actually always valid, as they satisfy the constraints of the given grammar. (Of course, one needs a valid grammar in the first place.) However, there are also semantical properties that cannot be easily expressed in a grammar. If, say, for a URL, the port range is supposed to be between 1024 and 2048, this is hard to write in a grammar. If one has to satisfy more complex constraints, one quickly reaches the limits of what a grammar can express.

One way around this is to attach constraints to grammars, as we will discuss later in this book. Another possibility is to put together the strengths of grammar-based fuzzing and mutation-based fuzzing. The idea is to use the grammar-generated inputs as seeds for further mutation-based fuzzing. This way, we can explore not only valid inputs, but also check out the boundaries between valid and invalid inputs. This is particularly interesting as slightly invalid inputs allow to find parser errors (which are often abundant). As with fuzzing in general, it is the unexpected which reveals errors in programs.

To use our generated inputs as seeds, we can feed them directly into the mutation fuzzers introduced earlier:

from MutationFuzzer import MutationFuzzer
number_of_seeds = 10
seeds = [
    simple_grammar_fuzzer(
        grammar=URL_GRAMMAR,
        max_nonterminals=10) for i in range(number_of_seeds)]
seeds
['ftps://user:password@www.google.com:80',
 'http://cispa.saarland/',
 'ftp://www.google.com:42/',
 'ftps://user:password@fuzzingbook.com:39?abc=abc',
 'https://www.google.com?x33=1&x06=1',
 'http://www.google.com:02/',
 'https://user:password@www.google.com/',
 'ftp://cispa.saarland:8080/?abc=abc&def=def&abc=5',
 'http://www.google.com:80/def?def=abc',
 'http://user:password@cispa.saarland/']
m = MutationFuzzer(seeds)
[m.fuzz() for i in range(20)]
['ftps://user:password@www.google.com:80',
 'http://cispa.saarland/',
 'ftp://www.google.com:42/',
 'ftps://user:password@fuzzingbook.com:39?abc=abc',
 'https://www.google.com?x33=1&x06=1',
 'http://www.google.com:02/',
 'https://user:password@www.google.com/',
 'ftp://cispa.saarland:8080/?abc=abc&def=def&abc=5',
 'http://www.google.com:80/def?def=abc',
 'http://user:password@cispa.saarland/',
 'Eh4tp:www.coogle.com:80/def?d%f=abc',
 'ftps://}ser:passwod@fuzzingbook.com:9?abc=abc',
 'uftp//cispa.sRaarland:808&0?abc=abc&def=defabc=5',
 'http://user:paswor9d@cispar.saarland/v',
 'ftp://Www.g\x7fogle.cAom:42/',
 'hht://userC:qassMword@cispy.csaarland/',
 'httx://ww.googlecom:80defde`f=ac',
 'htt://cispq.waarlnd/',
 'htFtp\t://cmspa./saarna(md/',
 'ft:/www.google.com:42\x0f']

While the first 10 fuzz() calls return the seeded inputs (as designed), the later ones again create arbitrary mutations. Using MutationCoverageFuzzer instead of MutationFuzzer, we could again have our search guided by coverage – and thus bring together the best of multiple worlds.

A Grammar Toolbox

Let us now introduce a few techniques that help us writing grammars.

Escapes

With < and > delimiting nonterminals in our grammars, how can we actually express that some input should contain < and >? The answer is simple: Just introduce a symbol for them.

nonterminal_grammar = {
    "<start>": ["<nonterminal>"],
    "<nonterminal>": ["<left-angle><identifier><right-angle>"],
    "<left-angle>": ["<"],
    "<right-angle>": [">"]
}

In nonterminal_grammar, neither the expansion for <left-angle> nor the expansion for <right-angle> can be mistaken as a nonterminal. Hence, we can produce as many as we want.

Character Classes

In the above nonterminal_grammar, we have left out the definition for <identifier>. That is because enumerating all letters or digits in a grammar manually, as in <letter> ::= 'a' | 'b' | 'c' ... is a bit painful.

However, remember that grammars are part of a program, and can thus also be constructed programmatically. In Python, the constant string.ascii_letters contains all letters in the ASCII character set:

import string
string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

From this, we can construct a list of all letters:

def srange(characters):
    """Construct a list with all characters in the string }`characters`"""
    return [c for c in characters]
srange(string.ascii_letters)[:10]
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

We can now use this in our grammar to define identifiers:

nonterminal_grammar["<identifier>"] = ["<idchar>", "<idchar><identifier>"]
nonterminal_grammar["<idchar>"] = srange(
    string.ascii_letters) + srange(string.digits) + srange("-_")
[simple_grammar_fuzzer(nonterminal_grammar, "<identifier>") for i in range(10)]
['b', 'd', 'X3b', 'n0', '7', 'v', 'z', 'X', 'g', 'Pfed']

The shortcut crange(start, end) returns a list of all characters in the ASCII range of start to (including) end:

def crange(character_start, character_end):
    return [chr(i)
            for i in range(ord(character_start), ord(character_end) + 1)]

We can use this to express ranges of characters:

crange('0', '9')
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
assert crange('a', 'z') == srange(string.ascii_lowercase)

Grammar Shortcuts

In the above nonterminal_grammar, as in other grammars, we have to express repetitions of characters using recursion, that is, by referring to the original definition:

nonterminal_grammar["<identifier>"]
['<idchar>', '<idchar><identifier>']

It could be a bit easier if we simply could state that a nonterminal should be a non-empty sequence of letters – for instance, as in

<identifier> = <idchar>+

where + denotes a non-empty repetition of the symbol it follows.

Operators such as + are frequently introduced as handy shortcuts in grammars. Formally, our grammars come in the so-called Backus-Naur form, or BNF for short. Operators extend BNF to so-called _extended BNF, or EBNF* for short:

  • The form <symbol>? indicates that <symbol> is optional – that is, it can occur 0 or 1 times.
  • The form <symbol>+ indicates that <symbol> can occur 1 or more times repeatedly.
  • The form <symbol>* indicates that <symbol> can occur 0 or more times. (In other words, it is an optional repetition.)

To make matters even more interesting, we would like to use parentheses with the above shortcuts. Thus, (<foo><bar>)? indicates that the sequence of <foo> and <bar> is optional.

Using such operators, we can define the identifier rule in a simpler way. To this end, let us create a copy of the original grammar and modify the <identifier> rule:

from copy import deepcopy
nonterminal_ebnf_grammar = deepcopy(nonterminal_grammar)
nonterminal_ebnf_grammar["<identifier>"] = "<idchar>+"

Likewise, we can simplify the expression grammar. Consider how signs are optional, and how integers can be expressed as sequences of digits.

EXPR_EBNF_GRAMMAR = {
    "<start>":
        ["<expr>"],

    "<expr>":
        ["<term> + <expr>", "<term> - <expr>", "<term>"],

    "<term>":
        ["<factor> * <term>", "<factor> / <term>", "<factor>"],

    "<factor>":
        ["<sign>?<factor>", "(<expr>)", "<integer>(.<integer>)?"],

    "<sign>":
        ["+", "-"],

    "<integer>":
        ["<digit>+"],

    "<digit>":
        srange(string.digits)
}

Our aim is to convert EBNF grammars such as the ones above into a regular BNF grammar. This is done by four rules:

  1. An expression (content)op, where op is one of ?, +, *, becomes <new-symbol>op, with a new rule <new-symbol> ::= content.
  2. An expression <symbol>? becomes <new-symbol>, where <new-symbol> ::= <empty> | <symbol>.
  3. An expression <symbol>+ becomes <new-symbol>, where <new-symbol> ::= <symbol> | <symbol><new-symbol>.
  4. An expression <symbol>* becomes <new-symbol>, where <new-symbol> ::= <empty> | <symbol><new-symbol>.

Here, <empty> expands to the empty string, as in <empty> ::=. (This is also called an epsilon expansion.)

If these operators remind you of regular expressions, this is not by accident: Actually, any basic regular expression can be converted into a grammar using the above rules (and character classes with crange(), as defined above).

Applying these rules on the examples above yields the following results:

  • <idchar>+ becomes <idchar><new-symbol> with <new-symbol> ::= <idchar> | <idchar><new-symbol>.
  • <integer>(.<integer>)? becomes <integer><new-symbol> with <new-symbol> ::= <empty> | .<integer>.

Let us implement these rules in three steps.

Creating New Symbols

First, we need a mechanism to create new symbols. This is fairly straightforward.

def new_symbol(grammar, symbol_name="<symbol>"):
    """Return a new symbol for `grammar` based on `symbol_name`"""
    if symbol_name not in grammar:
        return symbol_name

    count = 1
    while True:
        tentative_symbol_name = symbol_name[:-1] + "-" + repr(count) + ">"
        if tentative_symbol_name not in grammar:
            return tentative_symbol_name
        count += 1
assert new_symbol(EXPR_EBNF_GRAMMAR, '<expr>') == '<expr-1>'

Expanding Parenthesized Expressions

Next, we need a means to extract parenthesized expressions from our expansions and expand them according to the rules above. Let's start with extracting expressions:

RE_PARENTHESIZED_EXPR = re.compile(r'\([^()]*\)[?+*]')
def parenthesized_expressions(expansion):
    # In later chapters, we allow expansions to be tuples,
    # with the expansion being the first element
    if isinstance(expansion, tuple):
        expansion = expansion[0]

    return re.findall(RE_PARENTHESIZED_EXPR, expansion)
assert parenthesized_expressions("(<foo>)* (<foo><bar>)+ (+<foo>)? <integer>(.<integer>)?") == [
    '(<foo>)*', '(<foo><bar>)+', '(+<foo>)?', '(.<integer>)?']

We can now use these to apply rule number 1, above, introducing new symbols for expressions in parentheses.

def convert_ebnf_parentheses(ebnf_grammar):
    """Convert a grammar in extended BNF to BNF"""
    grammar = deepcopy(ebnf_grammar)
    for nonterminal in ebnf_grammar:
        expansions = ebnf_grammar[nonterminal]

        for i in range(len(expansions)):
            expansion = expansions[i]

            while True:
                parenthesized_exprs = parenthesized_expressions(expansion)
                if len(parenthesized_exprs) == 0:
                    break

                for expr in parenthesized_exprs:
                    operator = expr[-1:]
                    contents = expr[1:-2]

                    new_sym = new_symbol(grammar)
                    expansion = grammar[nonterminal][i].replace(
                        expr, new_sym + operator, 1)
                    grammar[nonterminal][i] = expansion
                    grammar[new_sym] = [contents]

    return grammar

This does the conversion as sketched above:

convert_ebnf_parentheses({"<number>": ["<integer>(.<integer>)?"]})
{'<number>': ['<integer><symbol>?'], '<symbol>': ['.<integer>']}

This even works for nested parenthesized expressions:

convert_ebnf_parentheses({"<foo>": ["((<foo>)?)+"]})
{'<foo>': ['<symbol-1>+'], '<symbol>': ['<foo>'], '<symbol-1>': ['<symbol>?']}

Expanding Operators

After expanding parenthesized expressions, we now need to take care of symbols followed by operators (?, *, +). As with convert_ebnf_parentheses(), above, we first extract all symbols followed by an operator.

RE_EXTENDED_NONTERMINAL = re.compile(r'(<[^<> ]*>[?+*])')
def extended_nonterminals(expansion):
    # In later chapters, we allow expansions to be tuples,
    # with the expansion being the first element
    if isinstance(expansion, tuple):
        expansion = expansion[0]

    return re.findall(RE_EXTENDED_NONTERMINAL, expansion)
assert extended_nonterminals(
    "<foo>* <bar>+ <elem>? <none>") == ['<foo>*', '<bar>+', '<elem>?']

Our converter extracts the symbol and the operator, and adds new symbols according to the rules laid out above.

def convert_ebnf_operators(ebnf_grammar):
    """Convert a grammar in extended BNF to BNF"""
    grammar = deepcopy(ebnf_grammar)
    for nonterminal in ebnf_grammar:
        expansions = ebnf_grammar[nonterminal]

        for i in range(len(expansions)):
            expansion = expansions[i]
            extended_symbols = extended_nonterminals(expansion)

            for extended_symbol in extended_symbols:
                operator = extended_symbol[-1:]
                original_symbol = extended_symbol[:-1]

                new_sym = new_symbol(grammar, original_symbol)
                grammar[nonterminal][i] = grammar[nonterminal][i].replace(
                    extended_symbol, new_sym, 1)

                if operator == '?':
                    grammar[new_sym] = ["", original_symbol]
                elif operator == '*':
                    grammar[new_sym] = ["", original_symbol + new_sym]
                elif operator == '+':
                    grammar[new_sym] = [
                        original_symbol, original_symbol + new_sym]

    return grammar
convert_ebnf_operators({"<integer>": ["<digit>+"]})
{'<integer>': ['<digit>'], '<digit>': ['<digit>', '<digit><digit>']}

All Together

We can combine the two, first extending parentheses and then operators:

def convert_ebnf_grammar(ebnf_grammar):
    return convert_ebnf_operators(convert_ebnf_parentheses(ebnf_grammar))
convert_ebnf_grammar({"<authority>": ["(<userinfo>@)?<host>(:<port>)?"]})
{'<authority>': ['<symbol-2><host><symbol-1-1>'],
 '<symbol>': ['<userinfo>@'],
 '<symbol-1>': [':<port>'],
 '<symbol-2>': ['', '<symbol>'],
 '<symbol-1-1>': ['', '<symbol-1>']}
expr_grammar = convert_ebnf_grammar(EXPR_EBNF_GRAMMAR)
expr_grammar
{'<start>': ['<expr>'],
 '<expr>': ['<term> + <expr>', '<term> - <expr>', '<term>'],
 '<term>': ['<factor> * <term>', '<factor> / <term>', '<factor>'],
 '<factor>': ['<sign-1><factor>', '(<expr>)', '<integer><symbol-1>'],
 '<sign>': ['+', '-'],
 '<integer>': ['<digit-1>'],
 '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'],
 '<symbol>': ['.<integer>'],
 '<sign-1>': ['', '<sign>'],
 '<symbol-1>': ['', '<symbol>'],
 '<digit-1>': ['<digit>', '<digit><digit-1>']}

Success! We have nicely converted the EBNF grammar into BNF.

With character classes and EBNF grammar conversion, we have two powerful tools that make the writing of grammars easier. We will use these again and again as it comes to working with grammars.

Grammar Extensions

During the course of this book, we frequently want to specify additional information for grammars, such as probabilities or constraints. To support these extensions, as well as possibly others, we define an annotation mechanism.

Our concept for annotating grammars is to add annotations to individual expansions. To this end, we allow that an expansion cannot only be a string, but also a pair of a string and a set of attributes, as in

"<expr>":
        [("<term> + <expr>", opts(min_depth=10)),
         ("<term> - <expr>", opts(max_depth=2)),
         "<term>"]

Here, the opts() function would allow us to express annotations that apply to the individual expansions; in this case, the addition would be annotated with a min_depth value of 10, and the subtraction with a max_depth value of 2. The meaning of these annotations is left to the individual algorithms dealing with the grammars; the general idea, though, is that they can be ignored.

Our opts() helper function returns a mapping of its arguments to values:

def opts(**kwargs):
    return kwargs
opts(min_depth=10)
{'min_depth': 10}

To deal with both expansion strings and pairs of expansions and annotations, we access the expansion string and the associated annotations via designated helper functions, exp_string() and exp_opts():

def exp_string(expansion):
    """Return the string to be expanded"""
    if isinstance(expansion, str):
        return expansion
    return expansion[0]
exp_string(("<term> + <expr>", opts(min_depth=10)))
'<term> + <expr>'
def exp_opts(expansion):
    """Return the options of an expansion.  If options are not defined, return {}"""
    if isinstance(expansion, str):
        return {}
    return expansion[1]

def exp_opt(expansion, attribute):
    """Return the given attribution of an expansion.  
    If attribute is not defined, return None"""
    return exp_opts(expansion).get(attribute, None)
exp_opts(("<term> + <expr>", opts(min_depth=10)))
{'min_depth': 10}
exp_opt(("<term> - <expr>", opts(max_depth=2)), 'max_depth')
2

Finally, we define a helper function that sets a particular option:

def set_opts(grammar, symbol, expansion, opts=None):
    """Set the options of the given expansion of grammar[symbol] to opts"""
    expansions = grammar[symbol]
    for i, exp in enumerate(expansions):
        if exp_string(exp) != exp_string(expansion):
            continue

        new_opts = exp_opts(exp)
        if opts is None or new_opts == {}:
            new_opts = opts
        else:
            for key in opts:
                new_opts[key] = opts[key]
        if new_opts == {}:
            grammar[symbol][i] = exp_string(exp)
        else:
            grammar[symbol][i] = (exp_string(exp), new_opts)
        return

    raise KeyError("no expansion " + repr(symbol) + " -> " + repr(exp_string(expansion)))

Checking Grammars

Since grammars are represented as strings, it is fairly easy to introduce errors. So let us introduce a helper function that checks a grammar for consistency.

The helper function is_valid_grammar() iterates over a grammar to check whether all used symbols are defined, and vice versa, which is very useful for debugging; it also checks whether all symbols are reachable from the start symbol. You don't have to delve into details here, but as always, it is important to get the input data straight before we make use of it.

import sys
def def_used_nonterminals(grammar, start_symbol=START_SYMBOL):
    defined_nonterminals = set()
    used_nonterminals = {start_symbol}

    for defined_nonterminal in grammar:
        defined_nonterminals.add(defined_nonterminal)
        expansions = grammar[defined_nonterminal]
        if not isinstance(expansions, list):
            print(repr(defined_nonterminal) + ": expansion is not a list",
                  file=sys.stderr)
            return None, None

        if len(expansions) == 0:
            print(repr(defined_nonterminal) + ": expansion list empty",
                  file=sys.stderr)
            return None, None

        for expansion in expansions:
            if isinstance(expansion, tuple):
                expansion = expansion[0]
            if not isinstance(expansion, str):
                print(repr(defined_nonterminal) + ": "
                      + repr(expansion) + ": not a string",
                      file=sys.stderr)
                return None, None

            for used_nonterminal in nonterminals(expansion):
                used_nonterminals.add(used_nonterminal)

    return defined_nonterminals, used_nonterminals
def reachable_nonterminals(grammar, start_symbol=START_SYMBOL):
    reachable = set()

    def _find_reachable_nonterminals(grammar, symbol):
        nonlocal reachable
        reachable.add(symbol)
        for expansion in grammar.get(symbol, []):
            for nonterminal in nonterminals(expansion):
                if nonterminal not in reachable:
                    _find_reachable_nonterminals(grammar, nonterminal)

    _find_reachable_nonterminals(grammar, start_symbol)
    return reachable

def unreachable_nonterminals(grammar, start_symbol=START_SYMBOL):
    return grammar.keys() - reachable_nonterminals(grammar, start_symbol)
def opts_used(grammar):
    used_opts = set()
    for symbol in grammar:
        for expansion in grammar[symbol]:
            used_opts |= set(exp_opts(expansion).keys())
    return used_opts
def is_valid_grammar(grammar, start_symbol=START_SYMBOL, supported_opts=None):
    defined_nonterminals, used_nonterminals = \
        def_used_nonterminals(grammar, start_symbol)
    if defined_nonterminals is None or used_nonterminals is None:
        return False

    # Do not complain about '<start>' being not used,
    # even if start_symbol is different
    if START_SYMBOL in grammar:
        used_nonterminals.add(START_SYMBOL)

    for unused_nonterminal in defined_nonterminals - used_nonterminals:
        print(repr(unused_nonterminal) + ": defined, but not used",
              file=sys.stderr)
    for undefined_nonterminal in used_nonterminals - defined_nonterminals:
        print(repr(undefined_nonterminal) + ": used, but not defined",
              file=sys.stderr)

    # Symbols must be reachable either from <start> or given start symbol
    unreachable = unreachable_nonterminals(grammar, start_symbol)
    msg_start_symbol = start_symbol
    if START_SYMBOL in grammar:
        unreachable = unreachable - reachable_nonterminals(grammar, START_SYMBOL)
        if start_symbol != START_SYMBOL:
            msg_start_symbol += " or " + START_SYMBOL
    for unreachable_nonterminal in unreachable:
        print(repr(unreachable_nonterminal) + ": unreachable from " + msg_start_symbol,
              file=sys.stderr)

    used_but_not_supported_opts = set()
    if supported_opts is not None:
        used_but_not_supported_opts = opts_used(grammar).difference(supported_opts)
        for opt in used_but_not_supported_opts:
            print("warning: option " + repr(opt) + " is not supported", file=sys.stderr)

    return used_nonterminals == defined_nonterminals and len(unreachable) == 0

Our grammars defined above pass the test:

assert is_valid_grammar(EXPR_GRAMMAR)
assert is_valid_grammar(CGI_GRAMMAR)
assert is_valid_grammar(URL_GRAMMAR)

The check can also be applied to EBNF grammars:

assert is_valid_grammar(EXPR_EBNF_GRAMMAR)

These ones do not pass the test, though:

assert not is_valid_grammar({"<start>": ["<x>"], "<y>": ["1"]})
'<y>': defined, but not used
'<x>': used, but not defined
'<y>': unreachable from <start>
assert not is_valid_grammar({"<start>": "123"})
'<start>': expansion is not a list
assert not is_valid_grammar({"<start>": []})
'<start>': expansion list empty
assert not is_valid_grammar({"<start>": [1, 2, 3]})
'<start>': 1: not a string

From here on, we will always use is_valid_grammar() when defining a grammar.

Lessons Learned

  • Grammars are powerful tools to express and produce syntactically valid inputs.
  • Inputs produced from grammars can be used as is, or used as seeds for mutation-based fuzzing.
  • Grammars can be extended with character classes and operators to make writing easier.

Next Steps

As they make a great foundation for generating software tests, we use grammars again and again in this book. As a sneak preview, we can use grammars to fuzz configurations:

<options> ::= <option>*
<option> ::= -h | --version | -v | -d | -i | --global-config <filename>

We can use grammars for fuzzing functions and APIs and fuzzing graphical user interfaces:

<call-sequence> ::= <call>*
<call> ::= urlparse(<url>) | urlsplit(<url>)

We can assign probabilities and constraints to individual expansions:

<term>: 50% <factor> * <term> |  30% <factor> / <term> | 20% <factor>
<integer>: <digit>+ { <integer> >= 100 }

All these extras become especially valuable as we can

  1. infer grammars automatically, dropping the need to specify them manually, and
  2. guide them towards specific goals such as coverage or critical functions;

which we also discuss for all techniques in this book.

To get there, however, we still have bit of homework to do. In particular, we first have to learn how to

Background

As one of the foundations of human language, grammars have been around as long as human language existed. The first formalization of generative grammars was by Dakṣiputra Pāṇini in 350 BC [Dakṣiputra Pāṇini, 350 BCE.]. As a general means to express formal languages for both data and programs, their role in computer science cannot be overstated. The seminal work by Chomsky [Chomsky et al, 1956.] introduced the central models of regular languages, context-free grammars, context-sensitive grammars, and universal grammars as they are used (and taught) in computer science as a means to specify input and programming languages aver since.

The use of grammars for producing test inputs goes back to Burkhardt [Burkhardt et al, 1967.], to be later rediscovered and applied by Hanford [Hanford et al, 1970.] and Purdom [Purdom et al, 1972.]. The most important use of grammar testing since then has been compiler testing. Actually, grammar-based testing is one important reason why compilers and Web browsers work as they should:

  • The CSmith tool [Yang et al, 2011.] specifically targets C programs, starting with a C grammar and then applying additional steps, such as referring to variables and functions defined earlier or ensuring integer and type safety. Their authors have used it "to find and report more than 400 previously unknown compiler bugs."

  • The LangFuzz work [Holler et al, 2012.], which shares two authors with this book, uses a generic grammar to produce outputs, and is used day and night to generate JavaScript programs and test their interpreters; as of today, it has found more than 2,600 bugs in browsers such as Mozilla Firefox, Google Chrome, and Microsoft Edge.

  • The EMI Project [Le et al, 2014.] uses grammars to stress-test C compilers, transforming known tests into alternative programs that should be semantically equivalent over all inputs. Again, this has led to more than 100 bugs in C compilers being fixed.

  • Grammarinator [Hodován et al, 2018.] is an open-source grammar fuzzer (written in Python!), using the popular ANTLR format as grammar specification. Like LangFuzz, it uses the grammar for both parsing and producing, and has found more than 100 issues in the JerryScript lightweight JavaScript engine and an associated platform.

  • Domato is a generic grammar generation engine that is specifically used for fuzzing DOM input. It has revealed a number of security issues in popular Web browsers.

Compilers and Web browsers, of course, are not only domains where grammars are needed for testing, but also domains where grammars are well-known. Our claim in this book is that grammars can be used to generate almost any input, and our aim is to empower you to do precisely that.

Exercises

Exercise 1: A JSON Grammar

Take a look at the JSON specification and derive a grammar from it:

  • Use character classes to express valid characters
  • Use EBNF to express repetitions and optional parts
  • Assume that
    • a string is a sequence of digits, ASCII letters, punctuation and space characters without quotes or escapes
    • whitespace is just a single space.
  • Use is_valid_grammar() to ensure the grammar is valid.

Feed the grammar into simple_grammar_fuzzer(). Do you encounter any errors, and why?

Exercise 2: Finding Bugs

The name simple_grammar_fuzzer() does not come by accident: The way it expands grammars is limited in several ways. What happens if you apply simple_grammar_fuzzer() on nonterminal_grammar and expr_grammar, as defined above, and why?

Exercise 3: Grammars with Regular Expressions

In a grammar extended with regular expressions, we can use the special form

/regex/

to include regular expressions in expansions. For instance, we can have a rule

<integer> ::= /[+-]?[0-9]+/

to quickly express that an integer is an optional sign, followed by a sequence of digits.

Part 1: Convert regular expressions

Write a converter convert_regex(r) that takes a regular expression r and creates an equivalent grammar. Support the following regular expression constructs:

  • *, +, ?, () should work just in EBNFs, above.
  • a|b should translate into a list of alternatives [a, b].
  • . should match any character except newline.
  • [abc] should translate into srange("abc")
  • [^abc] should translate into the set of ASCII characters except srange("abc").
  • [a-b] should translate into crange(a, b)
  • [^a-b] should translate into the set of ASCII characters except crange(a, b).

Example: convert_regex(r"[0-9]+") should yield a grammar such as

{
    "<start>": ["<s1>"],
    "<s1>": [ "<s2>", "<s1><s2>" ],
    "<s2>": crange('0', '9')
}

Part 2: Identify and expand regular expressions

Write a converter convert_regex_grammar(g) that takes a EBNF grammar g containing regular expressions in the form /.../ and creates an equivalent BNF grammar. Support the regular expression constructs as above.

Example: convert_regex_grammar({ "<integer>" : "/[+-]?[0-9]+/" }) should yield a grammar such as

{
    "<integer>": ["<s1><s3>"],
    "<s1>": [ "", "<s2>" ],
    "<s2>": srange("+-"),
    "<s3>": [ "<s4>", "<s4><s3>" ],
    "<s4>": crange('0', '9')
}

Optional: Support escapes in regular expressions: \c translates to the literal character c; \/ translates to / (and thus does not end the regular expression); \\ translates to \.

Exercise 4: Defining Grammars as Functions (Advanced)

To obtain a nicer syntax for specifying grammars, one can make use of Python constructs which then will be parsed by an additional function. For instance, we can imagine a grammar definition which uses | as a means to separate alternatives:

def expression_grammar_fn():
    start = "<expr>"
    expr = "<term> + <expr>" | "<term> - <expr>"
    term = "<factor> * <term>" | "<factor> / <term>" | "<factor>"
    factor = "+<factor>" | "-<factor>" | "(<expr>)" | "<integer>.<integer>" | "<integer>"
    integer = "<digit><integer>" | "<digit>"
    digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

If we execute expression_grammar_fn(), this will yield an error. Yet, the purpose of expression_grammar_fn() is not to be executed, but to be used as data from which the grammar will be constructed.

with ExpectError():
    expression_grammar_fn()
Traceback (most recent call last):
  File "<ipython-input-99-612cec5468d3>", line 2, in <module>
    expression_grammar_fn()
  File "<ipython-input-98-f21ab929e5ee>", line 3, in expression_grammar_fn
    expr = "<term> + <expr>" | "<term> - <expr>"
TypeError: unsupported operand type(s) for |: 'str' and 'str' (expected)

To this end, we make use of the ast (abstract syntax tree) and inspect (code inspection) modules.

import ast
import inspect

First, we obtain the source code of expression_grammar_fn()...

source = inspect.getsource(expression_grammar_fn)
source
'def expression_grammar_fn():\n    start = "<expr>"\n    expr = "<term> + <expr>" | "<term> - <expr>"\n    term = "<factor> * <term>" | "<factor> / <term>" | "<factor>"\n    factor = "+<factor>" | "-<factor>" | "(<expr>)" | "<integer>.<integer>" | "<integer>"\n    integer = "<digit><integer>" | "<digit>"\n    digit = \'0\' | \'1\' | \'2\' | \'3\' | \'4\' | \'5\' | \'6\' | \'7\' | \'8\' | \'9\'\n'

... which we then parse into an abstract syntax tree:

tree = ast.parse(source)

We can now parse the tree to find operators and alternatives. get_alternatives() iterates over all nodes op of the tree; If the node looks like a binary or (| ) operation, we drill deeper and recurse. If not, we have reached a single production, and we try to get the expression from the production. We define the to_expr parameter depending on how we want to represent the production. In this case, we represent a single production by a single string.

def get_alternatives(op, to_expr=lambda o: o.s):
    if isinstance(op, ast.BinOp) and isinstance(op.op, ast.BitOr):
        return get_alternatives(op.left, to_expr) + [to_expr(op.right)]
    return [to_expr(op)]

funct_parser() takes the abstract syntax tree of a function (say, expression_grammar_fn()) and iterates over all assignments:

def funct_parser(tree, to_expr=lambda o: o.s):
    return {assign.targets[0].id: get_alternatives(assign.value, to_expr)
            for assign in tree.body[0].body}

The result is a grammar in our regular format:

grammar = funct_parser(tree)
for symbol in grammar:
    print(symbol, "::=", grammar[symbol])
start ::= ['<expr>']
expr ::= ['<term> + <expr>', '<term> - <expr>']
term ::= ['<factor> * <term>', '<factor> / <term>', '<factor>']
factor ::= ['+<factor>', '-<factor>', '(<expr>)', '<integer>.<integer>', '<integer>']
integer ::= ['<digit><integer>', '<digit>']
digit ::= ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Part 1 (a): One Single Function

Write a single function define_grammar(fn) that takes a grammar defined as function (such as expression_grammar_fn()) and returns a regular grammar.

Part 1 (b): Alternative representations

We note that the grammar representation we designed previously does not allow simple generation of alternatives such as srange() and crange(). Further, one may find the string representation of expressions limiting. It turns out that it is simple to extend our grammar definition to support grammars such as below:

def define_name(o):
    return o.id if isinstance(o, ast.Name) else o.s
def define_expr(op):
    if isinstance(op, ast.BinOp) and isinstance(op.op, ast.Add):
        return (*define_expr(op.left), define_name(op.right))
    return (define_name(op),)
def define_ex_grammar(fn):
    return define_grammar(fn, define_expr)

The grammar:

@define_ex_grammar
def expression_grammar():
    start   = expr
    expr    = (term + '+' + expr
            |  term + '-' + expr)
    term    = (factor + '*' + term
            |  factor + '/' + term
            |  factor)
    factor  = ('+' + factor
            |  '-' + factor
            |  '(' + expr + ')'
            |  integer + '.' + integer
            |  integer)
    integer = (digit + integer
            |  digit)
    digit   = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

for symbol in expression_grammar:
    print(symbol, "::=", expression_grammar[symbol])

Note. The grammar data structure thus obtained is a little more detailed than the standard data structure. It represents each production as a tuple.

We note that we have not enabled srange() or crange() in the above grammar. How would you go about adding these? (Hint: wrap define_expr() to look for ast.Call)

Part 2: Extended Grammars

Introduce an operator * that takes a pair (min, max) where min and max are the minimum and maximum number of repetitions, respectively. A missing value min stands for zero; a missing value max for infinity.

def identifier_grammar_fn():
    identifier = idchar * (1,)

With the * operator, we can generalize the EBNF operators – ? becomes (0,1), * becomes (0,), and + becomes (1,). Write a converter that takes an extended grammar defined using *, parse it, and convert it into BNF.

Creative Commons License The content of this project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The source code that is part of the content, as well as the source code used to format and display that content is licensed under the MIT License. Last change: 2018-12-11 17:26:09+01:00CiteImprint