Fuzzing with Constraints

In previous chapters, we have seen how Grammar-Based Fuzzing allows us to efficiently generate myriads of syntactically valid inputs. However, there are semantic input features that cannot be expressed in a context-free grammar, such as

  • "$X$ is the length of $Y$";
  • "$X$ is an identifier previously declared"; or
  • "$X$ should be longer than 4,096 bytes".

In this chapter, we show how the ISLa framework allows us to express such features as constraints added to a grammar. By having ISLa solve these constraints automatically, we produce inputs that are not only syntactically valid, but actually semantically valid. Furthermore, such constraints allow us to very precisely shape the inputs we want for testing.

from bookutils import YouTubeVideo
YouTubeVideo("dgaGuwn-1OU")

Prerequisites

Synopsis

To use the code provided in this chapter, write

>>> from fuzzingbook.FuzzingWithConstraints import <identifier>

and then make use of the following features.

This chapter introduces the ISLa framework, consisting of

  • the ISLa specification language, allowing to add constraints to a grammar
  • the ISLa solver, solving these constraints to produce semantically (and syntactically) valid inputs
  • the ISLa checker, checking given inputs for whether they satisfy these constraints.

A typical usage of the ISLa solver is as follows. First, install ISLa, using

$ pip install isla-solver

Then, you can import the solver as

>>> from isla.solver import ISLaSolver

The ISLa solver needs two things. First, a grammar - say, US phone numbers.

>>> from Grammars import US_PHONE_GRAMMAR

Second, you need constraints – a string expressing a condition over one or more grammar elements. Common functions include

  • str.len(), returning the length of a string
  • str.to.int(), converting a string to an integer

Here, we instantiate the ISLa solver with a constraint stating that the area code should be above 900:

>>> solver = ISLaSolver(US_PHONE_GRAMMAR, 
>>>             """
>>>             str.to.int(<area>) > 900
>>>             """)

With that, invoking solver.solve() returns a solution for the constraints.

>>> str(solver.solve())
'(907)203-4265'

solve() returns a derivation tree, which typically is converted into a string using str() as above. The print() function does this implicitly.

Subsequent calls of solve() return more solutions:

>>> for _ in range(10):
>>>     print(solver.solve())
(907)789-2197
(907)994-3910
(907)305-0189
(907)617-4056
(907)401-3406
(907)557-9854
(907)898-8434
(907)430-2840
(907)866-1743
(906)893-5683

We see that the solver produces a number of inputs that all satisfy the constraint - the area code is always more than 900.

The ISLaSolver() constructor provides several additional parameters to configure the solver, as documented below. Additional ISLaSolver methods allow checking inputs against constraints, and provide additional functionality.

ISLaSolver ISLaSolver __init__() check() parse() solve() Legend Legend •  public_method() •  private_method() •  overloaded_method() Hover over names to see doc

The ISLa functionality is also available on the command line:

>>> !isla --help
usage: isla [-h] [-v]
            {solve,fuzz,check,find,parse,repair,mutate,create,config} ...

The ISLa command line interface.

options:
  -h, --help            show this help message and exit
  -v, --version         Print the ISLa version number

Commands:
  {solve,fuzz,check,find,parse,repair,mutate,create,config}
    solve               create solutions to ISLa constraints or check their
                        unsatisfiability
    fuzz                pass solutions to an ISLa constraint to a test subject
    check               check whether an input satisfies an ISLa constraint
    find                filter files satisfying syntactic & semantic
                        constraints
    parse               parse an input into a derivation tree if it satisfies
                        an ISLa constraint
    repair              try to repair an existing input such that it satisfies
                        an ISLa constraint
    mutate              mutate an input such that the result satisfies an ISLa
                        constraint
    create              create grammar and constraint stubs
    config              dumps the default configuration file

Semantic Input Properties

In this book, we have frequently used grammars to systematically generate inputs that cover input structure and more. But while it is relatively easy to express the syntax of an input using a grammar, there are input properties that cannot be expressed using a grammar. Such input properties are called semantic properties.

Let us illustrate semantic properties using a simple example. We want to test some system that is configured by two settings, a page size and a buffer size. Both these come as integer numbers as part of a human-readable configuration file. The syntax of this file is given by the following grammar:

from Grammars import Grammar, is_valid_grammar, syntax_diagram, crange
import string
CONFIG_GRAMMAR: Grammar = {
    "<start>": ["<config>"],
    "<config>": [
        "pagesize=<pagesize>\n"
        "bufsize=<bufsize>"
    ],
    "<pagesize>": ["<int>"],
    "<bufsize>": ["<int>"],
    "<int>": ["<leaddigit><digits>"],
    "<digits>": ["", "<digit><digits>"],
    "<digit>": list("0123456789"),
    "<leaddigit>": list("123456789")
}
assert is_valid_grammar(CONFIG_GRAMMAR)

Here's a visualization of this grammar as a railroad diagram, showing its structure:

start
config
config
pagesize= pagesize bufsize= bufsize
pagesize
int
bufsize
int
int
leaddigit digits
digits
digit digits
digit
0 1 2 3 4 5 6 7 8 9
leaddigit
1 2 3 4 5 6 7 8 9

Using this grammar, we can now use any of our grammar-based fuzzers to generate valid inputs. For instance:

from GrammarFuzzer import GrammarFuzzer, DerivationTree
fuzzer = GrammarFuzzer(CONFIG_GRAMMAR)
for i in range(10):
    print(i)
    print(fuzzer.fuzz())
0
pagesize=4057
bufsize=817
1
pagesize=9
bufsize=8
2
pagesize=5
bufsize=25
3
pagesize=1
bufsize=2
4
pagesize=62
bufsize=57
5
pagesize=2
bufsize=893
6
pagesize=1
bufsize=33
7
pagesize=7537
bufsize=3
8
pagesize=97
bufsize=983
9
pagesize=2
bufsize=2

So far, so good - and indeed, these random values will help us test our (hypothetical) system. But what if we want to control these values further, putting our system to the test?

A grammar gives us some control. If we want to ensure a page size of at least 100,000, for instance, a rule like

"<bufsize>": ["<leaddigit><digit><digit><digit><digit><digit>"]

would do the job. We could also express that the page size should be an odd number, by having it end in an odd digit. But if we want to state that the page size should be, say, a multiple of 8, or larger or less than the buffer size, we are out of luck.

In the chapter on fuzzing with generators, we have seen how to attach program code to individual rules - program code that would either generate individual elements right away or filter only these that satisfy specific conditions.

Attaching code makes things very flexible, but also has several disadvantages:

  • First, it is pretty hard to generate inputs that satisfy multiple constraints at once. In essence, you have to code your own strategy for generating inputs, which at some point negates the advantage of having an abstract representation such as a grammar.
  • Second, your code is not portable. While a grammar can be easily adapted to any grammar-based fuzzer, adding, say, Python code, ties you to the Python environment forever.
  • Third, program code can only be used for producing inputs or checking inputs, but not both. This, again, is a downside compared to a pure grammar representation.

Hence, we are looking to a more general way to express semantic properties - and also a more declarative way to express semantic properties.

Unrestricted Grammars

One very general solution to this problem would be to use unrestricted grammars rather than the context-free grammars we have used so far. In an unrestricted grammar, one can have multiple symbols also on the left-hand side of an expansion rule, making them very flexible. In fact, unrestricted grammars are Turing-universal, meaning that they can express any feature that could also be expressed in program code; and they could thus check and produce arbitrary strings with arbitrary features. (If they finish, that is – unrestricted grammars also suffer from the halting problem.) The downside is that there is literally no programming support for unrestricted grammars – we'd have to implement all arithmetics, strings, and other functionality from scratch in a grammar, which is - well - not fun.

Specifying Constraints

In recent work, Dominic Steinhöfel and Andreas Zeller (one of the authors of this book) have presented an infrastructure that allows producing inputs with arbitrary properties, but without having to go through the trouble of implementing producers or checkers. Instead, they suggest a dedicated language for specifiying inputs, named ISLa (for input specification language). ISLa combines a standard context-free grammar with constraints that express semantic properties of the inputs and their elements. ISLa can be used as a fuzzer (producing inputs that satisfy the constraints) as well as a checker (checking inputs whether they satisfy the given constraints).

Let us illustrate ISLa by example. ISLa comes as a Python package named isla-solver that can be easily installed using pip:

$ pip install isla-solver

This also installs all dependent packages.

The core of ISLa is the ISLa Solver – the component that actually solves constraints to produce satisfying inputs.

import isla
from isla.solver import ISLaSolver

The constructor of an ISLaSolver takes two mandatory arguments.

  • The grammar is the grammar the solver should produce inputs from.
  • The constraint is the constraint the produced inputs should satisfy.

To express a constraint, we have a variety of functions and predicates at our disposition. These can be applied to individual elements of the grammar, notably their nonterminals. The function str.len(), for instance, returns the length of a string. If we want to have inputs in which the page size has at least 6 digits, we can write:

solver = ISLaSolver(CONFIG_GRAMMAR, 'str.len(<pagesize>) >= 6')

The method solve() returns the next produced string from the ISLa solver, as a derivation tree (seen in the Chapter on fuzzing with grammars). To convert these into a string, we can use the str() converter:

str(solver.solve())
'pagesize=84056595974\nbufsize=3'

The print() method converts its arguments to strings implicitly. To get, say, the next 10 solutions, we can thus write

for _ in range(10):
    print(solver.solve())
pagesize=84056595974
bufsize=402493567181
pagesize=84056595974
bufsize=1
pagesize=84056595974
bufsize=96
pagesize=84056595974
bufsize=81
pagesize=84056595974
bufsize=514
pagesize=84056595974
bufsize=2
pagesize=84056595974
bufsize=635
pagesize=84056595974
bufsize=7
pagesize=84056595974
bufsize=3
pagesize=614590983583
bufsize=44

... and we see that, indeed, each page size has exactly six digits.

Derivation Trees

If you inspect a derivation tree as returned from solve() directly, you will get quite a structure:

solution = solver.solve()
solution
DerivationTree('<start>', (DerivationTree('<config>', (DerivationTree('pagesize=', (), id=2), DerivationTree('<pagesize>', (DerivationTree('<int>', (DerivationTree('<leaddigit>', (DerivationTree('6', (), id=562),), id=563), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('1', (), id=572),), id=573), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('4', (), id=582),), id=583), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('5', (), id=592),), id=593), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('9', (), id=602),), id=603), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('0', (), id=612),), id=613), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('9', (), id=622),), id=623), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('8', (), id=632),), id=633), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('3', (), id=642),), id=643), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('5', (), id=652),), id=653), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('8', (), id=662),), id=663), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('3', (), id=672),), id=673), DerivationTree('<digits>', (), id=677)), id=666)), id=656)), id=646)), id=636)), id=626)), id=616)), id=606)), id=596)), id=586)), id=576)), id=566)), id=557),), id=554), DerivationTree('\nbufsize=', (), id=4), DerivationTree('<bufsize>', (DerivationTree('<int>', (DerivationTree('<leaddigit>', (DerivationTree('3', (), id=1614),), id=1621), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('0', (), id=1630),), id=1640), DerivationTree('<digits>', (DerivationTree('<digit>', (DerivationTree('1', (), id=1646),), id=1655), DerivationTree('<digits>', (DerivationTree('', (), id=1641),), id=1644)), id=1629)), id=1625)), id=1611),), id=1608)), id=1),), id=0)

We can easily visualize the tree, revealing its structure:

display_tree(solution)
0 <start> 1 <config> 0->1 2 pagesize= 1->2 3 <pagesize> 1->3 41 \nbufsize= 1->41 42 <bufsize> 1->42 4 <int> 3->4 5 <leaddigit> 4->5 7 <digits> 4->7 6 6 (54) 5->6 8 <digit> 7->8 10 <digits> 7->10 9 1 (49) 8->9 11 <digit> 10->11 13 <digits> 10->13 12 4 (52) 11->12 14 <digit> 13->14 16 <digits> 13->16 15 5 (53) 14->15 17 <digit> 16->17 19 <digits> 16->19 18 9 (57) 17->18 20 <digit> 19->20 22 <digits> 19->22 21 0 (48) 20->21 23 <digit> 22->23 25 <digits> 22->25 24 9 (57) 23->24 26 <digit> 25->26 28 <digits> 25->28 27 8 (56) 26->27 29 <digit> 28->29 31 <digits> 28->31 30 3 (51) 29->30 32 <digit> 31->32 34 <digits> 31->34 33 5 (53) 32->33 35 <digit> 34->35 37 <digits> 34->37 36 8 (56) 35->36 38 <digit> 37->38 40 <digits> 37->40 39 3 (51) 38->39 43 <int> 42->43 44 <leaddigit> 43->44 46 <digits> 43->46 45 3 (51) 44->45 47 <digit> 46->47 49 <digits> 46->49 48 0 (48) 47->48 50 <digit> 49->50 52 <digits> 49->52 51 1 (49) 50->51 53 52->53

By converting the derivation tree into a string, we get the represented string:

str(solution)
'pagesize=614590983583\nbufsize=301'

print() does this implicitly, so printing the solution gives us the string:

print(solution)
pagesize=614590983583
bufsize=301

Unless you want to inspect the derivation tree or access its elements, converting it into a string makes it more manageable.

To express a minimum numeric value, we can use a more elegant way. The function str.to_int(), for instance, converts a string into an integer. To obtain a page size that of at least 100000, we can thus also write

solver = ISLaSolver(CONFIG_GRAMMAR,
                    'str.to.int(<pagesize>) >= 100000')
print(solver.solve())
pagesize=100002
bufsize=25

If we want the page size to be in the range of 100 to 200, we can state this as a logical conjunction (using and)

solver = ISLaSolver(CONFIG_GRAMMAR, 
                    '''
                    str.to.int(<pagesize>) >= 100 and 
                    str.to.int(<pagesize>) <= 200
                    ''')
print(solver.solve())
pagesize=168
bufsize=9

And if we want the page size to be a multiple of seven, we can write

solver = ISLaSolver(CONFIG_GRAMMAR, 
                    '''
                    str.to.int(<pagesize>) mod 7 = 0
                    ''')
print(solver.solve())
pagesize=49
bufsize=7
from bookutils import quiz

Quiz

Which of the following constraints expresses that the page size and the buffer size have to be equal? Try it out!





Indeed, ISLa constraints can also involve multiple elements. Expressing equality between two elements is easy, and uses a single equal sign. (There's no assignment in ISLa the = could be confused with.)

solver = ISLaSolver(CONFIG_GRAMMAR, 
                    '''
                    <pagesize> = <bufsize>
                    ''')
print(solver.solve())
pagesize=8
bufsize=8

We can also use numerical constraints, stating that the buffer size should always be exactly one more than the page size:

solver = ISLaSolver(CONFIG_GRAMMAR, 
                    '''
                    str.to.int(<pagesize>) > 1024 and
                    str.to.int(<bufsize>) = str.to.int(<pagesize>) + 1
                    ''')
print(solver.solve())
pagesize=1025
bufsize=1026

All the above functions (like str.to.int()) actually stem from the SMT-LIB library for satisfiability modulo theories (SMT), a standard for expressing constraints for constraint solvers (like ISLa). The list of all theories defined in SMT-LIB lists dozens of functions and predicates that can be used in ISLa constraints.

Quiz

Which constraints are necessary to ensure that all digits are between 1 and 3?





Using SMT-LIB Syntax

Instead of the above "infix" syntax which is familiar to programmers, ISLa also supports full SMT-LIB syntax. Instead of writing $f(x_1, x_2, x_3, \dots)$ for a function $f$ and its arguments $x_1 \dots x_n$, SMT-LIB uses a "prefix" LISP-like syntax in which all functions and operators are written as $(f x_1 x_2 x_3 \dots)$. The above predicate would thus be written as

solver = ISLaSolver(CONFIG_GRAMMAR, 
                '''
                (and
                  (> (str.to.int <pagesize>) 1024)
                  (= (str.to.int <bufsize>)
                     (+ (str.to.int <pagesize>) 1)))
                ''')
print(solver.solve())
pagesize=1025
bufsize=1026

Note that for boolean operators such as and, we still use the ISLa infix syntax; having ISLa handle these operators is more efficient than passing them on to the constraint solver.

ISLa on the Command Line

When you install isla-solver, you also get an isla command-line tool. This allows you to create inputs from the command line or shell scripts.

Let us first create a grammar file suitable for isla. isla accepts grammars in Fuzzingbook format; they need to define a variable named grammar.

with open('grammar.py', 'w') as f:
    f.write('grammar = ')
    f.write(str(CONFIG_GRAMMAR))
from bookutils import print_file
print_file('grammar.py')
grammar = {'<start>': ['<config>'], '<config>': ['pagesize=<pagesize>\nbufsize=<bufsize>'], '<pagesize>': ['<int>'], '<bufsize>': ['<int>'], '<int>': ['<leaddigit><digits>'], '<digits>': ['', '<digit><digits>'], '<digit>': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'], '<leaddigit>': ['1', '2', '3', '4', '5', '6', '7', '8', '9']}

With this, we can use isla as a grammar fuzzer, plain and simple By default, isla solve produces one single matching output:

!isla solve grammar.py
pagesize=14
bufsize=621783905620

The true power of isla, however, comes to be as we (again) add constraints to be solved - either in separate constraint files or (easier) directly on the command line:

!isla solve grammar.py --constraint '<pagesize> = <bufsize>'
pagesize=8
bufsize=8
!isla solve grammar.py \
    --constraint '<pagesize> = <bufsize> and str.to.int(<pagesize>) > 10000'
pagesize=99290248
bufsize=99290248

The isla executable provides several options and commands, and is a great alternative on the command line.

!isla --help
usage: isla [-h] [-v]
            {solve,fuzz,check,find,parse,repair,mutate,create,config} ...

The ISLa command line interface.

options:
  -h, --help            show this help message and exit
  -v, --version         Print the ISLa version number

Commands:
  {solve,fuzz,check,find,parse,repair,mutate,create,config}
    solve               create solutions to ISLa constraints or check their
                        unsatisfiability
    fuzz                pass solutions to an ISLa constraint to a test subject
    check               check whether an input satisfies an ISLa constraint
    find                filter files satisfying syntactic & semantic
                        constraints
    parse               parse an input into a derivation tree if it satisfies
                        an ISLa constraint
    repair              try to repair an existing input such that it satisfies
                        an ISLa constraint
    mutate              mutate an input such that the result satisfies an ISLa
                        constraint
    create              create grammar and constraint stubs
    config              dumps the default configuration file

Accessing Elements

So far, we have accessed nonterminals simply by referencing their name, as in <bufsize> or <pagesize>. However, in some cases, this simple method is not sufficient. In our configuration grammar, for instance, we may want to access (or constrain) <int> elements. However, we do not want to constrain all integers at once, but only those in a particular context – say, those that occur as a part of a <pagesize> element, or only those that occur as part of a <config> element.

To this end, ISLa allows referencing parts of a given element, using two special operators.

The expression <a>.<b> refers to the immediate subpart <b> of some element <a>. That is, <b> has to appear in one of the expansion rules of <a>. For instance, <pagesize>.<int> refers to the <int> element of a page size. Here is an example using the dot operator:

solver = ISLaSolver(CONFIG_GRAMMAR, 
                '''
                <pagesize>.<int> = <bufsize>.<int>
                ''')
print(solver.solve())
pagesize=8
bufsize=8

The expression <a>..<b>, however, refers to any subpart <b> of some element <a>. That is, <b> can appear in the expansion of <a>, but also in the expansion of any subelement (and any subelement thereof). Here is an example using the double dot operator, enforcing every digit in a <config> element to be 7:

from ExpectError import ExpectError
solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            <config>..<digit> = "7" and <config>..<leaddigit> = "7"
            ''')
print(solver.solve())
pagesize=7
bufsize=77

To reason about dots and double dots, it helps to visualize the string in question as a derivation tree discussed in the chapter on grammar-based fuzzing. The derivation tree of the input

pagesize=12
bufsize=34

for instance, looks like this:

0 <start> 1 <config> 0->1 2 pagesize= 1->2 3 <pagesize> 1->3 11 \nbufsize= 1->11 12 <bufsize> 1->12 4 <int> 3->4 5 <leaddigit> 4->5 7 <digits> 4->7 6 1 (49) 5->6 8 <digit> 7->8 10 <digits> 7->10 9 2 (50) 8->9 13 <int> 12->13 14 <leaddigit> 13->14 16 <digits> 13->16 15 3 (51) 14->15 17 <digit> 16->17 19 <digits> 16->19 18 4 (52) 17->18

In this tree, the . syntax refers to immediate children. <bufsize>.<int> is the one <int> node that is the immediate descendant of <bufsize> (but not any other <int> node). In contrast, <config>..<digit> refers to all <digit> descendants of the <config> node.

If an element has multiple immediate children of the same type, one can use the special <a>[$n$] syntax to access the $n$-th child of type <a>. To access the first child, $n$ is equal to one, not zero, as in the XPath abbreviated syntax. In our configuration grammar, there is no expansion including the same nonterminal more than once, so we do not need this feature.

For a demonstration of indexed dots, consider the following grammar, which produces lines of three "a" or "b" characters:

LINES_OF_THREE_AS_OR_BS_GRAMMAR: Grammar = {
    '<start>': ['<A>'],
    '<A>': ['<B><B><B>', '<B><B><B>\n<A>'],
    '<B>': ['a', 'b']
}
fuzzer = GrammarFuzzer(LINES_OF_THREE_AS_OR_BS_GRAMMAR)
for i in range(5):
    print(i)
    print(fuzzer.fuzz())
0
aab
bab
1
aaa
aba
2
aba
baa
bba
3
bbb
abb
4
baa

We can force, say, the second character in a line to always be a "b:"

solver = ISLaSolver(LINES_OF_THREE_AS_OR_BS_GRAMMAR, 
            '''
            <A>.<B>[2] = "b"
            ''')

for i in range(5):
    print(i)
    print(solver.solve())
0
abb
1
bbb
2
abb
3
bbb
4
abb

Quantifiers

By default, all nonterminals in ISLa constraints are universally quantified - that is, any constraint applying to, say, some <int> element applies to all <int> elements in the resulting string. If you only want to constrain one element, though, you have to (and can) specify this in ISLa, using an existential quantifier.

To use an existential quantifier in ISLa, use the construct

exists TYPE VARIABLE in CONTEXT:
    (CONSTRAINT)

where VARIABLE is some identifier, TYPE is its type (as a nonterminal), and CONTEXT is the context (again a nonterminal) in which the constraint should hold. CONSTRAINT is again a constraint expression, in which you now can make use of VARIABLE as the element whose existence you assume.

Let us illustrate existential quantification again using a simple example. We want to make sure that at least one integer in our generated string has a value of more than 1000. So we write

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            exists <int> i in <start>:
                str.to.int(i) > 1000
            ''')

for _ in range(10):
    print(solver.solve())
pagesize=1008
bufsize=5
pagesize=1008
bufsize=2280617349521
pagesize=1008
bufsize=8
pagesize=1008
bufsize=7
pagesize=1008
bufsize=3
pagesize=1008
bufsize=93
pagesize=1008
bufsize=630
pagesize=1008
bufsize=4
pagesize=1008
bufsize=14
pagesize=1008
bufsize=5

We note that all generated inputs satisfy the constraint of having at least one integer that satisfies the constraint.

Specifying a variable name is optional; if you omit it, you can use the quantified nonterminal instead. The above constraint can thus also be expressed in a more compact fashion:

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            exists <int> in <start>:
                str.to.int(<int>) > 1000
            ''')

print(solver.solve())
pagesize=1007
bufsize=64

Besides existential quantification, there also is universal quantification, using the forall keyword instead of exists. If we want all elements in some context to satisfy a constraint, this comes in handy.

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            forall <int> in <start>:
                str.to.int(<int>) > 1000
            ''')

for _ in range(10):
    print(solver.solve())
pagesize=1001
bufsize=1001
pagesize=1001
bufsize=1002
pagesize=1001
bufsize=1003
pagesize=1001
bufsize=1004
pagesize=1002
bufsize=1001
pagesize=1002
bufsize=1002
pagesize=1002
bufsize=1003
pagesize=1002
bufsize=1004
pagesize=1003
bufsize=1001
pagesize=1003
bufsize=1002

We see that all <int> elements satisfy the constraint.

By default, all nonterminals that are re-used directly in constraints are universally quantified within the <start> symbol, so the above can actually be simplified to

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            str.to.int(<int>) > 1000
            ''')

for _ in range(10):
    print(solver.solve())
pagesize=1001
bufsize=1001
pagesize=1001
bufsize=1002
pagesize=1001
bufsize=1003
pagesize=1001
bufsize=1004
pagesize=1002
bufsize=1001
pagesize=1002
bufsize=1002
pagesize=1002
bufsize=1003
pagesize=1002
bufsize=1004
pagesize=1003
bufsize=1001
pagesize=1003
bufsize=1002

... and you realize that in all our initial constraints, we always had an implicit universal quantification.

Picking Expansions

Sometimes, we'd like a quantifier to apply only for a specific expansion alternative of a nonterminal. The form

forall TYPE VARIABLE=PATTERN in CONTEXT:
    (CONSTRAINT)

means that the CONSTRAINT only applies to a VARIABLE that actually matches the expansion given in PATTERN. (Again, we can replace forall with exists, and make this an existential quantification rather than a universal quantification.)

Here's an example of using forall:

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            forall <int>="<leaddigit><digits>" in <start>:
                (<leaddigit> = "7")
            ''')

This ensures that when <int> is expanded to a lead digit followed by more digits, the lead digit becomes 7. The effect is that all <int> values now start with a 7 digit:

str(solver.solve())
'pagesize=71\nbufsize=770835426929713'

Likewise, we can constrain <int> as a whole, and thus ensure that all numbers are greater than 100:

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            forall <int> in <start>:
                (str.to.int(<int>) > 100)
            ''')

str(solver.solve())
'pagesize=101\nbufsize=101'

By default, all variables are universally quantified in <start>, so the above can also be expressed as

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            str.to.int(<int>) > 100
            ''')

str(solver.solve())
'pagesize=101\nbufsize=101'

Matching Expansion Elements

In a quantification pattern, we can also name individual nonterminal elements and use them in our constraints. This is done by replacing the nonterminal <ID> with the special form {<ID> VARIABLE} (in curly braces) which then makes the variable VARIABLE a placeholder for the value matched by ID; VARIABLE can then be used in constraints.

Here is an example. In the expansion <leaddigit><int>, we want to ensure that the <leaddigit> is always 9. Using the special brace form, we make lead a variable holding the value of <leaddigit>, and can then use it in a constraint:

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            forall <int> i="{<leaddigit> lead}<digits>" in <start>:
                (lead = "9")
            ''')

This (again) ensures that all lead digits should be 9:

for _ in range(10):
    print(solver.solve())
pagesize=92
bufsize=9168435097796
pagesize=9
bufsize=9188
pagesize=981
bufsize=999
pagesize=9
bufsize=9
pagesize=9
bufsize=9
pagesize=9
bufsize=9
pagesize=91
bufsize=9
pagesize=9
bufsize=9
pagesize=90
bufsize=9
pagesize=9
bufsize=9242

Could we express the above in a simpler fashion? Yes! For one, we can refer to <leaddigit> directly rather than introducing variables like i and lead:

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            forall <int>="<leaddigit><digits>" in <start>:
                (<leaddigit> = "9")
            ''')
print(solver.solve())
pagesize=9
bufsize=941890257631

Furthermore, using implicit universal quantification and the dot notation introduced earlier, we could write, for instance

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            <int>.<leaddigit> = "9"
            ''')
print(solver.solve())
pagesize=99
bufsize=9501387624

or just

solver = ISLaSolver(CONFIG_GRAMMAR, 
            '''
            <leaddigit> = "9"
            ''')
print(solver.solve())
pagesize=98
bufsize=93762401955

and obtain the same result (not necessarily the exact same values, though, due to randomness):

But while universal quantification and dot notation are sufficient for many cases, the pattern matching notation is more general and more flexible – even if it may be harder to read.

Checking Strings

Using an ISLaSolver, we can also check if an string satisfies the constraints. This can be applied to inputs, but also to outputs; ISLa constraints can thus server as oracles – that is, predicates that check a test result.

Let us check if in a given string, <pagesize> and <bufsize> are the same.

constraint = '<pagesize> = <bufsize>'
solver = ISLaSolver(CONFIG_GRAMMAR, constraint)

To check the tree, we can pass it into the evaluate() method of the solver – and find that the given input does not satisfy our constraint.

solver.check('pagesize=12\nbufsize=34')
False

If we repeat the above, however, with an input that satisfies the constraint, we obtain a True result.

solver.check('pagesize=27\nbufsize=27')
True

Checking constraints is much more efficient than solving them, as ISLa does not have to search for possible solutions.

Case Studies

Let us further illustrate ISLa using a few case studies.

Matching Identifiers in XML

The Extensible Markup Language (XML) is a typical example of an input language that cannot be fully expressed using a context-free grammar. The problem is not so much expressing the syntax of XML – the basics are fairly easy:

XML_GRAMMAR: Grammar = {
    "<start>": ["<xml-tree>"],
    "<xml-tree>": ["<open-tag><xml-content><close-tag>"],
    "<open-tag>": ["<<id>>"],
    "<close-tag>": ["</<id>>"],
    "<xml-content>": ["Text", "<xml-tree>"],
    "<id>": ["<letter>", "<id><letter>"],
    "<letter>": crange('a', 'z')
}
assert is_valid_grammar(XML_GRAMMAR)
start
xml-tree
xml-tree
open-tag xml-content close-tag
open-tag
< id >
close-tag
</ id >
xml-content
Text xml-tree
id
letter id letter
letter
b a c d e g f h i j l k m n o q p r s t v u w x y z

The problem becomes evident when we produce inputs from the grammar: The <id> elements in <open-tag> and <close-tag> do not match.

fuzzer = GrammarFuzzer(XML_GRAMMAR)
fuzzer.fuzz()
'<xdps><s><x><f>Text</g></ka></k></hk>'

If we want the tag IDs to match, we need to come up with a finite set of tags (as in, say, HTML); then we can extend the grammar with one rule for each tag - <body>...</body>, <p>...</p>, <strong>...</strong>, and so on. For an infinite set of tags, though, as in our grammar, expressing that the two tag IDs must match is not possible in a context-free grammar.

With ISLa, however, constraining the grammar is easy. All we need is the rule that constrains the <xml-tree>:

solver = ISLaSolver(XML_GRAMMAR, 
            '''
            <xml-tree>.<open-tag>.<id> = <xml-tree>.<close-tag>.<id>
            ''', max_number_smt_instantiations=1)

for _ in range(3):
    print(solver.solve())
<p>Text</p>
<p><p>Text</p></p>
<p><p><p>Text</p></p></p>

and we see that the <id> tags now indeed match each other.

Solver Configuration Parameters

The configuration parameter max_number_smt_instantiations we passed to the ISLaSolver object above limits the number of calls to ISLa's underlying SMT solver. Generally, higher numbers lead to more inputs generated per time. Many of those will look structurally similar, though. If we aim for structurally diverse inputs and do not care about, e.g., the names of tags, it can make sense to choose a lower value for this parameter. This is what happens with max_number_smt_instantiations=10, which is the current default:

solver = ISLaSolver(XML_GRAMMAR, 
            '''
            <xml-tree>.<open-tag>.<id> = <xml-tree>.<close-tag>.<id>
            ''', max_number_smt_instantiations=10)

for _ in range(3):
    print(solver.solve())
<p>Text</p>
<h>Text</h>
<c>Text</c>

The parameter max_number_free_instantiations serves a similar purpose: ISla randomly instantiates nonterminal symbols whose values are not restricted by a constraint. It chooses—surprise!—at most max_number_free_instantiations such random instantiations.

Other configuration parameters of interest are structural_predicates and semantic_predicates, which let you extend the ISLa language by passing custom structural and semantic predicates to the solver. You can use all the predicates in these sets inside the ISLa constraint to solve. Per default, the semantic predicate count(in_tree, NEEDLE, NUM) and the following structural predicates are available:

  • after(node_1, node_2)
  • before(node_1, node_2)
  • consecutive(node_1, node_2)
  • count(in_tree, NEEDLE, NUM)
  • different_position(node_1, node_2)
  • direct_child(node_1, node_2)
  • inside(node_1, node_2)
  • level(PRED, NONTERMINAL, node_1, node_2)
  • nth(N, node_1, node_2)
  • same_position(node_1, node_2)

In contrast to the "input generator" solution in the chapter on generators, our constraint-based solution is purely declarative - and can also be used to parse and check inputs. Plus, of course, we can easily add more constraints:

solver = ISLaSolver(XML_GRAMMAR, 
            '''
            <xml-tree>.<open-tag>.<id> = <xml-tree>.<close-tag>.<id>
            and
            str.len(<id>) > 10
            ''', max_number_smt_instantiations=1)

for _ in range(3):
    print(solver.solve())
<ppmphklftkp>Text</ppmphklftkp>
<ppmphklftkp><ppkpeyqktdh>Text</ppkpeyqktdh></ppmphklftkp>
<ppmphklftkp><ppkpeyqktdh><ppnpukmnppn>Text</ppnpukmnppn></ppkpeyqktdh></ppmphklftkp>

Definitions and Usages in Programming Languages

When testing compilers with generated program code, one often encounters the problem that before using an identifier, one has to declare it first - specifying its type, some initial value, and more.

This problem is easily illustrated in the following grammar, which produces sequences of assignments. Variable names consist of a single lowercase letter; values can only be digits; assignments are separated by semicolons.

LANG_GRAMMAR: Grammar = {
    "<start>":
        ["<stmt>"],
    "<stmt>":
        ["<assgn>", "<assgn>; <stmt>"],
    "<assgn>":
        ["<lhs> := <rhs>"],
    "<lhs>":
        ["<var>"],
    "<rhs>":
        ["<var>", "<digit>"],
    "<var>": list(string.ascii_lowercase),
    "<digit>": list(string.digits)
}
assert is_valid_grammar(LANG_GRAMMAR)
syntax_diagram(LANG_GRAMMAR)
start
stmt
stmt
assgn assgn ; stmt
assgn
lhs := rhs
lhs
var
rhs
var digit
var
b a c d e g f h i j l k m n o q p r s t v u w x y z
digit
0 1 2 3 4 5 6 7 8 9

Here are some assignment sequences produced by the grammar:

fuzzer = GrammarFuzzer(LANG_GRAMMAR)
for _ in range(10):
    print(fuzzer.fuzz())
w := 7; g := 2
f := m; x := 3
p := 3
h := f; h := u
n := k; k := z; z := 6
m := x; g := h
d := 3
h := 6
s := m
k := q

We see that the assignment syntax is similar to what we have in common programming languages. The semantics, however, are, well, questionable, as we commonly access variables whose values have not been previously defined. Again, this is a semantic property that cannot be expressed in a context-free grammar alone.

What we need here is a constraint specifying that on the right-hand side of an assignment, we can only have variable names that occur on the left-hand side. In ISLa, we achieve this through the following constraint:

solver = ISLaSolver(LANG_GRAMMAR, 
            '''
            forall <rhs> in <assgn>:
                exists <assgn> declaration:
                    <rhs>.<var> = declaration.<lhs>.<var>
            ''',
            max_number_smt_instantiations=1,
            max_number_free_instantiations=1)
for _ in range(10):
    print(solver.solve())
y := 1
a := 0; t := 5
d := d
u := 2; l := 4; p := 7
b := 9; a := 6; v := a
h := 8; o := 3; z := h
w := 4; i := 3; p := p
x := 8; d := d
h := 8; f := h
p := p; m := 5

This is much better already, but not perfect yet - we might still have assignments like a := a or a := b; b := 5. That is because our constraints do not yet take care of ordering – in a <rhs> element, we can only use variables that are defined earlier.

For this purpose, ISLa provides a before() predicate: before(A, B) expresses that the element A must occur before the element B. With before(), we can rewrite our constraint as

solver = ISLaSolver(LANG_GRAMMAR, 
            '''
            forall <rhs> in <assgn>:
                exists <assgn> declaration:
                    (before(declaration, <assgn>) and
                     <rhs>.<var> = declaration.<lhs>.<var>)
            ''',
            max_number_free_instantiations=1,
            max_number_smt_instantiations=1)

... and thus ensure that on the right-hand-side of assignments, we only use identifiers defined earlier.

for _ in range(10):
    print(solver.solve())
b := 6
f := 8; v := 5
h := 7; b := h; p := b; p := p
p := 9; p := p; j := 4; n := p
p := 0; p := p; p := 3; l := p
p := 1; x := p
p := 2; c := p
a := 9; t := a
a := 0; m := a
p := 9; p := 6; z := p

In case you find that the assignment sequences are too short, you can use the ISLa count() predicate. count(VARIABLE, NONTERMINAL, N) ensures that the number of NONTERMINALs in VARIABLE is exactly N. To have statements with exactly 5 assignments, write

solver = ISLaSolver(LANG_GRAMMAR, 
            '''
            forall <rhs> in <assgn>:
                exists <assgn> declaration:
                    (before(declaration, <assgn>) and
                     <rhs>.<var> = declaration.<lhs>.<var>)
            and
            count(start, "<assgn>", "5")
            ''', 
            max_number_smt_instantiations=1,
            max_number_free_instantiations=1)
for _ in range(10):
    print(solver.solve())
a := 8; q := a; e := 4; m := 3; t := 0
b := 6; h := b; l := 1; k := 9; w := h
b := 2; s := b; d := 7; y := 5; r := d
b := 6; x := b; p := 3; p := 7; j := p
p := 5; c := p; z := 7; i := 5; o := p
h := 0; p := h; n := p; u := 6; f := 3
b := 3; g := b; v := b; q := 7; z := 4
a := 6; a := a; o := a; o := 5; e := a
p := 5; a := p; a := p; e := 4; j := a
p := 8; h := p; q := p; p := 1; g := p

Lessons Learned

  • Using ISLa, we can add and solve constraints to grammars, allowing to express semantic properties of our test inputs
  • Declaring constraints (and have a solver solve them) is much more versatile than adding generator code, and language-independent, too
  • Using ISLa is fun :-)

Next Steps

In the next chapters, we will continue to focus on semantics. Among others, we will learn how to

Background

Exercises

Exercise 1: String Encodings

A common way of representing strings is length-prefixed strings, a representation made popular by the PASCAL programming language. A length-prefixed string starts with a few bytes that encode the length $L$ of the string, followed by the $L$ actual characters. For instance, assuming that two bytes are used to encode the length, the string "Hello" could be represented as the sequence

0x00 0x05 'H' 'e' 'l' 'l' 'o'

Part 1: Syntax

Write a grammar that defines the syntax of length-prefixed strings.

Part 2: Semantics

Use ISLa to produce valid length-prefixed strings. Make use of the SMT-LIB string library to find appropriate conversion functions.

Creative Commons License The content of this project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The source code that is part of the content, as well as the source code used to format and display that content is licensed under the MIT License. Last change: 2023-11-11 18:18:05+01:00CiteImprint