# A Survey Of Graph Based Software Watermarking

August 31, 2010 - 23:45

We examine the currently proposed static and dynamic graph watermarking schemes. Graph based watermarking schemes, like other watermarking schemes, can be divided into two groups: static and dynamic. Static graph watermarks are embedding in a control-flow graph within a program whereas dynamic graph watermarks are embedding in a graph data-structure built at run-time. We report previous findings, describe some recent additions and conclude by suggesting a direction for future work.

The full paper is summarised on this page; the full paper contains more details and examples.

## Encoding Watermarks in Graphs

Collberg et al.1 describe several techniques for encoding watermark integers in graph structures. The algorithms in this paper rely on the fact that graph-generating code is difficult to analyse due to aliasing effects 2 which, in general, is known to be un-decidable 3.

An ideal class of watermarking graph should have the following properties 4:

• ability to efficiently encode a watermark integer; and be efficiently decodable to a watermark integer
• a root node from which all other nodes are reachable
• a high data-rate
• a low outdegree to resemble common data structures such as lists and trees
• error correcting properties to allow detection after transformation attacks
• tamper-proofing abilities
• have some computationally feasible algorithms for graph isomorphism, for use during recognition

### Graph Enumeration

The family of graph encoding is based on a branch of graph theory known as graphical enumerations5. An integer watermark $n$ is encoded as the $n^{th}$ enumeration of a given graph.

#### Directed Parent-Pointer Trees

This family of graphs contains a single edge between a vertex and it's parent. For example, there are 4 possible enumerations of DPPTs having four indistinguishable vertices6. We choose the nth enumerated graph to embed the watermark number n. Implementing a parent-pointer tree data-structure is space efficient because each node has just one pointer field referencing it's parent. However, the data-structure is fragile and an adversary could add a single node or edge to distort the watermark.

#### Planted Planar Cubic Trees

Planted planar cubic trees (PPCT) are binary trees where every interior vertex, except the root, has two children. These are easily enumerated by using a Catalan recurrence7 8. The Catalan number c(n) gives the number of unique trees for each n. An integer is encoded as one of the trees.

PPCTs can be made more resilient to attacks by a) marking each leaf with a self-pointer, and b) creating an outer cycle from the root to itself through all the leaves9. This allows single edge and node insertions to be detected; however, multiple changes cannot be detected or corrected. The PPCT graphs have a lower bit-rate than other graph families but are more resilient to attacks.

Radix graphs add an extra pointer field in each vertex of a circular linked list of length k to encode a base-k digit. It is possible to encode watermark digits where a self-pointer represents 0, a pointer to the next node 1, and so on.

These graphs give the highest data-rate for encoding watermarks however they are fragile as the watermark could easily be distorted. We can add redundancy to these watermarks by restricting the indegree to two and outdegree to two and using a permutation graph 1.

Several papers10 11 12 13 14 15 describe a technique which combines radix graphs with the error-correcting properties of PPCTs by converting a radix graph into a PPCT-like structure.

### Permutation Graphs

Permutation based graphs (as defined by Collberg et al.1) use the same basic singly linked circular list structure as the radix graphs but have error-correcting properties. In this encoding scheme a permutation P = {p1, p2, ..., pn} is derived from the watermark integer n; the permutation is then encoded in the graph by adding edges between vertices i and pi.

Collberg and Nagra9 give algorithms for encoding and decoding an integer as a permutation. For example, the integer 9710 is encoded as the permutation {3, 1, 0, 2, 4}.

If an adversary changes one of the non-spine pointers to point to a different vertex we will discover an error because the permutation graph encodes a unique permutation - each vertex must have indegree two. However, if the adversary swaps two pointers we will lose the watermark.

#### Reducible Permutation Graphs

Reducible permutation graphs (RPG)16 17 are very similar to permutation graphs but they closely resemble control-flow graphs18 as they are reducible-flow graphs19 20.

The reducibility of this family of graphs means that they resemble control-flow graphs constructed from programming constructs such as if, while etc.9.

RPGs, like CFGs, contain a unique entry node and a unique exit node, a preamble which contains zero or more nodes from which all other nodes can be reached and a body which encodes a watermarking using a self-inverting permutation21.

## Static Graph Watermarking

Venkatesan et al.16 proposed the first static graph watermarking scheme, Graph Theoretic Watermarking (GTW), which encodes a value in the topology of a program's control-flow graph18. The idea was later patented by Venkatesan et al.17 for Microsoft. The basic concept is to encode a watermark value in a reducible permutation graph and convert it into a control flow graph; it is then merged with the program control flow graph by adding control flow edges between the two.

The algorithm adds bogus control flow edges between random pairs of vertices in the program CFG and watermark CFG in order to protect against static analysis attacks looking for sparse-cuts22 in the control-flow graph. A sparse-cut would indicate a possible joining point of the original program CFG and the watermark CFG where the attacker could split the program with as few edges broken as possible. For example, we could insert a call to the sum, or any other method in the original program, in the watermark method. The GTW algorithm also inserts bogus method calls in between parts of the original program so that the location of bogus calls cannot be used to point out the location of the watermark.

In order to recognise a GTW we must know which basic blocks, from the program's CFG, are part of the watermark graph; we must therefore mark the watermark basic blocks in some way in order to identify them. We could, for example, re-order instructions such that they are in lexicographic order, or insert bogus instructions to identify a watermark basic block9. However, these techniques are not resilient to semantics-preserving transformations and an attacker could remove the marks so that we cannot mark our watermark blocks.

Collberg et al.23 implemented a version GTWSM of GTW in Sandmark24 in order to evaluate the algorithm. They measured the size and time overhead of watermarking and evaluated the algorithm against a variety of attacks. They also introduce two methods (Partial Sum splitting and Generalised Chinese Remainder Theorem25 splitting) for splitting a watermark integer into redundant pieces so that a large integer can be stored in several smaller CFGs. They found that stealth is a big problem; for example, the basic blocks of the generated watermark method consisted of 20% arithmetic instructions compared to just 1% for standard Java methods26. Watermarks of up to 150 bits increased program size by between 40% and 75%, while performance decreased by between 0% and 36%23.

## Dynamic Graph Watermarking

Collberg and Thomborson27 proposed the first dynamic graph based watermarking scheme CT to overcome problems with static watermarking schemes; most notably, static watermarks are highly fragile and therefore susceptible to semantics-preserving transformation attacks28.

Dynamic graph watermarking schemes are similar to static graph watermarking except the graph structures are built at run-time. CT can use any of the previously described graph encoding schemes to store the watermark. We would inspect the Java heap to retrieve our watermark when the program is executed with the secret input.

The first implementation29 of the CT algorithm CTJW, implemented in a system called JavaWizz 30, was for Java bytecode using PPCT graphs to encode the watermark integer. A class is chosen from the program to be watermarked and converted into a node class by adding additional fields which contain references to other nodes.

Palsberg et al.29 discuss a simple implementation which only resists attacks against dead-code removal; they suggest that other software protection techniques are applied after the program has been watermarked. Dependencies are added between the watermark generating code and the original program by replacing a statement S with a statement of the form if(x != y) S where x and y are distinct nodes in the watermark graph.

In order to retrieve the watermark, the watermarked program is executed and the Java heap is accessed by dumping an image using the -Xhprof heap profiling JVM option31.

The following steps are performed to retrieve the watermark:

• Extracting potential node classes
• Exracting potential node objects
• Determining potential edges
• Searching for the watermark graph

In general, searching for the graph would be an NP-complete problem however we know that the graph is a PPCT graph of a certain size and can prune away unnecessary nodes. Palsberg et al.32 evaluated their implementation and found that, although it is possibly to retrieve a watermark, it is time-consuming - taking from 0.6 minutes up to 8.9 minutes on their set of test programs. It was shown that the CT algorithm is a good watermarking scheme because it is stealthy and resilient to semantics-preserving transformation attacks but it must be combined with other software protection techniques such as obfuscation33 and tamper-proofing 34.

Collberg and Thomborson4 implemented a version of the CT algorithm, CTSM, in Sandmark 24 and compared it to JavaWiz29. The CTSMimplementation differs from the JavaWiz in the following ways:

1. CTSMrequires a key to encode and decode the watermark, requiring user annotation of the program to be watermarked
2. CTSMcan encode arbitrary strings whereas CTJW requires an integer watermark
3. CTJW uses PPCTs to encode the watermark, whereas CTSMoffers the choice of Radix, PPCT, Permutation or Reducible Permutation graphs; also offering a cycled graph option to protect against node-splitting attacks.
4. CTJW's graph building code is concentrated in one location whereas CTSMembeds code fragments at sereval user-specified locations
5. CTJW uses an existing class for graph nodes whereas CTSW generates its own
6. CTJW uses the heap profiling option of the JVM whereas CTSM uses the Java Debug Inteferace (JDI) API 35

The CTSW algorithm embedding algorithm consists of the following steps:

• Annotation In this first step the programmer must insert calls to a method mark() which indicates to the watermarking system locations in which graph building code can be inserted. This manual annotation step allows a programmer to carefully choose the best locations for code insertion, maximising the stealthiness of the embedded watermark.
• Tracing After the code has been annotated the program is run with a secret input during which one or more annotation points will be encountered. Some of these encountered locations are used to store the graph building code.
• Embedding The watermark is encoded in a graph structure (strings watermarks are first converted to integers) and is converted into Java bytecode that builds the graph. The graph node class is generated from a similar class already in the program, or if no suitable class is found, classes from the Java library, such as LinkedList could be used. The graph is partitioned into several, equal sized, sub-graphs so that the code is more stealthy and the code is inserted at the marked locations discovered in the tracing step.

During extraction the program is run with the secret input; this will invoke the graph building code at the locations marked by the programmer. The graph structure will be on the Java heap and the Java debugging interface is used to retrieve it. The JDI is used to add breakpoints to every constructor in the program to build a circular linked buffer of the last 1000 objects allocated. The list is then searched, in reverse, for the watermark graph.

Collberg and Thomborson4 evaluate CTSM and conclude that it is efficient and resilient to semantics-preserving transformations. The greatest vulnerability is it's limited stealthiness and it is suggest that future work investigates combining a locally stealthy but not steganographically stealthy technique, such as describe by Nagra and Thomborson36, with the CT algorithm would be helpful.

## Attacks against graph watermarks

Collberg et al.4 describe 4 types of attacks against dynamic graph watermarking schemes:

• adding extra nodes to the graph
• renaming and re-ordering instance variables
• splitting nodes into several linked parts

However, an adversary will probably not know the location of the watermark in a large program and as such would have to apply transformations across the program resulting in a large performance decrease. The use of RPG or PPCT graphs prevent renaming and re-ordering attacks because these graph encodings do not rely on the order of the nodes.

The biggest problem is the addition of bogus nodes to a graph for which tamper-proofing techniques are required. Collberg et al.37 suggest the use of Java reflection may help. However, they did not implement this because they conclude that the code would be unstealthy as this kind of code is unusual in most programs.

Luo et al.38 describe an algorithm based on CT using the Chinese Remainder Theorem25 to split the watermark.

## Tamper-proofing by Constant Encoding

He39, and later Thomborson et al.40, developed a tamper-proofing technique for dynamic graph watermark called constant encoding. Constant encoding encodes some of the constants in a program into a tree structure similar to the watermark and decodes them at run-time. An attacker cannot distinguish between the watermark graph and the constant encoding graphs so they cannot change any graph structure as they may introduce bugs into the program.

An attacker that manipulates the constant graph will change the encoding; so the decode function will return a different number and the program will be semantically incorrect.

A weakness of this system is that an attacker may be able to discover that certain program functions always return the same value, for every execution. He39 suggests introducing dependencies, to make the code harder to analyse, by providing an input variable to the decoding function. A further suggestion, for future work, is to change the constant tree at runtime making the code harder to analyse.

Jian-qi et al.41 propose using the constant encoding functions within the parameters of opaque predicates42 so that if an attacker manipulates a constant graph control will pass to the wrong clause of an if statement using the opaque predicate.

Thomborson et al.40 combine the watermarking and tamper-proofing techniques further by using sub-graphs of the watermark graph to encode constants, rather than separate graphs. Thomborson et al.40 also formally define the problem of tamper-proofing and show that commonly occurring constants in computer programs can be replaced automatically and that a long series of dynamic analyses would be required to remove the watermark. However, it is not clear if their constant encoding technique is resilient against pattern-matching attacks -- a question which is left for future work.

Khiyal et al.43 suggest splitting constants so that large constants can be encoded in small trees. They evaluated the constant encoding technique by comparing watermarked and tamper-proofed programs for efficiency, size and resilience. They found that the size of a program does increase when tamper-proofed and the execution time is slower; however, the tamper-proofing technique should be used with obfuscation to further protect a tamper-proofed program.

## Conclusion

We have presented a survey of software watermarking schemes based on graph encoding using both static and dynamic embedding techniques. Static techniques, as always, are highly susceptible to semantics preserving transformation attacks and are therefore easily removed by an adversary.

Dynamic graph watermarking, however, is resilient to most semantics-preserving transformations if the right graph encoding is chosen. However, a choice between variable degrees of high stealth, resilience and bit-rate has to be made, as no algorithm has been developed which combines the best of all these properties.

Further research should continue into dynamic graph watermarking, especially on improving the CT algorithm from attacks. We intend to investigate the use of program slicing44 to attack dynamic watermarks and improve resilience to subtractive attacks.