The Problems with the Execution Path Watermark Algorithm for Java Bytecode

March 29, 2011 - 11:14

There are two general types of software watermarking: static and dynamic. The latter stores the watermark in the execution or a data structure of a program. Execution path watermarking encodes the watermark in the sequence of branches taken during execution. A version of this algorithm has been implemented in Sandmark. How effective is execution path watermarking? Is it better than static watermarks, which are highly susceptible to semantics-preserving transformation attacks?

The Java bytecode version, implemented in Sandmark, has two ways to encode data: one based on loops, the other based on if conditionals using variables collected from a tracing phase. A loop includes a conditional which encodes 0 or 1 as a branch taken or not taken. The following example is taken from the original paper1:

int bits = 0xA; 
int counter = 5; 
int j = 0; 
for(int i=0; i<counter ; i++,bits >>=1)
    if ( (bits & 1) == 1) j++; 
if (PF) live_var += j;

The example encodes the bits 01012 by first reversing the bit string to obtain 10102 (A in hexadecimal). Note that the least-significant bit is 0. The first time around the loop the condition will fail. Subsequent times around the result is compared to the first result. So the 2nd time around succeed != fail gives 1, then fail == fail gives 0 then succeed != fail gives 1. This gives us the original number 01012. In order to prevent static analysis tools from detecting this code which isn't part of the original program an opaque-predicate may be used to make it look like the watermark code is 'connected' to the original code.

This is just a small example and a real watermark number would be bigger - Collberg et al. claim that this uses approximately 60 bytes of code for 64 bits of data.

The following is an extract (in Jimple) from a program watermarked using the Sandmark implementation of this algorithm:

$l16 = 4284109030105526342L;
$i1 = 62;
goto label2;

label0:
    $i17 = $i1 - 1;
    $l2 = $l16;
    $i20 = (int) $l16;
    $i21 = $i20 & 1;
    if $i21 == 0 goto label1;
         i0 = i0 + 1;

label1:
    $i1 = $i17;
    $l16 = $l2 >>> 1;

label2:
    if $i1 != 0 goto label0;

The code here works in the same way as the example from the paper: bits is $l16, counter is $i1 etc. Notice the huge constant that is assigned to $l16. I would say that this is unstealthy - most constants in real-world Java programs are small (93% are less than 1000) or very close to powers of two2.

When you take into account the redundancy features of the Sandmark implementation the amount of code inserted is a lot more than one loop. The watermark is split into 64-bit blocks using the Chinese Remainder Theorem which gives a good probability of recovering the watermark from a damaged bit-string. However, this also means that we require a larger amount of loops, large constants, variables and conditionals added to methods in the program. The bits need to be stored as a linear sequence of for-loops and/or conditionals - which is highly unstealthy

Although the watermark is encoded as the path taken during execution, the integer which generates this path sequence still has to be stored in the code. We assume that an attacker will not change the semantics of the program but if they made even a small change to this input number it will destroy the watermark. 

Another big problem with this algorithm is that it is not resilient to all semantics-preserving transformations - even though this is a dynamic watermarking algorithm. Applying an obfuscation which affects the branching behaviour of the program will destroy the watermark.

It is assumed that the attacker could not apply obfuscations uniformly throughout the program due to the effect on program performance. However, if the watermark is unstealthy (for example, because of lots of strange looking constants), the attacker could find the watermarked sections and obfuscate only those. Alternatively, the attacker could find the watermarked sections and remove them.

The Sandmark implementation uses a live integer variable, if one is available, to 'connect' the watermark code to the rest of the program (like the example). However, if there is no live integer the code is 'unconnected' - leaving the watermark susceptible to program analysis/slicing attacks. Further work could look into the choice of the live integer - for example, slicing obfuscations3 might help protect the watermarked program against slicing attacks as the slice sizes are greatly increased.

Sandmark uses the opaque predicate \(f(x) = ((x - 1) \times x)\mod 2) = 0\) where the value of \(x\) is not important because \(f(x)\) will always be 0. Again the choice of the variable \(x\) is important - a variable that can be confirmed constant by a constant propogation analysis will allow an attacker to easily execute the opaque predicate, as they will know the input value to the opaque predicate. An opaque predicate like this could also be attacked by abstract interpertation4 or even pattern matching (if you have a library of known opaque predicates).

So, the two big problems with this watermarking algorithm are: it is susceptible to branch-based obfuscation attacks, and it is unstealthy5. Note that this post only deals with the Java bytecode implementation of this algorithm, not native code which is less restrictive than bytecode.