What is software watermarking?

July 08, 2010 - 08:50

Software theft, also known as software piracy, is the act of copying a legitimate application and illegally distributing that software, either free or for profit. Legal methods to protect software producers such as copyright laws, patents and license agreements1 do not always dissuade people from stealing software, especially in emerging markets where the price of software is high and incomes are low. Ethical arguments, such as fair compensation for producers, by software manufacturers, law enforcement agencies and industry lobbyists also do little to counter software piracy. The global revenue loss due to software piracy was estimated to be more than $50 billion in 20082.

Software watermarking involves embedding a unique identifier within a piece of software, to discourage software theft. Watermarking does not prevent theft but instead discourages software thieves by providing a means to identify the owner of a piece of software and/or the origin of the stolen software3. It can then be extracted by an extractor or verified by a recogniser to prove ownership of software. The former extracts the original watermark, while the latter merely confirms the presence of a watermark4. A watermark recogntion or extraction algorithm may also been classified as blind, where the original program and watermark is unavailable, or informed, where the original program and/or watermark is available5.

It is also possible to embed a unique customer identifier in each copy of the software distributed which allows the software company to identify the individual that pirated the software. It is necessary that the watermark is hidden so that it cannot be detected and removed.

In most cases the watermark should be robust - that is, resilient to semantics preserving transformations (such as optimisations or obfuscations). However, in some cases it is desirable that a watermark is fragile in the sense that if semantics preserving transformations are performed on the software the watermark becomes invalid. This is useful in the context of software licensing where any changes to a program could disable it.

Watermarking techniques are used extensively in the entertainment industry to identify multimedia files such as audio and video files, and the concept has extended into the software industry. Watermarking does not aim to make a program hard to steal or indecipherable like obfuscation but it discourages theft as thieves know that they could be identified.

Most literature discusses techniques and problems of automatically watermarking software and there is only a small amount of literature which compares automatic watermarking with manual watermarking6. A manual watermark is inserted by the programmer of the application, rather than a using a third-party automatic tool. Some semi-automatic watermarking systems also exist (e.g. The Collberg-Thomborson algorithm implemented in Sandmark7) - where a programmer inserts markers into a program during development and the finished software is then augmented by a software watermarking tool.

Watermarks can be classified as either visible, where the recongiser is public knowledge or invisible where the recongniser, or some component (such as an encryption key), is not public knowledge. Visible watermarks can act as a deterrent but also show an adversary the location of a watermark making the task of removing the watermark easier.

Difficulties of Software Watermarking

Software watermarks present several implementation problems and many of the current watermarking algorithms are vulnerable to attack. Watermarked software must meet the following conditions:

program size must not be increased significantly.
program efficiency must not be decreased significantly.
robust watermarks must be resilient to semantics preserving transformations (fragile watermarks, by definition, should not be).
watermarks must be sufficiently well hidden, to avoid removal.
watermarks must be easy for the software owner to extract.

Perhaps the most difficult problem to solve is keeping the watermark hidden from attackers while, at the same time, allowing the software owner to efficiently extract the watermark when needed. If the watermark is too easy to extract then an attacker would be able to extract the watermark too. If a watermark is too well hidden then the software owner may not be able to find the watermark, in order to extract it. Some watermark tools (such as Sandmark7) use markers to designate the position of the stored watermark - this is problematic as it poses a risk of exposing the watermark to an adversary.

Watermarks should be resilient to semantics preserving transformations and ideally it should be possible to recognise a watermark from a partial program. Semantics preserving transformations, by definition, result in programs which are syntactically different from the original, but whose behaviour is the same. The attacker can attempt, by performing such transformations, to produce a semantically equivalent program with the watermark removed. Redundancy and recognition with a probability threshold may help with these problems8.

The watermark code must be locally indistinguishable from the rest of the program so that it is hidden from adversaries9. For example, imagine a watermark which consists of a dummy method with 100 variables - this kind of method will probably stand out in a simple analysis of the software (such as using software metrics techniques10 11). It could be difficult to programatically generate code which is indecipherable from the human-generated program code but statisical analysis of the original program could help in generating suitable watermarks8.

Ideally, software watermarks should be resilient to decompilation-recompilation attacks, as decompilation of Java is possible (though not perfect12).

Software watermarks must be efficient in several ways:

cost of embedding time.
cost of runtime.
cost of recognition time.

The cost of embedding a software watermark can be divided into two areas: developer time and embedding cost. The former simply quantifies the time that a developer spends embedding a watermark, while the latter quantifies the execution time of a software watermarking tool. Embedding costs are not a significant problem except in certain cases such as live multimedia streaming.

Developer time is important in use of software watermarks as the developer should not have to spend a large amount of time preparing a software watermark. The complexity of a software watermark is proportional to the resilience of the watermark - that is, the greater amount of time a developer spends embedding a watermark the harder it may be for an adversary to crack. For example, a developer could spend days introducing a subtle semantic property into the program which is unique to the software and very hard to discover. Consider a program which when a certain key combination is entered one pixel in the program's interface changes to a certain colour - this would be hard for an attacker to detect but may take a developer time to implement.

Now consider an automatic watermarker which programatically generates dummy code and injects this into random places within the program - such code will be easier to discover than code that was generated by a developer and carefully inserted.

In the middle of the scale is a semi-automatic watermark which involves a developer preparing a program before a watermarking tool embeds the watermark. The preparations could include inserting markers where watermark code should be inserted, or creating dummy methods which watermarks could use. Monden et al.13 describe a watermarking algorithm which requires the production of a dummy method in a program for the watermark to be stored. A programmer must create this dummy method manually and then execute watermarking software to embed the watermark.

The cost of runtime depends on the effect that the transformations applied by the watermark have had on the size and execution time. For example, Hattanda et al.14 found that the size of a program, watermarked with Davidson/Myhrvold15 algorithm, increased by up to 24% and the performance decreased by up to 24%.

Dummy methods, which are not executed, will have minimal effect on runtime cost but dynamic watermarks may have a high runtime cost as the watermark is built during program execution. The fidelity of watermark, `the extent to which embedding the watermark detoriates the original content' 6, should also be taken into account for the effects caused by watermarking, for example embedding a watermark may introduce unintentional errors.

The ideal recognition time of a watermark will most likely be quick but in some cases it may be important to artificially slow watermark recognition time to prevent oracle attacks6. Such attacks rely on the repetitive execution of a recongiser thus fast recognition time helps an adversary.

Types of Watermark

Nagra et al. define four types of watermark6:

Authorship Mark identifying a software author, or authors. These watermarks are generally visible and robust.
Fingerprinting Mark identifying the channel of distribution, i.e. the person who leaked the software. The watermarks are generally invisible, robust and consist of a unique identifier such as a customer reference number.
Validation Mark to verify that software is genuine and unchanged, for example like digitally signed Java Applets. These watermarks must be visible to the end-user to allow validation and fragile to ensure the software is not tampered with.
Licensing Mark used to authenticate software against a license key. The key should become ineffective if the watermark is damaged therefore licensing marks should be fragile.

Watermarking Techniques

Software watermarks can be broadly divided into two categories: static and dynamic16. The former embeds the watermark in the data and/or code of the program, while the latter embeds the watermark in a data structure built at runtime.

Static Watermarks

Static watermarks are embedded in the code and/or data of a computer program, a trivial example would be embedding a copyright notice in a string. Java class-files could contain software watermarks within their method bodies or their constant pool. The problem with storing a watermark as a string in a computer program is that unused variables could be easily removed with a simple dead-code analysis, and method or variable names are either lost during compilation or obfuscation. Other static watermarking techniques may involve transformations such as re-arranging or replacing instructions or basic blocks.

Some of the first static software watermarking techniques15 17 were describe in patents before academic researchers became interested in the area.

Dynamic Watermarking

Dynamic watermarking techniques store a watermark in a program's execution state, rather than in the program code itself18, therefore they should be resilient to semantics preserving transformations.

"Easter eggs"16 are a type of dynamic watermark and, while they provide little protection, they demonstrate the idea of dynamic watermarking. An easter egg is hidden, by a programmer, within a program which usually performs some immediately perceptible action, such as displaying a message or exposing a hidden game. Easter eggs are usually easy to find and there are websites19 dedicated to sharing easter eggs, in software, video games, music, TV shows and movies. Because easter eggs are usually easy to find and shared easily once found they are impractical to use for identifying software. Once an easter egg has been found standard program analysis and transformation techniques such as slicing20 can be used to remove the watermark.

Program Transformation Attacks

Program transformation attacks on watermarked software can be divided into three categories:

Additive

An additive attack involves inserting another watermark into an already watermarked application, thus over-writing the original watermark. This attack will usually work if a watermark of the same type is embedded but not necessarily if a different type of watermark is embedded3.

Subtractive

A subtractive attack involves removing the section, or sections, of code where the watermark is stored while leaving behind a working program. This could be achieved by dead code elmination, statistical analysis or program slicing.

Distortive

Distortive attacks involve applying semantics preserving transformations to a program, such as obfuscations or optimisations thus removing any watermarks which rely on program syntax. For example, renaming variables, loop transformations, function inlining, etc.

Both static and dynamic watermarks can be susceptible to program transformation attacks. Myles et al.3 conducted an evaluation of dynamic and static versions of the Arboit algorithm by watermarking and obfuscating test files. They found the dynamic version to be only minimally stronger than the static version, and both versions could be defeated by distortive attacks.

Extracts from An Evaluation of Static Java Bytecode Watermarks, accepted for publication at the International Conference on Computer Science and Applications 2010, part of The World Congress on Engineering and Computer Science 2010 (WCECS2010).

1. Cronin, G., A Taxonomy of Methods for Software Piracy Prevention, 2002.
2. Alliance, B. Software, The Economic Benefits of Reducing PC Software Piracy, 2008.
3. a. b. c. Myles, G., "Using Software Watermarking to Discourage Piracy", Crossroads - The ACM Student Magazine, 2004.
4. Zhu, W.F., Concepts and techniques in software watermarking and obfuscation: The University of Auckland, New Zealand, 2007.
5. Zhu, W., and C. Thomborson, Recognition in software watermarking, Santa Barbara, California, USA, ACM, 2006.
6. a. b. c. d. Nagra, J., C. Thomborson, and C. Collberg, "Software watermarking: Protective terminology", Proceedings of the ACSC: Citeseer, 2002.
7. a. b. Collberg, C., G. Myles, and A. Huntwork, "Sandmark - A Tool for Software Protection Research", IEEE Security and Privacy, vol. 1, pp. 40–49, 2003.
8. a. b. Mishra, A., R. Kumar, and P. P. Chakrabarti, A Method-based Whole-Program Watermarking Scheme For Java Class Files, 2008.
9. Venkatesan, R., V. Vazirani, and S. Sinha, "A graph theoretic approach to software watermarking", Information Hiding: Springer, 2001.
10. Halstead, M. H., Elements of software science (Operating and programming systems series): Elsevier, 1977.
11. Kearney, J. K., R. L. Sedlmeyer, W. B. Thompson, M. A. Gray, and M. A. Adler, "Software complexity measurement", Commun. ACM, vol. 29, pp. 1044–1050, 1986.
12. Hamilton, J., and S. Danicic, "An evaluation of current java bytecode decompilers", 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation, Edmonton, Alberta, Canada, IEEE, 2009.
13. Monden, A., H. Iida, K-ichi. Matsumoto, K. Torii, and K. Inoue, "A Practical Method for Watermarking Java Programs", Computer Software and Applications Conference, 2000. COMPSAC 2000. The 24th Annual International, Washington, DC, USA, IEEE Computer Society, 2000.
14. HATTANDA, K., and S. ICHIKAWA, "The Evaluation of Davidson’s Digital Signature Scheme", IEICE TRANS. FUNDAMENTALS, vol. E87–A, 2004.
15. a. b. Davidson, R.I., and N. Myhrvold, Method and system for generating and auditing a signature for a computer program, 1996.
16. a. b. Collberg, C., "Software watermarking: Models and dynamic embeddings", Proceedings of the 26th ACM SIGPLAN-SIGACT, 1999.
17. Samson, P.R., Apparatus and method for serializing and validating copies of computer software, 1994.
18. Collberg, C., C. Thomborson, and D. Low, "On the limits of software watermarking", Proc. ACM Symp. on Principles of Programming Languages, 1999.
19. , The Easter Egg Archive, 2009.
20. Weiser, M., "Program slicing", Proceedings of the 5th international conference on Software engineering, Piscataway, NJ, USA, IEEE Press, 1981.