Build A Compiler With The JEP 457 Class-File API

January 21, 2024 - 11:01

This is an updated version of the Java Advent 2023 article "My First Compiler" using the JEP 457 Class-File API instead of ProGuardCORE for class file generation.

As a Java developer you probably spend a lot of time writing Java source code and executing a Java compiler to convert that human readable Java source into machine readable bytecode stored in Java class files.

If you've ever wondered how a compiler works or how a Java compiler creates Java class files, then keep reading! We'll work through writing a compiler for a simple programming language that compiles to Java bytecode.

We'll implement a compiler for the esoteric programming language Brainf*ck, which is simple enough that it doesn’t require much code to create a working compiler.

What is Brainf*ck?

Brainf*ck is an esoteric language that consists of just 8 operators which operate on an array of memory cells, created by Urban Müller. Even though there are only 8 operators, the language is Turing complete and so, in theory, you can write any program you can think of; whether you’d want to is another question!

The 8 operators manipulate the values in the memory pointed at by a data pointer. In implementations of the language the memory size is typically 30,000 cells and the cells are typically 1 byte in size.

These are the 8 operators:

>	Moves the data pointer to the right
<	Moves the data pointer to the left
+	Increment the value in memory pointed to by the data pointer
-	Decrement the value in memory pointed to by the data pointer
[	Jump to the location after the corresponding ] if the current value is zero
]	Jump to the corresponding [ if the current value is not zero
,	Read 1 character (1 byte) from the standard input
.	Print 1 character to standard out

Any character not in this list is considered a comment. At the beginning of a brainf*ck program all the values are zero and the data pointer points to the left-most block:

[0][0][0][0][0]...[0]
 |
 DP

Given the following program:

+>++>++++-

The memory will end up looking like this after running the program:

[1][2][3][0][0]...[0]
       |
       DP

Compiling to Java Bytecode

We’ll write a Java program that will read Brainf*ck input and produce Java class files in a jar as output: then you’ll be able to execute the jar with the java command (java -jar output.jar).

Java class files contain Java bytecode: these are the instructions that are executed by a Java virtual machine. A Java virtual machine is a stack-based machine: many of the instructions deal with pushing and popping from the operand stack. For example, the instruction iconst_0 is used to load a constant 0 onto the stack and the pop instruction will pop the top value from the stack.

So, how do we create a class file? Technically, a class file is just a bunch of bytes so we could just start writing out a stream of bytes but that’ll get hard pretty quickly so we’d benefit from a higher-level API to help us out.

JEP 457: Class-File API

JEP 457 Class-File API is a Java 22 preview API that aims to provide a standard API for parsing, generating, and transforming Java class files.

Unlike libraries such as ProGuardCORE, ASM or ByteBuddy its scope is smaller: only parsing, generating, & transforming are in-scope while code analysis features are explicitly out of scope. The biggest benefit is that it will be a standard Java API that keeps in-sync with the Java class file specification and evolves in-sync with the compiler and virtual machine.

And the API is pretty nice to use! Unlike older libraries like ASM and ProGuardCORE (both 20+ years old), the new API avoids the use of the visitor pattern and instead uses newer Java idioms and features that weren't present when the older libraries were designed. Generating a class which prints HelloWorld requires just a handful of lines of Java code:

ClassFile.of().buildTo(Path.of("HelloWorld.class"), ClassDesc.of("HelloWorld"), classBuilder -> classBuilder 
   .withMethodBody("main", ofDescriptor("([Ljava/lang/String;)V"), ACC_PUBLIC | ACC_STATIC, codeBuilder -> codeBuilder 
     .getstatic(of("java.lang.System"), "out", of("java.io.PrintStream")) 
     .ldc("Hello World") 
     .invokevirtual(of("java.io.PrintStream"), "println", MethodTypeDesc.of(CD_void, of("java.lang.Object"))) 
     .return_()));

Since JEP 457 is currently a preview API, the --enable-preview flag needs to be provided to both the javac compiler and the java command:

$ javac --release 22 --enable-preview HelloWorldGenerator.java
$ java --enable-preview HelloWorld

A Brainf*ck compiler

We’ll start with some boilerplate code for setting up the input and output arguments, creating a class with a main method and writing the class to a jar.

We can use the ClassFile API to create a class, which provides a build method that takes a class name (in the form of a ClassDesc) and a function that provides a ClassBuilder which can be used to build the class. We can use the ClassBuilder to set properties of the class, and add or transform fields & methods.

We can set the version of the Java class with the withVersion builder method: we’ll set it to Java 1.6.

Adding a method using the ClassBuilder is easy with the withMethodBody builder method: we first need to provide the name (main) and the descriptor of the method (void String[] in the form of a MethodTypeDesc) and the access flags (public, static).

The final parameter of withMethodBody takes a function that allows building code with a CodeBuilder. The CodeBuilder provides an API to build the Java bytecode instructions that make up method bodies: this is how we’ll generate specific bytecodes needed to implement Brainf*ck logic. For now, we’ll just add a return instruction.

Finally, a JarOutputStream can be used to write the class file bytes generated by the build method to a jar file.

public static void main(String[] args) throws IOException {
   if (args.length != 2) throw new RuntimeException("Expected input and output arguments");

   var input = Files.readString(Path.of(args[0]));
   var output = args[1];

   var bytes = ClassFile.of()
       .build(ClassDesc.of("BF"), classBuilder -> classBuilder
           .withVersion(JAVA_6_VERSION, 0)
           .withMethodBody("main", MethodTypeDesc.of(CD_void, ClassDesc.of("java.lang.String").arrayType()), ACC_PUBLIC | ACC_STATIC, codeBuilder -> {

               // TODO: Generate code here

               // Generate the return instruction
               codeBuilder.return_();
           }));

   // Write the class file bytes into a jar file.
   var manifest = new Manifest();
   manifest.getMainAttributes().put(MANIFEST_VERSION, "1.0");
   manifest.getMainAttributes().put(MAIN_CLASS, "BF");
   try (var os = new JarOutputStream(new BufferedOutputStream(new FileOutputStream(output)), manifest)) {
       os.putNextEntry(new JarEntry("BF.class"));
       os.write(bytes);
       os.closeEntry();
   }
}

If you run this, it will generate a jar file containing a class called BF, which contains a main method that does nothing! Try it out and see nothing for yourself:

$ javac --release 22 --enable-preview BfCompiler.java input.bf output.jar
$ java -jar output.jar

You can though use the javap command to look at the generated bytecode where you’ll see the main method which contains the return instruction at offset 0.

$ javap -cp out.jar -c -v -p BF
public class BF
  minor version: 0
  major version: 50
  flags: (0x0001) ACC_PUBLIC
  this_class: #2                          // BF
  super_class: #4                         // java/lang/Object
  interfaces: 0, fields: 0, methods: 1, attributes: 0
Constant pool:
  #1 = Utf8               BF
  #2 = Class              #1              // BF
  #3 = Utf8               java/lang/Object
  #4 = Class              #3              // java/lang/Object
  #5 = Utf8               main
  #6 = Utf8               ([Ljava/lang/String;)V
  #7 = Utf8               Code
{
  public static void main(java.lang.String[]);
    descriptor: ([Ljava/lang/String;)V
    flags: (0x0009) ACC_PUBLIC, ACC_STATIC
    Code:
      stack=0, locals=1, args_size=1
         0: return
}

Next we’ll need to fill in the main method with code generated based on the input Brainf*ck program.

Memory

A Brainf*ck program operates on a block of memory with a typical implementation size of 30,000. Some implementations automatically expand the memory size if it’s exceeded but for our implementation we’ll use a fixed-size byte array to represent the memory; and to keep it simple we won’t handle out of bounds access.

So the first code we’ll need to generate in our main method is to declare the array, which will require 3 instructions:

codeBuilder
   .sipush(30_000)
   .newarray(TypeKind.ByteType)
   .astore(MEMORY);

The sipush (short integer) instruction pushes the desired size of the array, 30000, onto the stack which the newarray instruction will pop from the stack. The operand for the newarray instruction is the type of array: in this case a byte array.

Then we will store the array that was just pushed onto the stack into a local variable slot with an astore instruction. The astore instruction pops a reference value from the stack and stores it in the specified local variable slot.

MEMORY is a static final field int that contains the local variable slot number, which you should now add to the compiler:

private static final int MEMORY = 1;

Data Pointer

We’ll also need to keep track of the data pointer value: we’ll use another local variable slot for that and generate some code to initialise the value to zero. The iconst_0 instruction pushes the integer 0 onto the stack and then the istore instruction will pop the value from the stack and store it in the specified local variable slot.

codeBuilder
   .iconst_0()
   .istore(DATA_POINTER);

DATA_POINTER is a static final int field that contains the local variable slot number, which you should now add to the compiler:

private static final int DATA_POINTER = 0;

Now we’ve initialised the memory and the data pointer; next we need to generate the code for each of the operators in the input.

Parsing

We don’t need a fancy or complicated parser since the language is so simple. We can simply iterate over the characters in the input string, generating the corresponding code for each operator and then continue to the next one:

input.chars().forEach(c -> {
   switch (c) {
       case '>' -> move(composer,       1); // Move right by 1
       case '<' -> move(composer,      -1); // Move left by 1
       case '+' -> increment(composer,  1); // Increment by 1
       case '-' -> increment(composer, -1); // Decrement by 1
       case ',' -> printChar(composer); 
       case '.' -> readChar(composer);
       case '[' -> loopBegin(composer);
       case ']' -> loopEnd(composer);
       default -> {
           // Ignore other characters.
       }
   }
});

This will call a corresponding method to generate the code for each operator and ignore any non-operator characters. We’ll take a look at each code generation method one by one.

The move operators < >

The < and > operators move the data pointer left or right by one: in our generated Java program this corresponds to incrementing or decrementing the data pointer by 1.

Out of all of the operators, this is the easiest to generate code for as there is a single Java bytecode instruction that does exactly what we need: iinc.

The iinc instruction takes as operands a local variable index and a signed byte value and when executed the local variable will be incremented by the value.

private static void move(CodeBuilder codeBuilder, int amount) {
  codeBuilder.iinc(DATA_POINTER, amount);
}

The increment / decrement operators + -

The increment and decrement operators increment or decrement the value in Brainf*ck memory by 1. In our generated Java bytecode this corresponds to incrementing or decrementing the value in the array at the data pointer index i.e. in Java source code it would look like MEMORY[DATA_POINTER]++.

The baload instruction is used to load a byte value from a byte array. It pops its two operands from the stack: the array reference and the array index. So to load a value from the memory array requires 3 instructions:

aload to push the memory array onto the stack
iload to push the data pointer array index onto the stack
baload to load the value from the memory array at the data pointer index

There is also a corresponding bastore instruction which pops the array reference and array index from the stack as well as the value to store into the array at the specified index.

Since we’re going to need the memory array and data pointer on the stack later for using with bastore we duplicate the top two entries on the stack, using dup2, before the baload instruction pops the original two:

codeBuilder
 .aload(MEMORY)
 .iload(DATA_POINTER)
 .dup2()
 .baload()

After executing these instructions the JVM stack will look something like this:

MEMORY[DATA_POINTER]

DATA_POINTER

MEMORY

We’ll then add 1 or -1 to the value on the top of the stack by pushing the amount value onto the stack with the constantInstruction pseudo-instruction that generates the correct instruction to push a constant onto the stack. For example, passing the integer 1 as a parameter would generate the iconst_1 instruction whereas passing the integer 30,000 would generate the sipush instruction.

We can then use the iadd instruction to add the values together. The iadd instruction will pop two integers from the stack, add them together and then push the result onto the stack.

codeBuilder
 .aload(MEMORY)
 .iload(DATA_POINTER)
 .dup2()
 .baload()
 .constantInstruction(amount)
 .iadd()

The JVM stack will then look something like this and we have all the operands in place to use the bastore instruction to store the incremented value back into the memory array. The bastore instruction will pop the memory array, data pointer array index and the value to store into the memory array.

MEMORY[DATA_POINTER] + (-)1

DATA_POINTER

MEMORY

The full increment code generation method looks like this:

private static void increment(CodeBuilder codeBuilder, int amount) {
 codeBuilder
   .aload(MEMORY)
   .iload(DATA_POINTER)
   .dup2()
   .baload()
   .constantInstruction(amount)
   .iadd()
   .bastore();
}

So far we can move the data pointer and modify the values in memory but we can’t yet observe what’s happening: next we’ll generate code for the print operator.

The print operator .

If you’re a Java or Kotlin developer, you’ll know that to print something to standard out you’ll use the System.out.print method (or println if you want a newline printed). We’ll generate the code to invoke the same print method to print a character from the Brainf*ck memory.

The code to print a character from memory will need the following instructions:

getstatic to push the System.out PrintStream object onto the stack
aload / iload / baload to load the byte from the Brainf*ck memory array
i2c to convert the byte to a char
invokevirtual to invoke the print method on the System.out object

The getstatic instruction takes the fully qualified class name, field name and descriptor as parameters and pushes the field value onto the stack:

codeBuilder
 .getstatic(ClassDesc.of("java.lang.System"), "out", ClassDesc.of("java.io.PrintStream"))

The code to load the byte from the Brainf*ck memory is familiar, since it’s the same 3 instructions as for the increment operator:

codeBuilder
 .getstatic(ClassDesc.of("java.lang.System"), "out", ClassDesc.of("java.io.PrintStream"))
 .aload(MEMORY)
 .iload(DATA_POINTER)
 .baload()

We’ll then add the i2c instruction to convert the byte to a char:

codeBuilder
 .getstatic(ClassDesc.of("java.lang.System"), "out", ClassDesc.of("java.io.PrintStream"))
 .aload(MEMORY)
 .iload(DATA_POINTER)
 .baload()
 .i2c()

After executing these instructions the JVM stack will look something like this:

(char)MEMORY[DATA_POINTER]

System.out

The stack is now set up for executing an invokevirtual instruction for the print method. The object on which to execute the method is popped from the stack after the method parameters, in this case the single character.

The invokevirtual instruction requires the fully qualified class name (java.io.PrintStream), the method name (print) and the method descriptor (void char).

The full print code generation method looks like this:

private static void printChar(CodeBuilder codeBuilder) {
  codeBuilder
   .getstatic(ClassDesc.of("java.lang.System"), "out", ClassDesc.of("java.io.PrintStream"))
   .aload(MEMORY)
   .iload(DATA_POINTER)
   .baload()
   .i2c()
   .invokevirtual(ClassDesc.of("java.io.PrintStream"), "print", MethodTypeDesc.of(CD_void, CD_char));
}

We’ll now be able to see the results of the increment operations on the memory by printing out values. If you run the compiler with the following input and execute the resulting jar, what do you see?

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++.

The read operator ,

We can now print characters using System.out.println and next we’ll implement the read operator with System.in.

As with the print operator, we’ll start by pushing System.in onto the stack:

codeBuilder
  .getstatic(ClassDesc.of("java.lang.System"), "in", ClassDesc.of("java.io.InputStream"))

To read a byte from the input we can use the read(byte[] array, int offset, int length) method of the System.in input stream. This method takes as parameters a byte array, a destination offset within the array and the length of bytes to read from the stream which will then be stored in the array. The method returns the number of bytes that were actually read from the stream.

Our Brainf*ck memory is a byte array and the data pointer points to the position in the array where the read byte should be stored; so these can be directly used as parameters to the read method. The read operator only reads a single byte at a time so the length parameter to the read method is a constant integer 1.

This means we need to push the memory array, the data pointer and constant integer 1 onto the stack before the invokevirtual instruction to invoke the read method:

codeBuilder
  .getstatic(ClassDesc.of("java.lang.System"), "in", ClassDesc.of("java.io.InputStream"))
  .aload(MEMORY)
  .iload(DATA_POINTER)
  .iconst_1()
  .invokevirtual(ClassDesc.of("java.io.InputStream"), "read", MethodTypeDesc.of(CD_int, CD_byte.arrayType(), CD_int, CD_int))

Remember that the read method returns the number of bytes read? The invokevirtual instruction will push that return value onto the stack. We don’t care about it so we can discard it with a pop instruction.

The full read code generation method looks like this:

private static void readChar(CodeBuilder codeBuilder) {
  codeBuilder
   .getstatic(ClassDesc.of("java.lang.System"), "in", ClassDesc.of("java.io.InputStream"))
   .aload(MEMORY)
   .iload(DATA_POINTER)
   .iconst_1()
   .invokevirtual(ClassDesc.of("java.io.InputStream"), "read", MethodTypeDesc.of(CD_int, CD_byte.arrayType(), CD_int, CD_int))
   .pop();
}

Now try compiling and running this Brainf*ck program; then type a character, press enter and what do you see?

,+.

We’re almost done but we’re still missing the [ and ] operators.

The loop operators [ ]

So far we’ve implemented 6 out of the 8 operators and have been able to compile and execute some simple Brainf*ck programs. But the two remaining operators are required to allow us to run more complicated programs.

Let’s remind ourselves of what the remaining two operators do:

[	Jump to the location after the corresponding ] if the current value is zero
]	Jump to the corresponding [ if the current value is not zero

For our generated code, this means that the generated code for the [ operator will need to conditionally jump forward to a location in code that we have not yet generated for the ]; and the code for the ] operator will need to conditionally jump back to the location of the corresponding [ operator.

To implement this in Java bytecode we’ll use the ifeq and ifne instructions that jump to a specified offset if the value on the stack is 0 or not 0 respectively. When we’re parsing the [ operator to generate the ifeq we’ll need to already know the location to jump forward to even though we haven’t yet generated it.

The Class-File API solves this problem with labels: the CodeBuilder has a newLabel() method that will create a label that can be used as a parameter for the labelBinding pseudo-instruction like this:

var exampleLabel = codeBuilder.newLabel();
codeBuilder
    .iconst_0()
    .ifeq(exampleLabel)
    ...
    .labelBinding(exampleLabel)
    ....

We’ll need to keep track of the labels that we create because we’ll create the labels when generating code for the [ and then need to use the labels later when generating the code for the corresponding ]. We’ll do this with a stack of pairs of labels: one label for the loop body and another for the loop exit. We use a stack because loops can be nested and this allows us to match the corresponding [ and ].

The following record and field should now be added to our compiler:

private record LoopInfo(Label body, Label exit) {}
private static final Stack<LoopInfo> loops = new Stack<>();

For the beginLoop code generation method, we start by creating labels and pushing them onto the stack:

var loopInfo = loops.push(new LoopInfo(codeBuilder.newLabel(), codeBuilder.newLabel()));

Then we need to generate the code for the [ operator with the following instructions:

aload / iload / baload to load the byte from the Brainf*ck memory array
ifeq to conditionally jump to the label loopInfo.exit
labelBinding to mark the position in the code with the label loopInfo.body

The full beginLoop code generation method then looks like this:

private static void loopBegin(CodeBuilder codeBuilder) {
    var loopInfo = loops.push(new LoopInfo(codeBuilder.newLabel(), codeBuilder.newLabel()));

    codeBuilder
        .aload(MEMORY)
        .iload(DATA_POINTER)
        .baload()
        .ifeq(loopInfo.exit)   // jump to loop exit if zero
        .labelBinding(loopInfo.body); // mark the start of loop body
}

The code generation for the ] operator works in a similar way: it pops the labels from the loops stack, then generates the code for the ] operator with the following instructions:

aload / iload / baload to load the byte from the Brainf*ck memory array
ifne to conditionally jump to the label loopInfo.body
labelBinding to mark the position in the code with the label loopInfo.exit

We expect that the [ and ] operators are correctly balanced when popping from the array but in the case of a syntax error in the input Brainf*ck code this may not be the case. For example, the following input would throw an EmptyStackException if we try to pop labels from the loops stack:

++[--]]

We can provide a nicer error message by checking if the stack is empty before we try to pop the labels from the loops stack.

The full endLoop code generation method then looks like this:

private static void loopBegin(CodeBuilder codeBuilder) {
    var loopInfo = loops.push(new LoopInfo(codeBuilder.newLabel(), codeBuilder.newLabel()));

    codeBuilder
        .aload(MEMORY)
        .iload(DATA_POINTER)
        .baload()
        .ifeq(loopInfo.exit)   // jump to loop exit if zero
        .labelBinding(loopInfo.body); // mark the start of loop body
}

As a final improvement, we can also detect when there are too many [ operators compared with the number of ] operators and provide a nicer error message. If the loops stack is not empty at the end of our code generation we know that we’ve pushed more loop labels onto the stack than we have popped:

// Add just before codeBuilder.return_();
if (!loops.isEmpty()) throw new RuntimeException("Too many '['");

Now our compiler is ready for anything (almost)!

Finally, a working compiler

Congratulations! The compiler can now run almost any Brainf*ck program. Try the following, what does it do?

-[------->+<]>-.-[->+++++<]>++.+++++++..+++.[--->+<]>-----.+++++[->++<]>.>+[--->++<]>.---------.-[-->+<]>-----.[--->+<]>-.

The full compiler code can be found over on GitHub which you can easily build and run with a few commands, providing you have Java 22:

$ javac --release 22 --enable-preview src/BfCompiler.java -d build
$ java --enable-preview -cp build BfCompiler examples/hellojvm.bf build/out.jar
$ java -jar build/out.jar

Why Almost?

There are some limitations in our implementation that prevent us from running every possible Brainf*ck program. For one, the memory is limited to a fixed size of 30,000 cells: some Brainf*ck programs may require more than this.

There are also some JVM limitations: we currently generate code in a single method. A single method in the JVM is limited to 65,535 bytes which we could exceed. In fact, if you compile this Mandelbrot Brainf*ck program we already reach 47,078 bytes.

One improvement we can make to reduce code size could be to generate helper methods for each of our operations rather than generating all the code within a single method.

But a bigger improvement would be to optimise the generation of the code for the move (<>) and add (+-) operators.

Optimizations

Consider the following Brainf*ck program:

>>>>>

Our compiler will call the move code generation method 5 times, generating 5 iinc instructions; when instead we could generate a single iinc instruction with 5 as the increment amount.

The move and increment methods are already set up to pass in increment amounts other than 1 or -1: can you modify the compiler to optimise the generation of consecutive move and increment operations?

Here’s a nice blog post that describes some other optimizations; can you implement them in our compiler?

Next steps

We’ve only just scratched the surface of Java class files, Java bytecode and the new JEP457 Class-File API. The API also provides tools to read and transform class files.

The JEP457 Class-File API is putting a standardised Java class file manipulation API right into the Java standard library but it’s not the only tool that provides such functionality. The de facto standard right now is the ASM library; and other libraries such as ByteBuddy and ProGuardCORE provide more functionality (such as code analysis) beyond the scope of the JEP 457 Class-File API.

A previous version of this article was published using ProGuardCORE to generate the class file: the code building APIs are very similar; which is not too surprising because both APIs stick closely to the Java class file specification. For example, each library has a code builder (CodeBuilder vs CompactCodeAttributeComposer) interface that implements methods to generate instructions whose names are very familiar since they are the same as the instructions themselves.

A bigger difference between the new JEP457 Class-File API and the ProGuardCORE API is the use of more modern Java features instead of using a visitor API for traversing and transforming class files. Brian Goetz has a great talk about the design of the new API over on YouTube.