LLVM Hello World

September 22, 2024 - 16:20

Writing a “Hello World” program is often a rite of passage for a software engineer when learning a new language. In a previous post, I showed how to create a Hello World program for the JVM, by using tooling to write Java bytecode to generate a Java class file.

In this post, we’ll walk through creating a Hello World program using LLVM IR which the clang compiler can then compile to native code. We’ll create the equivalent of the following C Hello World but using LLVM IR:

#include <stdio.h>

int main(void) {
    puts("Hello World!");
    return 0;
}

By the end of this post, you’ll have taken your first steps into LLVM and LLVM IR: you’ll learn a bit about the structure of LLVM IR, a couple of LLVM IR instructions and how to execute the IR with an interpreter or compile it to an executable.

What is LLVM?

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies that began as a research project at the University of Illinois more than 20 years ago and has since grown to huge importance, underpinning many compilers and related tools, including clang, the C/C++/Objective-C compiler.

The LLVM project consists of multiple sub-projects, the most important for the purposes of this introduction are the LLVM core libraries, the clang compiler, the static compiler llc, and the lli interpreter/dynamic compiler.

The core libraries provide a source & target independent optimizer along with code generation for a multitude of target machines. They are built around the LLVM IR, an intermediate representation, which we’ll use to generate a Hello World program.

Installing LLVM

There is an automatic installation script available to easily install LLVM on Ubuntu/Debian systems, for example to install LLVM 19:

$ wget https://apt.llvm.org/llvm.sh
$ chmod +x llvm.sh
$ ./llvm.sh 19

LLVM IR

LLVM is a strongly typed Static Single Assignment (SSA) based representation which abstracts away most details of the target while allowing high-level language constructs to be represented cleanly. There are three equivalent representations of the IR: an in-memory format, a bitcode format for serialisation and a human readable assembly language representation. The latter of which we’ll use to write a Hello World program in textual form.

LLVM programs are composed of modules that consist of a list of global variables and functions (both of which are global values that are represented by a pointer to a memory location). Multiple modules can be linked with the LLVM linker, which merges function and global variable definitions, resolves forward declarations, and merges symbol table entries. 

For our simple program, we’ll use three types of entities:

  • A global variable containing the string “Hello World!”
  • A declaration of the external libc puts function
  • A definition of the main function, which will contain an instruction to call puts

We’ll start with an empty file helloworld.ll to which we’ll add textual LLVM IR in the following steps: 

$ touch helloworld.ll

At this moment, if you try to execute this with the LLVM IR interpreter lli, it will complain that there is no main function defined (obviously, the file is empty!):

$ lli helloworld.ll
lli: Symbols not found: [ main ]

Let’s fix that by declaring a main function.

The main function

The entry point to our program is the function named main which takes no parameters and returns an integer exit code, where a non-negative integer denotes success.

Function definitions begin with the define keyword which is followed by the return type, the name and parameter list (and potentially other optional attributes and options but we don't care about those). Global identifiers, representing functions and global variables, start with the @ symbol (whereas local identifiers start with the % symbol). The LLVM IR return type for our main function is the i32 type which represents a 32-bit integer. 

The ret instruction takes an optional value that must be one of the first-class types; in this case we return 0 (a literal 32-bit integer).

Putting this together gives us the following:

define i32 @main() {
    ret i32 0
}

Now, though it doesn’t do much, the lli interpreter can execute this short program. You can confirm that it works by observing the exit code after running the interpreter:

$ lli helloworld.ll
$ echo $?

Try changing the main function’s return value to some other integer and checking the exit code again.

Global variables

A global variable, defined at the top-level in the LLVM IR, defines a region of memory allocated at compilation time rather than run-time. As with functions, global variable identifiers start with the @ character. Globals can be declared as constant if their values will never change and can be initialised at compile time.

In our case, we need to initialise a constant global with the string “Hello, World!” which we can define as an array of characters, terminated by the null character. The LLVM IR syntax for constant character arrays uses a double-quoted string with a c prefix, such as c”Hello, World!\00”.

We’ve already seen that the syntax for a 32-bit integer type is i32; the syntax for array types requires an element type and a size: for example [ 10 x i32 ] is the type representing an array of 10 32-bit integer values.

Using this knowledge, we can define a constant global named @str with the array type [ 14 x i8 ] (14 1-byte characters) which is initialised with the constant character array c”Hello, World!\00” as follows:

@str = constant [14 x i8] c"Hello, World!\00"

define i32 @main() {
  ret i32 0
}

We don’t yet use the @str constant so if you run the lli interpreter again there will be no difference in the output compared to previously. You can however see what happens to this string when compiled to assembly with llc:

$ llc helloworld.ll -o helloworld.s

We see at the end of the helloworld.s file (which contains x86_64 assembler source code) the global string named str in the rodata section, with the literal constant value of size 14, “Hello, World!”:

$ tail helloworld.s
.type str,@object                     # @str
.section .rodata,"a",@progbits
.globl str
str:
.asciz "Hello, World!"
.size str, 14

Note that the output of llc here matches the host machine by default because we haven’t specified a target triple directive in the LLVM IR source file. You can also override this with a specific target machine, for example to output ARM 64 for Linux:

$ llc helloworld.ll -o helloworld-aarch64.s -mtarget=aarch64-linux-gnu
# Assemble
$ aarch64-linux-gnu-as helloworld.s -o helloworld-aarch64.o
# Link
$ aarch64-linux-gnu-gcc helloworld-aarch64.o -o helloworld-aarch64 -static
# Run with QEMU
$ qemu-aarch64 ./helloworld-aarch64
$ echo $?
0

Now we’ve added the string, we need to do something with it.

Calling functions

Before we can call a function we need  to either define or declare it. Since we want to use the existing function puts from the C standard library we declare the function without a definition and let the linker resolve the definition.

The declaration for puts is as follows, it has a single parameter of type ptr (pointer) and returns a i32 type (32-bit integer):

declare i32 @puts(ptr)

We can then call the function with the LLVM IR call instruction with the Hello World string as the parameter (and we ignore the return value). This gives us the final Hello World program looking like this:

declare i32 @puts(ptr)

@str = constant [14 x i8] c"Hello, World!\00"

define i32 @main() {
  call i32 @puts(ptr @str)
  ret i32 0
}

If you now run the program with lli, it should print Hello World!

$ lli helloworld.ll
Hello, World!

Compiling Hello World

The clang compiler can be used to compile the LLVM IR program into a executable program: 

$ clang helloworld.ll -o helloworld
$ ./helloworld
Hello, World!

Congratulations! You’ve taken your first steps into the world of LLVM by writing textual LLVM IR and compiling it with clang to a native executable.

Next steps

We’ve only just scratched the surface of LLVM and LLVM IR: the LLVM project is huge and there are many different aspects to learn.

You can continue your journey by learning more LLVM IR instructions to write more complicated programs. The clang compiler can compile C/C++ code to LLVM textual IR which is very useful for learning how high-level constructs are translated from C/C++ to LLVM IR; for example the following command will compile the helloworld.c program to textual LLVM IR:

$ clang helloworld.c -S -emit-llvm -o helloworld.ll

Compiler Explorer is also a great online tool that can do the same thing: you can paste in some code, compile it and see the LLVM IR & assembly output.

LLVM is written in C++ and has an extensive C++ API for generating and transforming code which we haven’t covered here but I’ve created a simple C++ program that uses the IRBuidler to create a Hello World program that you can use to get started.

You might want to follow the "My First Language Frontend with LLVM Tutorial" next to continue learning further. Or start with a simple language like Brainf*ck and continue from there. A nice thing about having a textual IR format is that you can use the IR as a compilation target simply by writing out text, without using the C++ API, in case you are writing a compiler frontend in another language.