The Java Virtual Machine

Readers already familiar with the Java Virtual Machine and the Java class file format may want to skip this section and proceed with section 3.

Programs written in the Java language are compiled into a portable binary format called byte code. Every class is represented by a single class file containing class related data and byte code instructions. These files are loaded dynamically into an interpreter (Java Virtual Machine, aka. JVM) and executed.

Figure 1 illustrates the procedure of compiling and executing a Java class: The source file (HelloWorld.java) is compiled into a Java class file (HelloWorld.class), loaded by the byte code interpreter and executed. In order to implement additional features, researchers may want to transform class files (drawn with bold lines) before they get actually executed. This application area is one of the main issues of this article.


Figure 1: Compilation and execution of Java classes

Note that the use of the general term "Java" implies in fact two meanings: on the one hand, Java as a programming language, on the other hand, the Java Virtual Machine, which is not necessarily targeted by the Java language exclusively, but may be used by other languages as well. We assume the reader to be familiar with the Java language and to have a general understanding of the Virtual Machine.

Giving a full overview of the design issues of the Java class file format and the associated byte code instructions is beyond the scope of this paper. We will just give a brief introduction covering the details that are necessary for understanding the rest of this paper. The format of class files and the byte code instruction set are described in more detail in the Java Virtual Machine Specification. Especially, we will not deal with the security constraints that the Java Virtual Machine has to check at run-time, i.e. the byte code verifier.

Figure 2 shows a simplified example of the contents of a Java class file: It starts with a header containing a "magic number" (0xCAFEBABE) and the version number, followed by the constant pool, which can be roughly thought of as the text segment of an executable, the access rights of the class encoded by a bit mask, a list of interfaces implemented by the class, lists containing the fields and methods of the class, and finally the class attributes, e.g., the SourceFile attribute telling the name of the source file. Attributes are a way of putting additional, user-defined information into class file data structures. For example, a custom class loader may evaluate such attribute data in order to perform its transformations. The JVM specification declares that unknown, i.e., user-defined attributes must be ignored by any Virtual Machine implementation.


Figure 2: Java class file format

Because all of the information needed to dynamically resolve the symbolic references to classes, fields and methods at run-time is coded with string constants, the constant pool contains in fact the largest portion of an average class file, approximately 60%. In fact, this makes the constant pool an easy target for code manipulation issues. The byte code instructions themselves just make up 12%.

The right upper box shows a "zoomed" excerpt of the constant pool, while the rounded box below depicts some instructions that are contained within a method of the example class. These instructions represent the straightforward translation of the well-known statement:

System.out.println("Hello, world");

The first instruction loads the contents of the field out of class java.lang.System onto the operand stack. This is an instance of the class java.io.PrintStream. The ldc ("Load constant") pushes a reference to the string "Hello world" on the stack. The next instruction invokes the instance method println which takes both values as parameters (instance methods always implicitly take an instance reference as their first argument).

Instructions, other data structures within the class file and constants themselves may refer to constants in the constant pool. Such references are implemented via fixed indexes encoded directly into the instructions. This is illustrated for some items of the figure emphasized with a surrounding box.

For example, the invokevirtual instruction refers to a MethodRef constant that contains information about the name of the called method, the signature (i.e., the encoded argument and return types), and to which class the method belongs. In fact, as emphasized by the boxed value, the MethodRef constant itself just refers to other entries holding the real data, e.g., it refers to a ConstantClass entry containing a symbolic reference to the class java.io.PrintStream. To keep the class file compact, such constants are typically shared by different instructions and other constant pool entries. Similarly, a field is represented by a Fieldref constant that includes information about the name, the type and the containing class of the field.

The constant pool basically holds the following types of constants: References to methods, fields and classes, strings, integers, floats, longs, and doubles.

The JVM is a stack-oriented interpreter that creates a local stack frame of fixed size for every method invocation. The size of the local stack has to be computed by the compiler. Values may also be stored intermediately in a frame area containing local variables which can be used like a set of registers. These local variables are numbered from 0 to 65535, i.e., you have a maximum of 65536 of local variables per method. The stack frames of caller and callee method are overlapping, i.e., the caller pushes arguments onto the operand stack and the called method receives them in local variables.

The byte code instruction set currently consists of 212 instructions, 44 opcodes are marked as reserved and may be used for future extensions or intermediate optimizations within the Virtual Machine. The instruction set can be roughly grouped as follows:

Stack operations: Constants can be pushed onto the stack either by loading them from the constant pool with the ldc instruction or with special "short-cut" instructions where the operand is encoded into the instructions, e.g., iconst_0 or bipush (push byte value).

Arithmetic operations: The instruction set of the Java Virtual Machine distinguishes its operand types using different instructions to operate on values of specific type. Arithmetic operations starting with i, for example, denote an integer operation. E.g., iadd that adds two integers and pushes the result back on the stack. The Java types boolean, byte, short, and char are handled as integers by the JVM.

Control flow: There are branch instructions like goto, and if_icmpeq, which compares two integers for equality. There is also a jsr (jump to sub-routine) and ret pair of instructions that is used to implement the finally clause of try-catch blocks. Exceptions may be thrown with the athrow instruction. Branch targets are coded as offsets from the current byte code position, i.e., with an integer number.

Load and store operations for local variables like iload and istore. There are also array operations like iastore which stores an integer value into an array.

Field access: The value of an instance field may be retrieved with getfield and written with putfield. For static fields, there are getstatic and putstatic counterparts.

Method invocation: Static Methods may either be called via invokestatic or be bound virtually with the invokevirtual instruction. Super class methods and private methods are invoked with invokespecial. A special case are interface methods which are invoked with invokeinterface.

Object allocation: Class instances are allocated with the new instruction, arrays of basic type like int[] with newarray, arrays of references like String[][] with anewarray or multianewarray.

Conversion and type checking: For stack operands of basic type there exist casting operations like f2i which converts a float value into an integer. The validity of a type cast may be checked with checkcast and the instanceof operator can be directly mapped to the equally named instruction.

Most instructions have a fixed length, but there are also some variable-length instructions: In particular, the lookupswitch and tableswitch instructions, which are used to implement switch() statements. Since the number of case clauses may vary, these instructions contain a variable number of statements.

We will not list all byte code instructions here, since these are explained in detail in the JVM specification. The opcode names are mostly self-explaining, so understanding the following code examples should be fairly intuitive.

Non-abstract (and non-native) methods contain an attribute "Code" that holds the following data: The maximum size of the method's stack frame, the number of local variables and an array of byte code instructions. Optionally, it may also contain information about the names of local variables and source file line numbers that can be used by a debugger.

Whenever an exception is raised during execution, the JVM performs exception handling by looking into a table of exception handlers. The table marks handlers, i.e., code chunks, to be responsible for exceptions of certain types that are raised within a given area of the byte code. When there is no appropriate handler the exception is propagated back to the caller of the method. The handler information is itself stored in an attribute contained within the Code attribute.

Targets of branch instructions like goto are encoded as relative offsets in the array of byte codes. Exception handlers and local variables refer to absolute addresses within the byte code. The former contains references to the start and the end of the try block, and to the instruction handler code. The latter marks the range in which a local variable is valid, i.e., its scope. This makes it difficult to insert or delete code areas on this level of abstraction, since one has to recompute the offsets every time and update the referring objects. We will see in section 3.3 how BCEL remedies this restriction.

Java is a type-safe language and the information about the types of fields, local variables, and methods is stored in so called signatures. These are strings stored in the constant pool and encoded in a special format. For example the argument and return types of the main method

public static void main(String[] argv)

are represented by the signature

([java/lang/String;)V

Classes are internally represented by strings like "java/lang/String", basic types like float by an integer number. Within signatures they are represented by single characters, e.g., I, for integer. Arrays are denoted with a [ at the start of the signature.

The following example program prompts for a number and prints the factorial of it. The readLine() method reading from the standard input may raise an IOException and if a misspelled number is passed to parseInt() it throws a NumberFormatException. Thus, the critical area of code must be encapsulated in a try-catch block.

import java.io.*; public class Factorial { private static BufferedReader in = new BufferedReader(new InputStreamReader(System.in)); public static int fac(int n) { return (n == 0) ? 1 : n * fac(n - 1); } public static int readInt() { int n = 4711; try { System.out.print("Please enter a number> "); n = Integer.parseInt(in.readLine()); } catch (IOException e1) { System.err.println(e1); } catch (NumberFormatException e2) { System.err.println(e2); } return n; } public static void main(String[] argv) { int n = readInt(); System.out.println("Factorial of " + n + " is " + fac(n)); } }

This code example typically compiles to the following chunks of byte code:

0: iload_0 1: ifne #8 4: iconst_1 5: goto #16 8: iload_0 9: iload_0 10: iconst_1 11: isub 12: invokestatic Factorial.fac (I)I (12) 15: imul 16: ireturn LocalVariable(start_pc = 0, length = 16, index = 0:int n)

fac(): The method fac has only one local variable, the argument n, stored at index 0. This variable's scope ranges from the start of the byte code sequence to the very end. If the value of n (the value fetched with iload_0) is not equal to 0, the ifne instruction branches to the byte code at offset 8, otherwise a 1 is pushed onto the operand stack and the control flow branches to the final return. For ease of reading, the offsets of the branch instructions, which are actually relative, are displayed as absolute addresses in these examples.

If recursion has to continue, the arguments for the multiplication (n and fac(n - 1)) are evaluated and the results pushed onto the operand stack. After the multiplication operation has been performed the function returns the computed value from the top of the stack.

0: sipush 4711 3: istore_0 4: getstatic java.lang.System.out Ljava/io/PrintStream; 7: ldc "Please enter a number> " 9: invokevirtual java.io.PrintStream.print (Ljava/lang/String;)V 12: getstatic Factorial.in Ljava/io/BufferedReader; 15: invokevirtual java.io.BufferedReader.readLine ()Ljava/lang/String; 18: invokestatic java.lang.Integer.parseInt (Ljava/lang/String;)I 21: istore_0 22: goto #44 25: astore_1 26: getstatic java.lang.System.err Ljava/io/PrintStream; 29: aload_1 30: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V 33: goto #44 36: astore_1 37: getstatic java.lang.System.err Ljava/io/PrintStream; 40: aload_1 41: invokevirtual java.io.PrintStream.println (Ljava/lang/Object;)V 44: iload_0 45: ireturn Exception handler(s) = From To Handler Type 4 22 25 java.io.IOException(6) 4 22 36 NumberFormatException(10)

readInt(): First the local variable n (at index 0) is initialized to the value 4711. The next instruction, getstatic, loads the references held by the static System.out field onto the stack. Then a string is loaded and printed, a number read from the standard input and assigned to n.

If one of the called methods (readLine() and parseInt()) throws an exception, the Java Virtual Machine calls one of the declared exception handlers, depending on the type of the exception. The try-clause itself does not produce any code, it merely defines the range in which the subsequent handlers are active. In the example, the specified source code area maps to a byte code area ranging from offset 4 (inclusive) to 22 (exclusive). If no exception has occurred ("normal" execution flow) the goto instructions branch behind the handler code. There the value of n is loaded and returned.

The handler for java.io.IOException starts at offset 25. It simply prints the error and branches back to the normal execution flow, i.e., as if no exception had occurred.