Oct 1 ’09
Optimizing Your Code Using the GNU Compiler Collection
The June/July 2009 z/Journal article “Compiling and Debugging Applications on Linux for System z” examined how to compile source code on Linux for System z and how the different tools interact. This article focuses on some advanced optimization issues and security checks and provides insight on how the compiler works internally.
The previous article mentioned the two most important optimization options specific to System z, namely specifying the target processor using -march and -mtune. There are, however, many additional optimization options; some valuable for fine-tuning application performance.
The user manual for GNU Compiler Collection (GCC) 4.4.0 (see the GCC home page at www.gnu.org/software/ gcc/gcc.html) mentions at least 160 code optimization switches that can be individually turned on and off. Users also can specify more than 100 different values controlling the amount of, or determining the context for, applying a certain optimization.
Handling all these options individually wouldn’t be practical for normal application development. So, GCC provides summary optimization options named -O0, -O1, -O2, -O3, and -Os. Option -O0 turns off almost all optimization and is intended for debugging purposes. Option -Os instructs GCC to minimize the size of the generated executable, even if this increases the program’s run-time. In contrast, options -O1, -O2, and -O3 focus on making the executed code fast and activate an increasing number of individual optimizations that have proved generally advantageous.
See the GCC manual for details on what individual optimization techniques are activated by the summary options. These sets of individual optimization switches may, however, vary in future versions of GCC, as new optimizations are added or later processors have different optimization needs. The following set of command line help options cause GCC to tell you what individual settings are activated when applying the summary option -O<level>:
gcc -q -O<level> --help=optimizers
Figure 1 shows the effect the different options have on run-time performance. These times were measured using the SPEC performance benchmark. (The setup is described in a paper by D. Edelsohn et al., titled “Contributions to the GNU Compiler Collection,” available at www.research. ibm.com/journal/sj/442/edelsohn.pdf.)
Turning on at least the first level of optimization with option -O1 greatly reduces program execution time compared to the unoptimized case (option -O0). Enabling further optimization steps by specifying options -O2 or -O3, respectively, has only little effect on program execution time. Optimizing for small executables slightly increases execution time compared to the most aggressive optimization, -O3. However, the effect of individual optimization steps can strongly depend on the nature of the code being compiled. If performance is an important issue, consider fine-tuning by adjusting optimization options. The major optimization options are part of the regression test that every compiler change must pass.
A good optimization example is a technique named constant folding, which causes operations applied to constant values to be evaluated at compile time. A related technique is constant propagation. If the compiler learns that a variable has a constant value at least during certain parts of program execution, it forwards this information through the program, often enabling further applications of constant folding and constant propagation.
Be aware, however, that many optimization techniques suffer from side effects. Depending on the context, the effect of an “optimization” may turn into the opposite and make things worse.
Several optimization techniques involve trade-offs between conflicting goals. For example, a technique called common subexpression elimination prevents code from doing redundant calculations of the same expression that repeatedly occurs at several places in a program. The value needs to be saved somewhere and either occupies a register that’s no longer available for other purposes or requires an additional memory access.
You also can fine-tune application performance with loop unrolling. Imagine the program has a counting loop. Counting occurs using some variable, say i, which is incremented after every iteration and tested to determine whether you need another iteration. The loop body somehow reads the value of i. Unrolling such a loop means replacing the loop by several instances of the loop body so there’s one instance for every loop iteration. In the duplicated loop bodies, all references to variable i are replaced by the value that i would have during that particular iteration. Basically, there are three sources for possible performance improvements:
• The overhead of incrementing and testing variable i is removed.
• The register needed to hold i is now free.
• In the loop body instances, the references to a variable are replaced by constant values that may enable constant folding and constant propagation.
On the other hand, the code gets larger. Instead of just one instance of the loop body, we now have many. Depending on the machine’s cache structure and cache load situation, this increased code size may hamper performance. Longer code also may require using less efficient branch instructions. A trade-off might be partial unrolling where a certain number of iterations are retained and only some are avoided by duplicating the loop body, saving at least some iterations while limiting code enlargement. Unfortunately, this technique may suffer from more complex loop exit conditions.
The GCC manual states that loop unrolling “may or may not make code run faster.” In fact, we observed both cases: a significant decrease of run-time for some programs and an increase of run-time in other cases. The compiler provides several switches and numeric parameters for fine-tuning and applying loop unrolling.
Subprogram inlining is another optimization technique with similar trade-offs. Here, calls to a subprogram are replaced by the body of the subprogram and formal parameters are replaced by references to the actual parameters provided in the call. Primarily, the improvement is to avoid the overhead of performing the call, saving registers, and possibly passing parameter values via the stack. Also, the inserted function body may offer opportunities for specialized optimizations only possible for the respective set of parameters; in particular, if some of the parameters were constant values. On the negative side is the enlargement of the code and also possibly an increased register pressure.
GCC’s Internal Structures
The basis for GCC’s advanced code optimization features are several internal program representations. Recall from the previous article that GCC is divided into three major phases that communicate through different data structures representing the program being compiled (see Figure 2). The front-end generates a data structure based on a tree-like structure that’s close to the nature of the source code. The tree data structure resulting from the parser step is then turned into a form more appropriate to code optimization— the Static Single Assignment (SSA) form. The optimization phase and the code generator finally communicate through another representation using the so-called Register Transfer Language (RTL).
The SSA data structure, introduced in the late ’80s, includes an efficient representation for all aspects of a program being compiled. It’s efficient with respect to memory consumption and, more important, for execution of algorithms to perform data flow analysis. This is essential for obtaining information useful for determining if and where you should apply a certain optimization to the program.
SSA also is well-suited to actually perform optimizing program transformations. A key property for achieving this is the way multiple assignments to the same variable are represented in SSA. For more on SSA, see The Compiler Design Handbook: Optimizations and Machine Code Generation (CRC Press, Florida 2002). Not surprisingly, an SSAbased program representation was integrated into GCC a few years ago; the initial design was presented at the 2003 GCC Developers’ Summit. The proceedings are available at www.gccsummit. org/2003/.
RTL is the last intermediate representation and has existed since the beginning of GCC. Like the tree and SSA representation, RTL is platform-independent. In contrast to the other intermediate representations, however, the RTL representation is already close to real machine code. The RTL optimization and code generation layer can be considered the platform-dependent optimization step interacting heavily with the so-called back-end.
A GCC back-end consists of the machine description, which is written in a LISP-like language describing the instructions available on the machine. The remaining part of a back-end, written in C, implements hooks that leverage the RTL optimization passes. These hooks describe several machine and Application Binary Interface (ABI)-specific things you must take into account when rearranging or generating code. Examples include:
• How parameters of a certain type are passed between functions
• Where the return value must reside
• How the stack elements are arranged
• How much stack space must be allocated for a function.
The RTL data structure itself is a per-function, double-linked list of socalled insns. An insn has a type that indicates it actually describes code to be generated (sequence insn, call insn, and jump insn), or that it holds information important for optimization or code generation (barrier, code label, and note). An insn actually describing code consists of data structures pointing to each other, forming an RTL Expression (RTX). The textual representation of an RTX follows the LISP-like syntax. An arithmetic expression such as a + b will be written as (plus a b).
Human-Readable Intermediate Representations
For GCC development, it’s essential to know how a program being compiled looks after certain passes in the compiler. Even if you don’t intend on doing compiler development, it’s helpful to learn more about what GCC is doing with your favorite source code. GCC provides a family of -fdump- <name> options for printing the intermediate representation after pass <name> into a text file that can be read with an arbitrary text editor. The GCC manual provides details of all available options. Some options have an abbreviated form. For example, specifying -da dumps every possible intermediate RTL-based file resulting in about 60 files at -O3.
The following examines a few of the important passes working on the RTL representation of the code:
Processors have a complex internal structure that provides different functional units and register sets. To generate fast code, these things must be taken into account when generating code. The purpose of an instruction scheduler is to rearrange the instructions to make ideal use of the machinespecific facilities.
There are two major factors that might delay execution of an instruction. Either there’s a data dependency, which means the result from the last instruction isn’t yet available, or the functional unit of the processor is still busy executing one of the last instructions.
To solve the data dependency problem, by default, GCC uses a modified version of a list scheduling algorithm. Part of the algorithm maintains a list of insns whose input values are already available and cleverly picks one of these to later add as many new insns as possible to that list.
For conflicting functional units, it’s essential for GCC to be able to quickly determine if choosing a specific insn might lead to such a conflict. So the back-end contains a detailed description of the functional units of the processor together with insn attributes indicating which instruction needs which functional unit—the pipeline description. This pipeline description is compiled by GCC into a finite state machine that can answer in linear time if scheduling an insn will lead to a conflict. Based on this, the instruction scheduler can search for an insn without conflicts.
A common way to make a processor appear faster is to enhance the instruction set with instructions covering the functionality of several others at once and leave the real work to the compiler. An obvious example is the multiply and add instruction doing the work of two instructions while using the time of just one of them. The combiner is GCC’s way of exploiting such macro instructions. To do this, the so-called “combine” step tries to merge insns and find an implementation for it in the backend. The merge step tries to merge up to three instructions at once. If the resulting insn can be found in the machine description, the combiner will take the new instruction as the base for even more merge operations. There are some instructions that provide a complex functionality. One example is the z10 instruction rotate and insert selected bits (risbg). The combiner is the only chance to exploit such instructions.
Register allocation is an essential part of code generation. Until this step, GCC assumes there’s an unlimited number of registers available—called pseudo registers. The difficult task of the register allocation step is to assign a machine register to each of the pseudo registers. GCC uses an enhanced version of the sophisticated graph-coloring algorithm. If the graph-coloring fails to find such a mapping, pseudo registers will be replaced by stack slots and the register allocation will be restarted. This continues until either everything resides in memory (undesirable) or the 16 general- purpose registers of the System z CPU are sufficient to cover all the pseudo registers.
Most technical security issues result from actual programming errors. Sloppy memory management especially opens the door to attackers trying to gain access to the machine where the code runs. As the tool knowing most about the programming language, the compiler is a natural choice when trying to add automated techniques to prevent a certain class of attacks. The following introduces a few of the mechanisms GCC provides:
Stack-smashing protector: With the stack-smashing protector, GCC provides a mechanism to protect code against a special class of stack overflow attacks. The simplest stack overflow attacks try to write data beyond the end of a stack variable to manipulate the return address of the function, which also resides on the stack. This way, an attacker can execute code with the rights of the attacked code on its own behalf. Jumping to code that opens a shell console is a technique widely used by attackers.
The GCC option -fstack-protector: This places a randomly chosen guard value on the stack between the local variables and return address. Before using the return address from the stack to return to the caller, the guard value is validated. If the guard value doesn’t match, the program is immediately terminated. Whenever an attacker writes beyond the local variable area, the guard value will be garbled on the stack and the program will exit before the attacker’s code can execute.
However, this doesn’t prevent the attacker from overwriting other variables in the local variable area. To minimize the risk of this, GCC moves potentially problematic variables such as char arrays as close as possible to the guard value. That way, the chances are high that even an overwrite not targeting the return address can be detected.
Mudflap: Mudflap is a debugging tool built into GCC. With mudflap enabled, GCC can instrument C code to track access to memory objects and its allocations (the C++ support currently is limited). So, it’s possible to detect especially annoying kinds of bugs such as invalid pointer dereferences and array out of bound accesses, which often lead to security holes. The memory objects are tracked in an external library, which is why it’s always necessary to link the mudflap library to your code.
Figure 3 shows how mudflap is able to detect a simple array out of bound access. A character array having five slots indexed from 0 through 4 is written to in a loop using indices from 0 through 5, resulting in an invalid memory access when writing to the last slot. Compiling this example and enabling mudflap, this error is detected. Mudflap prints the source code line of the invalid access together with information about the underlying memory object. Using this information, the programmer can easily locate the wrong code and fix it.
To prepare code to be built with mudflap, the -fmudflap and -lmudflap options must be specified as shown in Figure 3.
MUDFLAP_OPTIONS=-help can be used to get a list of the other analysis that can be provided. For more information about mudflap, please visit the GCC Wiki Page at http://gcc.gnu.org/ wiki/Mudflap_Pointer_Debugging or read the GCC 2003 Summit Paper “Mudflap: Pointer Use Checking for C/ C++” at: www.gccsummit.org/2003/.
GCC uses many state-of-the-art techniques to make your code run faster and harden it against attacks. Basing these techniques on machine-independent internal code representations allows GCC to provide them to a wide range of platforms and programming languages. Compiling code at the highest optimization level, GCC currently performs about 130 passes on SSA and about 60 passes on the RTL representation— and these numbers tend to grow. However, this article showed that there’s no single “one-fits-all” strategy when it comes to code optimization. Techniques might contradict each other or may not do anything helpful for a particular code piece. Code optimization is about heuristics and will continue to provide compiler developers like us with a lot of work in the future.