Saturday, December 31, 2005

OpenDos - The Initial Design

THE DYNAMIC OPTIMISATION SYSTEM


The basic idea is that while a compiler must treat every code branch as equally possible, a binary optimizer can let the program itself tell the optimizer which branches are important and likely.
The main aim is to find hot regions in the executable binary, optimize them at run time and execute the optimized code rather than executing the native un-optimized hot code. The hot codes are mostly dominated by loops and the backward branches taken in a program.



THE DESIGN

The overall design of the system is composed of a group of inter-operating modules written as shared objects.


The pre-loader is responsible for loading the system initialization routine as a pre-loadable library and also taking the control over the execution of the binary. The bootstrapper loads and halts the running binary just before the execution of the execution of the mainline (but only after the library linking). Then the control is finally handed over to the interpreter.
The interpreter in this dynamic optimization system uses a technique of “cut and paste” strategy (also known as the basic code cache scheme). It starts the interpretation of the executable binary, instruction by instruction. This allows non-invasive profiling of the binary and also allows the interpreter to gather statistics and profiling information and also to track the behavior without having to modify the source binary at all. Now, in this process of interpretation, whenever a back-edge (a backward jump, which is the essential condition for a loop) is encountered, its information is maintained in a hash table, where it has entries for the most probable candidates for being hot codes along with their number of occurrences are maintained. If the count exceeds the threshold (say 10) then this code is considered to be a hot code.
The interpreter takes this hot code and fills it into the instruction buffer with all its subsequent instruction. These instructions are then translated into some IR (Intermediate
Representation), which helps in the efficient optimization of this code to provide faster execution of the code.
This instruction (IR) buffer along with the process state information is passed to the dispatcher, which is the central communication hub for the overall system. Finally the dispatcher passes this buffer to the optimizer for optimization.
The optimizer module is a program that takes the hot code trace (IR buffer) as an input, safely optimize them into equivalent fragments (optimized equivalent piece of code) and then requests the dispatcher to pass it back to the Add Epilogue & Prologue, in order to assure that all the exits from this piece of code should actually leave passing through the epilogue saving the state before exiting. Finally this fragment is passed to the fragment cache manager which inserts it into the cache. Now this fragment is later executed in place of the straight interpretation, whenever that particular trace (a hot code) is encountered again.
The performance of the overall execution of the code as compared when executed natively without interpretation and optimization and when executed through the optimization system, slows down in the initial period but gains high speed up later when most of the hot codes in the binary are found and fragmented. The theory is that, after enough time, the majority of the executed program text will have been converted into fragments, and so the execution will occur mainly in the fragment cache. Execution within the cache will be faster than the execution of the statically optimized binary and will eventually offset the initial cost of the profiler and optimizer.


THE IMPLEMENTAION


Pre-loader

The primary function of the pre-loader is to bootstrap the system and gain the control over the execution. It is done primarily by compiling the system as a pre-loadable library and then use of the UNIX environment variable LD_PRELOAD, in which we can add paths to any pre-loadable libraries. Now, whenever any executable is executed, these libraries are loaded before any other libraries and their initialization routines are executed then. The function of the pre-loader is to do this task as a wrapper routine and also once this library is loaded, call the Bootstrapper as its initialization routine.


Bootstrapper

The Bootstrapper runs as the initialization routine for the system which is loaded is a pre-loadable library and is basically responsible for the further loading of all the other modules of the system and also it locates the _start symbol from the current executable and overwrites the next two instruction, which are basically call to the main ( ) of a normal program, with call to startup ( ) routines whose primary job is to save the context just at the time of entering the _start symbol and also to transfer control to the interpreter.


Interpreter

The interpreter’s main function is to gather profiling information and code execution traces. The design of the interpreter was typically based on the some of the above factors.
The overall system runs in non-privileged mode, so the interpreter is primarily a non-privileged instruction interpreter; second, the interpreter only needs to be functionally correct, therefore no complicated hardware structure needs to be emulated. Also the interpreter has to fetch and decode the instruction and doesn’t have to emulate since the system’s design is architecture dependent (x86 to x86) meaning this system is designed for x86 binaries and will produce a final code, which can be executed only on x86 machine.


A trace is simply a sequence of instructions gathered after encountering a trace head. There are no restrictions placed on the termination conditions for a trace, however the system can stop gathering instructions if a certain numerical limit is reached. A trace once found, also includes an array of registers that contain the machine state when the trace head was encountered. This information is then passed on to the Dispatcher for subsequent optimization and finally insertion into the fragment cache.


Optimizer


The systems optimizer has a simple interface: it accepts a trace (as a low-level intermediate representation) and returns a trace. In this way optimizers can not only optimize a particular machine code trace, but can be used to convert a trace from one machine to another, or to convert a trace from machine-level to an intermediate form more appropriate for optimization (This part is expected to be one of the advancements to the system).
The type of optimization techniques still depends totally upon the IR that is provided by the Interpreter. No details are being mentioned as the implementation of various modules are still to be done, so the optimization still remains a part to which most of work is to be done.


Translator

The function of the translator is to take the fragment from the optimizer through the dispatcher and convert it back from the intermediate representation to the native binary code, which is required for the execution of the code. This is passed to the prologue and epilogue adder for further instrumentation.


Epilogue & Prologue Adder

The job of the prologue is call a procedure which will load the saved context to the native CPU registers and then transfer control to the fragment for execution. The epilogue serves as the fragment’s only exit point. All possible exit points (such as branches, calls, and jumps) have their target addresses changed so that on exit from the fragment code, they jump to the epilogue. The epilogue’s job is then to clean up after the fragment, capture the processor’s state, and then jump to the fragment cache’s function to return control to interpreter.





Dispatcher

The Dispatcher’s main purpose is to serve as a common interface through which all the system modules can interact. In particular the dispatcher coordinates all the details of trace processing. It takes the trace if form of IR from the Interpreter and then passes it over to the optimizer, which converts it into fragment. This fragment is finally returned to the dispatcher which in turn gives it to the translator for converting it back to the native binary.





REFRENCES


[1] Dino Dai Zovi, Trek Palmer, SIND: SIND is not DYNAMO, March 9, 2001.

[2] Trek Palmer, Darko Stefanovic. Experiences Constructing a Lightweight SPARC Interpreter for a Dynamic Binary Translator, March 14, 2001.

[3] Trek Palmer. Design and implementation of SIND, A Dynamic Binary Translator.

[4] Derek Bruening and Saman Amarasinghe. The DynamoRIO Collaboration.
http://www.cag.lcs.mit.edu/dynamorio/.

[5] Eric Feigin. A Case for Automatic Run-Time Code Optimization. PhD thesis.

[6] David Keppel Robert F. Cmelik. Shade: A fast instruction set simulator for execution
profiling.

[7] Smith Traub, Schechter. Ephemeral instrumentation for lightweight program profiling.

[8] Sanjeev Banerjia Vasanth Bala, Evelyn Duesterwald. Dynamo: A transparent dynamic optimization system.

[9] Sanjeev Banerjia Vasanth Bala, Evelyn Duesterwald. Transparent dynamic optimization: The design and implementation of dynamo.

0 Comments:

Post a Comment

<< Home