Friday, May 12, 2006

Verge of Release

Hi frens,
Finally OpenDos is on its verge to realease. And will available for downloads soon.

Thursday, March 16, 2006

Abstract & Synopsis

OpenDos
An
Open Source
Dynamic Optimization System

ABSTRACT

Dynamic optimization refers to the runtime optimization of a native program binary. Dynamic binary translation is an important area for compiler research, because additional information available at runtime can substantially improve the effectiveness of optimizations. The challenge is to monitor a process during its execution. We are building a system for monitoring and modifying a program’s behavior while it is running. The runtime environment exposes several execution paths that have not been discovered during static optimization. This gives us several opportunities for performing optimizations even though the binary is statically optimized.
As the 90-10 rule 90% of the program execution time is used for 10% of the program region. The main aim of our system is to identify this 10% and optimize it. The lightweight dynamic binary optimizer is a software that optimizes the executables at run-time. This optimizer is aimed to optimize i386 executables of ELF format. It is entirely user-level software, which works on the native binaries.
As the binaries shipped by the vendors are not fully optimized there are various opportunities for optimizing these binaries. Also as optimization is done at runtime no extra code is added to the binary. Also our system exploits the runtime information available to it. This includes information about the taken branches, frequently executed path etc.
Using this information we can apply certain optimizations, which are termed aggressive during static compilation. Also our system is an open source effort towards dynamic optimization.




Project name : OpenDos – Dynamic Optimization System
College :Maharashtra Academy Of Engineering.

Project group

1.Poornima.P.Kamath . email-id: kamath.poornima@gmail.com.
2.Divya Sasidharan email-id: divya.in@gmail.com
3.Sumit Kumar email-id: sumit.pmb@gmail.com.
4.Sandeep Kumar email-id: sandeepksinha@gmail.com.

Project Guides:
Internal guide: Prof Pallavi Talegaonkar.
External guide: Mr Saurabh Verma (Codito technologies)
Sponsored by: Codito Technologies, Pune.




SYNOPSIS


1.1 Problem Statement:
Static optimization techniques do not perform optimization across shared libraries. Also legacy code where code is unavailable cannot be optimized using static techniques .
1.2 Soultion:
A solution to above problem is dynamic optimization. Dynamic optimizers apply code to an executing application. Applying optimizations dynamically provides several benefits over static optimization, particularly for control-intensive programs. For example dynamic optimization span branches and calls including calls to dynamically linked code, which typically hinder static optimizations. In many cases, a dynamic optimizer is able to identify and optimize longer hot execution paths than a static optimization. Also because it naturally and transparently profiles during execution, a dynamic optimizer can adapt to changing program behaviour.

1.3 Functional Description:
1.Region selection.
The optimizer observes the execution and generates a series of traces (code which is executed frequently) that represent the common dynamic sequencing of instructions.
2.Optimization.
It applies optimizations or transformations to the generated code traces.
3. Optimized region storage(fragment cache)
It stores the optimized code traces in a software based code cache.
4.Dispatch of optimized code.
It intercepts execution and directs it to code cache for all future executions of the optimized code traces until program termination.

2.1 OUR SYSTEM DESIGN

The Image Cannot be Displayed.


MAIN COMPONENTS
1. Bootstrapper:
This module intercepts control from the running binary and gives control to our interpreter.It loads and initializes all the modules of our system.

2.Interpreter.
This module does the job of profiling the executable.
It performs two tasks, a).It disassembles the binary till it gets a control transfer instruction.
b).It simulates control transfer instructions and processes and records the frequently executed hot regions.

3.Epilog-Prolog adder.
This module adds the epilog and prolog code along with exit stubs and compensation code.As we execute hot code from fragment cache, there has to be a change in context between executable and fragment code.
So the epilog and prolog code saves the register context

4.Optimizer.
The optimizer optimizes the hot code. The optimizer performs optimizations like constant propagation. Copy propagation, etc.

5.Fragment cache manager.
The hot code that is frequently executed code is stored in fragment cache. This code is in binary and is stored in a software code cache.
This is managed by fragment manager.
2.2 Software Process Life Cycle
The primary tasks involved in this project is to profile the executable as well as optimize it at runtime . We can increase the optimizations in each release Hence we use the Incremental model for the development of the project.

2.3 Platforms
The dynamic optimizer we are developing is a user level software on linux platform. It is for i386-Elf executables.


2.4 Area of work

Main category
Software
Subcategory:
System applications









2.5 Function point analysis

Measurement parameter Count Complex Count*average
Number of user input 1 6 6
Number of user output 1 7 7
Number of files 21 15 315
Count-total 328

FP= 328*(0.65+0.01*sum(Fi)) ,Fi=8
FP=239.44


2.6 Test Suites:

SPEC 2000 benchmarks
1.wc
2.bzip2
3.gzip



2.7 References:

[1] Sanjeev Banerjia Vasanth Bala, Evelyn Duesterwald. Dynamo: A transparent dynamic optimization system.
http://www.hpl.hp.com/techreports/1999/HPL-1999-78.pdf


[2] Trek Palmer. Design and implementation of SIND, A Dynamic Binary Translator.
http://www.cs.unm.edu/~darko/papers/palmer-ms-thesis.pdf

Tuesday, January 03, 2006

SOFTWARE REQUIREMENT SPECIFICATION

OpenDos
The OpenSource
Dynamic Optimization System


Introduction

1.1Purpose.
Dynamic optimization refers to the runtime optimization of a native program binary. Dynamic binary translation is an important area for compiler research, because additional information available at runtime can substantially improve the effectiveness of optimizations.
The challenge is to monitor a process during its execution. We are building a system for monitoring and modifying a program’s behavior while it is running. The runtime environment exposes several execution paths that have not been discovered during static optimization. This gives us several opportunities for performing optimizations even though the binary is statically optimized.
The lightweight dynamic binary optimizer is software that optimizes the executables at run-time. This optimizer is aimed to optimize i386 executables .It is an entirely a user-level software which works on the native binaries.
This SRS describes the functions and performance requirements of the optimizer.
1.2Document Conventions

1.3 Intended audience
This SRS is intended for the developers and users.

1.4 Scope of the Development Project

Context:
The aim of our project is to develop a Dynamic optimization system. The function of this system will be to generate a more optimized version of the executable at the run time and gain performance. We bootstrap into the executable by compiling the system as a pre-loadable library and finally integrate it with the executable.

Information Objective:
Our project doesn’t accept anything as input; it is transparent enough to integrate itself to the binary, which is going to be executed. We actually find the hot codes in the executable and after converting them into an optimized version; we execute it natively on the machine. The output is the performance gain that we achieve by optimizing those fragments, which are optimized.

Function and performance:
Our software will first be loaded into the address space of the executable, which is going to be executed. Then it tries to bootstrap into the executable and take the control from it and then interpret the code that was going to be executed.
No performance gain will be achieved in the initial phase of execution but later on, once most of the hot regions are identified, there will be gain in the performance.

Limitations:

1. Takes away cycles from program.
2. Especially slow at start.
3. Debugging can be difficult:

1.5. Definitions, Acronyms, and Abbreviations

DOS : Dynamic optimizer system .

1.6 Refernces

1.7 Overview of document


Overall description

2.1 Product perspective

The dynamic optimizer is a user level system. It consists of various shared library modules. These are invoked as soon as the executable is loaded. They are loaded before the other DLLs. The system intercepts the control from the executable and control is given to the dynamic optimizer system (DOS).
The DOS executes the executable by monitoring it that is executable executes under the control of the DOS. As the 90-10 rule 90% of the program execution time is used for 10% of the program region. The main aim of the DO is to identify this 10% and optimize it.As the binaries shipped by the vendors are not fully optimized there are various opportunities for optimizing these binaries. Also as optimization is done at runtime no extra code is added to the binary.
Also DOS exploit the runtime information available to them. This includes information about the taken branches, frequently executed path etc., which allows the DOS to perform path specific optimizations. Using this information we can apply certain optimizations, which are termed aggressive during static compilation.

2.2 Product functions

The functions of the DOS are
1.Find the hotspots in the program.
2.Optimize these hotspots using appropriate optimization techniques.
3.Cache these optimized hotspots in a software code cache.
4.On subsequent encounter of the same hot code execute from the code cache.

2.3 User classes.

This optimizer is user level software and intended to make the executable run faster.
It can be used
1. To run application software.
2. To run legacy applications where recompiling is very costly which thus provides ample opportunities for optimizations.

2.4 Operating environment.

The optimizer works in user space.
It is designed for the i386 architecture to run Linux executables.

2.5 Design and Implementation Constraints

The executable should be an i386 executable.

2.6 User Documentation

2.7 Assumptions and Dependencies

2.8 Overview of Data Requirements

2.9 General Constraints, Assumptions, Dependencies, Guidelines

2.10 User View of Product Use

The user will be provided with an option whether he wants to run the application under our DOS (Dynamic Optimizer System). This the user can do by specifying as the command line argument –DOS if he intents to run his application under DOS.

3 External interface requirements

3.1 User interfaces (not applicable)

3.2 Hardware interfaces (not applicable).

3.3 Software interfaces

We use the Binary File Descriptor (BFD) library for handling the executable.
BFD is a library, which provides a single interface to read and write object files, the other DLLs. It makes changes in the ELF header so that control transfers to DOS after all DLLs are loaded and initializations are made.

Inputs.
The input to this preload library is the handle to the executable.

Output
After this function the control will be with the next module.

4.System Features

4.1. Hot path (trace) selection.

Once we get control, we have to execute the executable.
For this we require 2 functions
1.Disassembling
2.Hot path selection.

1. Disassembly
Disassembler disassembles each instruction of the object code selectively.

2.Hot path selection.

The disassembled instruction is examined for control transfer instructions and hot path selector maintains this information. The non-control transfer instructions are copied as they are without disassembling to a page. A block of sequential instructions between two control transfer instructions is executed natively. Whenever there is a back edge the target of the back edge to the origin of the back edge itself is considered as candidate for a hot trace as it forms a loop. Then a count is maintained for this trace, which when exceeds the threshold level is considered as a hot trace and given for further optimization.

Inputs
The input is the start address of the main function of the executable. This address is obtained from the ELF header.

Output
The output is selected hot path.
Also each disassembled instruction is executed.

4.2.Optimization.

Purpose
This function is to optimize the hot path obtained. The trace is analyzed and suitable optimization techniques are used to optimize it. Here a cost-benefit analysis is performed to select the appropriate optimization techniques to be applied.

Input
The input is selected hot trace.

Output
The output is optimized trace, which is emitted into the code cache.

4.3.Caching the trace in code cache.

Purpose:
The optimized trace is assembled and put into software code cache. This is to ensure that the next reference to this trace is executed from the code cache. The code cache is located in the same address space as the executable and is allocated during initialization of DOS.
The main aim is to bring the whole 10% path frequently executing code inside the code cache so that the execution remains in the code cache for most of its time.

Input
The input is optimized hot trace.

Output
The output is the cached trace.
5.6 Special User Requirements
5.6.1 Backup and Recovery
5.6.2 Data Migration
5.6.3 Data retention
5.6.4 User Training
5.6.5 Installation

Data structures and algorithms Part - I (all except Interpreter )

1.Bootstrapper.

Data structures:
1.Buffer to hold elf-header
2. Location for address of start-up code where control has to go .
3. START_ADDR: location for storing address of _start .
4. Pointers for storing handles of shared libraries.


Algorithm
1.Load the modules of DOS using dlopen.
2.Store the handles.
3.mmap the ELF header and mprotect it.
4.Get address of startup code using dlsym on module containing it.
5.Save the address of start in START_ADDR.
6.Overwrite this address with address of startup code.
7.Unmap the header.




2.Startup-code

Data structures.

1.Context save area for the registers (array of structures).
2.Buffer to be used as stack.
3.Buffer to hold complete text segment.
4.PROG_CTR to hold linear virtual address.



Algorithm
1.Save the context into context save area.
2.Initialize PROG_CTR to 0.
3.Open the executable using bfd and get canonical form into buffer.
4.Give control to interpreter.






Interpreter
Data structures


Algorithm
It will get control from the startup code.

It will call disassember to disassemble the instructions from PROG_CTR.
The disassembler will return the no of bytes of sequential code and the next control transfer instruction..
The interpreter calls linear executor and execute those sequential instructions natively.
It then determines the type of control transfer instruction as given by disassembler.

Types
1 unconditional jump
2.conditional jump
3 call
4 ret
5 leave

if unconditional and forward then change PROG_CTR to target of jump.
Else





Prologue and epilogue adder.


Data structures.

1.Buffer for trace- TRACE BUFFER.
This will be dynamic and will be big enough to hold prologue ,epilogue
Trace and compensation code.

2..Stub lookup table.
It will contain 3 fields.

Virtual address of the exit instruction's destination.
Stub address.
Pointer (index) to compensation code buffer.
no of instructions

3.Next-counter NEXT_CTR.
Gives the next free position in trace buffer.

Algorithm

1.Add code in trace buffer to restore the machine context.
2.This is prologue code.
3.Increment the NEXT_CTR by size of above code.
4.Add the code for trace into buffer.
5.Increment the NEXT_CTR by size of above code.
6.Traverse this code and fill the stub address in the stub table with the exit addresses i.e those coming out of trace.
7.The optimizer will have filled some exits which have compensation code.
8.Any remaining exits are to filled.
9.Now in the stub-table fill in all stub address in the following way
10.Save NEXT_CTR .
11.The first stub address is NEXT_CTR.
12.The next address is NEXT_CTR + sizeof(compensation code) + 2 .
13.The 2 is for jump instruction to epilogue and saving the address from which to start interpretation.
In this way the table is filled.

14.Now we traverse the trace and change all the exit lvirtual addresses to stub addresses by stub-table lookup.

15. Also add the stubs and compensation code.
After this add the code for storing the register context in context save area and jumping to dispatch.








Dispatcher

Algorithm
Adding trace
1.Give control to optimizer
2.It will transfer the trace to optimizer and receive the assembled trace from 3.assembler.
4.This assembled trace is given to fragment cache.
5.After inserting fragment cache gives control to dispatch.

Fragment execution
Trace selector or trace collector gives fragment address .
Dispatch checks from where control has come.
If from trace collector it saves the address to be collected next.
It transfers control to fragment cache.
After receiving control back it gives control to either trace collector or trace selector.











Fragment cache.


Data structures

1.List of buffers having the structure
header containing size ,start address
buffer to hold fragment
pointer to next .

2..FRAG_LOOK_TABLE
This will hold all the information related to the
fragments that are present in the fragment cache

It will have the following members

* original_vma /* the actual vma of start_addr before relocation */
* frag_start_addr /* start adress inside the fragement */
* offset_complementary_fragement_page /* start of somplementary page */
* fragment_size /* size of the fragment */
* SOME MORE MEMBERS FOR AGING POLICIES.

2.Free pointer FREE_PTR.
This will point to the next free buffer on list.

Algorithm
For adding a fragment
1.Make an entry in FRAG_LOOK_TABLE.
2.Insert fragment in position pointed by FREE_PTR.
3. Insert the complementry fragment in next position.
4. advance the FREE_PTR BY 2.
5. Fill the headers of above fragments.
6.Return control to dispatch.

For execution.

1.Get the fragment address.
2.Change ip to that address.

Data Structure & Algo - part - II ( Interpreter Only)

Disassembling and trace collection function is divided into 3 parts as follows:

1.Interpreter

a.This will disassemble the instructions and call trace selector for all control transfer instructions to update TLT (trace lookup table).
b.It will also simulate the control transfer instruction by checking the hardware flags thereby taking decisions whether the branch will be taken or not.

2.Trace selector

a.This will update TLT (trace lookup table) for control transfer instructions and return to interpreter the next target address to interprete.
b.If the trace is already present in the fragment cache then it will give control to the fragment cache to execute the fragment directly. In this case the control will go from:
Trace selector--> Fragment cache --> Interpreter.
c.If not in fragment cache and the count exceeds threshold while
After updating TLT then call trace collector.



3. Trace collector


a. It collects the trace till either of the following condition occurs:
Trace size exceeds the MAX LIMIT.
It encounters a return instruction.
It gets a backedge which is outside the hot loop.

b. It works in two modes: Pure and Interrupted.
The idea is that if during trace collection a call instruction is
encountered then the called function is not included as a part of the trace instead it is executed separately and then trace collection is resumed where it was halted.
So initially trace collector is in pure mode where it simply
copies the trace into the trace collection page.
Then if a call or a trace already exising in fragment cache is
encountered then trace collection is halted. And next time when
it resumes, it is in interrupted mode.In this interrupted mode
instead of starting a new trace it simply appends the remaing
trace into the trace collecting page.

d.It also emits the complementary execution page which contains
all the code inside the loop which may or may not be included in the trace collecting page.


Interpreter


Input:

1.address := Address of instruction to start interpretation.
This will be a global variable accessible to all.

Functions:

1. bytes := disassemble_retbytes (address);
This will return the number of bytes for linear code before a control
transfer instruction.

2.insn_buffer := disassemble_ct_insn (address);
This will return the pointer to decoded control transfer instruction.

3.tar_add := take_branch_decision( insn_buffer from above)
It will simulate the control transfer instruction and then returns the taken branch address for conditional control transfer instruction.

It will consist of following steps:
If instruction is “call”
Return( target_add in the instruction).
If instruction is “return”
Store the context
Return( top of dos’s stack)
If insruction is “unconditional jump”
Return( target add in the instruction);
If instruction is ”conditional jump”
Flag = simulate(opcode of instruction)
If(flag) // the branch is taken
Then return (tar add in the instruction)
Else return (add of instruction below CT insn)



4. copy_exec_LEP(bytes);
In this the linear code from (address) to (address + bytes) will be copied
and then executed after adding the prolog and epilog.

5. trace_selector( tar_add)
call to a trace selection to update the TLT.


Algorithm:


1.bytes := disassemble_retbytes (address);
2. copy_exec_LEP(bytes);
3. add_CT_insn = bytes + address;
4. insn_buffer := disassemble_ct_insn (address);
5. tar_add := take_branch_decision( insn_buffer from above)
6. trace_selector( insn_buffer,tar_add)
// trace selector will update address variable also so the loop can be continued.


Trace selector

Input:

1. insn_buffer := which is a pointer to decoded instruction of the control transfer instruction.
2.tar_add := next instruction to interprete / target address in insn

Functions used:

1.update_TLT( add )

It will perform the following functions:
search_in_TLT(add);
if found increment counter
if not found add_entry_TLT(add)
if counter > threshold and already present in fragment cache then start excuting from the fragment cache
If counter > threshold and not present in fragment cache
Call trace_collect(add);
if counter < address =" tar_add;" address =" tar_add;"> curr_prog_counter // forward jump
Address = tar_add;

4.If instruction is “return”
Address= tar_add;


Trace Collector

Input:

1.add := this is the start address for trace collection
2.mode := this indicates mode of working for trace collector.
It can be PURE or INTERRUPTED

Functions used:

1. bytes := disassemble_block(pointer to buffer,add)
This function will return number of bytes of linear code and also disassembled code in the buffer.
2. rawcode := read_from_executable(add,bytes)
This function returns a pointer rawcode to another buffer containing code in executable format.
3. copy_exec_LEP(rawcode)
copy the rawcode into the linear executable page and execute.
4. copy_DA_TCP(pointer to buffer containing DA_code)
It will copy the Disassembled code into the Trace collecting page.
5. insn_buffer:= disassemble_ct_insn(address of CT instruction)
this will decode only the control transfer instruction into insn_buffer
6. action_CT_insn(buffer)
This will perform following functions:
a.if instruction = unconditional jump
* check if the target_add is in fragment cache using function
check_in_FC(target_add);
if not in fragment cache then return target address

* if target_add > start address of trace then return tar_of_jump
else stop collecting trace
// this will ensure that no backedge outside the loop is collected //


b.If instruction is conditional jump

i)

Flag := Simulate(opcode)
If(flag = true) // condition true & branch is taken
{
TBA = (add of target in the instruction)
NTBA = (add of instruction + length of instruction)
}
if(branch is not taken)
{
NTBA = (add of target in the instruction)
TBA = (add of instruction + length of instruction)
}

ii)If NTBA > start_address
Copy NTBA in complementary address table CAT

iii) if TBA > start address
check_in_FC(st_add);
if not in FC then return TBA

iii)if TBA = start_address then stop_collecting_trace()

c.If instruction = call

Copy the instruction in TCP
Check_in_FC(call site address)

If not in fragment cache then we have to execute the call natively and return back to the trace collection in interrupted mode.
Push the address of trace collector user stack
Set flag = INTERRUPTED
Push (return add which is the call_site + length of call)

If in the fragment cache then again the control must be given to FC and then back to trace collector in INTERRUPTED mode.

d.If instruction is “return”
Then stop_collecting_trace()


7. stop_collecting_trace()

In this the lowest address from complementary address table CAT is chosen. And all the code from lowest adress to end address is copied into the complementary executing page (CEP) .
This CEP and TCP is then given further to optimizer.

8. check_in_FC(add)

It will check if the given trace is already present in the fragment cache if present it will push the trace collector address on the user stack and set flag to INTERRUPTED and push the return address on the DOS’s stack and then execute from the fragment cache.

Algorithm: (add)


If mode is PURE TCP_off = 0
Else TCP_off = INTERRUPTED_address

1. bytes := disassemble_block(ptr_DA,add)
2. rawcode := read_from_executable(add,bytes)
3. copy_exec_LEP(rawcode)
4. copy_DA_TCP(pointer to buffer containing DA_code)
It will copy or append the Disassembled code into the Trace collecting page depending upon TCP_off value.
5. insn_buffer:= disassemble_ct_insn(address of CT instruction)
this will decode only the control transfer instruction into insn_buffer
6. add=action_CT_insn(buffer) then loop to 1.

Saturday, December 31, 2005

Dec-16th To-Do list

Minutes of the meet on 16th Dec

Following is the to do list:

1. Deciding upon the data structures
a. Optimization specific
b. Others

2. Algorithms
a. Optimization related
b. Others

3. Module layout

4. Relocation

5. First-cut optimization

6. Makefiles

7. Testing and profiling

8. Coding Work to be completed till Tuesday i.e 20th Dec is as below:
1. Relocations.
2. First-cut optimization (list)
3. Non-optimization specific data structures and Algorithms.

Modules & API's - PART 1

No. OF Modules & API's for Disassembler

1.bootstrapper.
2.dispatch+assembler
3.optimizer+prologue-epilogue adder.
4.fragment cache manager
5. Interpreter.

The api's for disassembler are
1. function : disassemble
input: start address of disassembling
output:no of bytes of linear code before control transfer instruction

2.function:disassemble 1 instruction
input:address of control transfer instruction.
output: pointer to decoded control transfer instruction.

3.function : disassemble and give disassembled code
input: start address of disassembling
output:1. no of bytes of linear code before control transfer instruction
2.pointer to buffer containing disassembled code .

Here the datastructure can be an array of structures.

OpenDos - The Initial Design

THE DYNAMIC OPTIMISATION SYSTEM


The basic idea is that while a compiler must treat every code branch as equally possible, a binary optimizer can let the program itself tell the optimizer which branches are important and likely.
The main aim is to find hot regions in the executable binary, optimize them at run time and execute the optimized code rather than executing the native un-optimized hot code. The hot codes are mostly dominated by loops and the backward branches taken in a program.



THE DESIGN

The overall design of the system is composed of a group of inter-operating modules written as shared objects.


The pre-loader is responsible for loading the system initialization routine as a pre-loadable library and also taking the control over the execution of the binary. The bootstrapper loads and halts the running binary just before the execution of the execution of the mainline (but only after the library linking). Then the control is finally handed over to the interpreter.
The interpreter in this dynamic optimization system uses a technique of “cut and paste” strategy (also known as the basic code cache scheme). It starts the interpretation of the executable binary, instruction by instruction. This allows non-invasive profiling of the binary and also allows the interpreter to gather statistics and profiling information and also to track the behavior without having to modify the source binary at all. Now, in this process of interpretation, whenever a back-edge (a backward jump, which is the essential condition for a loop) is encountered, its information is maintained in a hash table, where it has entries for the most probable candidates for being hot codes along with their number of occurrences are maintained. If the count exceeds the threshold (say 10) then this code is considered to be a hot code.
The interpreter takes this hot code and fills it into the instruction buffer with all its subsequent instruction. These instructions are then translated into some IR (Intermediate
Representation), which helps in the efficient optimization of this code to provide faster execution of the code.
This instruction (IR) buffer along with the process state information is passed to the dispatcher, which is the central communication hub for the overall system. Finally the dispatcher passes this buffer to the optimizer for optimization.
The optimizer module is a program that takes the hot code trace (IR buffer) as an input, safely optimize them into equivalent fragments (optimized equivalent piece of code) and then requests the dispatcher to pass it back to the Add Epilogue & Prologue, in order to assure that all the exits from this piece of code should actually leave passing through the epilogue saving the state before exiting. Finally this fragment is passed to the fragment cache manager which inserts it into the cache. Now this fragment is later executed in place of the straight interpretation, whenever that particular trace (a hot code) is encountered again.
The performance of the overall execution of the code as compared when executed natively without interpretation and optimization and when executed through the optimization system, slows down in the initial period but gains high speed up later when most of the hot codes in the binary are found and fragmented. The theory is that, after enough time, the majority of the executed program text will have been converted into fragments, and so the execution will occur mainly in the fragment cache. Execution within the cache will be faster than the execution of the statically optimized binary and will eventually offset the initial cost of the profiler and optimizer.


THE IMPLEMENTAION


Pre-loader

The primary function of the pre-loader is to bootstrap the system and gain the control over the execution. It is done primarily by compiling the system as a pre-loadable library and then use of the UNIX environment variable LD_PRELOAD, in which we can add paths to any pre-loadable libraries. Now, whenever any executable is executed, these libraries are loaded before any other libraries and their initialization routines are executed then. The function of the pre-loader is to do this task as a wrapper routine and also once this library is loaded, call the Bootstrapper as its initialization routine.


Bootstrapper

The Bootstrapper runs as the initialization routine for the system which is loaded is a pre-loadable library and is basically responsible for the further loading of all the other modules of the system and also it locates the _start symbol from the current executable and overwrites the next two instruction, which are basically call to the main ( ) of a normal program, with call to startup ( ) routines whose primary job is to save the context just at the time of entering the _start symbol and also to transfer control to the interpreter.


Interpreter

The interpreter’s main function is to gather profiling information and code execution traces. The design of the interpreter was typically based on the some of the above factors.
The overall system runs in non-privileged mode, so the interpreter is primarily a non-privileged instruction interpreter; second, the interpreter only needs to be functionally correct, therefore no complicated hardware structure needs to be emulated. Also the interpreter has to fetch and decode the instruction and doesn’t have to emulate since the system’s design is architecture dependent (x86 to x86) meaning this system is designed for x86 binaries and will produce a final code, which can be executed only on x86 machine.


A trace is simply a sequence of instructions gathered after encountering a trace head. There are no restrictions placed on the termination conditions for a trace, however the system can stop gathering instructions if a certain numerical limit is reached. A trace once found, also includes an array of registers that contain the machine state when the trace head was encountered. This information is then passed on to the Dispatcher for subsequent optimization and finally insertion into the fragment cache.


Optimizer


The systems optimizer has a simple interface: it accepts a trace (as a low-level intermediate representation) and returns a trace. In this way optimizers can not only optimize a particular machine code trace, but can be used to convert a trace from one machine to another, or to convert a trace from machine-level to an intermediate form more appropriate for optimization (This part is expected to be one of the advancements to the system).
The type of optimization techniques still depends totally upon the IR that is provided by the Interpreter. No details are being mentioned as the implementation of various modules are still to be done, so the optimization still remains a part to which most of work is to be done.


Translator

The function of the translator is to take the fragment from the optimizer through the dispatcher and convert it back from the intermediate representation to the native binary code, which is required for the execution of the code. This is passed to the prologue and epilogue adder for further instrumentation.


Epilogue & Prologue Adder

The job of the prologue is call a procedure which will load the saved context to the native CPU registers and then transfer control to the fragment for execution. The epilogue serves as the fragment’s only exit point. All possible exit points (such as branches, calls, and jumps) have their target addresses changed so that on exit from the fragment code, they jump to the epilogue. The epilogue’s job is then to clean up after the fragment, capture the processor’s state, and then jump to the fragment cache’s function to return control to interpreter.





Dispatcher

The Dispatcher’s main purpose is to serve as a common interface through which all the system modules can interact. In particular the dispatcher coordinates all the details of trace processing. It takes the trace if form of IR from the Interpreter and then passes it over to the optimizer, which converts it into fragment. This fragment is finally returned to the dispatcher which in turn gives it to the translator for converting it back to the native binary.





REFRENCES


[1] Dino Dai Zovi, Trek Palmer, SIND: SIND is not DYNAMO, March 9, 2001.

[2] Trek Palmer, Darko Stefanovic. Experiences Constructing a Lightweight SPARC Interpreter for a Dynamic Binary Translator, March 14, 2001.

[3] Trek Palmer. Design and implementation of SIND, A Dynamic Binary Translator.

[4] Derek Bruening and Saman Amarasinghe. The DynamoRIO Collaboration.
http://www.cag.lcs.mit.edu/dynamorio/.

[5] Eric Feigin. A Case for Automatic Run-Time Code Optimization. PhD thesis.

[6] David Keppel Robert F. Cmelik. Shade: A fast instruction set simulator for execution
profiling.

[7] Smith Traub, Schechter. Ephemeral instrumentation for lightweight program profiling.

[8] Sanjeev Banerjia Vasanth Bala, Evelyn Duesterwald. Dynamo: A transparent dynamic optimization system.

[9] Sanjeev Banerjia Vasanth Bala, Evelyn Duesterwald. Transparent dynamic optimization: The design and implementation of dynamo.

Monday, November 21, 2005

PROJECT : 21st NOV

Agenda: Deciding upon the Data structures Required



These are some the Data structures that we have thought as per the
initial design of the system:


1. Structure used for saving the register context of the CPU

It will have the following members
all the CPU registers

2. An array of the above declared structured for saving the
registers/contexts.


* The size of this array will be static
* 25 will be an average value
* This value will generally depend upon the
application. If it has lots of system calls and library
routines then this number has to be increased.

3. A buffer which will contain the complete text segment

* This segment of the code will be provided by the
bfd library.
* The size of this buffer can be static or dynamic.
but dynamic is more preferable, as bfd is capable
of providing us with the size of this segment.



4. START_ADDR, which will be taken from the e_entry of the elf header

* This variable will hold the address from where the
actual interpretation of the code will start, once
the control comes to our system.



5. PROG_CTR , it will hold the linear virtual address for the interpreter

* The interpreter always assumes that it has to start
its interpretation from this address only. So whenever
control changes, this value must be updated.


6. LINEAR EXECUTABLE PAGE

* The sizeof this page can be static as well as dynamic.
* Dynamic : The number of bytes to be executed linearly
can be obained. This functionality is provided by our
/binutils/opcodes or say our disassembler.

7. TRACE LOOKUP TABLE

* A sturucture holding the information about the
entries of the traces that have been encountered.
It has the following members,

* flag ; /* indictaing wether a function or a loop */
* validity /* To inform wether its still in the fragement, if
was present */
* head_trace /* head of the trace */
* end_trace /* end of the trace */
* start_frag /* Its only valid if validity is TRUE */
* count /* number of hits of start_trace */

8. HIST_BUFFER

* A strcuture holding information about the various basic blocks
which have been encountered previously. It saves the overhead
re-interpretation of the code that has already been interpreted.
Also, It contains an extra member insn, which contains the
control transfer insn, which follows this basic block.

* The member of this structure are :
* vma_start_addr /* It is the start of the basic block */
* linear_byt_count /* number of linear bytes in this basic block */
* char insn[30] /* holds the next insn */

Example :

1000 : mov eax,abx
1002 : addl eax,#30
1004 : nop
1005 : ....some linear inmstructions
..
..
..
1020 : cmp eax,#400
1023 : jnz 1020

Now, in this case,
vma_start_addr = 1000;
linear_byt_count = 1022
insn = "jnz 1020"

Now when we encounter this start_addr again for interetation, we simply copy
linear_byt_count number of bytes from the text segment starting from
vma_start_addr into a linear executable page and execute it. Now we
have the next control transfer insn also, so removed the overhead of
decoding the control transfer insn.


9. COMPLEMETARY_ADDR_TABLE

/* This would simply be an array, which will store the start address of all
the non taken branches during trace collection . So that after collection
we can select the lowest value from this array and move the code lying
between this selected value and the end of trace.

* The basic motto is to move the complete loop in the fragment.


10. COMPLY_FRAG_PAGE

* This will be just a page that will be allocted by malloc. It will
actually contain the complimentary piece of code that we will get
from the point No . 9.

* This page will also be finally moved into the fraagement cache,
along with the fragement.

* The size of this page cannot be known before hand.

11. IR

* This structure will actually hold the IR that we are gonna get
from the disasembler.

* It will have the following members :
* menmonics /* menmonics of the insn */
* opnd1
* opnd2

12. An array of the above structure will be used to pass to the optimizer for optimization.


13. FRAG_LOOK_TABLE

* This will hold all the information related to the
fragments that are present in the fragment cache

* It will have the following members

* original_vma /* the actual vma of start_addr before relocation */
* frag_start_addr /* start adress inside the fragement */
* offset_complementary_fragement_page /* start of somplementary page */
* fragment_size /* size of the fragment */
* SOME MORE MEMBERS FOR AGING POLICIES.

These were some of the most important data structures that would be
required as per the initial design of the system.

We are still deciding on :

1. Fragement

a. Memory Alloctaion policies
b. Placement policies ( procedure sorting )
c. Relocation Policies
d. Eviction policies ( based on aging, count etc ).

2. Data structures required for the optimizer.

3. Study on how many shared object would make our system
flexible as well as robust. Currently discussing on keeping it two.


a. The first consists of
i> preloader
ii> boostrapper
iii>interpreter
iv>disassembler

b. The second consists of
i> Relocator
ii>optimizer
iii>epilog & prolog adder
iv>fragment

We will have a DISPATCHER, which is responsible for the communication
between these shared objects.

4. The Relocation

Not much work is required for direct jumps as we are removing direct jumps from the trace.

Relocation during the optimization is a mojor issue.
i> When a piece of code is inserted.
ii> When a piece of code is removed.

Friday, November 18, 2005

PROJECT : 9th SEP

Design of various modules of the system

OpenDOS- Open dynamic optimization system


The different modules are
1.Preloader
2.Bootstrapper
3.Interpreter
4.Dispatcher
5.Linear code executer
6.Fragment cache manager
7.Prologue &epilogue adder
8.Translator
9.Optimizer.

1.Preloader
This module will be preloaded using the LD_PRELOAD variable.
This will load the bootstrapper ans call it.

2.Bootstrapper
The bootstrapper will load and initialize all the modules.
It will get the address of _start using dlsym.
It will save the address of main
It will mprotect the text segment.
It will overwrite the next two lines after _start by a call to startup routine.


3.Startup module
It will save the context of the process.
It will allocate a new stack for the dbo(dynamic binary optimizer system).
It will store a pointer to this stack.
It will call the interpreter.


4.Interpreter
The input to this is address of main.
It will fetch instructions from this address and decode them.

If the instruction is a not a control transfer instruction the it will copy it to a page.
Otherwise it will check if it is backward taken branch (if pc>target addr)

If it is it will record the target addr as start of trace and
The current pc -1 as end of trace in variables

There are 3 conditions possible:

1.If no entry for this trace then
Store both these addresses in trace lookup table.
Initialize the counter to 1.

2.If trace is present and counter is less than threshold increment the counter.

3.If trace is present and the counter exceeds the threshold value we convert it into IR.
And give it to dispatcher.


5.Dispatcher
It acts as a communicating module between all the modules.
It will accept the IR from interpreter and give it to optimizer.
it will accept optimized IR and give it to translator.

6.Optimizer
It will perform various optimizations depending on IR.


7.Epilogue and prologue adder
It basically adds stubs at the start and end of each process’s fragment we want to execute natively.
The prologue saves the process context to cpu registers.
The epilogue saves the context from cpu registers to the page where previously we had saved the context.
There is a separate epilogue for each exit.
It transfers control to the interpreter to the next instruction after the end of trace.


8.Fragment cache manager
It inserts the fragment into the software code cache at appropriate location.
It records the address of this location and puts it into trace lookup table.


9.Translator
It will convert the IR to binary and give it to prologue and epilogue adder.


Points to be discussed.
1.Heuristics for cache replacement .
2.Handling aging of traces.
3.using API of libbfd for IR.
4.deciding optimization techniques.

PROJECT : 27th SEP

Agenda : Trace selection & collection
Codito TEchnologies, Pune.
1. Definition of a trace

- When it becomes hot.
- When it's not a hot code.



2. What is a hot code?

- exact definition of a hot code

3. How to handle procedure calls

- in linear sequential code
- in the hot codes

4. How to handle forward jumps

- in linear codes. ( As our earlier definition said : A linear sequential code is, a piece of code till we get a jump. )
- A solution ( if we get a forward jump ): we can just change the fetch address of the interpreter and proceed as usual till we get a call or a backward jump and also remove the forward jump from the buffer where the linear sequential code is collected for execution.

5. Over-lapping of the traces .

- finding out a strategy to decide, how to co-alias two fragments.

6.what optimizations to perform

- what optimization has to be carried out .
- what info is required for them.
- What they give
- what libdisasm provides and libfd doesn't or vice versa

7..how to handle the relocation of the addresses.

- a module, with its properties and behavior has to be decided.

8. find API provided from libopcodes.

9.find where we can piggy back the data structures returned by libopcodes.

- according to the requirements for optimization.
- Also, find out whether it is possible to convert it into an IR using some callback function at the time of disassembling.

PROJECT : 9th OCT

Agenda : Trace selection and Trace collection

Global table : table which hold the info about the traces.

1. Start of Trace.
- destination of backward jump.
- exits from fragments already present.
- all function calls destination

2. End of Trace.

- two possibilities
- trace is a loop
- no of insn > Threshold
- encountering start of trace
- begining of fucntion definition
- no of insn > Threshold
- ret insn is encountered

Now, We just take a look at how the control flow is gonna take place.

Interpretation Starts.

- from begining of code.
- a branch tranfer insn is encountered.

- forward jump
- change the fetch addr for interpreter.

- backward jump
- its a special case.
- exect'n of collected linear code.
- making an entry in global table.
- check count, perform reqd. operation.
- function calls
- exect'n of collected linear code.
- entry made in global table.
- if already present, increment the count.
- and check for threshold.

Once you find trace crossed the threshold :

- start collecting the trace
- start of trace already defined
- end of trace alreay defined

Contents of trace can be :

- forward jumps inside fragments
- backward jumps inside fragments
- jumps to outside the fragment
- instructions - calls to functions

Actions taken for these above mentioned cases :
- forward jump inside fragments
- care taken during relocation.

- backward jump inside fragments
- care taken during relocation.

- jump outside fragments
- control transferred to epilogue.
- an entry is present for desti'n of exits.
as all exits from fragments are probabale
candidates for trace.
- incrementaing the count in the global table.
- control finally transferred to epilogue.

- function calls
- should be replaced by exit stubs.
- if this exit stub crosses the threshold ,
it is optimised and linked into fragment cache.

Some important considerations :
- Every fragment will have a unique fragment id.
- Also in the global table, there will be a field for this _frag_id.
- The use of this fragment id :

- If two or more fragments have same exit destinations
from the fragment, we willtwo separate entries for
that in the gloabal table.

- If sometime later, an exit is made through the
exit stub, the corrseponding entry in the global
table will be bumped.

- This is to help, the linking of fragments, e.g. if
siometimes later the exit path's counter execeeds
the threshold, we can optimise it and link it to the
existing fragments.

PROJECT : 9th SEP

Agenda : Detailed discussion of the Dynamic optimization system
Codito Technologies, Pune.

----------------------------------------------------------------

crt0: overwrite text mprotect
-----------------------------

preloaded+bootstrapper:
-----------------------
- loads all other modules
- mprotect
- changes the text section (_start contents)

startup module:
---------------
- ctrl comes in from crt0
- page allocated for storing reg contents, sind stack
- register state saved
- sind_sp var set to the virtual stack
- j to interpreter

interpreter:
------------
(e.g. valgrind, bochs, plex86, vdebug0
- allocate page for app execution
- finds hot region and give the IR of that
- in: address of main
- find sequential code till the first jump
- copy the seq code into a trace/code region
- detect back-edges for identifying hot regions
- check that the back edges are in the cached address space
- Copy the hot code and reg state to the dispatcher
- no machine emulator

dispatcher
----------

- communicates betn all modules
- input: IR

fragment cache manager
-----------------------------

- put optimized hot regions into the cache
- heuristic for cache and trace replacement

prologues/epilogue instrumenter:
----------------------------------
- add pro/epilog
- xfer saved reg state back into hard regs
- save reg state back, retrun address into the interpreter,
to the pt in app address space
- thread the return stmts to the epilog


Data Structures:
--------------------
1. start of trace, end of trace, count, address in fragment cache,
if already present (for jumping the next time)
2. cache table (need heuristic for replacement here)
3. trace table (need heuristic for replacement here)


Deliverables:
----------------
- Detailed meeting minutes: Monday 12 Sep 2005
- Schedule (first cut) : Wednesday 14 Sep 2005

PROJECT : NOV 17

Dynamic Optimisation System
Codito Technologies, Pune.

Dynamic optmization techniques

constant propagation
--------------------
+ removal of one extra move
+ detection+deletion of dead code
+ one register freed for other optimizations


copy propagation
----------------
+ one register freed for other optimizations
+ one move instruction removed

redundant load/store removal
----------------------------
+ remove one load
- add one/two moves instead

code specialization
-------------------
+ free one register
+ new opportunities for constant propagation
- keep two copies of code
- extra guard code to be generated
* goodness based

procedure sorting
-----------------
* where to put a new entrant into the fragment cache
+ cache replacement policy, in addition/sync with the LRU policy
* i cache line size needs to be seen
* prefetch queue size needs to be seen