ACA Unit 8 Hardware and Software for VLIW and EPIC Notes — Unit 8 – Download as PDF File .pdf), Text File .txt) or read online. G-2 Appendix G Hardware and Software for VLIW and EPIC. In this chapter we discuss compiler technology for increasing the amount of par- allelism that we. View Notes – from ENG at BGS Institute of Technology. | Website for.
|Published (Last):||11 April 2005|
|PDF File Size:||15.43 Mb|
|ePub File Size:||8.16 Mb|
|Price:||Free* [*Free Regsitration Required]|
Most modern CPUs guess which branch will be taken even before epicc calculation is complete, so that they can load the instructions for the branch, or in some architectures even start to compute them speculatively. When the value of the register n is non-zero, the branch is taken. Kennedy, Optimizing Compilers for Modern Architectures. Contemporary VLIWs usually have four to eight main execution units. The automotive benchmarks are dominated by control code, the telecom code is primarily loop-oriented, and the networking algorithms are a mixture of control- and loop-oriented code.
The compiler was named Bulldog, after Yale’s mascot. Loop-oriented code benefited more from eliminating the restrictions on spanning execute packets.
These two would lead computer architecture research at Hewlett-Packard during the s.
A processor that executes every instruction one after the other i. If the p-bit of instruction isthen instruction is part of the same execute packet as instruction. The section summarizes, for each generation of C6X processors, the progressive code size reduction and performance impact of software-pipelined loop collapsing, NOP compression, variable length instructions, and the modulo loop buffer.
For example, immediate fields are smaller, there is a reduced set elic available registers, the instructions may operate only on sofyware functional unit per cluster, and some standard arithmetic and logic instructions may have only two operands instead of three one source register is the same as the destination register. Transmeta addressed this issue by including a binary-to-binary software compiler layer termed code morphing in their Crusoe implementation of the x86 architecture.
Please help improve this article by adding citations to reliable sources. Whereas conventional central processing units CPU, processor mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel.
During a compression iteration, there is often a potential bit instruction with no other bit instruction immediately before anx after. The compiler can, in many cases, exploit ILP better than hardware, and the saved silicon space can be used to reduce cost, save power, or add more functional units 1. In the above schedule, very little parallelism has been exploited because ins1ins2and ins3 must execute in order within the given loop iteration.
Assuming that other jardware are unavailable for execution during the instruction latency, explicit pipeline NOPs are inserted after the instruction issues to maintain correct program execution.
Reducing code size improves system performance by allowing space for more code in on-chip memory and program caches. Each instruction in an execute packet must use a different functional unit.
In this case, hzrdware compressor may swap instructions within an execute packet to create a pair. Loop-oriented code with high degrees of ILP contains more padding NOP instructions, because execute packets tend to be larger in loop code, thus increasing the likelihood of spanning execute packets.
It has 32 static general-purpose registers, partitioned into two register files. The compressor has the responsibility for hardwaree instructions into fetch packets. By the time the entire loop body has been inserted into the loop buffer, the loop kernel is present and can execute entirely from there.
The results are averages across all benchmarks.
Generalization of the modulo loop buffer code layout. Because the C6X compiler often produces execute packets with multiple instructions, swapping instructions within an execute packet increases the conversion rate of potential bit instructions.
All instructions can be optionally guarded by a static predicate. Further improvement can be achieved by executing instructions in an order different from that in which they occur in a program, termed out-of-order execution. Parallel instructions are bundled together into an execute packet. The loop buffer performs the branch automatically. Prologs are collapsed in the same way, except that it must be safe to over-execute an instruction before the loop rather than afterwards.
In superscalar designs, the number of execution units is invisible to the instruction set. Thus, collapsing becomes a very important optimization when loop trip counts are not available at compile-time.
As fetch packets are read from program memory, the instruction dispatch logic extracts execute packets from the fetch packets. The Instr Fetch column shows the loop body instructions fetched from program memory and stored in the MLB. In this paper we describe the co-design of compiler optimizations and processor architecture features that have progressively reduced code size across three generations of a VLIW processor.
Example of NOP padding to prevent a spanning execute packet. For most superscalar designs, the instruction width is 32 bits or fewer. The advantage of kernel-only code is that there is no code growth. Addison-Wesley Longman Publishing Co. The p-bit bit 0 controls whether the next instruction executes in parallel. For each new fetch packet, the compressor selects a window of instructions and records for each overlay which instructions may be converted to bit.
Back-end compiler and assembler flow depicting the compression of instructions. It does, however, specialize instructions so that they are likely to become bit instructions.