This is the text submitted to the managers of the Embedded Processor Forum 2001, June 11 - 15, in San Jose.
You can read the first slides if you follow this link.
1. The name of the chip, product or technology the F-CPU Instruction Set Architecture and the F-CPU Core #0 ("FC0") 2. General features of the chip, product or technology * Fully parameterizable VHDL'93 source code distributed under GPL and F-CPU Charter. * retargetable to a broad range of technologies * SIMD, "undetermined register size" (32, 64, 128, 256 ...) * 64 registers (64-bit wide in the first implementation) * 32-bit instructions with 4 instruction formats only * superpipelined (very short matched pipeline stages) * 1 instruction per cycle issued In-Order and Out Of Order Completion (for the FC0) * Execution pipeline freed from exceptions (no rollback possible) * Zero-delay branch instruction (in the ideal case) * Smooth Register Back ("SRB") mechanism lowers context switch time 3. Specific features that will be discussed in detail at Embedded Processor Forum * instruction scheduling * memory interface 4. Whether the chip will be formally announced or disclosed for the first time at Embedded Processor Forum (if not, what prior disclosures or announcements are planned) It is the first public presentation at a congress in the US. The F-CPU team was presented during two conferences in Berlin (17C3 : http://www.ccc.de/congress/fahrplan/event/153.en.html 16C3 : http://www.ccc.de/events/congress99/doku/fcpu.html) but there was only a short technical introduction, while some key details (see 3.) will be discussed. 5. For non-chip announcements, please provide details on the product to be announced, its relevance to the event, and particulars that will help evaluate the presentation Today, the F-CPU is the only configurable SIMD 64-bit general-purpose RISC core aimed at high performance applications (multimedia or scientific computations). It is a very recent design (1999) originally meant to provide a free, ununcumbered alternative to the Merced. Contrarily to the LEON, the F-CPU defines a new instruction set and is designed to last much longer. It is "by design" extensible and scalable in several ways, providing forward and backward compatibility for several decades. The first core (FC0) could be used as replacement for the aging MIPS, SPARC, ALPHA... Contrarily to the ARC Cores and similar "configurable cores" offers, the F-CPU is not a "black box" but a completely transparent design that allows complete visibility to the engineer. The F-CPU is based around a clean ISA and some simple scheduling rules, leaving a lot of freedom to the implementor. 6. Other descriptive information about the presentation {not defined yet. maybe a free distribution of CDROMs} 7. Company name It is a community development on the Internet, not a company "product". 8. Speaker's name (should be the chief architect or lead designer of the product) Yann Guidon 9. Speaker's job title Software development Engineer at Mentor Graphics (that's how i earn my life), main F-CPU contributor. 10. Speaker's biography (30 words or less) Joined the F-CPU Design Team on the Internet in Dec. 1998, influenced the design when the F-CPU ISA became RISC and defined the FC0 in mid-1999, Mentor Graphics employee since nov. 2000. 11. Speaker's address, phone, fax, email address and administrator (if any) email : whygee@f-cpu.org (personal) and yann_guidon@mentor.com (job) snail mail address : Yann Guidon 13 rue Francois Couperin 93110 Rosny sous bois France phone : +33 1 64 86 62 07 (business) 12. Abstract title proposed title : Instruction scheduling and the memory interface of the F-CPU Core #0 13. P.R. or Marcom contact (include phone, fax, and email) no marketing. see http://www.f-cpu.org or email me. I'll answer quickly, since i'm an email freak. 14. Your abstract or outline should not exceed 1000 words. This technical presentation will focus on one of the many particular characteristics of the F-CPU, a superpipelined SIMD RISC CPU that is being developped in a community on the Web, and distributed under GPL. The first implementation of the F-CPU is called FC0 and the strong constraints of the F-CPU ISA have led to redesign from scratch the way instructions are decoded and memory is accessed. Although it almost tastes like a classic "MIPS" RISC core, the memory units are completely upside-down compared to other usual designs. The instruction path is uncommon and the instruction set is highly influenced. However, the number of critical or difficult cases is highly reduced, a strong memory protection policy is enforced and jumps/calls can be done with no overhead. Finally, the instruction set is kept orthogonal and easy to schedule in almost any kind of CPU core, even in the presence of faults and interrupts. SYNOPSYS FOR THE PRESENTATION SLIDES (will be converted to HTML & illustrated). Slide 0 Logo ------------- Slide 1 Title --------------- "Instruction scheduling and the memory interface of the F-CPU Core #0" Disclaimers : * This is a technical (not marketing) overview of some major characteristics of the F-CPU design philosophy and the implementation in the FC0. * It presents the backgrounds and design choices of an ongoing (unfinished) work. * Not all the details are presented here ! RTFM @ http://www.f-cpu.seul.org/manual Slide 2 Short introduction to the F-CPU ---------------------------------------- F-CPU Project goals (voted 1999) : " To develop and make freely available an architecture, and all other intellectual property necessary to fabricate one or more implementations of that architecture, with the following priorities, in decreasing order of importance: 1) Versatility and usefulness in as wide a range of applications as possible 2) Performance, emphasizing user-level parallelism and derived through intelligent architecture rather than advanced silicon process 3) Architecture lifespan and forward compatibility 4) Cost, including monetary and thermal considerations " In the end, the goal is not to make the fastest ever CPU, because it's an endless race. The purpose of this project is more to design the "coolest/sexiest" CPU possible. Note the difference between "design/develop" and "build/market/sell/distribute..." Slide 3 Short introduction to the F-CPU (2) -------------------------------------------- Some features of the F-CPU Instruction Set Architecture : * 32-bit instructions with 4 instruction formats only [illustration of the formats] * Instruction set census : ~110 reserved opcodes as of today, with several variations, "core" and "optional" features. * 64 registers (64-bit wide in the first implementation) with R0=0 * SIMD flag in most instructions allows parallel operation on registers with "undetermined" width (32, 64, 128, 256 bits ...) * Support for saturated arithmetics, IEEE Floating point, IEEE LNS and fractional integers (optional) * 2r1w, 3r1w and 2r2w instructions, 3r2w register set * memory : 3 "stream hint bits" + cachability bit (7 "memory streams" can be implemented -> more bandwidth because the different accesses are separated) * One addressing mode only for code and data : register (post-increment possible for data) example of some instructions : [follows the binary format] Opcode Example Explicitly - add add r1,r2,r3 r3=r1+r2 - load load r1,r2,r3 r3=[r2], r2+=r1 - jump if r1=0 jump r2 link r3 if cond, r3=PC, PC=r2 Slide 4 Starting point of the FC0 : Clock speed issue ------------------------------------------------------ Technology handicap ---> "carpaccio" CPU * faster clock -> shorter Critical DataPath -> 6 4-input gates in the CDP or approximatively 10 transistors between two pipeline flip-flops. (This goal is here to emphasize on the simplicity of the pipeline stages, it's not a "golden rule" because a lot of other parameters must be also considered) * stage is deep enough to perform a 8-bit add/sub or a 64-bit increment. When a unit becomes too complex, it must be split. * CDP determined from the start : - no "hidden CDP" that slows the clock down in the middle of the project - very simple and easy design of the units (fortunately because everything must be redesigned) - emphasis on short stage pipelining for every part of the project favours "matched" and balanced pipeline stage design. - the number of pipeline stages is not the issue. Slide 5 Variable latency ------------------------- Every Execution Unit of the F-CPU is dedicated to a certain kind of operation : * ROP2 : bit-to-bit boolean operations : 1 cycle * SHL : bit/byte shuffling, 1 or 2 cycles * ASU : SIMD integer addition and substraction with carry and saturation : 2 cycles in the general case, 1-cycle for 8-bit operations * IMU : SIMD integer multiply and MAC : 6 cycles / 64-bit (MR design) * IDU, LSU, POPC, INC ... EUs are "LEGO bricks" with a particular and deterministic latency. They are all SIMD-capable and are duplicated when the register width increases. The SIMD size and "chunck" size are implementation-dependent. Slide 6 Connexion of the EUs around the Xbar --------------------------------------------- [drawing] Slide 7 General layout of the FC0 ---------------------------------- [drawing] Slide 8 Scheduling of one instruction -------------------------------------- [drawing] Fetcher : 1 cycle (transparent) {bypass possible when cache miss} decoder/R7 read : 1 cycles (more if stall) Issue/Xbar read : 1 cycle (unless bypass possible) EU : 1+ cycle (latency depends on the unit, see slide #5) Xbar : 1 cycle (here we don't care about the scheduling...) Register write : 1 cycle Slide 9 Scheduling of one instruction (2) ------------------------------------------ hazard detection with a scoreboard [drawing] * Every register is associated to a "not ready bit" * The bit is set when the instruction is issued * The bit is reset when the scheduler commands a write to the register. * The decode pipeline is stalled if all the necessary bits for the current instruction are not cleared. * This scheme is scalable almost in O(n) with the number of concurrently decoded/executed instructions Slide 10 Scheduling of one instruction (3) ------------------------------------------ allocation of the Xbar slots with a FIFO [drawing] * The opcode and the flags of the current instruction are the inputs of a hardwired lookup table. The output indicates how many slots are required (1 or 2 write buses ?) and how many cycles are required for the completion of the operation. * When the operation is issued (if no hazard on the scoreboard and the FIFO is detected), the selected slot is filled with the number of the register that must be written to. * When the FIFO shifts down, the number reaches the bottom and commands the control wires of the Xbar and the Register Set. * The depth of the FIFO is 8 cycles. It corresponds to the highest latency of the execution units (IMU: 6 cycles) plus the 2 x Xbar cycles. * An additional down-counter extends the FIFO for the long and high-latency divide unit (not pipelined version). When the counter is elapsed, the behaviour is normal for the rest of the FIFO (the register number is injected on the top of the FIFO). * Load instructions require special handling when the data is not present in the LSU's buffer. Slide 11 Memory protection --------------------------- All the pointers are available from the Xbar and checked by a TLB that translates the task's logical address into a physical address. [drawing] * TLB entries are SW-replaced through Special Registers with the help of LRU HW * 4 page sizes : 4KB, 32KB, 256KB and 2MB, plus probably a set of "fence registers" for larger pages. * TLB entries contain cachability informations, access rights (R/W/RW/X), access counter and subdomain usage counters, VMID tag ... Slide 12 Connexion of the address bus -------------------------------------- [drawing] * If there is a TLB miss, the pointer register is marked as invalid and a future use of this register as pointer will trigger a trap at the decode stage. * The address output by the TLB is compared to the address tags of the Fetcher, the L/SU, the internal Icache and Dcache. * If there is a cache or buffer hit - the register is marked as valid - the data is brought to the LSU or the Fetcher (depending on the access type) if the data is not available in the required buffer. Slide 13 LSU/Fetcher/Decoder coupling to detect faulty pointers ---------------------------------------------------------------- The R2 field of the instruction is compared to a set of 8*2 6-bit comparators. [drawing] * Each 256-bit line of the Fetcher and the LSU is tagged with an address and the number of the associated register. * The association of a line and a register is determined with two rules : - LRU - double-buffering (to avoid thrashing when scanning a large linear region of memory) with a pair of line * When the register number is compared to all the entries, a signal is sent to the decoder, saying whether the data is available or not. - if the data is present AND the instruction is a load/store or jump (plus all the other scheduling, scoreboarding etc. rules), the instruction is issued and can complete in 2 cycles. - if the data is not present in any buffer AND it is a load/store or jump, a "prefetch" instruction is simulated in the decoder. * the pointer's value is present on the Xbar because it is accessed in parallel with the decode * the pointer is checked in the TLB and the result goes the usual way. Slide 14 Pipelining of a load instruction ------------------------------------------ Due to the relatively high latency of the load and store instruction, some particular coding techniques must be used to benefit from the FC0's performance. * loop unrolling and interleaving (2x or 3x depending on the size of the loop, the available registers...) * explicit and agressive prefetch. Example : usual code : asm code : loadimm @item1, r4 ; prefetch load [r4],r0 ; prefetch (2) load [item1],r1 loadi item2-item1,[r4], r1 load [item2],r2 loadi item3-item2,[r4], r2 load [item3],r3 loadi item4-item3,[r4], r3 * pointer duplication : a pointer with post-increment takes approx. 6 cycles to complete, other pointers must be used and interleaved if the same memory block must be accessed within this 6-cycle period. * Of course, a memory access without prefetch will work but at the cost of a substantial stall (it is necessary because a task switch will flush the "register tags" in the LSU and the Fetcher) * Similar rules apply for the jump/call/return instructions but the constraints are looser (no pointer update usually). Slide 15 Code example : vector loop ------------------------------------ These are exemples of the typical code that runs best on the F-CPU : 1) Vector style --------------- Pseudocode : #define N 1024 char A[N], B[N], S[N] loop (N) S[i]=A[i]+B[i] ; LOOP PREPARATION : ; r4 = base address of A ; r6 = base address of B ; r8 = base address of S load.s1 r4,r0 ; prefetch load.s2 r6,r0 ; prefetch load.s3 r8,r0 ; prefetch get SR_MAX_SIZE,r1 ; r1 is loaded with 8, 16, 32... ; depending on the chip version (number of bytes per register) get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit) loadimm N/2,r3 ; N/2 because we unroll the loop twice add r1,r4,r5 ; r5 = r4+max_size (pointer duplication) add r1,r6,r7 ; idem add r1,r8,r9 ; idem shl 1,r1,r1 ; r1 = r1*2 shr r2,r3,r3 ; r3 = final loop count loopentry r2 ; r2 = PC+4 ; LOOP KERNEL : loadf.s1 r1,r4,r10 loadf.s1 r1,r5,r12 loadf.s2 r1,r6,r11 loadf.s2 r1,r7,r13 sadd.8 r10,r11,r11 ; SIMD add on 8-bit chuncks sadd.8 r12,r13,r13 ; stall storef.s3 r1,r8,r11 storef.s3 r1,r9,r13 loop r2,r3 ; decrement r3 // jump to r1 if r3 is not zero {remaining of the loop here, in case of an odd loop count} 2) Packed style --------------- Pseudocode : #define N 1024 struct {char A,B} M[N] char S[N] loop (N) S[i]=M[i].A + M[i].B ; r4 = base address of A ; r6 = base address of S load.s1 r4,r0 ; prefetch load.s2 r6,r0 ; prefetch get SR_MAX_SIZE,r1 ; r1 is loaded with 8, 16, 32... ; depending on the chip version (number of bytes per register) get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit) loadimm N,r7 add r1,r4,r5 ; r5 = r4+max_size (pointer duplication) shr r2,r7,r7 ; loop count shl 1,r1,r2 ; r1 = r1*2 loopentry r3 loadf.s1 r2,r4,r10 loadf.s1 r2,r5,r11 mixhi.8 r10,r11,r12 mixlo.8 r10,r11,r13 ; stall sadd.8 r13,r12,r12 ; stall storef.s2 r1,r6,r13 loop r4,r3 ; decrement r3 // jump to r1 if r3 is not zero Slide 16 The future of the FC0 ------------------------------- * Adding new execution units : LNS Add/Sub/Conversion, IEEE floating point units, Popcount/ECC ... * adding other memory interfaces, ie DDR SDRAM * Extension to 2-issue then 4-issue * Split/pipelining of the decode/issue unit * Coding constraint for a superscalar F-CPU : 2-issue : 64/(2*3)= 10 instructions without dependency 3-issue : 64/(3*3)= 7 ... 4-issue : 64/(3*4)= 5 <-- current limit of the FC0 Slide 17 Evolution of the F-CPU -------------------------------- * explicit register renaming -> more usable registers * Possible directions : - SMT (Simultaneous Multi Threading), - multi-core chips - RISC->TTA translator - eDRAM (embedded DRAM) - F-BUS, ... * New instructions : Scatter-Gather, Permute ... * F-CPU II ? Slide 18 Conclusion -------------------- * URL of the slides http://www.f-cpu.de/epf2001/ (not yet created) * Coming milestones : complete VHDL and manual v1 on CVS * F-CPU official web site : http://www.f-cpu.org * F-CPU manual available from http://www.f-cpu.seul.org