This is the text submitted to the managers of the Embedded Processor Forum 2001, June 11 - 15, in San Jose.
You can read the first slides if you follow this link.
1. The name of the chip, product or technology
the F-CPU Instruction Set Architecture and the F-CPU Core #0 ("FC0")
2. General features of the chip, product or technology
* Fully parameterizable VHDL'93 source code distributed under GPL
and F-CPU Charter.
* retargetable to a broad range of technologies
* SIMD, "undetermined register size" (32, 64, 128, 256 ...)
* 64 registers (64-bit wide in the first implementation)
* 32-bit instructions with 4 instruction formats only
* superpipelined (very short matched pipeline stages)
* 1 instruction per cycle issued In-Order
and Out Of Order Completion (for the FC0)
* Execution pipeline freed from exceptions (no rollback possible)
* Zero-delay branch instruction (in the ideal case)
* Smooth Register Back ("SRB") mechanism lowers context switch time
3. Specific features that will be discussed in detail at
Embedded Processor Forum
* instruction scheduling
* memory interface
4. Whether the chip will be formally announced or
disclosed for the first time at Embedded Processor
Forum (if not, what prior disclosures or announcements
are planned)
It is the first public presentation at a congress in the US.
The F-CPU team was presented during two conferences in Berlin
(17C3 : http://www.ccc.de/congress/fahrplan/event/153.en.html
16C3 : http://www.ccc.de/events/congress99/doku/fcpu.html)
but there was only a short technical introduction, while
some key details (see 3.) will be discussed.
5. For non-chip announcements, please provide details
on the product to be announced, its relevance to the
event, and particulars that will help evaluate the
presentation
Today, the F-CPU is the only configurable SIMD 64-bit
general-purpose RISC core aimed at high performance
applications (multimedia or scientific computations).
It is a very recent design (1999) originally meant to
provide a free, ununcumbered alternative to the Merced.
Contrarily to the LEON, the F-CPU defines a new instruction
set and is designed to last much longer. It is "by design"
extensible and scalable in several ways, providing forward
and backward compatibility for several decades. The first
core (FC0) could be used as replacement for the aging MIPS,
SPARC, ALPHA...
Contrarily to the ARC Cores and similar "configurable cores"
offers, the F-CPU is not a "black box" but a completely
transparent design that allows complete visibility to the
engineer. The F-CPU is based around a clean ISA and some simple
scheduling rules, leaving a lot of freedom to the implementor.
6. Other descriptive information about the presentation
{not defined yet. maybe a free distribution of CDROMs}
7. Company name
It is a community development on the Internet, not a company "product".
8. Speaker's name (should be the chief architect or lead
designer of the product)
Yann Guidon
9. Speaker's job title
Software development Engineer at Mentor Graphics (that's how i
earn my life), main F-CPU contributor.
10. Speaker's biography (30 words or less)
Joined the F-CPU Design Team on the Internet in Dec. 1998,
influenced the design when the F-CPU ISA became RISC and
defined the FC0 in mid-1999, Mentor Graphics employee
since nov. 2000.
11. Speaker's address, phone, fax, email address and
administrator (if any)
email : whygee@f-cpu.org (personal) and yann_guidon@mentor.com (job)
snail mail address :
Yann Guidon
13 rue Francois Couperin
93110 Rosny sous bois
France
phone : +33 1 64 86 62 07 (business)
12. Abstract title
proposed title :
Instruction scheduling and the memory interface of the F-CPU Core #0
13. P.R. or Marcom contact (include phone, fax, and email)
no marketing. see http://www.f-cpu.org or email me. I'll answer quickly,
since i'm an email freak.
14. Your abstract or outline should not exceed 1000 words.
This technical presentation will focus on one of the many particular
characteristics of the F-CPU, a superpipelined SIMD RISC CPU that is
being developped in a community on the Web, and distributed under GPL.
The first implementation of the F-CPU is called FC0 and the strong
constraints of the F-CPU ISA have led to redesign from scratch the
way instructions are decoded and memory is accessed. Although it
almost tastes like a classic "MIPS" RISC core, the memory units
are completely upside-down compared to other usual designs. The
instruction path is uncommon and the instruction set is highly
influenced. However, the number of critical or difficult cases
is highly reduced, a strong memory protection policy is enforced
and jumps/calls can be done with no overhead. Finally, the instruction
set is kept orthogonal and easy to schedule in almost any kind
of CPU core, even in the presence of faults and interrupts.
SYNOPSYS FOR THE PRESENTATION SLIDES
(will be converted to HTML & illustrated).
Slide 0 Logo
-------------
Slide 1 Title
---------------
"Instruction scheduling and the memory interface of the F-CPU Core #0"
Disclaimers :
* This is a technical (not marketing) overview of some major
characteristics of the F-CPU design philosophy and the
implementation in the FC0.
* It presents the backgrounds and design choices of an ongoing
(unfinished) work.
* Not all the details are presented here !
RTFM @ http://www.f-cpu.seul.org/manual
Slide 2 Short introduction to the F-CPU
----------------------------------------
F-CPU Project goals (voted 1999) :
"
To develop and make freely available an architecture, and all
other intellectual property necessary to fabricate one or more
implementations of that architecture, with the following priorities,
in decreasing order of importance:
1) Versatility and usefulness in as wide a range of applications
as possible
2) Performance, emphasizing user-level parallelism and derived
through intelligent architecture rather than advanced silicon process
3) Architecture lifespan and forward compatibility
4) Cost, including monetary and thermal considerations
"
In the end, the goal is not to make the fastest ever CPU, because it's an
endless race. The purpose of this project is more to design the
"coolest/sexiest" CPU possible.
Note the difference between "design/develop" and "build/market/sell/distribute..."
Slide 3 Short introduction to the F-CPU (2)
--------------------------------------------
Some features of the F-CPU Instruction Set Architecture :
* 32-bit instructions with 4 instruction formats only
[illustration of the formats]
* Instruction set census : ~110 reserved opcodes as of today,
with several variations, "core" and "optional" features.
* 64 registers (64-bit wide in the first implementation) with R0=0
* SIMD flag in most instructions allows parallel operation
on registers with "undetermined" width (32, 64, 128, 256 bits ...)
* Support for saturated arithmetics, IEEE Floating point, IEEE LNS
and fractional integers (optional)
* 2r1w, 3r1w and 2r2w instructions, 3r2w register set
* memory : 3 "stream hint bits" + cachability bit (7 "memory streams"
can be implemented -> more bandwidth because the different accesses
are separated)
* One addressing mode only for code and data : register
(post-increment possible for data)
example of some instructions : [follows the binary format]
Opcode Example Explicitly
- add add r1,r2,r3 r3=r1+r2
- load load r1,r2,r3 r3=[r2], r2+=r1
- jump if r1=0 jump r2 link r3 if cond, r3=PC, PC=r2
Slide 4 Starting point of the FC0 : Clock speed issue
------------------------------------------------------
Technology handicap ---> "carpaccio" CPU
* faster clock -> shorter Critical DataPath -> 6 4-input gates in the CDP
or approximatively 10 transistors between two pipeline flip-flops.
(This goal is here to emphasize on the simplicity of the pipeline stages,
it's not a "golden rule" because a lot of other parameters must be also considered)
* stage is deep enough to perform a 8-bit add/sub or a 64-bit increment.
When a unit becomes too complex, it must be split.
* CDP determined from the start :
- no "hidden CDP" that slows the clock down in the middle of the project
- very simple and easy design of the units (fortunately because
everything must be redesigned)
- emphasis on short stage pipelining for every part of the project
favours "matched" and balanced pipeline stage design.
- the number of pipeline stages is not the issue.
Slide 5 Variable latency
-------------------------
Every Execution Unit of the F-CPU is dedicated to
a certain kind of operation :
* ROP2 : bit-to-bit boolean operations : 1 cycle
* SHL : bit/byte shuffling, 1 or 2 cycles
* ASU : SIMD integer addition and substraction with
carry and saturation : 2 cycles in the general case,
1-cycle for 8-bit operations
* IMU : SIMD integer multiply and MAC : 6 cycles / 64-bit (MR design)
* IDU, LSU, POPC, INC ...
EUs are "LEGO bricks" with a particular and deterministic latency.
They are all SIMD-capable and are duplicated when the register
width increases. The SIMD size and "chunck" size are implementation-dependent.
Slide 6 Connexion of the EUs around the Xbar
---------------------------------------------
[drawing]
Slide 7 General layout of the FC0
----------------------------------
[drawing]
Slide 8 Scheduling of one instruction
--------------------------------------
[drawing]
Fetcher : 1 cycle (transparent) {bypass possible when cache miss}
decoder/R7 read : 1 cycles (more if stall)
Issue/Xbar read : 1 cycle (unless bypass possible)
EU : 1+ cycle (latency depends on the unit, see slide #5)
Xbar : 1 cycle (here we don't care about the scheduling...)
Register write : 1 cycle
Slide 9 Scheduling of one instruction (2)
------------------------------------------
hazard detection with a scoreboard
[drawing]
* Every register is associated to a "not ready bit"
* The bit is set when the instruction is issued
* The bit is reset when the scheduler commands a write
to the register.
* The decode pipeline is stalled if all the necessary bits
for the current instruction are not cleared.
* This scheme is scalable almost in O(n) with the number of
concurrently decoded/executed instructions
Slide 10 Scheduling of one instruction (3)
------------------------------------------
allocation of the Xbar slots with a FIFO
[drawing]
* The opcode and the flags of the current instruction are the
inputs of a hardwired lookup table. The output indicates
how many slots are required (1 or 2 write buses ?)
and how many cycles are required for the completion
of the operation.
* When the operation is issued (if no hazard on the
scoreboard and the FIFO is detected), the selected
slot is filled with the number of the register that must
be written to.
* When the FIFO shifts down, the number reaches the bottom
and commands the control wires of the Xbar and the Register Set.
* The depth of the FIFO is 8 cycles. It corresponds to
the highest latency of the execution units (IMU: 6 cycles)
plus the 2 x Xbar cycles.
* An additional down-counter extends the FIFO for
the long and high-latency divide unit (not pipelined version).
When the counter is elapsed, the behaviour is normal for the
rest of the FIFO (the register number is injected on the top
of the FIFO).
* Load instructions require special handling when the data
is not present in the LSU's buffer.
Slide 11 Memory protection
---------------------------
All the pointers are available from the Xbar and checked by a TLB
that translates the task's logical address into a physical address.
[drawing]
* TLB entries are SW-replaced through Special Registers with the help of LRU HW
* 4 page sizes : 4KB, 32KB, 256KB and 2MB, plus probably a set of
"fence registers" for larger pages.
* TLB entries contain cachability informations, access rights (R/W/RW/X),
access counter and subdomain usage counters, VMID tag ...
Slide 12 Connexion of the address bus
--------------------------------------
[drawing]
* If there is a TLB miss, the pointer register is marked as invalid
and a future use of this register as pointer will trigger a trap
at the decode stage.
* The address output by the TLB is compared to the address tags of the
Fetcher, the L/SU, the internal Icache and Dcache.
* If there is a cache or buffer hit
- the register is marked as valid
- the data is brought to the LSU or the Fetcher (depending on the
access type) if the data is not available in the required buffer.
Slide 13 LSU/Fetcher/Decoder coupling to detect faulty pointers
----------------------------------------------------------------
The R2 field of the instruction is compared to a set of 8*2 6-bit comparators.
[drawing]
* Each 256-bit line of the Fetcher and the LSU is tagged with an address and
the number of the associated register.
* The association of a line and a register is determined with two rules :
- LRU
- double-buffering (to avoid thrashing when scanning a large linear
region of memory) with a pair of line
* When the register number is compared to all the entries, a signal is sent
to the decoder, saying whether the data is available or not.
- if the data is present AND the instruction is a load/store or jump
(plus all the other scheduling, scoreboarding etc. rules),
the instruction is issued and can complete in 2 cycles.
- if the data is not present in any buffer AND it is a load/store or jump,
a "prefetch" instruction is simulated in the decoder.
* the pointer's value is present on the Xbar because it is accessed
in parallel with the decode
* the pointer is checked in the TLB and the result goes the usual way.
Slide 14 Pipelining of a load instruction
------------------------------------------
Due to the relatively high latency of the load and store instruction,
some particular coding techniques must be used to benefit from the FC0's
performance.
* loop unrolling and interleaving (2x or 3x depending on the size
of the loop, the available registers...)
* explicit and agressive prefetch. Example :
usual code : asm code :
loadimm @item1, r4 ; prefetch
load [r4],r0 ; prefetch (2)
load [item1],r1 loadi item2-item1,[r4], r1
load [item2],r2 loadi item3-item2,[r4], r2
load [item3],r3 loadi item4-item3,[r4], r3
* pointer duplication : a pointer with post-increment takes approx.
6 cycles to complete, other pointers must be used and interleaved
if the same memory block must be accessed within this 6-cycle period.
* Of course, a memory access without prefetch will work but at the cost
of a substantial stall (it is necessary because a task switch will flush
the "register tags" in the LSU and the Fetcher)
* Similar rules apply for the jump/call/return instructions but the
constraints are looser (no pointer update usually).
Slide 15 Code example : vector loop
------------------------------------
These are exemples of the typical code that runs best on the F-CPU :
1) Vector style
---------------
Pseudocode :
#define N 1024
char A[N], B[N], S[N]
loop (N)
S[i]=A[i]+B[i]
; LOOP PREPARATION :
; r4 = base address of A
; r6 = base address of B
; r8 = base address of S
load.s1 r4,r0 ; prefetch
load.s2 r6,r0 ; prefetch
load.s3 r8,r0 ; prefetch
get SR_MAX_SIZE,r1 ; r1 is loaded with 8, 16, 32...
; depending on the chip version (number of bytes per register)
get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit)
loadimm N/2,r3 ; N/2 because we unroll the loop twice
add r1,r4,r5 ; r5 = r4+max_size (pointer duplication)
add r1,r6,r7 ; idem
add r1,r8,r9 ; idem
shl 1,r1,r1 ; r1 = r1*2
shr r2,r3,r3 ; r3 = final loop count
loopentry r2 ; r2 = PC+4
; LOOP KERNEL :
loadf.s1 r1,r4,r10
loadf.s1 r1,r5,r12
loadf.s2 r1,r6,r11
loadf.s2 r1,r7,r13
sadd.8 r10,r11,r11 ; SIMD add on 8-bit chuncks
sadd.8 r12,r13,r13
; stall
storef.s3 r1,r8,r11
storef.s3 r1,r9,r13
loop r2,r3 ; decrement r3 // jump to r1 if r3 is not zero
{remaining of the loop here, in case of an odd loop count}
2) Packed style
---------------
Pseudocode :
#define N 1024
struct {char A,B} M[N]
char S[N]
loop (N)
S[i]=M[i].A + M[i].B
; r4 = base address of A
; r6 = base address of S
load.s1 r4,r0 ; prefetch
load.s2 r6,r0 ; prefetch
get SR_MAX_SIZE,r1 ; r1 is loaded with 8, 16, 32...
; depending on the chip version (number of bytes per register)
get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit)
loadimm N,r7
add r1,r4,r5 ; r5 = r4+max_size (pointer duplication)
shr r2,r7,r7 ; loop count
shl 1,r1,r2 ; r1 = r1*2
loopentry r3
loadf.s1 r2,r4,r10
loadf.s1 r2,r5,r11
mixhi.8 r10,r11,r12
mixlo.8 r10,r11,r13
; stall
sadd.8 r13,r12,r12
; stall
storef.s2 r1,r6,r13
loop r4,r3 ; decrement r3 // jump to r1 if r3 is not zero
Slide 16 The future of the FC0
-------------------------------
* Adding new execution units : LNS Add/Sub/Conversion, IEEE floating point units,
Popcount/ECC ...
* adding other memory interfaces, ie DDR SDRAM
* Extension to 2-issue then 4-issue
* Split/pipelining of the decode/issue unit
* Coding constraint for a superscalar F-CPU :
2-issue : 64/(2*3)= 10 instructions without dependency
3-issue : 64/(3*3)= 7 ...
4-issue : 64/(3*4)= 5 <-- current limit of the FC0
Slide 17 Evolution of the F-CPU
--------------------------------
* explicit register renaming -> more usable registers
* Possible directions :
- SMT (Simultaneous Multi Threading),
- multi-core chips
- RISC->TTA translator
- eDRAM (embedded DRAM)
- F-BUS, ...
* New instructions : Scatter-Gather, Permute ...
* F-CPU II ?
Slide 18 Conclusion
--------------------
* URL of the slides http://www.f-cpu.de/epf2001/ (not yet created)
* Coming milestones : complete VHDL and manual v1 on CVS
* F-CPU official web site : http://www.f-cpu.org
* F-CPU manual available from http://www.f-cpu.seul.org