Slide 1  Title
---------------
 "Instruction scheduling and the memory interface of the F-CPU Core #0"

*  This is a technical (not marketing) overview of some major
characteristics of the F-CPU design philosophy and the
implementation in the FC0.

*  It presents the backgrounds and design choices of an ongoing
(unfinished) work.

*  Not all the details are presented here !
    RTFM @ http://www.f-cpu.seul.org/manual


Slide 2  Short introduction to the F-CPU
----------------------------------------
  F-CPU Project goals (voted 1999) :
"
     To develop and make freely available an architecture, and all
other intellectual property necessary to fabricate one or more
implementations of that architecture, with the following priorities,
in decreasing order of importance:
   1)  Versatility and usefulness in as wide a range of applications
       as possible
   2)  Performance, emphasizing user-level parallelism and derived
       through intelligent architecture rather than advanced silicon process
   3)  Architecture lifespan and forward compatibility
   4)  Cost, including monetary and thermal considerations
"

In the end, the goal is not to make the fastest ever CPU, because it's an
endless race. The purpose of this project is more to design the
"coolest/sexiest" CPU possible.


Slide 3  Short introduction to the F-CPU (2)
--------------------------------------------
  Some features of the F-CPU Instruction Set Architecture  :

* 32-bit instructions with 4 instruction formats only
   [illustration of the formats]
* Instruction set census : ~110 reserved opcodes as of today,
  with several variations, "core" and "optional" features.
* 64 registers (64-bit wide in the first implementation) with R0=0
* SIMD flag in most instructions allows parallel operation
  on registers with "undetermined" width (32, 64, 128, 256 bits ...)
* Support for saturated arithmetics, IEEE Floating point, IEEE LNS
  and fractional integers
* 2r1w, 3r1w and 2r2w instructions
* memory : 3 "stream hint bits" + cachability bit
* One addressing mode only for code and data : register
   (post-increment possible for data)

example of some instructions :
  Opcode Example                  Explicitly
- add    add r1,r2,r3             r3=r1+r2
- load   load r1,r2,r3            r3=[r2], r2+=r1
- jump   if r1=0 jump r2 link r3  if cond, r3=PC, PC=r2


Slide 4  Starting point of the FC0 : Clock speed issue
------------------------------------------------------
  Technology handicap ---> "carpaccio" CPU

* faster clock -> shorter Critical Datapath -> 6 4-input gates in the CDP
   or approximatively 10 transistors between two pipeline flip-flops.
   (I know, Nicolas, it's an heresy but you know the arguments...
    And this rough design rule maps so easily in FPGAs :-D)

* stage is deep enough to perform a 8-bit add/sub or a 64-bit increment.

* CDP determined from the start :
   - no "hidden CDP" that slows the clock down in the middle of the project
   - very simple and easy design of the units
   - emphasis on short stage pipelining for every part of the project
     favours "matched" and balanced pipeline stage design.


Slide 5  Variable latency
-------------------------
  Every Execution Unit of the F-CPU is dedicated to
a certain kind of operation :

* ROP2 : bit-to-bit boolean operations : 1 cycle
* SHL  : bit/byte shuffling, 1 or 2 cycles
* ASU  : SIMD integer addition and substraction with
   carry and saturation : 2 cycles in the general case,
   1-cycle for 8-bit operations
* IMU  : SIMD integer multiply and MAC : 6 cycles (MR design)
* IDU, LSU, POPC, INC ...

EUs are "LEGO bricks" with a particular and deterministic latency.
They are all SIMD-capable and are duplicated when the register
width increases.


Slide 6  Connexion of the EUs around the Xbar
---------------------------------------------

[drawing]


Slide 7  General layout of the FC0
----------------------------------

[drawing]



Slide 8  Scheduling of one instruction
--------------------------------------

[drawing]

Fetcher : 1 cycle
{bypass possible}
decoder : 1 cycles (more if stall)
Xbar read : 1 cycle (unless bypass possible)
EU : 1+ cycle (latency depends on the unit)
Xbar : 1 cycle
Register write : 1 cycle


Slide 9  Scheduling of one instruction (2)
------------------------------------------
 hazard detection with a scoreboard

[drawing]

* every register is associated to a "not ready bit"
* the bit is set when the instruction is issued
* the bit is reset when the scheduler commands a write
  to the register.
* The decode pipeline is stalled if all the necessary bits
  for the current instruction are not cleared.


Slide 10  Scheduling of one instruction (3)
------------------------------------------
 allocation of the Xbar slots with a FIFO

[drawing]

* the opcode and the flags are the inputs of
  a hardwired lookup table. The output indicates
  how many slots are required (1 or 2 write buses ?)
  and how many cycles are required for the completion
  of the operation.
* When the operation is issued (if no hazard on the
  scoreboard and the FIFO is detected), the selected
  slot is filled the number of the register that must
  be written to.
* When the FIFO shifts down, the number reaches the bottom
  and commands the control wires of the Xbar and the Register Set.
* The depth of the FIFO is 8 cycles. It corresponds to
  the highest latency of the execution units (IMU: 6 cycles)
  plus the 2 X bar cycles.
* An additional down-counter extends the FIFO for
  the long and high-latency divide unit (not pipelined version).
  When the counter is elapsed, the behaviour is normal for the
  rest of the FIFO (the register number is injected on the top
  of the FIFO).
* Load instructions require special handling when the data
  is not present in the LSU's buffer.


Slide 11  Memory protection
---------------------------
  All the pointers are available from the Xbar and checked by a TLB
that translates the task's logical address into a physical address.

  [drawing]

* TLB entries are SW-replaced through Special Registers.
* 4 page sizes : 4KB, 32KB, 256KB and 2MB, plus probably a set of
   "fence registers" for larger pages.
* TLB entries contain cachability informations, access rights (R/W/RW/X),
   access counter and subdomain usage counters, VMID tag ...


Slide 12  Connexion of the address bus
--------------------------------------

  [drawing]

* If there is a TLB miss, the pointer register is marked as invalid
  and a future use of this register as pointer will trigger a trap
  at the decode stage.

* The address output by the TLB is compared to the address tags of the
  Fetcher, the L/SU, the internal Icache and Dcache.

* If there is a cache or buffer hit
   - the register is marked as valid
   - the data is brought to the LSU or the Fetcher (depending on the
     access type) if the data is not available in the required buffer.


Slide 13  LSU/Fetcher/Decoder coupling to detect faulty pointers
----------------------------------------------------------------
  The R2 field of the instruction is compared to a set of 8*2 6-bit comparators.

[drawing]

* Each 256-bit line of the Fetcher and the LSU is tagged with an address and
the number of the associated register.
* The association of a line and a register is determined with two rules :
   - LRU
   - double-buffering (to avoid thrashing when scanning a large linear
       region of memory) with a pair of line
* When the register number is compared to all the entries, a signal is sent
  to the decoder, saying whether the data is available or not.
   - if the data is present AND the instruction is a load/store or jump
     (plus all the other scheduling, scoreboarding etc. rules),
     the instruction is issued and can complete in 2 cycles.
   - if the data is not present in any buffer AND it is a load/store or jump,
     a "prefetch" instruction is simulated in the decoder.
       * the pointer's value is present on the Xbar because it is accessed
         in parallel with the decode
       * the pointer is checked in the TLB and the result goes the usual way.


Slide 14  Pipelining of a load instruction
------------------------------------------
  Due to the relatively high latency of the load and store instruction,
some particular coding techniques must be used to benefit from the FC0's
performance.

* loop unrolling and interleaving (2x or 3x depending on the size
  of the loop, the available registers...)
* explicit and agressive prefetch. Example :
   usual code :                asm code :
                                 loadimm @item1, r4   ; prefetch
                                 load [r4],r0         ; prefetch (2)
      load [item1],r1            loadi item2-item1,[r4], r1
      load [item2],r2            loadi item3-item2,[r4], r2
      load [item3],r3            loadi item4-item3,[r4], r3

* pointer duplication : a pointer with post-increment takes approx.
   6 cycles to complete, other pointers must be used and interleaved
   if the same memory block must be accessed within this 6-cycle period.

* Of course, a memory access without prefetch will work but at the cost
  of a substantial stall (it is necessary because a task switch will flush
  the "register tags" in the LSU and the Fetcher)

* Similar rules apply for the jump/call/return instructions but the
  constraints are looser (no pointer update usually).


Slide 15  Code example : vector loop
------------------------------------
   These are exemples of the typical code that runs best on the F-CPU :

   1) Vector style
   ---------------

Pseudocode :
   #define N 1024
   char A[N], B[N], S[N]
   loop (N)
     S[i]=A[i]+B[i]


; LOOP PREPARATION :
   ; r4 = base address of A
   ; r6 = base address of B
   ; r8 = base address of S
   load.s1 r4,r0 ; prefetch
   load.s2 r6,r0 ; prefetch
   load.s3 r8,r0 ; prefetch
   get SR_MAX_SIZE,r1     ; r1 is loaded with 8, 16, 32...
    ; depending on the chip version (number of bytes per register)
   get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit)
   loadimm N/2,r3         ; N/2 because we unroll the loop twice
   add r1,r4,r5           ; r5 = r4+max_size (pointer duplication)
   add r1,r6,r7           ; idem
   add r1,r8,r9           ; idem
   shl 1,r1,r1            ; r1 = r1*2
   shr r2,r3,r3           ; r3 = final loop count
   loopentry r2           ; r2 = PC+4
; LOOP KERNEL :

     loadf.s1 r1,r4,r10
     loadf.s1 r1,r5,r12
     loadf.s2 r1,r6,r11
     loadf.s2 r1,r7,r13
     sadd.8  r10,r11,r11 ; SIMD add on 8-bit chuncks
     sadd.8  r12,r13,r13
     ; stall
     storef.s3 r1,r8,r11
     storef.s3 r1,r9,r13

   loop r2,r3      ; decrement r3 // jump to r1 if r3 is not zero

   {remaining of the loop here, in case of an odd loop count}


   2) Packed style
   ---------------

Pseudocode :
   #define N 1024
   struct {char A,B} M[N]
   char S[N]
   loop (N)
     S[i]=M[i].A + M[i].B


   ; r4 = base address of A
   ; r6 = base address of S
   load.s1 r4,r0 ; prefetch
   load.s2 r6,r0 ; prefetch
   get SR_MAX_SIZE,r1     ; r1 is loaded with 8, 16, 32...
    ; depending on the chip version (number of bytes per register)
   get SR_LOG_MAX_SIZE,r2 ; ie r2 = 3 if MAX_SIZE=8 (64-bit)
   loadimm N,r7
   add r1,r4,r5           ; r5 = r4+max_size (pointer duplication)
   shr r2,r7,r7           ; loop count
   shl 1,r1,r2            ; r1 = r1*2

   loopentry r3
     loadf.s1 r2,r4,r10
     loadf.s1 r2,r5,r11
     mixhi.8 r10,r11,r12
     mixlo.8 r10,r11,r13
     ; stall
     sadd.8 r13,r12,r12
     ; stall
     storef.s2 r1,r6,r13
   loop r4,r3      ; decrement r3 // jump to r1 if r3 is not zero

Slide 16  The future of the FC0
-------------------------------

* Adding new execution units : LNS Add/Sub/Conversion, IEEE floating point units,
  Popcount/ECC ...
* adding other memory interfaces, ie DDR SDRAM
* Extension to 2-issue then 4-issue
* Split/pipelining of the decode/issue unit
* Coding constraint for a superscalar F-CPU :
   2-issue : 64/(2*3)= 10 instructions without dependency
   3-issue : 64/(3*3)= 7 ...
   4-issue : 64/(3*4)= 5   <-- current limit of the FC0


Slide 17  Evolution of the F-CPU
--------------------------------

* explicit register renaming -> more usable registers

* Possible directions :
   - SMT (Simultaneous Multi Threading),
   - multi-core chips
   - RISC->TTA translator
   - eDRAM (embedded DRAM)
   - F-BUS, ...

* New instructions : Scatter-Gather, Permute ...


Slide 18  Conclusion
--------------------

*  URL of the slides  http://www.f-cpu.de/epf2001/ (not yet created)
*  Coming milestones : complete VHDL and manual v1 on CVS
*  F-CPU official web site : http://www.f-cpu.org
*  F-CPU manual available from http://www.f-cpu.seul.org
