

     WHYGEE's processor (Gamma):
  mail me here: whygee@mime.up8.edu



rev. 2: Thu Feb  4 02:24:09 CET 1999
 changed the instruction coding of the size field
 and the I/D regs address

rev. 3: Wed Feb 10 17:51:20 CET 1999
 added the inspiration paragraph and the address segmentation

rev. 4: Thu Mar  4 03:25:31 CET 1999
 added CLR, SERIALIZE, ENDSTREAM, RDSPR, WRSPR...

rev. 5: Mon Mar 15 04:13:46 CET 1999
 added some cra(z)yisms anyway...



1) Foreword
-----------

This manual gives some background ideas behind
the Gamma architecture to the hardware and
software developers. No specific implementation
is described, but the main architectural
specificities are depicted (not in much detail
though). I don't even think that i address
all the ideas but the most important ones
are here.

It is expected that the reader has some
knowledge and insight of several recent
computers and microprocessors.

This is a personal design. It may become
whatever you want but I made this design alone
and it will remain personal unless/until
I am explicitely asked.
I will continue to work on this project,
if it becomes a hardware GNU project or not,
as i did for years.

Objective thoughts are welcomed,
flames moved to /dev/null.


2) The need of a programming model
----------------------------------

Where to start building a processor ?
The programming model (the instruction set,
what it does and how) is my answer.
This ensures that the whole design, the whole
project, the whole family, the whole stuff
in one word, isn't stuck to a silly detail.
That is : an "implementation" is just a
"specific case of the programming model".
That also means that the "implementation
independence" of some features garanties a
certain degree of freedom to the hardware
designer.
For me, "programming model" is more complete
than "ISA" (Instruction Set Architecture")
because it also contain the WAY to use the
entire processor. It is more application-
oriented.

Let's look back :

One purpose of this processor is to last
as long as possible, more than any other known
architecture. (Why not ? it won't reach high
if we aim our feet.)

Thus, it will be implemeted
in many different processes, core
architectures, emulated and stand technology's
unforecast evolutions. In any case, it
will have to be as powerful as possible.
If it is a "naughty hack" at some point,
the following generations become huge kluges
(we all know examples).

If we define an architecture right now
the processor's ability to scale in the
future will be reduced. That's why I start
the work with a programming model that
defines the mechanisms available to the user,
the instructions, what they do, how...
The big constraint is to allow the processor's
performance to scale without pushing the fab
process, while keeping the background idea of
the analysis of the data and instruction streams.

A specific implementation can be built over this
programming model by taking specific constraints
into account. This implementation will
introduce specific coding rules as to run at
full speed but the general rules of the model
must be respected as to ensure backward AND
forward compatibility between different
implementation families and its members.
This binary compatibility goal can sound
silly in the era of open source software,
but this is to ensure that a new compiler
won't be written for each specific implementation
(amongst other reasons). After all, binary
compatibility is not a nightmare if the
processor is carefully designed.


3) General overview
-------------------

The Gamma, as i call it (don't ask me why,
it's a code name), is a kind of modified
RISC processor. During the years the main
characteristics have evolved, changed and mutated
but the main outlines are rather stable:

It is a "prefetch-queue (PFQ) based" processor
with 64 General Purpose Registers (GPR) and
32 PFQs that are shared between instruction and
data fetch. The rest of the registers are
SPRs (Special Purpose Registers) which are
accessed through special instructions.


Thanks to the PFQs the Gamma is a truely
register to register machine because there is
no load and store instruction. Each PFQ has
two of its registers, Data and Index, accessed
as "normal" registers and allows to perform
address and data manipulations with common
instructions. Furthermore, the 32 PFQs boost
the (virtual) onchip memory bandwidth and
reduce the instruction count for regular pattern
access. The PFQs include a 32-word buffer
that is PART of the programming model. These
buffers are a kind of L0 cache, and have the
same role as the "reordering buffer" in the
modern processors. They allow several structures
or algorithms to be performed almost transparently
(FIFO (stacks), LIFO (buffers), autoincremented
pointers, circular buffers...) with simple dedicated
hardware.

Instructions that define an address can use more
bits in the opcode, compared to usual load-store
architectures. And since it is the same instruction
for the data and the instructions, it is more
flexible.

The choice of a PFQ oriented architecture implies
that the algorithms must take pretches into account.
The rule is : "the earlier the better".
As soon as an address is "known by the user"
(that is : the address is predifined or can be computed)
and a PFQ is free, the read orders must be made
with an explicit prefetch instruction.
This strategy is valid for scattered memory blocks
because the autoincrement/autodecrement (or other
"features") modify the index each time the DATA register
is accessed inside a memory block, thus reducing the
data access overhead for large blocks.

What does "explicit prefetch" means ? It simply pushes
some complexities (such as pipeline hazards) to
software. One of the first intent was to find a solution
to the delayed branch problem, because the slot size
has different impacts in different architectures.
Another important issue is to keep the architecture
"predictible" when it grows (more units, more instruction
per cycle, less CPI). This "hint" strategy has few impact
on existing algorithms but rewriting them to take into
account all the addressing facilities can greatly speed
them up, the same way as when rewritten for a vector
processor for example.


Two "operating modes" are intended : raw addressing
mode and protected mode. Raw adressing mode is for
single-task or OS code that linearly uses the
memories. Protected mode ensures strict checking
of pointers for multi-tasking environment, and
provides a virtual addressing mechanism.


4) Inspiration
--------------

The Gamma is a personal design, not an adaptation
of another processor, even though it is influenced
by several designs (existing or prototypes),
ideas and methodologies. Let's say that i took
the best of existing architectures as a working
basis, and integrated into it my vision of how
it should be.

The first influence is the RISC methodology,
in the light of its today's developments.
The concurrent development of the compiler, the
simulator and the chip's architecture emphathize
on the fundamental interaction of hardware and software.

I have not been influenced by crayisms because
the way to solve the problem (performance) is
not the same (VLSI CMOS CPU vs massive use of
expensive indium phosphide chips). A VLSI CPU doesn't
allow the same kind of parallelism as a multi-chip
processor, thus the methodology is different.
The purpose is also different : the Gamma is a
scalable, general-purpose architecture that
will find a place in a personal computer box,
instead of an ALPHA, POWER or similar chip.

Anyway, even though the purpose and the methodology
are different, the goal is still the same:
             PERFORMANCE
and i sometimes found some strange similarities, for
example (3/1999) with the CDC6600 that uses
pairs of data and index registers. At that time (1964),
the main memory was slow magnetic cores, and today
we have about the same clock ratio between the CPU
and the memory. This is a sign that "explicit prefetch",
like any other "hinting" method, is not as dangerous as it
seems.
The difference with the CDC6600 is that PFQs are homogenous
(not specialized in read or write, not even for instruction
or data) and have more space (32 PFQ * 32 words of 32 bits)
since transistors are less expensive now :-)

Another similarity can be seen in the CRAY T3E "streaming
buffers" that are parallel to the L2 cache of the EV5 ALPHA
processors. While a normal "cache" memory is OK for random
order data access (like OS data or linked lists)
the streaming buffers prefetch in advance a lot of contiguous
memory blocks when two consecutive L2 cache misses occur.
This is an interesting enhancement, particularly for
vector-based code.
The difference with the T3E is that the PFQs are prepared in
advance, before the data is actually used by the processor
units. We avoid the 2-read cache misses overhead.
Furthermore, the streaming buffers "second-guess"
the beahaviour of the program. PFQ can be precisely
controlled with dedicated instructions and is really
more "predictable". The drawback is that the compiler
must be rather smart.


One of the most important hidden side effect of
"explicit prefetch" is that it reduces the amount
of instruction latency. This is particularly important
in Out-Of-Order (OOO) architectures where the instruction
is buffered and kept until completion. 

Let's take for example " load r1,(r2) " from a load-store
procesor. It reads r2 and fetches some data from the memory,
then writes th result in r1. If the data is in the onchip cache,
let's say it takes two processor cycles (only for data fetch).
But if it is in the main memory (and that other flows access
this main memory in the same time, like DMA for example)
it takes tens of clock cycles before the instruction is completed.
In the same time the CPU core "speculatively" executes to a certain
degree the instructions it can execute, until the instruction
buffer (or any other buffer, the tyniest anyway) is saturated.
In this case, the "load" instruction delays the other instructions
and keeps them from executing. Later, if no page fault occurs,
who knows, the "retire unit" will be saturated too.

Moving the memory fetch backwards helps reduce this bottleneck.
The policy is : we ask the memory interface unit to load a buffer
starting at the specified address, and to care for the detail
(cache miss, page fault or even peripheral emulation).
This is performed with the "prefetch" instruction (like "PF Q1,R1"). 
Then later after, the actual instruction that uses the DATA
register checks the specified PFQ state (it's a matter of a few
logic levels). If the data is not ready, the instruction is not
issued to the execution units. If there is an error, this
instruction triggers a trap. The instruction buffer is kept
smaller than with a normal architecture because the memory-bound
instructions suffer less from the memory latencies.
Just another detail : if a page fault occurs during the memory fetch
instruction, the "load" instruction must be "restarted".
PFQ based code is doesn't need it.



Several existing chips and their methodology have
influenced the instruction coding and the overall
architecture: the early MIPS family, the DEC ALPHA
and the similar designs. The 3-operands, fixed 32 bit
instruction format is for example absolutely necessary.
The necessity for the architecture and the instruction
set to be able to have several concurrent execution units,
thus easing the instruction decoding and avoiding
"complex and blocking ressources", led to some choices
like : any register other than the GPR and the PFQ
registers are SPRs, each GPR has its dedicated
condition code bits.

Several new design use a VLIW approach as to perform
several instructions per cycle (IBM, TI TMS320C6x or
HP/INTEL IA64). Several independent instructions
are packed in a large (around 256 bits) word
and parallely feed dedicated (slot-dependent)
execution units. This is for the outline.
The main problem of this approach, when applied to a
project like the Gamma, is that it doesn't actually
solve the problem of the instruction decoder
and the data/register dependency check. Why ?
Because the Gamma family can have a wide range of
instruction decoding rate ("at least one per cycle"
doesn't tell until how much). If a designer has the
technology to decode and execute say, 5 instructions
per cycle, a design stuck to 4 instruction slots
per word would under-use the chip capacity. The designer
will thus have to design a special decoder to translate
the 4-instruction packets into wider packets.
We see that the first purpose (simplify the instruction
decoder) is not flexible enough and can lead in the future
to more complex problems. In that case it is wiser
to use a simple 32 bit instruction format that is
easy to decode in any architecture.


Another kind of processors influenced the design:
the DSPs that are special processors designed
around the algorithms they run, like FFT or filters.
Their architecture allow them to perform at least two memory
fetches, decode one instruction, perform one multiply
and one add, all in one clock cycle with a simple, cheeap
and power saving hardware. They have a sophisticated address
generator that computes two address at every cycle,
independently from the instruction pointer. They are designed
for power consumption saving and computation intensive use
in consumer, industrial or military applications.
They have a hardware loop mechanism for zero-overhead,
nested block loops. The PFQ are a way to perform them.

But the goals of the project needed specific solutions.

The 64 GPRs share the floating point and fixed point
numbers, as to have enough registers in any case
(if suddenly we need more int or FP). Likewise,
the PFQs are not (yet) assigned to data or instruction fetch,
in case one needs more data than instructions or vice versa.
Reason :
In general, peak performance can only be reached when all
the available ressources are used. If part of the ressources
are locked for a certain use that is not part of the peak,
then this "dead hardware" has no reason to be.
The Gamma is a performance oriented processor, it must be able
to concentrate all its forces towards the goal.
The same kind of thinking is behind the choice of the
execution units. The number of identical units (say, FP multiply)
is not chosen from average use statistics, but from
"how many multiply can it be necessary to perform in one cycle
at peak rate". We all know that lack of ressources can
reduce performance in special cases (the crucial cases...)
that's why the Gamma is so "generous".


When needed, the register could be split into
8 sets of 8 registers as to speed up the clock rate,
increase the data locality and have 8 execution units.

The PFQs are a way to avoid complex cache memories and
have a much higher core speed than the main memory.
Compared to the DSP's Address Generators, the PFQs are
like "Data Generators".

A general guideline is that "every ressource must have
a state and a can be controlled". "ressource" means
everything, from the cache memory to the PFQ themselves.
This guideline ensures that an algorithm can run with
a predictible fashion, and the CPU remains flexible.



5) sizeof(int)
--------------

The usual word sizes are 8, 16, 32 and 64 bits.
this requires two bits in each instruction
referencing a register's contents.

               BUT

Since it will have to stand many decades of evolution
there will be a point where any word size is not enough.
Like for the PDP8 there comes a time when there's not
enough address lines. The same thing slowly happens for data.
I have chosen that the word size is completely
implementation dependent and the meaning of the size
field will be user-defineable to a power of two of bytes.
That way it is easy to design algorithms
that run as fast as each implementation allows it.

"more-than-64 bit words" may sound like an overkill
for most applications. Anyway,
increasing the data width is one of the ONLY ways to
increase the overall performance, and SIMD data
become more common today : the very long words
contain several data and are treated parallely
with a single instruction. This is what HP, Intel,
Sun, Motorola, ADi etc. already do.
Using large words and SIMD execution also requires that
the algorithms must be adapted or rewritten.
By keeping the word sizes in registers, we allow
programs to execute as fast as the machine allows.

It is a byte addressing processor : a pointer has a
byte granularity. When it is incremented by two,
two bytes are skipped.


In the simulator files, any number is treated
as a byte array, first for scalability, and second
(and more important) for portability issues
since many machines have a different sizoef(int)
and endianness. The simulations are thus accurate as
far as no 'int' is used, except for the data references
wich need a cast to int as to access the real memory.
Simulations are not really fast but it's not the goal.

The assembler, because of BISON's and FLEX's internal
machinery, has no built-in support for large numbers. Assembling
files will be as accurate as the platform is.
Anyway, there is few risks in common cases because
- assembly files are not very huge
- the instructions are 32 bit wide and the immediate
   values are less than 32 bit wide.
If there is a need for it, i could put undefined size
word support in the BISON and FLEX scripts.


Oh ! a last remark : because of this undefined word size,
the Gamma is inherently purely LITTLE ENDIAN.
This had to be clearly written.


6) the Prefetch Queues
----------------------

The PFQs are a fantastic tool to access instructions
and data, and the Gamma is the testbench of this "new"
fashion to view the memory and anything else. We almost
always forget that it was at the beginning a kind of
enhanced instruction prefetch mechanism.

The PFQs turn any data or instruction into flows
that the hardware handles by batches.  These flow
can come from or go to anywhere. The primary intent
is to access the cache memory, the main memory, then it
appeared that they could also point to a hard disk drive,
a network adapter board, another PFQ, a UNIX-pipe,
whatever you imagine that can be handled as data flows or
instruction flow. Each PFQ has a dedicated buffer
which can contain about 32 instructions and
can be accessed as fast as a normal cache line,
except that the integrated automatic address generators
actually prepare data in advance, and care for the
data alignment.

PFQs were born with the need to explicitely prefetch
instructions, as a flexible alternative to the
delayed branches of the early RISC CPUs.
In effect, delayed branches are deeply "implementation
dependent", and the size of the slot changes
when the processor evolves. Now, with the superscalar
organizations, no early RISC architecture has benefits
using them.

The solution i have found is rather simple :
an instruction prepares the branch target as soon
as the address is known. The address is decoded,
the instructions are fetched, speculatively decoded
and executed, while the main instruction flow goes on
normally. Then, when the branch decision must be taken,
the processor switches to the second instruction flow
with almost no overhead. If the branch is not taken,
the speculative execution can be discarded when the
operands change, or stay in the pipes, waiting for
another branch instruction to take place.

For example, we could write:
PF q2, @label_1
 ....
 ....
IF condition1 JPF q2
 ....
IF condition2 JPF q2

"PF" prefetches the queue q2 with the address pointing
to label_1. The conditional jumps can simply specify
when to jump in the instruction flow, and since the queue
is prepared, any number of jump instructions can use it.
The same approach is also valid for function calls :
once the prefetch queue is setup, it can be used several
times.

The jumps, conditional or not, do not specify the address,
and can therefore include more useful fields. We can tell
the processor what the chances to branch are (branch "hint"),
or whether to copy the current queue to another for function calls
or long loops. PFQs save a complex dynamic branch
prediction mechanism, and reduce the cost of a misprediction
(make a parallel here with the AM29000 and the branch target buffers)


The opportunity of a prefetch buffer that is part
of the programming model also allows us to make tight loops
or instruction skips in hardware for example.
Care must be taken because loops can be nested, interrupted
and restarted without the need of a loop stack (unlike DSPs)
so a slightly different approach must be used.

When the loop doesn't fit into the PFQ, low-overhead looping
is still possible : we can use a queue to store the beginning
of a loop. Before the beginning, an instruction marks
the loop start by copying the current PFQ to another.
At the end of the loop, the conditional branch specifies
the PFQ from which the current PFQ must be loaded.

COPY Q2,Q1
(loop entry point :)
 ....
 ....
IF condition JPF_COPY Q2 (copies Q2 into Q1 (CPFQ) and execute)

Since there are at least two different instruction
flows in the CPU core during an explicit prefetch,
it becomes possible to speculatively execute the
instructions that come in the other buffer. There has
to be some priorities between the different
instruction flows but when the main flow stalls
the others can execute. And if different flows could
be executed concurrently, so why should they belong
to the same thread ? Yes, this was SMT, or "simultaneous
multi-threading", before DEC published the studies.
SMT not only boosts the multi-tasking computers,
but also reduce the task switches and the interrupt
overhead, by smoothly changing the renamed register's
allocation. In other words, there are far less lost
cycles and no pipeline flush.

With very long latency memories and several different
places that have to be prepared, it became evident that
two prefetch queues were not enough, so they multiplied.
Today, 32 sounds like the most we can do in hardware,
this already needs a big crossbar, but
nothing keeps us from "mapping" a virtual PFQ to physical
one. As again, the model and the implementation
are completely decoupled, as long as the principle is valid
and useful, and is correctly used.

With many PFQs available, why not use them for data
memory access ? With a pair of visible registers,
DATA and INDEX, it is easy to manipulate pointers.
The rest of the registers are only additional features
that provide protection, status, properties,
index increment on read, index increment on write,
base address, limit...

PFQs can be viewed and configured as FIFOs, LIFOs or
could even perform buffer-wise bit-reversing. They
provide a 32 word frame buffer when used as a stack.
They unify data and instruction flows while still
providing protection. They provide blocking, though
testable, access to any stream (pipe, file, coprocessors,
other processors...). They permit to take into account
the characteristics of every memory type.

Since computers have several different memory types
(L1, L2, main memory bank #X, I/Os, SCSI bus, etc...)
the programer can select exactly where the data are
accessed by chosing the "memory type" property of the PFQ.
This can for example benefit convolution algorithms
because each data stream can use different memory buses
if the data are correctly located, like in DSPs
that use modified HARVARD architectures.
In raw addressing mode, the programer must manually
set the type with the TYPE register or with the
pointer's 8 MSB.
In protected mode, the OS cares for the lookup table
that translates the pointer's 8 MSB into address bounds
and a memory type. That way, the  OS has a tight
control of every ressource.


The PFQs give the invaluable ability to "play" with
the memory and the data and instruction flows,
and provide with several opportunities to manage
common situations with a flexible way. Even though
they are a rather complex mechanism, they allow to
increase the CPU core clock frequency to a point
where the main memory is still the bottleneck.
The only solution is to increase the data parallelism,
namely to increase the word width, implement more
independent busses and more memory banks. Of course,
the software must be prepared for these enahncements.


7) The registers
----------------

There are 128 directly user-visible registers.
Registers 0..63 are the GPR that hold all the
processed data. Their size is implementation dependent
but at least 32 bits.
They hold both floating point and integer values.
Each GPR has three dedicated bits that hold the state
of the last operation which used this register
as destination : zero, carry, neg.
They replace the usual (CISC) general Condition Code
Register (CCR) and allow to design complex architectures
with many execution units, out-of-order execution
and renamed registers, while keeping the complexity low.

The state bits are set according to the size of the last
operation performed on the register. This is why the
neg flag can be set when the MSB of the register is zero,
and the zero flag can be set when the whole register
is not equal to zero.

Please note that there is no "zero" register.
If one is needed, it's as simple as "clear R0"
to make one.


The other 64 registers (64 to 127) are the PFQ registers :
there are 32 pairs of data and index registers.
The data register has the value of the memory pointed
to by the index register in the memory type of the PFQ.

When the PFQ is configured for automodification,
reading and writing a DATA register changes the INDEX.
When the data register is being read, the index
is incremented by a read_count, multiplied by the size
of the word. When the data register is written,
the index is incremented by write_count*word_size.
This way, it is easy to implement a FIFO, a LIFO,
any number of stacks, a stack frame buffer, or any
regular structure.
A write to the index register, after any operation,
can setup the index to any place within the bounds
associated by the OS to the memory type of the PFQ.


Several additional, implementation dependant registers,
can exist but are not mapped in the general register's
range : these are the SPRs (Special Purpose Registers).
They can be read and written by special instructions
that can trigger protection checking mechanisms, according
to their function and importance.

The SPRs contain important informations for the task
and the OS. They are completely stored and restored
during each task switch, to a place that the OS can
control (L1, L2 or main memory).

The main SPR are:
-SPR_NUMBER: what's the number of the highest SPR.
  this is hardwired and implementation dependent.
-CPFQ: "Current PFQ" this number indicates which
 PFQ extracts the main instruction flow.
-MAXSIZE: this is the maximum number of byte a GPR can contain.
 This is strictly dependent from each implementation and hardwired.
-SIZE0, SIZE1, SIZE2, SIZE3: usually 1, 2, 4 and 8 but
 can be reprogrammed to any power of 2 that is equal to or
 below MAXSIZE. This can be also hardwired if MAXSIZE
 is equal to or below 64.
-CYCLECOUNT: counts the actual number of cycles during which
 a task's instruction has been executed. this can be useful
 for the OS' scheduler and the task's tuning.

Several SPRs are dedicated to each PFQ's state, beyond
the directly visible INDEX and DATA registers. There are:
- status : indicates if the PFQ is ready, or if the
   index is valid, etc (this is a "volatile" register)
- base : the value that the index is reloaded with,
   when circular addressing is on.
- limit : the maximum value of index
- read_increment,
- write_increment,
- memory_type : which memory is accessed
- granularity : the amount of data that is loaded each time
 a PFQ is refilled (like a HDD sector size or DRAM page size...)
- properties : indicates if data or instructions are
   fetched, enables the autoincrement and the circular
   addressing and other neat "features".

Other SPRs can exist, that can be task-dependent or not,
as to handle the interrupt table, the protection
mechanisms, the individual hardware properties,
or the debugging registers.

There can be 2^14 SPRs with an immediate instruction form,
and virtually any number of SPRs with the register form.


8) "FLAT" access to each memory type
------------------------------------

"What ??! no cache memory ? no paged virtual memory ??!"
Why not ? DSPs and other real-time or high performance
computers don't have Level 1 cache memory or virtual memory.
They implement them in their own way, differently.

First reason is that it is a complex hardware.
Thus it is not really fast, because it requires several
mechanisms to know if the line is present in the cache,
to fill it and flush it, and when.

Second reason is that DSPs are designed to take the memory
hierarchy into account, in their programming model.
The algorithms they run is therefore designed to use
this hierarchy to its fullest. Virtual linear addressing
and address translations help to program easily,
but flat addressing of each memory reduce the hardware
latencies and ensure a certain performance level.

Third reason, which is more philosophical, is that this
hardware is not "predictable". The state of the LRU tags
of each cache line depends on operations that were performed
very long ago, by processes that a task can't necessarily control.
For example, the OS can interrupt a task at any time, an
interrupt can "flush" the cache from the application's data.
Paged virtual memory is a compelx mechanism that is
influenced by the other tasks as well, the hard disk speed
and the kernel speed. Performance enhancements provided
by cache mechanisms are only statistical, not absolute.
Furthermore, a cache replacement order can be uneffective
for certain algorithms (which are sometimes critical).

I could go on but this doesn't prevent DSPs and real-time
processors from integrating a small high-speed on-chip memory.
This doesn't prevent them from being able to access slow and
huge memory banks too. But they are not slow because they
do all the important computations with the fast on-chip memory.
In the same time they use DMA channels to flush or fill
the on-chip memory with the needed data, all in the background.
This is the kind of strategy i want to use with the Gamma.


Another principle that i want to introduce is similar to the
"no caching" principle. Since the memory map of the processor
can change from one implementation to another, and that each
usual "cache level" is accessed independently, it is not
possible to access a certain memory at a fixed address.
In other words, we can't access the on-chip memory at address
xxx and the off-chip at address yyy, but we need to access
every memory level independently and linearly.

That is why, associated to each index of the PFQ, there is
a "memory type" property field that says if the index points to
on-chip cache, off-chip cache, main memory, etc.
One good thing about it is that there is no address
decoding to do in hardware : there can be pins on the chip
that directly select the main memory or the off-chip "cache"
with no additional "chipset". We yet gain some picoseconds :-)

One drawback is that pointer management and arithmetics
is more complex if we must load this TYPE register into the PFQ
each time. The solution is to use the MSBs of the pointer
as the TYPE field. Since most MSBs of the addresses are zeroed,
we can use bits 0 to 55 for the address and bits 56 to 63 for
the memory type. In protected mode, the OS is responsible for
allocationg the memory type "ressource" to a task specific number,
and the hardware performs the correspondance between the
number and the physical memory type with a specific lookup table.
This correspondance is performed only when the pointer is changed,
not every time a data is read.

If protection is not enabled, the "TYPE" field in the pointer
directly points to the actual physical memory type.
Only a bound check is performed with the PFQ's setting.
If protection is enabled, the TYPE field of the pointer
is the index in a lookup table which gives the actual
physical memory type, the physical base and the
logical limit. More bound checkings are performed.

The 56:8 pointer format is arbitrary and can change, so the
programer can't rely on it. He should not modify the MSB
of a pointer and he must respect the limit that the OS
has set. Paged virtual memory can be done
when the TYPE field is shifted towards lower bits.
The page size is then defined by the address field size.
The hardware can still lookup the TYPE field but the
"page fault" trap can be triggered instead of a memory bound error.

The user can ask the OS for a certain
amount of memory in a certain type of memory. That way,
we can organize the data as
we like or (better) depending on their importance.
The hardware doesn't keep us from doing it anymore.
For example, the most important data, for immediate use,
are transfered to the on-chip memory by a DMA command.
The main memory then only serves as a backup for the big data
arrays or the "sleeping" programs. The executable code can
be stored in the "L2 cache", with the exception and
interrupt table and handlers, the OS kernel and its data.

The OS is responsible for the allocation of the different
cache levels between the different tasks. The OS becomes
the key for restoring and storing the task's state, which
now includes the cache's contents (this doesn't waste much
room in the main memory).
Because of this complete backup (software) mechanism,
a task's state can get restored at a different physical
address. The pointers that the application has computed
before can get invalid or conflict with another task.
The easiest turnaround to this problem is a task-dependent
SPR that holds a base address for a physical memory type.
This base is automatically added to the index each time a
PFQ is loaded. This is what the Phys_Base SPRs do in each
PFQ. It is automatically loaded during a table lookup when
the TYPE field is decoded (in protected mode).

In a sense, we are near a protected segmented scheme because
a memory location is defined by an index and a memory type couple.
We also can detect when an index is past a PFQ limit, and trigger
an interrupt. SPRs can be write- or read-disabled for the user,
in protected mode. I think that protection is assured.


9 Some architectural considerations
-----------------------------------

As written above, the PFQ allow the CPU core clock
to increase while having slow memories, provided data
paths are wide enough to stand the bandwidth.

From the global point of view, the Gamma processor core
family has three important parts: the PFQs, the "instruction
generator" (IG) and the "execution unit(s)" (EU) . Let's say
that the register file belongs to the execution unit(s).

The PFQs interface the processor core with the outside.
It communicates with the onchip cache, the bus interface(s)
and any other mean on one side, and with the EU and the IG on
the other side. The PFQ unit contains the buffers and the
addressing mechanisms with their bound checks and their
configuration registers.

The "Instruction Generator" decodes the instructions and "emits"
orders to the EU. It performs all the instruction decoding, the
instruction fetch, the hardware loops, and can "latch" orders
for single instruction looping purpose. It receives the status
bits from the PFQ as to catch any invalid operation before the
orders are emitted (this simplifies the overall architecture
since no pipeline flush is needed, no instruction is partially
executed). The IG plans the sequences ("scoreboard") of operations
that the EU will perform, delays the emission of orders when the
PFQ or the data are not ready, and switches between the instruction
flows.

The "Execution Unit" performs the actual operations : integer or
floating point, boolean or arithmetic, shift or comparison...
The data moves (ie reg to reg) do not involve the EU, but uses
the busses that connect them.


10) Coding
---------

Here are some examples to help you grasp how it works
and how the ideas behind the Gamma can be used.

First code: silly block copy

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
char * d,s;
int i;

for (i=0; i<500; i++) {
  *(d++) = *(s++);
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How does that translate to the Gamma ?

- The first thing to remember is that we have a DMA engine
 that queues block transfer orders for external-to-external
 or external-with-internal memory. but let's say it's already
 in internal memory.

- the inner instructions can be a simple register to register move:
 move.8 Dd, Ds;
Dd and Ds can be one of the 32 PFQs, the number can be allocated
depending on the context. Let's say CPFQ is PFQ0, the Dd (destination)
PFQ is PFQ1, and the source PFQ (Ds) is PFQ2. So we get:
 move.8 D1,D2;

- The granularity of the move can be changed if the data blocks
 don't overlap. We can then do 62 loops with a 64 bit width,
 and one 32 bit move outside of the loop (500=(62*8)+4)

- Since the loop only contains one instruction we can use a short loop
 form, with an immediate loop count:

 loop end_loop, 62 times;
 move.64 D1,D2;
end_loop:
 move.32 D1,D2;

- The '++' operator increments the pointers at each iteration,
 we have to set the proper register to autoincrement the INDEX.
 pfq1 r+w+;
 pfq2 r+w+;
  
The C code above is translated into the following code:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 pfq1 r+w+;
 pfq2 r+w+; 
 loop end_loop, 62 times;
 move.64 D1,D2;
end_loop:
 move.32 D1,D2;

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



The instruction set summary
---------------------------

It has been designed to be orthogonal, and i tried
to keep it as coherent and "implementation independent" as possible.
The basic format which rules the others has the usual
3-register form, called OP3R. The other formats are
deduced or adapted from it, while trying to reduce the
instruction decoding cost and avoiding oddities.


*****************
*** REGISTERS ***
*****************

registers 0 to 63 are the GPR   (bit 6 zeroed)
registers 64 to 95 are the DATA windows (bits 6 set and bit 5 zeroed)
registers 96 to 127 are the INDEX registers (bits 5 and 6 set) 

Restriction:
 Writing to any register dependent from the current PFQ
 (the number in the SPR CPFQ) has no effect and may trigger
 a trap.

Operations:

 Reading or writing to a DATA register triggers a trap
 if the associated INDEX is outside the PFQ's limit.

 Reading a DATA register increments the associated INDEX
 register with the amount specified in the read_increment SPR,
 mutilplied by the size of the data width.

 Writing to a DATA register increments the associated INDEX
 register with the amount specified in the write_increment SPR,
 mutilplied by the size of the data width.

 Writing to an INDEX register cancels any previous prefetch.
 The processor first checks the validity of
 the TYPE field. If TYPE is invalid, the PFQ's STATUS
 register indicates this. Then the index is compared with the
 memory type's limit and the PFQ's limit. If no error
 has occured, the index is added with the memory type's BASE
 and data is fetched from this address. If an error occurs,
 the PFQ's STATUS register indicates which occured.
 If instructions have already been fetched, they can be
 speculatively executed.

 A PFQ's TYPE is updated only when the INDEX register is
 written with SIZE==MAXSIZE. Writing with another size only updates
 the LSB, which are the actual inddex bits (it's a "partial" write).

 One PFQ's INDEX can be updated only once in an instruction;
 the written register has the most priority. For example:
   ADD.8 D1,D1,D1
 only updates PFQ1 once, according to how writing influences
 the INDEX.


*******************
****** PFQs *******
*******************

PFQ 0  to 15 are usually used for instruction fetch.
PFQ 16 to 31 are typically for data access.
PFQ0 is typically the starting PFQ for a program,
and PFQ31 is typically the stack.
These rules are not yet obligatory 
but later processors will certainly follow
these rules more strictly for hardware reasons.


********************************************
*** FORMAT 1: OPRI (Register, Immediate) ***
********************************************

Most common operations involving an immediate
value use this form.

Fields:
   FORMAT3 : bits 0..2 = 000
   OP6     : bits 3..8
   PLACE4  : bits 6..9
   REG7    : bits 9..15
   SIZE2   : bits 16..17
   IMM14   : bits 18..31

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NOP (Do nothing)
   opcode: 0x00000000

   Well, if you want to bloat your code... or insert a
   zeroed constant.

   example:
     nop;

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

SERIALIZE
   opcode: 0x00000001

   SERIALIZE waits for the execution units to complete
   all the works in the pipeline. Nothing is affected,
   like a NOP : the instruction generator is blocked
   until all planned operations are completed.

   example:
     serialize;


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ENDSTREAM
   opcode: 0x00000002

   ENDSTREAM ends the currently executed instruction stream.
   this is used like an IRET (at the end of an interrupt routine),
   to finish a program, or when concurrent execution of several
   instruction streams is performed.

   Effect: decoding of the current instruction stream
   is finished. All the operations planned before the
   instruction are completed normally. The instruction decoding
   unit is free for another stream.

   example:
     endstream;


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

MOVI "move immediate"
   opcode:
      FORMAT3 : bits 0..2   = 000
      OP6     : bits 3..8   = 000001
      REG7    : bits 9..15  : register
      SIZE2   : bits 16..17 : size of the register
      IMM14   : bits 18..31 : Sign-extended, 14-bit immediate value

   MOVI moves an immediate 14 bit integer value to a register.

   example:
     movi.64 r0,1;    // move the value 1 to the 64 LSB of GPR0.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ADDI "add immediate"
   opcode:
      FORMAT3 : bits 0..2   = 000
      OP6     : bits 3..8   = 000010
      REG7    : bits 9..15  : register
      SIZE2   : bits 16..17 : size of the register
      IMM14   : bits 18..31 : Sign-extended, 14-bit immediate value

   ADDI adds an immediate integer value to register.

   Note: there is not SUBI since the imm14 value is sign extended.

   example:
     addi.64 r0,1;    // add the value 1 to the 64 LSB of GPR0 (increments R0)
     add1.8  r2,-1;   // decrements R2

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CLR "clear"
   opcode:
      FORMAT3 : bits 0..2   = 000
      OP6     : bits 3..8   = 000011
      REG7    : bits 9..15  : register
      SIZE2   : bits 16..17 : size of the register
      IMM14   : bits 18..31 : 0 (reserved)

   CLR clears a register. The CARRY flag is not modified, but the
    ZERO and NEG flags are modified. CLR should be preferred to
    the XOR or SUB fashions to clear a register because CLR
    doesn't read the register.

   example:
     clr.16 r0;    // set R0 to 0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

WRSPR "write to a SPR"
   opcode:
      FORMAT3 : bits 0..2   = 000
      OP6     : bits 3..8   = 000100
      REG7    : bits 9..15  : register
      SIZE2   : bits 16..17 : size of the register
      IMM14   : bits 18..31 : SPR number

   WRSPR write a General Purpose Register to a Special Purpose Register.
   If the SPR number is invalid or if accessing the SPR is forbidden
   by the protection mechanism, a trap is triggered.

   example:
     WRSPR.16 r0, PFQ0LIMIT;    // set the limit of PFQ0 to the content of R0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

RDSPR "read from a SPR"
   opcode:
      FORMAT3 : bits 0..2   = 000
      OP6     : bits 3..8   = 000101
      REG7    : bits 9..15  : register
      SIZE2   : bits 16..17 : size of the register
      IMM14   : bits 18..31 : SPR number

   RDSPR writes a Special Purpose Register to a General Purpose Register.
   If the SPR number is invalid or if accessing the SPR is forbidden
   by the protection mechanism, a trap is triggered.

   example:
     RDSPR.16 r0, PFQ0LIMIT;    // read the limit of PFQ0 into R0

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*********************************************
*** FORMAT 1': OPRI (Register, Immediate) ***
*********************************************

MOVP "Partial MOVe"
   opcode:

      FORMAT3 : bits 0..2   = 000
      OP3     : bits 3..5   = 001
      PLACE4  : bits 6..9   : where to deposit the immediate value
      REG6    : bits 10..15 : GPR (only GPRs are allowed, not PFQ registers)
      IMM16   : bits 16..31 : immediate 16-bit value

   MOVP moves an immediate 16-bit integer value to a certain
   place in a general purpose register. This allows to load
   a full 256 bit register with an integer value. When possible,
   partial move instructions to the same register should be
   consecutive and address decreasing places (this would allow
   hardware acceleration).

   example:
     movp r0[1],1; // move the value 1 to bits 16..31 of GPR0.


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*******************************************
*** FORMAT 2: OPRR (Register, Register) ***
*******************************************

With its extended opcode, many operations can use this form.
move, shr, shl, sar, ror, rol, shri, shli, sari, rori, roli

Fields:
   FORMAT3 : bits 0..2 = 001
   OP6     : bits 3..8   : main operation code
   DEST7   : bits 9..15  : destination register
   SIZE2   : bits 16..17 : operation size
   SOURCE7 : bits 18..24 : source register
   OPX7    : bits 25..31 : opcode extension / 7 bit immediate

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

MOVE  "MOVE a register into another"
   opcode:
      FORMAT3 : bits 0..2 = 001
      OP6     : bits 3..8 = 000000  : operation code
      DEST7   : bits 9..15  : destination register
      SIZE2   : bits 16..17 : operation size
      SOURCEA7: bits 18..24 : source register A
      RES7    : bits 25..31 : reserved

   MOVE copies the contents of one register into another.
   The Flags are not copied, the DEST7's flags are set
   according to the size field of the opcode. The carry
   flag is not changed.

   example:
     move.8 r0,r0; // do nothing, except updating the flags ?  


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

************************************
*** FORMAT 3: OP3R (3 Registers) ***
************************************

This is the most important format.
Bit 2 specifies if the operation is floating point
or integer (good hint for an emulation trap)

add, addc, fadd, sub, subc, fsub, ...

Fields:
   FORMAT2 : bits 0..1 = 01
   OP7     : bits 2..8   : operation code, >=64 if floating point
   DEST7   : bits 9..15  : destination register
   SIZE2   : bits 16..17 : operation size
   SOURCEA7: bits 18..24 : source register A (holds the carry when needed)
   SOURCEB7: bits 25..31 : source register B


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

OPLx  "Logic OPeration"
   opcode:
      FORMAT2 : bits 0..1 = 01
      OP7     : bits 2..4 = 0000  : operation code
      FUNCT4  : bits 5..8   : boolean function
      DEST7   : bits 9..15  : destination register
      SIZE2   : bits 16..17 : operation size
      SOURCEA7: bits 18..24 : source register A
      SOURCEB7: bits 25..31 : source register B

   OPLx performs all the bitwise boolean operations. It
   uses the logic unit, which for every bits SOURCEA7n and SOURCEB7n
   gives DEST7n according to the FUNCT4 field with the following
   boolean formula:

     Dn= (/An./Bn.F0) + (/An.Bn.F1) + (An./Bn.F2) + (An.Bn.F3)

   In fact, the FUNCT4 field gives the truth table of the operation:

    SOURCEA7n  |   SOURCEB7n   |   DEST7n
   ----------------------------------------
        0      |       0       |   FUNCT0 (bit 5)
        0      |       1       |   FUNCT1 (bit 6)
        1      |       0       |   FUNCT2 (bit 7)
        1      |       1       |   FUNCT3 (bit 8)

   FUNCT4 is read from FUNCT0 to FUNCT3.
   FUNCT4 can be written both in hexadecimal or binary form in the
   assembly source. This provides a straightforward way to perform
   any boolean operation. For example, AND translates directly
   into OPL0001 or OPL1. Similarly, NAND is OPL1110 or OPLE.
   The usual mnemonics can be #define'd but this opcode provides a
   simple way to create any other needed operation.

   Some common operations:
    AND    OPL0001
    NAND   OPL1110
    OR     OPL0111
    NOR    OPL1000
    XOR    OPL0110
    NXOR   OPL1001

   The other operations are deduced from the above table:
    A AND NOT B is written OPL0010

   Note that 6 opcodes are not "actual" boolean operations:
    OPL0000 sets DEST7 to all 1s.
    OPL1111 sets DEST7 to all 0s.
    OPL0011 sets DEST7 to A.
    OPL1100 sets DEST7 to /A.
    OPL0101 sets DEST7 to B.
    OPL1010 sets DEST7 to /B.

   These opcodes have a more efficient form that doesn't
   need to use the 3-operand form (they use less ressources) and
   costs less clock cycles:
    OPL0000 is replaced by CLR DEST7
    OPL0011 and OPL0101 are replaced by MOVE DEST7, SOURCE7
    OPL1010 and OPL1100 are replaced by NEG DEST7, SOURCE7
   They are left anyway for simplicity.

   The Carry flag of DEST7 is not modified, ZERO and NEG are.

   example:
     opl1.8 r1,r2,r3 ; // bitwise AND of r2 and r3 into r1.
     opl0100.8 r1,r2,d1 ; // r1 = /r2 AND d1.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
