ROP2.txt
created Sun Sep  2 06:47:11 2001 by whygee@f-cpu.org


INTRODUCTION :
--------------
here is a little programming "howto" for those who want
to understand what the ROP2 unit is and how to use it.


BACKGROUND :
------------
The ROP2 unit's name comes from the ability to perform
any 2-operand logical operation (see the F-CPU manual).
Because "all" the operations is redundant (the manual
shows how/why), we can encode 8 logical operations :
AND/OR/NAND/NOR/ANDN/ORN/XOR/XNOR and the NOT operation
is just a special case.

If you have to program bitmap software, you will love
these operations, especially if you have reached the
limits of MMX. In effect, the 3-address instructions and
the high number of registers make programming very easy.
If you have to perform operations on bitmaps, or any kind
of bitfields, F-CPU's ROP2 is for you.

During the design, it appeared that there is
some room left in the pipeline (the ROP2 operator
is extremely simple) so we added two other valuable
operations : COMBINE and MUX. When properly used, in
conjunction with operations from the SHL unit, ROP2
can give surprising results. Unless you read this text
and understand how it works :-)


ROP2 FUNCTION :
---------------

In the latest versions of the VHDL sources,
i have exploited a logical simplification that
implements the ROP2 function more efficiently,
whether during simulation or synthesis. Here
is a "no background necessary" explanation.

In a "classical" ALU, where the boolean and arithmetic
functions are combined in a single "black box",
one would implement whatever logic operation is
specified by the instruction set, and multiplex
the results from OR, AND, XOR, ADD, SHL etc. into
a single bit.

ROP2 works differently : the opcode defines the
function table (though in a compressed way), so
if you want to perform a XOR, the decoder will give
"0110" to the ROP2 unit. It then has to
select the bit it wants, depending on the current
position in the word and the corresponding input
values. So if you have A[n]=1 and B[n]=0, the "address"
is binary "10" (or 2) and the 2nd bit of the
ROP2 mode is "1", which is output to the result
bus at position [n].

In hardware, it means that "0110" is the input
of a 4-bit multiplexer and the selection is done
with the concatenation of the A and B input bits.

If your CPU is 64-bit wide, there are 64 multiplexers
that perform the ROP2 functions, that's all :
not a 8-input multiplexer that selects the
8 results of all possible operations.
It takes more time to explain it that to code it :-)

One can remark, however, that the operation of
selection is mostly the same as what was
implemented/written before (a OR of 4 3-input ANDs).
Though the operation has the same result, the
implementation is a bit different in hardware,
mostly because the "mux" cells are optimised
and can even be found in FPGAs. A synthesizer
can fail to recognize the old version and infer
a mux. Finally, "discrete" logic cells include
some buffers that are not needed inside a single
cell, so the MUX version is actually both
smaller and faster.


COMBINE MODE :
--------------
The COMBINE instructions perform either a OR or
a AND on 8 consecutive bits from the output of
the ROP2 operator. This way you can test if
the bytes are 0x00, 0xFF or something else.
The output is either 0x00 or 0xFF, thus creating
a useful mask for other operations.

Remark : If you need a NAND of the output
of the ROP2, simply use Morgan's theorem !
for example, in this case, turn the NAND into
a OR and reverse the output polarity of the
preceding ROP2 operation. For example :
the 8-bit NAND of XORs is written "xnor.or"
(xor->xnor and nand->or).

COMBINE can be used "alone" if the ROP2 function
is set to a transparent mode, such as "R0 or RN"
  or.and r0, r1, r2; // tests if there are 0xFFs in r1
but it becomes very powerful when the ROP2 function
uses another parameter (usually a mask). The first
example (straight-forward) is when searching a specific
character in a string : the mask is set to the repeated
character and the other input is the given string.
Now, a "comparison" of the two patterns is possible
with one operation :
   xnor.and r1, r2, r3
which means : for each byte, if all the bits match
(result is 0xFF), set the corresponding byte to
0xFF, or set to 0x00 otherwise. A "hit" will be detected
when the register will be written back to the register
set, where the data will be detected as non-zero
if one character corresponds. If the instructions
are properly scheduled, we can put a conditional
move or jump after the comparison.

MUX MODE :
----------
Now, imagine that we want to replace all
'B' characters with a 'b' (or whatever byte
substitution). We already know how to
detect the desired pattern with xnor, and the
ROP2 unit provides a MUX function which selects
one bit from one of the two sources.
Here is some source code :

sdupi 'B', r1; // create the search mask into R1
sdupi 'b', r2; // create the substitution mask into R2

// loop here :
// R3 is the input data register

xnor.and r1, r3, r4; // R4 now contains the "hit" mask
mux r3, r2, r4;      // for every byte, R5 now contains 'b' if
                     // R4 was 0xFF, unchanged otherwise.
// and now, store the result in R5 to memory

MUX was added because :
 - it is often very useful
 - it would require 3 x ROP2 instructions otherwise (quite a
    large overhead for so little)
 - it is a very, very simple operation and consumes almost
    no ressource in itself (mostly control stuff)

PERMUTATIONS :
--------------
But you have not seen the "real power" of this unit
(and it is the purpose of this document) : it can do wonders
at shuffling bits in arbitrary directions, thus removing
the need to implement a complex and costly FPGA in the F-CPU
(as it was sometimes asked for). I will show you that a proper
succession of "simple" operations can permute all the bits
of a byte (we are still limited to 64 bits today, but extensions
to wider bit fields is possible). Those who love bit
reversing and other similar functions will have to wait
for the SHL unit because bit reversing is still a "regular"
operation. I am speaking here about _arbitrary_ bit shuffling :
    0 -> 3
    1 -> 2
    2 -> 4
    3 -> 7
    4 -> 5
    5 -> 1
    6 -> 0
    7 -> 6
for a purely random example.

The first step is to prepare the masks :
 mask1 = 0x8040201008040201;
 maks2 = 0x4001028020100408; // permuted mask
(this is usually performed by the user and input in the source code)

The second step is to "expand" the original data :
 sdup.b r1, r2; // The byte in the lower part of r1 is duplicated all over r2

then we use the first mask to create another :
 and.or r2, r3, r4; // assume r3 = mask1,
     // r4's Nth byte now contains 0xFF is r2's Nth bit is set.

This mask is then masked with the permutation :
 and r4, r5, r6; // assume r5 = mask2

Now the problem is to gather all the bits in the lower part of
the register. There is no dedicated operation but it is not
necessary to loop over all the bytes : a Log2 sequence is
far better and faster !

 shri 32, r6, r7
 or r6, r7
 shri 16, r6, r7
 or r6, r7
 shri 8, r6, r7
 or r6, r7

now r7 contains the permuted bits.

WARNING :
---------
In the above example, you have to keep in mind that the
9 instructions will not execute in 9 clock cycles, mainly
because of the additional Xbar cycle for every instruction.
In order to benefit from all the F-CPU power, you will have to "schedule"
the instructions and fill the gaps between two dependent instructions.

Here the case is very simple : all the instructions execute in
one cycle (or so i hope) so we have to "interleave" one cycle
between each instructions. This is a good news if you have to
perform the permutation a lot of times (over a whole register
for example) because you can duplicate every instruction once
and rename the registers. This way, the FC0 can sustain
an everage of one byte permutation every 9 cycles :-)
[of course it depends on how much you unroll the loop]


Note : combines of more than 8 bits is difficult
if we want to keep the pipeline stage short.
16-bit is still possible but not done. 32- and 64-bits
combine would probably use another stage, which
is not wanted at all. In case it is required, then
use the ASU in saturated mode.




CONCLUSION :
------------
The applications of the ROP2 features are wide and the usefulness
is obvious. Most cases where a FPGA would be required can be
reformulated with a proper sequence of ROP2/SHL operations
and using logarithmic packing/unpacking (thus avoiding silly loops).
This strategy keeps the F-CPU architecture and cores simple
and the code will benefit from the future enhancements (superscalar,
OOO, whatever) to the core, while a "FPGA" would not scale as
well and would handicap the whole F-CPU family. Through correct
instruction scheduling and good programming practices, what
would have taken small iterations can be now performed with
a few instructions operating on large registers.
With the unconstrained register width model of the F-CPU,
you can hope that bit shuffling, bitblt or string searches
will be faster and easier in the future.

YG.
