ROP2.txt
created Sun Sep  2 06:47:11 2001 by whygee@f-cpu.org


INTRODUCTION :
--------------
here is a little programming "howto" for those who want
to understand what the ROP2 unit is and how to use it.


BACKGROUND :
------------
The ROP2 unit name comes from the ability to perform
any 2-operand logical operation (see the F-CPU manual).
Because "all" the operations is redundant (the manual
shows how/why), we can encode 8 logical operations :
AND/OR/NAND/NOR/ANDN/ORN/XOR/XNOR and the NOT operation
is just a special case.

If you have to program bitmap software, you will love
these operations, especially if you have reached the
limits of MMX. In effect, the 3-address instructions and
the high number of registers make programming very easy.
If you have to perform operations on bitmaps, or any kind
of bitfields, F-CPU's ROP2 is for you.

During the design, we also discovered that there is
still some room left in the pipeline (the ROP2 operator
is extremely simple) so we added two other valuable
operations : COMBINE and MUX. When properly used, in
conjunction with operations from the SHL unit, ROP2
can give surprising results.


COMBINE MODE :
--------------
The COMBINE instructions perform either a OR or
a AND on 8 consecutive bits from the output of
the ROP2 operator. This way you can test if
the bytes are 0x00, 0xFF or something else.
The output is either 0x00 or 0xFF, thus creating
a useful mask for other operations.

Remark : If you need a NAND of the output
of the ROP2, simply use Morgan's theorem !
for example, in this case, turn the NAND into
a OR and reverse the output polarity of the
preceding ROP2 operation. For example :
the 8-bit NAND of XORs is written "xnor.or"
(xor->xnor and nand->or).

COMBINE can be used "alone" if the ROP2 function
is set to a transparent mode, such as "R0 or RN"
  or.and r0, r1, r2; // tests if there are 0xFFs in r1
but it becomes very powerful when the ROP2 function
uses another parameter (usually a mask). The first
example (straight-forward) is when searching a specific
character in a string : the mask is set to the repeated
character and the other input is the given string.
Now, a "comparison" of the two patterns is possible
with one operation :
   xnor.and r1, r2, r3
which means : for each byte, if all the bits match
(result is 0xFF), set the corresponding byte to
0xFF, or set to 0x00 otherwise. A "hit" will be detected
when the register will be written back to the register
set, where the data will be detected as non-zero
if one character corresponds. If the instructions
are properly scheduled, we can put a conditional
move or jump after the comparison.

MUX MODE :
----------
Now, imagine that we want to replace all
'B' characters with a 'b' (or whatever byte
substitution). We already know how to
detect the desired pattern with xnor, and the
ROP2 unit provides a MUX function which selects
one bit from one of the two sources.
Here is some source code :

sdupi 'B', r1; // create the search mask into R1
sdupi 'b', r2; // create the substitution mask into R2

// loop here :
// R3 is the input data register

xnor.and r1, r3, r4; // R4 now contains the "hit" mask
mux r3, r2, r4;      // for every byte, R5 now contains 'b' if
                     // R4 was 0xFF, unchanged otherwise.
// and now, store the result in R5 to memory

MUX was added because :
 - it is often very useful
 - it would require 3 x ROP2 instructions otherwise (quite a
    large overhead for so little)
 - it is a very, very simple operation and consumes almost
    no ressource in itself (mostly control stuff)

PERMUTATIONS :
--------------
But you have not seen the "real power" of this unit
(and it is the purpose of this document) : it can do wonders
at shuffling bits in arbitrary directions, thus removing
the need to implement a complex and costly FPGA in the F-CPU
(as it was sometimes asked for). I will show you that a proper
succession of "simple" operations can permute all the bits
of a byte (we are still limited to 64 bits today, but extensions
to wider bit fields is possible). Those who love bit
reversing and other similar functions will have to wait
for the SHL unit because bit reversing is still a "regular"
operation. I am speaking here about _arbitrary_ bit shuffling :
    0 -> 3
    1 -> 2
    2 -> 4
    3 -> 7
    4 -> 5
    5 -> 1
    6 -> 0
    7 -> 6
for a purely random example.

The first step is to prepare the masks :
 mask1 = 0x8040201008040201;
 maks2 = 0x4001028020100408; // permuted mask
(this is usually performed by the user and input in the source code)

The second step is to "expand" the original data :
 sdup.b r1, r2; // The byte in the lower part of r1 is duplicated all over r2

then we use the first mask to create another :
 and.or r2, r3, r4; // assume r3 = mask1,
     // r4's Nth byte now contains 0xFF is r2's Nth bit is set.

This mask is then masked with the permutation :
 and r4, r5, r6; // assume r5 = mask2

Now the problem is to gather all the bits in the lower part of
the register. There is no dedicated operation but it is not
necessary to loop over all the bytes : a Log2 sequence is
far better and faster !

 shri 32, r6, r7
 or r6, r7
 shri 16, r6, r7
 or r6, r7
 shri 8, r6, r7
 or r6, r7

now r7 contains the permuted bits.

WARNING :
---------
In the above example, you have to keep in mind that the
9 instructions will not execute in 9 clock cycles, mainly
because of the additional Xbar cycle for every instruction.
In order to benefit from all the F-CPU power, you will have to "schedule"
the instructions and fill the gaps between two dependent instructions.

Here the case is very simple : all the instructions execute in
one cycle (or so i hope) so we have to "interleave" one cycle
between each instructions. This is a good news if you have to
perform the permutation a lot of times (over a whole register
for example) because you can duplicate every instruction once
and rename the registers. This way, the FC0 can sustain
an everage of one byte permutation every 9 cycles :-)
[of course it depends on how much you unroll the loop]


CONCLUSION :
------------
The applications of the ROP2 features are wide and the usefulness
is obvious. Most cases where a FPGA would be required can be
reformulated with a proper sequence of ROP2/SHL operations
and using logarithmic packing/unpacking (thus avoiding silly loops).
This strategy keeps the F-CPU architecture and cores simple
and the code will benefit from the future enhancements (superscalar,
OOO, whatever) to the core, while a "FPGA" would not scale as
well and would handicap the whole F-CPU family. Through correct
instruction scheduling and good programming practices, what
would have taken small iterations can be now performed with
a few instructions operating on large registers.
With the unconstrained register width model of the F-CPU,
you can hope that bit shuffling, bitblt or string searches
will be faster and easier in the future.

YG.