f-cpu/vhdl/eu_popc/README.txt
created Sat Jun 29 14:47:32 CEST 2002 by whygee@f-cpu.org

Overview :

The POPCOUNT Execution Unit is an optional EU in FC0.
It is also often named "canonical NSA instruction",
"Hamming weight", "sideways addition" or bitcount.

It's not often used, but when it's needed, one is very happy
to have it as an opcode. Alternative algos require tens
of instructions and it is particularly sensible in FC0
unless you can unroll/interleave the operations in a loop.

However it has several uses that may be critical in some
applications : telecommunications, scientific computations,
chess games or graphics... Macros for performing this function
can be found in the source code of X11 or Linux, if you
thought it was only a "spook"'s instruction.

For example, it can be used for computing the parity of a word,
or more complex operations on data integrity (error recovery
on network or storage etc). An extension of this unit would
be to compute SEC/ECC or Reed Solomon codes as used in
radiocomunications.

Another potential use is to analyse the output of INC,
in order to determine the position of a bit. It could be easily
done with a bunch of ORs but it would add another pipeline stage
to the INC unit, which is not reasonable. Since it is not
often used (unless emulating FP with ints), POPC is a better
place to do it.

Popcount is used in more places than one can imagine
because it is one of the fundamental operators in
information theory. But most importantly, it is used
for signature compaction during the BIST.


Structure :

It's mainly a SIMD 64-bit unit (Max chunk size = 64 bits,
replicated as needed). It's a tree of adders, so the SIMD
part is straight-forward (as usually, it's done with
rows of MUXes).

The "substract with saturation" instruction is removed
because it's not needed at all. I may have not realized
how dumb it was, when i proposed it. However, since there
is a free unused read port, Hamming weight can be computed
with much less cost (a single row of XORs). When normal
POPCOUNT is required, just keep one of the inputs zeroed.

The first pipeline stage is the XOR row followed by a first
byte-wide optimised POPCOUNT stage. The other stages are
adders arranged in a tree. A Mux tree overlaps the adder's
tree so there is no time loss at the last stage (only a
2-mux overhead for SIMD, since the other MUXes are burried
earlier in the pipeline).

There is no black magic except for creating nice optimised
adders and the 8-bit popcount stage. Here i prefer to let
Michael show and reuse his skills :-)
