Testing a HighSpeed Data Path The Design
of the RSAb Crypto Chip
Wolfgang Mayerwieser
(Graz University of Technology, Austria
wmayer@ iaik.tugraz.ac.at)
Karl C. Posch
(Graz University of Technology, Austria
kposch@iaik.tugraz.ac.at)
Reinhard Posch
(Graz University of Technology, Austria
rposch@iaik.tugraz.ac.at)
Volker Schindler
(Graz University of Technology, Austria
vschindl@iaik.tugraz.ac.at)Abstract: High speed devices for public key cryptography
are of emerging interest. For this reason, the crypto chip
was designed. It is an architecture capable of performing fast RSA
encryption and other cryptographic algorithms based on modulo
multiplication. Besides the modulo multiplication algorithm called
FastMM, the reasons for its high computation speed are the As
Parallel As Possible (APAP) architecture, as well as the high
operation frequency. The crypto chip also contains onchip
RAM and a specialpurpose control logic, enabling special features
like encrypted key loading. However, this control mechanism
influences to some extend testability of the MM data path which is
the heart of the chip. For this reason, the crypto chip has
been designed to be able to evaluate the behaviour of the pure MM data
path. In the following, we describe the strategies used with the
crypto chip for testing the MM data path under realistical
conditions. In this context, analyzing control signal flow turns out
to be the key action. This work has been sponsored as part of the
project Nr. P9384PHY "Sichere Kommunikation bei hohen
Geschwindigkeiten" by the Austrian Science Foundation. Keywords:
high speed multipliers, hardware algorithms, design for testability,
public key cryptography Category:
E.3, B.5.1, B.2.2
1 Introduction
Simulating a CMOS VLSI design and testing the prototype of a newly
designed chip is a sophisticated and time consuming task. This is in
particular true in the special case of the RSA Crypto Chip, which
has as its core a data path of rather high complexity for performing
modulo multiplications (MM), organized as a bitslice SIMD
architecture [Lippitsch et al 1992]. Several reasons are
responsible for the need of detailed testing:  The modulo multiplier
is built in modified domino logic [Weste et al 1993], which results
in smaller area and higher speed, but requires a very careful design process. Page 728
 The necessity for handling very large integers for a secure RSA
implementation causes a width of the MM data path of about 700
bits. Designing CMOS circuits of this size leads to specific problems
not present when designing much smaller systems. Therefore, it is
quite advisable to spend very much effort on simulating the chip's
function, and to provide builtin testing features. Furthermore,
standalone testing of the MM data path without control logic
driving it would be a preferable strategy. We will concentrate on
the last item. Real standalone testing of the MM data path in a
separate package results in feeding its control signals directly via
the chip's pads. At this point, a problem arises: The maximum clock
rate allowed by the pads is lower than the clock rate we would need
at the MM data path to reach the desired processing speed by a factor
of about 5. It makes no sense (and even might be impossible for
dynamic logic) to test the MM data path at a much lower speed than it
was designed for. Therefore we cannot avoid adding some kind of
control logic between the pads and the MM data path. Test results
should not be influenced by that control structure. It should not
cause any changes to the layout of the MM data path, and it must be of
very high reliability. Both requirements call for an "as simple as
possible" control structure. The key for finding such a minimum
version of the control structure is to analyze the control signal
flow, which allows for reducing redundancy, and even leads to the use
of a modified version of the original modulo multiplication algorithm
for optimizing control signal flow. In this paper, we will take a
closer look at the modulo multiplication algorithm (which is the
heart of the RSA implementation), as well as at the MM data path
designed for performing this operation efficiently. After analyzing
the control signal flow, enough information is available to
introduce a simple mechanism for compensating the clock rate
difference without loss of overall performance. Finally, we will show
an outline of a possible hardware implementation.
2 The FastMM Algorithm
The RSA algorithm [Rivest et al 1978] performs both, encryption and
decryption, by calculating
with c, C the cipher
text and plain text or vice versa, and x the encryption or decryption
key. To meet security demands, the modulus N must be of a length of
approximately 700 bits. Exponentiation can be performed by the well
known SquareandMultiply algorithm [Knuth 1981]. Therefore, for
our purpose it is sufficient to consider the RSA implementation as a
sequence of continued modulo multiplications with very long integers,
separated by simple register transfer operations.With the RSA
Crypto Chip, modulo multiplication is done by the FastMM algorithm
[Posch et al 1990]. This algorithm performs which is much easier to compute than the exact modulo multiplication Page 729
as it avoids division and uses multiplication only like the Montgomery
approach [Montgomery 1985]. We will refer to y as relaxed residuum,
since it is possibly not fully reduced. The efficiency of FastMM is based on a fast and good approximation Q of the quotient which is used to calculate the relaxed residuum
Q is calculated as follows:
k is the length of the modulus N in bits. For convenience, we
introduce
Note that both, NegN as well as N1 can be precomputed. Now, FastMM
can be represented using the following symbolic instructions
denotes the i rightmost bits of w, denotes the i leftmost bits of w):
In [Posch et al 1990] it is shown that the results of continued
relaxed modulo multiplications never exceed the limit given by
e. Therefore, only after the last multiplication a final modulus
reduction might be necessary. Obviously, this reduction step does
not decrease throughput, compared to the hundreds of modulo
multiplications which are to be performed for getting the relaxed
result. This final modulus reduction can be done by adding NegN to
the result. Note that NegN and the possibility to perform an addition
are needed by the FastMM algorithm, anyway. Therefore, the final
modulus reduction requires no additional hardware effort or
constants.
3 The MM Data Path
3.1 The Structure of the MM Data Path
The design of the MM data path, shown in [Figure 1], meets the special
requirements of the FastMM algorithm. It consists of several main
registers with length 696 bits (IO, Cph, Mkd, NegN, N1, Low, High),
an APAPstructure [Lippitsch et al 1991] (described in [Section
3.2]) and a few auxiliary registers of length eight bits. Adder1 and
Adder2 are needed for converting the redundant number
representation used by register High and the APAPadder  into
binary numbers. Adder3 performs the addition of multiplication step
(C). The purpose of the transmission gates as well as the overall
function of the MM Page 730
Figure 1: The MM data pathPage 731
data path will become clear in [Section 3.3]. We will use the main
registers in the following way: The IO register is needed for
input/ouput operations, Cph (for cipher text) is commonly used for
storing the last result while the next modulo multiplication is
running. NegN and N1 hold the corresponding constants mentioned in
[Section 2], Mkd must contain the actual multiplicand. Low receives
the lower part of the result of a multiplication in binary number
representation, while High gets the higher part in redundant number
representation. The MM data path requires quite many control signals,
denoted as a control word. We present a complete list of the control
signals which shows the actions in each clock cycle if the
corresponding signal is active: EvlX: Write value of register X to the 696 bit bus, where X can be
one of the following: IO, Cph, NegN, N1, Low.
LdY: Load register
Y with the value of the 696 bit bus, where Y can be one of the
following: IO, Cph, NegN, N1, Low, Mkd.
LdHigh: Load register High with the current value of
the APAPadder, then clear the APAPadder.
ShrHigh: Enable storing the least significant byte of register High to
LSBHigh, then shift High to the right by one byte.
ShrLow: Shift Low to the right by one byte, including MSBLow.
ShrIO: Shift IO to
the right by one byte including MSBIO, thus loading a new byte from
outside via MSBIO. LSBIO buffers the least significant byte of IO
and provides it for output.
ShrCph: Shift right Cph by one bit.
EnAC: Enable storing the least significant byte of the APAPadder to
LSBAdd during operation (A) resp. (C) of FastMM.
EnC: Enable
storing the least significant byte of register Low to LSBLow during
operation (C) of FastMM.
EnCut: Cut off the bit stream on changing
from high to low. : Eight bits, holding the value of .
MultK: Load register Trp with for multiplying with Mkd.
3.2 The APAPStructure
The APAP (As Parallel As Possible)structure represents a mixture of a
serial/parallel multiplier and a full parallel Wallace tree
multiplier [Wallace 1964]. This approach has been chosen because
for long integer arithmetic conventional serial/parallel
multipliers would be far too slow, whereas a full Wallace tree
multiplier cannot be implemented on reasonable chip area with current
technology. The degree of parallelism (i.e., the number p of new
partial products that can be processed each cycle) may vary between
certain boundaries, which allows for scaling the APAP design. To
avoid carry delays, redundant number representation is used. For our
purpose, we hide these implementation details, since we are only
interested in the overall understanding of control signal
handling. Therefore, we consider the APAPstructure together with
Adder2, as well as High together with Adder1 to be "black
boxes", thus hiding the internal function of the APAPadder and the
redundant number representation. (However, for Page 732 understanding the timing of the control signal sequence, the number of
cycles consumed by the APAPadder and Adder1 resp. Adder2 would
have to be known.) Multiplication can now simply be described as a
serial/parallel multiplication, where p bits of the multiplier are
processed each cycle. For the RSA Crypto Chip, p = 8 has been
chosen. The following example [Figure 2] provides such a pure
behavioural description Figure 2: Multiplication scheme
of the multiplication procedure. It uses decimal digits instead of
bytes, showing the principle of calculating the product of the
multiplicand 789 and the multiplier 456. denotes the content
of Reg at the nth cycle: c is a positive constant depending on the
APAPstructure and the adders. The APAPadder gets the first partial
product (PP) by multiplying the multiplicand (held in Mkd) with the
least significant digit of the multiplier. Note that the least
significant digit of this PP represents the least significant digit
of the the final result and can be loaded into LsbAdd. Therefore,
this PP is shifted to the right by one digit, being aligned for the
addition with the PP of the next multiplication cycle as well. The
least significant digit of the result of that addition represents the
next digit of the result, and so forth. After the multiplicand has
been multiplied with the most significant digit of the multiplier,
the APAPadder contains the higher part of the result, which can be
loaded into High. Note that the APAPadder always shifts its content
to the right by one byte Page 733
each cycle, while Low resp. High can be forced by the corresponding
control signals to do so. It is important to understand that the MM
data path is able to perform these actions in parallel: For example,
while the APAPmultiplier is fed from High via Trp with a sequence of
bytes to multiply, former multiplication results may be shiftet into
register Low. There are concurrent data streams, and their flow is
controlled by a few control signals.
3.3 Performing a Modulo Multiplication with the MM Data Path
Table 1: The FastMM scheme with the MM data pathTable 1 shows the overall scheme of the modulo multiplication with the
MM data path. Once registers NegN and N1 are loaded with the
corresponding constants, the multiplier b must get into
High. Figure 1 shows that this can only be done by loading it into
Mkd first, followed by a multiplication by one. We will call this
operation Mult, because we assign the correct value to k (in this
context we consider k = 0) and set MultK to high for one cycle. This
way, the content of Mkd gets multiplied with one (i.e.,), and must
be loaded from the APAPadder into High at the very right moment,
since the APAPadder is shifting its content permanently. After loading Mkd with the multiplicand a, we can start with operation
ModMul, the modulo multiplication using the FastMM algorithm. Figure
3  which shows the control signal waveforms of a 664bit modulo
multiplication simulation  can now be understood: We start
multiplication step (A) with setting high ShrHigh to feed the APAP
structure with the multiplier byte by byte. EnAC, EnCut and ShrLow
are also set to high for shifting the lower part of the intermediate
result into Low: it will be needed in step (C). When the higher part
of the result is ready, it is loaded into High by activating
LdHigh. At the same time, N1 can be loaded into Mkd, starting
multiplication step (B). This step is similar to step (A) except that
shifting the result into Low is omitted. Step (C) starts with loading
NegN into Mkd and is equal to step (A), but additionally sets EnC to
high, thus adding the lower part of the multiplication result with
the previous content of Low. After this procedure, the final result
can be fetched from Low. Page 734 Figure 3: Timing diagram of the control signals during Mult and
ModMul
3.4 Observations about the Control Words
We note that the two operations Mult and ModMul work correctly if
and only if their related sequences of control words appear in a
given order with a given timing: in particular, the control words
within these two operations must not be separated by any additional
control word, not even by a NOP (that is, a control word which
performs no operation onto the MM data path). That is because the
flow of the different data streams will proceed with the internal
clock, and synchronization with external feeding of control words is
to be assumed at given cycles. Obviously, a wrong product would be
calculated otherwise. We will refer to such a noninterruptable
sequence of control words as an atomic sequence. Wef urther note that
throughout the algorithm some control words are repeated several
times. These sections will be referred to as stable sections. With
respect to [Section 5] it is useful to consider an atomic sequence as
a sequence of stable sections, where the number of the stable
sections is called the Page 735 Figure 4: An atomic sequence, consisting of several stable sectionsdepth of the atomic sequence. Figure 4 shows an atomic sequence of a length of 15 cycles and depth 4.
4 A Modification of the FastMM Algorithm: FastMM*
A closer look at the FastMM algorithm shows that truncation of
intermediate results is necessary. For reasons of regularity it would
be easier to implement truncation only at multiples of i bits rather
than at arbitrary positions. With FastMM, values would generally be
cut off at positions . To facilitate implementation, a
solution of this problem is given by a method called
preshifting. With this method, constants NegN and N1 as well as a and
b are preshifted to the left by certain values. Loading the multiplier
into High is done by using a when performing operation
Mult, thus shifting the multiplier to the left by k bits for
aligning at a desired border. After the modulo multiplication is
finished, a final alignment is necessary to get the correct result.
Furthermore, EnCut must be able to truncate at arbitrary bit borders
rather than at a position It is needless to mention that
these additional procedures make the handling of the control signals
more complicated. Therefore, the lengths of stable sections within
the atomic sequence of operation ModMul are shortened, increasing
the depth of the atomic sequence  this is disadvantageous, as we will
see in [Section 5]. For that reason, we would prefer an algorithm
which avoids these difficulties. Page 736
The obvious goal in algorithm design is to minimize depth for a given
algorithm by control word rearranging. A solution is given by a
modified version of FastMM, called FastMM*. FastMM* consists of the following steps:
The basic idea is to apply shifts only to constants which can be
performed in advance. This algorithm has the same structure as
FastMM, but uses a different constant M1 instead of N1. Preshifting
will not be necessary because and can be determined to
meet the conditions (mod i) and (mod
i). Typically, one chooses , where r is the radix on which
the long integer hardware is operating. The goals defined above can
be achieved by calculating M1, and in the following
way:
k is the length of the modulus N in bits. A further benefit of FastMM* is that it allows optimization in terms of
the possibility to change more control signals at the same time
(i.e., a higher degree of parallelism) than with FastMM. This results
in a significant reduction of the depth of the atomic sequence
associated with ModMul. Additionally, for FastMM* is always 0,
therefore control signals associated with may be omitted.
5 Managing the Clock Rate Difference Using a FIFO
In this section, a method for compensating the difference between the
maximum clock rate allowed by the pads and the clock rate of the MM
data path is presented. First, data flow at the control signal
interface will be reduced. This can easily be done because the MM
data path doesn't need to be fed with one control word each cycle if
we use stable sections instead. Each stable section is described by
the corresponding control word and the numer of cycles it
is "stable" [see figure 4]. The control structure between the pads
and the MM data path now consists of a buffer for the control word,
and a down counter to decide when to fetch the control word of the
next stable section. That would be sufficient if each stable section
consisted of enough cycles, ensuring that continuous fetching of
control words never exceeds the maximum clock rate allowed by the
pads. Page 737
Unfortunately, there are a lot of stable sections consisting only of a
few cycles. Therefore, a mechanism has to be added which feeds
internally generated NOPs into the MM data path until the next
control word is available. Otherwise, control words could influence
the MM data path in a wrong way. However, this will still not work
because atomic sequences must not be interrupted by NOPs. (The
problem arises because of the existence of atomic sequences which
contain stable sections of length one, as we will see in [Section
5.1].) The
Figure 5: A FIFO for compensating the clock rate difference solution is to add a FIFO between the pads and the control word
buffer [see figure 5]. Inserting NOPs in between an atomic sequence
can be prevented by preceding each atomic sequence with an externally
inserted NOP as the control word, and a repeat count big enough to
fill the FIFO with the stable sections of the entire atomic
sequence. When the down counter expires, the stages of the FIFO are
shifted, and the counter gets the next repeat count as start value. Internally generated NOPs will only reach the MM data path if the
FIFO is cleared before the next external control word could reach the
control word buffer. Therefore, the depth of the FIFO must not be
lower than the depth of the longest atomic sequence. As will be
shown in [Section 5.2], analyzing atomic sequences of FastMM* allows
for reducing this depth. Page 738
5.1 Atomic Instruction Sequences of FastMM*
For observing the control word flow at the MM data path, a simulation
environment has been developed. This environment uses for
controlling the simulation of an RSA encryption with . The
simulation is based on an functional model of the MM data path. The use of the procedural simulation interface provided by GENIE has several advantages:  Simulation runs of high complexity can be controlled and evaluated
by using a Clike language, thus allowing a higher level of
abstraction than with conventional methods.  Straightforward
observation of the 696 bit wide MM data path becomes enabled.  One
can get the sequence of stable sections (that is, the control words
and the associated number of cycles) for a modulo multiplication
totally automatic.  Flexibility in controlling the simulation
simplifies extracting the atomic sequerrces.  Once the simulation
environment is "debugged", the occurance of errors when analyzing
the interface is minimized. Using this environment, the control signal interface of the MM data
path can easily be analyzed for different algorithms as well as for
different implementations of the same algorithm. Additionally, it
supplies the control words needed for controlling a test chip which
uses the proposed FIFO.
Table 2: Stable sections of Mult In [Table 2] and [Table 3], the atomic sequences for Mult and
ModMul of a full 664 bit modulo multiplication are shown. For each
stable section the first value represents the control word, the
second value the required repeat count. We observe that ModMul calls
for a FIFO of at least depth 9. This result is only valid for the
FastMM* algorithm: as mentioned in [Section 4], the FastMM algorithm
would increase depth significantly. 
(General Interpreted Environment) is a generalpurpose interpreted
language wellsuited for use as a control language attached to
application programs like Lsim.
design automation tool that allows to analyze the behaviour of
electronic designs ranging from highlevel, abstract system models
through fullcustom integrated designs. is a hardware description
language for describing the behaviour of circuits in functional
models, for use with the Lsim simulator. Page 739
Table 3: Stable sections of ModMul
5.2 Reducing Depth of the Instruction FIFO
Having in mind [Table 3], we are able to introduce some
optimizations. Since ModMul starts with a control word which is
repeated many times, this atomic sequence doesn't need to be preceded
by an externally inserted NOP: the FIFO will be filled, anyway. Thus,
waste of execution cycles can be reduced. That is true because we
made the implicit assumption , where is the external
cycle time (the maximum cycle time allowed by the pads), and is
the internal cycle time the MM data path uses. It seems to be
adequate to assume r = 5. A closer look at [Table 3] shows that for r
= 5 a FIFO of depth 3 is sufficient because within ModMul there are
stable sections long enough to prevent clearing out the FIFO. This
depth also satisfies Mult . A second possible method to reduce the
depth of the FIFO is to split the atomic sequence associated with
ModMul into three shorter atomic sequences associated with the
multiplication steps (A), (B) and (C). This method adds extra cycles
to the entire modulo multiplication, but does not lead to a smaller
FIFO than the first method does. Because speed of the modulo
multiplication is the bottleneck of an RSA implementation, that
method is not of prime interest.
6 An Outline of the Hardware Implementation
Using a FIFO for controlling the MM data path meets the goal of having
an "as simple as possible" control structure for testing
purposes. It would be quite consistent to avoid extremely complex
design methods when implementing the FIFO in hardware. For the same
reason, shifting the contents of the FIFO stages is chosen to be done
synchronously rather than asynchronously. The clock of the MM data
path is used, since the FIFO must operate at the same (high)
speed. This high speed seems to be the only substantial difficulty
when designing the FIFO circuit layout. As we have seen in [Section
5], a down counter will be needed. It has to fulfil the given timing
constraints: decrementing, detecting underflow, setting up load
signals for the stages and shifting the stages must work within a
single cycle. Therefore, the down counter is replaced by a shifter,
as shown in [Figure 6]. This Page 740
Figure 6: Replacing the down counter by a shifterway, much of time is saved because shifting a single bit works a lot
faster than decrementing a counter, and no decoder is necessary for
detecting whether the next control word has to be loaded: this can be
done by inspecting the single bit b0. To avoid an excessively long
shifter, its length can be decreased by limiting the repeat count to
a certain power of two, thus splitting long stable sections. On the
increasing FIFO depth resulting thereof and the length of the shifter
a good compromise has to be evaluated. Figure 7 shows a schematic
diagram with a FIFO of depth 4. Precise timing issues are omitted: in
particular, the circuit for the handshake signal timing has to be
designed carefully. Special attention has to be paid to the
asynchronous Strobe input to avoid metastable states and
synchronization failure. The handshake procedure works in the
following way: If the top FIFO stage is empty, Busy is set to low,
indicating a request for the next control word. The positive edge of
Strobe signals that the next control word is available from outside,
and sets Busy to high. For the time Busy is low, internal NOPs with
repeat count 0 are to be inserted. This is done by "anding" the
input signals with Busy. Additionally, the Ext bit is cleared,
indicating an internally generated NOP. If Busy is high and the load
signal for the top stage is set, there must have been a rising edge
of Strobe (because the load signal resets Busy). In that case, the
top stage is loaded from outside. Load signals are generated as
follows: Expiration of the shifter means to load all stages. If any
stage is marked as an internally generated NOP, it should be
overwritten at the next cycle, and all stages above have to move down:
with other words, the decision for shifting an upper stage depends on
the state of all underlying stages. This results in cascaded OR
gates, shown in [Figure 7]. Remember there is only one cycle for
setting up all load signals, therefore serious Page 741
Figure 7: Schematic diagram of the FIFOdelays caused by cascaded gates must be avoided.
7 Conclusion
In order to design a chip for testing the MM data path, we searched a
minimum version of the control structure. After analyzing the data
flow at the control interface, we introduced a FIFO for compensating
the difference of the maximum clock rate allowed by the chip"s pads,
and the clock rate the MM data path Page 742
requires to reach the desired encryption rate. Simulation results for
a complete 664 bit RSA run based on Mmodels of the FIFO and the MM
data path showed that it is sufficient to precede the whole sequence
of multiplications with only a few NOPs, if we use the FastMM*
algorithm and do not reduce the depth of the FIFO as described in
[Section 5.2]. In this case, not even a single additional NOP needs
to be inserted in between that sequence of multiplications. Because
almost all cycles of a complete RSA run are needed for that sequence
of multiplications, overall performance is not significantly affected
by a slow clock rate at the pads. Therefore, if there is no need to
reduce the depth of the FIFO (e.g. for shrinking down its size to
save chip area), an RSA chip using the proposed control stucture
might be appropriate even beyond testing purposes.
References
[Knuth 1981], D.E. Knuth: The Art of Computer Programming.
Second Edition, Volume 2 / Seminumerical Algorithms,
AddisonWesley (1981), pp 441442.
[Lippitsch et al 1991] P. Lippitsch, K.C. Posch,
R. Posch: Multiplication As Parallel As Possible. First
International Conference of the Austrian Center for Parallel
Computation, (Sep. 1991).
[Lippitsch et al 1992] P. Lippitsch,
K.C. Posch, R. Posch, V. Schindler: A scalable design with encryption
rates from 200 kBit/s to 1.5 MBit/s. 32nd International Science Week, Damascus, (Dec. 1992).
[Montgomery 1985] P.L. Montgomery: Modulo
Multiplication without Trial Division. Mathematics of Computation,
Volume 44, Number 170, (Apr. 1985). pp 519521.
[Posch et al 1990]
K.C. Posch, R. Posch: Approaching encryption at ISDN speed using
partialparallel modulus multiplication. Microprocessing and
Microprogramming 29, (1990), pp 177184
[Rivest et al 1978]
R. Rivest, A. Shamir, L. Adleman: A method for obtaining digital
signatures and publickey cryptosystems. Comm. ACM (1978), pp
120126.
[Wallace 1964] C.S. Wallace: A suggestion for a fast
multiplier. IEEE Transactions on Electronic Computers, Vol. EC13,
(Feb. 1964), pp 1417.
[Weste et al 1993] N.H.E. Weste,
K. Eshraghian: Principles of CMOS VLSI Design. A Systems
Perspective. Second Edition, AddisonWesley (1993).
Page 743
