Highradix Division with Approximate Quotientdigit Estimation
Peter Fenwick
Department of Computer Science, The University of
Auckland, Private Bag 92019, Auckland, New Zealand
p_fenwick@cs.auckland.ac.nzAbstract: Highradix division, developing several quotient bits per
clock, is usually limited by the difficulty of generating accurate
highradix quotient digits. This paper describes techniques which
allow quotient digits to be inaccurate, but then refine the result. We
thereby obtain dividers with slightly reduced performance, but with
much simplified logic. For example, a nominal radix64 divider can
generate an average of 4.5 to 5.5 quotient bits per cycle with quite
simple digit estimation logic. The paper investigates the technique
for radices of 8, 16, 64 and 256, including various qualities of digit
estimation, and operation with restricted sets of divisor multiples. Keywords: Division, high radix, approximate digit estimates Category: B2
1 Introduction
Division has always been one of the more difficult of the fundamental
operations in computers. Some very early computers omitted it
altogether, even if they included multiplication, relying on one of
the multiplicative algorithms mentioned later. (Many modern
supercomputers also follow this wellestablished historical
precedent!) Division was often restricted to basic restoring or
nonrestoring methods which develop only one bit per cycle, even on
computers where multiplication handled two or more bits per cycle. There are several methods for achieving binary division which is
faster than the simple nonrestoring method.
 The basic techniques of efficient digitbydigit binary division,
originally using multiples of {1, 0, +1}, were established by
[Robertson 1958] and [Tocher 1956], summarised by [MacSorley 1961] and
analysed by [Freiman 1961], who introduced the term SRT
division. Robertson discusses Radix4 SRT division and [Atkins 1970]
extends the analysis to higher radix dividers (radices of 16, 64 and
256). [Wilson and Ledley 1961] describe division with shifting over 0s
and 1s. A good discussion of these early methods is also given by
[Flores 1963]. [Waser and Flynn 1982] describe these as subtractive
algorithms because they are based on subtraction as the iterative
operator.
Page 2
 The development of very fast combinational multipliers
[Wallace 1964] renewed interest in the older multiplicationintensive
methods based on NewtonRaphson or Taylor series approximations to the
reciprocal of the divisor. In most cases these methods provide
quadratic convergence to the final value, doubling the number of
accurate digits at each iteration. [Knuth 1969] (p244) presents a
cubically convergent method which triples the accuracy at each step,
but it requires more, lessconvenient, arithmetic and is overall no
faster than the simpler quadratic methods.) The repeated
multiplications lead Waser and Flynn to describe these as
multiplicative algorithms.
 Other, very significant, developments are in some sense a melding
of the subtractive and multiplicative methods. The basic iteration is
still subtractive, but combinational multipliers are included as
components to form divisor multiples and, often, quotient digits. For
example, Byte Division, described by Waser and Flynn uses a ROM to
estimate the divisor reciprocal which is then combined with the
residue in a small combinational multiplier to estimate the quotient
digit. With Booth recoding of the quotient digit a 4input adder can
form all 256 divisor multiples. Some improved methods of digit
estimation are discussed by [Schwarz and Flynn 1993], and [Wong and
Flynn 1992], who achieve speeds of at least 12 or 14 bits per
iteration, and up to 53 bits per iteration. Following work by [Svoboda
1963], [Ercegovac, Lang and Montuschi 1993] have recently described a
method based on prescaling the divisor and dividend which allows the
development of 12 or more quotient bits per cycle. They again include
multipliers as basic components. Thus while these are still
essentially subtractive dividers, they use combinational multipliers
to accomplish very high radix operation.
The work of this paper is in many ways a return to the traditional
subtractive methods, but emphasises the combination of highradix
division (to minimise the time for a division) with relatively simple
logic for quotient digit estimation (to minimise complexity and cost)
and a minimal set of divisor multiples (again for complexity and
cost). A unique feature is the tradeoff between performance and
complexity; without changing the basic design it is possible to reduce
the available divisor multiples or the accuracy of the quotient digit
estimation (or both) at the cost of only slightly slower division
operation.
2 The Basic Principles of "Subtractive Division"
All of the subtractive methods depend on the relation below, as stated
by Atkins 1961
Page 3
where = the residue used in the jth cycle,
= the initial dividend,
= the remainder, and
= the jth quotient digit
We also have that = the radix, eg 2, 4, 8, 16,
= the divisor, and
= the number of radixr digits in the quotient
Verbally, we can note that we subtract a multiple () of the divisor
from the residue and enter the same as the corresponding quotient
digit. By convention , this ensuring that a properly
chosen value will eliminate a digit of and ensure that
and is in the range to allow the iteration to proceed. The
crux of most division methods lies in generating the correct value of
so that the residue is properly reduced and the generated digit is
accurate. For high radices this may require considerable logic; so
much so that Waser and Flynn consider that SRT algorithms are
unsuitable for any radix greater than 4. [Knuth 1969] (p235) shows that for any radix r, estimating the
quotient digit by taking the two mostsignificant residue digits
divided by one mostsignificant divisor digit will give an error of at
most 2 in the estimate; he includes a refinement which ensures that
the digit is usually exact, may be in error by 1, and never has an
error of 2. The refinement is, however relatively expensive to
implement and he does not discuss the necessary hardware. (It does
however have interesting connections with the more recent techniques
for producing very accurate quotient estimates.) [Atkins 1970] presents an extensive analysis of SRT division and its
extension to higher radices. He discusses redundancy of the quotient
representation (for example, 3 may be represented as either 2+1 or
41) and states that With redundancy, the quotient digit need not be
precise. Then from a detailed analysis of digitestimation logic, he
shows that the number of bits to be examined is at least Residue bits or , and Divisor bits where the radix r is Radix Residue Divisor
bits bits
4 5 7
8 6 8
16 7 9
64 9 11
256 11 13
Table 1.
Atkins estimates ofbits to be examined Page 4
For some typical
radices, we find that the number of operand bits to be examined is as
shown in [Tab. 1]. These results show that quotient estimation for high radix division is
indeed a difficult process; even for radix 16 it is a function of at
least 16 inputs. Atkins also discusses the problems of converting the
quotient from the redundant code in which it is generated into the
external binary form and states that a fulllength quotient subtracter
may be necessary. (This matter will considered later.) A recent paper [Montuschi and Ciminiera 1994] considers the use of
overredundant digits, where the quotient digits may equal or exceed
the division radix. In the present context their most important result
is that the quotientdigit estimation logic may be simplified by
allowing a wider range of quotient digits. Many of their comments (and
the overredundant digits) are germane to the present work, but here
we also allow the quotient digits to be inaccurate.
3 The new approach
Underlying most of this paper is the observation that at any stage of
the division with divisor d the 'residue' p and 'partial quotient' q
represent a value . The value is unchanged if we put . In other words we can add any quantity
to the quotient, provided that we also subtract from the
residue. A quotient digit can be formed as a result of several
operations with the same operand alignment; if the estimation logic
gives a poor estimate of the quotient digit, we can 'hold' that division
cycle and correct or refine the estimate until the residue is within
range for the next reduction. The new method involves placing a small lowprecision adder at the
loworder end of the quotient so that new digits are added into the
quotient rather than jammed in as is usual. We also allow unshifted
arithmetic to refine the quotient digit estimate, effectively holding
the division at a particular stage until its result is
satisfactory. We can use any multiple which we like, or any convenient
combination of multiples, in constructing each quotient digit. In
particular we do not insist that the generated quotient digit will
immediately reduce the residue to the correct range, but are prepared
to accept a poor estimate and then repair the damage from that
estimate. There are two consequences 
 The bits entered into the quotient do not have to be exact. Small
errors can be corrected by carry propagation within the quotient
adder, provided that the carry is absorbed within the length of the
adder. The divisor multiple need be accurate enough only to allow the
residue to be driven toward zero at each step. The resulting changes
to the quotient will be referred to as 'quotient adjustment'.
 If the chosen multiple estimate leaves the residue too far from
zero, it is possible to hold the division at a step and subtract
another divisor multiple without
Page 5
shifting. Thus if the logic estimated
a multiple of 5 instead of the correct value of 6, a correction with a
multiple of 1 will ensure the correct result. Carry propagation within
the quotient adder will convert the initial estimate to the correct
value. These will be referred to as 'correction cycles'.
Both aspects allow us to reduce the quality of quotient digit
estimation without affecting the accuracy of the final result.The second aspect is especially interesting. While it is relatively
easy to provide logic which gives a good quotient estimate most of the
time, it is much more difficult (and expensive) to generate an
accurate value all of the time. (This complexity is evident from the
results of Atkins, especially when compared with the lookup table
sizes used here.) With a correction step available at any stage a bad
estimate need not affect the final answer  it just requires a little
longer to fix up. We can therefore trade off the complexity of the
estimation logic against the overall division speed.
4 Limiting the quotient carry propagation.
An assumption of the present work is that the quotient has a short
adder; a fulllength quotient adder requires a considerable increase
in logic complexity and should be avoided if possible. (This is
exactly the situation discussed by Atkins and mentioned above in
converting from redundantlyrepresented quotient digits.) For a positive divisor, excessive quotient carry arises when the
residue becomes negative after a subtraction and remains negative for
a while thereafter. With conventional nonrestoring division, the
negative value will force a 0 quotient bit to be entered (from an
unsuccessful subtraction). By comparison, the algorithm as described
enters the multiple which was used (and was too large) and relies on a
later negative digit to correct for the overdraw. If there is only a
slight overdraw, the residue stays close to zero for several steps
while zero quotient digits are generated and the carry must eventually
propagate through all of these zero digits. To minimise the quotient carry propagation, we monitor the sign of the
result and, if it is negative, enter as a quotient digit the
(multiple1) and set a Qcarry flag; this is analogous to the action of
simple nonrestoring division. Qcarry is shifted in parallel with the
quotient and added in on the next cycle. Thus we subtract 1 from the
quotient, but add it back on the next cycle. The operation is similar
during a correction cycle, except that the 1 is added directly into
the quotient, without any shift. To illustrate, consider a radix8 divider where the residue goes just
negative from a multiple of 6 and stays negative with no arithmetic (0
digit) for several cycles before being corrected with a 2 multiple. Assuming that the simple algorithm generates the quotient digit
sequence Page 6
{ 6 0 0 2 }, the generated quotient bits are { 110 000 000 } before the last digit and become the correct value { 101 111 111 110 } after the 2 multiple is added, but only after the
carry propagates through 8 bits of the previous quotient. With the Qcarry flag, we recognise the overdraw and enter an initial 5
instead of 6; this digit is now correct. On the next two cycles the
generated digit of 0 is converted to 1 because of the negative
residue, but the shifted Qcarry corresponds to an addition of 8, so the
entered digit is 7 or bit pattern 111. We enter the correct quotient
digit at each stage, avoiding lengthy carry propagation. We may note that the Qcarry is in fact redundant, being identical to
the residue sign. It is however convenient to regard it as a separate
entity connected with the quotient rather than the residue. The carry
from the main adder wrapsround into the quotient adder.
5 The complete division algorithm.
The final algorithm is shown in [Fig. 1], written in C but with some
conditions in descriptive rather than explicit form. The function
estDigit produces a suitable quotient digit estimate by some means (in
the tests by a table lookup), including operations such as the
limitation to complex multiples as described in [Section 11]. This
program, and indeed all of the work in the paper, assumes a positive
divisor. while (dividing)
{
Residue <<= BitsPerDigit; /* align residue */
Qdigit = estDigit (Residue, Divisor); /* estimate quot. digit */
Residue = Qdigit * Divisor; /* adjust residue */
Quotient =
((Quotient + Qcarry) << BitsPerDigit) + Qdigit;
Qcarry = (Residue <0); /* to stop long carries */
Quotient = Qcarry; /* and adjust quotient */
while (Residue_out_of_range) /* correction cycles */
{
Qdigit = estDigit (Residue, Divisor); /* est. quot. digit */
Residue = Qdigit * Divisor; /* adjust residue */
Quotient += Qdigit + Qcarry; /* adjust quotient */
Qcarry = (Residue < 0);
Quotient = Qcarry;
}
} /* end main divide loop */
if (Qcarry > 0) Quotient++; /* assimilate quotient carry */
while (Residue < 0) /* correction if ve residue */
{ Residue += Divisor; Quotient; }
Figure 1. The basic division program Page 7
The first four lines of the main
loop are essentially a standard highradix division and are followed
by two lines to control the quotient carry propagation. An inner loop
handles the case of the residue being not reduced correctly, using
code which is very similar to the main division code but without
operand shifts. Finally, after the main loop is complete, we must
assimilate any pending Qcarry and correct for a residue of the wrong
sign. A point which is not stated is that digit estimation in the inner,
correction, loop must never give a zero digit because this loop must
always change the residue; the digit must be forced to +1 or 1
depending on the sign of the residue.
6 The Hardware
The basic divider hardware is shown in [Fig. 2]. The differences from
conventional division hardware are in the presence of the quotient
adder and in the ability to operate in an unshifted mode during
division. The quotient is shown with two paths from the quotient
register, one shifted and one unshifted; the same applies to the
residue register and main adder. Two shifts are wired into the residue
and quotient logic; a shift of 3, 4, etc bits which determines the
nominal radix of the operation and a zero shift which is used during
correction cycles. Figure 2. Divider Hardware In many cases the divisor multiple logic is limited to a 2input
adder/subtracter with shifters at each input up to the width of a
quotient digit. Page 8
7 Simple, radix8, division
The proposed algorithm was simulated by program using 32bit integers
and 64bit long integers to provide a basic operand precision of 24
bits (48 bit dividend). In all cases the highorder bits of the
divisor and residue are used to index a precomputed table which
yields the estimated multiple. The divisor is assumed normalised with
its mostsignificant bit always 1. The initial tests are with radix8
division (3 bits per cycle). The test cases were 
 A 7x7 table (7 residue bits and 7 divisor bits), which is similar in
size to what Atkins predicts is needed for radix8 division
 Three smaller sizes (6x6, 5x5, 4x4), the larger two of which are
'nearly good enough' for conventional division. The last is intended as
a test of an economical estimation table.
 A table which examines only 3 residue bits and 3 divisor bits (just
two significant divisor bits). This was tested as a minimal table
which is easily implemented in combinational logic.
All cases were tested by a sequence of 100,000 divisions (the same
sequence in all cases) counting the total add/subtract operations or
cycles, the number of times that an earlier quotient was adjusted, and
the number of additional correction cycles needed. With 8 octal digits
to be developed for each 24bit test operand, there are 8 'basic
operations' for each test case, or 800,000 operations within each test. The results are shown in [Tab. 2]. Each column heading shows first the
radix and then the residue and divisor bits used to estimate the
quotient digit for that test. The same heading convention will be
followed for all of the results tables.
8:7x7 8: 6x6 8: 5x5 8: 4x4 8:3x3
basic operations 800,000 800,000 800,000 800,000 800,000
quot. adjustments 0 0 0 1,292 9,412
adjustments (%) 0 0 0 0.158 1.085
correction cycles 0 19 699 16,395 66,862
corrections (%) 0 0.002 0.087 2.049 8.357
bits per cycle 3.000 3.000 2.997 2.940 2.769
performance 1.000 1.000 0.999 0.980 0.923
Quotient carry distance 0 1 2 6 6
Table 2. Radix8 division, with varying lookup table sizes. Small errors in the digit estimation show up in the 'quotient
adjustments' which refine the prior quotient, but do not affect the
residue or the speed. Larger errors manifest themselves as 'correction
cycles' which modify the existing quotient and residue, and do slow the
operation. In this table, the 'quotient adjustments' count only the
adjustments which affect more than the leastsignificant quotient
digit; we Page 9
may expect every correction to alter this digit but count
only those which spill into more significant digits. In no case does
the quotient carry propagate over more than 2 digits (6 bits). Whereas the largest, 7x7, table is able to predict a correct digit
every time, the 6x6 table is inadequate by normal standards because it
gives a few estimates which require correction. Even so it delivers a
performance almost identical to the larger table (actually 2.99993
bits per cycle). The 5x5 and 4x4 tables are even less acceptable by
normal criteria, with error rates of 0.1% and 2%, but here they still
give performance within 0.1% and 2% of the optimal 3 bits per
cycle. Even the minimal 3x3 table still yields the correct quotient
digit 92% of the time and needs a correction cycle on only 8% of the
steps, yielding nearly 2.77 bits per cycle. Comparative results for different radices are given later in [Fig. 3]
(showing bits per cycle) and in [Fig. 4] (showing relative
performance). These figures include all of the significant and useful
cases to be discussed and should be consulted for quick comparisons.
8 Effect of table aspect ratio
Although Knuth implies that it is better to examine more residue
digits than divisor digits, Atkins shows in his analysis that it is
desirable to examine about the same number of bits from each value, or
perhaps a few more bits from the divisor. We next examine the effect
of trading off residue bits against divisor bits, in all cases keeping
constant the total number of tested bits. The results of [Tab. 3] are
for the '5x5' table of the previous section, but similar results were
obtained for other configurations. For the present situation, where we are concerned only with obtaining
a good estimate rather than the accurate value, it seems best to
consider about the same number of bits from the two operands. Where
the total number of bits is odd, the extra bit should be allocated to
the residue. Results for the rest of the paper will mostly assume a
'square' table, without further justification.
8:6x4 8: 5x5 8: 4x6
basic operations 800,000 800,000 800,000
quot. adjustments 132 0 0
adjustments (%) 0.016 0 0
correction cycles 6,844 699 5,486
corrections (%) 0.856 0.087 0.685
bits per cycle 2.975 2.997 2.980
performance 0.992 0.999 0.993
Quotient carry distance 6 2 3
Table 3. Radix8 division, varying table aspect ratio.
Page 10
9 Radix16 division (4 bits per cycle)
We repeat the above work for radix16 division. Again, it is not
difficult to get close to the ideal performance of 4 quotient bits per
cycle. We now show four lookup tables, examining 7, 6, 5 and 4 bits
of the residue and divisor. Atkins requires a 7x9 table for radix 16,
which is larger than even the largest of the tables used here. The
change in radix reduces the number of 'basic operations' to 600,000, (6
digits for a 24 bit operand) compared with 800,000 for radix8
operation (8 digits for 24 bits). Once again the larger tables give very nearly the maximum performance,
with the 5x5 table within 1.5% of the ideal and even the 4x4 table
(which considers only a single digit of each operand) only 8% off the
possible speed. Quotient adjustments are again needed for the smaller
tables, but even the smallest table never modifies more than 8
quotient bits (the current digit and its predecessor).
16: 7x7 16: 6x6 16: 5x5 16: 4x4
basic operations 600,000 600,000 600,000 600,000
quot. adjustments 0 0 250 2,854
adjustments (%) 0 0 0.041 0.442
correction cycles 57 615 8,489 46,507
corrections (%) 0.009 0.102 1.415 7.751
bits per cycle 4.000 3.996 3.944 3.712
performance 1.000 0.999 0.986 0.928
Quotient carry distance 1 3 7 8
Table 4. Radix16 division, with varying lookup table sizes.
10 Radix64 and radix256 dividers
For division with higher radices, we initially assume that all
multiples are available and observe the effect of only the reduced
digitestimation logic. Actually not much extra hardware is needed to
handle radix64 and even radix256 division  we certainly do not need
an adder input for each possible power of two. By using Booth recoding
of the quotient digit we can handle radix64 with a 3input adder for
the divisor multiples and radix256 with a 4input adder. Later we
reduce the range of multiples to allow simplified divisormultiple
generation logic. We retain the earlier sizes of lookup table as
covering a reasonable range of practical sizes. Although we assume the full range of divisor multiples for radix64,
we retain estimation logic which is quite small compared with Atkins
predictions (9x 11 for radix 64). Particularly in the first two cases,
7x7 and 6x6 tables, we see from [Tab. 5] that the new algorithm
largely absorbs any deficiencies of the digit estimation. The
performance deteriorates markedly for the 5x5 and 4x4 tables, to the
point where these are probably not worth considering, in comparison
with the restricted Page 11
cases later.
64: 7x7 64: 6x6 64: 5x5 64: 4x4
basic operations 400,000 400,000 400,000 400,000
quot. adjustments 6 490 1746 0
adjustments (%) 0 0.114 0.330 0
correction cycles 3,492 29,744 128,667 382,634
corrections (%) 0.873 7.436 32.167 95.659
bits per cycle 5.948 5.585 4.540 3.067
performance 0.991 0.931 0.757 0.511
Quotient carry distance 9 12 11 6
Table 5. Radix64 division  full multiples.Repeating the exercise for radix256, we obtain the results of
[Tab. 6]. The relatively poor quality of the digit estimates is even
more noticeable here, but even so by examining just a single digit (8
bits) of the residue and divisor we achieve nearly 7.5 bits per
cycle.
256: 8x8 256: 8x7 256: 7x7 256: 6x6 256: 5x5
basic operations 300,000 300,000 300,000 300,000 300,000
quot. adjustments 64 558 276 0 0
adjustments (%) 0.019 0.157 0.071 0 0
correction cycles 20,247 56,402 90,815 288,049 882,365
corrections (%) 6.749 18.80 30.27 96.02 294.1
bits per cycle 7.494 6.734 6.141 4.081 2.030
performance 0.937 0.842 0.768 0.510 0.254
Quotient carry distance 13 16 14 8 8
Table 6. Radix256 division  full multiples.By comparing this table with the previous one, we see that the
performance is largely determined by the unexamined bits of the
mostsignificant digit. Thus examining a complete digit (6 or 8 bits
respectively for radix64 and radix256) gives about 93% of the ideal
performance, one bit less (5 or 7 bits) gives 75% and 2 bits less
51%. Nevertheless, it is interesting that reasonable performance is
still possible if the estimation logic examines only part of the
mostsignificant digits of the residue and divisor. As a more general
observation, the algorithm is robust with respect to changes in the
digit prediction logic. A poor prediction does not impair the final
result, but may delay achieving that result. Practically though, it is clear that operation with a radix of 256 is
not really satisfactory, at least with the size of lookup table which
is used. With a 7x7 table, the performance is very little better than
that of a radix64 divider (6.14 bits, compared with 5.95). The
benefit of the higher radix barely offsets the penalty of examining
partial digits. Page 12
11 Division with few multiples available.
Divisor multiples can be divided into 3 categories 
 'Shifted values', available by just shifting left the raw value (1, 2,
4, 8,),
 'Simple multiples', being the sum or difference of pairs of
shifted values (3, 5, 6, 7, 9,),
 'Complex multiples', which require
the combination of 3 or more shifted values (11, 13,).
In this section
we examine the performance if the only available multiples are the
'shifted values' and the 'simple multiples'. We assume that a 'large' shift
is available (e g 6 places for radix64). The initial operation on the
shifted residue will be followed in many cases by 'corrections' on the
unshifted residue as we simulate the more difficult multiples or
quotient digits. 16 7x7 16 6x6 16 5x5 16 4x4
basic operations 600,000 600,000 600,000 600,000
quot. adjustments 0 0 174 4,124
adjustments (%) 0 0 0.027 0.604
correction cycles 22,151 22,725 34,207 82,868
corrections (%) 3.692 3.788 5.701 13.811
bits per cycle 3.858 3.854 3.784 3.515
performance 0.964 0.964 0.946 0.879
Quotient carry distance 2 2 5 8
Table 7. Radix16 division, with limitation to 'simple' multiples. Initially we examine radix16, even though a 2input adder is adequate
to form all of the multiples with Booth recoding of the quotient
digits. The estimation tables do not use Booth recoding but are just
recoded versions of the previous ones with, for example, 11 being
rounded to 10 or 12. As expected, there is some performance degradation as compared with
the previous case where all multiples were assumed to be
available. However even the simplest case still delivers over 3.5 bits
per cycle. The speed is better than we would expect from noting that
1/8 of the multiples are unavailable and must be simulated; about half
of these cases are just absorbed into the general operation and do not
require explicit correction cycles. For higher radices it is especially useful to avoid a complete suite
of divisor multiples. We still restrict ourselves to simple multiples
of the form .
Page 13
64: 7x7 64: 6x6 64: 5x5 64: 4x4
basic operations 400,000 400,000 400,000 400,000
quot. adjustments 10 736 1,482 0
adjustments (%) 0.002 0.143 0.250 0
correction cycles 91,476 116,488 193,466 412,402
corrections (%) 22.87 29.12 48.37 103.1
bits per cycle 4.883 4.647 4.044 2.954
performance 0.814 0.775 0.674 0.492
Quotient carry distance 6 12 11 6
single corrections 91,448 111,560 140,974 147,143
double corrections 14 2,464 26,246 85,618
> double corrections 0 0 0 30,116
Table 8. Radix64 division, with limitation to simple multiples.
256: 8x8 256: 7x7 256: 6x6
basic operations 300,000 300,000 300,000
quot. adjustments 74 128 0
adjustments (%) 0.015 0.023 0
correction cycles 187,415 245,222 399,179
corrections (%) 62.47 81.74 133.1
bits per cycle 4.924 4.402 3.433
performance 0.616 0.550 0.429
Quotient carry distance 9 9 8
single corrections 173,975 152,214 112,568
double corrections 6,810 46,360 85,170
> double corrections 0 96 36,544
Table 9. Radix256 division, with limitation to simple multiples.The results in [Tab. 8] and [Tab. 9] are extended to show details of
the corrections. Radix64 division is quite successful, generating an
average of nearly 5 bits per cycle with either of the two larger
tables. However the smallest table (4x4) is actually inferior to
radix16 with the same table size. The radix256 results are inferior
to the radix64 results, showing the effect of having relatively fewer
multiples available and having to rely much more on correction cycles. With radix16, restricted multiples cover 7/8 or 87.5% of the total
range of multiples. With radix64 only 33 multiples are available
(51.6% coverage), but only a single correction cycle is ever needed in
most cases. Radix256 uses 58 multiples (22.6%) and the sparse
coverage requires many more correction cycles even with the larger
tables. (Both radix64 and radix256 actually use occasional multiples
of 66, 68, 258 Page 14
and 260, which are available without penalty or extra
hardware given that 64 and 256 are provided.) While radix64 operation
is acceptable, there is clearly no benefit in using this method for a
radix256 divider.
12 The effect of using only shifted values as multiples
As an extreme restriction on the available multiples, we can restrict
them to just powers of 2 (those which are available by shifting). We
retain the original style of lookup tables, even though they are
clearly inappropriate in this case; we should be able to achieve
comparable performance with much simpler quotient estimation.
64: 7x7 64: 6x6 64: 5x5 256:7x7
basic operations 400,000 400,000 400,000 300,000
quot. adjustments 0 0 0 0
adjustments (%) 0 0 0 0
corrections 407,228 428,667 484,1 62 554,306
corrections (%) 101.8 107.2 121.0 184.8
bits per cycle 2.973 2.896 2.714 2.809
performance 0.496 0.483 0.452 0.351
Quotient carry distance 5 5 5 9
Table 10. Radix64 and radix256 division, limited to powerof2
multiples.The division should tend to be equivalent to a variable shift length
algorithm, with multiples of 1, shifting over strings of 0s and
1s. Variable shift algorithms are known to have an asymptotic
performance of about 3 bits per cycle (2.5  3.5, see [Flores 1963]),
a performance which is confirmed here in [Tab. 10]. As for the
previous high radix operations, the performance is better for a radix
of 64 than for 256.
13 Graphical summary of results
In [Fig. 3] and [Fig. 4] we show the performance results for the
various radices and sizes of lookup table. Figure 3 shows the average
quotient bits delivered per cycle, while Figure 4 shows the
performance for each radix, with 100% being N bits/cycle for a radix
of . Results are shown for only the more realistic cases of full
multiples and restricted multiples. Results for corrections and
quotient adjustments are not presented; they are essentially internal
or intermediary phenomena whose consequences are apparent in the final
performance.
Page 15
Figure 3. Average bits/cycle, for different radices and table sizes Figure 4. Relative performance, for different radices and table sizes The graphs show quite clearly that a table should work with about 1
digit for each input; a 6x6 table for example gives negligible benefit
on a radix16 divider, but with radix256 gives fewer bits per cycle
that with radix64. This will be mentioned again later, in connection
with some other recent work.
14 Generation of the quotient digits
The present paper has generated quotient digits only from look up
tables addressed
Page 16
by the high order digits of the residue and
divisor. This technique might not be the best one and in that regard
the current work should be regarded more as a feasibility study of the
new technique. The usual problem in division is that the critical path consists of
the generation of the quotient digit, then the generation of the
corresponding divisor multiple and finally the addition/subtraction,
in other words all of the difficult operations! [Montuschi and
Ciminiera 1994] point out indeed that many methods of 'improving'
division really do little more than move the dominant delay around the
critical path! Nevertheless there are several ways of accelerating the
digit estimation.
 The 'divisor prescaling' methods of [Ercegovac, Lang and Montuschi
1993] and [Svoboda 1963] trivialise the generation of the quotient
digit, but at the cost of preliminary arithmetic to get the divisor
into a suitable form. With a normalised divisor, they subtract
appropriately shifted multiples of the divisor from itself to produce
a divisor of the form , where . At the same time they apply similar
transformations to the dividend; effectively they multiply the divisor
and dividend by the same factor, leaving the quotient unchanged. With
the divisor of that form, the high order digits of the residue are
precisely the desired quotient digit. Without recoding the divisor
digits we may expact iterations to reduce the divisor and the same
number of parallel operations to transform the dividend, giving a
total of preliminary operations. The cost of these operations would
have to be offset against the reduction in the cycle time from the
improvement in the time to generate quotient digits. A problem is that
it does not yield a remainder.
 [Wong and Flynn 1992] present quotient digit logic which operates
in parallel with the other operations and allows much more overlap and
correspondingly higher speed. Their logic does however assume the
correctness of the previous quitient digit, using that digit as part
of the input for the current estimate. The complexities of handling an
approximate previous digit may make their full method unwieldy in this
case. However, the essence of the method, which derives the actual
logical functions of the quotient bits as explicit functions of the
other bits, may still be appropriate.
 The quotient digit may be estimated from the difference of the
logarithms of the divisor and residue (or their high order
bits). Although this will still require lookup tables or equivalent
logic to produce the logarithms, the table sizes are now , rather
than. Conversion from the difference back to the actual digit
estimate will require logic of similar scale to produce the
antilog. Depending upon the logic family, the complex of 3 small
tables (or equivalent) will certainly be simpler and may be faster
than the single large lookup table. Note that we do not require wholly
accurate digit estimates.
Operation with logarithms has been tested and verified, with similar
results to the direct table lookup. The logarithm, if in base 2, needs
as many fractional
Page 17
bits as there are bits in the radix, and integral
bits. Thus radix16 requires a logarithm of the form xx.xxxx, and
radix256 of the form xxx.xxxxxxxx. The detailed results are slightly
different from those for the direct lookup tables, even though the
formulae are nominally equivalent. In some cases there is a slight
degradation in performance but many cases improve by 1  2%. This
demonstrates both the robustness of the algorithm with respect to
slight changes in the digit estimates and the possibility of improving
the performance in particular cases by fine tuning the digit
estimation.  An interesting possibility, which has not been investigated, might
be to use hybrid analogue/digital techniques in the quotient digit
estimation. Similar methods have been used in a recent adder design
[Kawahito et al 1994]. The essential point is that the algorithms
described here can accomodate poor quotient digit estimates without
undue penalty. If hybrid techniques can achieve useful simplification
while delivering reasonably good estimates, they could be worth
consideration.
It is not easy to predict which is the best method. Most of the
tablelookup methods insert extra delay into the division cycle and
slow down every cycle; they may be better if there is spare time in
each cycle. The prescaling methods however allow faster cycles within
the division proper, but require additional steps for the preliminary
adjustments and may be better where the system clock matches the
critical path delay within the division logic.
15 Other recent work
A very recent paper (published since this paper was submitted) presents
some very similar techniques [Cortadella and Lang, 1994]. A comparison
of the two methods is conveniently presented as a series of points
(with references to 'this paper' and 'their paper' having the obvious
meanings).  Even though the techniques are similar, the underlying
philosophies are different. This paper works in terms of an
approximation to the exact quotient digit, with possible refinements
to that estimate. Their paper emphasises the speculative nature of the
digit estimation, with options such withdrawing the estimation
completely or accepting only some of the estimated bits.
 This paper always develops a fixed number of bits, perhaps taking
several cycles to achieve an accurate digit. Their paper however
allows a 'partial advance' to develop fewer bits on some cycles,
accepting only as many quotient bits as are known to be accurate. Thus
both approaches may develop fewer that the nominal bits per cycle, but
in different ways. (Their 'basic scheme', without partial advance is
similar toe the ideas of this paper.)
Page 18
 This paper places considerable emphasis on developing the quotient
by adding in each new digit. Although perhaps not really apparent from
this paper this was the original idea which led to the work, but had
to be supplemented by allowing some digits to require several
cycles. Their paper acknowledges a quotient adjustment by 1, but does
not explore the consequences of possible quotient carry propagation.
 Their paper goes to some trouble to develop a good 'speculation
function' equivalent to the 'estimation tables' used in this paper. This
paper (while assuming lookup tables for quotientdigit estimation)
provides a much more comprehensive treatment of the effects of varying
the table size and therefore the accuracy of the quotient digit.
 One of their optimisation options is to reduce the number of
outputs from the speculation function. This parallels exactly the use
of the 'simple multiples' in this paper.
 Their paper has much more detail concerning the hardware
consequences, costs and physical aspects of their design.
 Their results on there being an optimal complexity of the
digitestimation logic are supported by the results here which show,
for example, a 6x6 table giving its best performance with radix64
division.
 Both papers rely very heavily on computer simulations of
algorithms which are largely unpredictable in internal details, even
though the final result is very well determined.
The two papers present different approaches to the same problem, both
illustrating that highradix division does not require exact high
precision quotientdigit estimates.
We have shown that it is possible to obtain satisfactory highradix
division, even when the division process is subject to one of two
significant restrictions 
 Limited precision digit estimation. By allowing the quotient logic
to assimilate corrections from later digits, and by allowing an
occasional 'hold' of the division process, it is possible to estimate
quotient digits to quite low precision. The corrections are needed in
relatively few cases and lead to quite small performance degradation
except for extremely simple estimation logic. For example, a divider
which examines just a single digit of the residue and divisor (i e 3
or 4 bits of each) can average over 2.75 bits per cycle for radix8
division and 3.7 bits per cycle for radix16 division.
 Limited divisor multiples. High radix division is expensive in its
logic to
Page 19
estimate quotient digits and its logic to form a complete
suite of divisor multiples. We have demonstrated that a radix64
divider (generating a nominal 6 bits per cycle) can generate an
average of more than 4.5 bits per cycle using only those multiples
which can be formed with a two input addersubtracter and considering
only a single 6bit digit of each of the residue and divisor.
The tests indicate that the new techniques are appropriate for radices
up to 64, but less satisfactory with a radix of 256, largely because
of restrictions on the digitestimation logic and the number of
divisor multiples which are available. If we have a multiplier which is designed for, say, radix64
multiplication, its hardware can be used as the basis of a divider
with nominal radix64 operation. This paper shows that, with
relatively simple digit estimation logic, it is easy to achieve, if
not the full 6 bits per cycle, then certainly 4 or 5 bits.
The lookup tables are generated as a simple function of the highorder
residue and divisor bits, assumed to index the rows and columns
respectively of the table. The process is parametrised to facilitate
the generation of tables of differing sizes and shapes, and for
differing radices. The tables are generated for positive residue and
divisor (with the mostsignificant divisor bit always 1) and the
negative half then generated as the complement of the positive
half. There is evidence that minor improvements might be possible by
finetuning some table entries in particular cases but that has mot
been attempted. The basic parameters are
 ResBits
 the number of residue bits to examine
(excluding the sign)
 DvsBits
 the number of divisor bits to examind, including
the normalised bit
 Radix
 the radix of the quotient digit digit
From these are derived several other values
 DigitBits (Radix = 1 << DigitBits)
 the bits to
represent a digit
 maxRow (1 << ResBits)
 the maximum row of
the table
 maxCol (1 << DvsBits)
 the maximum column of
the table
 minCol (maxCol / 2)
 the minimum column of the
table
 beta (2 << DigitBits)*minCol/maxRow
 a
scaling factor
Logically, the table rows extend from maxRow to maxRow
 1, and the columns from minCol to maxCol 
1. We calculate also a scaling factor chosen to give an
intermediate result of twice the true estimate. A mapping table then
rounds the intermediate result of twice the true estimate. A mapping
table then rounds the intermediate value to the correct estimate, or
rounds with suppression of complex multiples. Page 20
The essential loops for producing the table are then for (col = minCol; col <= maxCol; col++)
for (row = 0; row <= maxRow; row++)
Table[row + maxRow][col  minCol] = (beta * row) / col;
For operation with less than the full set of divisor multiples the
table values are rounded to the nearest available multiple.
References
[Atkins 1970] D.E. Atkins, "HigherRadix Division Using Estimates of
the Divisor and Partial Remainders", IEEE Trans. Comp., August 1970, pp
720733
[Cortadella and Lang, 1994] J. Cortadella and T. Lang, "HighRadix
Division and Square Root with Speculation", IEEE Trans. Comp., August
1994, pp 919931
[Ercegovac, Lang and Montuschi 1993] M.D. Ercegovac, T. Lang,
P. Montuschi, "Very High Radix Division with Selection by Rounding and
Prescaling", Proc. Eleventh IEEE Symp. Comp. Arithmetic, 1993, pp
112119
[Freiman 1961] C.V. Freiman, "Statistical Analysis of Certain Binary
Division Algorithms", Proc. IRE, Vol 49, No 1, Jan 1961, pp 91103
[Hwang 1979] K. Hwang, "Computer Arithmetic: principles, architecture
and design", John Wiley and Sons, New York, 1979
[Flores 1963] Ivan Flores, "The Logic of Computer Arithmetic",
PrenticeHall, Englewood Cliffs, 1963
[Kawahito et al 1994] S. Kawahito, M. Ishida, T. Nakamura, M. Kamyama,
T. Higuchi, "HighSpeed AreaEfficient Multiplier Design Using
MultipleValued CurrentMode Circuits", IEEE Trans Comp., Vol 43, No 1,
Jan 1994, pp 3442
[Knuth 1969] D.E. Knuth,"The Art of Computer Programming", Vol 2
Seminumerical Algorithms, Addison Wesley 1969
[MacSorley 1961] O.L. MacSorley,"High Speed Arithmetic in Binary
Computers", Proc. IRE, Vol 49, Jan 1961, pp 6791
[Montuschi and Ciminiera 1994] P. Montuschi and L. Ciminiera,
"OverRedundant Digit Sets and the Design of DigitByDigit Division
Units", IEEE Trans. Comp., Vol 43, No 3, March 1994, pp 269277
[Robertson 1958] J.E. Robertson, "A New Class of Digital Division
Methods", IEEE Trans. Comp., Vol C7, No 8, September 1958, pp 218222
[Svoboda 1963] A. Svoboda, "An Algorithm for Division", Information Proc
Machines, Vol 9, pp 2532, 1963 Page 21
[Schwarz and Flynn 1993] E.M. Schwarz, M.J. Flynn, "Parallel HighRadix
Nonrestoring Division", IEEE Trans. Comp., Vol 42, No 10, Oct 1993, pp
12341246
[Tocher 1956] K.D. Tocher, "Techniques of Multiplication and Division
for Automatic Binary Computers", Q. J. Mech. Appl. Math. Vol 11 Pt 3 pp
364384
[Wallace 1964] C.S. Wallace, "A Suggestion for a Fast Multiplier", IEEE
Trans. Elec. Comp., Vol EC13, Feb 1964, pp 1417
[Waser and Flynn 1982] S. Waser, M.J. Flynn, "Introduction to
Arithmetic for Digital Systems Designers", Holt, Reinhart and Winston
New Yoerk, 1982
[Wilson and Ledley 1961] J.B. Wilson, R.S. Ledley,"An Algorithm for
Rapid Binary Division", IRE Trans. Elec. Comp. Vol EC10, Dec 1961, pp
662670
[Wong and Flynn 1992] D. Wong, M. Flynn, "Fast Division Using Accurate
Quotient Approximations to Reduce the Number of Iterations", IEEE
Trans. Comp., Vol 41, No 8, Aug 1992, pp 981995
Page 22
