Hardware Design and Functional Programming: a Perfect
Match
Mary Sheeran
(Chalmers University of Technology
ms@cs.chalmers.se)
Abstract: This paper aims to explain why I am still fascinated
by the use of functional languages in hardware design. I hope that some
readers will be tempted to tackle some of the hard problems that I outline
in the final section. In particular, I believe that programming language
researchers have much to contribute to the field of hardware design.
Key Words: functional programming, hardware description languages,
arithmetic and logic structures, circuit generation
Category: B.2.2, B.6.3
1 Introduction
At the beginning of the 1980s, many researchers independently came upon
the idea of using a functional programming language to design and reason
about hardware. Some of the references that I am familiar with are [Lahti,
1980, Sheeran, 1983, Boute,
1984, Johnson, 1984a, Johnson,
1984b, Patel et al., 1985, Morrison
et al., 1985, Hunt Jr., 1985, O'Donnell,
1988, Verkest et al., 1988] but I have most likely
omitted several more, for which I apologise. I thought then, and I still
think that this was a great idea waiting to be discovered. And many others
have discovered it along the way. In the old days, there was a feeling
of excitement in the community. The interested reader is invited to dip
into the long list of references that I have gathered here, to get a feeling
for the nature and extent of the early work.
We felt sure that we would have an impact, and cheerfully ignored ridicule.
In response to my first paper submission (which later appeared as [Sheeran,
1984]), I received a single referee's report: a very thin strip of
paper which said "Functional correctness is not an issue in VLSI design.
Layout is the problem.". When, as a shy doctoral student, I tried
to explain my ideas to the leader of an industrial research lab, he eventually
jumped redfaced out of his chair and shouted "How dare you tell my
designers what to do!". And in the end, it must be said that we had
little impact on industrial practice. Verilog and VHDL took over the world.
In this invited paper, I would like to explain why I am more convinced
than ever that functional programming and hardware design are a perfect
match. Much has changed in the past 20 years. The problems facing hardware
designers as we move to deep submicron technologies are unfathomably difficult.
And functional programming languages, and our ideas about how to use
them, have changed out of all recognition. My thesis is that our original
idea was a good one. We were just a bit early!
1.1 µFP
My own route into this topic was via Backus' Turing Award paper [Backus,
1978]. I developed a variant of FP that operated on streams, and in
which the combining forms had geometric as well as behavioural meaning.
The following is the very first example that I published [Sheeran,
1984]:
Tal(0) = [1]
Tal(n) = TA zip [apndl [0, Tal(n1) mst],
[n,n,..n],
apndr [Tal(n1) mst, 0] ]
where mst = reverse tl reverse
TA = 2 > 3;1
It is the tally circuit from the classic text book [Mead
and Conway, 1980], see Figure 1. It should take
n inputs and produce n + 1 outputs. If exactly j of the inputs
are low (or zero), then the first (or leftmost) j outputs should also be
low, followed by a single 1, and then n  j low outputs (one
for each high input). The component TA is a multiplexer (or mux)
whose middle input is the address bit. It outputs its third input if that
address bit is high, and its first input if it is low. The operator
is function composition. apndl and apndr stand for append
left and append right respectively, and they add an element
to a sequence, either on the left or on the right. tl returns
all but the first element of a nonempty sequence, and thus mst
returns all but the last element of such a sequence.
The definition of the tally circuit is recursive. The base case is a
circuit that outputs a single high output. For n inputs, the circuit
constructs the two possible n + 1bit outputs, one beginning with
low and one ending with it, and then selects between them using the n^{th}
input (here called n) as the address bit for each of the muxes.
My description contains two recursive calls of the smaller tally with n
 1 inputs, and this now strikes me as strange. However, the combinatororiented
style did not permit the naming of intermediate values or wires. I also
showed the transistor diagram of the original circuit, and with the innocence
and arrogance of youth wrote "It is clear that we have described the
circuit exactly.".
It should be remembered that µFP was developed before the
idea of computer hardware description languages (CHDLs) had become widespread.
Neither VHDL nor Verilog existed at this time, though both were about to
come into existence. Many groups had long been experimenting with the design
of CHDLs (see [Chu et al., 1992, Borrione
et al., 1992] for a fascinating retrospective from 1992).
Figure 1: The Mead and Conway tally circuit, Tal(n), showing
the recursive call of Tal(n1).
Even in industry, there was a great willingness to experiment. A team
at Plessey, Caswell (in the UK), investigated the use of µFP
in the design of regular architectures, as part of a large Alvey Project
(ARCH013). Their first experiences were encouraging, as shown by the following
quotes from their paper about a case study, the design of a video picture
motion estimator [Bhandal et al., 1990]:
Using muFP, the array processing element was described in just one line
of code and the complete array required four lines of muFP description.
muFP enabled the e ects of adding or moving data latches within the array
to be assessed quickly. Since the results were in symbolic form it was
clear where and when data within the results was input into the array making
it simple to examine the dataflow within the array and change it as desired.
This was found to be a very useful way to learn about the data dependencies
within the array.
[. . .]
From the experience gained on this design, the most important consideration
when designing array processors is to ensure that the processor input/output
requirements can be met easily and without sacrificing array performance.
The most difficult part of the design task is not the design of the computation
units but the design of the data paths and associated storage devices.
It is essential to have the right design tools to aid and improve the design
process. Early use of tools to explore the ow of data within and around
the array and to understand the data requirements of the array is important.
muFP has been shown to be useful for this purpose.
It is interesting to note that µFP was used in the (important)
design exploration stage and that the use of symbolic and multilevel simulation
was much appreciated. Later, Ella [Morrison et al., 1985]
and Plessey's own cellbased design system were used for the implementation.
It was Geraint Jones and Wayne Luk who worked closely with the Plessey
team, and our views from that time on the tools needed for regular array
design are presented in reference [Luk et al., 1990].
Unfortunately, at just the wrong moment, another of the partners in the
project bought the relevant part of Plessey and closed down the design
team — a major setback for the research. However, my colleagues and I remained
interested in the design of regular architectures, having been much stimulated
by contact with real designers with special competence in this area. We
moved on to experiment with a relational language that again captured common
connection patterns as combinators or higher order functions [Luk
and Jones, 1988, Sheeran, 1988, Jones
and Sheeran, 1990]. My work at this time seemed to be of considerable
interest to academics, but it generated no interest whatsoever from industry.
Eventually, my response to this was to work with Satnam Singh, initially,
and later Koen Claessen, on the development of a more practical design
and verication system for FPGAs, Lava.
1.2 Lava: embedding a simple hardware description language in Haskell
In Lava, in which a µFP or Lustrelike language is embedded
in the functional programming language Haskell [Bjesse
et al., 1998, Claessen et al., 2001], the definition
of the tally is
tal :: [Signal Bool] > [Signal Bool]
tal [] = [high]
tal (a:as) = bs
where
bs = mux(a,(low:cs,cs++[low]))
cs = tal as
It takes a list of bits as input, and produces a list of bits as output.
Now, there is only one call to the smaller tally circuit. Also, we now
use pattern matching to give a standardlooking recursive definition over
lists. The function mux is the builtin polymorphic multiplexer, which
in this case selects between two input lists. However, all is not quite
as it seems. This is not a circuit description, but a circuit generator.
What actually happens when we want to (say) generate a netlist is that
we run the circuit description with symbolic inputs (of fixed size), to
produce a symbolic representation of the circuit (again of fixed size).
This representation can then be processed in various ways to produce
formats for input to other design and verification tools. The important
point is that we have a two stage process, not a single stage as you might
expect. So, in Lava, you write a Haskell program to generate a netlist,
rather than writing a description of the netlist itself. This may sound
like a small change of emphasis, but in fact having a full functional language
available for writing generators increases expressiveness and ease of use
in ways that we had not predicted when we started work on Lava.
There are pros and cons to having a datastructure that represents the
circuit. One advantage is that the application of circuit transformations
becomes possible. VHDL descriptions, which represent the required behaviour
cannot easily be manipulated in the same way. Also, we perform formal reasoning
only about the simple netlistlike circuit representation (at least at
the moment). The semantics of such a netlist is wellunderstood, so it
is very clear what should be proved in verification, and we will consider
this in the next section. Separating the semantics of the final circuit
from that of the programming language used to write the generator (which
includes lazy evaluation) is an important simplification.
The rather structural approach that we advocate is, however, most appropriate
in cases where we want to have full control over the properties (both functional
and nonfunctional) of the final result. This is not always the case. Sometimes,
a synthesis tool is just what is needed, and the designer is then relieved
of having to worry about unnecessary detail. However, even in a synthesis
tool, it is necessary to be able to supply library components that are
suitable for different contexts. The production of such components is an
area in which we think our approach can be applied, even in a standard
design ow.
1.3 Verification in Lava
Our approach to the problem of verifying circuit correctness has been
to borrow from the synchronous programming community the notion of synchronous
observer [Halbwachs et al., 1993]. The observer operates
on both the inputs and outputs of the circuit being observed, and produces
a single Boolean output that should never become false. Both the circuit
and the observer can be sequential, and the production of a logical representation
of the combination of the circuit and the observer proceeds in just the
same way as for an ordinary circuit. We can, for instance, produce propositional
logic for input to SATsolvers, and we have developed associated formal
verification methods [Sheeran et al., 2000, Bjesse
et al., 1998]. The following very small example illustrates the idea.
We have designed a new fulladder circuit, and would like to check that
its behaviour is the same as the built in fulladder. The observer produces
a single output called ok and we use a SATsolver (which is called by the
function satzoo) to prove that it can never be false.
Behind the scenes, using symbolic evalutation, a formula in DIMACS format
is generated, but the user need not bother with the details of the format.
We have similar links to other verification tools, including SMV [McMillan,
1999] and VIS [Brayton, R.K. et al., 1996].
fa::(Signal Bool,(Signal Bool,Signal Bool))>(Signal Bool,Signal Bool)
fa (cin, (a,b)) = (sum, cout)
where
part_sum = xor2 (a, b)
sum = xor2 (part_sum, cin)
cout = mux (part_sum, (a, cin))
checkFullAdd :: (Signal Bool,(Signal Bool,Signal Bool)) > Signal Bool
checkFullAdd ins = ok
where
out1 = fa ins
out2 = fullAdd ins  Lava builtin fullAdder
ok = out1 <==> out2
Main> satzoo checkFullAdd
Satzoo: ... (t=0.1) Valid.
Note that this approach only allows the verification of safety properties.
A similar approach to verification can be used in standard languages, and
indeed there is growing interest in standard property specification
languages, which can express a greater variety of properties [Fitzpatrick,
2003, Accelera, 2004]. For a brief description
of a course on Hardware Description and verification that incorporates
both PSL and Lava, see [Axelsson et al., 2005a].
The approach to formal verification currently used in Lava flattens
the entire circuit before invoking the verification tools. It would be
interesting to explore more hierarchical verification methods, which would
entail a more complicated (but still tangible) circuit representation.
1.4 Connection patterns in Lava
In describing larger circuits, we make use of higher order functions
that encode commom connection patterns. For example, here are the
definitions of two of the connection patterns that are built in to Lava:
compose [] = id
compose (circ:circs) = circ > compose circs
composeN n circ = compose (replicate n circ)
The function compose composes a list of circuits in series using the
series composition pattern >, while composeN composes
n copies of the given circuit. In the tally circuit, above, it would actually
have made sense to first de ne a connection pattern, and then use it with
the mux as parameter, rather than hardwiring that component into the circuit.
Another typical connection pattern is row, which makes a linear
array of twoinput, twooutput components. Using it, we can, for instance,
make a ripplecarry binary adder from fulladders:
binAdder fadd as = ss ++ [c]
where (ss,c) = row fadd (zero, as)
The input as is a list of pairs of bits from the two (least signi cant
bit first) binary numbers being added. The carryout from the rightmost
fulladder is appended to the end of the list of sumbit outputs to give
a result whose length is one longer than that of its inputs. Note that
this is a generic definition. When we want to analyse it, for example for
verification, we must instantiate it to a fixed size by giving it appropriate
symbolic inputs. For a case study of the use of a connection pattern oriented
approach to the design and analysis of a sorter core, see reference [Claessen
et al., 2001].
1.5 Nonstandard interpretation for circuit analysis in Lava
We include fadd as a parameter in the definition of the binary adder,
binAdder, not only to allow the use of various fulladder implementations,
but also to allow the inclusion of nonstandard components that assist
in circuit analysis. For instance, we can define a fulladder that operates
not on bits but on delays:
fAddI (a1s, a2s, a3s, a1c, a2c, a3c) (a1,(a2,a3)) = (s,cout)
where
s = max (a1s+a1) (max (a2s+a2) (a3s+a3))
cout = max (a1c+a1) (max (a2c+a2) (a3c+a3))
fI :: (Signal Int,(Signal Int, Signal Int)) > (Signal Int, Signal Int)
fI as = fAddI (20,20,10,10,10,10) as
Main> simulate (binAdder fI) (replicate 10 (0,0))
[20,30,40,50,60,70,80,90,100,110,100]
The function fAddI has six additional parameters which
model the delay for each of the six possible paths from input to
output. The component can then be used to calculate worstcase
delays. This is useful in design exploration, when different circuits
that use this component can be compared. This is a direct application
of the idea of NonStandard Interpretation (NSI) [Jones and Nielsen, 1994] — again an example of a
direct and fruitful borrowing of a good idea from programming language
research. There was some promising early work on using NSI in circuit
analysis [Singh, 1992, Luk,
1990] but I feel that the idea is still ripe for
exploitation. Significant advances have been made in functional
languages in recent years in the development of more expressive type
systems, and of new approaches to metaprogramming, reflection and
resourceawareness. I cannot attempt a survey here, but the interested
reader is referred to the content and citations in [Kiselyov et al., 2004, Sheard and
Peyton Jones, 2002, Grundy et al., 2003, Mycroft and Sharp, 2001].
These developments will allow more typesafe embeddings or direct implementations
of hardware description languages, and will allow us to do more with NSI.
Later in this paper, I show that NSI can be used not only after but also
during circuit generation.
1.6 What more do we need?
The ideas presented in the previous sections are appealing, but still
they have been largely ignored by those doing circuit design in practice.
My conjecture is that this is at least partly because the range of circuits
that can easily be generated and analysed is too narrow. And perhaps our
work has been too concentrated on the functional programming technology,
and not enough on the problems faced by designers. In my recent work, I
have aimed to address these issues.
The problem that I have chosen to address is the need to take account
of nonfunctional circuit properties (such as power consumption) early
in the design. The need for new ideas in this area was eloquently stated
by IBM's Wolfgang Roesner in an invited talk at CHARME 2003 [Roesner,
2003]. I cannot, of course, address all of the issues raised in that
talk, but I have become interested in ways to allow nonfunctional properties
to cross abstraction boundaries. And I have started at the bottom! For
instance, how might we generate a circuit whose topology is adapted to
the delay on its input signals? How might we make generators that produce
circuits that adapt to the context in which they are placed? Finding answers
to these questions results in a broadening of the notion of generic
circuit. Attacking these problems has, as a nice sideeffect, greatly
increased the range of circuits that can easily be described in a functional
style.
The following sections illustrate, by means of examples, my first steps
towards a design style that takes account of nonfunctional properties.
2 Parallel prefix circuits
A modern microprocessor contains many parallel prefix
circuits and it was this that led us to study ways to design and
analyse this class of circuits. The best known use of parallel prefix
circuits is in the computation of the carries in fast binary addition
circuits; another common application is in priority encoders. There
are a variety of well known parallel prefix networks, including [Sklansky, 1960] and [Brent and
Kung, 1982]. And there are also many papers in the field of
circuit design that try to systematically figure out which topology is
best for practical circuits [Knowles, 1999, Chan et al., 1992, Zlatanovici and
Nikolic, 2003]. Those more familiar with functional programming
than with hardware design might know the prefix function as
scan — see, for example, O'Donnell's work on proving the
correctness of a parallel scan algorithm [O'Donnell, 1994].
Figure 2: (a) The serial prefix structure (b) Construction
of a parallel prefix circuit by composition of two smaller such circuits,
p1 and p2.
More recently, Hinze has studied the algebra of scans in a delightful
paper that is also a great tutorial on parallel prefix networks [Hinze,
2004].
Given n inputs, x_{1}, x_{2}
... , x_{n}, the problem is to design a circuit that
takes these inputs and produces the n outputs y1 =
x_{1}, y_{2} = x_{1}
x_{2}, y_{3} = x_{1}
x_{2}
x_{3} , ... y_{n} = x_{1}
: : :
x_{n}, where is
an arbitrary associative (but not necessarily commutative) binary
operator. One possible solution is the serial prefix circuit
shown schematically in Figure 2(a). Input nodes
are on the top of the circuit, with the least significant input
(x_{1}) being on the left. Data flows from top to
bottom, and we also count the stages or levels of the circuit in this
direction, starting with level zero on the top. An operation node,
represented by a small circle, performs the
operations on its two inputs. One of the inputs comes along the
diagonal line above and to the left of the node, and the other along
the vertical line feeding the node from the top. A node always
produces an output to the bottom along the vertical line. It may also
produce an output along a diagonal line below and to the right of the
node. Here, at level zero, there is a diagonal line leaving a vertical
wire in the absence of a node. This is a fork. The diagram shows a
ripplecarry structure. The circuit shown contains 7 nodes, and so is
said to be of size 7. Its lowest level in the picture is level
7, so the circuit has depth 7. The fanout of a node is its
outdegree. In this example, all but the rightmost node have fanout
2, so the whole circuit is said to have fanout 2. Examining the
serial prefix structure in Figure 2(a), we see
that at each nonzero level only one of the vertical lines contains a
node. We are interested in exploring ways to design and analyse
parallel prefix circuits, in which there can be more than one
node at a given level, so that there is some parallelism in the
resulting computation.
Figure 3: (a) A parallel prefix construction using a forwards
and a backwards tree (b) The same construction with the lower tree slid
back by one level, which reduces the depth by one.
Figure 4: The BrentKung construction for 32 inputs.
For two inputs, there is only one reasonable way to construct a prefix
circuit,using one copy of the operator. Parallel prefix circuits can also
be formed by composing two smaller such circuits, as shown in Figure
2(b). Repeated application of this pattern (and the base case) produces
the ripplecarry circuit that we have already seen.
For 2^{n} + 1 inputs, one can use socalled forwards
and backwards trees, as shown in Figure 3(a). We call
a parallel prefix circuit of this form a slice. At the expense of
a little extra fanout at a single level in the middle of the circuit,
one can slide the (lower) backwards tree up one level, as shown in Figure
3(b). Composing increasing sized slices gives the wellknown BrentKung
construction [Brent and Kung, 1982] shown in Figure
4.
A shallower ninput parallel prefix circuit can be obtained by
using the recursive Sklansky construction, which combines the results of
two separate n/2 input parallel prefix circuits [Sklansky,
1960], as shown in Figure 5. It is written in Lava
as
sklansky :: ([a] > [a]) > [a] > [a]
sklansky op [a] = [a]
sklansky op as = (sklansky op >> join op (sklansky op)) as
where join op p (a:as) = withEach op (a: p as)
Figure 5: (a) The recursive Sklansky construction (b) Sklansky
for 32 inputs, with fanout 17.
Here, in the second equation, the Sklansky circuit is constructed by
composing two smaller parallel prefix circuits. The connection pattern
called >> composes two parallel prefix circuits as shown in Figure
2(b), but in addition makes sure that the two subcircuits each operate
on half of the inputs.
join op p takes a parallel prefix circuit p with n
inputs, and makes a parallel prefix circuit with n + 1 inputs, by
combining the first input with each of the outputs of p, see Figure
6. The first of the two smaller circuits composed using >>
is simply a recursive call of the Sklansky construction, while the second
is join op applied to such a subcircuit. The depth of this construction
is
and this is the shallowest possible. The downside of the Sklansky construction
is that it has high fanout. The KoggeStone construction, which has fanout
of only 2, also achieves minimum logical depth, but at the expense of a
large number of operators and long wires [Kogge and Stone,
1973].
2.1 Generating depthsize optimal parallel prefix
circuits
An ninput prefix circuit of depth d and size s
is said to be depthsize optimal if d + s = 2n
 2. Of those parallel prefix constructions discussed above, only the serial
prefix is depthsize optimal.
Figure 6: The decomposition of the Sklansky construction
corresponding to the Lava code.
We have investigated the generation of depthsize optimal circuits as
a means to find a good compromise between number of operators, depth, wirelength
and fanout. This entails a study of the algebra of the prefix (or scan)
combinator, in true functional programming tradition.
Consider the slice shown in Figure 3(a). The first
input line is forked for the first time at level 4, and we say that it
has firstFork 4. Its last output is calculated at level 5, even
though the circuit has depth 9. We call the level at which the last output
is computed lastd. We call the difference between lastd and
firstFork the waist of the circuit. A parallel prefix circuit
with waist w is waistsize optimal if w + s
= 2n  2. (This property is called size optimality by its originators,
who also present it a little differently [Lin et al.,
2003]). Both of the 17input slices shown in Figure
3 have waist one and size 31; so both are waistsize optimal. The composition
(as in Figure 2(b)) of two waistsize optimal circuits
gives a new waistsize optimal circuit whose waist is the sum of the waists
of its components. We note also that a waistsize optimal circuit whose
waist is the same as its depth is, in addition, depthsize optimal. This
gives a clue about how to build a depthsize optimal circuit: we should
compose a series of waistsize optimal circuits in such a way that the
resulting circuit's waist is the same as its depth. So, to build a depth
n circuit, we must compose n waistone slices whose waists
span from level i to level i + 1, for i ranging from 0 to
n  1. Each slice should have as many inputs as possible, and the
depths of the forwards and backwards trees must be such that the total
circuit depth does not exceed n.
If we restrict fanout to 2, as in the slice shown in Figure
3(a), then the result is like an extended version of BrentKung. It
is made of 9 composed slices, with the size of the slices increasing and
then decreasing, see Figure 7.
Figure 7: Composing waistone slices to make a depthsize
optimal parallel prefix circuit with fanout 2. It has 47 inputs, depth
9 and 83 operators.
Figure 8: A waistsize optimal circuit (of waist one) that
combines serial prefix with matching fanout.
This circuit has a pleasing symmetry, and the reader might like to
consider the effect of rotating it through 180 degrees and then
sliding all the dots to the other end of their diagonal lines. (If
that was too easy, consider the e ect of performing this operation on
any parallel prefix circuit.)
Things become more interesting, though, when we allow fanout to increase
slightly. Allowing the arbitrary use of fanout gives circuits that contain
too many operators to be waistsize optimal. However, using a serial prefix
construction in the forward tree, matched by fanout in the backwards tree,
as shown in Figure 8, gives a waistsize optimal circuit.
Moreover, this construction can be used at any level in the trees, meaning
that the serial prefix construction can here combine the results of 4 subtrees,
while there will be fanout to 4 matching subtrees in the lower tree.
Armed with this knowledge, we can write a function that constructs a slice
with a forward tree of depth m and matching backwards tree of depth
n, for a given fanout, f. The fact that we can use fanout
greater than 2 in the backwards tree allows us to make larger trees for
a given depth. The slice should take in as many inputs as possible for
the required tree depths.
slice m n f op d [a] = [a]
slice m n f op d (a:as) = cs ++ [bsl']
where
bs = ftree m n f op d as
bsm = most bs
bsl = last bs
[a',bsl'] = op [a,bsl]
cs = btree m n f op d (a':bsm)
Depth 
Slices(4) 
WE4 
Z4 
H4 
M 
LYD 
8 
4772 
4764 
4567 
4257 
4562 
5577 
9 
73114 
65102 
6898 
5887 
6387 
7895 
10 
115179 
103138 
99145 
88119 
88122 
96135 
12 
282440 
215286 
209303 
180243 
166233 
170242 
14 
6921082 
439582 
431621 
364491 
311443 
309446 
Table 1: The number of inputs that can be processed by some
depthsize optimal circuits for a specific depth
The recursive definitions of the forwards and backwards trees are slightly
tricky, but well suited to the functional style of description.
Now, we can make a depthsize optimal parallel prefix circuit with
depth m and fanout f by composing slice i (mi)
f, for i ranging from 0 to m1. For
depth 8 and fanout 4, the result is shown in Figure
9. For want of a better name, we call the new construction
Slices(f), where f is the fanout. Here, we have shown
how to construct the circuit with the largest possible number of
inputs for a given depth and fanout. To make circuits of the same
depth, but with fewer inputs, one can remove some lines and their
associated operators. This can be done in a way that preserves
depthsize optimality.
The previous best known depthsize optimal construction for fanout
4 and depth 8 was Z4, which could take 67 inputs [Lin
and Hsiao, 2004]. Table 1 (which is drawn from
[Lin and Hsiao, 2004], with the addition of figures
for our new circuit) compares the new construction with the best previously
known depthsize optimal circuits. Z4, WE4 and H4
have fanout 4, while M and LYD have unrestricted fanout
per level. Note that we have improved considerably on all known depthsize
optimal constructions, even those with unrestricted fanout.
I would also assert that the Slices construction is very much simpler
than its depthsize optimal competitors! A word of caution, though: these
are interesting theoretical results, but to check whether or not they are
useful in practice, we must build circuits using them, and measure
their performance, including power consumption. To this end, three groups
of VLSI design students, supervised by Prof. Per LarssonEdefors at Chalmers,
are designing fast adders based on the Slices(4) construction (which is
used for computing the carries). The students will implement reference
adders using the construction given in [Han and Carslon,
1987], which is a hybrid of BrentKung and KoggeStone. One of the
groups will have their project implemented on silicon, and measurements
will be conducted in the autumn of 2005. We await the results with interest.
Figure 9: A new arrangement of composed slices, with 72 inputs,
depth 8 and fanout 4. We call this construction Slices(4).
Figure 10: A depth 9, fanout 5, 128 input parallel prefix
circuit.
It is also interesting to play with varying the fanout. For depth 8,
fanout of 5 gets us to 80 inputs, while fanout of 9 (the maximum the
construction can use) reaches 88 inputs. For depth 9, increasing fanout
to 5 gets us to 128 inputs, which might well be interesting in practice.
The resulting circuit is shown in Figure 10. It has
(as expected) 245 operators. By comparison, the Sklansky construction,
for 128 inputs, has 448 operators (with depth 7 and fanout 65).
Incidentally, the dotdiagrams of slices and parallel prefix circuits
shown in this paper have been automatically produced from Lava descriptions.
A nonstandard interpretation is first used to gather information about
the circuit network and a small Haskell program then produces X g input.
This is typical of the kind of lightweight analysis that is easy to do
in a functional language.
3 Clever circuits
The parallel prefix constructions that we have just considered have
a degree of regularity, although that regularity is not always immediately
obvious when one considers the diagrams. In recent years, I have come to
realise that it is important to be able to describe and analyse not only
regular circuits but also circuits that are almost regular. The
idea was born in a discussion with a designer of high performance arithmetic
circuits. He explained how what starts o as a regular adder circuit becomes
less and less regular as one goes down through the levels of abstraction,
taking account of more and more nonfunctional properties that depend on
the exact context in which each cell is placed. But still, the final circuit
has some relationship to its more regular ancestor, even if we don't currently
have good intellectual tools for expressing that relationship. I became
fascinated by the general question of how to generate circuits that adapt
to their contexts, losing some but not all of their regularity in the process.
This led to a new programming idiom, which I call clever circuits.
We have seen in earlier examples how one uses pattern matching in
recursive functions over lists to define parameterised circuits (see
the definition of sklansky for instance). Another idiom used
Haskell integers to control the unrolling that is used to generate
circuit netlists (as in the definition of composeN). In clever
circuits, Haskelllevel shadow values are paired with
circuitlevel inputs and outputs, and used to control circuit
generation. The shadowvalues can contain whatever information we
choose. In a first application, I encoded, on the shadow wires, the
degree of sortedness of the values on the circuit wires, and used this
information to generate median circuits from descriptions of sorters
[Sheeran, 2003]. The shadow values controlled
the presence or absence of comparator components. In a development of
that idea, I used shadow values to make choices about circuit
wiring during the generation of multiplier reduction trees [Sheeran, 2004].
4 Combining clever circuits and NSI to get adaptive circuits
Here, I shall use clever circuits to determine circuit topology
based on information about nonfunctional properties (in this case information
about the delay on circuit inputs). For example, Figure
11 shows a 64input parallel prefix circuit that is well adapted to
its input delay profile, which is typical of that from a reduction tree.
The example is taken from work at Synopsys on optimised synthesis of sumofproducts
circuits [Zimmermann and Tran, 2003]. This example
uses a series of narrow slices on the left, and a Sklanskylike construction
on the right.
To facilitate such descriptions, it makes sense to build combinators
that work on sets of circuits (or in fact of circuit generators).
For example, the function best chooses among a set of possible
circuits according to an ordering on the output shadow values.
best comp [f] i = f i
best comp (a:as) i = if (comp fs gs) then a i else best comp as i
where (_,fs) = unzip (a i)
(_,gs) = unzip (best comp as i)
Then, one way to build a composite parallel prefix circuit is to define
a set of solutions for the left hand side of the circuit, and then choose
the one that has smallest lastd. Similarly, one can minimise the
difference between depth and firstFork from a set of possible solutions
for the right hand side. Finally, one can choose where to place the division
between these two parts to minimise the overall measure, which is the comp
parameter.
Figure 11: A parallel prefix circuit adapted to a typical
delay profile on the output of a reduction tree.
adapt m n f comp op d as
= best comp [composeK k (left k) (right k) k < [tm..t+m]] as
where t = div (length as) 2
left tr = bestf lastdP (lpart f m n op d ....
right tl = bestf dminfirstfoP (rpart f m n op d ....
Figure 11 shows what happens when the set corresponding
to rpart contains only the Sklansky construction. We can, as we
did in the nonadaptive case, get a lowfanout solution by using slices.
rpart should then express the set of possible combinations of
slices, and the context, that is the delay on the input shadow values,
determines which is actually chosen. Figure 12 shows
the result for our typical delay profile, while in Figure
13 different choices have been made, in response to a shallower profile.
It would be easy to use a ner degree of modelling in the calculation
of the shadow values that control the generation. Here, the modelling was
crude, and counted only the number of operator delays. It is just a question
of including the required modelling functions in the nonstandard versions
of the operators. Similarly, it would be possible to include the modelling
of wires, as we did in multiplier reduction tree generation [Sheeran,
2004], or to have more than one kind of dot operator, which would be
useful in the generation of adders.
Note that the user is now expressing sets of possible circuits
and how to choose between them. This is just a first attempt at de ning
suitable combinators, and it is surprisingly effective. However, we feel
that we can do better. Koen Claessen has pointed out a possible link to
prettyprinting combinators. It should be possible to adapt Hughes' ideas
about the development of libraries of prettyprinting combinators [Hughes,
1995] to the problems of generating circuits that meet constraints
on their nonfunctional properties. This is yet another opportunity for
crossfertilisation between functional programming and hardware design.
Thinking along these lines will most likely lead to a much broader notion
of shadow value.
Figure 12: A lowfanout adaptive parallel prefix construction.
Figure 13: A parallel prefix circuit adapted to a shallower
delay profile.
Sometimes, it will make sense to pair the shadow values with wires,
as we have done here, but in other cases the shadow values will be linked
to the circuit in more general ways. Many interesting questions remain
to be investigated.
5 Discussion and Conclusion
In section 2.1, I demonstrated the power of the
connection pattern approach to the writing of circuit generators. A study
of the algebra of parallel prefix suggested a new design, and the functional
language made it easy to realise and analyse that design. The compactness
of the descriptions, and the immediate feedback from the generated pictures,
enabled a great deal of design exploration along the way to the new design.
I am convinced that this exploration would have been very difficult, if
not impossible, in a more standard hardware description language such as
VHDL.
Next, I showed how the combination of nonstandard interpretation and
clever circuits gives a method of writing generators for circuits that
adapt to their contexts.
Although I, again, demonstrated the approach using the example of parallel
prefix, the method is a general one. The better nonstandard models we
can create, the better circuits we will generate. This seems to be a promising
approach to the problem of creating families of circuits with different
nonfunctional properties to act as library components implementing a given
function.
So far, I have been considering a rather low level of abstraction in
circuit design. And what my colleagues and I are doing is going even lower!
We are developing Wired, a system for circuit design that takes
account of the e ects of wires by being precise about their sizes and locations
[Axelsson et al., 2005b]. The descriptions are essentially
relational, which means that ideas from my earlier work with Geraint Jones
on Ruby are reappearing [Jones and Sheeran, 1990].
The use of relations allows not only the kind of unidirectional or functional
nonstandard interpretation that has been shown here, but also bidirectional
analyses. An example of such an analysis is RCdelay estimation, in which
resistance values ow in one direction, and capacitance values in the other.
Our aim is to make this lower level of design and analysis quicker and
easier, and our first results are promising. Here, again, we feel that
we will be able to incorporate ideas that have been proposed for use in
prettyprinting [Hughes, 1995]. We are particularly
interested in ways to generate circuits with low switching activity, and
we hope to develop new methods of generating lowpower circuits.
While developments in low level circuit design are important, we will
not solve the problems facing designers unless we find new design and verification
methods that work earlier in the design process, at higher levels of abstraction.
An important question is "Can the methods described in this paper
be used at higher levels of abstraction?". My tentative answer is
that these are general programming idioms, and so may have application
at other levels of abstraction. Even at the higer levels, we are still
faced with the pressing problem of how to get nonfunctional information
across abstraction levels in ways that assist the designer, rather than
overwhelming her. The use of shadow values is one idea, but I am convinced
that there are many more waiting to be discovered. Programming language
researchers have much to contribute here.
Another big question is "Can we make re nement work?". Early
work in re nment for hardware design resulted in some nice examples, but
had little impact [Jones and Sheeran, 1990, O'Donnell
and Ruenger, 2004]. Now, Carl Seger and his colleagues at Intel have
boldly built the IDV system ? Integrating Design with verification [Spirakis,
2004]. It allows the re nement of a (relatively) high level model all
the way to implementation on the silicon, with every transformation step
being guaranteed to preserve the original specification.
The message is that the current design ow is broken and that we must
replace it by a new ow that puts the designer in charge of guaranteeing
both correctness and nonfunctional properties such as timing and power
consumption, and gives him the necessary analysis tools, and welldesigned
ways of using them. The latest version of IDV is built around the re ective
functional language ReFLect [Grundy et al., 2003].
For an academic who has envisaged but never built such a system, IDV
is both exciting and stimulating. It raises many questions. As it starts
with a relatively low level specification, what should be built on top
of it?. How can we assist microarchitects in making early decisions based
on estimates of nonfunctional properties? Can the kind of microarchitectural
algebra that was studied in Hawk [Matthews and Launchbury,
1999] be made to work on real designs? Can we make use of Vuillemin's
elegant work linking circuits and numbers [Vuillemin,
1994] in the verification of arithmetic circuits? Can we cope with
Globally Asynchronous Locally Synchronous systems, in order to get a handle
on buses, networks on chip, caches and all the problems that arise above
the hardware level? Can we nd good ways to do datapath generation from
Electronic System Level Design languages like SystemC [OSCI,
2005]? What new verification technology is needed if we are to raise
the level of abstraction at which design decisions are made? The current
version of IDV works on fixed size instantiations of circuits. The million
dollar question is "Can we design appropriate generic design
and verification methods?". Hunt's work on design and verification
using the DUALEVAL language [Brock and Hunt Jr., 1997]
and his recent work that builds upon it are good starting points. But much
remains to be done. How, for instance, can we make better use of hierarchical
verification? Insights from functional programming as well as from automated
and interactive theorem proving, will doubtless play a role in developing
practical usable methods and tools. Now, we need to reason about the generators
themselves, rather than about the fixedsize tangible representations of
the generated circuits. My hope is that the fact that the generators are
structured in the same way as the circuits themselves will aid the construction
of the necessary inductive proofs.
Mead and Conway developed simplified design methods aimed at enabling
a new kind of "tall thin person" to get all the way from architecture
to circuit layout [Mead and Conway, 1980]. The aim
was to "deal with complexity through deliberate methodological simplifications"
[Conway, 2003]. It seems to me that it is time to
return to these ideas. And to succeed, I think we will need another kind
of tall thin person, who can span the range from low level hardware design
to high level functional programming!
Acknowledgements
Thanks to Emil Axelsson and Satnam Singh for their stimulating comments
on an earlier draft of this paper. This work is funded, in part, by the
Swedish Research Council, Vetenskapsrådet.
Many thanks to Ricardo Massa and Martin Musicante, who gave me the opportunity
to speak at the Brazilian Symposium on Programming Languages, 2005, and
to write this invited paper.
References
[Accelera, 2004] Accelera (2004). PSL v1.1 Language
Reference Manual.
[Axelsson et al., 2005a] Axelsson, E., Björk,
M., and Sheeran, M. (2005a). Teaching Hardware Description and verification.
In International Conference on Microelectronic Systems Education, MSE.
IEEE.
[Axelsson et al., 2005b] Axelsson, E., Claessen,
K., and Sheeran, M. (2005b). Wired: wireaware circuit design. In Correct
Hardware Design and Verification Methods, CHARME, to appear,
Lecture Notes in Computer Science. Springer.
[Backus, 1978] Backus, J. (1978). Can programming
be liberated from the von Neumann style?: a functional style and its algebra
of programs. Communications of the ACM, 21(8):165180.
[Bhandal et al., 1990] Bhandal, A. S., Considine,
V., and Dixon, G. E. (1990). An array processor for video picture motion
estimation. In Systolic array processors, pages 369378. PrenticeHall,
Inc.
[Bjesse et al., 1998] Bjesse, P., Claessen, K.,
Sheeran, M., and Singh, S. (1998). Lava: Hardware Design in Haskell. In
International Conference on Functional Programming, ICFP,
pages 174184. ACM.
[Borrione et al., 1992] Borrione, D., Piloty, R.,
Hill, D., Lieberherr, K., and Moorby, P. (1992). Three Decades of HDLs,
part 2: Conlan through Verilog. Design & Test of Computers,
9(3):5463.
[Boute, 1984] Boute, R. (1984). Functional languages
and their application to the description of digital systems. Journal
A, 25(1):2733.
[Brayton, R.K. et al., 1996] Brayton, R.K., Hachtel,G.D.,
SangiovanniVincentelli, A., Somenzi, F., Aziz, A., Cheng, S.T., Edwards,
S., Khatri, S., Kukimoto, Y., Pardo, A., Qadeer, S., Ranjan, R.K., Sarwary,
S., Shiple, T.R., Swamy, G., and Villa, G. (1996). VIS: a system for verification
and synthesis. In Rajeev Alur and Thomas A. Henzinger, editors, Proceedings
of the Eighth International Conference on Computer Aided Verification CAV,
volume 1102 of Lecture Notes in Computer Science, pages 428432.
Springer.
[Brent and Kung, 1982] Brent, R. and Kung, H. (1982).
A regular layout for parallel adders. IEEE Transactions on Computers,
C31:260264.
[Brock and Hunt Jr., 1997] Brock, B. and Hunt Jr.,
W. (1997). The DUALEVAL Hardware Description Language and Its Use in the
Formal Specification and Verification of the FM9001 Microprocessor. Formal
Methods in System Design, 11(1):71104.
[Chan et al., 1992] Chan, P., Schlag, M., Thomborson,
C., and Oklobdzija, V. (1992). Delay Optimization of CarrySkip Adders
and Block CarryLookahead Adders using Multidimensional Dynamic Programming.
IEEE Transactions on Computers, 41(8):920930.
[Chu et al., 1992] Chu, Y., Dietmeyer, D., Duley,
J., Hill, F., Barbacci, M., Rose, C., Ordy, G., Johnson, B., and Roberts,
M. (1992). Three Decades of HDLs, part 1: CDL through TIHDL. Design
& Test of Computers, 9(2):6981.
[Claessen et al., 2001] Claessen, K., Sheeran,
M., and Singh, S. (2001). The Design and verification of a Sorter Core.
In Correct Hardware Design and Verification Methods, CHARME, volume
2144 of Lecture Notes in Computer Science, pages 355369. Springer.
[Conway, 2003] Conway, L. (2003). Lynn Conway's
Retrospective. Part III: Starting Over. http://ai.eecs.umich.edu/people/conway/Retrospective3.html.
[Fitzpatrick, 2003] Fitzpatrick, T. (2003). SystemVerilog
Assertions: A Uni ed Language for More Efficient Verification. Synopsys,
Inc.
[Grundy et al., 2003] Grundy, J., Melham, T., and
O'Leary, J. (2003). A re ective functional language for hardware design
and theorem proving. Research Report PRGRR0316, Programming Research
Group, Oxford University Computing Laboratory. http://www.comlab.ox.ac.uk/tom.melham/pub/Grundy2003RFL.pdf.
[Halbwachs et al., 1993] Halbwachs, N., Lagnier,
F., and Raymond, P. (1993). Synchronous observers and the verification
of reactive systems. In Third Int. Conf. on Algebraic Methodology and
Software Technology, AMAST'93, Workshops in Computing, pages
8396. Springer.
[Han and Carslon, 1987] Han, T. and Carslon, D.
(1987). Fast AreaEfficient VLSI Adders. In Int. Symposium on Computer
Arithmetic. IEEE.
[Hinze, 2004] Hinze, R. (2004). An algebra of scans.
In Mathematics of Program Construction, volume 3125 of Lecture
Notes in Computer Science, pages 186210. Springer.
[Hughes, 1995] Hughes, J. (1995). The Design of
a Prettyprinting Library. In Advanced Functional Programming, volume
925 of Lecture Notes in Computer Science, pages 5396. Springer.
[Hunt Jr., 1985] Hunt Jr., W. (1985). FM8501,
A Veri ed Microprocessor. PhD thesis, University of Texas at Austin.
[Johnson, 1984a] Johnson, S. (1984a). Applicative
programming and digital design. In Principles of Programming Languages,
pages 218227. ACM.
[Johnson, 1984b] Johnson, S. (1984b). Synthesis
of Digital Design from Recursive Equations. MIT Press.
[Jones and Sheeran, 1990] Jones, G. and Sheeran,
M. (1990). Circuit design in Ruby. In Staunstrup, J., editor, Formal
Methods for VLSI Design, pages 1370. NorthHolland.
[Jones and Nielsen, 1994] Jones, N. and Nielsen,
F. (1994). Abstract interpretation: A semanticsbased tool for program
analysis. In Handbook of Logic in Computer Science, pages 527636.
OUP.
[Kiselyov et al., 2004] Kiselyov, O., Swadi, K.,
and Taha, W. (2004). A methodology for generating veri ed combinatorial
circuits. In Fourth International Conference on Embedded Software (EMSOFT),
pages 249258. ACM Press.
[Knowles, 1999] Knowles, S. (1999). A Family of
Adders. Int. Symp. on Computer Arithmetic, pages 277284.
[Kogge and Stone, 1973] Kogge, P. and Stone, H.
(1973). A Parallel Algorithm for the Efficient Solution of a General Class
of Recurrence Equations. IEEE Transactions on Computers, C22(8):786793.
[Lahti, 1980] Lahti, D. (1980). Applications of
a Functional Programming Language to Hardware Synthesis. Master's thesis,
UCLA.
[Lin and Hsiao, 2004] Lin, Y.C. and Hsiao, J.W.
(2004). A new approach to constructing optimal parallel prefix circuits
with small depth. Journal of Parallel and Distributed Computing,
64(1):97107.
[Lin et al., 2003] Lin, Y.C., Hsu, Y.H., and Liu,
C.K. (2003). Constructing H4, a Fast DepthSize Optimal Parallel prefix
Circuit. The Journal of Supercomputing, 24(3):279304.
[Luk, 1990] Luk, W. (1990). Analysing parametrised
designs by nonstandard interpretation. In International Conference
on ApplicationSpecific Array Processors, pages 133144. IEEE Computer
Society Press.
[Luk and Jones, 1988] Luk, W. and Jones,
G. (1988). From specification to parametrised architectures. In Milne,
G., editor, The fusion of hardware design and
verification. NorthHolland.
[Luk et al., 1990] Luk, W., Jones, G., and Sheeran,
M. (1990). Computerbased tools for regular array design. In Systolic
array processors, pages 589598. PrenticeHall, Inc.
[Matthews and Launchbury, 1999] Matthews, J. and
Launchbury, J. (1999). Elementary Microarchitecture Algebra. In International
Conference on ComputerAided verification, volume 1633 of Lecture
Notes in Computer Science, pages 288300. Springer.
[McMillan, 1999] McMillan, K. (1999). A methodology
for hardware verification using compositional model checking. Cadence.
[Mead and Conway, 1980] Mead, C. and Conway, L.
(1980). Introduction to VLSI Systems. AddisonWesley.
[Morrison et al., 1985] Morrison, J., Peeling, N.,
and Thorp, T. (1985). The Design Rationale of ELLA, a Hardware Design and
Description Language. In IFIP 7th International Symposium on Computer
Hardware Description Languages and their Applications, pages 303320.
NorthHolland.
[Mycroft and Sharp, 2001] Mycroft, A. and Sharp,
R. (2001). Hardware synthesis using sa and application to processor design,
invited talk. In Proceedings of the 11th Advanced Research Working Conference
on Correct Hardware Design and Verification Methods, CHARME,
volume 2144 of Lecture Notes in Computer Science, pages 1339. Springer.
[O'Donnell, 1994] O'Donnell, J. (1994). A correctness
proof of parallel scan. Parallel Processing Letters, 4(3):329338.
[O'Donnell and Ruenger, 2004] O'Donnell, J. and
Ruenger, G. (2004). Derivation of a Logarithmic Time Carry Lookahead Addition
Circuit. Journal of Functional Programming, 14(6):697713.
[O'Donnell, 1988] O'Donnell, J. T. (1988). Hydra:
Hardware description in a Functional Language using Recursion Equations
and High Order Combining Forms. In Milne, G., editor, The fusion of
hardware design and verification. NorthHolland.
[OSCI, 2005] OSCI (2005). SystemC 2.1 Language
Reference Manual. Open SystemC Initiative, www.systemc.org.
[Patel et al., 1985] Patel, D., Schlag, M., and
Ercegovac, M. (1985). fp: An environment for the multilevel specification,
analysis, and synthesis of hardware algorithms. In Functional Programming
Languages and Computer Architecture, volume 201 of Lecture Notes
in Computer Science, pages 238255. Springer.
[Roesner, 2003] Roesner, W. (2003). What is beyond
the RTL Horizon for Microprocessor and System Design?, invited talk. In
Correct Hardware Design and Verification Methods, volume 2860 of
Lecture Notes in Computer Science, page 1. Springer.
[Sheard and Peyton Jones, 2002] Sheard, T. and
Peyton Jones, S. (2002). Template metaprogramming for Haskell. In Chakravarty,
M. M. T., editor, ACM SIGPLAN Haskell Workshop, pages 116. ACM
Press.
[Sheeran, 1983] Sheeran, M. (1983). µFP,
an algebraic VLSI design language. D. Phil. Thesis, Programming Research
Group, Oxford University.
[Sheeran, 1984] Sheeran, M. (1984). muFP, A Language
for VLSI Design. In LISP and Functional Programming, pages 104112.
ACM.
[Sheeran, 1988] Sheeran, M. (1988). Retiming and
Slowdown in Ruby. In Milne, G., editor, The fusion of hardware design
and verification. NorthHolland.
[Sheeran, 2003] Sheeran, M. (2003). Finding regularity:
describing and analysing circuits that are almost regular. In Correct
Hardware Design and verification Methods, CHARME, volume 2860
of Lecture Notes in Computer Science, pages 418. Springer.
[Sheeran, 2004] Sheeran, M. (2004). Generating
fast multipliers using clever circuits. In Formal Methods in ComputerAided
Design, FMCAD, volume 3312 of Lecture Notes in Computer Science,
pages 620. Springer.
[Sheeran et al., 2000] Sheeran, M., Singh, S.,
and Stålmarck, G. (2000). Checking Safety Properties Using Induction
and a SATSolver. In Formal Methods in ComputerAided Design, FMCAD,
volume 1954 of Lecture Notes in Computer Science, pages 108125.
Springer.
[Singh, 1992] Singh, S. (1992). Circuit Analysis
by NonStandard Interpretation. In Designing Correct Circuits, volume
A5 of IFIP Transactions, pages 119138. NorthHolland.
[Sklansky, 1960] Sklansky, J. (1960). Conditionalsum
addition logic. IRE Trans. Electron. Comput., EC9:226231.
[Spirakis, 2004] Spirakis, G. (2004). Opportunities
and challenges in building silicon products at 65nm and beyond, invited
talk. In Design and Test in Europe (DATE), pages 23. IEEE.
[Verkest et al., 1988] Verkest, D., Johannes, P.,
Claessen, L., and De Man, H. (1988). Formal Techniques for Proving Correctness
of Parameterized Hardware using Correctness Preserving Transformations.
In Milne, G., editor, The fusion of hardware design and verification.
NorthHolland.
[Vuillemin, 1994] Vuillemin, J. (1994). On circuits
and numbers. IEEE Transactions on Computers, 43:8:868879.
[Zimmermann and Tran, 2003] Zimmermann, R. and Tran,
D. (2003). Optimized synthesis of sumofproducts. In ThirtySeventh
Asilomar Conference on Signals, Systems and Computers, pages 867872.
IEEE.
[Zlatanovici and Nikolic, 2003] Zlatanovici, R.
and Nikolic, B. (2003). PowerPerformance Optimal 64bit CarryLookahead
Adders. In 29th European SolidState Circuits Conference, ESSCIRC,
pages 321324.
