Abstract: Numerical validation of computed results in scientific computation is always an essential problem as well on sequential architecture as on parallel architecture. The probabilistic approach is the only one that allows to estimate the round-off error propagation of the floating point arithmetic on computers. We begin by recalling the basics of the CESTAC method (Controle et Estimation Stochastique des Arrondis de Calculs). Then, the use of the CADNA software (Control of Accuracy and Debugging For Numerical Applications) is presented for numerical validation on sequential architecture. On parallel architecture, we present two solutions for the control of round-off errors. The first one is the combination of CADNA and the PVM library. This solution allows to control round-off errors of parallel codes with the same architecture. It does not need more processors than the classical parallel code. The second solution is represented by the RAPP prototype. In this approach, the CESTAC method is directly parallelized. It works both on sequential and parallel programs. The essential difference is that this solution requires more processors than the classical codes. These different approaches are tested on sequential and parallel programs of multiplicat ion of matrices.

Category: B.2

1 Introduction

On computers, using the floating point arithmetic, each elementary operation creates a round-off error because of the limited coding of real numbers. Consequently, every computed result is affected by the round-off error propagation. Then, to validate numerical results on computers we must estimate the number of exact significant digits of every result. Exact significant digits mean the digits in common between the computed result and the mathematical result. The probabilistic approach is the only one which allows to estimate the round-off error of the floating point arithmetic.

On a sequential architecture, the numerical validation of computation can be done by using the CADNA software which is based on the CESTAC method. But, more and more computations are now performed on parallel architectures. A tool for the numerical validation for such kind of computations has become essential.

Page 454

After recalling the basics of the CESTAC method and explaining the use of the CADNA software for sequential architectures, we present in this note two prototypes for the control of round-off error on parallel architectures.

The first one combines CADNA and the message passing library PVM. In this approach each processor validates its own results using CADNA. The passing of results between processors is performed by adding a specific extension to PVM because of the new stochastic types used by CADNA.

The second one is a direct parallelization of the CESTAC method - the RAPP prototype - available both on sequential and parallel architectures.

2 The CESTAC method

In the probabilistic approach, the round-off errors are modeled by independent identically distributed (iid) random variables [Hamming 70][Hull and Swenson 66]. Therefore, a computed result R is also modeled by a random variable. The number of significant digits of R is estimated from notions like the mean value and the standard deviation of R.

Based on the probabilistic approach, the CESTAC method has been developed by J. Vignes and M. La Porte [Vignes 90][Vignes 93][Vignes and La Porte 74] to estimate the round-off error propagation of the floating point arithmetic. At each step of the computation, the chosen rounding of any intermediate result is the upper or the lower rounding with the probability 0.5. Practically this is realized by perturbing the lowest bit of the mantissa after each elementary operation. This technique is called the random arithmetic. Then the code is run N times with this new arithmetic. By this way, N different results Ri are obtained. The computed result is taken as the average of the :

The number of exact significant digits of R is given by the formula:

with the level of confidence of the t distribution for a probability (1 - ).

It has been shown [Chesneaux 90] that a computed result obtained with the random arithmetic may be modeled by

where r is the mathematical result, n is the number of elementary operations, the are coefficients depending only on the data and the algorithm, p is the length of the mantissa (the hidden bit included) and the are iid centered random variables.

Page 455

From the probabilistic point of view, the CESTAC method consists in applying Student,s test on a sample of R (which is the ). Then, an estimation of the mean value of R (in the model, it is the mathematical result) is obtained from a confidence interval.

The theoretical study has proved the validity of the CESTAC method on the model [Chesneaux 90]. Then the practical efficiency of the CESTAC method is based on the physical reality of the hypotheses and the approximations which have been assumed during the theoretical study.

Theses hypotheses are:

i) the signs and the exponents of the intermediate results are independent of the random arithmetic,

ii) the model of round-off errors by iid random variables is correct,

iii) the approximation of the R,s distribution by the first order terms in modeled by Z is correct,

iv) the approximation by the first four terms in the Edgeworth's development of the Z,s distribution during the study of Student's test is correct,

v) the coefficients are regular, which means that none of them is of a greater order than the sum of the others.

In practice, the hypotheses i), ii), iv), v) are of little importance [Chesneaux 95].

The hypotheses iii) is the only one which may really make the CESTAC method fail. If it is not verified, the mathematical expectation of the terms of order greater than is not zero. Then, there could be a bias which is not of a smaller order than the standard deviation of Z. The mean value of R is not yet well modeled by the mathematical expectation of Z which is the mathematical result.

In fact, as the CESTAC method gives an estimation of the mean value of R, it is absolutely necessary that it is closed to the mathematical result according to the standard deviation of R, i.e., that the approximation at the first order is valid. If it is not true, there could be an overestimation of the accuracy of the computed result R.

Only multiplications between two non significant results (called stochastic zeroes) and divisions by non significant results may create preponderant terms of order higher than [Chesneaux 95]. Then, such operatins must be detected at run-time and the user must be advised. It may also be shown that, for the CESTAC method to work efficently, a control of the accuracy of the operands during tests like IF (A > B) THEN must be performed. The answer of the test must take the number of exact significant digits of each operand into account [Chesneaux 95].

All of this may be done very easily by using the synchronous implementation of the CESTAC method which allows to estimate the accuracy of any intermediate result at any time. It consists in performing in a complete parallel manner the N runs of a code using the random arithmetic [Vignes 90][Vignes 93]. This implementation allows to have a sample of N values for any variables at each step of the run and so to estimate the accuracy of any result. Then, it is possible to point out the denominators which are non significant, the unstable multiplications and other numerical unstabilities. It leads to a self-validation of the CESTAC method.

All the efficiency of the CADNA software is based on this self-validation.

Page 456

3 The CADNA software for sequential architectures

CADNA [Chesneaux 92][Vignes 93] means Control of Accuracy and Debugging for Numerical Applications. The first goal of this software is the estimation of the round-off error in a scientific code using the floating point arithmetic. CADNA uses the synchronous implementation of the CESTAC method. It also implements and uses the definitions of the order and equality relations of the stochastic arithmetic [Chesneaux 95]. This enables to control the branching, that is, CADNA points out all the tests for which the computed and the real answer have opposite signs. This is the second goal of the software.

The third goal is the numerical debugging. With CADNA, users may detect numerical unstabilities that appear at run-time. We must emphazise that this kind of debugging does not deal with the logical validation of a code but with the ability of the computers to give correct results when the code is performed using the floating point arithmetic.

CADNA also includes all the control tests pointed out by the theoretical study to have an efficient use the CESTAC method.

CADNA is copyrighted and marketed, it is the property of the Pierre and Marie Curie University. CADNA works on codes written in FORTRAN 77 but requires a FORTRAN 90 compiler to generate executable codes. In practice, CADNA is a library which is used during the link. It implements three new numerical types - the stochastic types - and all the arithmetic operators for these new types. The control of round-off error propagation is only performed on variables of a stochastic type.

These types are:

- type (SINGLE_ST) : stochatic single precision;

- type (DOUBLE_ST) : stochastic double precision;

- type (COMPLEX_ST): stochastic complex single precision.

Declarations of stochastic variables are of the same kind as for the classical numerical types. For instance,

TYPE (DOUBLE_ST) X, Y, Z

The estimation of the number of significant digits is available at any time on any stochastic variable. All the arithmetic operators and order relations have been overloaded for the stochastic types. Intrinsic mathematical functions of the FORTRAN 77 only exist by their generic name. In that way, it is very easy to use CADNA on old FORTRAN 77 codes almost without any modification.

In an arithmetic expression, stochastic types, floating types and integer types may be mixed. The classical rules of prevalence are applied with the prevalence of the stochastic types on the others. In the writing procedures, only the significant digits are printed for the variables of stochastic types. In this way, it is very easy to see the accuracy of the results.

For the numerical debugging, CADNA detects at any time numerical unstabilities that appear at run-time and let a trace in a special file. With the symbolic debugger, the user may point out the operation which is responsible for the unstability. The implementation of CADNA on sequential architecture consists in imbedding the computations in . This simple solution perfectly simulates the synchronous implementation of the CESTAC method. Each operation is performed N times. In that way and at any time, there exists a sample of N values for any

Page 457

stochastic variable for the control of accuracy, the branching and the numerical debugging. The use of CADNA multiplies the run time by a factor 3 or 5.

4 Interfacing of CADNA with PVM

Nowadays, parallel computers are more and more used for scientific softwares usually written in FORTRAN. It seems essential to develop a validation tool for numerical softwares on this kind of machine. As the existing parallel machines are very different in structure and programming techniques, we have intentionnally limited our tool to the parallel machine using the IEEE arithmetic for the floating point number and being programmed in MIMD (Multiple Instruction Multiple Data) mode. Then a program may be considered as a set of sequential processes allocated on the different processors of the computer and interacting together by message passing.

To create the numerical validation tool, we have proceeded in the following way: each sequential process uses the sequential CADNA library to locally validate its computations and exchanges by message passing variables of standard type or of stochastic type (to estimate the round-off error).

For the communications between the processors, the message passing library PVM (Parallel Virtual Machine) was chosen for several reasons. First, the concept of Virtual Machine is available on a large number of parallel computers and allows to create easily a parallel virtual machine from a network of sequential and heterogeneous machines. Secondly PVM has been welcome by the scientific community working on parallel computation. Finally it is a freeware software of easy access.

This section is made up of four paragraphs. First, we recall the principle of the foundation of PVM. Then, the different problems encountered to use FORTRAN 90, CADNA and PVM in a same program and the proposed solution are described. Starting with an example of a parallel program (multiplication of two matrices), we explain the modifications we have to perform on a FORTRAN source file to use CADNA library. Finally, the performance obtained on a Connection Machine 5 are presented and commented.

4.1 Presentation de PVM

P.V.M. (Parallel Virtual Machine) [Geist and al. 93] has been developed by the Heterogeneous Network Project which embodies research from the Oak Ridge National Laboratory and the universities of Tennessee and of Emory. The aim of this project was to do parallel computations on networks of heterogeneous machines which have different architectures and different representations for floating point numbers.

To interconnect machines, which may be of sequential, parallel or vectorial type, PVM is able to use different types of networks (Ethernet, FDDI, Token Ring,...). Despite the global view of the computer and of the network, PVM allows the different processes to take the best advantage of the performance of the target machines.

To develop an application, the programmer may use of the C language, the FORTRAN 77 language and a message passing library of high level (synchronous and asynchronous sending and receiving, synchronisation of processes, broadcast

Page 458

of message passing, and concentration and diffusion of values (reduce)). For each message, the user's interface imposes to describe the data type that may be converted in another format for the target machine. This functionalities allow to exchange data between computers with very different architectures.

To create his own virtual machine, the user lists into a file, which is read in the initial phasis of PVM, the accessible machines on the network. Three modes of allocation are available to place the processes on the processors:

1. the transparent mode: each task is automatically located at the most appropriate site.

2. the architecture-dependent mode: the user may indicate the specific architecture on which particular tasks must be executed.

3. the machine-specific mode: a particular machine may be specified

4.2 Extension of PVM

The extensions brought to on PVM have their origin in the use of FORTRAN 90 and the creation of new types in CADNA. We begin by making a demonstration model of feasibility modifying only the main functions of PVM. The first difficulty we have found is due to the interface between FORTRAN 90 and PVM. For example, the subroutines PVMFPACK(type, data,....) and PVMFUNPACK(type, data,....) respectively allow to gather the data before sending them or to recover them after receiving them. They use 5 parameters including a pointer on the data and a constant that indicates their type. This technique works very well in FORTRAN 77 and in C because there is no type verification for the parameter of the subroutine and the functions. When these subroutines are used in a same program written in FORTRAN 90 to send or to recover data of different types, the compiler generates errors because it detects variables of different types for the second parameter. So it is necessary to overload PVMFPACK and PVMFUNPACK which leads to write as many subroutines (hidden type in the user's interface) as there are potential types. The problem is the same for the subroutine PVMFSEND (creation and sending of a message with one instruction) and PVMFRECV (receipt and recovering of data with one instruction). The subroutine PVMFREDUCE (concentration and diffusion) has not still been adapted for the new type of CADNA because it needs a development too important in the scope of a demonstration model of feasibility.

The second difficulty consists in integrating the new stochastic types SINGLE_ST, DOUBLE_ST, COMPLEX_ST, defined by CADNA in order to preserve the PVM principle of typed data. We must add the stochastic types to the intrisic types (BYTES, INTEGER2, INTEGER4, REAL4, REAL8, COMPLEX) predefined and used by PVMFPACK and PVMFUNPACK.

4.3 Difference of programmming

To explain in detail how the validation software for parallel computations works, we present an example of matrix multiplication intentionally simple and not optimized not to complicate the problem. The program realizes the multiplication on a computer using 4 processors in the following way: a process named master reads 2 matrices a and b. The matrix b is divided into four equal parts. The

Page 459

master processor sends to the four slave processors the matrix a and one quarter of the elements of b. Each one executes the multiplication between the matrix a and its part of b and sends back its result to the process master that builds the matrix solution. The program is written in FORTRAN.

Figure (1) presents in the left column the most interesting part of the source program of the process slave without using CADNA and in the right column the same source with the necessary modifications to use CADNA in bold-face. The printing subroutine of the matrix is not detailed. It is necessary to know that CADNA proposes a function str(var) that takes a stochastic variable in parameter and, associated with a print or a write instruction, only displays the exact significant digits. If the value is no significant, @0 is displayed.

One may note that it is not necessary to do a lot of modifications on the original program to validate the numerical computation.

4.4 Results

To measure the performance of our model, two sequential versions and two parallel versions (each one with and without CADNA) have been written and run. The run times have been measured on a Connection Machine 5 (CM5) manufactured by the society Thinking Machine Corporation on full squared matrices of size 100x100, 200x200, 300x300 and are reported on table (1). For the measure of the run time only 4 processors (or nodes) have been used on a partition of 32. The vector unit associated to each node have been inhibited because it is impossible to change his working mode and therefore to modify their arithmetic. In the sequential versions, the run times represent only the run time of the matrix product subroutine. In the parallel versions, the run times are measured on the master processor. It includes the sending of the matrix a and b to the slave processors, the run time of the slave processors, the receipt of the partial matrix and the rebuilding of solution matrix.

Table 1: Computing time obtained on CM5 using one node for the sequential version and four nodes for the parallel versions.

To estimate the overhead generated by using CADNA, the sequential and parallel times obtained with CADNA are divided by the times obtained in the same conditions but without CADNA (see table (2)). CADNA induces a cost in time near a factor 3 which represents the price of the validation of the numerical results. To conclude the study of the results, the efficiency of the parallelization is calculated. It is defined by the following ratio:

Page 460

Figure 1: Source of the slave program. Left column: standard version using PVM, right column: version using PVM and CADNA.

Page 461

Table 2: Overhead due to the use of CADNA

with:

eff:     efficiency of the parallelism,
NbProc:  number of processors,
TpsPara: run time of the parallel program,
TpsSeq:  run time of the sequential program.

Table 3: efficiency of the parallel program

The efficiency of 100 % is obtained only in the case where all the processors are independent and do not exchange any data. Table (3) presents the obtained result on the matrix multiplication. Two behaviours must be noticed.

First, the efficiency is less for the small size matrix. When the message is short, the time of initialization of a communication between two processes becomes more predominant. It would be better to send large size messages.

Secondly, with an equivalent matrix size, the efficiency is superior when CADNA is used. This increase is simple to explain: the ratio is more important.

5 Parallelization of the CESTAC method

The previous system (CADNA + PVM) validates the result of a parallel program without modifying the architecture used. Here we present a prototype of direct parallelization of the CESTAC method which brings about an increase of the processors number.

Page 462

5.1 Extraction of the parallelism of CESTAC

Let us consider an algebraic procedure PA composed of a finite sequence of arithmetical operations. Practically, the CESTAC method consists in executing N (N = 2 ou 3) copies of the program of the procedure PA with the random arithmetic in order to be able to estimate the accuracy of all intermediate or final computed results. The N copies constitute N independent tasks of computation. At the time of their execution, it is necessary to know the accuracy of the intermediate computed result in order to carry out certain operations (conditional splitting, division). Then, the tasks communicate their results to a control task that returns the mean of the N representative values with the number of exact significant digits.

The implementation of the CESTAC method on a sequential machine multiplies the run time of a program by a factor depending in the average on the number of image programs. These programs being independent, their implementation on a parallel machine should allow to decrease the run time. Figure (2) presents the scheduling of the tasks in the CESTAC method. A prototype of feasibility named RAPP (Random Arithmetic Parallel Prototype) has been developed on a distributed and reconfigurable machine based on transputers with the OCCAM2 language of INMOS [OCCAM 88].

Figure 2: Elementary module of computation for the random arithmetic

5.2 Presentation of RAPP

In the scope of RAPP, we have chosen to use three tasks of computation (i = 1, 2, 3) (image program of PA). The are allocated to the processors 1,2 and 3

Page 463

(see figure (2)) and are run in parallel. The task of accuracy estimation is allocated to a monitor processor which is also used for the communication between the host processor and the rest of the network.

The divisions and the conditionnal splitting synchronize automatically the tasks at the moment of the accuracy estimation. When a division occurs, the values of the denominator are sent to the monitor processor and the trace of a possible unstability is generated into a file and may be consulted after the end of the computation. In the same way, for a conditionnal splitting, a precision control is carried out before the comparison of two variables. The precision estimation is possible for all the other results, but it is not automatically done to preserve the performance of the prototype.

5.3 Communication between the nodes of the RAPP network

The language OCCAM2 gives the possibility to define new protocols of communication between nodes of a network in order to adapt the communication to the type of exchanged data.

The RAPP prototype uses 3 communication protocols to gain in performance and to preserve for the programmer the standard Input/Output, the system call that work according to the standard protocol SP.

The first prototype allows the communications of the tasks (i = 1,2,3) towards the task of precision estimation and limits the communication to a table of real numbers coded on 64 bits. The number of elements of this table is variable.

The second protocol allows the task of precision estimation to send back to the task the response to the precision request. A table of real numbers coded on 64 bits and a table of integer numbers form the response. The two tables are of variable size.

Finally, the protocol of standard communication SP is used by the tasks 1,2 and 3 so that every communication (input/output, system call,...) with the host processor should still be possible for the programmer. It is then necessary to introduce two new processors mux1 and mux2 (see figure (3)) which are only used for the multiplexing of the communication channel. In the case of a parallel program, the addition of these processors concern only the processor directly linked to the host processors.

On figure (3) the processors P1, P2, P3 and monitor realize the computation of RAPP. The processors mux1, mux2 and aux are used to solve the problem due to the little number of communication channels of the transputer.

5.4 Application to a sequential program

Let us consider a computer program P giving a single result r and that is executed with the computer's classical arithmetic. For the program to be run with RAPP, it is sufficient:

- to perturb every arithmetic operation and every assignment. (for example to replace x:= y + z by x:= p(y + z), p being a perturbation function of the tool box of RAPP);

- to use a specific output that takes into account the significant digits of the result r;

Page 464

Figure 3: RAPP network for a sequential program

- to add at the beginning of the program P the statement of declaration of the communication channel with the task of precision estimation.

The RAPP user elaborates only a single program but in the reality three identical programs are executed on the RAPP nodes.

By analogy with the synchronous programmation on sequential architectures of the CESTAC method (imbedding in ), the RAPP prototype defines a structure. The elements of this structure are distribued on different processors.

To test the performance of RAPP on a sequential program, we considered the example of two square matrices multiplication presented in the previous section. Table (4) shows the results obtained on a machine based on transputer T800.

RAPP use 3 processors for the execution of the image programs and a processor for the precision estimation. So it should be normal to obtain a time ratio with and without near of 1. Therefore according to table (4), this ratio is contained between 4.5 and 5. This increase is principally due to the perturbation of the elementary operation in the image program. Table (5) shows the cost of the perturbation in relation to the arithmetic function.

Page 465

Table 4: Run time obtained on transputers, T800.

Table 5: Perturbation cost in relation to the arithmetic operations

Let us recall that this work deals with a study of feasability. With an optimized perturbation (as in CADNA), we may hope that the time ratio whould be in the order of 1.5.

5.5 Application to a parallel program

Let us consider a computer program P composed of k sequential tasks running in parallel on k processors . The application of RAPP to the program P consists in replacing each k processors by a subnetwork of figure (2). Thus, each task is replaced with three images plus a task of precision estimation. The interconnection of all the tasks is done in the following way: In the program P, if a task communicates with a task then, in the program working with RAPP, the image tasks communicate respectively with the tasks (see figure (4)). Here again, we shall use the previous example of matrix multiplication with the same structure of parallel program (see table (6)). The increase of time that RAPP generates in the case of a parallel program is less important than in the case of a sequential program because RAPP increases only the time corresponding to the computation and not the time corresponding to the communication of the classical p arallel program (i.e. without RAPP).

Page 466

Table 6: Run time obtained on transputers T800.

Figure 4: Application of RAPP to the parallel program

(i = 1..4): processors monitor

Page 467

6 Conclusion

This work has shown that it is now possible to study the round-off errors propagation on results provided by sequential or parallel codes written in FORTRAN 77. It is easily done using the CADNA software on sequential architectures and on parallel architectures if the message passing library PVM is used. It was very important to show the feasibility of estimating the accuracy on parallel architectures which play an important role in the scientific computers today. We have seen that the solution CADNA + PVM is very simple and does not modify the number of processors required for running the initial parallel program. But the run time is multiplied in the same way than with the use of CADNA on sequential architecture.

The CESTAC method seems to be well-suited to a direct parallelization. The problem was to find out if the cost of message passing between processors (an absolute necessity) was not too important compared to the running time. The study of the RAPP prototype has shown that it is very satisfactory. Such an approach has a strong future, but there is not a useful FORTRAN tool yet.

We did not mention the numerical validation on vectorial architectures which are very often combined with the parallel architectures. The solution with CADNA or RAPP cut off the vectorisation and the problem is still open.

Acknowledgements:

This work has been supported by the team Arithmetique et precision of the pool PRC-PRS of the CNRS and by the Centre National de Calcul Parallele en Sciences de la Terre. (CNCPST)

References

[Chesneaux 90] J.-M. Chesneaux, J.-M.: "Study of the computing accuracy by using probabilistic approach"; Contribution to Computer Arithmetic and Self-Validating Numerical Methods, ed. C. Ulrich, (J.C. Baltzer) (1990), 19-30.

[Chesneaux 92] Chesneaux, J.-M.: "Descriptif d'utilisation du logiciel CADNA_F"; MASI Report, no 92-32 (1992).

[Chesneaux 95] Chesneaux, J.-M.: "L'arithmetique stochastique et le logiciel CADNA"; Habilitation a diriger des recherches, to appear.

[Hamming 70] R.W. Hamming, R. W.: "On the distribution of numbers"; The Bell System Technical Journal (1970), 1609-1625.

[Hull and Swenson 66] Hull, T.E., Swenson, J. R.: "Test of probabilistic models for propagation of round-off errors"; Communication of A.C.M., vol.9, no 2 (1966), 108-113.

[Vignes 90] Vignes, J.: "Estimation de la precision des resultats de logiciels nu me ri ques"; La Vie des Sciences, Comptes Rendus, serie generale, 7 (1990), 93-145.

[Vignes 93] Vignes, J.: "A stochastic arithmetic for reliable scientific computation"; Math. Comp. Simul., 35 (1993), 233-261.

[Vignes and La Porte 74] Vignes, J., La Porte, M.: "Error analysis in computing"; Information Processing 74, North-Holland (1974).

[Geist and al. 93] Geist, G. A. and al.: "PVM 3 User,s Guide and Reference Manual"; ORNL (Oak Ridge National Laboratory), TM-12187 May (1993)

[OCCAM 88] "OCCAM 2 Reference Manual"; Prentice Hall, International Series in Computer Sciences (1988).

Page 468