(Keller J., Ungerer T.) J.UCS Special Issue on Multithreaded Processors and Chip-Multiprocessors

Abstract: Today's superscalar processors are able to issue up to six instructions per cycle from a single sequential instruction stream. VLSI technology will soon allow future microprocessors to issue and execute eight or more instructions per cycle. However, instruction level parallelism (ILP) found in a conventional instruction stream is limited. Recent studies show the limits of processor utilization even in today's superscalar microprocessors reporting instructions per cycle (IPC) values between 0:14 and 1:9. One solution to increase performance is an additional utilization of more coarsegrained parallelism either by integrating two or more complete processors on a single chip or by using a multithreaded approach.

A multithreaded processor is able to pursue multiple threads of control in parallel within the processor pipeline. The functional units are multiplexed between the thread contexts. Most approaches store the thread contexts in different register sets on the processor chip. Latencies are masked by switching to another thread. A finegrained multithreaded processor interleaves execution of instructions of different threads on a cyclebycycle basis, whereas a blockmultithreaded processor executes instructions of a single thread until a contextswitching event, e.g. a cache miss, occurs. Moreover, a simultaneous multithreaded (SMT) processor issues instructions of several threads simultaneously. It combines a wideissue superscalar processor with multithreading.

The importance of multithreaded execution to both the research and microprocessor industries is rapidly increasing, as can be seen by the increasing amount of research papers and by recent announcements of the computer industry, in particular, IBM's Power4 with two processors on a die, the 4threaded SMT Alpha processor 21464 of Compaq, and the MAJC5200 processor of Sun which features two 4threaded processors on a single die.

Chipmultiprocessors and multithreaded processors are able to boost performance of a multithreaded program mix, i.e. programmervisible or compilergenerated instruction sequences, operating system threads or even whole processes. Another more recent research trend targets the performance increase of singlethreaded programs by dynamically utilizing speculative threadlevel parallelism. Sequences of instructions are dynamically extracted from sequential binaries and speculatively executed by different processing elements or in multiple thread slots within a single processor. In case of misspeculation, the results of the speculative thread and of subsequent threads are discarded. Codrescu and Wills investigate different dynamic partitioning schemes, in particular, threadgeneration by dynamically parallelizing loop iterations, procedure calls, or using

Page 906

fixed instruction length blocks. A new, more flexible algorithm called MEMslicing algorithm is proposed that generates a thread starting from a slice instruction up to a maximum thread length. All approaches are evaluated in context of the Atlas chipmultiprocessor.

Gopinath and Narasimhan M.K. investigate the performance of switch blocking where waiting threads are disabled and signaled at completion of the wait vs. switchspinning where waiting threads poll and execute in roundrobin fashion in context of a blockmultithreaded processor. A root of multithreaded execution is the coarsegrain dataflow execution model that relies on nonblocking threads generated from singleassignment dataflow programs. Threads start execution as soon as all operands are available. Such threads may be generated with the aim to decouple memory accesses from execute instructions. Kavi, Arul and Giorgi present a decoupled scheduled dataflow architecture where a (dataflow) program is compilerpartitioned into execution and memoryaccess threads and executed on a decoupled dataflow machine.

Beyls and D'Hollander present a technique to generate at compiletime computation threads and datafetch threads which ensure that the computation thread does not experience cache misses. Li and Jenq theoretically investigate the thread scheduling problem that deals with the compiletime schedule of a data dependency graph on a multithreaded architecture.

Evripidou and Kyriacou propose Networks of Workstations as basis for multithreaded program execution. The hardware implementation of a thread synchronization unit to coordinate such workstations is presented. The overall basis is a decoupled dataflow architecture, where the thread synchronization unit is able to schedule the execution threads on the different workstations as processing elements.

However, multithreading may also be applied for event handling due to its fast context switching ability. Metzner and Niehaus propose the use of multithreaded processors for realtime event handling. Several blockmultithreaded MSparc processors are supervised by an external thread scheduler called EVENTS, that assigns computationintensive realtime threads to the different MSparc processors.

The focus of the special issue thus ranges from execution and performance models of multithreaded processors and chip multiprocessors to speculative multithreading, compiler interaction and event handling by multithreading. The seven papers in this issue represent a broad spectrum of activities within the field. We hope you enjoy the selection.

Page 907