WO2002003194A2

WO2002003194A2 - Multi-entry threading method and apparatus for automatic and directive-guided parallelization of a source program

Info

Publication number: WO2002003194A2
Application number: PCT/US2001/018614
Authority: WO
Inventors: Knud Kirkegaard; Milind Girkar; Paul Grey; Xinmin Tian
Original assignee: Intel Corporation
Priority date: 2000-06-30
Filing date: 2001-06-08
Publication date: 2002-01-10
Also published as: GB0301568D0; WO2002003194A3; GB2381356A; AU2001266796A1; DE10196389T1; GB2381356B; CN1446334A; TW525090B; CN1210650C

Abstract

A method and apparatus for compiling a source program are described. Multiple predetermined sequences within the source program are located. A start code is inserted in the source program prior to a first instruction of each predetermined sequence. An invocation code is inserted in the source program prior to the start code, the invocation code addressing the start code and transferring each sequence to a system for execution. Finally, a stop code is inserted in the source program after a last instruction of each sequence, the stop code signaling to the system to step execution of the sequence.

Description

MULTI-ENTRY THREADING METHOD AND APPARATUS FOR AUTOMATIC AND DIRECTIVE-GUIDED PARALLELIZATION OF A

SOURCE PROGRAM

FIELD OF THE INVENTION

The present invention relates generally to compiler optimization techniques and, more specifically, to a multi-entry threading method and

apparatus for automatic and directive-guided parallelization of a source program.

BACKGROUND OF THE INVENTION The ever-increasing complexity of computational problems that require a solution is reflected in the increase in the speed and performance of computers designed to solve such problems. As CPUs become faster and multiprocessor

system configurations multiply, program performance must be improved and

efficient methods for compiling and executing source programs must be used.

Parallel processing is rapidly becoming mainstream technology influencing

architecture and software design from the home computer market to business applications. Parallel applications are executed by a multiple processor computer system which includes a plurality of processors interconnected so as to exchange

data with each other. Figure 1A is a block diagram of a distributed-memory multiple processor

computer system. As illustrated in Figure 1A, a computer system 100 includes

multiple processing modules 120. Each processing module 120 includes a processor 122 and a memory 124. In the computer system 100, any number of processing modules can be interconnected as shown.

Figure IB is a block diagram of a shared-memory multiple processor

computer system. As illustrated in Figure IB, a computer system 150 includes multiple processors 160 connected to a shared memory 170. In one embodiment,

memory 170 includes exclusive areas occupied by each processor 160 and a common area accessed by all processors. In the computer system 150, only a limited number of processors 160 may be interconnected, due to the Hmitations imposed by the shared memory 170.

Parallel processing methods use automatic tools, such as automatic

parallelizing compilers, which compile the source programs and facilitate parallel

processing of the programs. A compiler looks at the entire source program, collects and reorganizes the instructions, and translates the source program into

object code executable by the computer. One compiler technique involves use of outlining technology, which

transforms selected regions of a program into outlined or separate subroutines.

Each outlined subroutine is then sent to one thread in a processor of parallel

execution. Parallelization using outlining technology is described in detail in Jyh- Herng Chow et al., Automatic Parallelization for Symmetric Shared-Memory

Multiprocessors, Proceedings of CASCON '96, Toronto, Canada, 12-14 November, 1996. However, parallelization of a source program using outlining technology

increases the complexity of compiler generating multi-threaded code. Because the original code is split into separate subroutines, a number of scalar optimizations, which were originally applied to one single subroutine, will have to be invoked for several different subroutines, creating a procedure that generates less efficient code and is time consuming. BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not Hmltation

in the figures of the accompanying drawings, in which like references indicate

similar elements and in which:

Figure 1 A is a block diagram of one embodiment for a distiϊbuted-memory multiple processor computer system.

Figure IB is a block diagram of one embodiment for a shared-memory multiple processor computer system.

Figure 2 is a block diagram of one embodiment for a computer system.

Figure 3A is a block diagram of one embodiment for a process of obtaining an executable program in a computer system.

Figure 3B is a block diagram of one embodiment for a process of obtaining a parallel executable program in a computer system.

Figure 4 is a flow diagram of one embodiment for a multi-entry threading method for automatic and directive-guided parallelization of a source program.

DETAILED DESCRIPTION

In the following detailed description of embodiments of the invention, reference is made to the accompanying drawings in which like references indicate

similar elements, and in which are shown by way of illustration specific embodiments in which the invention may be practiced.

Numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one

skilled in the art that the present invention may be practiced without these

specific details. In some instances, well-known structures and devices are shown

in block diagram form, rather than in detail, in order to avoid obscuring the present invention. These embodiments are described in sufficient detail to enable

those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, mechanical, electrical and

other changes may be made without departing from the scope of the present

invention.

Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively

convey the substance of their work to others skilled in the art. An algorithm is

here, and generally, conceived to be a self-consistent sequence of acts leading to a

desired result. The acts are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise

as apparent from the following discussion, it is appreciated that throughout the

description, discussions utilizing terms such as "processing" or "computing" or "calculating" or "determining" or "displaying" or the like, refer to the action and

processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities

within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or

registers or other such information storage, transmission or display devices.

The present invention also relates to an apparatus for performing the

operations herein. This apparatus may be specially constructed for the required

purposes, or it may comprise a general purpose computer, selectively activated or reconfigured by a computer program stored in the computer. Such a computer

program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. The algorithms and displays presented herein are not inherently related to

any particular computer or other apparatus. Various general purpose systems

may be used with programs in accordance with the teachings herein, or it may

prove convenient to construct more specialized apparatus to perform the required

method. For example, any of the methods according to the present invention can be implemented in hard-wired circuitry, by programming a general purpose processor or by any combination of hardware and software. One of skill in the art

will immediately appreciate that the invention can be practiced with computer system configurations other than those described below, including hand-held

devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and

the like. The invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. The required structure for a variety

of these systems will appear from the description below.

The methods of the invention are described in terms of computer software.

If written in a programming language conforming to a recognized standard,

sequences of instructions designed to implement the methods can be compiled for execution on a variety of hardware platforms and for interface to a variety of

operating systems. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a

variety of programming languages may be used to implement the teachings of the invention as described herein. Furthermore, it is common in the art to speak of

software, in one form or another (e.g., program, procedure, application...), as taking an action or causing a result. Such expressions are merely a shorthand way of saying that execution of the software by a computer causes the processor of the

computer to perform an action or produce a result.

Figure 2 is a block diagram of one embodiment for a computer system 200.

Computer system 200 includes a system bus 201, or other communications module similar to the system bus, for communicating information, and a processing module, such as processor 202, connected to bus 201 for processing

information. Computer system 200 further includes a main memory 204, such as a random access memory (RAM) or other dynamic storage device, connected to

bus 201, for storing information and instructions to be executed by processor 202.

Main memory 204 may also be used for storing temporary variables or other intermediate information during execution of instructions by processor 202.

Computer system 200 also includes a read only memory (ROM) 206, and/ or other similar static storage device, connected to bus 201, for storing static information

and instructions for processor 202. An optional data storage device 207, such as a magnetic disk or optical

disk, and its corresponding drive may also be connected to computer system 200 for storing information and instructions. System bus 201 is connected to an

external bus 210, which connects computer system 200 to other devices. Computer system 200 may also be connected via bus 210 to a display device 221,

such as a cathode ray tube (CRT) or a liquid crystal display (LCD), for displaying information to a computer user. For example, graphical or textual information may be presented to the user on display device 221. Typically, an alphanumeric input device 222, such as a keyboard including alphanumeric and other keys, is

connected to bus 210 for communicating information and/ or command selections to processor 202. Another type of user input device is cursor control device 223,

such as a conventional mouse, touch mouse, trackball, or other type of cursor

direction keys, for communicating direction information and command selection

to processor 202 and for controlling cursor movement on display 221. A fully loaded computer system may optionally include video, camera, speakers, sound card, and many other similar conventional options.

A communication device 224 is also connected to bus 210 for accessing

remote computers or servers, via the Internet, for example. The communication

device 224 may include a modem, a network interface card, or other well known

interface devices, such as those used for interfacing with Ethernet, Token-ring, or other types of networks. In any event, in this manner, the computer system 200 may be connected to a number of servers via a conventional network infrastructure.

Figure 3A is a block diagram of one embodiment for a process of obtaining

an executable program in a computer system. According to Figure 3A, source file 310 includes source code written by programmers in high-level languages, for

example FORTRAN or C. The source code instructions must be translated into machine language. The translation process involves several processing steps and

includes compilation of the source code instructions.

In one embodiment, the high-level language source code instructions

within source file 310 are passed through a compiler (not shown). The compiler translates the high-level instructions into object code stored within object files 320.

In one embodiment, considering a parallel computer system, the compiler needs to generate the object code in a form suitable for parallel execution. In one

embodiment, the object code includes multiple modules, each module

incorporating one or more subroutines. Some modules may be stored in runtime

library 340.

Finally, the object code is passed through a linker 330. The linker 330 combines the modules and gives real values to addresses within the modules, thereby producing an executable program 350.

Figure 3B is a block diagram of one embodiment for a process of obtaining

a parallel executable program in a computer system. As shown in Figure 3B, a

serial source program 360 and a serial source program with OpenMP directives 365 are compiled by a compiler (not shown), which creates parallel executable

code 370. The parallel executable code 370 is then linked to a parallel runtime library 380. The runtime library 380 creates multiple thread entries 385, which are then sent to a SMP machine 390 for execution. Figure 4 is a flow diagram of one embodiment for a multi-entry threading method for automatic and directive-guided parallelization of a source program.

In one embodiment, a source program to be compiled and executed by a multi-processor computer system needs to be parallelized in order to fully take advantage of the system's resources. Therefore, depending on the number of

processors, multiple threads must be generated to run the source program in

parallel.

The source program includes multiple loops of code, also known as parallel regions. A parallel region or loop is defined as a code block of the program that is to be executed by the multiple threads in parallel. One example of a source program including multiple parallel regions or loops is as follows:

#include <stdio , h> Idefine NSIZE 200 main ( ) { int x, i, j ; float a [NSIZE], b [NSIZE], c [NSIZE];

/* parallel loop */ #pragma omp parallel for schedule (static) private (i) shared(a,b) for (i=0; KNSIZE; i++) { b[i] = (float) (i * 2) ; a[i] = b[i] + 100; }

/* parallel region */ ipragma omp parallel share (b,c) { x = 100 ; /* work-sharing loop */

#pragma omp for schedule (dynamic) firstprivate (x) private ( i ) for (j=100 ; j <200 ; j ++ ) { b [j ] = ( float ) (j * 2 ) ι c [j ] = b [j ] + 100 * x; } } }

Each thread receives a portion of the loop and executes the portion in

parallel with other threads. Parallel regions or loops are sequences of the code representing the fundamental parallel constructs that indicate code to be executed

in parallel.

Referring to Figure 4, at processing block 410, the source program or source code is received and read by the compiler. At processing block 420, a first parallel construct within the routine to be executed in parallel is located by the compiler.

At processing block 430, a start code is generated by the compiler. In one embodiment, the start code is a new threaded entry code indicating the beginning

of a parallel construct. At processing block 440, an invocation code is generated by the compiler. In one embodiment, the invocation code is an invocation

instruction, which passes the threaded entry identified by the new threaded entry code to the multi-threaded run-time system for parallel execution on multi¬

processor systems.

At processing block 450, the new threaded entry code is inserted before the

parallel construct in the source program. In one embodiment, the new entry code is inserted prior to a first instruction of the parallel construct. At processing block 460, the invocation instruction is inserted before the new threaded entry code in the source program.

At processing block 470, a stop code is inserted after the parallel construct

in the source program. In one embodiment, the stop code is a threaded return instruction, which is inserted after a last instruction of the parallel construct. The threaded return instruction signals the run-time system to perform the synchronization and return to the main program.

At processing block 480, a new location instruction is generated by the

compiler. In one embodiment, the location instruction is a label instruction

indicating the next instruction to be executed by the multiprocessor system. At processing block 485, the location instruction is inserted after the threaded return

instruction.

At processing block 490, a jump instruction is generated and is inserted

before the new threaded entry to direct the system to continue execution of the source program at the location instruction. In one embodiment, the jump

instruction is inserted subsequent to the invocation instruction and prior to the

new threaded entry code.

At processing block 495, a decision is made whether the routine contains any new parallel constructs. If the routine contains another parallel construct,

then blocks 420 through 495 are processed again with respect to the new parallel

construct. Otherwise, if the routine does not contain any new parallel constructs,

the procedure stops. In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing

from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an

illustrative rather than a restrictive sense.

Claims

CLAIMSWhat is claimed is:

1. A method for compiling a source program comprising: locating a plurality of predetermined sequences within said source

program;

inserting a start code in said source program prior to a first instruction of

each sequence of said plurality of predetermined sequences; inserting an invocation code in said source program prior to said start

code, said invocation code addressing said start code and transferring said each sequence to a system for execution; and

inserting a stop code in said source program after a last instruction of said each sequence of said plurality of sequences, said stop code signaling to said

system to stop execution of said each sequence.

2. The method according to claim 1, further comprising: inserting a location instruction after said stop code;

generating a jump instruction to start execution of said source program at

said location instruction; and inserting said jump instruction prior to said start code and subsequent to

said invocation code.

3. The method according to claim 1, further comprising: receiving said source program; and reading said source program.

4. The method according to claim 1, wherein each sequence of said plurality of predetermined sequences is a parallel construct.

5. The method according to claim 1, wherein said system is a multi¬

threaded run-time system capable of executing said each sequence in parallel.

6. The method according to claim 1, wherein each sequence of said

plurality of predetermined sequences is an Open MP parallel construct.

7. The method according to claim 1, wherein inserting said start code further comprises generating said start code for insertion.

8. The method according to claim 1, wherein inserting said invocation

code further comprises generating said invocation code for insertion.

9. The method according to claim 2, wherein inserting said location

instruction further comprises generating said location instruction for insertion.

10. A computer readable medium containing executable instructions, which, when executed in a processing system, cause said system to perform a method for compiling a source program, said method comprising:

locating a plurality of predetermined sequences within said source program;

inserting a start code in said source program prior to a first instruction of each sequence of said plurality of predetermined sequences;

inserting an invocation code in said source program prior to said start code, said invocation code addressing said start code and transferring said each

sequence to a system for execution; and

inserting a stop code in said source program after a last instruction of said

each sequence, said stop code signaling to said system to stop execution of said each sequence.

11. The computer readable medium according to claim 10, wherein said

method further comprises:

inserting a location instruction after said stop code;

generating a jump instruction to start execution of said source program at

said location instruction; and

inserting said jump instruction prior to said start code and subsequent to

said function code.

12. The computer readable medium according to claim 10, wherein said method further comprises: receiving said source program; and

reading said source program.

13. The computer readable medium according to claim 10, wherein each

sequence of said plurality of predetermined sequences is a parallel construct.

14. The computer readable medium according to claim 10, wherein said

system is a multi-threaded run-time system capable of executing said first sequence in parallel.

15. The computer readable medium according to claim 10, wherein each

sequence of said pluraHty of predetermined sequences is an Open MP parallel construct.

16. The computer readable medium according to claim 10, wherein inserting said start code further comprises generating said start code for insertion.

17. The computer readable medium according to claim 10, wherein

inserting said invocation code further comprises generating said invocation code

for insertion.

18. The computer readable medium according to claim 11, wherein

inserting said location instruction further comprises generating said location instruction for insertion.

19. An apparatus comprising:

a memory to store a source program; and a processor coupled to said memory

to locate a plurality of predetermined sequences within said source program;

to insert a start code in said source program prior to a first instruction of each sequence of said pluraHty of predetermined sequences;

to insert an invocation code in said source program prior to said start code, said invocation code addressing said start code and transferring said each sequence to a system for execution; and to insert a stop code in said source program after a last instruction of

said each sequence of said pluraHty of sequences, said stop code signaling

to said system to stop execution of said each sequence.

20. The apparatus according to claim 19, wherein said processor further: inserts a location instruction after said stop code;

generates a jump instruction to start execution of said source program at

said location instruction; and inserts said jump instruction prior to said start code and subsequent to said function code.

21. The apparatus according to claim 19, wherein said processor further: receives said source program; and reads said source program.

22. The apparatus according to claim 19, wherein each sequence of said plurality of predetermined sequences is a paraHel construct.

23. The apparatus according to claim 19, wherein said system is a multi¬

threaded run-time system capable of executing said each sequence in parallel.

24. The apparatus according to claim 19, wherein each sequence of said plurality of predetermined sequences is an Open MP paraHel construct.

25. The apparatus according to claim 19, wherein prior to inserting said

start code, said processor further generates said start code for insertion.

26. The apparatus according to claim 19, wherein prior to inserting said

invocation code, said processor further generates said invocation code for

insertion.

27. The apparatus according to claim 20, wherein prior to inserting said location instruction, said processor further generates said location instruction for insertion.