US20200004585A1 - Techniques for reducing serialization in divergent control flow - Google Patents
Techniques for reducing serialization in divergent control flow Download PDFInfo
- Publication number
- US20200004585A1 US20200004585A1 US16/023,897 US201816023897A US2020004585A1 US 20200004585 A1 US20200004585 A1 US 20200004585A1 US 201816023897 A US201816023897 A US 201816023897A US 2020004585 A1 US2020004585 A1 US 2020004585A1
- Authority
- US
- United States
- Prior art keywords
- task list
- shader program
- task
- lane
- entry
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3888—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple threads [SIMT] in parallel
- G06F9/38885—Divergence aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/50—Lighting effects
- G06T15/80—Shading
Definitions
- SIMD single-instruction-multiple-data
- SIMT single-instruction-multiple-thread
- work-items individual items of execution
- wavefronts to take advantage of the parallel nature of these processors, according to which multiple work-items execute the same instruction in a respective lane of the wavefront with different data in the same clock cycle.
- different work-items it is possible for different work-items to have divergent control flow paths. For instance, if a conditional branch occurs that is conditional on data that differs between work-items, then some work-items may take the branch while others do not.
- SIMD or SIMT machine can no longer be efficiently utilized, since it must execute different instructions for different work items in the wavefront. Since, divergent flow control is common in practice, improvements are constantly being made in the area of executing programs with divergent control flow on parallel SIMD and SIMT processors.
- FIG. 1 is a block diagram of an example device in which one or more disclosed embodiments may be implemented
- FIG. 2 is a block diagram of the device of FIG. 1 , illustrating additional detail, according to an example
- FIG. 3 presents one technique for executing work-items with divergent control flow, according to an example
- FIG. 4 presents a chart that includes computer code and lane execution indicators, according to an example
- FIGS. 5A-5B and 6 illustrate another technique for executing work-items with divergent control flow, according to an example.
- SIMD single-instruction-multiple-data
- SIMT single-instruction-multiple-thread
- work-items individual items of execution
- wavefronts to take advantage of the parallel nature of these processors, according to which multiple work-items execute the same instruction in a respective lane of the wavefront with different data in the same clock cycle.
- different work-items it is possible for different work-items to have divergent control flow paths. For instance, if a conditional branch occurs that is conditional on data that differs between work-items, then some work-items may take the branch while others do not.
- One technique to handle divergent control flow is to serialize each divergent path. Specifically, each possible control flow path is executed, with only the lanes corresponding to the work-items that take any particular control flow path enabled, and the other lanes disabled. This serialization represents a performance penalty, as the duration of the shader program is increased by a factor related to the degree of serialization. In certain situations, such as when each work-item executes a function pointer pointing to a different function, the high degree of serialization results in a large performance penalty.
- a technique to reduce or eliminate this serialization includes detecting entry into a divergent section of a shader program and, for the work-items that enter the divergent section, placing a task entry into a task queue associated with the target of each work-item.
- the target is the destination, in code, of any particular work-item, and is also referred to as a code segment herein.
- the task queues store task entries for code segments generated by different (or the same) wavefronts.
- a command processor examines task lists and schedules wavefronts for execution by grouping together tasks in the same task list into wavefronts and launching those wavefronts. By grouping similar tasks from different wavefronts together for execution in the same wavefront, serialization of execution is greatly reduced or eliminated.
- FIG. 1 is a block diagram of an example device 100 in which one or more aspects of the present disclosure are implemented.
- the device 100 includes, for example, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, or a tablet computer.
- the device 100 includes a processor 102 , a memory 104 , a storage device 106 , one or more input devices 108 , and one or more output devices 110 .
- the device 100 also optionally includes an input driver 112 and an output driver 114 . It is understood that the device 100 may include additional components not shown in FIG. 1 .
- the processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU.
- the memory 104 is located on the same die as the processor 102 , or may be located separately from the processor 102 .
- the memory 104 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
- the storage device 106 includes a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive.
- the input devices 108 include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
- the output devices 110 include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
- the input driver 112 communicates with the processor 102 and the input devices 108 , and permits the processor 102 to receive input from the input devices 108 .
- the output driver 114 communicates with the processor 102 and the output devices 110 , and permits the processor 102 to send output to the output devices 110 .
- the output driver 114 includes an accelerated processing device (APD) 116 which is coupled to a display device 118 .
- the APD is configured to accept compute commands and graphics rendering commands from processor 102 , to process those compute and graphics rendering commands, and to provide pixel output to display device 118 for display.
- the techniques described herein could also be performed by an AP 116 that does not have graphics rendering capability.
- FIG. 2 is a block diagram of the device 100 , illustrating additional details related to execution of processing tasks on the APD 116 , according to an example.
- the processor 102 maintains, in system memory 104 , one or more control logic modules for execution by the processor 102 .
- the control logic modules include an operating system 120 , a driver 122 , and applications 126 , and may optionally include other modules not shown. These control logic modules control various aspects of the operation of the processor 102 and the APD 116 .
- the operating system 120 directly communicates with hardware and provides an interface to the hardware for other software executing on the processor 102 .
- the driver 122 controls operation of the APD 116 by, for example, providing an application programming interface (“API”) to software (e.g., applications 126 ) executing on the processor 102 to access various functionality of the APD 116 .
- the driver 122 also includes a just-in-time compiler that compiles shader code into shader programs for execution by processing components (such as the SIMD units 138 discussed in further detail below) of the APD 116 .
- the APD 116 executes commands and programs for selected functions, such as graphics operations and non-graphics operations, which may be suited for parallel processing.
- the APD 116 can be used for executing graphics pipeline operations such as pixel operations, geometric computations, and rendering an image to display device 118 based on commands received from the processor 102 .
- the APD 116 also executes compute processing operations that are not directly related to graphics operations, such as operations related to video, physics simulations, computational fluid dynamics, or other tasks, based on commands received from the processor 102 or that are not part of the “normal” information flow of a graphics processing pipeline, or that are completely unrelated to graphics operations (sometimes referred to as “GPGPU” or “general purpose graphics processing unit”).
- the APD 116 includes compute units 132 (which may collectively be referred to herein as “programmable processing units”) that include one or more SIMD units 138 that are configured to perform operations in a parallel manner according to a SIMD paradigm.
- the SIMD paradigm is one in which multiple processing elements share a single program control flow unit and program counter and thus execute the same program but are able to execute that program with different data.
- each SIMD unit 138 includes sixteen lanes, where each lane executes the same instruction at the same time as the other lanes in the SIMD unit 138 but can execute that instruction with different data. Lanes can be switched off with predication if not all lanes need to execute a given instruction. Predication can also be used to execute programs with divergent control flow. More specifically, for programs with conditional branches or other instructions where control flow is based on calculations performed by individual lanes, predication of lanes corresponding to control flow paths not currently being executed, and serial execution of different control flow paths, allows for arbitrary control flow to be followed.
- the basic unit of execution in compute units 132 is a work-item.
- Each work-item represents a single instantiation of a shader program that is to be executed in parallel in a particular lane of a wavefront.
- Work-items can be executed simultaneously as a “wavefront” on a single SIMD unit 138 .
- Multiple wavefronts may be included in a “work group,” which includes a collection of work-items designated to execute the same program.
- a work group can be executed by executing each of the wavefronts that make up the work group.
- the wavefronts may be executed sequentially on a single SIMD unit 138 or partially or fully in parallel on different SIMD units 138 .
- Wavefronts can be thought of as instances of parallel execution of a shader program, where each wavefront includes multiple work-items that execute simultaneously on a single SIMD unit 138 in line with the SIMD paradigm (e.g., one instruction control unit executing the same stream of instructions with multiple data).
- a command processor 137 is present in the compute units 132 and launches wavefronts based on work (e.g., execution tasks) that is waiting to be completed.
- a scheduler 136 is configured to perform operations related to scheduling various wavefronts on different compute units 132 and SIMD units 138 .
- the parallelism afforded by the compute units 132 is suitable for graphics related operations such as pixel value calculations, vertex transformations, tessellation, geometry shading operations, and other graphics operations.
- a graphics processing pipeline 134 which accepts graphics processing commands from the processor 102 thus provides computation tasks to the compute units 132 for execution in parallel.
- the compute units 132 are also used to perform computation tasks not related to graphics or not performed as part of the “normal” operation of a graphics processing pipeline 134 (e.g., custom operations performed to supplement processing performed for operation of the graphics processing pipeline 134 ).
- An application 126 or other software executing on the processor 102 transmits programs (often referred to as “compute shader programs,” which may be compiled by the driver 122 ) that define such computation tasks to the APD 116 for execution.
- programs often referred to as “compute shader programs,” which may be compiled by the driver 122 .
- Execution of a shader program may lead to control flow that is divergent. More specifically, the SIMD units 138 of the APD 116 execute a shader program in a SIMD manner, in which instructions of the shader program are executed for multiple work-items simultaneously.
- a single control flow unit fetches and executes instructions, and the instructions are executed for multiple different items of data associated with the different work-items.
- Divergent control flow occurs when an instruction that modifies control flow is executed for multiple work-items simultaneously, but the target for the instruction is different for at least two of those work-items. The two work-items would then be redirected to two different locations in the shader, which would mean that the work-items could not be executed simultaneously in a SIMD manner.
- a conditional branch is executed that branches if a particular variable is less than 0. Multiple work-items execute this branch simultaneously, according to the SIMD paradigm, but because the conditional branch is based on the value of the variable, which can be different for the different work-items, some of the work-items take the branch and other work-items do not take the branch.
- FIG. 3 presents one technique for executing work-items with divergent control flow, according to an example.
- this technique may be referred to as “waterfalling.”
- a chart 300 illustrates program instructions 302 and lane execution indicators 304 that illustrate the waterfalling technique.
- the software instructions are illustrated in the C language but it should be understood that a processor would execute machine instructions.
- the software instructions include an if statement that evaluates whether the lane ID is less than 3. If the lane ID is less than 3, then statements A; and B; are performed and if the lane ID is not less than 3, then statements C and D are performed.
- the waterfalling technique may be suitable for a simple case such as a single conditional statement.
- other situations have a greater degree of control flow divergence that may highly serialize work-item execution, which does not take advantage of the inherent parallelism of SIMD execution.
- FIG. 4 illustrates such an example.
- FIG. 4 presents a chart 400 that includes computer code 402 and lane execution indicators 404 , according to an example.
- a function pointer fptr is set to the value stored in the fptrs[ ] function pointer array, indexed by the lane ID.
- the value fptr is set to a different value for each lane, since each of slots 1 - 4 of the fptrs[ ] array points to a different function (function 1 through function 4 ).
- each lane executes its particular function, and thus a different function is executed for each lane.
- the SIMD execution unit serializes execution of each of the functions.
- function 1 is executed for lane 1 with lanes 2 - 4 switched off
- function 2 is executed for lane 2 with lanes 1 and 3 - 4 switched off
- function 3 is executed for lane 3 with lanes 1 - 2 and 4 switched off
- function 4 is executed for lane 4 with lanes 1 - 3 switched off.
- FIGS. 5A-5B and 6 illustrate a different technique for handling divergent control flow.
- program code 502 is illustrated for causing a function to be executed based on a function pointer.
- This code is substantially similar to the code of FIG. 4 , and includes assignment of a function pointer found in an array (fptrs[ ]) indexed by lane ID to a function pointer variable (fptr) and then execution of the function pointed to by the function pointer variable.
- function 1 task list stores task list entries 506 for execution of function 1
- function 2 task list stores task list entries 506 for execution of function 2
- the task lists fill up, as illustrated in FIG. 5B .
- the command processor 137 launches a wavefront to execute the segment of code associated with that task list.
- the SIMD unit 138 again stores a task list entry in an appropriate task list for later execution and exits the shader.
- a wavefront executes with a function pointer call. All of the lanes store a task list entry into an appropriate task list and the wavefront ends execution.
- the command processor 137 causes a wavefront for one of the task lists to execute. That wavefront executes the function and then executes a return instruction.
- the return instruction causes each of the work-items to store a task list entry into a task list corresponding to the segment of code at the return address of the work-item (note, the return addresses can differ among work-items because the instruction that calls the function could have been executed from different locations in code).
- the command processor 137 examines the task lists associated with the code segment at the return addresses and schedules wavefronts based on those task lists.
- any particular SIMD unit 138 executing wavefronts would select one or the other technique when control flow is to diverge.
- the choice of which technique to be used can be made in any technically feasible manner.
- shader programs indicate when to use the adaptive scheduling technique explicitly. More specifically, in response to a shader program including an instruction type that explicitly invokes the adaptive scheduling technique, the SIMD unit 138 causes the wavefront to store the task list entries in one or more task lists and to exit.
- SIMD units 138 may determine whether to use the waterfall technique or the adaptive scheduling technique based on runtime characteristics (such as the degree divergence—the number of work-items of a wavefront to flow down different control flow paths), or in any other technically feasible manner.
- the task list entries store stack pointers for the lane.
- the stack pointer uniquely identifies the lane for which the task list entry is created. More specifically, each lane has a stack at a location in memory. The stack holds data for the current code segment for the lane, such as local variables, function parameters, and return values. The stack pointer thus uniquely identifies a particular lane. In operation, around the time a lane executes a function call to a function pointer, the lane pushes its register values onto the stack, then pushes function arguments onto the stack. The lane stores the task list entry into an appropriate task list for the function to be called and then exits the shader.
- That task executes with the stack pointer of the lane that caused the corresponding task list entry to be placed in the task list. More specifically, that task pops the function arguments off of the stack corresponding to the stack pointer, performs its operations and pushes a return value onto the stack if necessary.
- the task list entry generated for a return instruction also includes the stack pointer and results in a task being executed in a similar manner.
- the task that is executed when the task list entry that is generated as the result of a return instruction pops any return value from the stack and resumes execution.
- Each task list is associated with a particular segment of a shader program. This association allows the command processor 137 to group tasks together for execution for the same segment of shader program. Any particular such segment may begin at one of the beginning of a shader program, the beginning of a function, or the return address for a function call. Any particular segment may end at one of the end of a shader program, the end of a function, or the beginning of a function pointer call.
- FIG. 6 is a flow diagram of a method 600 for executing divergent shader programs, according to an example. Although described with respect to the system shown and described with respect to FIGS. 1-5B , it should be understood that any system configured to perform the method, in any technically feasible order, falls within the scope of the present disclosure.
- the method 600 illustrates some steps performed by a command processor 137 , which generally include looping through a task list and scheduling wavefronts for execution, as well as some steps performed by one or more SIMD units 138 , which generally include execution of the wavefronts, placement of task list entries into task lists, and exiting the shader programs.
- the method 600 begins at step 602 , where the command processor 137 examines a particular task list to determine whether there are tasks available to be scheduled. Each task represents a work-item that is waiting to be scheduled for a particular segment of a shader program.
- the command processor 137 may use any technically feasible technique to determine how many tasks to wait for for any particular task list before scheduling such tasks for execution.
- the command processor 137 schedules a wavefront for execution of tasks from a task list when there is at least a number of tasks in the task list that is equal to the width of the wavefront, where the term “width” refers to the maximum number of work-items that can be in a wavefront.
- the command processor 137 schedules this number of tasks, and not fewer, to maximize the parallelism with which the tasks are executed. In another example, the command processor 137 schedules wavefronts with fewer tasks than the width of the wavefront. The command processor 137 may switch between scheduling wavefronts with this width number of tasks and scheduling wavefronts with less than the width number of tasks as the workload of the SIMD unit 138 increases or decreases. In an example, if the SIMD unit 138 would otherwise be idle, and a task list has fewer than the width number of tasks, the command processor 137 schedules a wavefront with that lower number of tasks to advance execution. Although some example ways of determine when and whether to schedule a wavefront with tasks in a task list for execution are described, any technically feasible technique may be used.
- step 602 If the command processor 137 does not determine that there are tasks available for scheduling as a wavefront at step 602 , then the method 600 loops back around and performs step 602 again. If the command processor 137 does determine that there are available tasks, then the method 600 proceeds to step 604 . At step 604 , the command processor 137 schedules the tasks for execution as a wavefront. For the command processor 600 , after scheduling the wavefront for execution, the method 600 loops back to step 602 . On the SIMD unit side 138 , at step 606 , the SIMD unit 138 executes the wavefront. At step 608 , the SIMD unit 138 detects entry of the wavefront into a divergent segment.
- steps 608 - 612 are optional because not all wavefronts will enter into a divergent segment. Also, steps 606 - 612 are not performed only for wavefronts that are scheduled for execution per steps 602 - 604 . In other words, wavefronts generated not based on task entries in task queues may include divergent segments, which would cause execution of steps 606 - 612 without execution of steps 602 - 604 .
- Detecting entry into the divergent segment may occur in any technically feasible manner. Two possibilities include detection of a call to a function using a function pointer and returning from a function that was entered using a function pointer.
- each lane stores a task queue entry into a task queue data structure that is associated with the target—the code segment—of the lane. For instance, if one lane calls a function pointer that points to function 1 ( ), then that lane stores a task queue entry into the task queue list associated with function 1 ( ). If another lane calls a function pointer that points to function 2 ( ), then that lane stores a task queue entry into the task queue list associated with function 2 ( ), and so on.
- the task queue entry includes the stack pointer associated with the lane.
- the SIMD unit 138 exits the shader for the wavefront that is entering into the divergent segment. After this, at some point, the command processor 137 schedules the tasks for execution in their own wavefronts.
- processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine.
- DSP digital signal processor
- ASICs Application Specific Integrated Circuits
- FPGAs Field Programmable Gate Arrays
- Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
- HDL hardware description language
- non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
- ROM read only memory
- RAM random access memory
- register cache memory
- semiconductor memory devices magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Graphics (AREA)
- Image Processing (AREA)
Abstract
Description
- On a single-instruction-multiple-data (“SIMD”) or single-instruction-multiple-thread (“SIMT”) processor, individual items of execution (“work-items”) are grouped and executed together as wavefronts to take advantage of the parallel nature of these processors, according to which multiple work-items execute the same instruction in a respective lane of the wavefront with different data in the same clock cycle. Under some situations, it is possible for different work-items to have divergent control flow paths. For instance, if a conditional branch occurs that is conditional on data that differs between work-items, then some work-items may take the branch while others do not. Under such situations, the SIMD or SIMT machine can no longer be efficiently utilized, since it must execute different instructions for different work items in the wavefront. Since, divergent flow control is common in practice, improvements are constantly being made in the area of executing programs with divergent control flow on parallel SIMD and SIMT processors.
- A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
-
FIG. 1 is a block diagram of an example device in which one or more disclosed embodiments may be implemented; -
FIG. 2 is a block diagram of the device ofFIG. 1 , illustrating additional detail, according to an example; -
FIG. 3 presents one technique for executing work-items with divergent control flow, according to an example; -
FIG. 4 presents a chart that includes computer code and lane execution indicators, according to an example; -
FIGS. 5A-5B and 6 illustrate another technique for executing work-items with divergent control flow, according to an example. - A technique for executing shader programs with divergent control flow on a SIMD or SIMT processor is disclosed. On a single-instruction-multiple-data (“SIMD”) or single-instruction-multiple-thread (“SIMT”) processor, individual items of execution (“work-items”) are grouped and executed together as wavefronts to take advantage of the parallel nature of these processors, according to which multiple work-items execute the same instruction in a respective lane of the wavefront with different data in the same clock cycle. Under some situations, it is possible for different work-items to have divergent control flow paths. For instance, if a conditional branch occurs that is conditional on data that differs between work-items, then some work-items may take the branch while others do not.
- One technique to handle divergent control flow is to serialize each divergent path. Specifically, each possible control flow path is executed, with only the lanes corresponding to the work-items that take any particular control flow path enabled, and the other lanes disabled. This serialization represents a performance penalty, as the duration of the shader program is increased by a factor related to the degree of serialization. In certain situations, such as when each work-item executes a function pointer pointing to a different function, the high degree of serialization results in a large performance penalty.
- A technique to reduce or eliminate this serialization is disclosed herein. This technique includes detecting entry into a divergent section of a shader program and, for the work-items that enter the divergent section, placing a task entry into a task queue associated with the target of each work-item. The target is the destination, in code, of any particular work-item, and is also referred to as a code segment herein. The task queues store task entries for code segments generated by different (or the same) wavefronts. A command processor examines task lists and schedules wavefronts for execution by grouping together tasks in the same task list into wavefronts and launching those wavefronts. By grouping similar tasks from different wavefronts together for execution in the same wavefront, serialization of execution is greatly reduced or eliminated.
-
FIG. 1 is a block diagram of anexample device 100 in which one or more aspects of the present disclosure are implemented. Thedevice 100 includes, for example, a computer, a gaming device, a handheld device, a set-top box, a television, a mobile phone, or a tablet computer. Thedevice 100 includes aprocessor 102, amemory 104, astorage device 106, one ormore input devices 108, and one ormore output devices 110. Thedevice 100 also optionally includes aninput driver 112 and anoutput driver 114. It is understood that thedevice 100 may include additional components not shown inFIG. 1 . - The
processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. Thememory 104 is located on the same die as theprocessor 102, or may be located separately from theprocessor 102. Thememory 104 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache. - The
storage device 106 includes a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. Theinput devices 108 include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). Theoutput devices 110 include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). - The
input driver 112 communicates with theprocessor 102 and theinput devices 108, and permits theprocessor 102 to receive input from theinput devices 108. Theoutput driver 114 communicates with theprocessor 102 and theoutput devices 110, and permits theprocessor 102 to send output to theoutput devices 110. Theoutput driver 114 includes an accelerated processing device (APD) 116 which is coupled to adisplay device 118. The APD is configured to accept compute commands and graphics rendering commands fromprocessor 102, to process those compute and graphics rendering commands, and to provide pixel output to displaydevice 118 for display. The techniques described herein could also be performed by an AP 116 that does not have graphics rendering capability. -
FIG. 2 is a block diagram of thedevice 100, illustrating additional details related to execution of processing tasks on theAPD 116, according to an example. Theprocessor 102 maintains, insystem memory 104, one or more control logic modules for execution by theprocessor 102. The control logic modules include anoperating system 120, adriver 122, andapplications 126, and may optionally include other modules not shown. These control logic modules control various aspects of the operation of theprocessor 102 and the APD 116. For example, theoperating system 120 directly communicates with hardware and provides an interface to the hardware for other software executing on theprocessor 102. Thedriver 122 controls operation of theAPD 116 by, for example, providing an application programming interface (“API”) to software (e.g., applications 126) executing on theprocessor 102 to access various functionality of theAPD 116. Thedriver 122 also includes a just-in-time compiler that compiles shader code into shader programs for execution by processing components (such as theSIMD units 138 discussed in further detail below) of theAPD 116. - The APD 116 executes commands and programs for selected functions, such as graphics operations and non-graphics operations, which may be suited for parallel processing. The APD 116 can be used for executing graphics pipeline operations such as pixel operations, geometric computations, and rendering an image to display
device 118 based on commands received from theprocessor 102. TheAPD 116 also executes compute processing operations that are not directly related to graphics operations, such as operations related to video, physics simulations, computational fluid dynamics, or other tasks, based on commands received from theprocessor 102 or that are not part of the “normal” information flow of a graphics processing pipeline, or that are completely unrelated to graphics operations (sometimes referred to as “GPGPU” or “general purpose graphics processing unit”). - The APD 116 includes compute units 132 (which may collectively be referred to herein as “programmable processing units”) that include one or
more SIMD units 138 that are configured to perform operations in a parallel manner according to a SIMD paradigm. The SIMD paradigm is one in which multiple processing elements share a single program control flow unit and program counter and thus execute the same program but are able to execute that program with different data. In one example, eachSIMD unit 138 includes sixteen lanes, where each lane executes the same instruction at the same time as the other lanes in theSIMD unit 138 but can execute that instruction with different data. Lanes can be switched off with predication if not all lanes need to execute a given instruction. Predication can also be used to execute programs with divergent control flow. More specifically, for programs with conditional branches or other instructions where control flow is based on calculations performed by individual lanes, predication of lanes corresponding to control flow paths not currently being executed, and serial execution of different control flow paths, allows for arbitrary control flow to be followed. - The basic unit of execution in
compute units 132 is a work-item. Each work-item represents a single instantiation of a shader program that is to be executed in parallel in a particular lane of a wavefront. Work-items can be executed simultaneously as a “wavefront” on asingle SIMD unit 138. Multiple wavefronts may be included in a “work group,” which includes a collection of work-items designated to execute the same program. A work group can be executed by executing each of the wavefronts that make up the work group. The wavefronts may be executed sequentially on asingle SIMD unit 138 or partially or fully in parallel ondifferent SIMD units 138. Wavefronts can be thought of as instances of parallel execution of a shader program, where each wavefront includes multiple work-items that execute simultaneously on asingle SIMD unit 138 in line with the SIMD paradigm (e.g., one instruction control unit executing the same stream of instructions with multiple data). Acommand processor 137 is present in thecompute units 132 and launches wavefronts based on work (e.g., execution tasks) that is waiting to be completed. Ascheduler 136 is configured to perform operations related to scheduling various wavefronts ondifferent compute units 132 andSIMD units 138. - The parallelism afforded by the
compute units 132 is suitable for graphics related operations such as pixel value calculations, vertex transformations, tessellation, geometry shading operations, and other graphics operations. Agraphics processing pipeline 134 which accepts graphics processing commands from theprocessor 102 thus provides computation tasks to thecompute units 132 for execution in parallel. - The
compute units 132 are also used to perform computation tasks not related to graphics or not performed as part of the “normal” operation of a graphics processing pipeline 134 (e.g., custom operations performed to supplement processing performed for operation of the graphics processing pipeline 134). Anapplication 126 or other software executing on theprocessor 102 transmits programs (often referred to as “compute shader programs,” which may be compiled by the driver 122) that define such computation tasks to theAPD 116 for execution. Although theAPD 116 is illustrated with agraphics processing pipeline 134, the teachings of the present disclosure are also applicable for anAPD 116 without agraphics processing pipeline 134. - Execution of a shader program may lead to control flow that is divergent. More specifically, the
SIMD units 138 of theAPD 116 execute a shader program in a SIMD manner, in which instructions of the shader program are executed for multiple work-items simultaneously. A single control flow unit fetches and executes instructions, and the instructions are executed for multiple different items of data associated with the different work-items. Divergent control flow occurs when an instruction that modifies control flow is executed for multiple work-items simultaneously, but the target for the instruction is different for at least two of those work-items. The two work-items would then be redirected to two different locations in the shader, which would mean that the work-items could not be executed simultaneously in a SIMD manner. In one example, a conditional branch is executed that branches if a particular variable is less than 0. Multiple work-items execute this branch simultaneously, according to the SIMD paradigm, but because the conditional branch is based on the value of the variable, which can be different for the different work-items, some of the work-items take the branch and other work-items do not take the branch. -
FIG. 3 presents one technique for executing work-items with divergent control flow, according to an example. Herein, this technique may be referred to as “waterfalling.” Achart 300 illustratesprogram instructions 302 andlane execution indicators 304 that illustrate the waterfalling technique. The software instructions are illustrated in the C language but it should be understood that a processor would execute machine instructions. - The software instructions include an if statement that evaluates whether the lane ID is less than 3. If the lane ID is less than 3, then statements A; and B; are performed and if the lane ID is not less than 3, then statements C and D are performed. As the
execution indicators 304 show, all lanes execute the if statement together.Only lanes lanes lanes lanes lanes lanes - The waterfalling technique may be suitable for a simple case such as a single conditional statement. However, other situations have a greater degree of control flow divergence that may highly serialize work-item execution, which does not take advantage of the inherent parallelism of SIMD execution.
FIG. 4 illustrates such an example. -
FIG. 4 presents achart 400 that includescomputer code 402 andlane execution indicators 404, according to an example. According to thecomputer code 402, a function pointer fptr is set to the value stored in the fptrs[ ] function pointer array, indexed by the lane ID. Thus the value fptr is set to a different value for each lane, since each of slots 1-4 of the fptrs[ ] array points to a different function (function1 through function4). When the fptr( ) function is executed, each lane executes its particular function, and thus a different function is executed for each lane. Because the functions are different for each of the lanes, the SIMD execution unit serializes execution of each of the functions. Thus,function 1 is executed forlane 1 with lanes 2-4 switched off, then function 2 is executed forlane 2 withlanes 1 and 3-4 switched off, then function 3 is executed forlane 3 with lanes 1-2 and 4 switched off, and then function 4 is executed forlane 4 with lanes 1-3 switched off. As can be seen, in highly divergent situations, such as when each lane calls a different function, the advantageousness of the parallelism inherent in SIMD execution is greatly reduced. -
FIGS. 5A-5B and 6 illustrate a different technique for handling divergent control flow. InFIG. 5A ,program code 502 is illustrated for causing a function to be executed based on a function pointer. This code is substantially similar to the code ofFIG. 4 , and includes assignment of a function pointer found in an array (fptrs[ ]) indexed by lane ID to a function pointer variable (fptr) and then execution of the function pointed to by the function pointer variable. - In response to execution of the function, the
SIMD unit 138 stores atask queue entry 506 into a task list corresponding to each of the lanes executing the function and exits the shader, the task lists being in a taskqueue data structure 504. The shader is exited because the subsequent control flow will be handled by thecommand processor 137 examining task lists and scheduling wavefronts based on those task lists. The purpose of the task list is to allow thecommand processor 137 to aggregate tasks that can be launched together as a wavefront at a later time, thereby taking advantage of the parallelism of the SIMD architecture. Each task list storestask list entries 506 for execution of a particular task—a portion of code. For example,function 1 task list storestask list entries 506 for execution offunction 1,function 2 task list storestask list entries 506 for execution offunction 2, and so on. When other wavefronts perform similar operations (including entering into a section of divergent control flow and storing task list entries into task lists), the task lists fill up, as illustrated inFIG. 5B . Once a sufficient number of task entries are stored in any particular task list, thecommand processor 137 launches a wavefront to execute the segment of code associated with that task list. Thus, instead of serializing different function calls over multiple work-items, the divergent paths of the work-items are grouped together and launched as wavefronts. - Once a particular code segment ends, such as through a call to another function in a divergent manner (e.g., with a function pointer, where the function pointer for multiple different work-items point to different functions), or through a return instruction, the
SIMD unit 138 again stores a task list entry in an appropriate task list for later execution and exits the shader. In an example, a wavefront executes with a function pointer call. All of the lanes store a task list entry into an appropriate task list and the wavefront ends execution. At a later time, thecommand processor 137 causes a wavefront for one of the task lists to execute. That wavefront executes the function and then executes a return instruction. The return instruction causes each of the work-items to store a task list entry into a task list corresponding to the segment of code at the return address of the work-item (note, the return addresses can differ among work-items because the instruction that calls the function could have been executed from different locations in code). At a later time, thecommand processor 137 examines the task lists associated with the code segment at the return addresses and schedules wavefronts based on those task lists. - It is possible for the
APD 116 to handle divergent control flow according to both the waterfall technique ofFIG. 3 and according to the technique illustrated inFIGS. 5A-5B (which may be referred to herein as the “adaptive scheduling technique”). More specifically, anyparticular SIMD unit 138 executing wavefronts would select one or the other technique when control flow is to diverge. The choice of which technique to be used can be made in any technically feasible manner. In one example, shader programs indicate when to use the adaptive scheduling technique explicitly. More specifically, in response to a shader program including an instruction type that explicitly invokes the adaptive scheduling technique, theSIMD unit 138 causes the wavefront to store the task list entries in one or more task lists and to exit. In other examples,SIMD units 138 may determine whether to use the waterfall technique or the adaptive scheduling technique based on runtime characteristics (such as the degree divergence—the number of work-items of a wavefront to flow down different control flow paths), or in any other technically feasible manner. - In some examples, the task list entries store stack pointers for the lane. The stack pointer uniquely identifies the lane for which the task list entry is created. More specifically, each lane has a stack at a location in memory. The stack holds data for the current code segment for the lane, such as local variables, function parameters, and return values. The stack pointer thus uniquely identifies a particular lane. In operation, around the time a lane executes a function call to a function pointer, the lane pushes its register values onto the stack, then pushes function arguments onto the stack. The lane stores the task list entry into an appropriate task list for the function to be called and then exits the shader. When the task corresponding to the task list entry is scheduled for execution, that task executes with the stack pointer of the lane that caused the corresponding task list entry to be placed in the task list. More specifically, that task pops the function arguments off of the stack corresponding to the stack pointer, performs its operations and pushes a return value onto the stack if necessary.
- The task list entry generated for a return instruction also includes the stack pointer and results in a task being executed in a similar manner. The task that is executed when the task list entry that is generated as the result of a return instruction pops any return value from the stack and resumes execution.
- Each task list is associated with a particular segment of a shader program. This association allows the
command processor 137 to group tasks together for execution for the same segment of shader program. Any particular such segment may begin at one of the beginning of a shader program, the beginning of a function, or the return address for a function call. Any particular segment may end at one of the end of a shader program, the end of a function, or the beginning of a function pointer call. -
FIG. 6 is a flow diagram of amethod 600 for executing divergent shader programs, according to an example. Although described with respect to the system shown and described with respect toFIGS. 1-5B , it should be understood that any system configured to perform the method, in any technically feasible order, falls within the scope of the present disclosure. Themethod 600 illustrates some steps performed by acommand processor 137, which generally include looping through a task list and scheduling wavefronts for execution, as well as some steps performed by one ormore SIMD units 138, which generally include execution of the wavefronts, placement of task list entries into task lists, and exiting the shader programs. - The
method 600 begins atstep 602, where thecommand processor 137 examines a particular task list to determine whether there are tasks available to be scheduled. Each task represents a work-item that is waiting to be scheduled for a particular segment of a shader program. Thecommand processor 137 may use any technically feasible technique to determine how many tasks to wait for for any particular task list before scheduling such tasks for execution. In one example, thecommand processor 137 schedules a wavefront for execution of tasks from a task list when there is at least a number of tasks in the task list that is equal to the width of the wavefront, where the term “width” refers to the maximum number of work-items that can be in a wavefront. In this example, thecommand processor 137 schedules this number of tasks, and not fewer, to maximize the parallelism with which the tasks are executed. In another example, thecommand processor 137 schedules wavefronts with fewer tasks than the width of the wavefront. Thecommand processor 137 may switch between scheduling wavefronts with this width number of tasks and scheduling wavefronts with less than the width number of tasks as the workload of theSIMD unit 138 increases or decreases. In an example, if theSIMD unit 138 would otherwise be idle, and a task list has fewer than the width number of tasks, thecommand processor 137 schedules a wavefront with that lower number of tasks to advance execution. Although some example ways of determine when and whether to schedule a wavefront with tasks in a task list for execution are described, any technically feasible technique may be used. - If the
command processor 137 does not determine that there are tasks available for scheduling as a wavefront atstep 602, then themethod 600 loops back around and performs step 602 again. If thecommand processor 137 does determine that there are available tasks, then themethod 600 proceeds to step 604. Atstep 604, thecommand processor 137 schedules the tasks for execution as a wavefront. For thecommand processor 600, after scheduling the wavefront for execution, themethod 600 loops back tostep 602. On theSIMD unit side 138, atstep 606, theSIMD unit 138 executes the wavefront. Atstep 608, theSIMD unit 138 detects entry of the wavefront into a divergent segment. For any particular wavefront, steps 608-612 are optional because not all wavefronts will enter into a divergent segment. Also, steps 606-612 are not performed only for wavefronts that are scheduled for execution per steps 602-604. In other words, wavefronts generated not based on task entries in task queues may include divergent segments, which would cause execution of steps 606-612 without execution of steps 602-604. - Detecting entry into the divergent segment may occur in any technically feasible manner. Two possibilities include detection of a call to a function using a function pointer and returning from a function that was entered using a function pointer. At
step 610, due to entering into the divergent segment, each lane stores a task queue entry into a task queue data structure that is associated with the target—the code segment—of the lane. For instance, if one lane calls a function pointer that points to function1( ), then that lane stores a task queue entry into the task queue list associated with function1( ). If another lane calls a function pointer that points to function2( ), then that lane stores a task queue entry into the task queue list associated with function2( ), and so on. In some examples, the task queue entry includes the stack pointer associated with the lane. Atstep 612, theSIMD unit 138 exits the shader for the wavefront that is entering into the divergent segment. After this, at some point, thecommand processor 137 schedules the tasks for execution in their own wavefronts. - It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
- The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
- The methods or flow charts provided herein may be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/023,897 US12014208B2 (en) | 2018-06-29 | 2018-06-29 | Techniques for reducing serialization in divergent control flow |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/023,897 US12014208B2 (en) | 2018-06-29 | 2018-06-29 | Techniques for reducing serialization in divergent control flow |
Publications (2)
Publication Number | Publication Date |
---|---|
US20200004585A1 true US20200004585A1 (en) | 2020-01-02 |
US12014208B2 US12014208B2 (en) | 2024-06-18 |
Family
ID=69007596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/023,897 Active 2040-05-30 US12014208B2 (en) | 2018-06-29 | 2018-06-29 | Techniques for reducing serialization in divergent control flow |
Country Status (1)
Country | Link |
---|---|
US (1) | US12014208B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151936A1 (en) * | 2018-11-13 | 2020-05-14 | Intel Corporation | Techniques to manage execution of shaders |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050108720A1 (en) * | 2003-11-14 | 2005-05-19 | Stmicroelectronics, Inc. | System and method for efficiently executing single program multiple data (SPMD) programs |
US20060230252A1 (en) * | 2005-03-31 | 2006-10-12 | Chris Dombrowski | System and method of improving task switching and page translation performance utilizing a multilevel translation lookaside buffer |
US20070038726A1 (en) * | 2005-08-15 | 2007-02-15 | Microsoft Corporation | Quick deploy of content |
US7477255B1 (en) * | 2004-04-12 | 2009-01-13 | Nvidia Corporation | System and method for synchronizing divergent samples in a programmable graphics processing unit |
US20140149710A1 (en) * | 2012-11-29 | 2014-05-29 | Advanced Micro Devices, Inc. | Creating simd efficient code by transferring register state through common memory |
US20140181467A1 (en) * | 2012-12-21 | 2014-06-26 | Advanced Micro Devices, Inc. | High level software execution mask override |
US8787154B1 (en) * | 2011-12-29 | 2014-07-22 | Juniper Networks, Inc. | Multi-topology resource scheduling within a computer network |
US20150124608A1 (en) * | 2013-11-05 | 2015-05-07 | International Business Machines Corporation | Adaptive Scheduling of Data Flows in Data Center Networks for Efficient Resource Utilization |
US20160019066A1 (en) * | 2014-07-18 | 2016-01-21 | Nvidia Corporation | Execution of divergent threads using a convergence barrier |
US20160239302A1 (en) * | 2015-02-13 | 2016-08-18 | Advanced Micro Devices, Inc. | Dynamic wavefront creation for processing units using a hybrid compactor |
US20160267621A1 (en) * | 2015-03-09 | 2016-09-15 | Mediatek Inc. | Graphic processing system and method thereof |
US10242419B2 (en) * | 2015-09-02 | 2019-03-26 | Intel Corporation | Compiler optimization to reduce the control flow divergence |
US20190266019A1 (en) * | 2016-10-06 | 2019-08-29 | International Business Machines Corporation | Task Scheduling Using Improved Weighted Round Robin Techniques |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5894576A (en) * | 1996-11-12 | 1999-04-13 | Intel Corporation | Method and apparatus for instruction scheduling to reduce negative effects of compensation code |
US7284246B2 (en) * | 2002-04-23 | 2007-10-16 | Canon Kabushiki Kaisha | Extensible device driver |
US7907610B2 (en) * | 2005-05-23 | 2011-03-15 | Nxp B.V. | Integrated circuit with internal communication network |
US8312254B2 (en) * | 2008-03-24 | 2012-11-13 | Nvidia Corporation | Indirect function call instructions in a synchronous parallel thread processor |
TW201336740A (en) * | 2011-11-28 | 2013-09-16 | Sicpa Holding Sa | Method and system for controlling items on a production/distribution line |
US9122522B2 (en) * | 2011-12-14 | 2015-09-01 | Advanced Micro Devices, Inc. | Software mechanisms for managing task scheduling on an accelerated processing device (APD) |
-
2018
- 2018-06-29 US US16/023,897 patent/US12014208B2/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050108720A1 (en) * | 2003-11-14 | 2005-05-19 | Stmicroelectronics, Inc. | System and method for efficiently executing single program multiple data (SPMD) programs |
US7477255B1 (en) * | 2004-04-12 | 2009-01-13 | Nvidia Corporation | System and method for synchronizing divergent samples in a programmable graphics processing unit |
US20060230252A1 (en) * | 2005-03-31 | 2006-10-12 | Chris Dombrowski | System and method of improving task switching and page translation performance utilizing a multilevel translation lookaside buffer |
US20070038726A1 (en) * | 2005-08-15 | 2007-02-15 | Microsoft Corporation | Quick deploy of content |
US8787154B1 (en) * | 2011-12-29 | 2014-07-22 | Juniper Networks, Inc. | Multi-topology resource scheduling within a computer network |
US20140149710A1 (en) * | 2012-11-29 | 2014-05-29 | Advanced Micro Devices, Inc. | Creating simd efficient code by transferring register state through common memory |
US20140181467A1 (en) * | 2012-12-21 | 2014-06-26 | Advanced Micro Devices, Inc. | High level software execution mask override |
US20150124608A1 (en) * | 2013-11-05 | 2015-05-07 | International Business Machines Corporation | Adaptive Scheduling of Data Flows in Data Center Networks for Efficient Resource Utilization |
US20160019066A1 (en) * | 2014-07-18 | 2016-01-21 | Nvidia Corporation | Execution of divergent threads using a convergence barrier |
US20160239302A1 (en) * | 2015-02-13 | 2016-08-18 | Advanced Micro Devices, Inc. | Dynamic wavefront creation for processing units using a hybrid compactor |
US20160267621A1 (en) * | 2015-03-09 | 2016-09-15 | Mediatek Inc. | Graphic processing system and method thereof |
US10242419B2 (en) * | 2015-09-02 | 2019-03-26 | Intel Corporation | Compiler optimization to reduce the control flow divergence |
US20190266019A1 (en) * | 2016-10-06 | 2019-08-29 | International Business Machines Corporation | Task Scheduling Using Improved Weighted Round Robin Techniques |
Non-Patent Citations (1)
Title |
---|
Ariovaldo Denis Granja; Improving workflow and resource usage in construction schedules through location-based management system (LBMS); 15 Jan 2018 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200151936A1 (en) * | 2018-11-13 | 2020-05-14 | Intel Corporation | Techniques to manage execution of shaders |
US11107263B2 (en) * | 2018-11-13 | 2021-08-31 | Intel Corporation | Techniques to manage execution of divergent shaders |
US11776195B2 (en) | 2018-11-13 | 2023-10-03 | Intel Corporation | Techniques to manage execution of divergent shaders |
Also Published As
Publication number | Publication date |
---|---|
US12014208B2 (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102510524B1 (en) | Combined world-space pipeline shader stage | |
KR101660659B1 (en) | Executing subroutines in a multi-threaded processing system | |
US11880715B2 (en) | Method and system for opportunistic load balancing in neural networks using metadata | |
US10649810B2 (en) | Data driven scheduler on multiple computing cores | |
US11900156B2 (en) | Inter-thread communication in multi-threaded reconfigurable coarse-grain arrays | |
US11175922B1 (en) | Coarse-grain reconfigurable array processor with concurrent handling of multiple graphs on a single grid | |
US11243752B2 (en) | Multi-version shaders | |
CN112395010A (en) | Method and apparatus for out-of-order pipeline execution implementing static mapping of workloads | |
CN110308982A (en) | A shared memory multiplexing method and device | |
EP4315046A1 (en) | Multi-accelerator compute dispatch | |
US11996166B2 (en) | Adaptable allocation of SRAM based on power | |
US12014208B2 (en) | Techniques for reducing serialization in divergent control flow | |
US10353591B2 (en) | Fused shader programs | |
US11481953B2 (en) | Command processor based multi dispatch scheduler | |
US20240303113A1 (en) | Compiler-directed graph-based command dispatch for accelerators | |
US20190318229A1 (en) | Method and system for hardware mapping inference pipelines | |
US11113061B2 (en) | Register saving for function calling | |
WO2023129302A1 (en) | Emulating performance of prior generation platforms | |
US10877926B2 (en) | Method and system for partial wavefront merger | |
US11288095B2 (en) | Enhanced atomics for workgroup synchronization | |
US20220206851A1 (en) | Regenerative work-groups | |
KR20230124598A (en) | Compressed Command Packets for High Throughput and Low Overhead Kernel Initiation | |
US20240403056A1 (en) | Shader launch scheduling optimization | |
US20240330045A1 (en) | Input locality-adaptive kernel co-scheduling | |
CN108255587B (en) | Synchronous multi-thread processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: ADVANCED MICRO DEVICES, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:SALEH, SKYLER JONATHON;KAZAKOV, MAXIM V.;SIGNING DATES FROM 20180725 TO 20180731;REEL/FRAME:046533/0389 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: WITHDRAW FROM ISSUE AWAITING ACTION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP, ISSUE FEE PAYMENT VERIFIED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |