WO2013006566A2 - Procédé et appareil d'ordonnancement d'instructions dans un processeur à exécution non ordonnée à multi-séquences - Google Patents
Procédé et appareil d'ordonnancement d'instructions dans un processeur à exécution non ordonnée à multi-séquences Download PDFInfo
- Publication number
- WO2013006566A2 WO2013006566A2 PCT/US2012/045286 US2012045286W WO2013006566A2 WO 2013006566 A2 WO2013006566 A2 WO 2013006566A2 US 2012045286 W US2012045286 W US 2012045286W WO 2013006566 A2 WO2013006566 A2 WO 2013006566A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- interdependent
- instruction
- instructions
- operand
- dependency
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
- G06F9/3856—Reordering of instructions, e.g. using queues or age tags
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/30087—Synchronisation or serialisation instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30145—Instruction analysis, e.g. decoding, instruction word fields
- G06F9/3016—Decoding the operand specifier, e.g. specifier format
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/3017—Runtime instruction translation, e.g. macros
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3889—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
- G06F9/3891—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
Definitions
- Embodiments relate generally to the field of computing, and more particularly to methods, systems, and apparatuses for the scheduling of instructions in a multi-strand out- of-order processor.
- a computer processor such as a central processing unit (CPU)
- various operations or stages must be performed for the CPU to perform any beneficial task.
- the concept of an instruction fetch corresponds to the operation of retrieving an instruction from program memory communicatively interfaced with the CPU so that it may undergo further processing (e.g., instruction decode, instruction execute, and write back of the results).
- instruction decode e.g., instruction decode
- instruction execute e.g., instruction execute, and write back of the results.
- Each of these operations consume time or CPU clock cycles, and thus, inhibit speed and efficiency of the processor.
- IRP Instruction Level Parallelism
- One of the simplest methods used to accomplish increased parallelism is to begin the first steps of instruction fetching and decoding before the prior instruction finishes executing resulting in a pipeline of instructions for processing.
- Increased parallelism may also be attained through multiple functional units to simultaneously perform multiple "fetch" operations which are then placed into a pipeline such that an instruction is always available for an execution cycle. In such a way, an opportunity to execute an instruction less likely to be wasted due to having to wait for an instruction to be fetched.
- Scoreboarding implements a scheduling mechanism by which dependency violations can be avoided (e.g., via waits, stalls, etc.) which would otherwise result in "hazards" or incorrectly processed data, instruction, etc.
- Figure 1 depicts an exemplary architecture for a prior art fetch operation in a central processor unit's (CPU's) instruction fetch unit which lacks instruction level parallelism;
- Figure 2A depicts an exemplary architecture for the scheduling of instructions in a multi-strand out-of-order processor in accordance with which embodiments may operate;
- Figure 2B depicts an exemplary architecture of a multi-strand out-of-order processor in accordance with which embodiments may operate;
- Figure 3 depicts an exemplary data structure and instruction format of an instruction having synchronization bits in accordance with which embodiments may operate;
- Figure 4 is a flow diagram illustrating a method for the scheduling of instructions in a Multi-Strand Out-Of-Order Processor in accordance with disclosed embodiments
- Figure 5 illustrates a diagrammatic representation of a machine having a multi-strand out-of-order processor in the exemplary form of a computer system, in accordance with one embodiment
- Figure 6 is a block diagram of a computer system according to one embodiment
- Figure 7 is a block diagram of a computer system according to one embodiment.
- Figure 8 is a block diagram of a computer system according to one embodiment.
- Described herein are systems, methods, and apparatuses for the scheduling of instructions in a multi-strand out-of-order processor.
- disclosed mechanisms include interleaving or braiding "strands" (also known as “braids") having instruction therein to form a single program fragment from multiple inter-dependent strands in an out-of-order code fetch mechanism.
- a system for scheduling instructions in a multi-strand out-of-order processor includes a binary translator to generate a multi-strand representation of a sequential program listing, in which the generated multi- strand representation includes a plurality of interdependent strands, each of the plurality of interdependent strands having operand synchronization bits.
- the system further includes an out-of-order instruction fetch unit to retrieve the plurality of interdependent strands for execution and an instruction scheduling unit to schedule the execution of the plurality of interdependent strands based at least in part on the operand synchronization bits.
- Such a system may further include, for example, multiple execution units for executing multiple fetched interdependent strands in parallel, subject to appropriate scheduling to resolve dependencies between any of the plurality of strands.
- an apparatus for scheduling instructions in a multi- strand out-of-order processor includes an out-of-order instruction fetch unit to retrieve a plurality of interdependent instructions for execution from a multi- strand representation of a sequential program listing; an instruction scheduling unit to schedule the execution of the plurality of interdependent instructions based at least in part on operand synchronization bits encoded within each of the plurality of interdependent instructions; and a plurality of execution units to execute at least a subset of the plurality of interdependent instructions in parallel.
- embodiments further include various operations which are described below.
- the operations described in accordance with such embodiments may be performed by hardware components or may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor programmed with the instructions to perform the operations.
- the operations may be performed by a combination of hardware and software.
- Embodiments also relate to an apparatus for performing the operations disclosed herein.
- This apparatus may be specially constructed for the required purposes, or it may be a general purpose computer selectively activated or reconfigured by a computer program stored in the computer.
- a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
- Embodiments may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the disclosed embodiments.
- a machine-readable medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
- a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium (e.g., read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.), a machine (e.g., computer) readable transmission medium (electrical, optical, acoustical), etc.
- any of the disclosed embodiments may be used alone or together with one another in any combination. Although various embodiments may have been partially motivated by deficiencies with conventional techniques and approaches, some of which are described or alluded to within the specification, the embodiments need not necessarily address or solve any of these deficiencies, but rather, may address only some of the deficiencies, address none of the deficiencies, or be directed toward different deficiencies and problems where are not directly discussed.
- Figure 1 depicts an exemplary architecture 100 for a prior art fetch operation in a central processor unit's (CPU's) instruction fetch unit 120 which lacks instruction level parallelism.
- CPU's central processor unit's
- an instruction fetch unit 120 which takes a program counter 115, and presents the program counter to a memory 105 as an address 116 via an interconnecting memory bus 110.
- the presentment triggers/signals a read cycle 117 on the memory 105 and latches the data 118 output from the memory 105 to the instruction register 125.
- the instruction fetch unit 120 further handles an increment of the program counter 115 to get the next instruction (via adder 130), and the addition of a relative jump address (via adder 130) for program counter 115 relative jumps, or the selection 135 and substitution of a branch address for direct branches.
- the program counter 115 will always pull the next instruction in-order. While more sophisticated pipelining buffers may be utilized or even superscalar architecture to provide redundancy of such fetch operations, prior art architecture 100 is nevertheless constrained by an in-order fetch based mechanism insomuch as the program counter 115 will always fetch the "next instruction" on increment.
- Figure 2A depicts an exemplary architecture 200 for the scheduling of instructions in a multi-strand out-of-order processor in accordance with which embodiments may operate.
- an exemplary architecture for data dependencies processing 200 is shown in additional detail in which the in-order fetch and out-of-order execution capabilities of previously known architectures is overcome in a multi-strand out-of-order processor architecture which improves instruction level parallelism and correspondingly expands an overall instruction scheduling window.
- a combined software/hardware solution for encoding and detecting register dependencies 230 and 225 between instructions in a multi-strand representation 299 generated by a binary translator (BT) from the original sequential program is described.
- the multi-strand representation 299 provides the capability to overcome the abovementioned in-order fetch limitations to provide enhanced instruction level parallelism.
- a strand (e.g., 205, 210, and 215) is a sequence of instructions predominantly data dependent on each other that is arranged by a binary translator at program compilation time.
- strand 205 includes instructions 220, 221, 222, and 223.
- Strand 210 includes instructions 211, 212, 213, and 250.
- Strand 215 includes instructions 224, 227, 226, and 228.
- Synchronization Bit appended to register rl, in accordance with the instruction format incorporating such Synchronization Bits as described herein.
- the instruction format having synchronization bits is described in additional detail below in the discussion of Figure 3.
- FIG. 2B depicts an exemplary architecture 201 of a multi-strand out-of- order processor 298 in accordance with which embodiments may operate.
- a multi-strand out-of-order processor 298 is a machine that processes multiple strands 205, 210, 215 (and instruction pointers) in parallel so that instructions (e.g. 220, 211, 224, etc.) from different strands are executed out of program order.
- an out-of- order instruction fetch unit 297 retrieves or fetches interdependent instructions, strands, braids, etc., at least partially out of order.
- interdependent instructions maybe stored in a sequential order and the out-of-order instruction fetch unit 297 enables fetch and retrieval of the interdependent instructions for execution in an order which is different from the order in which they are stored.
- results 270 produced in one cluster can be transferred to another cluster (e.g., to either 261 or 262) via a set of wires referred to as inter-cluster data network 285.
- Each cluster 261-262 has an Instruction Scheduler Unit (ISU) 266 that is aimed at correct handling of data dependencies (e.g., 225, 230, 235 from Figure 2A) among instructions of the same strand (e.g., output dependency 225 of strand 215) as well as dependencies amongst the different strands, known as cross-strand data dependencies (e.g., such as dependencies 230 and 235).
- ISU Instruction Scheduler Unit
- Strand accumulators 271, 272, 273, 274, 275, and 276 operate in conjunction with the common registers 290.
- Each strand accumulator 271-276 is dedicated to one strand only and is addressed by the strand identifier (strand ID).
- strand ID strand identifier
- the strand 205 within cluster 260 may be uniquely correlated to strand accumulator 271 via the strand ID 205 A for strand 205.
- a synchronization bit is a bit appended to an operand address of an instruction to support correct handling of data anti- dependency among dependent instructions (e.g., anti-dependent instruction 226 of Fig. 2A).
- the synchronization bit cannot be appended to an operand address that is pointing to a strand accumulator 271-276.
- a rule may implement a restriction or hardware logic may enforce such a restriction.
- An instruction that is data dependent upon another instruction through a register 290 is referred to as a consumer instruction or consumer of that register.
- dependencies 225 and 230 depict dependency through a register 290.
- the instruction that resolves a data dependency through a register 290 thus allowing issuing of a consumer is referred to as a producer instruction or producer of that register 290.
- a consumer is considered to be ready if all data dependencies of its operands are resolved.
- a consumer can be in the same strand (e.g., such as dependency 225) as well as in different strand with respect to the producer (e.g., such as dependency 230).
- a scoreboard 280 is a hardware table containing the instant status of each register in the machine implementing the multi-strand out-of-order processor 298, each register providing, indicating, or registering the availability of that respective register for its consumers.
- scoreboard 280 operates in conjunction with tag comparison logic 281. As depicted, the scoreboard 280 and tag comparison logic 281 reside within each ISU 266 of each cluster 260-262.
- synchronization of strands 205, 210, 215 through registers is performed via the strand-based architecture 200 and consists of both software (SW) and hardware (HW) components operating in accord to implement the disclosed methodologies.
- a software component includes a modified instruction set architecture (ISA) having functionality therein for adding synchronization bits to operands and further having therein functionality for the arrangement of instructions into strands 205, 210, 215 at compilation time.
- ISA modified instruction set architecture
- the arrangement of instructions into strands 205, 210, 215 at compilation time is performed by a binary translator.
- the out-of-order instruction fetch unit 297 of the multi-strand out-of-order processor 298 expands the available scheduling window size of the processor 298 over previously known mechanisms by, for example, permitting the retrieval (fetch) of a critical instruction which is not accurately predicted by a branch prediction algorithm, without requiring all sequentially preceding instructions to be fetched.
- in-order fetch mechanisms limit the scheduling window size of a CPU because a critical instruction cannot be fetched into the CPU, and therefore cannot be considered for execution, until an entire continuous sequence of previous instructions in the executing program is also fetched and stored into the CPU's buffers or queues.
- In-order fetch therefore requires that all control flow changes in a sequence of instructions for the executing program be correctly predicted by a branch prediction mechanism or face a penalty manifested as inefficiency.
- the ability of CPUs with in-order fetch to exploit ILP is limited by the branch prediction accuracy, the size of CPU buffers or queues, and the speed of fetching a continuous sequence of instructions. Errors in branch prediction triggered by flow control of an executing program therefore lead to inefficiency bottlenecks.
- out-of-order fetch unit 297 allows an instruction to be fetched to the multi-strand out-of-order processor 298 and considered for execution earlier than a previous instruction in the program' s sequential listing of instructions. It is therefore unnecessary to delay program execution while an entire continuous sequence of previous instructions in the executing program is also fetched and stored into the CPU's buffers or queues leading up to the necessary instruction, such as is required with previously known mechanisms implementing in-order instruction fetch. Further still, it is not necessary for the multi-strand out-of-order processor 298 to have buffers large enough to keep all the previous instructions in the sequence, and the branch prediction algorithm need not correctly predict each branch in the sequence.
- the out-of-order fetch unit 297 allows an instruction to be fetched to the multi-strand out-of-order processor 298 and considered for execution earlier than a previous instruction in the program' s sequential listing of instructions. It is therefore unnecessary to delay program execution while an entire continuous sequence of previous instructions in the executing program is also fetched and stored into the CPU's buffers or queue
- the out-of-order fetch architecture of the multi-strand out-of-order processor 298 constitutes a multi-strand architecture in which the compiler splits a program on the instruction level into two or more strands or braids, such that each strand has a corresponding hardware program counter. While each program counter performs fetch sequentially, several program counters operating simultaneously and independently of one another are capable to fetch instructions out of order with regard to a program's sequential listing or the program's provided order of instructions. If the compiler places a critical instruction at the beginning of one of the strands, that instruction will likely be fetched and considered for execution earlier than instructions placed deep in other strands which precede the critical instruction in the original program.
- Figure 3 depicts an exemplary data structure and instruction format 300 of an instruction 350 having synchronization bits (315, 325, and 335) in accordance with which embodiments may operate.
- a separate bit specifically the synchronization bit or "SB," is appended to each source and destination operand in the object code as shown.
- the resultant format thus includes an exemplary instruction 350 within a strand 301 having op-code 305, source operand 1 address 310, a synchronization bit 315 for the source operand 1, source operand 2 address 320, a synchronization bit 325 for the source operand 2, a destination operand address 330, and a synchronization bit 335 for the destination operand.
- multiple instructions 350 ... 359 may be present within the strand 301, each incorporating a similar format as that depicted in detail with regard to instruction 350.
- a data anti-dependency (e.g., such as anti-dependency 235 at Figure 2A) is explicitly encoded between an instruction using a value in a register 290 and a second instruction updating the register with a new value.
- a binary translator sets a synchronization bit of a producer source operand to indicate that the producer source operand is the last use of the data item causing the anti-dependency.
- the binary translator further sets the synchronization bit of the consumer destination operand to indicate that the instruction must wait until all uses of the previous data item are completed, thus guiding the HW scheduling logic to execute the consumer after the producer.
- a first rule prohibits race conditions among instructions belonging to different strands producing the same destination register; and a second rule prohibits race conditions among instructions belonging to different strands reading the same source register with a synchronization bit.
- the binary translator ensures that all such instructions are required to be assigned to the same strand or the execution order for such instructions must be explicitly set through additional data or control dependency. Some situations may or may not be treated as race conditions depending on the program algorithm. For example, two consumers in two different strands having the same source operand address must be prohibited by the binary translator when the program algorithm prescribes that they are dependent on two corresponding producers with the same destination operand address within another strand. If the consumers according to the program algorithm depend on the same producer, then there is no race condition.
- a third rule prohibits an instruction from having the same source and destination operand addresses, each with a synchronization bit.
- the binary translator prohibits the situation of the third rule as it leads to an ambiguous situation that can't be handled by the scheduling hardware.
- a hardware component implements the aforementioned scoreboard 280 of Figure 2B and further implements tag comparison logic 281.
- Scoreboard 280 permits status, check, determination, and assessment of operand readiness for an instruction, thus resolving data dependencies.
- scoreboard 280 and tag comparison logic 281 is configured to allow fetching, issuing and executing instructions from different strands 301 (and 205, 210, 215 of Figure 2A) out-of- order in accordance with the implementation of a multi-strand out-of-order processor 298 as described herein.
- Scoreboard 280 stores status bits for each register 290 and strand accumulator 271-276 in a multi-strand out-of-order processor 298 and every instruction looks up the scoreboard 280 to determine if its requisite operands are ready.
- the strand accumulators 271-276 have only one status bit each, designated as a busy bit.
- the availability bit of a strand accumulator 271-276 is pre-initialized (“set" as a default) and when set, indicates that a register value has been written to a register file (RF) by another instruction and is available for reading.
- the busy bit if set, indicates that an instruction is in a processor pipeline updating a register value that has been issued by instruction scheduler unit 266, but has not, as of yet, written new register value.
- the status bits of the scoreboard are updated after issuing the instruction.
- the instruction scheduler unit 266 sets the busy bit for the destination operand and the source operand with a synchronization bit (315, 325, and 335). If an instruction completes its execution and writes the destination register in the register file, the corresponding availability bit is set and the busy bit is cleared.
- a synchronization bit (315 or 325) appended to a source operand address (310 or 320) of an instruction 350 indicates that both status bits must be cleared after reading the operand value from the register file.
- a synchronization bit 335 appended to the destination operand address 330 of an instruction 350 indicates that the instruction must not be issued until both status bits are cleared.
- data dependencies are resolved thus allowing an instruction to be issued, by checking the status bits of the scoreboard 280 for the operands of instructions 350 residing in an instruction scheduler unit 266 as illustrated by Fig. 2B.
- true dependencies (e.g., 230) are resolved thus allowing an instruction to be issued, by setting the availability bit and clearing the busy bit corresponding to the destination operand of the producer after writing a produced register value into the register file.
- the dependency is resolved if the source operand of a consumer has its availability bit set and its busy bit cleared.
- synchronization bits appended by a binary translator at program compilation time to the source operand of the producer and the destination operand of the consumer are used. After reading the register value from the register file for the source operand with a
- the busy bit corresponding to the destination operand of the producer is set immediately after issuing the instruction.
- the dependency is resolved if the busy bit corresponding to the destination operand of the consumer is cleared.
- Each instruction reads the scoreboard 280 status to determine the status bits for every operand only once during its allocation into the instruction scheduler unit 266.
- tag comparison logic 281 monitors the register values being generated by instructions and detects the readiness of instructions waiting in the instruction scheduler unit 266. After a consumer has read the scoreboard 280 but its operand has not yet been identified as ready (e.g., a producer hasn't yet been issued or completed thus it hasn't yet updated the corresponding status bits), its readiness will be detected by the tag comparison logic 281 which monitors register values generated by instructions.
- tag comparison logic 281 implements a Content Addressable Memory (CAM) that compares operand addresses of producers being executed with operand addresses of consumers residing in the instruction scheduler unit 266.
- the CAM performs four types of operand address comparison: 1) destination address of the producer with source address of the consumer, 2) source address (310, 320) with
- comparison types 3) and 4) are performed only if both the producer and the consumer belong to the same strand (e.g., are both instructions within one strand, such as instructions 350 and 359 within exemplary strand 301).
- operand addresses of strand accumulators 271-276 are compared if the consumer and the producer (e.g., instructions 350 and 359 by way of example) belong to the same strand 301 as well.
- the tag comparison logic 281 implemented CAM is responsible not only for wakeup of dependent consumers that reside in instruction scheduler unit 266, thus substituting the functionality of availability bits, but the CAM is additionally responsible for stalling the consumers in the instruction scheduler unit 266, thus substituting the functionality of the busy bits.
- Comparison of source operand address (310 and 320) of the consumer with source operand address (310 and 320) of another consumer being executed belonging to the same strand and having synchronization bit (315, 325) is required in order to identify relevant producer and to resolve a true dependency (e.g., 230) if the consumers read the source operand value from bypass wires.
- either the CAM performs the comparison or the binary translator must properly arrange a corresponding strand, thus delaying the second consumer in order to prevent such a situation.
- ILP based architectures may benefit from incorporating static instruction scheduling to provide more efficient utilization of the available execution units than with dynamic instruction scheduling based on, for example, Tomasulo's algorithm.
- One approach to providing a larger instruction window is splitting the initial program control flow graph into fragments (e.g., strands or braids as depicted at 205, 210, 215 of Figure 2A) executing on a plurality of processing nodes (e.g., as individual threads of execution in, for example, a Multiscalar architecture) such as the clusters 260-262 depicted at Figure 2B. It is possible for several strands (braids) to occupy the same cluster 260-262.
- each thread is annotated with the list of registers that it may produce. This list is used to reset the
- Scoreboard's 280 state of the corresponding registers 290 so that the consumers are caused to wait, stall, or delay, for these registers 290 to be produced.
- Another approach implies partial or full delegation of the instruction scheduling function from the hardware dynamic scheduler to software, thus simplifying the scheduling hardware and providing more efficient utilization of multiple execution channels.
- the methods and techniques described herein permit a larger scheduling window by fully adopting an out-of- order instruction fetch unit 297, thus overcoming the prior limitations.
- the mechanisms and techniques described herein maintain program order on the level of single instructions, and not on the basis of entire strands. Because program order is maintained on the level of single instructions, the register synchronization information is fetched in an order different from the program order, thus providing the ability to interleave instructions from a single program fragment in multiple strands. Strands (or "braids") having instruction therein are thus interleaved, interwoven, or braided, to form a single program fragment from multiple inter-dependent strands in an out-of-order code fetch mechanism.
- Previously known mechanisms assume that threads are spawned in the program order, and a newly spawned thread receives the list of registers that need to be provided by the previous threads.
- Figure 4 is a flow diagram illustrating a method for the scheduling of instructions in a multi-strand out-of-order processor in accordance with disclosed
- Method 400 may be performed by processing logic that may include hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (e.g., instructions run on a processing device to perform the methodologies and operations described herein, such as the scheduling of instructions in a multi-strand out-of-order processor to enhance ILP.
- method 400 is performed by an integrated circuit or a system having an integrated circuit therein, such as the multi-strand out-of-order processor 298 architecture depicted by Figure 2B.
- Some of the blocks and/or operations of method 400 are optional in accordance with certain embodiments. The numbering of the blocks presented is for the sake of clarity and is not intended to prescribe an order of operations in which the various blocks must occur.
- Method 400 begins with processing logic for fetching a plurality of interdependent instructions, strands, or braids for execution, wherein the plurality of interdependent instructions, strands, or braids are fetched out of order (block 405).
- processing logic determines a dependency exists between a first interdependent instruction and a second interdependent instruction.
- processing logic resolves a data dependency by checking status bits in a scoreboard for operands associated with the first and second interdependent instructions.
- processing logic resolves a true dependency by setting the availability bit and clearing the busy bit corresponding to a destination operand of a producer after writing a produced register value.
- processing logic resolves an anti-dependency by reading a register value for a source operand with a synchronization bit and clearing a corresponding availability bit and busy bit for the source operand.
- processing logic resolves an output dependency by setting the busy bit corresponding to the destination operand of the producer immediately after issuing the instruction.
- processing logic monitors register values being generated by instructions.
- processing logic detects the readiness of instructions waiting in an instruction scheduler unit based on a scoreboard status.
- processing logic compares operand addresses of producers being executed with operand addresses of consumers residing in the instruction scheduler unit.
- processing logic schedules the plurality of interdependent instructions for execution subject to detecting the readiness and comparisons of operands.
- processing logic executes at least a subset of the plurality of interdependent instructions in parallel subject to the scheduling.
- FIG. 5 illustrates a diagrammatic representation of a machine 500 having a multi-strand out-of-order processor in the exemplary form of a computer system, in accordance with one embodiment, within which a set of instructions, for causing the machine/computer system 500 to perform any one or more of the methodologies discussed herein, may be executed.
- the machine may be connected (e.g., networked) to other machines in a Local Area Network (LAN), an intranet, an extranet, or the Internet.
- the machine may operate in the capacity of a server or a client machine in a client- server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, as a server or series of servers within an on-demand service environment.
- LAN Local Area Network
- the machine may operate in the capacity of a server or a client machine in a client- server network environment, as a peer machine in a peer-to-peer (or distributed) network environment, as a server or series of servers within an on-demand service environment
- Certain embodiments of the machine may be in the form of a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, switch or bridge, computing system, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
- PC personal computer
- PDA Personal Digital Assistant
- a cellular telephone a web appliance
- server a network router, switch or bridge, computing system
- machine shall also be taken to include any collection of machines (e.g., computers) that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
- the exemplary computer system 500 includes a multi-strand out-of-order processor 502, a main memory 504 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc., static memory such as flash memory, static random access memory
- main memory 504 e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM) or Rambus DRAM (RDRAM), etc.
- static memory such as flash memory
- static random access memory static random access memory
- Main memory 504 includes binary translator 524 to provide a program
- the binary translator 524 operates in conjunction with the out-of- order fetch unit 525 and processing logic 526 of the multi-strand out-of-order processor 502 to perform the methodologies discussed herein.
- Multi-strand out-of-order processor 502 incorporates the capabilities of one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. Multi-strand out-of-order processor 502 is configured to fetch instruction strands via out-of-order fetch unit 525 and execute the fetched instruction strands via processing logic 526 to perform the operations and methodologies discussed herein.
- the computer system 500 may further include a network interface card 508.
- the computer system 500 also may include a user interface 510 (such as a video display unit, a liquid crystal display (LCD), or a cathode ray tube (CRT)), an alphanumeric input device 512 (e.g., a keyboard), a cursor control device 514 (e.g., a mouse), and a signal generation device 516 (e.g., an integrated speaker).
- the computer system 500 may further include peripheral device 536 (e.g., wireless or wired communication devices, memory devices, storage devices, audio processing devices, video processing devices, etc.).
- the secondary memory 518 may include a non-transitory machine-readable or computer readable storage medium 531 on which is stored one or more sets of instructions (e.g., software 522) embodying any one or more of the methodologies or functions described herein.
- the software 522 may also reside, completely or at least partially, within the main memory 504 and/or within the multi-strand out-of-order processor 502 during execution thereof by the computer system 500, the main memory 504 and the multi-strand out-of-order processor 502 also constituting machine-readable storage media.
- the software 522 may further be transmitted or received over a network 520 via the network interface card 508.
- FIG. 6 shown is a block diagram of a system 600 in accordance with one embodiment of the present invention.
- the system 600 may include one or more processors 610, 615, which are coupled to graphics memory controller hub (GMCH) 620.
- GMCH graphics memory controller hub
- additional processors 615 is denoted in Figure 6 with broken lines.
- Each processor 610,615 may be some version of the processor 500. However, it should be noted that it is unlikely that integrated graphics logic and integrated memory control units would exist in the processors 610,615.
- Figure 6 illustrates that the GMCH 620 may be coupled to a memory 640 that may be, for example, a dynamic random access memory (DRAM).
- the DRAM may, for at least one embodiment, be associated with a nonvolatile cache.
- the GMCH 620 may be a chipset, or a portion of a chipset.
- the GMCH 620 may communicate with the processor(s) 610, 615 and control interaction between the processor(s) 610, 615 and memory 640.
- the GMCH 620 may also act as an accelerated bus interface between the processor(s) 610, 615 and other elements of the system 600.
- the GMCH 620 communicates with the processor(s) 610, 615 via a multi-drop bus, such as a frontside bus (FSB) 695.
- a multi-drop bus such as a frontside bus (FSB) 695.
- GMCH 620 is coupled to a display 645 (such as a flat panel display).
- GMCH 620 may include an integrated graphics accelerator.
- GMCH 620 is further coupled to an input/output (I/O) controller hub (ICH) 650, which may be used to couple various peripheral devices to system 600.
- I/O controller hub ICH
- Shown for example in the embodiment of Figure 6 is an external graphics device 660, which may be a discrete graphics device coupled to ICH 650, along with another peripheral device 670.
- additional processor(s) 615 may include additional processors(s) that are the same as processor 610, additional processor(s) that are heterogeneous or asymmetric to processor 610, accelerators (such as, e.g., graphics accelerators or digital signal processing (DSP) units), field programmable gate arrays, or any other processor.
- accelerators such as, e.g., graphics accelerators or digital signal processing (DSP) units
- DSP digital signal processing
- the various processors 610, 615 may reside in the same die package.
- multiprocessor system 700 is a point-to-point interconnect system, and includes a first processor 770 and a second processor 780 coupled via a point-to-point interconnect 750.
- processors 770 and 780 may be some version of the processor 500 as one or more of the processors 610,615.
- processors 770, 780 While shown with only two processors 770, 780, it is to be understood that the scope of the present invention is not so limited. In other embodiments, one or more additional processors may be present in a given processor.
- Processors 770 and 780 are shown including integrated memory controller units 772 and 782, respectively.
- Processor 770 also includes as part of its bus controller units point-to-point (P-P) interfaces 776 and 778; similarly, second processor 780 includes P- P interfaces 786 and 788.
- Processors 770, 780 may exchange information via a point-to-point (P-P) interface 750 using P-P interface circuits 778, 788.
- IMCs 772 and 782 couple the processors to respective memories, namely a memory 732 and a memory 734, which may be portions of main memory locally attached to the respective processors.
- Processors 770, 780 may each exchange information with a chipset 790 via individual P-P interfaces 752, 754 using point to point interface circuits 776, 794, 786, 798.
- Chipset 790 may also exchange information with a high-performance graphics circuit 738 via a high-performance graphics interface 739.
- a shared cache (not shown) may be included in either processor or outside of both processors, yet connected with the processors via P-P interconnect, such that either or both processors' local cache information may be stored in the shared cache if a processor is placed into a low power mode.
- first bus 716 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
- PCI Peripheral Component Interconnect
- first bus 716 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express bus or another third generation I/O interconnect bus, although the scope of the present invention is not so limited.
- PCI Peripheral Component Interconnect
- various I/O devices 714 may be coupled to first bus 716, along with a bus bridge 718 which couples first bus 716 to a second bus 720.
- second bus 720 may be a low pin count (LPC) bus.
- LPC low pin count
- Various devices may be coupled to second bus 720 including, for example, a keyboard and/or mouse 722,
- an audio I O 724 may be coupled to second bus 720.
- a system may implement a multi-drop bus or other such architecture.
- FIG 8 shown is a block diagram of a third system 800 in accordance with an embodiment of the present invention.
- Like elements in Figures 7 and 8 bear like reference numerals, and certain aspects of Figure 7 have been omitted from Figure 8 in order to avoid obscuring other aspects of Figure 8.
- FIG. 8 illustrates that the processors 870, 880 may include integrated memory and I/O control logic ("CL") 872 and 882, respectively.
- CL I/O control logic
- the CL 872, 882 may include integrated memory controller units such as that described above in connection with Figure 7.
- CL 872, 882 may also include I/O control logic.
- Figure 8 illustrates that not only are the memories 832, 834 coupled to the CL 872, 882, but also that I/O devices 814 are also coupled to the control logic 872, 882.
- Legacy I/O devices 815 are coupled to the chipset 890.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Advance Control (AREA)
- Executing Machine-Instructions (AREA)
Abstract
Selon des modes de réalisation décrits, l'invention porte sur des procédés, des systèmes et des appareils d'ordonnancement d'instructions dans un processeur à exécution non ordonnée à multi-séquences. Par exemple, un appareil d'ordonnancement d'instructions dans un processeur à exécution non ordonnée à multi-séquences comprend une unité de prise en charge non ordonnée d'instructions pour prendre en charge une pluralité d'instructions interdépendantes en vue de leur exécution à partir d'une représentation de multi-séquences d'un listage séquentiel du programme; une unité d'ordonnancement d'instructions pour ordonnancer l'exécution de la pluralité d'instructions interdépendantes basée au moins en partie sur des bits de synchronisation annexés à des opérandes et codés dans chaque instruction de la pluralité d'instructions interdépendantes; et une pluralité d'unités d'exécution pour exécuter en parallèle au moins un sous-ensemble de la pluralité d'instructions interdépendantes.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/175,619 US9529596B2 (en) | 2011-07-01 | 2011-07-01 | Method and apparatus for scheduling instructions in a multi-strand out of order processor with instruction synchronization bits and scoreboard bits |
US13/175,619 | 2011-07-01 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2013006566A2 true WO2013006566A2 (fr) | 2013-01-10 |
WO2013006566A3 WO2013006566A3 (fr) | 2013-03-07 |
Family
ID=47391881
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2012/045286 WO2013006566A2 (fr) | 2011-07-01 | 2012-07-02 | Procédé et appareil d'ordonnancement d'instructions dans un processeur à exécution non ordonnée à multi-séquences |
Country Status (3)
Country | Link |
---|---|
US (2) | US9529596B2 (fr) |
TW (1) | TWI514267B (fr) |
WO (1) | WO2013006566A2 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017168197A1 (fr) * | 2016-04-01 | 2017-10-05 | Intel Corporation | Appareil et procédé d'amélioration de performances de communications inter-brins |
RU2663362C1 (ru) * | 2014-03-27 | 2018-08-03 | Интел Корпорейшн | Команда и логическая схема для сортировки и выгрузки команд сохранения |
US20220012061A1 (en) * | 2020-07-10 | 2022-01-13 | Graphcore Limited | Handling Injected Instructions in a Processor |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9529596B2 (en) * | 2011-07-01 | 2016-12-27 | Intel Corporation | Method and apparatus for scheduling instructions in a multi-strand out of order processor with instruction synchronization bits and scoreboard bits |
WO2013147852A1 (fr) * | 2012-03-30 | 2013-10-03 | Intel Corporation | Planification d'instructions pour un processeur multi-brin hors service |
US9086873B2 (en) * | 2013-03-15 | 2015-07-21 | Intel Corporation | Methods and apparatus to compile instructions for a vector of instruction pointers processor architecture |
KR20140126195A (ko) * | 2013-04-22 | 2014-10-30 | 삼성전자주식회사 | 배치 쓰레드 처리 기반의 프로세서, 그 프로세서를 이용한 배치 쓰레드 처리 방법 및 배치 쓰레드 처리를 위한 코드 생성 장치 |
US20160011876A1 (en) * | 2014-07-11 | 2016-01-14 | Cavium, Inc. | Managing instruction order in a processor pipeline |
EP3274815B1 (fr) * | 2015-03-27 | 2021-12-22 | Intel Corporation | Appareil et procédé pour la communication entre brins |
US11119903B2 (en) * | 2015-05-01 | 2021-09-14 | Fastly, Inc. | Race condition testing via a scheduling test program |
CN106293893B (zh) * | 2015-06-26 | 2019-12-06 | 阿里巴巴集团控股有限公司 | 作业调度方法、装置及分布式系统 |
US9652235B1 (en) * | 2015-11-24 | 2017-05-16 | International Business Machines Corporation | Method of synchronizing independent functional unit |
TWI584667B (zh) * | 2015-11-27 | 2017-05-21 | 財團法人工業技術研究院 | 多請求的排程方法及排程裝置 |
US11106467B2 (en) | 2016-04-28 | 2021-08-31 | Microsoft Technology Licensing, Llc | Incremental scheduler for out-of-order block ISA processors |
CN109992307B (zh) * | 2017-12-29 | 2020-05-05 | 上海寒武纪信息科技有限公司 | 指令列表调度方法、装置、计算机设备及存储介质 |
US11709681B2 (en) * | 2017-12-11 | 2023-07-25 | Advanced Micro Devices, Inc. | Differential pipeline delays in a coprocessor |
US11567554B2 (en) | 2017-12-11 | 2023-01-31 | Advanced Micro Devices, Inc. | Clock mesh-based power conservation in a coprocessor based on in-flight instruction characteristics |
CN110928575B (zh) * | 2018-09-20 | 2022-04-29 | 上海登临科技有限公司 | 一种多设备同步控制系统和控制方法 |
US10915327B2 (en) * | 2018-12-14 | 2021-02-09 | Arm Limited | Apparatus and method of dispatching instructions for execution clusters based on dependencies |
CN110737471B (zh) * | 2019-09-12 | 2021-12-28 | 中科寒武纪科技股份有限公司 | 指令处理方法、装置及相关产品 |
CN110609705B (zh) * | 2019-09-20 | 2021-05-11 | 深圳市航顺芯片技术研发有限公司 | 一种提高mcu总线效率的方法、智能终端、存储介质及芯片 |
CN112559054B (zh) * | 2020-12-22 | 2022-02-01 | 上海壁仞智能科技有限公司 | 用于同步指令的方法和计算系统 |
Family Cites Families (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2911278B2 (ja) * | 1990-11-30 | 1999-06-23 | 松下電器産業株式会社 | プロセッサ |
IL100990A (en) * | 1991-02-27 | 1995-10-31 | Digital Equipment Corp | Multilingual optimization compiler that uses Gladi in the production of a multi-pass cipher |
US6260189B1 (en) * | 1998-09-14 | 2001-07-10 | Lucent Technologies Inc. | Compiler-controlled dynamic instruction dispatch in pipelined processors |
US6550001B1 (en) * | 1998-10-30 | 2003-04-15 | Intel Corporation | Method and implementation of statistical detection of read after write and write after write hazards |
US6643762B1 (en) | 2000-01-24 | 2003-11-04 | Hewlett-Packard Development Company, L.P. | Processing system and method utilizing a scoreboard to detect data hazards between instructions of computer programs |
US6715060B1 (en) * | 2000-01-28 | 2004-03-30 | Hewlett-Packard Development Company, L.P. | Utilizing a scoreboard with multi-bit registers to indicate a progression status of an instruction that retrieves data |
US7143401B2 (en) | 2000-02-17 | 2006-11-28 | Elbrus International | Single-chip multiprocessor with cycle-precise program scheduling of parallel execution |
US7080234B2 (en) | 2000-03-08 | 2006-07-18 | Sun Microsystems, Inc. | VLIW computer processing architecture having the problem counter stored in a register file register |
US20020138714A1 (en) | 2001-03-22 | 2002-09-26 | Sun Microsystems, Inc. | Scoreboard for scheduling of instructions in a microprocessor that provides out of order execution |
US6950927B1 (en) * | 2001-04-13 | 2005-09-27 | The United States Of America As Represented By The Secretary Of The Navy | System and method for instruction-level parallelism in a programmable multiple network processor environment |
US6842849B2 (en) * | 2001-05-21 | 2005-01-11 | Arm Limited | Locking source registers in a data processing apparatus |
JP3702813B2 (ja) | 2001-07-12 | 2005-10-05 | 日本電気株式会社 | マルチスレッド実行方法及び並列プロセッサシステム |
US7051191B2 (en) * | 2001-12-26 | 2006-05-23 | Intel Corporation | Resource management using multiply pendent registers |
US7363467B2 (en) * | 2002-01-03 | 2008-04-22 | Intel Corporation | Dependence-chain processing using trace descriptors having dependency descriptors |
US20040162972A1 (en) | 2003-02-18 | 2004-08-19 | Sorin Iacobovici | Method for handling control transfer instruction couples in out-of-order, multi-issue, multi-stranded processor |
US7600221B1 (en) * | 2003-10-06 | 2009-10-06 | Sun Microsystems, Inc. | Methods and apparatus of an architecture supporting execution of instructions in parallel |
US7213133B2 (en) * | 2004-05-03 | 2007-05-01 | Sun Microsystems, Inc | Method and apparatus for avoiding write-after-write hazards in an execute-ahead processor |
US7590830B2 (en) * | 2004-05-28 | 2009-09-15 | Sun Microsystems, Inc. | Method and structure for concurrent branch prediction in a processor |
US20090217020A1 (en) | 2004-11-22 | 2009-08-27 | Yourst Matt T | Commit Groups for Strand-Based Computing |
US20070113137A1 (en) * | 2005-10-22 | 2007-05-17 | Ho Seung Ryu | Error Correction in Binary-encoded DNA Using Linear Feedback Shift Registers |
DE102006007993B4 (de) * | 2006-02-21 | 2007-11-08 | Infineon Technologies Ag | Testhilfseinrichtung in einem Speicherbaustein |
US7721071B2 (en) | 2006-02-28 | 2010-05-18 | Mips Technologies, Inc. | System and method for propagating operand availability prediction bits with instructions through a pipeline in an out-of-order processor |
US8233545B2 (en) * | 2006-08-21 | 2012-07-31 | Texas Instruments Incorporated | Run length encoding in VLIW architecture |
JP2010500679A (ja) * | 2006-10-27 | 2010-01-07 | インテル・コーポレーション | プロセッサ内のマルチスレッド間通信 |
US7702888B2 (en) * | 2007-02-28 | 2010-04-20 | Globalfoundries Inc. | Branch predictor directed prefetch |
US8789031B2 (en) * | 2007-09-18 | 2014-07-22 | Intel Corporation | Software constructed strands for execution on a multi-core architecture |
JP5395383B2 (ja) * | 2008-08-21 | 2014-01-22 | 株式会社東芝 | パイプライン演算プロセッサを備える制御システム |
US8417919B2 (en) | 2008-09-12 | 2013-04-09 | Wisconsin Alumni Research Foundation | Assigning different serialization identifier to operations on different data set for execution in respective processor in multi-processor system |
US20100274972A1 (en) * | 2008-11-24 | 2010-10-28 | Boris Babayan | Systems, methods, and apparatuses for parallel computing |
US8468539B2 (en) * | 2009-09-03 | 2013-06-18 | International Business Machines Corporation | Tracking and detecting thread dependencies using speculative versioning cache |
US20110179254A1 (en) * | 2010-01-15 | 2011-07-21 | Sun Microsystems, Inc. | Limiting speculative instruction fetching in a processor |
US8650554B2 (en) * | 2010-04-27 | 2014-02-11 | International Business Machines Corporation | Single thread performance in an in-order multi-threaded processor |
US8732711B2 (en) * | 2010-09-24 | 2014-05-20 | Nvidia Corporation | Two-level scheduler for multi-threaded processing |
US9529596B2 (en) * | 2011-07-01 | 2016-12-27 | Intel Corporation | Method and apparatus for scheduling instructions in a multi-strand out of order processor with instruction synchronization bits and scoreboard bits |
-
2011
- 2011-07-01 US US13/175,619 patent/US9529596B2/en not_active Expired - Fee Related
-
2012
- 2012-07-02 TW TW101123731A patent/TWI514267B/zh not_active IP Right Cessation
- 2012-07-02 WO PCT/US2012/045286 patent/WO2013006566A2/fr active Application Filing
-
2016
- 2016-12-27 US US15/391,709 patent/US20170235578A1/en not_active Abandoned
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2663362C1 (ru) * | 2014-03-27 | 2018-08-03 | Интел Корпорейшн | Команда и логическая схема для сортировки и выгрузки команд сохранения |
US10514927B2 (en) | 2014-03-27 | 2019-12-24 | Intel Corporation | Instruction and logic for sorting and retiring stores |
WO2017168197A1 (fr) * | 2016-04-01 | 2017-10-05 | Intel Corporation | Appareil et procédé d'amélioration de performances de communications inter-brins |
US20220012061A1 (en) * | 2020-07-10 | 2022-01-13 | Graphcore Limited | Handling Injected Instructions in a Processor |
US11635966B2 (en) * | 2020-07-10 | 2023-04-25 | Graphcore Limited | Pausing execution of a first machine code instruction with injection of a second machine code instruction in a processor |
Also Published As
Publication number | Publication date |
---|---|
US20130007415A1 (en) | 2013-01-03 |
US20170235578A1 (en) | 2017-08-17 |
US9529596B2 (en) | 2016-12-27 |
TWI514267B (zh) | 2015-12-21 |
WO2013006566A3 (fr) | 2013-03-07 |
TW201312460A (zh) | 2013-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9529596B2 (en) | Method and apparatus for scheduling instructions in a multi-strand out of order processor with instruction synchronization bits and scoreboard bits | |
KR101754462B1 (ko) | 동적 비순차적 프로세서 파이프라인을 구현하기 위한 방법 및 장치 | |
EP3314437B1 (fr) | Vérification de branches cibles dans un processeur basé sur des blocs | |
US9965274B2 (en) | Computer processor employing bypass network using result tags for routing result operands | |
TWI497412B (zh) | 用於使用相依矩陣追蹤解除配置之載入指令之方法、處理器及裝置 | |
CN106104481B (zh) | 用于执行确定性和机会性多线程的系统和方法 | |
US8904153B2 (en) | Vector loads with multiple vector elements from a same cache line in a scattered load operation | |
US9880842B2 (en) | Using control flow data structures to direct and track instruction execution | |
JP6006247B2 (ja) | 共有メモリへのアクセスの同期を緩和するプロセッサ、方法、システム、及びプログラム | |
US20080046689A1 (en) | Method and apparatus for cooperative multithreading | |
US20170315812A1 (en) | Parallel instruction scheduler for block isa processor | |
US20170371660A1 (en) | Load-store queue for multiple processor cores | |
CN101957744B (zh) | 一种用于微处理器的硬件多线程控制方法及其装置 | |
CN111752617A (zh) | 用于poq的系统、设备和方法以管理具有多个指令队列的处理器中的数据依赖关系 | |
US9927859B2 (en) | Internal communication interconnect scalability | |
CN109791493B (zh) | 用于乱序集群化解码中的负载平衡的系统和方法 | |
US20120060016A1 (en) | Vector Loads from Scattered Memory Locations | |
WO2016100142A2 (fr) | Architecture de processeur avancée | |
KR20180021812A (ko) | 연속하는 블록을 병렬 실행하는 블록 기반의 아키텍쳐 | |
CN105094747B (zh) | 基于smt的中央处理单元以及用于检测指令的数据相关性的装置 | |
CN112241288A (zh) | 在硬件中检测条件分支的动态控制流重汇聚点 | |
US10877765B2 (en) | Apparatuses and methods to assign a logical thread to a physical thread | |
US11775336B2 (en) | Apparatus and method for performance state matching between source and target processors based on interprocessor interrupts | |
CN112148106A (zh) | 用于处理器的混合预留站的系统、装置和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12807777 Country of ref document: EP Kind code of ref document: A2 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12807777 Country of ref document: EP Kind code of ref document: A2 |