US20160328237A1 - System and method to reduce load-store collision penalty in speculative out of order engine - Google Patents
System and method to reduce load-store collision penalty in speculative out of order engine Download PDFInfo
- Publication number
- US20160328237A1 US20160328237A1 US14/719,320 US201514719320A US2016328237A1 US 20160328237 A1 US20160328237 A1 US 20160328237A1 US 201514719320 A US201514719320 A US 201514719320A US 2016328237 A1 US2016328237 A1 US 2016328237A1
- Authority
- US
- United States
- Prior art keywords
- load
- instruction
- store
- dispatched
- logic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims description 18
- 230000001419 dependent effect Effects 0.000 claims abstract description 23
- 238000001514 detection method Methods 0.000 claims abstract description 16
- 230000004044 response Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 9
- -1 O-O-O Chemical class 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 3
- 238000011010 flushing procedure Methods 0.000 description 2
- 241000219104 Cucurbitaceae Species 0.000 description 1
- 229940127203 ENT-01 Drugs 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002028 premature Effects 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 235000020354 squash Nutrition 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30043—LOAD or STORE instructions; Clear instruction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3834—Maintaining memory consistency
Definitions
- the present invention relates in general to processing engines, and more specifically to a system and method of reducing the load-store collision penalty in a speculative out of order processing engine.
- a processing engine such as a microprocessor or the like, executes the instructions of an instruction set architecture, such as the x86 instruction set architecture or the like.
- the instructions of the instruction set architecture often referred to as macroinstructions, are first translated into microinstructions (or micro-operations or “ ⁇ ops”) that are issued to a reservation stations module that dispatches the instructions to the execution units.
- the microinstructions are more generally referred to herein simply as “instructions.”
- the instructions are also issued to a reorder buffer which ensures in-order retirement of the instructions.
- An out-of-order (O-O-O) scheduler is widely used in processor design and provides an important distinction between high performance processor and others.
- O-O-O scheduler each instruction is dispatched based on dependency, which is when the instructions use the same register as source and destination. Yet, the dependency of some instructions, such as load and store instructions, is difficult to recognize. This is because the dependency is not caused by the same register, but instead by the same address which is not known by the scheduler at the schedule stage. So, one common method is to speculatively assume that the load and store instructions do not have any collisions. When a collision is unfortunately detected afterwards, the result is incorrect, the pipeline is flushed, and the instructions are dispatched again. When the speculative dispatch of an instruction is incorrect, the flushing and re-dispatch of the instruction introduces a significant penalty.
- a microprocessor includes a load pipeline, a scheduler, an address generation unit, and a load-store queue.
- the load pipeline includes multiple stages which include at least one operand stage and two or more execution stages.
- the scheduler dispatches load instructions to the at least one operand stage for execution by the execution stages.
- the load instructions include a speculatively dispatched load instruction.
- the address generation unit provides a load instruction virtual address for the speculatively dispatched load instruction before the speculatively dispatched load instruction has progressed to the execution stages.
- the load-store queue asserts a clear signal to invalidate the speculatively dispatched load instruction when a match occurs between the load instruction virtual address and a store instruction virtual address of at least one previously dispatched store instruction in which corresponding store data has not yet been determined.
- the clear signal invalidates the speculatively dispatched load instruction in the event of a collision, such as when a match occurs between the load instruction virtual address and a store instruction virtual address of at least one previously dispatched store instruction.
- the scheduler which is otherwise configured to schedule dispatch of instructions that are dependent upon the speculatively dispatched load instruction, may instead not prematurely dispatch the dependent instructions when the clear signal is asserted.
- the load pipeline is configured to assert a load valid signal when the speculatively dispatched load instruction has progressed to a selected execution stage.
- Kill logic is provided that prevents the load valid signal from being detected by the scheduler when the clear signal is asserted.
- Broadcast logic may be provided to receive and broadcast the load valid signal to the scheduler when asserted, except when the clear signal is asserted to invalidate the speculatively dispatched load instruction.
- the broadcast logic may include kill logic that prevents the broadcast logic from broadcasting the load valid signal when the clear signal is asserted.
- the load-store queue may include multiple entries, each for comparing the load instruction virtual address of the speculatively dispatched load instruction with one or more store instruction virtual addresses.
- Valid logic and qualify logic may be provided for each entry to ensure a corresponding store instruction virtual address corresponds to a store instruction that is dispatched early than the speculatively dispatched load instruction and whose corresponding store data has not yet been determined.
- Each entry may assert a preliminary clear signal, and OR logic may be provided to assert the clear signal when any one or more of the preliminary clear signals are asserted.
- a load-store collision detection system for a speculative out of order processing engine.
- the processing engine includes a scheduler that dispatches instructions to multiple instruction pipelines, in which the instruction pipelines include a load pipeline that provides a load valid signal when a speculatively dispatched load instruction is executing.
- the load-store collision detection system includes comparator logic, broadcast logic, and kill logic.
- the comparator logic asserts a clear signal when a load instruction virtual address of the speculatively dispatched load instruction matches at least one store instruction virtual address of at least one previously dispatched store instruction whose corresponding store data is not ready yet.
- the broadcast logic broadcasts the load valid signal to the scheduler to enable dispatch of any instructions dependent upon the speculatively dispatched load instruction.
- the kill logic invalidates the load valid signal when the clear signal is asserted before or coincident with the load valid signal.
- the kill logic may be incorporated within the broadcast logic or the scheduler or any suitable combination of both.
- the load-store collision detection system may include a memory for storing one or more store instruction virtual addresses.
- a method of reducing load-store collisions in a speculative out of order processing engine includes providing a store instruction address for each of at least one previously dispatched store instruction whose corresponding data is not ready yet, speculatively dispatching a load instruction to a load pipeline, determining a load instruction address for the speculatively dispatched load instruction before said speculatively load instruction is executed, comparing the load instruction address with the store instruction address of each of the at least one previously dispatched store instruction, asserting a clear signal when the load instruction address matches the store instruction address of the at least one previously dispatched store instruction, asserting a load valid signal for the speculatively dispatched load instruction while being executed, and invalidating the load valid signal when the clear signal is also asserted.
- the method may include broadcasting the load valid signal to each queue of a scheduler, and suppressing the broadcasting of the load valid signal when the clear signal is also asserted.
- the method may include determining the validity of the store instruction address of each of the at least one previously dispatched store instruction.
- the method may include validating and qualifying the store instruction address of each of the at least one previously dispatched store instruction.
- the method may include comparing the load instruction address with multiple store instruction address and asserting a corresponding one of multiple preliminary clear signals for each match, and asserting the clear signal when at least one of the preliminary clear signals is asserted.
- the method may include asserting a corresponding one of the preliminary clear signals only when a corresponding store instruction address is valid and qualified.
- FIG. 1 is a simplified block diagram of a superscalar, pipelined microprocessor implemented according to one embodiment of the present invention
- FIG. 2 is a diagram depicting, in simplified manner, an O-O-O instruction sequence in contrast to an in-order instruction sequence according to a conventional configuration illustrating a collision and corresponding consequences;
- FIG. 3 is a simplified block diagram of the load pipeline of FIG. 1 receiving instructions dispatched from the LD RS Q within the reservation stations of FIG. 1 and corresponding load execution stages according to one embodiment;
- FIG. 4 is a more detailed block diagram of the LoStQ of FIG. 3 according to one embodiment of the present invention.
- FIG. 5 is a diagram of an exemplary entry of the LoStQ of FIG. 4 according to an alternative embodiment with memory devices.
- the inventors have recognized the penalty associated with a load-store collision caused by speculative dispatch of a load instruction within a processing engine. They have therefore developed a system and method of detecting load-store collisions before the load is dispatched for execution. The system and method further squashes, or otherwise suppresses, the dispatch valid of the load instruction to prevent the issuance of additional instructions which depend upon the speculatively dispatched load instruction. Since the potentially dependent instructions are not dispatched prematurely, the pipeline need not be flushed and the dependent instructions need not be replayed. In this manner, the penalty associated with the speculatively dispatched load instruction is reduced or otherwise minimized.
- a load-store (Lo-St) queue (LoStQ) structure is incorporated into the instruction pipeline which detects a collision between the load and any store instruction that is not ready to complete.
- a store instruction that is not ready to complete means that the address portion (STA) has been determined, but the data portion (STD) has not yet been determined, so that the store instruction is considered temporarily “LoSt.”
- the LoStQ detects this collision and issues a clear signal back to kill broadcast of the dispatch valid of the load instruction to scheduler queues holding instructions for dispatch. The clear signal suppresses the dispatch of additional instructions that depend upon the speculatively dispatched load to improve performance efficiency.
- FIG. 1 is a simplified block diagram of a superscalar, pipelined microprocessor 100 implemented according to one embodiment of the present invention.
- the microprocessor 100 includes an instruction cache 102 that caches macroinstructions of an instruction set architecture, such as the x86 instruction set architecture or the like. Additional or alternative instruction set architectures are contemplated.
- the microprocessor 100 includes an instruction translator 104 that receives and translates the macroinstructions into microinstructions. The microinstructions are then provided to a register alias table (RAT) 106 , which generates microinstruction dependencies and issues the microinstructions in program order to reservation stations 108 and to a reorder buffer (ROB) 110 via instruction path 107 .
- RAT register alias table
- microinstructions issued from the RAT 106 may typically be referred to as microinstructions, but are more generally referred to herein simply as “instructions.”
- the ROB 110 stores an entry for every instruction issued from the RAT 106 .
- the reservation stations 108 dispatches the instructions to an appropriate one of multiple execution units 112 .
- the execution units 112 may include one or more integer execution units, such as an integer arithmetic/logic unit (ALU) 114 or the like, one or more floating point execution units 116 , such as including a single-instruction-multiple-data (SIMD) execution unit such as MMX and SSE units or the like, a memory order buffer (MOB) 118 , etc.
- the MOB 118 generally handles memory type instructions to a system memory 120 , such as including a load instruction execution pipe 117 and a similar store instruction execution pipe 119 .
- the system memory 120 may be interfaced with the MOB 118 via a data cache (e.g., L2 data cache, not shown) and a bus interface unit (BIU, not shown).
- the execution units 112 provide their results to the ROB 110 , which ensures in-order retirement of instructions.
- the reservation stations 108 includes multiple RS queues, in which each RS queue schedules and dispatches corresponding issued instructions to corresponding execution units 112 when the instructions are ready to be executed.
- a separate RS queue is provided for each execution unit 112 .
- an RS Q1 122 is provided for the integer execution unit 114 and an RS Q2 124 is provided for the floating point execution unit 116 .
- a LD RS Q 126 provides load instructions to the load pipeline 117
- a separate ST RS Q 128 provides store instructions to the store pipeline 119 .
- Each RS Q of the reservation stations 108 may alternatively be referred to as a scheduler that includes schedule logic or the like (not shown) that schedules dispatch of issued instructions to the corresponding execution unit 112 .
- FIG. 2 is a diagram depicting, in simplified manner, an O-O-O instruction sequence 250 in contrast to an in-order instruction sequence 200 according to a conventional configuration illustrating a collision and corresponding consequences.
- the in-order instruction sequence 200 begins with a store instruction, which is divided into a store address (STA) micro-operation ( ⁇ op) 202 followed by a store data (STD) ⁇ op 206 for storing data D at a memory location 204 in the system memory 120 with address ADD.
- the store ⁇ ops are followed by a load (LD) instruction or LD ⁇ op 208 for loading the data D from the memory location 204 at address ADD into a storage location, such as a register or the like. Since the instructions are performed in the proper order, the correct data D is stored at the memory location 204 by the time the LD ⁇ op 208 is executed, so that the load result is correctly achieved by loading the correct data D.
- STA store address
- STD store data
- LD load
- LD
- the O-O-O instruction sequence 250 also begins with the STA micro-operation ( ⁇ op) 202 .
- the LD ⁇ op 208 is dispatched out of order (out of program order) and before the STD ⁇ op 206 since the data D operand is not yet known.
- the LD ⁇ op 208 loads the data X currently stored at the memory location 204 at address ADD into a selected storage location, in which data X is, for practical purposes, not the same as data D. Since the data D was not yet available, the LD ⁇ op 208 loads the incorrect data X.
- the speculative execution of the LD ⁇ op 208 is reported back to other dependent ⁇ ops depending on the LD ⁇ op 208 at the reservation stations 108 , and these dependent ⁇ ops may be dispatched into the pipeline of a corresponding execution unit to retrieve operands prior to execution.
- the results of the load ⁇ op 208 are reported back to the ROB 110 , which includes a “MISS WHY” routine or the like that detects the incorrect load result.
- the ROB 110 flushes the load pipeline 117 to remove the LD ⁇ op 208 , and also flushes any other execution pipeline processing any instructions dependent upon the LD ⁇ op 208 that have been dispatched. Also, as indicated by arrow 216 , the LD ⁇ op 208 and the corresponding dependent instructions must ultimately be replayed after the STD ⁇ op 206 is executed to retrieve the correct data D. The flushing of the execution pipelines and the replay of the load instruction and any dependent instructions has a negative impact on performance of the microprocessor 100 .
- FIG. 3 is a simplified block diagram of the load pipeline 117 receiving instructions dispatched from the LD RS Q 126 within the reservation stations 108 and corresponding load execution stages according to one embodiment.
- the reservation stations 108 is shown including multiple RS Q's or schedulers as previously described.
- the load pipeline 117 is divided into multiple sequential stages, shown as a D stage 302 , a Q stage 304 , an R stage 306 , an A(I) stage 308 , a B stage 310 , a C stage 312 , an E1 stage 314 and an E2 stage 316 .
- the stages are separated by corresponding sets of synchronous latches 318 or the like for transferring or propagating data and information through the load pipeline 117 synchronous with a common clock signal (not shown).
- Vertical dashed lines and corresponding boxes are drawn between sequential stages to depict stage boundaries.
- Stage D 302 is an issue stage which is common to each the pipelines of each of the execution units 112 , in which instructions are issued from the RAT 106 to the scheduler RS Q's within the reservation stations 108 .
- Stages Q, R and A(I) 304 , 306 and 308 are the operand stages in which a selected load instruction is dispatched for execution and the operands for the selected instruction are determined prior to actual execution.
- the remaining stages B, C, E1 and E2 are the load execution stages for executing the dispatched load instruction.
- a dispatch valid signal DV(Q) is asserted.
- Each stage generates corresponding dispatch tags DT(Q), DT(R), DT(I), DT(B), DT(C) as the load instruction propagates through the load pipeline 117 .
- select logic 320 selects from among possible sources (e.g., the sources may be a register, one of several types of constants, a memory address, etc.) of the operands for the load instructions for determining both a first source SRCA and a second source SRCB, which are provided to respective inputs of an address generation unit (AGU) 322 .
- a first multiplexer (MUX) 319 selects from among the possible sources to provide the first source SRCA
- a second MUX 321 selects from among the possible sources to provide the second source SRCB, both provided to inputs of the AGU 322 .
- the AGU 322 outputs a corresponding load instruction virtual address (LDVA) for accessing the system memory 120 .
- LDVA may be converted to a physical address for accessing the system memory 120 .
- LDVA may be just part of the virtual address of the load instruction.
- a 12-bit AGU 322 calculates [11:0] LDVA, which is identical with [11:0] of the physical address. However, if the [11:0] LDVA is the same with a store instruction virtual address (STVA) of previously dispatched store instruction whose store data has not yet been determined, the physical addresses of the LD instruction and ST instruction are probably the same, that means a load-store collision is detected. In other embodiments, more bits of the virtual address will be calculated and compared to improve the accuracy.
- the LDVA is shown provided through the synchronous latches 318 to stage B 310 to initiate execution.
- stage B 310 the latched version of LDVA is provided to an input of a load-store queue (LoStQ) 326 , which develops a clear signal CLR to invalidate the speculatively dispatched load instruction when a match occurs between the LDVA and a store instruction virtual address (STVA) of at least one previously dispatched store instruction whose store data has not yet been determined.
- LoStQ load-store queue
- STVA store instruction virtual address
- load valid (LV) logic 325 asserts a load valid signal LD VALID is provided back to an input of broadcast (BC) logic 324 in stage D 302 .
- the BC logic 324 is operative to forward or broadcast the LD VALID signal to one or more up to all of the scheduler RS Q's within the reservation stations 108 . In this manner, any instruction that has been issued to the reservation stations 108 and that is dependent upon the LD instruction may be scheduled for dispatch into corresponding execution units. Generally, these dependent instructions are not dispatched until the LD VALID signal is provided to ensure proper execution.
- the AGU 322 provides the LDVA of the load instruction as early as one of the operand stages (e.g., Stages A(I) 304 ), that is, before the load instruction has progressed to the execution stages (e.g., stages B) to produce the LD VALID signal.
- the CLR signal from the LoStQ 326 is provided back to an input of KILL logic 328 shown within the BC logic 324 .
- the kill logic 328 is operative to prevent the LD VALID signal from being broadcasted by the BC logic 324 to the scheduler RS Q's of the reservation stations 108 .
- the KILL logic 328 is separately added to prevent broadcast of the LD VALID signal.
- the BC logic 324 includes disable logic or the like incorporated within the BC logic 324 , in which the BC logic 324 passes the LD VALID signal to the reservation stations 108 only if the CLR signal is not asserted to disable broadcast.
- the KILL logic 328 may be provided external to the BC logic 324 , in which the KILL logic 328 either passes or blocks the broadcast of the LD VALID signal based on the state of CLR.
- the KILL logic 328 is distributed among the scheduler RS Q's of the reservation stations 108 .
- the CLR signal may instead be provided to each of the scheduler RS Q's within the reservation stations 108 , in which case assertion of the CLR signal prevents the broadcasted LD VALID signal from changing operative signals, bits or values within each RS Q.
- the CLR signal suppresses detection of assertion of the LD VALID signal.
- the LD VALID signal is suppressed, then any instructions that are dependent upon the speculatively issued load instruction are not yet dispatched for execution. The dependent instructions may be executed at a later time after the load instruction is replayed.
- FIG. 4 is a more detailed block diagram of the LoStQ 326 according to one embodiment of the present invention.
- the AGU 322 develops and passes each load virtual address LDVA to stage B 310 (via a corresponding set of latches 318 ) as previously described.
- the LoStQ 326 comprises comparator logic.
- the LoStQ 326 includes multiple entries individually labeled ENTnn, in which “nn” denotes an entry number.
- the LoStQ 326 includes 16 entries ENT00, ENT01, . . . , ENT15 (or ENT00-ENT15). The details of the first entry ENT00 are shown, in which it is understood the remaining entries ENT01-ENT15 are each substantially identical to the first for comparing up to 16 virtual addresses at a time.
- the LoStQ 326 receives virtual addresses of one or more previously dispatched store instructions in which the corresponding data values to be stored have not yet been determined or provided.
- a store instruction is divided into a STA ⁇ op for determining the address of the system memory 120 and a corresponding STD ⁇ op for determining the corresponding data value.
- the STA and STD ⁇ op are processed within the store pipeline 119 of the MOB 118 .
- the LoStQ 326 only receives the virtual addresses determined by one or more previously dispatched STA ⁇ ops in which the corresponding STD ⁇ op is yet to be processed by the store pipeline 119 .
- the LoStQ 326 receives all the virtual addresses from the store pipeline 119 .
- the LD RS Q 126 also includes 16 entries, which holds 16 store instructions that are then dispatched to the store pipeline 119 .
- the entries of the LoStQ 326 respectively correspond to the entries of the LD RS Q 126 and receive all the store instruction virtual addresses (STVA0-STVA15) from the store pipeline 119 .
- the LoStQ 326 identifies a collision between the LDVA of a speculatively dispatched load instruction and any of the virtual addresses of store instructions. When a collision is detected, the LoStQ 326 asserts the CLR signal to suppress detection of the LD VALID signal by the reservation stations 108 to prevent premature dispatch of any instructions dependent upon the speculatively dispatched load instruction.
- the first entry ENT00 includes a comparator 402 that compares the virtual address LDVA of the speculatively dispatched LD instruction with a store instruction virtual address STVA0 of the STA ⁇ op received by the first entry ENT00. If the virtual addresses are the same, then the comparator 402 asserts a match signal M0 to one input of AND logic 404 (in which “&” denotes a logical AND function).
- the first entry ENT00 also includes AND logic 406 that receives a qualify valid signal QV0 from the store pipeline 119 that indicates whether the virtual address STVA0 is valid.
- the first entry ENT00 also includes OR logic 408 that receives qualify condition signals QC0A and QC0B from the store pipeline 119 , in which either qualify condition signal asserted true indicates validity of STVA0.
- the qualify condition signals QC0A and QC0B are logically OR'd together and then logically AND'd with QV0 to determine a store valid signal VE0 indicating the validity of STVA0.
- the validity means STVA0 corresponds to a store instruction that is dispatched early than the speculatively dispatched load instruction and the corresponding store data of the store instruction has not yet been determined.
- VE0 is provided to the other input of the AND logic 404 , which outputs a first clear signal CL0 for the first entry ENT00.
- Each of the entries ENT00-ENT15 output a corresponding one of 16 clear signals CL1-CL15, which are provided to respective inputs of OR logic 410 , which outputs the CLR signal.
- the KILL logic 324 suppresses or blocks assertion of the LD VALID signal from being detected by the reservation stations 108 . In this manner, any instruction located at any one of the RS queues of the reservation stations 108 that is also dependent upon the load instruction is not dispatched prematurely for execution.
- the execution pipelines in which the dependent instructions have been prematurely dispatched to need not be flushed and the corresponding instructions need not be replayed.
- the load instruction is replayed non-speculatively in program order avoiding the potential for collision. If CLR is not asserted before LD VALID or coincidentallywith LD VALID, then a collision is not detected and the speculatively dispatched load instruction is allowed to execute to completion.
- the store virtual addresses STVA0-STVA15 and corresponding qualify valid signals QV0-QV15 and qualify condition signals QC0A/B-QC15A/B may be provided directly from the store pipe 119 .
- intermediate memory devices such as registers or latches or the like, may be provided between the store pipe 119 and the LoStQ 326 .
- each of the entries ENT00-ENT15 of the LoStQ 326 may incorporate memory devices 502 for storing the store instruction virtual addresses and the qualify signals.
- the store instruction virtual addresses and the qualify signals may still be generated within the store pipeline 119 , but are copied over to the memory devices 502 .
- the memory devices 502 may be registers or latches or the like.
- the LoStQ 326 may detect and indicate a collision that does not, in fact, exist. In this case, any benefit of speculatively dispatching the load instruction is lost and a slight performance drop may result by delaying the dependent instructions. Nonetheless, the statistical occurrence of false detections is relatively small so that the performance benefits of detecting actual collisions significantly outweighs the slight performance crop caused by false detections.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
Description
- 1. Field of the Invention
- The present invention relates in general to processing engines, and more specifically to a system and method of reducing the load-store collision penalty in a speculative out of order processing engine.
- 2. Description of the Related Art
- A processing engine, such as a microprocessor or the like, executes the instructions of an instruction set architecture, such as the x86 instruction set architecture or the like. In many such engines, the instructions of the instruction set architecture, often referred to as macroinstructions, are first translated into microinstructions (or micro-operations or “μops”) that are issued to a reservation stations module that dispatches the instructions to the execution units. The microinstructions are more generally referred to herein simply as “instructions.” The instructions are also issued to a reorder buffer which ensures in-order retirement of the instructions.
- An out-of-order (O-O-O) scheduler is widely used in processor design and provides an important distinction between high performance processor and others. In an O-O-O scheduler, each instruction is dispatched based on dependency, which is when the instructions use the same register as source and destination. Yet, the dependency of some instructions, such as load and store instructions, is difficult to recognize. This is because the dependency is not caused by the same register, but instead by the same address which is not known by the scheduler at the schedule stage. So, one common method is to speculatively assume that the load and store instructions do not have any collisions. When a collision is unfortunately detected afterwards, the result is incorrect, the pipeline is flushed, and the instructions are dispatched again. When the speculative dispatch of an instruction is incorrect, the flushing and re-dispatch of the instruction introduces a significant penalty.
- A microprocessor according to one embodiment includes a load pipeline, a scheduler, an address generation unit, and a load-store queue. The load pipeline includes multiple stages which include at least one operand stage and two or more execution stages. The scheduler dispatches load instructions to the at least one operand stage for execution by the execution stages. The load instructions include a speculatively dispatched load instruction. The address generation unit provides a load instruction virtual address for the speculatively dispatched load instruction before the speculatively dispatched load instruction has progressed to the execution stages. The load-store queue asserts a clear signal to invalidate the speculatively dispatched load instruction when a match occurs between the load instruction virtual address and a store instruction virtual address of at least one previously dispatched store instruction in which corresponding store data has not yet been determined.
- The clear signal invalidates the speculatively dispatched load instruction in the event of a collision, such as when a match occurs between the load instruction virtual address and a store instruction virtual address of at least one previously dispatched store instruction. In this manner, the scheduler, which is otherwise configured to schedule dispatch of instructions that are dependent upon the speculatively dispatched load instruction, may instead not prematurely dispatch the dependent instructions when the clear signal is asserted.
- In one embodiment, the load pipeline is configured to assert a load valid signal when the speculatively dispatched load instruction has progressed to a selected execution stage. Kill logic is provided that prevents the load valid signal from being detected by the scheduler when the clear signal is asserted. Broadcast logic may be provided to receive and broadcast the load valid signal to the scheduler when asserted, except when the clear signal is asserted to invalidate the speculatively dispatched load instruction. The broadcast logic may include kill logic that prevents the broadcast logic from broadcasting the load valid signal when the clear signal is asserted.
- The load-store queue may include multiple entries, each for comparing the load instruction virtual address of the speculatively dispatched load instruction with one or more store instruction virtual addresses. Valid logic and qualify logic may be provided for each entry to ensure a corresponding store instruction virtual address corresponds to a store instruction that is dispatched early than the speculatively dispatched load instruction and whose corresponding store data has not yet been determined. Each entry may assert a preliminary clear signal, and OR logic may be provided to assert the clear signal when any one or more of the preliminary clear signals are asserted.
- A load-store collision detection system is disclosed for a speculative out of order processing engine. The processing engine includes a scheduler that dispatches instructions to multiple instruction pipelines, in which the instruction pipelines include a load pipeline that provides a load valid signal when a speculatively dispatched load instruction is executing. The load-store collision detection system includes comparator logic, broadcast logic, and kill logic. The comparator logic asserts a clear signal when a load instruction virtual address of the speculatively dispatched load instruction matches at least one store instruction virtual address of at least one previously dispatched store instruction whose corresponding store data is not ready yet. The broadcast logic broadcasts the load valid signal to the scheduler to enable dispatch of any instructions dependent upon the speculatively dispatched load instruction. The kill logic invalidates the load valid signal when the clear signal is asserted before or coincident with the load valid signal.
- The kill logic may be incorporated within the broadcast logic or the scheduler or any suitable combination of both. The load-store collision detection system may include a memory for storing one or more store instruction virtual addresses.
- A method of reducing load-store collisions in a speculative out of order processing engine includes providing a store instruction address for each of at least one previously dispatched store instruction whose corresponding data is not ready yet, speculatively dispatching a load instruction to a load pipeline, determining a load instruction address for the speculatively dispatched load instruction before said speculatively load instruction is executed, comparing the load instruction address with the store instruction address of each of the at least one previously dispatched store instruction, asserting a clear signal when the load instruction address matches the store instruction address of the at least one previously dispatched store instruction, asserting a load valid signal for the speculatively dispatched load instruction while being executed, and invalidating the load valid signal when the clear signal is also asserted.
- The method may include broadcasting the load valid signal to each queue of a scheduler, and suppressing the broadcasting of the load valid signal when the clear signal is also asserted. The method may include determining the validity of the store instruction address of each of the at least one previously dispatched store instruction. The method may include validating and qualifying the store instruction address of each of the at least one previously dispatched store instruction. The method may include comparing the load instruction address with multiple store instruction address and asserting a corresponding one of multiple preliminary clear signals for each match, and asserting the clear signal when at least one of the preliminary clear signals is asserted. The method may include asserting a corresponding one of the preliminary clear signals only when a corresponding store instruction address is valid and qualified.
- The benefits, features, and advantages of the present invention will become better understood with regard to the following description, and accompanying drawings where:
-
FIG. 1 is a simplified block diagram of a superscalar, pipelined microprocessor implemented according to one embodiment of the present invention; -
FIG. 2 is a diagram depicting, in simplified manner, an O-O-O instruction sequence in contrast to an in-order instruction sequence according to a conventional configuration illustrating a collision and corresponding consequences; -
FIG. 3 is a simplified block diagram of the load pipeline ofFIG. 1 receiving instructions dispatched from the LD RS Q within the reservation stations ofFIG. 1 and corresponding load execution stages according to one embodiment; -
FIG. 4 is a more detailed block diagram of the LoStQ ofFIG. 3 according to one embodiment of the present invention; and -
FIG. 5 is a diagram of an exemplary entry of the LoStQ ofFIG. 4 according to an alternative embodiment with memory devices. - The inventors have recognized the penalty associated with a load-store collision caused by speculative dispatch of a load instruction within a processing engine. They have therefore developed a system and method of detecting load-store collisions before the load is dispatched for execution. The system and method further squashes, or otherwise suppresses, the dispatch valid of the load instruction to prevent the issuance of additional instructions which depend upon the speculatively dispatched load instruction. Since the potentially dependent instructions are not dispatched prematurely, the pipeline need not be flushed and the dependent instructions need not be replayed. In this manner, the penalty associated with the speculatively dispatched load instruction is reduced or otherwise minimized. A load-store (Lo-St) queue (LoStQ) structure is incorporated into the instruction pipeline which detects a collision between the load and any store instruction that is not ready to complete. A store instruction that is not ready to complete means that the address portion (STA) has been determined, but the data portion (STD) has not yet been determined, so that the store instruction is considered temporarily “LoSt.” The LoStQ detects this collision and issues a clear signal back to kill broadcast of the dispatch valid of the load instruction to scheduler queues holding instructions for dispatch. The clear signal suppresses the dispatch of additional instructions that depend upon the speculatively dispatched load to improve performance efficiency.
-
FIG. 1 is a simplified block diagram of a superscalar, pipelinedmicroprocessor 100 implemented according to one embodiment of the present invention. Themicroprocessor 100 includes aninstruction cache 102 that caches macroinstructions of an instruction set architecture, such as the x86 instruction set architecture or the like. Additional or alternative instruction set architectures are contemplated. Themicroprocessor 100 includes aninstruction translator 104 that receives and translates the macroinstructions into microinstructions. The microinstructions are then provided to a register alias table (RAT) 106, which generates microinstruction dependencies and issues the microinstructions in program order toreservation stations 108 and to a reorder buffer (ROB) 110 viainstruction path 107. The microinstructions issued from the RAT 106 (ISSUE INST) may typically be referred to as microinstructions, but are more generally referred to herein simply as “instructions.” TheROB 110 stores an entry for every instruction issued from theRAT 106. Thereservation stations 108 dispatches the instructions to an appropriate one ofmultiple execution units 112. - The
execution units 112 may include one or more integer execution units, such as an integer arithmetic/logic unit (ALU) 114 or the like, one or more floatingpoint execution units 116, such as including a single-instruction-multiple-data (SIMD) execution unit such as MMX and SSE units or the like, a memory order buffer (MOB) 118, etc. The MOB 118 generally handles memory type instructions to asystem memory 120, such as including a loadinstruction execution pipe 117 and a similar storeinstruction execution pipe 119. Thesystem memory 120 may be interfaced with the MOB 118 via a data cache (e.g., L2 data cache, not shown) and a bus interface unit (BIU, not shown). Theexecution units 112 provide their results to theROB 110, which ensures in-order retirement of instructions. - In one embodiment, the
reservation stations 108 includes multiple RS queues, in which each RS queue schedules and dispatches corresponding issued instructions to correspondingexecution units 112 when the instructions are ready to be executed. In general, a separate RS queue is provided for eachexecution unit 112. For example, an RS Q1 122 is provided for theinteger execution unit 114 and an RS Q2 124 is provided for the floatingpoint execution unit 116. In one embodiment, aLD RS Q 126 provides load instructions to theload pipeline 117, and a separateST RS Q 128 provides store instructions to thestore pipeline 119. Each RS Q of thereservation stations 108 may alternatively be referred to as a scheduler that includes schedule logic or the like (not shown) that schedules dispatch of issued instructions to thecorresponding execution unit 112. -
FIG. 2 is a diagram depicting, in simplified manner, anO-O-O instruction sequence 250 in contrast to an in-order instruction sequence 200 according to a conventional configuration illustrating a collision and corresponding consequences. The in-order instruction sequence 200 begins with a store instruction, which is divided into a store address (STA) micro-operation (μop) 202 followed by a store data (STD) μop 206 for storing data D at amemory location 204 in thesystem memory 120 with address ADD. The store μops are followed by a load (LD) instruction or LD μop 208 for loading the data D from thememory location 204 at address ADD into a storage location, such as a register or the like. Since the instructions are performed in the proper order, the correct data D is stored at thememory location 204 by the time the LD μop 208 is executed, so that the load result is correctly achieved by loading the correct data D. - The
O-O-O instruction sequence 250 also begins with the STA micro-operation (μop) 202. In this case, however, the LD μop 208 is dispatched out of order (out of program order) and before the STD μop 206 since the data D operand is not yet known. The LD μop 208 loads the data X currently stored at thememory location 204 at address ADD into a selected storage location, in which data X is, for practical purposes, not the same as data D. Since the data D was not yet available, the LD μop 208 loads the incorrect data X. As shown byarrow 210, the speculative execution of the LD μop 208 is reported back to other dependent μops depending on the LD μop 208 at thereservation stations 108, and these dependent μops may be dispatched into the pipeline of a corresponding execution unit to retrieve operands prior to execution. Eventually as shown byarrow 212, the results of theload μop 208 are reported back to theROB 110, which includes a “MISS WHY” routine or the like that detects the incorrect load result. In response, theROB 110 flushes theload pipeline 117 to remove the LD μop 208, and also flushes any other execution pipeline processing any instructions dependent upon the LD μop 208 that have been dispatched. Also, as indicated byarrow 216, the LD μop 208 and the corresponding dependent instructions must ultimately be replayed after the STD μop 206 is executed to retrieve the correct data D. The flushing of the execution pipelines and the replay of the load instruction and any dependent instructions has a negative impact on performance of themicroprocessor 100. -
FIG. 3 is a simplified block diagram of theload pipeline 117 receiving instructions dispatched from theLD RS Q 126 within thereservation stations 108 and corresponding load execution stages according to one embodiment. Thereservation stations 108 is shown including multiple RS Q's or schedulers as previously described. Theload pipeline 117 is divided into multiple sequential stages, shown as aD stage 302, aQ stage 304, anR stage 306, an A(I)stage 308, aB stage 310, aC stage 312, anE1 stage 314 and anE2 stage 316. The stages are separated by corresponding sets ofsynchronous latches 318 or the like for transferring or propagating data and information through theload pipeline 117 synchronous with a common clock signal (not shown). Vertical dashed lines and corresponding boxes are drawn between sequential stages to depict stage boundaries. -
Stage D 302 is an issue stage which is common to each the pipelines of each of theexecution units 112, in which instructions are issued from theRAT 106 to the scheduler RS Q's within thereservation stations 108. Stages Q, R and A(I) 304, 306 and 308 are the operand stages in which a selected load instruction is dispatched for execution and the operands for the selected instruction are determined prior to actual execution. The remaining stages B, C, E1 and E2 are the load execution stages for executing the dispatched load instruction. When a valid load instruction is dispatched from theLD RS Q 126 at stage Q, a dispatch valid signal DV(Q) is asserted. Each stage generates corresponding dispatch tags DT(Q), DT(R), DT(I), DT(B), DT(C) as the load instruction propagates through theload pipeline 117. - In stage A(I) 308,
select logic 320 selects from among possible sources (e.g., the sources may be a register, one of several types of constants, a memory address, etc.) of the operands for the load instructions for determining both a first source SRCA and a second source SRCB, which are provided to respective inputs of an address generation unit (AGU) 322. As an example of theselect logic 320, a first multiplexer (MUX) 319 selects from among the possible sources to provide the first source SRCA, and asecond MUX 321 selects from among the possible sources to provide the second source SRCB, both provided to inputs of theAGU 322. TheAGU 322 outputs a corresponding load instruction virtual address (LDVA) for accessing thesystem memory 120. It is understood that LDVA may be converted to a physical address for accessing thesystem memory 120. It should be noted that LDVA may be just part of the virtual address of the load instruction. For example, a 12-bit AGU 322 calculates [11:0] LDVA, which is identical with [11:0] of the physical address. However, if the [11:0] LDVA is the same with a store instruction virtual address (STVA) of previously dispatched store instruction whose store data has not yet been determined, the physical addresses of the LD instruction and ST instruction are probably the same, that means a load-store collision is detected. In other embodiments, more bits of the virtual address will be calculated and compared to improve the accuracy. - The LDVA is shown provided through the synchronous latches 318 to
stage B 310 to initiate execution. Instage B 310, the latched version of LDVA is provided to an input of a load-store queue (LoStQ) 326, which develops a clear signal CLR to invalidate the speculatively dispatched load instruction when a match occurs between the LDVA and a store instruction virtual address (STVA) of at least one previously dispatched store instruction whose store data has not yet been determined. The detail of theLoStQ 326 will be further described herein. Instage B 310, load valid (LV)logic 325 asserts a load valid signal LD VALID is provided back to an input of broadcast (BC)logic 324 instage D 302. TheBC logic 324 is operative to forward or broadcast the LD VALID signal to one or more up to all of the scheduler RS Q's within thereservation stations 108. In this manner, any instruction that has been issued to thereservation stations 108 and that is dependent upon the LD instruction may be scheduled for dispatch into corresponding execution units. Generally, these dependent instructions are not dispatched until the LD VALID signal is provided to ensure proper execution. It is worth noting that theAGU 322 provides the LDVA of the load instruction as early as one of the operand stages (e.g., Stages A(I) 304), that is, before the load instruction has progressed to the execution stages (e.g., stages B) to produce the LD VALID signal. - The CLR signal from the
LoStQ 326 is provided back to an input ofKILL logic 328 shown within theBC logic 324. In general, thekill logic 328 is operative to prevent the LD VALID signal from being broadcasted by theBC logic 324 to the scheduler RS Q's of thereservation stations 108. In one embodiment, theKILL logic 328 is separately added to prevent broadcast of the LD VALID signal. In another embodiment, theBC logic 324 includes disable logic or the like incorporated within theBC logic 324, in which theBC logic 324 passes the LD VALID signal to thereservation stations 108 only if the CLR signal is not asserted to disable broadcast. - In one alternative embodiment, the
KILL logic 328 may be provided external to theBC logic 324, in which theKILL logic 328 either passes or blocks the broadcast of the LD VALID signal based on the state of CLR. In another alternative embodiment, theKILL logic 328 is distributed among the scheduler RS Q's of thereservation stations 108. In this case, the CLR signal may instead be provided to each of the scheduler RS Q's within thereservation stations 108, in which case assertion of the CLR signal prevents the broadcasted LD VALID signal from changing operative signals, bits or values within each RS Q. - In general, when CLR is asserted before LD VALID or at least coincidentally asserted with LD VALID, then the CLR signal suppresses detection of assertion of the LD VALID signal. When the LD VALID signal is suppressed, then any instructions that are dependent upon the speculatively issued load instruction are not yet dispatched for execution. The dependent instructions may be executed at a later time after the load instruction is replayed.
-
FIG. 4 is a more detailed block diagram of theLoStQ 326 according to one embodiment of the present invention. TheAGU 322 develops and passes each load virtual address LDVA to stage B 310 (via a corresponding set of latches 318) as previously described. TheLoStQ 326 comprises comparator logic. TheLoStQ 326 includes multiple entries individually labeled ENTnn, in which “nn” denotes an entry number. In the illustrated embodiment, theLoStQ 326 includes 16 entries ENT00, ENT01, . . . , ENT15 (or ENT00-ENT15). The details of the first entry ENT00 are shown, in which it is understood the remaining entries ENT01-ENT15 are each substantially identical to the first for comparing up to 16 virtual addresses at a time. - The
LoStQ 326 receives virtual addresses of one or more previously dispatched store instructions in which the corresponding data values to be stored have not yet been determined or provided. As understood by those skilled in the art, a store instruction is divided into a STA μop for determining the address of thesystem memory 120 and a corresponding STD μop for determining the corresponding data value. The STA and STD μop are processed within thestore pipeline 119 of the MOB 118. In one embodiment, theLoStQ 326 only receives the virtual addresses determined by one or more previously dispatched STA μops in which the corresponding STD μop is yet to be processed by thestore pipeline 119. In another embodiment, theLoStQ 326 receives all the virtual addresses from thestore pipeline 119. For example, theLD RS Q 126 also includes 16 entries, which holds 16 store instructions that are then dispatched to thestore pipeline 119. In such a case, the entries of theLoStQ 326 respectively correspond to the entries of theLD RS Q 126 and receive all the store instruction virtual addresses (STVA0-STVA15) from thestore pipeline 119. As described further herein, theLoStQ 326 identifies a collision between the LDVA of a speculatively dispatched load instruction and any of the virtual addresses of store instructions. When a collision is detected, theLoStQ 326 asserts the CLR signal to suppress detection of the LD VALID signal by thereservation stations 108 to prevent premature dispatch of any instructions dependent upon the speculatively dispatched load instruction. - The first entry ENT00 includes a
comparator 402 that compares the virtual address LDVA of the speculatively dispatched LD instruction with a store instruction virtual address STVA0 of the STA μop received by the first entry ENT00. If the virtual addresses are the same, then thecomparator 402 asserts a match signal M0 to one input of AND logic 404 (in which “&” denotes a logical AND function). The first entry ENT00 also includes ANDlogic 406 that receives a qualify valid signal QV0 from thestore pipeline 119 that indicates whether the virtual address STVA0 is valid. The first entry ENT00 also includes ORlogic 408 that receives qualify condition signals QC0A and QC0B from thestore pipeline 119, in which either qualify condition signal asserted true indicates validity of STVA0. The qualify condition signals QC0A and QC0B are logically OR'd together and then logically AND'd with QV0 to determine a store valid signal VE0 indicating the validity of STVA0. The validity means STVA0 corresponds to a store instruction that is dispatched early than the speculatively dispatched load instruction and the corresponding store data of the store instruction has not yet been determined. VE0 is provided to the other input of the ANDlogic 404, which outputs a first clear signal CL0 for the first entry ENT00. Each of the entries ENT00-ENT15 output a corresponding one of 16 clear signals CL1-CL15, which are provided to respective inputs of ORlogic 410, which outputs the CLR signal. - In operation of the
LoStQ 326, if the virtual address LDVA matches any valid one of up to 16 store virtual addresses STVA0-STVA15 provided to the entries ENT00-ENT15, respectively, then a collision is detected and CLR is asserted in theB stage 310. If CLR is asserted, then when the speculatively dispatched load instruction reaches theB stage 310 and then the LD VALID signal is asserted, then theKILL logic 324 suppresses or blocks assertion of the LD VALID signal from being detected by thereservation stations 108. In this manner, any instruction located at any one of the RS queues of thereservation stations 108 that is also dependent upon the load instruction is not dispatched prematurely for execution. Thus, the execution pipelines in which the dependent instructions have been prematurely dispatched to need not be flushed and the corresponding instructions need not be replayed. Eventually, the load instruction is replayed non-speculatively in program order avoiding the potential for collision. If CLR is not asserted before LD VALID or coincidentallywith LD VALID, then a collision is not detected and the speculatively dispatched load instruction is allowed to execute to completion. - In one embodiment, the store virtual addresses STVA0-STVA15 and corresponding qualify valid signals QV0-QV15 and qualify condition signals QC0A/B-QC15A/B may be provided directly from the
store pipe 119. Alternatively, when timing or routing issues may prevent a direct interface, intermediate memory devices, such as registers or latches or the like, may be provided between thestore pipe 119 and theLoStQ 326. Alternatively, as shown by exemplary entry ENnn inFIG. 5 , each of the entries ENT00-ENT15 of theLoStQ 326 may incorporatememory devices 502 for storing the store instruction virtual addresses and the qualify signals. The store instruction virtual addresses and the qualify signals may still be generated within thestore pipeline 119, but are copied over to thememory devices 502. Thememory devices 502 may be registers or latches or the like. - It is understood that the
LoStQ 326 may detect and indicate a collision that does not, in fact, exist. In this case, any benefit of speculatively dispatching the load instruction is lost and a slight performance drop may result by delaying the dependent instructions. Nonetheless, the statistical occurrence of false detections is relatively small so that the performance benefits of detecting actual collisions significantly outweighs the slight performance crop caused by false detections. - The foregoing description has been presented to enable one of ordinary skill in the art to make and use the present invention as provided within the context of a particular application and its requirements. Although the present invention has been described in considerable detail with reference to certain preferred versions thereof, other versions and variations are possible and contemplated. Various modifications to the preferred embodiments will be apparent to one skilled in the art, and the general principles defined herein may be applied to other embodiments. For example, the circuits described herein may be implemented in any suitable manner including logic devices or circuitry or the like.
- Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the spirit and scope of the invention. Therefore, the present invention is not intended to be limited to the particular embodiments shown and described herein, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed.
Claims (20)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510229378.3A CN104808996B (en) | 2015-05-07 | The system and method for reducing load-memory contention punishment in processing engine | |
CN201510229378.3 | 2015-05-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
US20160328237A1 true US20160328237A1 (en) | 2016-11-10 |
Family
ID=53693849
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/719,320 Abandoned US20160328237A1 (en) | 2015-05-07 | 2015-05-22 | System and method to reduce load-store collision penalty in speculative out of order engine |
Country Status (2)
Country | Link |
---|---|
US (1) | US20160328237A1 (en) |
EP (1) | EP3091433B1 (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160378496A1 (en) * | 2015-06-26 | 2016-12-29 | Microsoft Technology Licensing, Llc | Explicit Instruction Scheduler State Information for a Processor |
US9946548B2 (en) | 2015-06-26 | 2018-04-17 | Microsoft Technology Licensing, Llc | Age-based management of instruction blocks in a processor instruction window |
US9952867B2 (en) | 2015-06-26 | 2018-04-24 | Microsoft Technology Licensing, Llc | Mapping instruction blocks based on block size |
US20180196754A1 (en) * | 2017-01-12 | 2018-07-12 | International Business Machines Corporation | Temporarily suppressing processing of a restrained storage operand request |
US10169044B2 (en) | 2015-06-26 | 2019-01-01 | Microsoft Technology Licensing, Llc | Processing an encoding format field to interpret header information regarding a group of instructions |
US10191747B2 (en) | 2015-06-26 | 2019-01-29 | Microsoft Technology Licensing, Llc | Locking operand values for groups of instructions executed atomically |
US10346168B2 (en) | 2015-06-26 | 2019-07-09 | Microsoft Technology Licensing, Llc | Decoupled processor instruction window and operand buffer |
US10409599B2 (en) | 2015-06-26 | 2019-09-10 | Microsoft Technology Licensing, Llc | Decoding information about a group of instructions including a size of the group of instructions |
US10409606B2 (en) | 2015-06-26 | 2019-09-10 | Microsoft Technology Licensing, Llc | Verifying branch targets |
US10621090B2 (en) | 2017-01-12 | 2020-04-14 | International Business Machines Corporation | Facility for extending exclusive hold of a cache line in private cache |
US10705851B2 (en) | 2018-01-30 | 2020-07-07 | Shanghai Zhaoxin Semiconductor Co., Ltd. | Scheduling that determines whether to remove a dependent micro-instruction from a reservation station queue based on determining cache hit/miss status of one ore more load micro-instructions once a count reaches a predetermined value |
US10860327B2 (en) | 2018-01-30 | 2020-12-08 | Shanghai Zhaoxin Semiconductor Co., Ltd. | Methods for scheduling that determine whether to remove a dependent micro-instruction from a reservation station queue based on determining a cache hit/miss status of a load micro-instruction once a count reaches a predetermined value and an apparatus using the same |
US10983801B2 (en) | 2019-09-06 | 2021-04-20 | Apple Inc. | Load/store ordering violation management |
WO2022094964A1 (en) * | 2020-11-06 | 2022-05-12 | 华为技术有限公司 | Instruction processing method and graphflow apparatus |
US11422821B1 (en) | 2018-09-04 | 2022-08-23 | Apple Inc. | Age tracking for independent pipelines |
US20240020012A1 (en) * | 2022-07-13 | 2024-01-18 | SiFive, Inc. | Memory Request Combination Indication |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW202331504A (en) * | 2021-12-21 | 2023-08-01 | 美商賽發馥股份有限公司 | Store-to-load forwarding for processor pipelines |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4658354A (en) * | 1982-05-28 | 1987-04-14 | Nec Corporation | Pipeline processing apparatus having a test function |
US5754812A (en) * | 1995-10-06 | 1998-05-19 | Advanced Micro Devices, Inc. | Out-of-order load/store execution control |
US6484254B1 (en) * | 1999-12-30 | 2002-11-19 | Intel Corporation | Method, apparatus, and system for maintaining processor ordering by checking load addresses of unretired load instructions against snooping store addresses |
US20030017732A1 (en) * | 2001-07-20 | 2003-01-23 | Hsu Hugh Chi | Electrical card connector having polarization mechanism |
US20030208665A1 (en) * | 2002-05-01 | 2003-11-06 | Jih-Kwon Peir | Reducing data speculation penalty with early cache hit/miss prediction |
US20110040955A1 (en) * | 2009-08-12 | 2011-02-17 | Via Technologies, Inc. | Store-to-load forwarding based on load/store address computation source information comparisons |
US20150002668A1 (en) * | 2011-12-21 | 2015-01-01 | Deka Products Limited Partnership | Flow Meter |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100049952A1 (en) * | 2008-08-25 | 2010-02-25 | Via Technologies, Inc. | Microprocessor that performs store forwarding based on comparison of hashed address bits |
-
2015
- 2015-05-22 US US14/719,320 patent/US20160328237A1/en not_active Abandoned
- 2015-11-25 EP EP15196163.8A patent/EP3091433B1/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4658354A (en) * | 1982-05-28 | 1987-04-14 | Nec Corporation | Pipeline processing apparatus having a test function |
US5754812A (en) * | 1995-10-06 | 1998-05-19 | Advanced Micro Devices, Inc. | Out-of-order load/store execution control |
US6484254B1 (en) * | 1999-12-30 | 2002-11-19 | Intel Corporation | Method, apparatus, and system for maintaining processor ordering by checking load addresses of unretired load instructions against snooping store addresses |
US20030017732A1 (en) * | 2001-07-20 | 2003-01-23 | Hsu Hugh Chi | Electrical card connector having polarization mechanism |
US20030208665A1 (en) * | 2002-05-01 | 2003-11-06 | Jih-Kwon Peir | Reducing data speculation penalty with early cache hit/miss prediction |
US20110040955A1 (en) * | 2009-08-12 | 2011-02-17 | Via Technologies, Inc. | Store-to-load forwarding based on load/store address computation source information comparisons |
US20150002668A1 (en) * | 2011-12-21 | 2015-01-01 | Deka Products Limited Partnership | Flow Meter |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10409606B2 (en) | 2015-06-26 | 2019-09-10 | Microsoft Technology Licensing, Llc | Verifying branch targets |
US9946548B2 (en) | 2015-06-26 | 2018-04-17 | Microsoft Technology Licensing, Llc | Age-based management of instruction blocks in a processor instruction window |
US9952867B2 (en) | 2015-06-26 | 2018-04-24 | Microsoft Technology Licensing, Llc | Mapping instruction blocks based on block size |
US20160378496A1 (en) * | 2015-06-26 | 2016-12-29 | Microsoft Technology Licensing, Llc | Explicit Instruction Scheduler State Information for a Processor |
US10169044B2 (en) | 2015-06-26 | 2019-01-01 | Microsoft Technology Licensing, Llc | Processing an encoding format field to interpret header information regarding a group of instructions |
US10175988B2 (en) * | 2015-06-26 | 2019-01-08 | Microsoft Technology Licensing, Llc | Explicit instruction scheduler state information for a processor |
US10191747B2 (en) | 2015-06-26 | 2019-01-29 | Microsoft Technology Licensing, Llc | Locking operand values for groups of instructions executed atomically |
US10346168B2 (en) | 2015-06-26 | 2019-07-09 | Microsoft Technology Licensing, Llc | Decoupled processor instruction window and operand buffer |
US10409599B2 (en) | 2015-06-26 | 2019-09-10 | Microsoft Technology Licensing, Llc | Decoding information about a group of instructions including a size of the group of instructions |
US10521351B2 (en) * | 2017-01-12 | 2019-12-31 | International Business Machines Corporation | Temporarily suppressing processing of a restrained storage operand request |
US20180196754A1 (en) * | 2017-01-12 | 2018-07-12 | International Business Machines Corporation | Temporarily suppressing processing of a restrained storage operand request |
US10621090B2 (en) | 2017-01-12 | 2020-04-14 | International Business Machines Corporation | Facility for extending exclusive hold of a cache line in private cache |
US10956337B2 (en) | 2017-01-12 | 2021-03-23 | International Business Machines Corporation | Temporarily suppressing processing of a restrained storage operand request |
US11366759B2 (en) | 2017-01-12 | 2022-06-21 | International Business Machines Corporation | Temporarily suppressing processing of a restrained storage operand request |
US10705851B2 (en) | 2018-01-30 | 2020-07-07 | Shanghai Zhaoxin Semiconductor Co., Ltd. | Scheduling that determines whether to remove a dependent micro-instruction from a reservation station queue based on determining cache hit/miss status of one ore more load micro-instructions once a count reaches a predetermined value |
US10860327B2 (en) | 2018-01-30 | 2020-12-08 | Shanghai Zhaoxin Semiconductor Co., Ltd. | Methods for scheduling that determine whether to remove a dependent micro-instruction from a reservation station queue based on determining a cache hit/miss status of a load micro-instruction once a count reaches a predetermined value and an apparatus using the same |
US11422821B1 (en) | 2018-09-04 | 2022-08-23 | Apple Inc. | Age tracking for independent pipelines |
US10983801B2 (en) | 2019-09-06 | 2021-04-20 | Apple Inc. | Load/store ordering violation management |
WO2022094964A1 (en) * | 2020-11-06 | 2022-05-12 | 华为技术有限公司 | Instruction processing method and graphflow apparatus |
US20240020012A1 (en) * | 2022-07-13 | 2024-01-18 | SiFive, Inc. | Memory Request Combination Indication |
Also Published As
Publication number | Publication date |
---|---|
EP3091433A2 (en) | 2016-11-09 |
EP3091433B1 (en) | 2022-08-24 |
CN104808996A (en) | 2015-07-29 |
EP3091433A3 (en) | 2017-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3091433B1 (en) | System and method to reduce load-store collision penalty in speculative out of order engine | |
KR100819232B1 (en) | In order multithreading recycle and dispatch mechanism | |
US7822951B2 (en) | System and method of load-store forwarding | |
US7685410B2 (en) | Redirect recovery cache that receives branch misprediction redirects and caches instructions to be dispatched in response to the redirects | |
US7721071B2 (en) | System and method for propagating operand availability prediction bits with instructions through a pipeline in an out-of-order processor | |
US8464029B2 (en) | Out-of-order execution microprocessor with reduced store collision load replay reduction | |
US8627044B2 (en) | Issuing instructions with unresolved data dependencies | |
US20120023314A1 (en) | Paired execution scheduling of dependent micro-operations | |
JP3577052B2 (en) | Instruction issuing device and instruction issuing method | |
US5872986A (en) | Pre-arbitrated bypassing in a speculative execution microprocessor | |
EP3171264B1 (en) | System and method of speculative parallel execution of cache line unaligned load instructions | |
US10203957B2 (en) | Processor with improved alias queue and store collision detection to reduce memory violations and load replays | |
US11836498B1 (en) | Single cycle predictor | |
CN108196884A (en) | Utilize the computer information processing device for generating renaming | |
US10776123B2 (en) | Faster sparse flush recovery by creating groups that are marked based on an instruction type | |
CN108415730B (en) | Microinstruction scheduling method and device using the same | |
US20240045695A1 (en) | Prediction unit that provides a fetch block descriptor each clock cycle | |
US20080244244A1 (en) | Parallel instruction processing and operand integrity verification | |
US6708267B1 (en) | System and method in a pipelined processor for generating a single cycle pipeline stall | |
US20060184772A1 (en) | Lookahead mode sequencer | |
CN108279928B (en) | Microinstruction scheduling method and device using the same | |
CN104808996B (en) | The system and method for reducing load-memory contention punishment in processing engine | |
US6535973B1 (en) | Method and system for speculatively issuing instructions | |
US20150082006A1 (en) | System and Method for an Asynchronous Processor with Asynchronous Instruction Fetch, Decode, and Issue | |
JP2021168036A (en) | Arithmetic processing unit |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: VIA ALLIANCE SEMICONDUCTOR CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:DI, QIANLI;WANG, JIANBIN;GAO, XIN YU;REEL/FRAME:035694/0974 Effective date: 20141223 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |