US20020046334A1 - Execution of instructions that lock and unlock computer resources - Google Patents
Execution of instructions that lock and unlock computer resources Download PDFInfo
- Publication number
- US20020046334A1 US20020046334A1 US09/941,142 US94114201A US2002046334A1 US 20020046334 A1 US20020046334 A1 US 20020046334A1 US 94114201 A US94114201 A US 94114201A US 2002046334 A1 US2002046334 A1 US 2002046334A1
- Authority
- US
- United States
- Prior art keywords
- instruction
- processor
- stage
- cache
- canceled
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3834—Maintaining memory consistency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30021—Compare instructions, e.g. Greater-Than, Equal-To, MINMAX
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/30087—Synchronisation or serialisation instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3861—Recovery, e.g. branch miss-prediction, exception handling
Definitions
- the present invention relates to execution of instructions that lock and unlock computer resources.
- Examples of instructions that lock and unlock computer resources are a test-and-set instruction and a swap instruction, and a cas (compare and swap) instruction.
- a test-and-set instruction reads a memory location (to perform a test) and also writes the memory location (to perform a “set” operation). This instruction is used to implement semaphores and other software synchronization mechanisms.
- a swap instruction swaps the contents of a memory location and a register.
- a cas instruction compares a memory location with a register R1, stores the memory location value in register R2, and if the comparison was successful, the instruction also stores the previous value of the register R2 in the memory location.
- Each of these instructions involves reading and writing a memory location.
- test-and-set and swap instructions are implemented as atomic instructions. These instructions lock the memory location during the reading operation to prevent other processors from writing the location. The location is unlocked when the memory location is written.
- Some embodiments of the present invention allow fast execution of instructions that lock and unlock computer resources.
- an instruction is allowed to lock a computer resource before it becomes known whether the instruction will be executed to completion or canceled. By the time the instruction processing is complete, the resource becomes unlocked whether or not the instruction is canceled.
- An instruction may have to be canceled if, for example, a trap condition occurs while the instruction is being executed. If the instruction is canceled after locking a computer resource but before unlocking the resource, the resource may become permanently locked, which is undesirable.
- an instruction is allowed to lock a resource before it is determined whether the instruction will be executed to completion or canceled. Later in the instruction processing, the resource is unlocked even if the instruction is canceled, and even if the fact that the instruction is canceled is established by the processor before the instruction has unlocked the resource.
- the instruction is allowed to read the memory location before it is known whether the instruction will be canceled. Performing the reading operation early speeds up the instruction execution.
- the determination of whether or not an instruction is to be canceled is made before the pipeline stage or stages in which the instruction results are written to their destinations (e.g., architecture register or memory). If an instruction is canceled, writing to the destination(s) is suppressed. However, the instruction still goes through all the pipeline stages at least up to, and including, the stage in which the resource is unlocked.
- the instruction goes through all the pipeline stages, but writing to the destinations is suppressed.
- the processor shares a cache with one or more other processors.
- the resource being locked is a cache memory location.
- FIG. 1 is a block diagram of a multi-processor system according to the present invention.
- FIG. 2 illustrates an instruction execution pipeline of a processor of FIG. 1.
- FIG. 3 is a block diagram of one embodiment of a processor of FIG. 1.
- FIG. 4 is a block diagram of a load/store unit for one embodiment of the processor of FIG. 3.
- FIG. 5 illustrates entries in load and store buffers of FIG. 4.
- a multiprocessor system 110 includes two processors (CPUs) 120 . 1 , 120 . 2 which share a two-port data cache unit (DCU) 130 .
- Each CPU 120 accesses the DCU through a respective one of the DCU ports.
- DCU 130 caches data from memory 140 .
- Data cache 130 includes a set-associative cache memory 130 M and control logic (not shown) to access the cache memory. Such caches are known in the art. Each cache set 130 L in memory 130 M can store a number of data words W0, W1 . . . (thirty-two 32-bit words in some embodiments). In addition, each cache set 130 L includes a lock bit L which indicates whether the cache set is locked, and a processor bit P which indicates which CPU has locked the cache set. When the cache set is locked, the cache set can be accessed only from the port connected to the CPU that has locked the cache set. The other CPU is not allowed to read or write the cache set or those memory 140 locations whose contents are cached in the cache set.
- a cache set 130 L is locked by one CPU, the other CPU is allowed to read the cache set but not to write the cache set.
- FIG. 2 illustrates instruction execution pipeline for a single CPU 120 in some embodiments.
- the two CPUs are identical, but the pipeline stages do not have to be synchronized between the CPUs.
- a pipeline disruption of one CPU does not affect the other CPU's pipeline.
- F is an instruction fetch stage.
- A is an alignment stage for embodiments in which the CPUs are VLIW (very long instruction word) processors.
- each instruction may include a number of sub-instructions executed in parallel by different execution units.
- the alignment stage A the sub-instructions are aligned before the respective execution units.
- each execution unit decodes its respective sub-instruction and reads operands from register file 150 (FIG. 1).
- stage E In execution stages E, C(A1), A2, A3, the sub-instructions are executed.
- stage T trap events are handled.
- E stands for effective address calculation
- C cache access
- A1, A2, A3 for annex 1, 2, 3.
- some stages may be unnecessary for instruction execution, but are inserted as padding to delay the trap stage T so that there are always three clock cycles between stages E and T.
- stages E, C(A1), A2, A3 vary from instruction to instruction. For example, some instructions (such as NOP) do not perform effective address calculation.
- the instruction results are written to their destinations which may include register file 150 (FIG. 1), DCU 130 , memory 140 (if the destination is a non-cacheable memory location), or other devices or bus lines.
- destinations may include register file 150 (FIG. 1), DCU 130 , memory 140 (if the destination is a non-cacheable memory location), or other devices or bus lines.
- stage T the processor's pipe control unit (PCU) 160 (FIG. 1) generates a “trap” signal indicating whether the VLIW instruction (and hence all its sub-instructions) has to be canceled due to a trap condition caused by the instruction itself or by an interrupt.
- the instruction (say instruction) “I1” can also be canceled by a trap condition caused by a previous instruction “I2” if execution of I1 and execution of I2 overlaps.
- the trap condition caused by “I2” causes the trap signal to be asserted in the T stage of I2 which is an earlier pipeline stage of instruction I1. Trap conditions are listed in Addendum 1 at the end of this description for some embodiments.
- each CPU 120 has a register file and a PCU, but in FIG. 1 the register file and the PCU are shown only for CPU 120 . 1 for simplicity.
- Addendum 2 is a pseudocode listing illustrating execution of an atomic instruction by a CPU 120 . We will describe Addendum 2 with reference to CPU 120 . 1 . Execution of atomic instruction by CPU 120 . 2 is similar.
- CPU 120 . 1 issues a load-with-lock request to DCU 130 .
- This is done as follows.
- Each CPU is connected to its respective DCU port by a bus 170 (FIG. 1). Only the bus 170 for CPU 120 . 1 is shown in detail.
- Each bus 170 includes address lines 170 A, data lines 170 D, read/write line 170 RW, lock line 170 L, unlock line 170 U, and no_store line 170 NS.
- CPU 120 . 1 drives the address lines 170 A of its bus 170 with the address of the data to be loaded (the address in memory 140 ), and drives a read signal on read/write line 170 RW.
- the CPU asserts the lock line 170 L to cause the DCU to lock the cache set being read.
- step 310 can be performed before the instruction's T stage, that is, before it becomes known whether or not the instruction will be canceled.
- the DCU If the data requested at step 310 are in the cache, and the cache set has not been locked by CPU 120 . 2 , the DCU returns the data on lines 170 D. Otherwise, the DCU asserts appropriate controls signals (not shown) to CPU 120 . 1 to signal that the cache set is locked or the data are not in the cache, whatever the case may be. If the data are not in the cache, CPU 120 . 1 issues a request to bus interface unit (BIU) 180 to fetch the data from memory 140 . BIU 180 fetches the data via bus 190 . When the data are fetched, they are cached in a cache set 130 L in DCU 130 and are also provided to CPU 120 . 1 . In addition, the lock bit L is set in the cache set, and the processor bit P is made to indicate CPU 120 . 1 .
- BIU bus interface unit
- CPU 120 . 1 calculates a store condition “COND” which determines whether the memory 140 location read at step 310 has to be written by the instruction. Step 320 is omitted for some instructions, such as swap, for which the memory location is written unconditionally.
- Step 350 is completed in the WB stage (though this step may start before the WB stage in some embodiments).
- This step includes steps 350 A, 350 B.
- CPU 120 . 1 issues a store request to DCU 130 , driving the store address on lines 170 A, the store data on lines 170 D, and the write signal on line 170 RW, as known in the art.
- CPU 120 . 1 asserts the unlock line 170 U to cause the DCU to unlock the cache set 130 L.
- CPU 120 . 1 drives the no_store line 170 NS with a signal indicating whether the store data are to be actually written to the cache set. The data will not be written if, and only if: (1) “trap” was asserted in the T stage or earlier stage of the instruction, or (2) the condition COND is false.
- DCU 130 Whether or not no_store is asserted, DCU 130 will reset the L bit to unlock the cache set.
- the instruction reads a memory location M[rs2] whose address is stored in register rs2. This location is in memory 140 . (The instruction definition of Addendum 3 does not depend on the presence of a cache.)
- the contents temp_rs2 of the memory location are compared with the contents r[rs1] of register rs1. If the comparison is successful, the memory location M[rs2] is written with the contents r[rd] of register rd.
- the register rd is written with the memory location contents temp_rs2 (step 430 ) fetched at step 410 .
- Addendum 4 illustrates execution of the cas instruction by a CPU 120 .
- the step reference numbers correspond to those of Addendum 2.
- the contents of memory location M[rs2] are fetched from cache 130 and placed into a temporary register temp_rs2.
- the cache set storing M[rs2] is locked.
- Register temp_rs2 is not an “architecture” register, that is, this register is not visible by software and this register can by modified even if the cas instruction will be canceled.
- register rd is read into another non-architecture register temp_rd.
- Steps 310 , 314 , 320 can be performed before the T stage. These steps can overlap or be performed in an order different from the order shown.
- Step 350 consisting of steps 350 A, 350 B, 350 C, is to be completed after the T stage.
- a store-with-unlock is issued to the DCU to store the contents of temp_rd in the cache location that caches M[rs2].
- Step 350 B is performed as in Addendum 2.
- step 350 C if “trap” has been deasserted in the T and all earlier stages of the cas instruction, then the contents of temp_rs2 are written to register rd to implement step 430 of Addendum 3.
- FIG. 3 illustrates one embodiment of a CPU 120 in more detail.
- CPU 120 is a VLIW processor having four execution units 610 . 0 , 610 . 1 , 610 . 2 , 610 . 3 , also labeled GFU, MFU1, MFU2, MFU3 respectively.
- GFU stands for “general functional unit”.
- MFU stands for “media functional unit”.
- the four execution units operate in parallel to execute a single VLIW instruction which may include up to four sub-instructions. Instruction cas is a sub-instruction.
- the GFU is the only execution unit that can perform memory access operations, including cas.
- the CPU fetches instructions from instruction cache 614 into instruction aligner 618 .
- instruction aligner 618 extracts up to four sub-instructions from cache 614 and aligns the sub-instructions before respective execution units 610 .
- the sub-instructions are written into instruction buffer 624 .
- units 610 decode their respective sub-instructions and, if needed, read instruction operands from respective register files RF0, RF1, RF2, RF3 which form the register file 150 .
- Each register file RF0, RF1, RF2, RF3 stores a copy of the same data.
- each execution unit 610 executes its respective sub-instruction.
- stage WB execution units 610 write instruction results, as explained above.
- each execution unit 610 . 0 - 610 . 3 When a VLIW instruction is in its T stage, each execution unit 610 . 0 - 610 . 3 generates a respective signal “trapo” through “trap3” to indicate whether the execution unit detected a trap condition. Signals trap0-trap3 are provided to PCU 160 . In the same stage T, the PCU asserts, “trap” signal if, and only if, any one of signals trap0-trap3 is asserted in the T stage.
- the “trap” signal is provided to load/store unit (LSU) 640 .
- LSU 640 executes requests to access cache 130 , BIU 180 , and other devices.
- store buffer 710 (FIG. 4) is a queue of eight entries 0-7. Entry 0 is the front (bottom) of the queue, entry 7 is the back (or top).
- the store instructions are written from GFU 610 . 0 into entry 7 in the E stage. (An entry in store buffer 710 defines a store operation which we will call a “store instruction”. Similarly, an entry in load buffer 720 of LSU 640 defines a load operation which we will call a “load instruction”. These store and load instructions should not be confused with sub-instructions executed by units 610 or with VLIW instructions.)
- a store instruction is not dispatched from the store buffer to the DCU until the stage A3. (Dispatching the instruction involves providing the address, data and control signals on bus 170 of FIG. 1.)
- the DCU When a store instruction is dispatched to the DCU, the DCU writes cache memory 130 M at least one cycle after the dispatch. If the instruction was dispatched at stage A3 but in stage T the “trap” signal is asserted, the instruction is canceled via a cancellation signal (not shown) sent by the LSU to the DCU in the T stage.
- “datab” field 710 D holds the store data.
- Address field 710 A (“addrb”) holds the store address which is an address in memory 140 .
- State field 710 S indicates the pipeline stage of the instruction.
- the binary encoding of the stage field is as follows:
- stage field is written at the end of the C stage and is thereafter shifted right once per clock cycle.
- Entries 4-7 of the store buffer keep all the three bits of the stage field. Entry 3 has two bits to track whether the instruction is in stage A3 or T or is past T. Entry 2 has one bit to track if the instruction is in stage T or past the T stage. Entries 0 and 1 do not have the stage field.
- the instruction type field 710 T indicates the instruction type. In particular, this field indicates whether the store is part of a cas instruction.
- One-bit load/store field 710 L is used for cas instructions to track if the cas load has been performed, as described below.
- Load buffer 720 in FIG. 4 is a queue of five entries 0-4. Entry 0 is the front of the queue, and entry 4 is the back. Load instructions are written from GFU 610 . 0 to entry 4 in the E stage. They shift through the buffer from top to bottom. Each instruction remains in the load buffer through its lifetime in the LSU, that is, even after the load request has been issued to DCU 130 . After the load data have returned from the DCU, the instruction is logically deleted from the load buffer.
- the load buffer entries can be finished (i.e. respective loads can be performed) out of order. Holes in the buffer from out-of-order completed instructions can be filled from any entry, one per clock cycle.
- a load instruction can be dispatched-to the DCU in the E stage without being written to the load buffer first. However, the instruction still gets written into the load buffer.
- “addrb” field 720 A holds the load address. This is an address in memory 140 .
- the address is calculated in the E stage (the address may be equal to the sum of two operands, as known in the art.)
- Destination register specifier field 720 RD holds the address of the load destination register in register file 150 .
- RAW hazard field 720 RAW is an 8-bit vector pointing to store buffer 710 instructions which must be performed before the load instruction to avoid a RAW (read after write) hazard.
- the stores are issued in order with respect to each other.
- the loads are also issued in order with respect to each other. However, the loads are also issued in preference to the stores.
- the store instructions are dispatched only when the first load in load buffer 720 cannot be dispatched due to a RAW hazard, or when the load buffer is empty. Therefore, a RAW (read after write) hazard is a possibility, but RAR, WAR, and WAW hazards are not.
- Each bit in field 720 RAW corresponds to an entry of store buffer 710 .
- the bit is set if the instruction in the corresponding store entry must be executed before the load, and the bit is reset otherwise.
- the RAW fields 720 RAW are shifted to the right.
- a load instruction can be speculatively dispatched to DCU 130 in the E stage even though the corresponding RAW hazards are not calculated until the C stage. If the load is found to have a hazard, the load is canceled (that is, the data returned by cache 130 are discarded), and the load is retried later.
- the load can also be canceled by a “trap” signal generated in the T or earlier stage if the load was dispatched to the DCU before the T stage. In this case, the load is not retried.
- One-bit field 720 T (“trap_taken”) is initially set to zero. This bit is set to 1 in the T or earlier stage in response to the trap signal from PCU 160 being asserted. If the bit is set, the instruction will be removed from the load buffer when the load data return, and the load data will be discarded.
- stage field 720 S has the same meaning as the field 710 S in the store buffer, and the encoding is the same.
- Entry 4 of load buffer 720 includes all the three stage bits 720 S. Entry 3 has two bits to track whether the instruction is in stage A3, T, or past T. Entry 2 has one bit to track whether the instruction is in stage T or past T. Entries 1 and 0 do not have the stage field.
- BIU list 730 is a queue of commands to be dispatched to bus interface unit 180 .
- the BIU list is written when DCU 130 returns a cache miss and when, therefore, data have to be fetched into the cache from memory 140 .
- the BIU list is also written to write the memory 140 .
- the LSU When GFU 610 . 0 issues a cas instruction to LSU 640 , the LSU writes one entry into each of buffers 720 , 710 .
- the entries are shown in FIG. 5.
- the instruction type field 710 T indicates cas.
- Address field 710 A has the contents of register rs2 (Addendum 3) of the cas instruction, i.e. the memory 140 address.
- the data field 710 D has the contents of the destination register rd (Addendum 3) of the cas instruction.
- the bit 710 L is 0 to indicate that the cas load has not been performed yet.
- the address field 720 A receives the contents of register rs1 (the comparison data). See Addendum 3.
- Field 720 RD receives the address of the destination register rd (Addendum 3) of the cas instruction.
- RAW vector 720 RAW the bit pointing to the store entry for the cas instruction is set even though the cas load is to precede the cas store. In addition, the bits corresponding to other RAW hazards, if any, are set.
- Addendum 5 describes the LSU operation in pseudocode.
- BIU list 730 has the highest priority in some embodiments. If the BIU list 730 is not empty, the LSU dispatches an operation from the BIU list (step 910 ).
- step 920 If the BIU list is empty, an operation from load buffer 720 or store buffer 710 is dispatched. If the first entry (i.e., the entry in the front of the queue) in load buffer 720 has no RAW hazard (step 920 ), the entry is dispatched. More particularly, the LSU dispatches to DCU 130 a load-without-lock request, that is, a read request with lock signal 170 L deasserted (step 920 A). When DCU returns data on lines 170 D (step 920 B), the LSU passes the data to PCU 160 and GFU 610 . 0 on bus lsu_dc_data (FIG. 3).
- the LSU also passes to the PCU and the GFU on bus lsu_pcu_rd the destination register specifier rd from field 720 RD (FIG. 4).
- the LSU also passes to the PCU the stage bits 720 S and the trap taken bit 720 T.
- step 930 If load buffer 720 is empty, or the first entry in the load buffer has a non-zero bit in field 720 RAW, the first store in store buffer 710 is dispatched (step 930 ). If the instruction type field 710 T of this entry does not indicate a cas instruction (step 930 A), then a store request is issued to DCU 130 , with the lock and unlock signals 170 L, 170 U deasserted. The instruction is dispatched in stage A3 or later. The stage is indicated by the stage field 710 S or by the position of the instruction in the store buffer.
- step 930 B If the field 710 T indicates a cas instruction (step 930 B), the actions in Table 5-1 are performed.
- the column “CAS STAGE” indicates the pipeline stage of the cas instruction for one example. In that example, both LSU buffers were empty when the cas instruction was issued by GFU 610 . 0 . Therefore, the cas load (step 930 B1) is dispatched to DCU 130 in the E stage.
- LSU PIPE STAGE indicates the LSU pipeline stages.
- LSU 640 is pipelined, and can issue a request to the DCU on every clock cycle.
- a load request is dispatched using the store buffer 710 entry for the cas instruction.
- the store buffer entry rather than the load buffer entry allows utilization of the same logic as used for non-cas instructions to select an instruction for dispatch.
- the cas load entry has a RAW hazard bit set (pointing to the cas store entry). Therefore, according to the non-cas rules, the cas store must be dispatched before the cas load.
- the memory address in field 710 A is driven on the DCU address bus 170 A.
- the lock signal 170 L is asserted, and the unlock signal 170 U is deasserted.
- step 930 B2 in stage C, the DCU returns data on lines 170 D (assuming a cache hit).
- the LSU sets the bit 710 L to indicate that the cas load has been performed.
- LSU pipeline stages at step 930 B2 are indicated as LC (LSU cache access) and LF (LSU finish). In the embodiment being described, these stages may or may not occur in the same clock cycle. For example, if the DCU returned a cache miss, the stage LF (data return on lines 170 D) would be performed later.
- the DCU does not lock the cache set, and the LSU does not set the bit 710 L.
- the LSU causes BIU 180 to fetch data from memory 140 , and then reissues the load-and-lock request of step 930 B1.
- the LSU 640 drives the destination register specifier rd on lines lsu_pcu_rd to PCU 160 and GFU 610 . 0 .
- the register specifier rd is taken from field 720 RD of the cas load entry (see FIG. 5).
- the LSU finds the cas load entry as the first load in the queue of load buffer 720 . Indeed, because the loads have priority over stores, a store is issued before a load only if the load has a RAW bit set, stores are issued in order with respect to each other, and loads are issued in order with respect to each other, the cas load is the first load in the load buffer.
- the cas load entry is found by the LSU as the first load having a RAW vector 720 RAW pointing to the cas store entry.
- the LSU again finds the load buffer entry corresponding to the cas instruction, and drives the comparison data (contents of register rs1) from field 720 A (FIG. 5) to GFU 610 . 0 .
- the data from the DCU are still available on bus lsu_dc_data.
- the GFU performs the comparison step 320 (Addendum 4), and provides the result COND to LSU 640 .
- Stages LL, LI may occur in the same clock cycle or in different cycles. They occur in the same cycle in Table 5-1.
- step 930 B 4 cas stage A3 in Table 5-1, LSU stage LD
- the LSU again selects an entry for dispatch to the DCU. Since the first load in the load buffer is a cas load, having a RAW bit set, a store is dispatched. The first store is still the cas store. However, since its bit 710 L is set, the LSU dispatches a store to the DCU, asserting the unlock signal 170 U.
- LSU 640 At step 930 B 5 (cas stage T, LSU stage LC), LSU 640 generates the signal no_store on line 170 NS (FIG. 1). This signal is asserted if, and only if, the trap taken bit 720 T is set (one) or COND is false. See step 350 B in Addenda 2 and 4.
- step 930 B 6 (cas stage WB in Table 5-1), the store operation is allowed to finish. However, if no_store was asserted at step 930 B 5 , the DCU will not perform a store. Whether or not no_store was asserted, the DCU resets the cache set lock bit L.
- the above embodiments illustrate but do not limit the invention.
- the invention is not limited to the cas instruction. Swap, test-and-set, and other atomic instructions are used in some embodiments.
- the invention is not limited by the number of the CPUs sharing the cache 130 or by the structure of a CPU. In some embodiments, the CPUs are not identical to one another. Further, in some embodiments, non-CPU entities, for example, a DMA or a communication controller, can share the cache with the CPUs. If a cache set is locked, such entities are prevented from writing and possibly reading the cache set.
- the LSU provides an interface to non-memory devices in addition to the memory.
- an LSU is absent from at least one CPU.
- the invention is not limited to dispatching loads in preference to stores, or to any other dispatch policy.
- the invention is not limited by the type of the CPUs.
- one or more of the CPUs are non-VLIW processors.
- one or more CPUs do not have a register file.
- the memory 140 is a random access memory
- the DCU caches data from non-random access memory devices.
- an atomic instruction locks an entire cache memory, or an individual word, bit, or some other cache portion. Some embodiments do not include a cache, and an atomic instruction locks part or all of a non-cache memory.
- the invention is not limited to any particular interface between a CPU and the cache.
- the lock line 170 L and the unlock line 170 U are combined into a single line since in some embodiments the lock and the unlock commands are never issued to the DCU simultaneously.
- the invention is not limited to caches. In some embodiments, the invention is applied to non-cache resources, for example, disk or communication controllers.
- the invention is not limited to the pipeline of FIG. 2 or to any particular pipeline of LSU 640 .
- an atomic instruction reads one memory location but writes a different memory location. The location being written, or both locations, are locked in some embodiments from the time the first location is read to the time the second location is written.
- Addenda 2-4 are performed not necessarily in the order shown. Some steps may overlap or be performed in a different order.
- a trap may be caused by an exception or an interrupt.
- An exception is a condition associated with an instruction being executed. Examples include divide by zero, unaligned memory access, stack overflow, an illegal instruction, a breakpoint or a software interrupt instruction, a privileged instruction executed in a non-privileged mode, a memory map error (attempt to access an unmapped memory address space, or to execute unallowed opcode for an address space), a memory access error (for example, a parity error), an instruction address is out of bounds, data are out of bounds, referencing a null pointer, software-initiated processor reset.
- An interrupt is a condition caused by an external device. Interrupts are not directly related to an instruction being executed. Examples of interrupts are requests from a network controller, a keyboard, a joy stick, or a disk controller. Another example is a timer interrupt. Power-on reset (a processor reset signal being asserted) also causes an interrupt.
- the processors stops executing the current instruction stream and starts executing a trap handler. Before the trap handler is started, the instructions that were past the T stage when the trap condition occurred are executed to completion. The instructions that have not yet gone past the T stage are canceled.
- a CPU issues a store-with-unlock request to DCU, to be completed at WB stage
- 930 B Else (the first store buffer entry is a cas entry): TABLE 5-1 LSU CAS PIPE STEP STAGE STAGE ACTION 930B1 E LD Dispatch load-with-lock request to DCU, with address in store entry field addrb (contents of rs2) 930B2 C LC, LF DCU returns data on lines 170D. Set flag 710L in store buffer entry. Drive load buffer destination register specifier (rd field of load buffer entry) on lines lsu_pcu_rd 930B3 A2 LL, LI Provide to GFU the comparison data r[rs1] from addrb field of load buffer entry.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Advance Control (AREA)
- Multi Processors (AREA)
Abstract
When an atomic instruction executed by a computer processor locks a memory location, the locking is performed before the processor has determined whether the instruction is to be executed to completion or canceled. The memory location is unlocked whether or not the instruction will be canceled. Since the locking operation can occur before it is known whether the instruction will be canceled, the reading of the memory location can also occur early, before it is known whether the instruction will be canceled.
Description
- The present invention relates to execution of instructions that lock and unlock computer resources.
- Examples of instructions that lock and unlock computer resources are a test-and-set instruction and a swap instruction, and a cas (compare and swap) instruction. A test-and-set instruction reads a memory location (to perform a test) and also writes the memory location (to perform a “set” operation). This instruction is used to implement semaphores and other software synchronization mechanisms. A swap instruction swaps the contents of a memory location and a register. A cas instruction compares a memory location with a register R1, stores the memory location value in register R2, and if the comparison was successful, the instruction also stores the previous value of the register R2 in the memory location. Each of these instructions involves reading and writing a memory location. If between the reading and writing operations another instruction, executed by a different processor, writes the same memory location, the program executing the test-and-set or swap instruction and/or the program executed by the different processor may provide incorrect results. Therefore, the test-and-set and swap instructions are implemented as atomic instructions. These instructions lock the memory location during the reading operation to prevent other processors from writing the location. The location is unlocked when the memory location is written.
- It is desirable to enable faster execution of instructions that lock and unlock computer resources.
- Some embodiments of the present invention allow fast execution of instructions that lock and unlock computer resources. In particular, an instruction is allowed to lock a computer resource before it becomes known whether the instruction will be executed to completion or canceled. By the time the instruction processing is complete, the resource becomes unlocked whether or not the instruction is canceled.
- An instruction may have to be canceled if, for example, a trap condition occurs while the instruction is being executed. If the instruction is canceled after locking a computer resource but before unlocking the resource, the resource may become permanently locked, which is undesirable.
- One solution to this problem is not to allow an instruction to lock a resource until it is determined that the instruction will be executed to completion. However, this delays instruction execution.
- Therefore, according to the present invention, an instruction is allowed to lock a resource before it is determined whether the instruction will be executed to completion or canceled. Later in the instruction processing, the resource is unlocked even if the instruction is canceled, and even if the fact that the instruction is canceled is established by the processor before the instruction has unlocked the resource.
- In some atomic instruction embodiments for which the resource is a memory location, the instruction is allowed to read the memory location before it is known whether the instruction will be canceled. Performing the reading operation early speeds up the instruction execution.
- In some pipelined embodiments, the determination of whether or not an instruction is to be canceled is made before the pipeline stage or stages in which the instruction results are written to their destinations (e.g., architecture register or memory). If an instruction is canceled, writing to the destination(s) is suppressed. However, the instruction still goes through all the pipeline stages at least up to, and including, the stage in which the resource is unlocked.
- In some embodiments, the instruction goes through all the pipeline stages, but writing to the destinations is suppressed.
- In some embodiments, the processor shares a cache with one or more other processors. The resource being locked is a cache memory location.
- Other features and advantages of the invention are described below. The invention is defined by the appended claims.
- FIG. 1 is a block diagram of a multi-processor system according to the present invention.
- FIG. 2 illustrates an instruction execution pipeline of a processor of FIG. 1.
- FIG. 3 is a block diagram of one embodiment of a processor of FIG. 1.
- FIG. 4 is a block diagram of a load/store unit for one embodiment of the processor of FIG. 3.
- FIG. 5 illustrates entries in load and store buffers of FIG. 4.
- A multiprocessor system110 (FIG. 1) includes two processors (CPUs) 120.1, 120.2 which share a two-port data cache unit (DCU) 130. Each
CPU 120 accesses the DCU through a respective one of the DCU ports. DCU 130 caches data frommemory 140. -
Data cache 130 includes a set-associative cache memory 130M and control logic (not shown) to access the cache memory. Such caches are known in the art. Each cache set 130L inmemory 130M can store a number of data words W0, W1 . . . (thirty-two 32-bit words in some embodiments). In addition, eachcache set 130L includes a lock bit L which indicates whether the cache set is locked, and a processor bit P which indicates which CPU has locked the cache set. When the cache set is locked, the cache set can be accessed only from the port connected to the CPU that has locked the cache set. The other CPU is not allowed to read or write the cache set or thosememory 140 locations whose contents are cached in the cache set. - In some embodiments, if a
cache set 130L is locked by one CPU, the other CPU is allowed to read the cache set but not to write the cache set. - FIG. 2 illustrates instruction execution pipeline for a
single CPU 120 in some embodiments. In some embodiments, the two CPUs are identical, but the pipeline stages do not have to be synchronized between the CPUs. In particular, a pipeline disruption of one CPU does not affect the other CPU's pipeline. - In FIG. 2, “F” is an instruction fetch stage. “A” is an alignment stage for embodiments in which the CPUs are VLIW (very long instruction word) processors. In a VLIW processor, each instruction may include a number of sub-instructions executed in parallel by different execution units. In the alignment stage A, the sub-instructions are aligned before the respective execution units.
- In the D/R stage (decode/register file access), each execution unit decodes its respective sub-instruction and reads operands from register file150 (FIG. 1).
- In execution stages E, C(A1), A2, A3, the sub-instructions are executed. In stage T, trap events are handled. “E” stands for effective address calculation, “C” for cache access, A1, A2, A3 for
annex - The operations performed during stages E, C(A1), A2, A3 vary from instruction to instruction. For example, some instructions (such as NOP) do not perform effective address calculation.
- In the write back stage WB, the instruction results are written to their destinations which may include register file150 (FIG. 1), DCU 130, memory 140 (if the destination is a non-cacheable memory location), or other devices or bus lines.
- In stage T, the processor's pipe control unit (PCU)160 (FIG. 1) generates a “trap” signal indicating whether the VLIW instruction (and hence all its sub-instructions) has to be canceled due to a trap condition caused by the instruction itself or by an interrupt. The instruction (say instruction) “I1” can also be canceled by a trap condition caused by a previous instruction “I2” if execution of I1 and execution of I2 overlaps. The trap condition caused by “I2” causes the trap signal to be asserted in the T stage of I2 which is an earlier pipeline stage of instruction I1. Trap conditions are listed in
Addendum 1 at the end of this description for some embodiments. If the “trap” signal is asserted in the T stage or an earlier pipeline stage of instruction I1, the I1 results are not written to the destination in the WB stage. However, the instruction I1 is allowed to proceed to the WB stage, and any cache set that has been locked by the instruction is unlocked in the WB stage. - Additional execution stages are inserted between A3 and WB if needed.
- In some embodiments, each
CPU 120 has a register file and a PCU, but in FIG. 1 the register file and the PCU are shown only for CPU 120.1 for simplicity. -
Addendum 2 is a pseudocode listing illustrating execution of an atomic instruction by aCPU 120. We will describeAddendum 2 with reference to CPU 120.1. Execution of atomic instruction by CPU 120.2 is similar. - At step310, CPU 120.1 issues a load-with-lock request to
DCU 130. This is done as follows. Each CPU is connected to its respective DCU port by a bus 170 (FIG. 1). Only thebus 170 for CPU 120.1 is shown in detail. Eachbus 170 includesaddress lines 170A,data lines 170D, read/write line 170RW,lock line 170L, unlockline 170U, and no_store line 170NS. - At step310, CPU 120.1 drives the
address lines 170A of itsbus 170 with the address of the data to be loaded (the address in memory 140), and drives a read signal on read/write line 170RW. In addition, the CPU asserts thelock line 170L to cause the DCU to lock the cache set being read. - Because the cache set can be unlocked in the WB stage even if the instruction has to be canceled, step310 can be performed before the instruction's T stage, that is, before it becomes known whether or not the instruction will be canceled.
- If the data requested at step310 are in the cache, and the cache set has not been locked by CPU 120.2, the DCU returns the data on
lines 170D. Otherwise, the DCU asserts appropriate controls signals (not shown) to CPU 120.1 to signal that the cache set is locked or the data are not in the cache, whatever the case may be. If the data are not in the cache, CPU 120.1 issues a request to bus interface unit (BIU) 180 to fetch the data frommemory 140.BIU 180 fetches the data viabus 190. When the data are fetched, they are cached in a cache set 130L inDCU 130 and are also provided to CPU 120.1. In addition, the lock bit L is set in the cache set, and the processor bit P is made to indicate CPU 120.1. - At step320, CPU 120.1 calculates a store condition “COND” which determines whether the
memory 140 location read at step 310 has to be written by the instruction. Step 320 is omitted for some instructions, such as swap, for which the memory location is written unconditionally. - Step350 is completed in the WB stage (though this step may start before the WB stage in some embodiments). This step includes steps 350A, 350B. At step 350A, CPU 120.1 issues a store request to
DCU 130, driving the store address onlines 170A, the store data onlines 170D, and the write signal on line 170RW, as known in the art. In addition, CPU 120.1 asserts theunlock line 170U to cause the DCU to unlock the cache set 130L. - At the same time, at step350B, CPU 120.1 drives the no_store line 170NS with a signal indicating whether the store data are to be actually written to the cache set. The data will not be written if, and only if: (1) “trap” was asserted in the T stage or earlier stage of the instruction, or (2) the condition COND is false.
- Whether or not no_store is asserted,
DCU 130 will reset the L bit to unlock the cache set. - Further details of one embodiment will be illustrated on the example of an atomic compare-and-swap instruction cas (Addendum 3). This instruction takes three operands rd, rs1, rs2. In some embodiments, these operands are addresses of registers in
register file 150. - At step410, the instruction reads a memory location M[rs2] whose address is stored in register rs2. This location is in
memory 140. (The instruction definition ofAddendum 3 does not depend on the presence of a cache.) At step 420, the contents temp_rs2 of the memory location are compared with the contents r[rs1] of register rs1. If the comparison is successful, the memory location M[rs2] is written with the contents r[rd] of register rd. - Whether or not the comparison is successful, the register rd is written with the memory location contents temp_rs2 (step430) fetched at step 410.
-
Addendum 4 illustrates execution of the cas instruction by aCPU 120. The step reference numbers correspond to those ofAddendum 2. At step 310 inAddendum 4, the contents of memory location M[rs2] are fetched fromcache 130 and placed into a temporary register temp_rs2. The cache set storing M[rs2] is locked. Register temp_rs2 is not an “architecture” register, that is, this register is not visible by software and this register can by modified even if the cas instruction will be canceled. - At step314, register rd is read into another non-architecture register temp_rd.
- At step320, another non-architecture register COND is written with a bit indicating whether temp rs2=r[rs1].
- Steps310, 314, 320 can be performed before the T stage. These steps can overlap or be performed in an order different from the order shown.
- Step350, consisting of steps 350A, 350B, 350C, is to be completed after the T stage. At step 350A, a store-with-unlock is issued to the DCU to store the contents of temp_rd in the cache location that caches M[rs2]. Step 350B is performed as in
Addendum 2. At the same time, at step 350C, if “trap” has been deasserted in the T and all earlier stages of the cas instruction, then the contents of temp_rs2 are written to register rd to implement step 430 ofAddendum 3. - FIG. 3 illustrates one embodiment of a
CPU 120 in more detail.CPU 120 is a VLIW processor having four execution units 610.0, 610.1, 610.2, 610.3, also labeled GFU, MFU1, MFU2, MFU3 respectively. GFU stands for “general functional unit”. MFU stands for “media functional unit”. The four execution units operate in parallel to execute a single VLIW instruction which may include up to four sub-instructions. Instruction cas is a sub-instruction. - The GFU is the only execution unit that can perform memory access operations, including cas.
- During the pipeline fetch stage F (FIG. 2), the CPU fetches instructions from
instruction cache 614 intoinstruction aligner 618. During the A stage,instruction aligner 618 extracts up to four sub-instructions fromcache 614 and aligns the sub-instructions before respective execution units 610. The sub-instructions are written intoinstruction buffer 624. During the D stage, units 610 decode their respective sub-instructions and, if needed, read instruction operands from respective register files RF0, RF1, RF2, RF3 which form theregister file 150. Each register file RF0, RF1, RF2, RF3 stores a copy of the same data. - In the execution stages E, C(A1), A2, A3, and possibly other stages after A3 and before WB, each execution unit610 executes its respective sub-instruction.
- In stage WB, execution units610 write instruction results, as explained above.
- When a VLIW instruction is in its T stage, each execution unit610.0-610.3 generates a respective signal “trapo” through “trap3” to indicate whether the execution unit detected a trap condition. Signals trap0-trap3 are provided to
PCU 160. In the same stage T, the PCU asserts, “trap” signal if, and only if, any one of signals trap0-trap3 is asserted in the T stage. - The “trap” signal is provided to load/store unit (LSU)640.
-
LSU 640 executes requests to accesscache 130,BIU 180, and other devices. InLSU 640, store buffer 710 (FIG. 4) is a queue of eight entries 0-7.Entry 0 is the front (bottom) of the queue,entry 7 is the back (or top). The store instructions are written from GFU 610.0 intoentry 7 in the E stage. (An entry instore buffer 710 defines a store operation which we will call a “store instruction”. Similarly, an entry inload buffer 720 ofLSU 640 defines a load operation which we will call a “load instruction”. These store and load instructions should not be confused with sub-instructions executed by units 610 or with VLIW instructions.) - At the end of the C stage, the instruction in
entry 7 of thestore buffer 710 is written to the lowest empty entry chosen from entries 4-7. - A store instruction is not dispatched from the store buffer to the DCU until the stage A3. (Dispatching the instruction involves providing the address, data and control signals on
bus 170 of FIG. 1.) When a store instruction is dispatched to the DCU, the DCU writescache memory 130M at least one cycle after the dispatch. If the instruction was dispatched at stage A3 but in stage T the “trap” signal is asserted, the instruction is canceled via a cancellation signal (not shown) sent by the LSU to the DCU in the T stage. - In each store buffer entry, “datab”
field 710D holds the store data.Address field 710A (“addrb”) holds the store address which is an address inmemory 140. -
State field 710S indicates the pipeline stage of the instruction. The binary encoding of the stage field is as follows: - 100: instruction is in stage A2;
- 010: instruction is in stage A3;
- 001: instruction is in stage T;
- 000: instruction is past the T stage.
- The stage field is written at the end of the C stage and is thereafter shifted right once per clock cycle. Entries 4-7 of the store buffer keep all the three bits of the stage field.
Entry 3 has two bits to track whether the instruction is in stage A3 or T or ispast T. Entry 2 has one bit to track if the instruction is in stage T or past the T stage.Entries - The
instruction type field 710T indicates the instruction type. In particular, this field indicates whether the store is part of a cas instruction. - One-bit load/
store field 710L is used for cas instructions to track if the cas load has been performed, as described below. -
Load buffer 720 in FIG. 4 is a queue of five entries 0-4.Entry 0 is the front of the queue, andentry 4 is the back. Load instructions are written from GFU 610.0 toentry 4 in the E stage. They shift through the buffer from top to bottom. Each instruction remains in the load buffer through its lifetime in the LSU, that is, even after the load request has been issued toDCU 130. After the load data have returned from the DCU, the instruction is logically deleted from the load buffer. - The load buffer entries can be finished (i.e. respective loads can be performed) out of order. Holes in the buffer from out-of-order completed instructions can be filled from any entry, one per clock cycle.
- A load instruction can be dispatched-to the DCU in the E stage without being written to the load buffer first. However, the instruction still gets written into the load buffer.
- In each load buffer entry, “addrb”
field 720A holds the load address. This is an address inmemory 140. The address is calculated in the E stage (the address may be equal to the sum of two operands, as known in the art.) - Destination register specifier field720RD holds the address of the load destination register in
register file 150. - RAW hazard field720RAW is an 8-bit vector pointing to store
buffer 710 instructions which must be performed before the load instruction to avoid a RAW (read after write) hazard. In the embodiment being described, the stores are issued in order with respect to each other. The loads are also issued in order with respect to each other. However, the loads are also issued in preference to the stores. The store instructions are dispatched only when the first load inload buffer 720 cannot be dispatched due to a RAW hazard, or when the load buffer is empty. Therefore, a RAW (read after write) hazard is a possibility, but RAR, WAR, and WAW hazards are not. - Each bit in field720RAW corresponds to an entry of
store buffer 710. The bit is set if the instruction in the corresponding store entry must be executed before the load, and the bit is reset otherwise. As the store buffer entries are shifted down to fill the free space in the store buffer, the RAW fields 720RAW are shifted to the right. - A load instruction can be speculatively dispatched to
DCU 130 in the E stage even though the corresponding RAW hazards are not calculated until the C stage. If the load is found to have a hazard, the load is canceled (that is, the data returned bycache 130 are discarded), and the load is retried later. - The load can also be canceled by a “trap” signal generated in the T or earlier stage if the load was dispatched to the DCU before the T stage. In this case, the load is not retried.
- One-
bit field 720T (“trap_taken”) is initially set to zero. This bit is set to 1 in the T or earlier stage in response to the trap signal fromPCU 160 being asserted. If the bit is set, the instruction will be removed from the load buffer when the load data return, and the load data will be discarded. - The
stage field 720S has the same meaning as thefield 710S in the store buffer, and the encoding is the same. When the load data are passed back to GFU 610.0, the stage field final value, shifted right once more, is passed toPCU 160. -
Entry 4 ofload buffer 720 includes all the threestage bits 720S.Entry 3 has two bits to track whether the instruction is in stage A3, T, orpast T. Entry 2 has one bit to track whether the instruction is in stage T orpast T. Entries -
BIU list 730 is a queue of commands to be dispatched tobus interface unit 180. The BIU list is written whenDCU 130 returns a cache miss and when, therefore, data have to be fetched into the cache frommemory 140. The BIU list is also written to write thememory 140. - When GFU610.0 issues a cas instruction to
LSU 640, the LSU writes one entry into each ofbuffers instruction type field 710T indicates cas.Address field 710A has the contents of register rs2 (Addendum 3) of the cas instruction, i.e. thememory 140 address. Thedata field 710D has the contents of the destination register rd (Addendum 3) of the cas instruction. Thebit 710L is 0 to indicate that the cas load has not been performed yet. - In the load buffer entry, the
address field 720A receives the contents of register rs1 (the comparison data). SeeAddendum 3. Field 720RD receives the address of the destination register rd (Addendum 3) of the cas instruction. In RAW vector 720RAW, the bit pointing to the store entry for the cas instruction is set even though the cas load is to precede the cas store. In addition, the bits corresponding to other RAW hazards, if any, are set. - The remaining fields of the cas load and store entries of FIG. 5 are defined as for other load and store instructions.
- Addendum 5 describes the LSU operation in pseudocode.
BIU list 730 has the highest priority in some embodiments. If theBIU list 730 is not empty, the LSU dispatches an operation from the BIU list (step 910). - If the BIU list is empty, an operation from
load buffer 720 orstore buffer 710 is dispatched. If the first entry (i.e., the entry in the front of the queue) inload buffer 720 has no RAW hazard (step 920), the entry is dispatched. More particularly, the LSU dispatches to DCU 130 a load-without-lock request, that is, a read request withlock signal 170L deasserted (step 920A). When DCU returns data onlines 170D (step 920B), the LSU passes the data toPCU 160 and GFU 610.0 on bus lsu_dc_data (FIG. 3). The LSU also passes to the PCU and the GFU on bus lsu_pcu_rd the destination register specifier rd from field 720RD (FIG. 4). The LSU also passes to the PCU thestage bits 720S and the trap takenbit 720T. - If
load buffer 720 is empty, or the first entry in the load buffer has a non-zero bit in field 720RAW, the first store instore buffer 710 is dispatched (step 930). If theinstruction type field 710T of this entry does not indicate a cas instruction (step 930A), then a store request is issued toDCU 130, with the lock and unlocksignals stage field 710S or by the position of the instruction in the store buffer. - If the
field 710T indicates a cas instruction (step 930B), the actions in Table 5-1 are performed. - In the table, the column “CAS STAGE” indicates the pipeline stage of the cas instruction for one example. In that example, both LSU buffers were empty when the cas instruction was issued by GFU610.0. Therefore, the cas load (step 930B1) is dispatched to
DCU 130 in the E stage. - The column “LSU PIPE STAGE” indicates the LSU pipeline stages.
LSU 640 is pipelined, and can issue a request to the DCU on every clock cycle. - At step930B1 (LSU pipeline dispatch stage LD), a load request is dispatched using the
store buffer 710 entry for the cas instruction. Using the store buffer entry rather than the load buffer entry allows utilization of the same logic as used for non-cas instructions to select an instruction for dispatch. Indeed, the cas load entry has a RAW hazard bit set (pointing to the cas store entry). Therefore, according to the non-cas rules, the cas store must be dispatched before the cas load. - When the LSU dispatches the cas store entry with
field 710T showing cas and bit 710L reset, the LSU dispatches a load request rather than a store to the DCU. - In the load request, the memory address in
field 710A is driven on theDCU address bus 170A. Thelock signal 170L is asserted, and theunlock signal 170U is deasserted. - At step930B2, in stage C, the DCU returns data on
lines 170D (assuming a cache hit). The LSU sets thebit 710L to indicate that the cas load has been performed. - The LSU pipeline stages at step930B2 are indicated as LC (LSU cache access) and LF (LSU finish). In the embodiment being described, these stages may or may not occur in the same clock cycle. For example, if the DCU returned a cache miss, the stage LF (data return on
lines 170D) would be performed later. - Of note, in case of a cache miss, the DCU does not lock the cache set, and the LSU does not set the
bit 710L. In this case (not shown in Table 5-1), the LSU causesBIU 180 to fetch data frommemory 140, and then reissues the load-and-lock request of step 930B1. - When the DCU returns data on
lines 170D, theLSU 640 drives the destination register specifier rd on lines lsu_pcu_rd toPCU 160 and GFU 610.0. The register specifier rd is taken from field 720RD of the cas load entry (see FIG. 5). In some embodiments, the LSU finds the cas load entry as the first load in the queue ofload buffer 720. Indeed, because the loads have priority over stores, a store is issued before a load only if the load has a RAW bit set, stores are issued in order with respect to each other, and loads are issued in order with respect to each other, the cas load is the first load in the load buffer. - In other embodiments, the cas load entry is found by the LSU as the first load having a RAW vector720RAW pointing to the cas store entry.
- At step930B3, the LSU again finds the load buffer entry corresponding to the cas instruction, and drives the comparison data (contents of register rs1) from
field 720A (FIG. 5) to GFU 610.0. At this time, the data from the DCU are still available on bus lsu_dc_data. The GFU performs the comparison step 320 (Addendum 4), and provides the result COND toLSU 640. This occurs in pipeline stage A2 in Table 5-1, when the LSU pipeline for the cas instruction is in stages LL (LSU late cache response) and LI (LSU invalidate, meaning that the instruction can be invalidated in this stage). Stages LL, LI may occur in the same clock cycle or in different cycles. They occur in the same cycle in Table 5-1. - At step930B4 (cas stage A3 in Table 5-1, LSU stage LD), the LSU again selects an entry for dispatch to the DCU. Since the first load in the load buffer is a cas load, having a RAW bit set, a store is dispatched. The first store is still the cas store. However, since its
bit 710L is set, the LSU dispatches a store to the DCU, asserting theunlock signal 170U. - At step930B5 (cas stage T, LSU stage LC),
LSU 640 generates the signal no_store on line 170NS (FIG. 1). This signal is asserted if, and only if, the trap takenbit 720T is set (one) or COND is false. See step 350B inAddenda - At step930B6 (cas stage WB in Table 5-1), the store operation is allowed to finish. However, if no_store was asserted at step 930B5, the DCU will not perform a store. Whether or not no_store was asserted, the DCU resets the cache set lock bit L.
- The above embodiments illustrate but do not limit the invention. In particular, the invention is not limited to the cas instruction. Swap, test-and-set, and other atomic instructions are used in some embodiments. The invention is not limited by the number of the CPUs sharing the
cache 130 or by the structure of a CPU. In some embodiments, the CPUs are not identical to one another. Further, in some embodiments, non-CPU entities, for example, a DMA or a communication controller, can share the cache with the CPUs. If a cache set is locked, such entities are prevented from writing and possibly reading the cache set. - In some embodiments, the LSU provides an interface to non-memory devices in addition to the memory. In other embodiments, an LSU is absent from at least one CPU.
- The invention is not limited to dispatching loads in preference to stores, or to any other dispatch policy.
- The invention is not limited by the type of the CPUs. In some embodiments, one or more of the CPUs are non-VLIW processors. In some embodiments, one or more CPUs do not have a register file.
- While in some embodiments the
memory 140 is a random access memory, in some embodiments the DCU caches data from non-random access memory devices. - In some embodiments, an atomic instruction locks an entire cache memory, or an individual word, bit, or some other cache portion. Some embodiments do not include a cache, and an atomic instruction locks part or all of a non-cache memory.
- The invention is not limited to any particular interface between a CPU and the cache. For example, in some embodiments, the
lock line 170L and theunlock line 170U are combined into a single line since in some embodiments the lock and the unlock commands are never issued to the DCU simultaneously. - The invention is not limited to caches. In some embodiments, the invention is applied to non-cache resources, for example, disk or communication controllers.
- The invention is not limited to the pipeline of FIG. 2 or to any particular pipeline of
LSU 640. Further, in some embodiments, an atomic instruction reads one memory location but writes a different memory location. The location being written, or both locations, are locked in some embodiments from the time the first location is read to the time the second location is written. - The steps of Addenda 2-4 are performed not necessarily in the order shown. Some steps may overlap or be performed in a different order.
- Other embodiments and variations are within the scope of the invention, as defined by the appended claims.
- A trap may be caused by an exception or an interrupt. An exception is a condition associated with an instruction being executed. Examples include divide by zero, unaligned memory access, stack overflow, an illegal instruction, a breakpoint or a software interrupt instruction, a privileged instruction executed in a non-privileged mode, a memory map error (attempt to access an unmapped memory address space, or to execute unallowed opcode for an address space), a memory access error (for example, a parity error), an instruction address is out of bounds, data are out of bounds, referencing a null pointer, software-initiated processor reset.
- An interrupt is a condition caused by an external device. Interrupts are not directly related to an instruction being executed. Examples of interrupts are requests from a network controller, a keyboard, a joy stick, or a disk controller. Another example is a timer interrupt. Power-on reset (a processor reset signal being asserted) also causes an interrupt.
- When a trap condition occurs, the processors stops executing the current instruction stream and starts executing a trap handler. Before the trap handler is started, the instructions that were past the T stage when the trap condition occurred are executed to completion. The instructions that have not yet gone past the T stage are canceled.
-
-
-
-
-
- cas rd, rs1, [rs2]:
-
-
- M[rs2]←r[rd]
-
-
-
-
-
-
-
-
-
-
-
-
lines 170D, -
-
-
-
field 710T of first store buffer entry does not indicate cas, issue a store request to DCU deasserting thelock signal 170L and theunlock signal 170U -
TABLE 5-1 LSU CAS PIPE STEP STAGE STAGE ACTION 930B1 E LD Dispatch load-with-lock request to DCU, with address in store entry field addrb (contents of rs2) 930B2 C LC, LF DCU returns data on lines 170D.Set flag 710L in store bufferentry. Drive load buffer destination register specifier (rd field of load buffer entry) on lines lsu_pcu_rd 930B3 A2 LL, LI Provide to GFU the comparison data r[rs1] from addrb field of load buffer entry. Get COND from the GFU 930B4 A3 LD Dispatch store-with-unlock to the DCU 930B5 T LC no_store <− ((trap_taken bit 720T set) or ˜COND)) 930B6 WB Allow the DCU to complete the cas store unlock (and, possibly the data store)
Claims (19)
1. A computer processor capable to execute a computer instruction which locks and then unlocks a computer resource, the computer processor being operable to lock the resource in the course of execution of the instruction before the processor has determined whether the instruction is to be executed to completion or canceled, the processor unlocking the resource by the time the instruction processing by the processor is terminated, the unlocking being performed whether or not the instruction is canceled.
2. The computer processor of claim 1 wherein the instruction execution is pipelined, and the instruction is canceled if a trap condition occurs after the processor started processing the instruction.
3. The computer processor of claim 1 wherein:
executing the instruction comprises reading a memory location and conditionally or unconditionally writing a memory location; and
the resource comprises the memory location to be written.
4. The computer processor of claim 3 further comprising a cache, wherein the memory location to be written is a memory location in said cache.
5. The computer processor of claim 3 wherein the circuitry is operable to perform the reading before the processor has determined whether the instruction is to be canceled.
6. The processor of claim 1 in combination with another processor having access to the same resource.
7. The processor of claim 1 wherein instruction execution is pipelined, and
if the processor determines before a pipeline stage of stages in which the unlocking is performed that the instruction is to be canceled, the instruction proceeds through all the pipeline stages at least up to, and including, the stage or stages in which the resource is unlocked.
8. The processor of claim 1 wherein:
each instruction is executed in a plurality of pipeline stages, wherein the pipeline for each instruction includes a stage ST1 in which a signal is generated by the processor to indicate whether the instruction is to be canceled due to a trap; and
when executing the instruction which locks and then unlocks the computer resource, the processor is operable to lock the computer resource before the stage ST1.
9. The processor of claim 8 wherein for at least some instructions including the instruction that locks and then locks the computer resource, the stage ST1 is followed by a stage ST2 in which at least one instruction result is written to an architecture storage location; and
when the processor executes the instruction that locks and then unlocks the computer resource, and the instruction is to be canceled, the stage ST2 is executed for the instruction to unlock the resource but writing to the architecture storage location is suppressed.
10. A computer processor comprising an interface to a cache, the interface comprising:
address and data terminals; and
one or more control terminals to lock and unlock at least a portion of the cache, the one or more control terminals being operable to indicate that the cache is not to store data but to perform an unlock operation.
11. The processor of claim 10 in combination with said cache, the cache being connected to the address and data terminals and to the one or more control terminals.
12. The combination of claim 11 further comprising a second processor having data and address terminals and one or more control terminals, wherein said terminals of the second processor are connected to the cache.
13. The combination of claim 12 further comprising a memory and a circuit for caching data from the memory in the cache.
14. A method for executing a computer instruction by a computer processor, wherein the instruction locks and then unlocks a computer resource, the method comprising:
locking the resource before the processor has determined whether the instruction is to be executed to completion or canceled; and then
unlocking the resource by the time the instruction processing by the processor is terminated, wherein the unlocking is performed whether or not the instruction is canceled.
15. The method of claim 14 wherein the unlocking is performed after the processor has determined whether the instruction is to be canceled.
16. The method of claim 14 wherein the instruction execution is pipelined, and the instruction is canceled if a trap condition occurs after the instructions processing by the processor has begun.
17. The method of claim 14 wherein the instruction is an atomic instruction which comprises reading a memory location and conditionally or unconditionally writing a memory location; and
the resource comprises the memory location to be written.
18. The method of claim 17 wherein the memory location to be written is a cache memory location.
19. The method of claim 17 wherein the reading operation is performed before the processor has determined whether the instruction is to be canceled.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/941,142 US20020046334A1 (en) | 1998-12-02 | 2001-08-28 | Execution of instructions that lock and unlock computer resources |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/204,760 US6282637B1 (en) | 1998-12-02 | 1998-12-02 | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
US09/941,142 US20020046334A1 (en) | 1998-12-02 | 2001-08-28 | Execution of instructions that lock and unlock computer resources |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,760 Continuation US6282637B1 (en) | 1998-12-02 | 1998-12-02 | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
Publications (1)
Publication Number | Publication Date |
---|---|
US20020046334A1 true US20020046334A1 (en) | 2002-04-18 |
Family
ID=22759325
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,760 Expired - Lifetime US6282637B1 (en) | 1998-12-02 | 1998-12-02 | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
US09/941,142 Abandoned US20020046334A1 (en) | 1998-12-02 | 2001-08-28 | Execution of instructions that lock and unlock computer resources |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/204,760 Expired - Lifetime US6282637B1 (en) | 1998-12-02 | 1998-12-02 | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
Country Status (2)
Country | Link |
---|---|
US (2) | US6282637B1 (en) |
WO (1) | WO2000033162A2 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040139093A1 (en) * | 2002-10-31 | 2004-07-15 | International Business Machines Corporation | Exclusion control |
US20050154866A1 (en) * | 2004-01-13 | 2005-07-14 | Steely Simon C.Jr. | Systems and methods for executing across at least one memory barrier employing speculative fills |
US20060041724A1 (en) * | 2004-08-17 | 2006-02-23 | Steely Simon C Jr | Locked cache line sharing |
WO2006012103A3 (en) * | 2004-06-30 | 2006-04-13 | Intel Corp | Method and apparatus for speculative execution of uncontended lock instructions |
US20060095668A1 (en) * | 2004-10-28 | 2006-05-04 | International Business Machines Corporation | Method for processor to use locking cache as part of system memory |
US20060095669A1 (en) * | 2004-10-28 | 2006-05-04 | International Business Machines Corporation | Direct deposit using locking cache |
US20060161740A1 (en) * | 2004-12-29 | 2006-07-20 | Sailesh Kottapalli | Transaction based shared data operations in a multiprocessor environment |
US20060282638A1 (en) * | 2005-06-10 | 2006-12-14 | Fujitsu Limited | Storage device, configuration information management method and program |
US20080243468A1 (en) * | 2007-03-30 | 2008-10-02 | International Business Machines Corporation | Providing memory consistency in an emulated processing environment |
US20120059971A1 (en) * | 2010-09-07 | 2012-03-08 | David Kaplan | Method and apparatus for handling critical blocking of store-to-load forwarding |
US20130246713A1 (en) * | 2012-03-19 | 2013-09-19 | International Business Machines Corporation | Conditional write processing for a cache structure of a coupling facility |
US20150242337A1 (en) * | 2014-02-24 | 2015-08-27 | Puneet Aggarwal | System and method for validation of cache memory locking |
US11126474B1 (en) * | 2017-06-14 | 2021-09-21 | Amazon Technologies, Inc. | Reducing resource lock time for a virtual processing unit |
US11537430B1 (en) | 2017-09-28 | 2022-12-27 | Amazon Technologies, Inc. | Wait optimizer for recording an order of first entry into a wait mode by a virtual central processing unit |
US11556485B1 (en) * | 2021-08-31 | 2023-01-17 | Apple Inc. | Processor with reduced interrupt latency |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6282637B1 (en) * | 1998-12-02 | 2001-08-28 | Sun Microsystems, Inc. | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
US6785714B1 (en) * | 2000-09-28 | 2004-08-31 | Microsoft Corporation | System and method for employing slot level locking of a cache |
US7246187B1 (en) * | 2000-11-28 | 2007-07-17 | Emc Corporation | Method and apparatus for controlling exclusive access to a shared resource in a data storage system |
US6772372B2 (en) * | 2001-03-06 | 2004-08-03 | Hewlett-Packard Development Company, L.P. | System and method for monitoring unaligned memory accesses |
US7003543B2 (en) | 2001-06-01 | 2006-02-21 | Microchip Technology Incorporated | Sticky z bit |
US6952711B2 (en) | 2001-06-01 | 2005-10-04 | Microchip Technology Incorporated | Maximally negative signed fractional number multiplication |
US20020184566A1 (en) | 2001-06-01 | 2002-12-05 | Michael Catherwood | Register pointer trap |
US7020788B2 (en) | 2001-06-01 | 2006-03-28 | Microchip Technology Incorporated | Reduced power option |
US6975679B2 (en) | 2001-06-01 | 2005-12-13 | Microchip Technology Incorporated | Configuration fuses for setting PWM options |
US6976158B2 (en) | 2001-06-01 | 2005-12-13 | Microchip Technology Incorporated | Repeat instruction with interrupt |
US20030061464A1 (en) * | 2001-06-01 | 2003-03-27 | Catherwood Michael I. | Digital signal controller instruction set and architecture |
US7007172B2 (en) | 2001-06-01 | 2006-02-28 | Microchip Technology Incorporated | Modified Harvard architecture processor having data memory space mapped to program memory space with erroneous execution protection |
US7467178B2 (en) | 2001-06-01 | 2008-12-16 | Microchip Technology Incorporated | Dual mode arithmetic saturation processing |
US6937084B2 (en) | 2001-06-01 | 2005-08-30 | Microchip Technology Incorporated | Processor with dual-deadtime pulse width modulation generator |
US6985986B2 (en) | 2001-06-01 | 2006-01-10 | Microchip Technology Incorporated | Variable cycle interrupt disabling |
CA2383832A1 (en) * | 2002-04-24 | 2003-10-24 | Ibm Canada Limited-Ibm Canada Limitee | System and method for intelligent trap analysis |
US7036125B2 (en) * | 2002-08-13 | 2006-04-25 | International Business Machines Corporation | Eliminating memory corruption when performing tree functions on multiple threads |
US20040044881A1 (en) * | 2002-08-28 | 2004-03-04 | Sun Microsystems, Inc. | Method and system for early speculative store-load bypass |
US7302553B2 (en) * | 2003-01-23 | 2007-11-27 | International Business Machines Corporation | Apparatus, system and method for quickly determining an oldest instruction in a non-moving instruction queue |
US7921250B2 (en) * | 2004-07-29 | 2011-04-05 | International Business Machines Corporation | Method to switch the lock-bits combination used to lock a page table entry upon receiving system reset exceptions |
US9182993B2 (en) * | 2005-03-18 | 2015-11-10 | Broadcom Corporation | Data and phase locking buffer design in a two-way handshake system |
US20110320781A1 (en) * | 2010-06-29 | 2011-12-29 | Wei Liu | Dynamic data synchronization in thread-level speculation |
US11036501B2 (en) * | 2018-12-23 | 2021-06-15 | Intel Corporation | Apparatus and method for a range comparison, exchange, and add |
US11119767B1 (en) | 2020-06-19 | 2021-09-14 | Apple Inc. | Atomic operation predictor to predict if an atomic operation will successfully complete and a store queue to selectively forward data based on the predictor |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5168564A (en) * | 1990-10-05 | 1992-12-01 | Bull Hn Information Systems Inc. | Cancel mechanism for resilient resource management and control |
US5175829A (en) * | 1988-10-25 | 1992-12-29 | Hewlett-Packard Company | Method and apparatus for bus lock during atomic computer operations |
US5420991A (en) * | 1994-01-04 | 1995-05-30 | Intel Corporation | Apparatus and method for maintaining processing consistency in a computer system having multiple processors |
US5613083A (en) * | 1994-09-30 | 1997-03-18 | Intel Corporation | Translation lookaside buffer that is non-blocking in response to a miss for use within a microprocessor capable of processing speculative instructions |
US6141734A (en) * | 1998-02-03 | 2000-10-31 | Compaq Computer Corporation | Method and apparatus for optimizing the performance of LDxL and STxC interlock instructions in the context of a write invalidate protocol |
US6212622B1 (en) * | 1998-08-24 | 2001-04-03 | Advanced Micro Devices, Inc. | Mechanism for load block on store address generation |
US6282637B1 (en) * | 1998-12-02 | 2001-08-28 | Sun Microsystems, Inc. | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
US6529982B2 (en) * | 1997-01-23 | 2003-03-04 | Sun Microsystems, Inc. | Locking of computer resources |
US6862664B2 (en) * | 2003-02-13 | 2005-03-01 | Sun Microsystems, Inc. | Method and apparatus for avoiding locks by speculatively executing critical sections |
US6938130B2 (en) * | 2003-02-13 | 2005-08-30 | Sun Microsystems Inc. | Method and apparatus for delaying interfering accesses from other threads during transactional program execution |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4488217A (en) * | 1979-03-12 | 1984-12-11 | Digital Equipment Corporation | Data processing system with lock-unlock instruction facility |
US5499356A (en) | 1989-12-29 | 1996-03-12 | Cray Research, Inc. | Method and apparatus for a multiprocessor resource lockout instruction |
US5524255A (en) | 1989-12-29 | 1996-06-04 | Cray Research, Inc. | Method and apparatus for accessing global registers in a multiprocessor system |
US5276847A (en) * | 1990-02-14 | 1994-01-04 | Intel Corporation | Method for locking and unlocking a computer address |
US5293613A (en) | 1991-08-29 | 1994-03-08 | International Business Machines Corporation | Recovery control register |
US5574922A (en) | 1994-06-17 | 1996-11-12 | Apple Computer, Inc. | Processor with sequences of processor instructions for locked memory updates |
US5787486A (en) * | 1995-12-15 | 1998-07-28 | International Business Machines Corporation | Bus protocol for locked cycle cache hit |
-
1998
- 1998-12-02 US US09/204,760 patent/US6282637B1/en not_active Expired - Lifetime
-
1999
- 1999-12-01 WO PCT/US1999/028596 patent/WO2000033162A2/en active Application Filing
-
2001
- 2001-08-28 US US09/941,142 patent/US20020046334A1/en not_active Abandoned
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5175829A (en) * | 1988-10-25 | 1992-12-29 | Hewlett-Packard Company | Method and apparatus for bus lock during atomic computer operations |
US5168564A (en) * | 1990-10-05 | 1992-12-01 | Bull Hn Information Systems Inc. | Cancel mechanism for resilient resource management and control |
US5420991A (en) * | 1994-01-04 | 1995-05-30 | Intel Corporation | Apparatus and method for maintaining processing consistency in a computer system having multiple processors |
US5613083A (en) * | 1994-09-30 | 1997-03-18 | Intel Corporation | Translation lookaside buffer that is non-blocking in response to a miss for use within a microprocessor capable of processing speculative instructions |
US6529982B2 (en) * | 1997-01-23 | 2003-03-04 | Sun Microsystems, Inc. | Locking of computer resources |
US6141734A (en) * | 1998-02-03 | 2000-10-31 | Compaq Computer Corporation | Method and apparatus for optimizing the performance of LDxL and STxC interlock instructions in the context of a write invalidate protocol |
US6212622B1 (en) * | 1998-08-24 | 2001-04-03 | Advanced Micro Devices, Inc. | Mechanism for load block on store address generation |
US6282637B1 (en) * | 1998-12-02 | 2001-08-28 | Sun Microsystems, Inc. | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs |
US6862664B2 (en) * | 2003-02-13 | 2005-03-01 | Sun Microsystems, Inc. | Method and apparatus for avoiding locks by speculatively executing critical sections |
US6938130B2 (en) * | 2003-02-13 | 2005-08-30 | Sun Microsystems Inc. | Method and apparatus for delaying interfering accesses from other threads during transactional program execution |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7793023B2 (en) | 2002-10-31 | 2010-09-07 | International Business Machines Corporation | Exclusion control |
US20040139093A1 (en) * | 2002-10-31 | 2004-07-15 | International Business Machines Corporation | Exclusion control |
US20080243887A1 (en) * | 2002-10-31 | 2008-10-02 | International Business Machines Corp. | Exclusion control |
US7415556B2 (en) * | 2002-10-31 | 2008-08-19 | International Business Machines Corporation | Exclusion control |
US7360069B2 (en) * | 2004-01-13 | 2008-04-15 | Hewlett-Packard Development Company, L.P. | Systems and methods for executing across at least one memory barrier employing speculative fills |
US20050154866A1 (en) * | 2004-01-13 | 2005-07-14 | Steely Simon C.Jr. | Systems and methods for executing across at least one memory barrier employing speculative fills |
WO2006012103A3 (en) * | 2004-06-30 | 2006-04-13 | Intel Corp | Method and apparatus for speculative execution of uncontended lock instructions |
JP2015072717A (en) * | 2004-06-30 | 2015-04-16 | インテル コーポレイション | Processor, method, and apparatus for execution of lock instructions |
JP2011175669A (en) * | 2004-06-30 | 2011-09-08 | Intel Corp | Method and apparatus for speculative execution of uncontended lock instruction |
US7529914B2 (en) | 2004-06-30 | 2009-05-05 | Intel Corporation | Method and apparatus for speculative execution of uncontended lock instructions |
JP2008504603A (en) * | 2004-06-30 | 2008-02-14 | インテル コーポレイション | Method and apparatus for speculative execution of non-conflicting lock instructions |
US20060041724A1 (en) * | 2004-08-17 | 2006-02-23 | Steely Simon C Jr | Locked cache line sharing |
US7590802B2 (en) | 2004-10-28 | 2009-09-15 | International Business Machines Corporation | Direct deposit using locking cache |
US20060095668A1 (en) * | 2004-10-28 | 2006-05-04 | International Business Machines Corporation | Method for processor to use locking cache as part of system memory |
US20080040548A1 (en) * | 2004-10-28 | 2008-02-14 | Day Michael N | Method for Processor to Use Locking Cache as Part of System Memory |
US7290106B2 (en) * | 2004-10-28 | 2007-10-30 | International Business Machines Corporation | Method for processor to use locking cache as part of system memory |
US20080040549A1 (en) * | 2004-10-28 | 2008-02-14 | Day Michael N | Direct Deposit Using Locking Cache |
US7290107B2 (en) * | 2004-10-28 | 2007-10-30 | International Business Machines Corporation | Direct deposit using locking cache |
US20060095669A1 (en) * | 2004-10-28 | 2006-05-04 | International Business Machines Corporation | Direct deposit using locking cache |
US7596665B2 (en) | 2004-10-28 | 2009-09-29 | International Business Machines Corporation | Mechanism for a processor to use locking cache as part of system memory |
US20110055493A1 (en) * | 2004-12-29 | 2011-03-03 | Sailesh Kottapalli | Transaction based shared data operations in a multiprocessor environment |
US7984248B2 (en) | 2004-12-29 | 2011-07-19 | Intel Corporation | Transaction based shared data operations in a multiprocessor environment |
US8176266B2 (en) | 2004-12-29 | 2012-05-08 | Intel Corporation | Transaction based shared data operations in a multiprocessor environment |
US8458412B2 (en) | 2004-12-29 | 2013-06-04 | Intel Corporation | Transaction based shared data operations in a multiprocessor environment |
US20060161740A1 (en) * | 2004-12-29 | 2006-07-20 | Sailesh Kottapalli | Transaction based shared data operations in a multiprocessor environment |
US20060282638A1 (en) * | 2005-06-10 | 2006-12-14 | Fujitsu Limited | Storage device, configuration information management method and program |
US20080243468A1 (en) * | 2007-03-30 | 2008-10-02 | International Business Machines Corporation | Providing memory consistency in an emulated processing environment |
US7899663B2 (en) * | 2007-03-30 | 2011-03-01 | International Business Machines Corporation | Providing memory consistency in an emulated processing environment |
US20120059971A1 (en) * | 2010-09-07 | 2012-03-08 | David Kaplan | Method and apparatus for handling critical blocking of store-to-load forwarding |
US20130246713A1 (en) * | 2012-03-19 | 2013-09-19 | International Business Machines Corporation | Conditional write processing for a cache structure of a coupling facility |
US8935471B2 (en) | 2012-03-19 | 2015-01-13 | International Business Machines Corporation | Conditional write processing for a cache structure of a coupling facility |
US8838888B2 (en) * | 2012-03-19 | 2014-09-16 | International Business Machines Corporation | Conditional write processing for a cache structure of a coupling facility |
US20150242337A1 (en) * | 2014-02-24 | 2015-08-27 | Puneet Aggarwal | System and method for validation of cache memory locking |
US9268715B2 (en) * | 2014-02-24 | 2016-02-23 | Freescale Semiconductor, Inc. | System and method for validation of cache memory locking |
US11126474B1 (en) * | 2017-06-14 | 2021-09-21 | Amazon Technologies, Inc. | Reducing resource lock time for a virtual processing unit |
US11537430B1 (en) | 2017-09-28 | 2022-12-27 | Amazon Technologies, Inc. | Wait optimizer for recording an order of first entry into a wait mode by a virtual central processing unit |
US11556485B1 (en) * | 2021-08-31 | 2023-01-17 | Apple Inc. | Processor with reduced interrupt latency |
Also Published As
Publication number | Publication date |
---|---|
WO2000033162A2 (en) | 2000-06-08 |
US6282637B1 (en) | 2001-08-28 |
WO2000033162A3 (en) | 2001-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6282637B1 (en) | Partially executing a pending atomic instruction to unlock resources when cancellation of the instruction occurs | |
WO2000033162A9 (en) | Execution of instructions that lock and unlock computer resources | |
US6542984B1 (en) | Scheduler capable of issuing and reissuing dependency chains | |
US8301849B2 (en) | Transactional memory in out-of-order processors with XABORT having immediate argument | |
JP3055980B2 (en) | Method for ensuring data integrity in a multiprocessor or pipeline processor system | |
US6295600B1 (en) | Thread switch on blocked load or store using instruction thread field | |
US7133968B2 (en) | Method and apparatus for resolving additional load misses in a single pipeline processor under stalls of instructions not accessing memory-mapped I/O regions | |
US6820195B1 (en) | Aligning load/store data with big/little endian determined rotation distance control | |
JP5255614B2 (en) | Transaction-based shared data operations in a multiprocessor environment | |
JP3105960B2 (en) | A method of operating data in a register with a simplified instruction set processor | |
JP5876458B2 (en) | SIMD vector synchronization | |
US5420991A (en) | Apparatus and method for maintaining processing consistency in a computer system having multiple processors | |
US6718440B2 (en) | Memory access latency hiding with hint buffer | |
JP5118652B2 (en) | Transactional memory in out-of-order processors | |
US5109495A (en) | Method and apparatus using a source operand list and a source operand pointer queue between the execution unit and the instruction decoding and operand processing units of a pipelined data processor | |
CA1325288C (en) | Method and apparatus for controlling the conversion of virtual to physical memory addresses in a digital computer system | |
US5276847A (en) | Method for locking and unlocking a computer address | |
EP1244962B1 (en) | Scheduler capable of issuing and reissuing dependency chains | |
US6564315B1 (en) | Scheduler which discovers non-speculative nature of an instruction after issuing and reissues the instruction | |
US6622235B1 (en) | Scheduler which retries load/store hit situations | |
US5835946A (en) | High performance implementation of the load reserve instruction in a superscalar microprocessor that supports multi-level cache organizations | |
JP3678443B2 (en) | Write buffer for super pipelined superscalar microprocessor | |
US5515521A (en) | Circuit and method for reducing delays associated with contention interference between code fetches and operand accesses of a microprocessor | |
US5696939A (en) | Apparatus and method using a semaphore buffer for semaphore instructions | |
EP0933697A2 (en) | A method and system for handling multiple store instruction completions in a processing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |