US20220138107A1 - Cache for storing regions of data - Google Patents
Cache for storing regions of data Download PDFInfo
- Publication number
- US20220138107A1 US20220138107A1 US17/575,991 US202217575991A US2022138107A1 US 20220138107 A1 US20220138107 A1 US 20220138107A1 US 202217575991 A US202217575991 A US 202217575991A US 2022138107 A1 US2022138107 A1 US 2022138107A1
- Authority
- US
- United States
- Prior art keywords
- region
- address
- cache
- memory
- request
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015654 memory Effects 0.000 claims abstract description 253
- 238000000034 method Methods 0.000 claims abstract description 40
- 230000004044 response Effects 0.000 claims description 22
- 238000004891 communication Methods 0.000 abstract description 13
- 239000004744 fabric Substances 0.000 abstract description 5
- 238000012545 processing Methods 0.000 description 62
- 238000013461 design Methods 0.000 description 28
- 238000013500 data storage Methods 0.000 description 25
- 238000010586 diagram Methods 0.000 description 25
- 238000013528 artificial neural network Methods 0.000 description 11
- 239000013598 vector Substances 0.000 description 11
- 230000008859 change Effects 0.000 description 10
- 238000005516 engineering process Methods 0.000 description 9
- 238000004519 manufacturing process Methods 0.000 description 6
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 5
- 238000003491 array Methods 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 229910052710 silicon Inorganic materials 0.000 description 5
- 239000010703 silicon Substances 0.000 description 5
- 238000012546 transfer Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 239000000872 buffer Substances 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000003247 decreasing effect Effects 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000004806 packaging method and process Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000011664 signaling Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000008685 targeting Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003071 parasitic effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000000523 sample Substances 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 230000005641 tunneling Effects 0.000 description 1
- 238000004148 unit process Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/0284—Multiple user address space allocation, e.g. using different base addresses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0811—Multiuser, multiprocessor or multiprocessing cache systems with multilevel cache hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0844—Multiple simultaneous or quasi-simultaneous cache accessing
- G06F12/0846—Cache with multiple tag or data arrays being simultaneously accessible
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0862—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0893—Caches characterised by their organisation or structure
- G06F12/0897—Caches characterised by their organisation or structure with two or more cache hierarchy levels
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
- G06F2212/6026—Prefetching based on access pattern detection, e.g. stride based prefetch
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
- G06F2212/6028—Prefetching based on hints or prefetch instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Definitions
- 3D ICs three-dimensional integrated circuits
- the 3D packaging known as System in Package (SiP) or Chip Stack multi-chip module (MCM), saves space by stacking separate chips in a single package. Components within these layers communicate using on-chip signaling, whether vertically or horizontally. This signaling provides reduced interconnect signal delay over known two-dimensional planar layout circuits.
- the computing system uses the additional on-chip storage as a last-level cache before accessing off-chip memory.
- a reduced miss rate achieved by the additional memory helps hide the latency gap between a processor and its off-chip memory.
- cache access mechanisms for row-based memories are inefficient for this additional integrated memory.
- Large tag data arrays, such as a few hundred megabytes for a multi-gigabyte cache, are expensive to place on the microprocessor die and provide a high latency for lookups of the large tag arrays. The lookup and data retrieval consume too much time as the tags and data are read out in a sequential manner.
- Increasing the size of a data cache line for the additional integrated memory such as growing from a 64-byte line to a 4-kilobyte (KB) line, reduces both a number of cache lines in the integrated memory and the size of a corresponding tag.
- dirty bits and coherency information are still maintained on a granularity of the original cache line size (64-byte line). Therefore, the on-package DRAM provides a lot of extra data storage, but cache and DRAM access mechanisms are inefficient.
- FIG. 1 is a block diagram of one embodiment of data storage.
- FIG. 2 is a flow diagram of one embodiment of a method for performing efficient memory accesses in a computing system.
- FIG. 3 is a block diagram of one embodiment of a computing system.
- FIG. 4 is a block diagram of one embodiment of a system-in-package (SiP).
- SiP system-in-package
- FIG. 5 is a block diagram of one embodiment of data storage.
- FIG. 6 is a block diagram of one embodiment of data storage.
- FIG. 7 is a block diagram of one embodiment of data storage.
- FIG. 8 is a block diagram of one embodiment of data storage.
- FIG. 9 is a block diagram of one embodiment of data storage.
- FIG. 10 is a flow diagram of one embodiment of a method for performing efficient memory accesses in a computing system.
- FIG. 11 is a block diagram of one embodiment of data storage.
- FIG. 12 is a block diagram of one embodiment of data storage.
- FIG. 13 is a block diagram of one embodiment of data storage.
- FIG. 14 is a flow diagram of one embodiment of a method for performing efficient memory accesses in a computing system.
- One or more clients in the computing system process applications. Examples of such clients include a general-purpose central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), an input/output (I/O) device, and so forth.
- the computing system also includes multiple link interfaces for transferring data between clients. In addition, each of the one or more clients access data from a last-level cache via the communication fabric.
- a cache is implemented with low latency, high bandwidth memory separate from system memory.
- the cache is used as a last-level cache in a cache memory subsystem.
- the cache is another level within the cache memory subsystem.
- the system memory includes one of a variety of off-package dynamic random access memory (DRAM) and main memory such as hard disk drives (HDDs) and solid-state disks (SSDs).
- DRAM off-package dynamic random access memory
- main memory such as hard disk drives (HDDs) and solid-state disks (SSDs).
- the computing system implements the cache with integrated DRAM, such as three-dimensional (3D) DRAM, included in a System-in-Package (SiP) with a processing unit of one of the clients.
- SiP System-in-Package
- the computing system includes one of other memory technologies for implementing the cache such as synchronous RAM (SRAM), embedded DRAM (eDRAM), flash memory such as solid state disks, and one of a variety of non-volatile memories.
- SRAM synchronous RAM
- eDRAM embedded DRAM
- flash memory such as solid state disks
- non-volatile memories are phase-change memory, memristors and spin-transfer torque (STT) magnetoresistive random-access memory (MRAM).
- a cache controller for the cache includes one or more queues. Each queue stores memory access requests of a respective type. For example, in some designs, a first queue stores memory read requests and a second queue stores memory write requests. Logic within the cache controller selects a queue of the one or more queues and selects a memory access request from the selected queue. The logic determines a range of addresses corresponding to a first region of contiguous data stored in system memory with a copy of the contiguous data stored in a second region of the cache. As used herein, “contiguous data” refers to one or more bits of data located next to one another in data storage.
- the size of the contiguous data ranges between a size of a cache line (e.g., 64 bytes) and a size of a page (e.g., 4 kilobytes) in order to provide a size granularity corresponding to a region of predicted upcoming data accesses for a software application being executed. In other embodiments, another size of the contiguous data is used.
- the logic stored a copy of the contiguous data from this region, which is the first region in this example, of the system memory into the second region of the cache.
- the contiguous data in the first region includes data corresponding to the predicted upcoming data accesses.
- the logic also initialized multiple parameters used to characterize the regions. For example, the logic maintains a first start address pointing to a memory location storing data at the beginning of the system memory. In addition, the logic maintains a second start address pointing to a memory location storing data at the beginning of the second region of the cache. Further, the logic maintains a size of the second region.
- the logic by monitoring received memory access requests and identifying a pattern, the logic predicts a region of the system memory that is going to be accessed with upcoming memory accesses. The logic identifies this region. In response, the logic performs the above steps such as storing a copy of the contiguous data from this region and updating corresponding parameters. In another embodiment, the logic receives one or more hints from software that identifies or is used to identify the region of predicted upcoming data accesses.
- the logic determines a range of addresses beginning at the first start address and ending at an address that is the sum of the first start address and the new size of the second region.
- the updates to one or more of the first start address and the size of the second region occurs as data is updated in the second region.
- the updates to the second region include one or more of adding data, removing data, and overwriting existing data in the second region.
- the logic compares the request address to this maintained range of addresses, rather than performs a set-associative lookup or a fully-associative lookup of the tag arrays in the cache.
- the comparison is a faster operation than the lookup operations of indexes and tags of the cache.
- the logic determines that the request address of the selected memory access request is not within the range of addresses, the logic sends the selected memory access request to system memory for servicing. However, when the logic determines that the request address is within the range of addresses, the logic services the memory access request by accessing data from the cache. In order to do so, the logic determines an offset based on a difference between the request address and the first start address. Afterward, the logic determines a translated address based on the offset and the second start address. Following, the logic services the memory access request by accessing data from the cache beginning at the translated address.
- each of system memory 110 and a last-level cache 130 store data.
- the cache 130 is another level within the cache memory subsystem. Processing units, communication interfaces and so forth are not shown for ease of illustration.
- Data 126 is contiguous data stored in the region 120 of the system memory 110 .
- the last-level cache 130 stores in the region 140 a copy of the contiguous data 126 of the region 120 .
- the region parameters 150 characterize the regions 120 and 140 .
- the system memory 110 includes one or more of off-package DRAM, hard disk drives (HDDs) and solid-state disks (SSDs).
- the last-level cache 130 includes on-package, low latency, high bandwidth memory separate from the system memory 110 .
- the last-level cache 130 includes 3D DRAM.
- the last-level cache 130 includes synchronous RAM (SRAM), embedded DRAM (eDRAM), flash memory such as solid state disks, and one of a variety of non-volatile memories. Examples of the non-volatile memories are phase-change memory, memristors and spin-transfer torque (STT) magnetoresistive random-access memory (MRAM).
- STT spin-transfer torque
- the address 122 which is also referred to as “x”, points to a memory location storing data at the beginning of the region 120 .
- the generic value “x” is any value represented in any manner such as integer, hexadecimal, and so forth.
- the region 120 has a size 124 , which is also referred to as “S bytes.”
- the address 142 which is also referred to as “a”, points to a memory location storing data at the beginning of the region 140 .
- the region 140 has a size 144 , which is also referred to as “S bytes,” and it is equal to the size 124 of region 120 .
- the values “x”, “S” and “a” are positive integers.
- sequential elements in a cache controller for the last-level cache 130 store the region parameters 150 .
- the sequential elements are registers, flip-flop circuits, and latches.
- the region parameters 150 include status information 152 such as a valid bit and metadata.
- the metadata are identifiers of the producer of the data 126 , identifiers of the consumer of the data 126 , cache coherency information for the data 126 , clean/dirty information for the data 126 , and so on.
- the identifiers for the producer and the consumer include one or more of a processing unit identifier, a process identifier, a thread identifier.
- the region parameters 150 do not include the status information 152 , since this information is stored in other queues and sequential elements in the cache controller.
- the region parameters 150 include two addresses.
- the first address 154 is a copy of the address 122 , which points to a memory location storing data at the beginning of the region 120 .
- the second address 156 is a copy of the address 142 , which points to a memory location storing data at the beginning of the region 140 . Therefore, the region parameters 150 include a memory mapping between the beginning of the region 120 and the beginning of the region 140 .
- the region parameters 150 currently stores a memory mapping between the address 122 (“x”) and the address 142 (“a”).
- the region parameters 150 also includes the size 158 of the region 140 .
- logic in the cache controller uses a size value of zero bytes in the size field 158 to indicate no valid region is stored in the last-level cache, rather than a valid bit in the status field 152 .
- the logic in the cache controller determines a cache hit or a cache miss for the last-level cache 130 with a comparison operation that is faster than a lookup operation in a large tag array. In one example, the logic determines whether a valid region is stored in the last-level cache 130 . If a status field 152 is used and a valid bit is negated, then there is no valid region stored in the last-level cache 130 . If a status field 152 is not used and the size field 158 stores a value of zero bytes, then there is no valid region stored in the last-level cache 130 . In such cases, the logic in the cache controller determines that there is a cache miss, and sends the memory access request with the request address to the system memory 110 for servicing. Therefore, the logic skips performing set-associative lookup operations into a set of the large tag array selected by an index of the request address, which reduces the latency of handling the memory access request.
- the logic of the cache controller determines that there is a valid region stored in the last-level cache 130 . In such a case, the logic in the cache controller determines a range of addresses when the logic determines a change in one or more of the address 122 (“x”) and the size 158 (“S”) of the region 140 . The logic determines the range of addresses as beginning at the address 122 (“x”) and ending at an address that is the sum of the address 122 and the size 158 (“S”) of the region 140 .
- the range of addresses is “x+S”.
- the logic determines whether the request address is within this range. For example, if the request address is denoted as “y,” then the logic determines whether the expression, x ⁇ y ⁇ (x+S), is true. Therefore, to determine whether there is a cache hit or a cache miss within the last-level cache 130 , the logic compares the request address to this range of addresses.
- the comparison operation is a faster operation than the lookup operations of indexes and tags of the last-level cache 130 .
- the logic determines the access of the last-level cache 130 is a cache miss, then the cache controller sends the memory access request with the request address to the system memory 110 for servicing. However, if the logic determines the access of the last-level cache 130 is a cache hit, then the logic services the memory access request by retrieving data from the last-level cache 130 . In order to do so, the logic determines an offset based on a difference between the request address (“y”) and the address 122 (“x”), which is expressed as (y ⁇ x). The logic determines a translated address based on the offset (y ⁇ x) and the address 142 (“a”), which is the sum of the two values and is expressed as (a+(y ⁇ x)).
- the logic services the memory access request by accessing data from the last-level cache 130 beginning at the translated address, or at the address represented by (a+(y ⁇ x)).
- the logic skips performing set-associative lookup operations into a set of the large tag array selected by an index of the request address. Rather, after the comparison operation used to determine the cache hit, simple arithmetic operations are used to identify the location storing the requested data in the last-level cache 130 .
- FIG. 2 one embodiment of a method 200 for efficiently performing memory accesses in a computing system is shown.
- the steps in this embodiment (as well as in FIGS. 10 and 14 ) are shown in sequential order.
- one or more of the elements described are performed concurrently, in a different order than shown, or are omitted entirely.
- Other additional elements are also performed as desired. Any of the various systems or apparatuses described herein are configured to implement methods 200 , 1000 and 1400 .
- One or more processing units execute one or more computer programs, or software applications. Examples of a processing unit are a processor core within a general-purpose central processing unit (CPU), a graphics processing unit (GPU), or other.
- a System-in-Package includes the processing unit and an on-package, low latency, high bandwidth memory.
- a memory is a 3D integrated memory, such as a 3D DRAM.
- the processing unit utilizes at least a portion of the 3D DRAM as a cache.
- the cache is a last-level cache.
- the high bandwidth memory is used as a first-level (L1), a second-level (L2), or other level in the cache memory hierarchy other than the last-level.
- the processing unit determines a memory request misses within the cache memory subsystem in levels lower than the last-level cache (block 202 ).
- the processing unit sends a request address corresponding to the memory request to the last-level cache (block 204 ).
- the logic in the cache controller for the last-level cache maintains an identification of a first region of contiguous data in the system memory that has a copy of the contiguous data stored in a second region of the last-level cache.
- the identification includes a first start address that identifies the beginning of the first region. Additionally, the identification includes a size of the second region.
- Logic in the cache controller for the last-level cache determines a range of addresses for this first region, which is a range of addresses within the system memory address space pointing to the memory locations storing the contiguous data stored in the system memory (block 206 ). This contiguous data has a copy stored in the last-level cache.
- the logic uses the expressions described earlier in the description of the data storage 100 (of FIG. 1 ).
- logic sends the memory request including the request address to system memory (block 210 ).
- the access of the last-level cache for the memory request is considered to be a cache miss, and accordingly, the memory request is sent to a lower level of the memory subsystem such as the system memory.
- logic determines an offset based on a difference between the request address and a start address of the range in system memory (block 212 ).
- Logic determines a translated address based on the offset and a start address of the range in the last-level cache (block 214 ). For example, the translated address is a sum of the offset and a start address of the range in the last-level cache.
- Logic services the memory request by accessing data from the last-level cache beginning at the translated address (block 216 ).
- FIG. 3 a generalized block diagram of one embodiment of a computing system 300 utilizing a low-latency, high-bandwidth cache is shown.
- the computing system 300 utilizes three-dimensional (3D) packaging such as the System in Package (SiP) 310 .
- the SiP 310 is connected to a memory 362 and off-package DRAM 370 via a memory bus 350 .
- the computing system 300 is a stand-alone system within a mobile computer, a smart phone, or a tablet; a desktop; a server; or other.
- the SiP 310 uses the processing unit 320 and a low-latency, high-bandwidth cache 330 .
- the in-package low-latency interconnect 348 uses one or more of horizontal and/or vertical routes with shorter lengths than long off-chip interconnects when a SiP is not used.
- the SiP 310 utilizes DRAM memory technology, such as 3D DRAM, other memory technologies that use a low latency, high bandwidth and row-based access scheme including one or more row buffers or other equivalent structures are possible and contemplated. Examples of other memory technologies are phase-change memories, spin-torque-transfer resistive memories, memristors, embedded DRAM (eDRAM), and so forth.
- the processing unit 320 is a general-purpose microprocessor, whereas, in other designs, the processing unit 320 is another type of processing unit. Other types of processing units include a graphics processing unit (GPU), a field programmable gate array (FPGA), an accelerated processing unit (APU), which is a chip that includes additional processing capability.
- GPU graphics processing unit
- FPGA field programmable gate array
- APU accelerated processing unit
- an APU includes a general-purpose CPU integrated on a same die with a GPU, a FPGA, or other processing unit, thus improving data transfer rates between these units while reducing power consumption.
- an APU includes video processing and other application-specific accelerators.
- the execution engine 322 uses one or more processor cores based on the type of the processing unit 320 . Additionally, in some designs, the execution engine 322 uses a communication fabric (or “fabric”) for transferring communication messages. Examples of communication messages are coherency probes, interrupts, and read and write access commands and corresponding data. Examples of interconnections in the fabric are bus architectures, crossbar-based architectures, network-on-chip (NoC) communication subsystems, communication channels between dies, silicon interposers, and through silicon vias (TSVs).
- communication messages are coherency probes, interrupts, and read and write access commands and corresponding data.
- Examples of interconnections in the fabric are bus architectures, crossbar-based architectures, network-on-chip (NoC) communication subsystems, communication channels between dies, silicon interposers, and through silicon vias (TSVs).
- NoC network-on-chip
- the processing unit 320 incorporates a system bus controller in the interface logic 326 that utilizes one of various protocols to connect the processor cores of the execution engine 322 to memory 362 , DRAM 370 , peripheral input/output (I/O) devices and other processing units.
- a system bus controller in the interface logic 326 that utilizes one of various protocols to connect the processor cores of the execution engine 322 to memory 362 , DRAM 370 , peripheral input/output (I/O) devices and other processing units.
- the computing system 300 uses the off-package memory 362 as main memory, or system memory.
- the memory 362 is one of hard disk drives (HDDs) and solid-state disks (SSDs).
- the off-package DRAM 370 is one of a variety of types of DRAM.
- the computing system 300 fills the off-chip DRAM 370 with data from the off-chip memory 362 through the I/O controller and bus 360 and the memory bus 350 .
- the interface logic 360 supports communication protocols, address formats and packet formats for each of the off-package memory 362 and the off-package DRAM 370 .
- Each of the processor cores within the execution engine 322 uses one or more levels of a cache memory subsystem for reducing memory latencies for the processor cores. In some designs, the processor cores additionally access a shared cache within the execution engine 322 . When the cache memory subsystem within the execution engine 322 does not include data requested by a processor core, the execution engine 322 sends the memory access request to the in-package cache 330 .
- the interface logic 340 supports communication protocols, address formats and packet formats for transferring information between the in-package cache 330 and the processing unit 320 .
- the in-package cache 330 uses multiple memory arrays 332 that are segmented into multiple banks.
- each one of the banks includes a respective row buffer.
- Each one of the row buffers stores data in an accessed row of the multiple rows within the corresponding memory array bank.
- the functionality of the queues 342 , the region parameters 344 , and the portion of the logic 346 that uses the region parameters 344 are located in the logic 336 .
- this functionality is included in a cache controller for the in-package cache 330 .
- this functionality is located in the interface logic 340 as shown.
- Each of the logic 336 and the logic 346 is implemented by software, hardware such as circuitry used for combinatorial logic and sequential elements, or a combination of software and hardware.
- the logic 346 stores received memory access requests in one of the multiple queues 342 based on access type. For example, a first queue of the queues 342 stores memory read requests and a second queue of the queues 342 stores memory write requests.
- Arbitration logic within the logic 346 selects a queue of the multiple queues 342 and selects a memory access request from the selected queue.
- the logic 346 determines a range of addresses corresponding to a first region of system memory, such as region 372 , with a copy of data stored in a second region of the in-package cache 330 , such as region 338 .
- the system memory is implemented by the combination of the off-package memory 362 and the off-package DRAM 370 .
- the logic 346 sends a selected memory access request to system memory when the logic 346 determines a request address of the memory access request is not within the range of addresses for the region 372 .
- the cache controller services the selected memory request by accessing data from the memory arrays 332 within the in-package cache 330 when the logic 346 determines the request address is within the range of addresses for the region 372 .
- the logic 346 uses the region parameters 344 for the above determinations. In various embodiments, the region parameters 344 are equivalent to the region parameters 150 (of FIG. 1 ).
- the logic 346 uses one of a variety of techniques for determining when to store a copy of the region 372 in the off-package system memory as region 338 in the in-package cache 330 .
- the logic 346 monitors memory accesses from the execution engine 322 to detect a streaming or sequential memory access pattern.
- the logic 346 uses one of a variety of techniques to detect streaming access patterns such as at least the techniques used by hardware prefetchers. When the logic 346 detects a streaming pattern, the logic 346 defines a new region.
- one of the logic 336 and logic 346 when a memory request accesses an address that is within L bytes of the end of the region 338 , one of the logic 336 and logic 346 extends the size of the region 338 by P bytes where L and P are positive, non-zero integers.
- the values of L and P are stored in programmable registers in the control and status registers (CSRs) 347 .
- an initial region size is also stored in a programmable register in CSRs 347 .
- the logic 346 uses software hints for determining when to define and create the region 338 in the in-package cache 330 .
- Software uses particular instructions to update certain registers accessed by the application or the operating system.
- the software is capable of updating one or more control and status registers (CSRs) 347 in the interface logic 340 .
- CSRs control and status registers
- the software application When processing a deep neural network, the software application is aware of when it finishes processing one layer of a multi-layer neural network and when it moves on to the next layer of the multi-layer network. As each layer of the multi-layer network is traversed (whether forward or backward), the software application utilizes software hints to inform one or more of the logic 346 and the CSRs 347 of the current region in the system memory that is being processed.
- the software application provides hints to indicate when to increase or decrease the sizes of the regions 372 and 338 . The hints also indicate to change the sizes from the left direction or the right direction of the regions 372 and 338 .
- the software hints indicate when to change the entire content of the regions 372 and 338 by moving to another region in the system memory.
- Limits stored in the CSRs 347 prevent the region 338 from exceeding the size of the in-package cache 330 .
- the logic 346 selects between supporting the existing region and a new region based on criteria. Examples of the criteria are a size of the region, a priority level of an application accessing the data of the regions, an age of the existing region, and so forth.
- the applications modify the contents of the region 338 . If the logic 346 adjusts the size or the portion of the region 338 such that a modified portion of the region 338 is no longer valid, then one of the logic 336 and the logic 346 sends the modified data to the off-package DRAM 370 .
- one of the logic 336 and the logic 346 controls the in-package cache 330 with a write-through cache policy or a write-back cache policy.
- the write-through cache policy spreads out the write operations to the off-package DRAM 370 over time. In contrast, the write-back cache policy delays the write operations until the size of the region 338 reduces.
- the logic 336 or the logic 346 sends the write operations for the modified data to the off-package DRAM 370 in a burst of write traffic.
- one of the logic 336 and the logic 346 controls the in-package cache 330 with a combination of a write-through cache policy and a write-back cache policy to trade off the benefits and the costs of the two policies.
- the in-package cache 330 uses low latency, high bandwidth memory technologies such as SRAM, phase-change memories, spin-torque-transfer resistive memories, memristors, embedded DRAM (eDRAM), and so forth. In other designs, the in-package cache 330 uses low latency, high bandwidth 3D DRAM.
- FIG. 4 a generalized block diagram of embodiments of systems-in-package (SiP) 400 and 440 are shown.
- the illustrated SiP includes one or more three-dimensional integrated circuits (3D ICs).
- a 3D IC includes two or more layers of active electronic components integrated both vertically and/or horizontally into a single circuit.
- fabrication techniques use interposer-based integration whereby the fabrication technique places the 3D IC next to the processing unit 420 .
- fabrication technique stacks a 3D IC directly on top of another IC.
- Die-stacking technology is a fabrication process that enables the physical stacking of multiple separate pieces of silicon (integrated chips) together in a same package with high-bandwidth and low-latency interconnects.
- the dies are stacked side by side on a silicon interposer, or vertically directly on top of each other.
- One configuration for the SiP is to stack one or more DRAM chips next to and/or on top of a processing unit.
- the stacked DRAM chips provide a very large cache for the processing unit. In some designs, this large cache has a size on the order of several hundred MB (or more).
- the SiP 400 includes a processing unit 420 and one or more three-dimensional (3D) DRAM 430 and 432 that communicate with the processing unit 420 through horizontal low-latency interconnect 410 .
- the processing unit 420 is one of a general-purpose CPU, a graphics processing unit (GPU), an accelerated processing unit (APU), a field programmable gate array (FPGA), or other data processing device that makes use of a row-based memory, such as a cache.
- the in-package horizontal low-latency interconnect 410 provides reduced lengths of interconnect signals versus long off-chip interconnects when a SiP is not used.
- the in-package horizontal low-latency interconnect 410 use particular signals and protocols as if the chips, such as the processing unit 420 and the 3D DRAMs 430 and 432 , were mounted in separate packages on a circuit board.
- the SiP 400 additionally includes backside vias or through-bulk silicon vias 412 that reach to package external connections 414 .
- the package external connections 414 are used for input/output (I/O) signals and power signals.
- the SiP 440 includes a 3D DRAM 450 stacked directly on top of the processing unit 420 .
- a 3D DRAM 450 stacked directly on top of the processing unit 420 .
- multiple chips, or device layers are stacked on top of one another with direct vertical interconnects 416 tunneling through them.
- the size and density of the vertical interconnects 416 that can tunnel between the different device layers varies based on the underlying technology used to fabricate the 3D ICs.
- FIG. 5 a generalized block diagram of one embodiment of data storage 500 is shown. Circuitry and logic previously described are numbered identically. As shown, each of system memory 110 and a last-level cache 130 store data. Again, although the description describes the cache 330 as a last-level cache, in other embodiments, the cache 330 is another level within the cache memory subsystem. Data 126 is contiguous data stored in the system memory 110 . Data 526 is contiguous data added to the region in the system memory 110 to create the region 520 . The size of contiguous data in the system memory 110 copied as a region grew from the size 124 to the size 524 .
- the last-level cache 130 stores in the region 540 a copy of the contiguous data 126 and data 526 of the region 520 . Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from the size 144 to the size 544 .
- the region parameters 150 characterize the regions 520 and 540 .
- the start addresses 122 and 142 remain the same. Therefore, the fields 154 and 156 remain unchanged in the region parameters 150 .
- logic updates the size field 158 to an increased amount. In the example, the logic updates the size field 158 from S bytes to S+T bytes where S and T are positive, non-zero integers.
- Data 126 is contiguous data stored in the system memory 110 .
- Data 626 is contiguous data added to the region in the system memory 110 to create the region 620 .
- the size of contiguous data in the system memory 110 copied as a region grew from the size 124 to the size 624 .
- the last-level cache 130 stores in the region 640 a copy of the contiguous data 126 and data 626 of the region 620 . Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from the size 144 to the sum of the size 644 and the size 646 . The contiguous data wrapped around the last-level cache 130 .
- the region parameters 150 characterize the regions 620 and 640 .
- the start addresses 122 and 142 remain the same. Therefore, the fields 154 and 156 remain unchanged in the region parameters 150 .
- logic updates the size field 158 to an increased amount. In the example, the logic updates the size field 158 from S bytes to S+T+U bytes where S, T and U are positive, non-zero integers.
- the access of a wrap around region alters the calculation of the translated address for the last-level cache 130 .
- the region 620 uses the address space 2,000 to 2,700 where the addresses are expressed as digits.
- the entire last-level cache 130 uses the address space 5,000 to 6,000 where the addresses are also expressed as digits.
- the region 640 uses the address space 5,800 wrapped around to 5,500.
- logic determines that the offset is (2,400 ⁇ 2,000), or 400.
- the logic adds the offset to the region start address of 5,800 to obtain (5,800+400), or 6,200. This value exceeds the limit of the region 640 .
- the logic determines the difference, which is (6,200 ⁇ 6,000), or 200.
- the logic adds the difference to the start address to obtain (5,000+200), or 5,200.
- the translated address is 5,200, and the logic uses the translated address 5,200 to access data from the last-level cache 130 in order to service the memory request.
- FIG. 7 a generalized block diagram of one embodiment of data storage 700 is shown. Similar to data storage 500 and 600 and upcoming data storage 800 - 900 and 1300 , circuitry and logic previously described are numbered identically. As shown, each of system memory 110 and a last-level cache 130 store data. Although the description describes the cache 330 as a last-level cache, in other embodiments, the cache 330 is another level within the cache memory subsystem. Data 126 is contiguous data stored in the system memory 110 . Data 726 is contiguous data added to the region in the system memory 110 to create the region 720 . The size of contiguous data in the system memory 110 copied as a region grew from the size 124 to the size 724 . The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 720 is address 722 (“x2”), rather than the address 122 (“x1”).
- the last-level cache 130 stores in the region 740 a copy of the contiguous data 126 and data 726 of the region 720 . Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from the size 144 to the size 744 . The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 740 is address 742 (“a2”), rather than the address 142 (“a1”).
- the region parameters 170 characterize the regions 720 and 740 .
- the start addresses 122 and 142 change, and the fields 154 and 156 indicate the changes in the region parameters 150 .
- Logic also updates the size field 158 to an increased amount. In the example, the logic updates the size field 158 from S bytes to S+V bytes where S and V are positive, non-zero integers.
- Data 126 is contiguous data stored in the system memory 110 .
- Data 826 is contiguous data added to the region in the system memory 110 to create the region 820 .
- the size of contiguous data in the system memory 110 copied as a region grew from the size 124 to the size 824 . The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 820 is address 822 (“x2”), rather than the address 122 (“x1”).
- the last-level cache 130 stores in the region 840 a copy of the contiguous data 126 and data 826 of the region 820 . Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from the size 144 to the sum of the size 844 and the size 846 . The contiguous data wrapped around the last-level cache 130 . The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 840 is address 842 (“a2”), rather than the address 142 (“a1”).
- the region parameters 150 characterize the regions 820 and 840 .
- the start addresses 122 and 142 change, and the fields 154 and 156 indicate the changes in the region parameters 150 .
- Logic also updates the size field 158 to an increased amount.
- the logic updates the size field 158 from S bytes to S+V+W bytes where S, V and W are positive, non-zero integers.
- the access of a wrap around region alters the calculation of the translated address for the last-level cache 130 .
- Logic uses a computation as described earlier for data storage 600 .
- Data 126 is contiguous data stored in the system memory 110 .
- Data 926 is contiguous data removed from the region in the system memory 110 to create the region 920 .
- the size of contiguous data in the system memory 110 copied as a region reduced from the size 124 to the size 924 . The decrease is in the right direction, rather than the left direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 920 is address 922 (“x2”), rather than the address 122 (“x1”).
- the last-level cache 130 stores in the region 940 a copy of the contiguous data 126 of the region 920 . Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region reduced from the size 144 to the size 944 . The decrease is in the right direction, rather than the left direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 940 is address 942 (“a2”), rather than the address 142 (“a1”).
- the region parameters 150 characterize the regions 920 and 940 .
- the start addresses 122 and 142 change, and the fields 154 and 156 indicate the changes in the region parameters 150 .
- Logic also updates the size field 158 to a decreased amount.
- the logic updates the size field 158 from S bytes to S-U bytes where S and U are positive, non-zero integers. It is noted that if the decrease in sizes of the regions occurred at the end of the regions and in the left direction, rather than the right direction, then the addresses 122 and 142 remain the same while the size field 158 is still updated.
- Logic monitors memory access patterns and/or receives software hints of data accesses (block 1002 ).
- software techniques with particular instructions used as hints software techniques with particular instructions used as hints, hardware techniques such as the techniques used by hardware prefetchers, or a combination are used to determine when to begin defining a region of memory.
- control flow of method 1000 returns to block 1002 where logic monitors memory access patterns and/or receives software hints. If logic predicts a region of upcoming data accesses, (“yes” branch of the conditional block 1004 ), then the logic initializes parameters characterizing the region of predicted upcoming data accesses (block 1006 ). For example, the logic stores a start address for the region of upcoming data accesses in the system memory, and stores a start address for this region in the last-level cache. Additionally, the logic stores a region size for this region. In some embodiments, an initial region size is provided in a programmable register of multiple control and status registers. In some designs, the initial size is between the fine granularity of a cache line size (e.g., 64 bytes) and a page size (e.g., 4 kilobytes or more).
- a cache line size e.g., 64 bytes
- a page size e.g., 4 kilobytes or more
- Logic stores a copy of contiguous data from the region of upcoming data accesses in the system memory into the last-level cache. For example, the logic stores a copy of data from a first region of system memory into a second region of a last-level cache (block 1008 ) where each of the first region and the second region corresponds to the region of the predicted upcoming data accesses in the system memory. Logic services memory requests targeting the first region by accessing data from the second region (block 1010 ). If logic determines the second region changes size (“yes” branch of the conditional block 1012 ), then logic updates parameters characterizing the region to indicate the size change (block 1014 ).
- control flow of method 1000 returns to block 1010 .
- logic services memory requests targeting the first region by accessing data from the second region. If the accesses of the second region have completed (“yes” branch of the conditional block 1016 ), then logic updates parameters characterizing the region to indicate there is no region (block 1018 ). Afterward, control flow of method 1000 returns to block 1002 where logic monitors memory access patterns and/or receives software hints.
- FIG. 11 a generalized block diagram of one embodiment of data storage 1100 is shown.
- Each of system memory 1110 and a last-level cache 1130 store data. Processing units, communication interfaces and so forth are not shown for ease of illustration. Although the description describes the cache 1130 as a last-level cache, in other embodiments, the cache 1130 is another level within the cache memory subsystem.
- Data 1120 - 1128 are contiguous data stored in the system memory 1110 .
- the last-level cache 1130 stores a copy of a portion of the contiguous data 1120 - 1128 at different points in time.
- the weights of a large (deep) neural network are stored in system memory 1110 , such as off-package DRAM.
- the weights, such as data 1120 - 1128 are too large to fit in the last-level cache 1130 , such as in-package 3D DRAM.
- the weights are evaluated by a processing unit executing a software application. From points in time t1 to 11t7 (or times t1 to t7), the size and content of the region stored in the last-level cache 1130 changes.
- the data 1120 corresponds to a first layer of weights of a multi-layer neural network.
- the data 1122 corresponds to a second layer of weights of the multi-layer neural network, and so on.
- the data 1120 is copied into the last-level cache 1130 at time t0 (not shown).
- the data 1122 is added to the region stored in the last-level cache 1130 .
- the data 1124 and the data 1126 are added to the region stored in the last-level cache 1130 .
- the region in the last-level cache 1130 expands in order to store the weights.
- the accessing of the neural network's weights proceeds in a regular, predictable manner. Therefore, the region in the last-level cache 1130 is increased sufficiently ahead of the evaluation of the weights.
- programmable registers of CSRs 347 (of FIG.
- the processing unit accesses the weights in the in-package last-level cache, rather than in the off-package DRAM.
- the entire last-level cache 1130 is filled.
- logic in the cache controller or in the processing unit adjusts the size of the region by decreasing the size from the left.
- the logic removes the data 1120 from the region in the last-level cache 1130 .
- the logic updates region parameters accordingly. Due to the nature of the software application performing the training of the weights, once the processing unit evaluates a given layer, the corresponding weights are not needed again for the current inference or forward propagation. Therefore, in some designs, the given layer of weights are removed from the region of the last-level cache 1130 .
- data 1126 is added to the region of the last-level cache 1130 .
- the region wraps around the last-level cache 1130 .
- the access of a wraparound region alters the calculation of the translated address for the last-level cache 130 .
- Logic uses a computation as described earlier for data storage 600 .
- the logic removes the data 1122 from the region in the last-level cache 1130 .
- the logic updates region parameters accordingly.
- the logic adds data 1128 to the region of the last-level cache 1130 .
- the processing unit After the processing unit has processed the last layer of the neural network, the processing unit generates a final output. The processing unit typically compares this final output against an expected value to compute an error or loss.
- the training of the neural network then continues with a backward propagation phase. During the backward propagation, the processing unit processes the layers of the multi-layered neural network in reverse order. Logic allocates and deallocates the region of the last-level cache 1130 in a manner to support the reverse order.
- the system memory 1210 stores data. Processing units, communication interfaces and so forth are not shown for ease of illustration.
- the system memory 1210 stores data of multiple regions. Examples of the regions include a first region 1220 pointed to by the address 1212 (“w0”), data of a second region 1222 pointed to by the address 1214 (“x0”), data of a third region 1224 pointed to by the address 1216 (“y0”), and data of a fourth region 1226 pointed to by the address 1218 (“z0”).
- a software application performs a stencil-like calculation where each element in an output vector stored in the region pointed to by the address 1218 (“z0”) is a sum of elements in the other vectors pointed to by addresses 1212 - 1216 (“w0” ⁇ “y0”). For example, if the output vector is represented as vector “d”, and each of the vectors in the other regions are represented as “a” to “c”, then the value of the element d[i] of the vector d is a[i ⁇ 1]+a[i]+a[i+1]+b[i ⁇ 1]+b[i]+b[i+1]+c[i ⁇ 1]+c[i]+c[i+1].
- the adder 1230 sums the values of the elements of the input vectors to generate an element in the output vector. In many cases, none of the input vectors fit within an in-package cache. However, each region within the in-package cache is capable of storing an active portion of each respective vector. As the calculation proceeds, each of the regions are updated to maintain an active portion of each vector. An example of such a scheme is shown in the following description.
- the last-level cache (LLC) 1330 stores a copy of data stored in the system memory 1210 . Although the description describes the cache 1330 as a last-level cache, in other embodiments, the cache 1330 is another level within the cache memory subsystem. Processing units, communication interfaces and so forth are not shown for ease of illustration. The last-level cache 1330 stores data of multiple regions.
- Examples of the regions include a first region pointed to by the address 1332 (“a0”) with a size 1334 (“S bytes”), a second region pointed to by the address 1336 (“b0”) with a size 1338 (“T bytes”), a third region 1224 pointed to by the address 1340 (“c0”) with a size 1342 (“U bytes”), and a fourth region pointed to by the address 1344 (“d0”) with a size 1346 (“V bytes”).
- the table 1350 stores region parameters for the regions stored in the last-level cache 1330 .
- the fields 1352 - 1358 are equivalent to the fields 152 - 158 of the region parameters 150 (of FIG. 1 ).
- the table 1350 supports multiple separate regions, rather than a single region.
- the table 1350 includes four valid rows (entries) for supporting the four regions in the last-level cache 1330 . Although four regions and entries are shown, any number of entries and regions are used in other examples.
- logic maintains the information in the table 1350 to ensure that the multiple regions grow, reduce and wrap around the last-level cache 1330 without overrunning another region.
- logic For each memory access of the last-level cache 1330 , logic compares the requested address against each valid, supported region in the last-level cache 1330 .
- the table 1350 stores information in a fully-associative manner. The requested address now checks against all N sets of region definition registers (analogous to a fully-associative cache structure).
- a processing unit executes one or more computer programs, or software applications. Examples of a processing unit are a processor core within a CPU, a GPU, or other.
- a System-in-Package includes the processing unit and high bandwidth memory.
- the high bandwidth memory is a 3D integrated memory, such as a 3D DRAM.
- the processing unit utilizes at least a portion of the 3D DRAM as a cache.
- the processing unit determines a memory request misses within a cache memory subsystem in levels lower than the last-level cache (block 1402 ).
- the processing unit utilizes at least a portion of the high bandwidth memory as a last-level cache.
- the processing unit sends an address corresponding to the memory request to the last-level cache (block 1404 ).
- Logic selects a range of one or more address ranges within a system memory address space with data stored in the last-level cache (block 1406 ). If logic determines the request address is not within the selected range (“no” branch of the conditional block 1408 ), and the last range is not reached (“no” branch of the conditional block 1410 ), then control flow of method 1400 returns to block 1406 . In block 1406 , logic selects another range of the one or more address ranges. If logic determines the request address is not within the selected range (“no” branch of the conditional block 1408 ), and the last range is reached (“yes” branch of the conditional block 1410 ), then logic sends the memory request including the request address to system memory (block 1412 ).
- logic determines the request address is within the selected range (“yes” branch of the conditional block 1408 ) then logic determines an offset based on a difference between the address and a start address of the range in system memory (block 1414 ). Logic determines a translated address based on the offset and a start address of the range in the last-level cache (block 1416 ). Logic services the memory request by accessing data from the last-level cache beginning at the translated address (block 1418 ).
- program instructions of a software application are used to implement the methods and/or mechanisms previously described.
- the program instructions describe the behavior of hardware in a high-level programming language, such as C.
- a hardware design language HDL
- the program instructions are stored on a non-transitory computer readable storage medium. Numerous types of storage media are available.
- the storage medium is accessible by a computing system during use to provide the program instructions and accompanying data to the computing system for program execution.
- the computing system includes at least one or more memories and one or more processors that execute program instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 16/214,363, now U.S. Pat. No. 11,232,039, entitled “CACHE FOR STORING REGIONS OF DATA”, filed Dec. 10, 2018, the entirety of which is incorporated herein by reference.
- As both semiconductor manufacturing processes advance and on-die geometric dimensions reduce, semiconductor chips provide more functionality and performance. However, design issues still arise with modern techniques in processing and integrated circuit design that limit potential benefits. One issue is that interconnect delays continue to increase per unit length in successive generations of two-dimensional planar layout chips. Also, high electrical impedance between individual chips increases latency. In addition, signals that traverse off-chip to another die increase power consumption for these signals due to the increased parasitic capacitance on these longer signal routes.
- Another design issue is that most software applications that access a lot of data are typically memory bound in that computation time is generally determined by memory bandwidth. A memory access latency for an off-chip dynamic random access memory (DRAM) is hundreds to over a thousand clock cycles, and an increased number of cores in a processor design have accentuated the memory bandwidth problem. Recently, there has been progress in memory technologies for implementing in-package memory that provides access to a large, low latency, high bandwidth memory before accessing off-package DRAM and main memory.
- One example of the memory technology is three-dimensional integrated circuits (3D ICs) used to include two or more layers of active electronic components integrated both vertically and horizontally into a single circuit. The 3D packaging, known as System in Package (SiP) or Chip Stack multi-chip module (MCM), saves space by stacking separate chips in a single package. Components within these layers communicate using on-chip signaling, whether vertically or horizontally. This signaling provides reduced interconnect signal delay over known two-dimensional planar layout circuits.
- The manufacturing trends in the above description lead to gigabytes of integrated memory within a single package. In some cases, the computing system uses the additional on-chip storage as a last-level cache before accessing off-chip memory. A reduced miss rate achieved by the additional memory helps hide the latency gap between a processor and its off-chip memory. However, cache access mechanisms for row-based memories are inefficient for this additional integrated memory. Large tag data arrays, such as a few hundred megabytes for a multi-gigabyte cache, are expensive to place on the microprocessor die and provide a high latency for lookups of the large tag arrays. The lookup and data retrieval consume too much time as the tags and data are read out in a sequential manner.
- Increasing the size of a data cache line for the additional integrated memory, such as growing from a 64-byte line to a 4-kilobyte (KB) line, reduces both a number of cache lines in the integrated memory and the size of a corresponding tag. However, dirty bits and coherency information are still maintained on a granularity of the original cache line size (64-byte line). Therefore, the on-package DRAM provides a lot of extra data storage, but cache and DRAM access mechanisms are inefficient.
- In view of the above, efficient methods and systems for efficiently performing memory accesses in a computing system are desired.
- The advantages of the methods and mechanisms described herein may be better understood by referring to the following description in conjunction with the accompanying drawings, in which:
-
FIG. 1 is a block diagram of one embodiment of data storage. -
FIG. 2 is a flow diagram of one embodiment of a method for performing efficient memory accesses in a computing system. -
FIG. 3 is a block diagram of one embodiment of a computing system. -
FIG. 4 is a block diagram of one embodiment of a system-in-package (SiP). -
FIG. 5 is a block diagram of one embodiment of data storage. -
FIG. 6 is a block diagram of one embodiment of data storage. -
FIG. 7 is a block diagram of one embodiment of data storage. -
FIG. 8 is a block diagram of one embodiment of data storage. -
FIG. 9 is a block diagram of one embodiment of data storage. -
FIG. 10 is a flow diagram of one embodiment of a method for performing efficient memory accesses in a computing system. -
FIG. 11 is a block diagram of one embodiment of data storage. -
FIG. 12 is a block diagram of one embodiment of data storage. -
FIG. 13 is a block diagram of one embodiment of data storage. -
FIG. 14 is a flow diagram of one embodiment of a method for performing efficient memory accesses in a computing system. - While the invention is susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are herein described in detail. It should be understood, however, that drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the invention is to cover all modifications, equivalents and alternatives falling within the scope of the present invention as defined by the appended claims.
- In the following description, numerous specific details are set forth to provide a thorough understanding of the methods and mechanisms presented herein. However, one having ordinary skill in the art should recognize that the various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the approaches described herein. It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
- Various systems, apparatuses, methods, and computer-readable mediums for efficiently performing memory accesses in a computing system are disclosed. One or more clients in the computing system process applications. Examples of such clients include a general-purpose central processing unit (CPU), a graphics processing unit (GPU), an accelerated processing unit (APU), an input/output (I/O) device, and so forth. The computing system also includes multiple link interfaces for transferring data between clients. In addition, each of the one or more clients access data from a last-level cache via the communication fabric.
- In various embodiments, a cache is implemented with low latency, high bandwidth memory separate from system memory. In some embodiments, the cache is used as a last-level cache in a cache memory subsystem. In other embodiments, the cache is another level within the cache memory subsystem. The system memory includes one of a variety of off-package dynamic random access memory (DRAM) and main memory such as hard disk drives (HDDs) and solid-state disks (SSDs). In some embodiments, the computing system implements the cache with integrated DRAM, such as three-dimensional (3D) DRAM, included in a System-in-Package (SiP) with a processing unit of one of the clients. In other embodiments, the computing system includes one of other memory technologies for implementing the cache such as synchronous RAM (SRAM), embedded DRAM (eDRAM), flash memory such as solid state disks, and one of a variety of non-volatile memories. Examples of the non-volatile memories are phase-change memory, memristors and spin-transfer torque (STT) magnetoresistive random-access memory (MRAM).
- A cache controller for the cache includes one or more queues. Each queue stores memory access requests of a respective type. For example, in some designs, a first queue stores memory read requests and a second queue stores memory write requests. Logic within the cache controller selects a queue of the one or more queues and selects a memory access request from the selected queue. The logic determines a range of addresses corresponding to a first region of contiguous data stored in system memory with a copy of the contiguous data stored in a second region of the cache. As used herein, “contiguous data” refers to one or more bits of data located next to one another in data storage. In some embodiments, the size of the contiguous data ranges between a size of a cache line (e.g., 64 bytes) and a size of a page (e.g., 4 kilobytes) in order to provide a size granularity corresponding to a region of predicted upcoming data accesses for a software application being executed. In other embodiments, another size of the contiguous data is used.
- At an earlier point in time, when the logic determined a region of predicted upcoming data accesses is defined, the logic stored a copy of the contiguous data from this region, which is the first region in this example, of the system memory into the second region of the cache. The contiguous data in the first region includes data corresponding to the predicted upcoming data accesses. The logic also initialized multiple parameters used to characterize the regions. For example, the logic maintains a first start address pointing to a memory location storing data at the beginning of the system memory. In addition, the logic maintains a second start address pointing to a memory location storing data at the beginning of the second region of the cache. Further, the logic maintains a size of the second region.
- In one embodiment, by monitoring received memory access requests and identifying a pattern, the logic predicts a region of the system memory that is going to be accessed with upcoming memory accesses. The logic identifies this region. In response, the logic performs the above steps such as storing a copy of the contiguous data from this region and updating corresponding parameters. In another embodiment, the logic receives one or more hints from software that identifies or is used to identify the region of predicted upcoming data accesses.
- When the logic detects a change to the size of the second region, the logic determines a range of addresses beginning at the first start address and ending at an address that is the sum of the first start address and the new size of the second region. In some embodiments, the updates to one or more of the first start address and the size of the second region occurs as data is updated in the second region. The updates to the second region include one or more of adding data, removing data, and overwriting existing data in the second region. When the logic selects a memory access request from one of the multiple queues, the logic of the cache controller compares a request address of the selected memory access request to the range of addresses. The logic determines whether the request address is within this range. Therefore, to determine whether there is a cache hit or a cache miss within the last-level cache, the logic compares the request address to this maintained range of addresses, rather than performs a set-associative lookup or a fully-associative lookup of the tag arrays in the cache. The comparison is a faster operation than the lookup operations of indexes and tags of the cache.
- When the logic determines that the request address of the selected memory access request is not within the range of addresses, the logic sends the selected memory access request to system memory for servicing. However, when the logic determines that the request address is within the range of addresses, the logic services the memory access request by accessing data from the cache. In order to do so, the logic determines an offset based on a difference between the request address and the first start address. Afterward, the logic determines a translated address based on the offset and the second start address. Following, the logic services the memory access request by accessing data from the cache beginning at the translated address.
- Referring to
FIG. 1 , a generalized block diagram of one embodiment ofdata storage 100 is shown. As shown, each ofsystem memory 110 and a last-level cache 130 store data. Although the description describes thecache 130 as a last-level cache, in other embodiments, thecache 130 is another level within the cache memory subsystem. Processing units, communication interfaces and so forth are not shown for ease of illustration.Data 126 is contiguous data stored in theregion 120 of thesystem memory 110. The last-level cache 130 stores in the region 140 a copy of thecontiguous data 126 of theregion 120. Theregion parameters 150 characterize theregions - In various designs, the
system memory 110 includes one or more of off-package DRAM, hard disk drives (HDDs) and solid-state disks (SSDs). In some designs, the last-level cache 130 includes on-package, low latency, high bandwidth memory separate from thesystem memory 110. In some designs, the last-level cache 130 includes 3D DRAM. In other designs, the last-level cache 130 includes synchronous RAM (SRAM), embedded DRAM (eDRAM), flash memory such as solid state disks, and one of a variety of non-volatile memories. Examples of the non-volatile memories are phase-change memory, memristors and spin-transfer torque (STT) magnetoresistive random-access memory (MRAM). - In the illustrated embodiment, the
address 122, which is also referred to as “x”, points to a memory location storing data at the beginning of theregion 120. Here, the generic value “x” is any value represented in any manner such as integer, hexadecimal, and so forth. Theregion 120 has asize 124, which is also referred to as “S bytes.” In a similar manner, theaddress 142, which is also referred to as “a”, points to a memory location storing data at the beginning of theregion 140. Theregion 140 has asize 144, which is also referred to as “S bytes,” and it is equal to thesize 124 ofregion 120. The values “x”, “S” and “a” are positive integers. - In some embodiments, sequential elements in a cache controller for the last-
level cache 130 store theregion parameters 150. Examples of the sequential elements are registers, flip-flop circuits, and latches. In an embodiment, theregion parameters 150 includestatus information 152 such as a valid bit and metadata. Examples of the metadata are identifiers of the producer of thedata 126, identifiers of the consumer of thedata 126, cache coherency information for thedata 126, clean/dirty information for thedata 126, and so on. The identifiers for the producer and the consumer include one or more of a processing unit identifier, a process identifier, a thread identifier. In other embodiments, theregion parameters 150 do not include thestatus information 152, since this information is stored in other queues and sequential elements in the cache controller. - In an embodiment, the
region parameters 150 include two addresses. Thefirst address 154 is a copy of theaddress 122, which points to a memory location storing data at the beginning of theregion 120. Thesecond address 156 is a copy of theaddress 142, which points to a memory location storing data at the beginning of theregion 140. Therefore, theregion parameters 150 include a memory mapping between the beginning of theregion 120 and the beginning of theregion 140. For example, theregion parameters 150 currently stores a memory mapping between the address 122 (“x”) and the address 142 (“a”). In some embodiments, theregion parameters 150 also includes thesize 158 of theregion 140. In an embodiment, logic in the cache controller uses a size value of zero bytes in thesize field 158 to indicate no valid region is stored in the last-level cache, rather than a valid bit in thestatus field 152. - Using the
region parameters 150, the logic in the cache controller determines a cache hit or a cache miss for the last-level cache 130 with a comparison operation that is faster than a lookup operation in a large tag array. In one example, the logic determines whether a valid region is stored in the last-level cache 130. If astatus field 152 is used and a valid bit is negated, then there is no valid region stored in the last-level cache 130. If astatus field 152 is not used and thesize field 158 stores a value of zero bytes, then there is no valid region stored in the last-level cache 130. In such cases, the logic in the cache controller determines that there is a cache miss, and sends the memory access request with the request address to thesystem memory 110 for servicing. Therefore, the logic skips performing set-associative lookup operations into a set of the large tag array selected by an index of the request address, which reduces the latency of handling the memory access request. - If a
status field 152 is used and a valid bit is asserted, or if astatus field 152 is not used, but thesize field 158 stores a positive, non-zero integer, then the logic of the cache controller determines that there is a valid region stored in the last-level cache 130. In such a case, the logic in the cache controller determines a range of addresses when the logic determines a change in one or more of the address 122 (“x”) and the size 158 (“S”) of theregion 140. The logic determines the range of addresses as beginning at the address 122 (“x”) and ending at an address that is the sum of theaddress 122 and the size 158 (“S”) of theregion 140. Using the notation in the illustrated embodiment, the range of addresses is “x+S”. The logic determines whether the request address is within this range. For example, if the request address is denoted as “y,” then the logic determines whether the expression, x≤y<(x+S), is true. Therefore, to determine whether there is a cache hit or a cache miss within the last-level cache 130, the logic compares the request address to this range of addresses. The comparison operation is a faster operation than the lookup operations of indexes and tags of the last-level cache 130. - If the logic determines the access of the last-
level cache 130 is a cache miss, then the cache controller sends the memory access request with the request address to thesystem memory 110 for servicing. However, if the logic determines the access of the last-level cache 130 is a cache hit, then the logic services the memory access request by retrieving data from the last-level cache 130. In order to do so, the logic determines an offset based on a difference between the request address (“y”) and the address 122 (“x”), which is expressed as (y−x). The logic determines a translated address based on the offset (y−x) and the address 142 (“a”), which is the sum of the two values and is expressed as (a+(y−x)). The logic services the memory access request by accessing data from the last-level cache 130 beginning at the translated address, or at the address represented by (a+(y−x)). The logic skips performing set-associative lookup operations into a set of the large tag array selected by an index of the request address. Rather, after the comparison operation used to determine the cache hit, simple arithmetic operations are used to identify the location storing the requested data in the last-level cache 130. - Referring now to
FIG. 2 , one embodiment of amethod 200 for efficiently performing memory accesses in a computing system is shown. For purposes of discussion, the steps in this embodiment (as well as inFIGS. 10 and 14 ) are shown in sequential order. However, it is noted that in various embodiments of the described methods, one or more of the elements described are performed concurrently, in a different order than shown, or are omitted entirely. Other additional elements are also performed as desired. Any of the various systems or apparatuses described herein are configured to implementmethods - One or more processing units execute one or more computer programs, or software applications. Examples of a processing unit are a processor core within a general-purpose central processing unit (CPU), a graphics processing unit (GPU), or other. In some embodiments, a System-in-Package (SiP) includes the processing unit and an on-package, low latency, high bandwidth memory. One example of such a memory is a 3D integrated memory, such as a 3D DRAM. In an embodiment, the processing unit utilizes at least a portion of the 3D DRAM as a cache. In one embodiment, the cache is a last-level cache. Although the following description describes the low latency, high bandwidth memory as being used as a last-level cache, in other embodiments, the high bandwidth memory is used as a first-level (L1), a second-level (L2), or other level in the cache memory hierarchy other than the last-level. The processing unit determines a memory request misses within the cache memory subsystem in levels lower than the last-level cache (block 202).
- The processing unit sends a request address corresponding to the memory request to the last-level cache (block 204). In an embodiment, the logic in the cache controller for the last-level cache maintains an identification of a first region of contiguous data in the system memory that has a copy of the contiguous data stored in a second region of the last-level cache. In some embodiments, the identification includes a first start address that identifies the beginning of the first region. Additionally, the identification includes a size of the second region. Logic in the cache controller for the last-level cache determines a range of addresses for this first region, which is a range of addresses within the system memory address space pointing to the memory locations storing the contiguous data stored in the system memory (block 206). This contiguous data has a copy stored in the last-level cache. In some designs, the logic uses the expressions described earlier in the description of the data storage 100 (of
FIG. 1 ). - If the request address is not within the range (“no” branch of the conditional block 208), then logic sends the memory request including the request address to system memory (block 210). The access of the last-level cache for the memory request is considered to be a cache miss, and accordingly, the memory request is sent to a lower level of the memory subsystem such as the system memory. If the request address is within the selected range (“yes” branch of the conditional block 208), then logic determines an offset based on a difference between the request address and a start address of the range in system memory (block 212). Logic determines a translated address based on the offset and a start address of the range in the last-level cache (block 214). For example, the translated address is a sum of the offset and a start address of the range in the last-level cache. Logic services the memory request by accessing data from the last-level cache beginning at the translated address (block 216).
- Turning now to
FIG. 3 , a generalized block diagram of one embodiment of acomputing system 300 utilizing a low-latency, high-bandwidth cache is shown. In various embodiments, thecomputing system 300 utilizes three-dimensional (3D) packaging such as the System in Package (SiP) 310. TheSiP 310 is connected to amemory 362 and off-package DRAM 370 via a memory bus 350. In one embodiment, thecomputing system 300 is a stand-alone system within a mobile computer, a smart phone, or a tablet; a desktop; a server; or other. TheSiP 310 uses theprocessing unit 320 and a low-latency, high-bandwidth cache 330. Theprocessing unit 310 and thecache 330 communicate through low-latency interconnect 348. The in-package low-latency interconnect 348 uses one or more of horizontal and/or vertical routes with shorter lengths than long off-chip interconnects when a SiP is not used. - Although, in some embodiments, the
SiP 310 utilizes DRAM memory technology, such as 3D DRAM, other memory technologies that use a low latency, high bandwidth and row-based access scheme including one or more row buffers or other equivalent structures are possible and contemplated. Examples of other memory technologies are phase-change memories, spin-torque-transfer resistive memories, memristors, embedded DRAM (eDRAM), and so forth. In some designs, theprocessing unit 320 is a general-purpose microprocessor, whereas, in other designs, theprocessing unit 320 is another type of processing unit. Other types of processing units include a graphics processing unit (GPU), a field programmable gate array (FPGA), an accelerated processing unit (APU), which is a chip that includes additional processing capability. This additional processing capability accelerates one or more types of computations outside of a general-purpose CPU. In one embodiment, an APU includes a general-purpose CPU integrated on a same die with a GPU, a FPGA, or other processing unit, thus improving data transfer rates between these units while reducing power consumption. In other embodiments, an APU includes video processing and other application-specific accelerators. - The
execution engine 322 uses one or more processor cores based on the type of theprocessing unit 320. Additionally, in some designs, theexecution engine 322 uses a communication fabric (or “fabric”) for transferring communication messages. Examples of communication messages are coherency probes, interrupts, and read and write access commands and corresponding data. Examples of interconnections in the fabric are bus architectures, crossbar-based architectures, network-on-chip (NoC) communication subsystems, communication channels between dies, silicon interposers, and through silicon vias (TSVs). In many designs, theprocessing unit 320 incorporates a system bus controller in theinterface logic 326 that utilizes one of various protocols to connect the processor cores of theexecution engine 322 tomemory 362,DRAM 370, peripheral input/output (I/O) devices and other processing units. - The
computing system 300 uses the off-package memory 362 as main memory, or system memory. Thememory 362 is one of hard disk drives (HDDs) and solid-state disks (SSDs). The off-package DRAM 370 is one of a variety of types of DRAM. Thecomputing system 300 fills the off-chip DRAM 370 with data from the off-chip memory 362 through the I/O controller and bus 360 and the memory bus 350. The interface logic 360 supports communication protocols, address formats and packet formats for each of the off-package memory 362 and the off-package DRAM 370. - Each of the processor cores within the
execution engine 322 uses one or more levels of a cache memory subsystem for reducing memory latencies for the processor cores. In some designs, the processor cores additionally access a shared cache within theexecution engine 322. When the cache memory subsystem within theexecution engine 322 does not include data requested by a processor core, theexecution engine 322 sends the memory access request to the in-package cache 330. Theinterface logic 340 supports communication protocols, address formats and packet formats for transferring information between the in-package cache 330 and theprocessing unit 320. - Similar to other DRAM topologies, in some designs, the in-
package cache 330 usesmultiple memory arrays 332 that are segmented into multiple banks. In such cases, each one of the banks includes a respective row buffer. Each one of the row buffers stores data in an accessed row of the multiple rows within the corresponding memory array bank. In some embodiments, the functionality of thequeues 342, theregion parameters 344, and the portion of thelogic 346 that uses theregion parameters 344, are located in thelogic 336. For example, this functionality is included in a cache controller for the in-package cache 330. In other embodiments, this functionality is located in theinterface logic 340 as shown. Each of thelogic 336 and thelogic 346 is implemented by software, hardware such as circuitry used for combinatorial logic and sequential elements, or a combination of software and hardware. - When the
interface logic 340 receives a memory access request from theexecution engine 322, thelogic 346 stores received memory access requests in one of themultiple queues 342 based on access type. For example, a first queue of thequeues 342 stores memory read requests and a second queue of thequeues 342 stores memory write requests. Arbitration logic within thelogic 346 selects a queue of themultiple queues 342 and selects a memory access request from the selected queue. For the selected memory access request, thelogic 346 determines a range of addresses corresponding to a first region of system memory, such asregion 372, with a copy of data stored in a second region of the in-package cache 330, such asregion 338. The system memory is implemented by the combination of the off-package memory 362 and the off-package DRAM 370. - The
logic 346 sends a selected memory access request to system memory when thelogic 346 determines a request address of the memory access request is not within the range of addresses for theregion 372. The cache controller services the selected memory request by accessing data from thememory arrays 332 within the in-package cache 330 when thelogic 346 determines the request address is within the range of addresses for theregion 372. Thelogic 346 uses theregion parameters 344 for the above determinations. In various embodiments, theregion parameters 344 are equivalent to the region parameters 150 (ofFIG. 1 ). - The
logic 346 uses one of a variety of techniques for determining when to store a copy of theregion 372 in the off-package system memory asregion 338 in the in-package cache 330. In some embodiments, thelogic 346 monitors memory accesses from theexecution engine 322 to detect a streaming or sequential memory access pattern. Thelogic 346 uses one of a variety of techniques to detect streaming access patterns such as at least the techniques used by hardware prefetchers. When thelogic 346 detects a streaming pattern, thelogic 346 defines a new region. In some embodiments, when a memory request accesses an address that is within L bytes of the end of theregion 338, one of thelogic 336 andlogic 346 extends the size of theregion 338 by P bytes where L and P are positive, non-zero integers. In an embodiment, the values of L and P are stored in programmable registers in the control and status registers (CSRs) 347. In some embodiments, an initial region size is also stored in a programmable register inCSRs 347. - In other embodiments, the
logic 346 uses software hints for determining when to define and create theregion 338 in the in-package cache 330. Software uses particular instructions to update certain registers accessed by the application or the operating system. In addition, the software is capable of updating one or more control and status registers (CSRs) 347 in theinterface logic 340. When processing a deep neural network, the software application is aware of when it finishes processing one layer of a multi-layer neural network and when it moves on to the next layer of the multi-layer network. As each layer of the multi-layer network is traversed (whether forward or backward), the software application utilizes software hints to inform one or more of thelogic 346 and theCSRs 347 of the current region in the system memory that is being processed. In some embodiments, the software application provides hints to indicate when to increase or decrease the sizes of theregions regions - Additionally, the software hints indicate when to change the entire content of the
regions CSRs 347 prevent theregion 338 from exceeding the size of the in-package cache 330. In some embodiments, if thelogic 346 already defines a region, thelogic 346 selects between supporting the existing region and a new region based on criteria. Examples of the criteria are a size of the region, a priority level of an application accessing the data of the regions, an age of the existing region, and so forth. - During the execution of one or more software applications, the applications modify the contents of the
region 338. If thelogic 346 adjusts the size or the portion of theregion 338 such that a modified portion of theregion 338 is no longer valid, then one of thelogic 336 and thelogic 346 sends the modified data to the off-package DRAM 370. In some designs, one of thelogic 336 and thelogic 346 controls the in-package cache 330 with a write-through cache policy or a write-back cache policy. The write-through cache policy spreads out the write operations to the off-package DRAM 370 over time. In contrast, the write-back cache policy delays the write operations until the size of theregion 338 reduces. At such a time, thelogic 336 or thelogic 346 sends the write operations for the modified data to the off-package DRAM 370 in a burst of write traffic. In other designs, one of thelogic 336 and thelogic 346 controls the in-package cache 330 with a combination of a write-through cache policy and a write-back cache policy to trade off the benefits and the costs of the two policies. - As described earlier, in some designs, the in-
package cache 330 uses low latency, high bandwidth memory technologies such as SRAM, phase-change memories, spin-torque-transfer resistive memories, memristors, embedded DRAM (eDRAM), and so forth. In other designs, the in-package cache 330 uses low latency,high bandwidth 3D DRAM. Turning now toFIG. 4 , a generalized block diagram of embodiments of systems-in-package (SiP) 400 and 440 are shown. The illustrated SiP includes one or more three-dimensional integrated circuits (3D ICs). A 3D IC includes two or more layers of active electronic components integrated both vertically and/or horizontally into a single circuit. In some designs, fabrication techniques use interposer-based integration whereby the fabrication technique places the 3D IC next to theprocessing unit 420. Alternatively, fabrication technique stacks a 3D IC directly on top of another IC. - Die-stacking technology is a fabrication process that enables the physical stacking of multiple separate pieces of silicon (integrated chips) together in a same package with high-bandwidth and low-latency interconnects. The dies are stacked side by side on a silicon interposer, or vertically directly on top of each other. One configuration for the SiP is to stack one or more DRAM chips next to and/or on top of a processing unit. The stacked DRAM chips provide a very large cache for the processing unit. In some designs, this large cache has a size on the order of several hundred MB (or more).
- As shown, in one embodiment, the
SiP 400 includes aprocessing unit 420 and one or more three-dimensional (3D)DRAM 430 and 432 that communicate with theprocessing unit 420 through horizontal low-latency interconnect 410. Again, theprocessing unit 420 is one of a general-purpose CPU, a graphics processing unit (GPU), an accelerated processing unit (APU), a field programmable gate array (FPGA), or other data processing device that makes use of a row-based memory, such as a cache. - The in-package horizontal low-
latency interconnect 410 provides reduced lengths of interconnect signals versus long off-chip interconnects when a SiP is not used. The in-package horizontal low-latency interconnect 410 use particular signals and protocols as if the chips, such as theprocessing unit 420 and the3D DRAMs 430 and 432, were mounted in separate packages on a circuit board. TheSiP 400 additionally includes backside vias or through-bulk silicon vias 412 that reach to packageexternal connections 414. The packageexternal connections 414 are used for input/output (I/O) signals and power signals. - In another embodiment, the
SiP 440 includes a3D DRAM 450 stacked directly on top of theprocessing unit 420. Although not shown, for each of theSiP 400 and theSiP 440, multiple chips, or device layers, are stacked on top of one another with directvertical interconnects 416 tunneling through them. The size and density of thevertical interconnects 416 that can tunnel between the different device layers varies based on the underlying technology used to fabricate the 3D ICs. - Referring to
FIG. 5 , a generalized block diagram of one embodiment ofdata storage 500 is shown. Circuitry and logic previously described are numbered identically. As shown, each ofsystem memory 110 and a last-level cache 130 store data. Again, although the description describes thecache 330 as a last-level cache, in other embodiments, thecache 330 is another level within the cache memory subsystem.Data 126 is contiguous data stored in thesystem memory 110. Data 526 is contiguous data added to the region in thesystem memory 110 to create theregion 520. The size of contiguous data in thesystem memory 110 copied as a region grew from thesize 124 to thesize 524. - The last-
level cache 130 stores in the region 540 a copy of thecontiguous data 126 and data 526 of theregion 520. Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from thesize 144 to thesize 544. Theregion parameters 150 characterize theregions fields region parameters 150. However, logic updates thesize field 158 to an increased amount. In the example, the logic updates thesize field 158 from S bytes to S+T bytes where S and T are positive, non-zero integers. - Turning now to
FIG. 6 , a generalized block diagram of one embodiment ofdata storage 600 is shown. Circuitry and logic previously described are numbered identically.Data 126 is contiguous data stored in thesystem memory 110. Data 626 is contiguous data added to the region in thesystem memory 110 to create theregion 620. The size of contiguous data in thesystem memory 110 copied as a region grew from thesize 124 to thesize 624. - The last-
level cache 130 stores in the region 640 a copy of thecontiguous data 126 and data 626 of theregion 620. Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from thesize 144 to the sum of thesize 644 and thesize 646. The contiguous data wrapped around the last-level cache 130. Theregion parameters 150 characterize theregions fields region parameters 150. However, logic updates thesize field 158 to an increased amount. In the example, the logic updates thesize field 158 from S bytes to S+T+U bytes where S, T and U are positive, non-zero integers. - The access of a wrap around region alters the calculation of the translated address for the last-
level cache 130. In one example, theregion 620 uses the address space 2,000 to 2,700 where the addresses are expressed as digits. The entire last-level cache 130 uses the address space 5,000 to 6,000 where the addresses are also expressed as digits. Theregion 640 uses the address space 5,800 wrapped around to 5,500. When a received memory request uses the request address 2,400, logic determines that the offset is (2,400−2,000), or 400. The logic adds the offset to the region start address of 5,800 to obtain (5,800+400), or 6,200. This value exceeds the limit of theregion 640. In response, the logic determines the difference, which is (6,200−6,000), or 200. The logic adds the difference to the start address to obtain (5,000+200), or 5,200. The translated address is 5,200, and the logic uses the translated address 5,200 to access data from the last-level cache 130 in order to service the memory request. - Referring to
FIG. 7 , a generalized block diagram of one embodiment ofdata storage 700 is shown. Similar todata storage system memory 110 and a last-level cache 130 store data. Although the description describes thecache 330 as a last-level cache, in other embodiments, thecache 330 is another level within the cache memory subsystem.Data 126 is contiguous data stored in thesystem memory 110. Data 726 is contiguous data added to the region in thesystem memory 110 to create theregion 720. The size of contiguous data in thesystem memory 110 copied as a region grew from thesize 124 to thesize 724. The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of theregion 720 is address 722 (“x2”), rather than the address 122 (“x1”). - The last-
level cache 130 stores in the region 740 a copy of thecontiguous data 126 and data 726 of theregion 720. Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from thesize 144 to thesize 744. The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of the region 740 is address 742 (“a2”), rather than the address 142 (“a1”). The region parameters 170 characterize theregions 720 and 740. The start addresses 122 and 142 change, and thefields region parameters 150. Logic also updates thesize field 158 to an increased amount. In the example, the logic updates thesize field 158 from S bytes to S+V bytes where S and V are positive, non-zero integers. - Referring to
FIG. 8 , a generalized block diagram of one embodiment ofdata storage 800 is shown.Data 126 is contiguous data stored in thesystem memory 110. Data 826 is contiguous data added to the region in thesystem memory 110 to create theregion 820. The size of contiguous data in thesystem memory 110 copied as a region grew from thesize 124 to the size 824. The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of theregion 820 is address 822 (“x2”), rather than the address 122 (“x1”). - The last-
level cache 130 stores in the region 840 a copy of thecontiguous data 126 and data 826 of theregion 820. Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region grew from thesize 144 to the sum of thesize 844 and thesize 846. The contiguous data wrapped around the last-level cache 130. The increase is in the left direction, rather than the right direction. Accordingly, the address pointing to the memory location storing data at the beginning of theregion 840 is address 842 (“a2”), rather than the address 142 (“a1”). Theregion parameters 150 characterize theregions fields region parameters 150. Logic also updates thesize field 158 to an increased amount. In the example, the logic updates thesize field 158 from S bytes to S+V+W bytes where S, V and W are positive, non-zero integers. The access of a wrap around region alters the calculation of the translated address for the last-level cache 130. Logic uses a computation as described earlier fordata storage 600. - Referring to
FIG. 9 , a generalized block diagram of one embodiment ofdata storage 900 is shown.Data 126 is contiguous data stored in thesystem memory 110.Data 926 is contiguous data removed from the region in thesystem memory 110 to create theregion 920. The size of contiguous data in thesystem memory 110 copied as a region reduced from thesize 124 to thesize 924. The decrease is in the right direction, rather than the left direction. Accordingly, the address pointing to the memory location storing data at the beginning of theregion 920 is address 922 (“x2”), rather than the address 122 (“x1”). - The last-
level cache 130 stores in the region 940 a copy of thecontiguous data 126 of theregion 920. Accordingly, the size of contiguous data in the last-level cache 130 maintained as a region reduced from thesize 144 to thesize 944. The decrease is in the right direction, rather than the left direction. Accordingly, the address pointing to the memory location storing data at the beginning of theregion 940 is address 942 (“a2”), rather than the address 142 (“a1”). Theregion parameters 150 characterize theregions fields region parameters 150. Logic also updates thesize field 158 to a decreased amount. In the example, the logic updates thesize field 158 from S bytes to S-U bytes where S and U are positive, non-zero integers. It is noted that if the decrease in sizes of the regions occurred at the end of the regions and in the left direction, rather than the right direction, then theaddresses size field 158 is still updated. - Referring now to
FIG. 10 , one embodiment of amethod 1000 for performing memory accesses in a computing system is shown. Logic monitors memory access patterns and/or receives software hints of data accesses (block 1002). As described earlier, software techniques with particular instructions used as hints, hardware techniques such as the techniques used by hardware prefetchers, or a combination are used to determine when to begin defining a region of memory. - If logic does not predict a region of upcoming data accesses (“no” branch of the conditional block 1004), then control flow of
method 1000 returns to block 1002 where logic monitors memory access patterns and/or receives software hints. If logic predicts a region of upcoming data accesses, (“yes” branch of the conditional block 1004), then the logic initializes parameters characterizing the region of predicted upcoming data accesses (block 1006). For example, the logic stores a start address for the region of upcoming data accesses in the system memory, and stores a start address for this region in the last-level cache. Additionally, the logic stores a region size for this region. In some embodiments, an initial region size is provided in a programmable register of multiple control and status registers. In some designs, the initial size is between the fine granularity of a cache line size (e.g., 64 bytes) and a page size (e.g., 4 kilobytes or more). - Logic stores a copy of contiguous data from the region of upcoming data accesses in the system memory into the last-level cache. For example, the logic stores a copy of data from a first region of system memory into a second region of a last-level cache (block 1008) where each of the first region and the second region corresponds to the region of the predicted upcoming data accesses in the system memory. Logic services memory requests targeting the first region by accessing data from the second region (block 1010). If logic determines the second region changes size (“yes” branch of the conditional block 1012), then logic updates parameters characterizing the region to indicate the size change (block 1014).
- If logic determines the second region does not change in size (“no” branch of the conditional block 1012), and accesses of the second region have not completed (“no” branch of the conditional block 1016), then control flow of
method 1000 returns to block 1010. Inblock 1010, logic services memory requests targeting the first region by accessing data from the second region. If the accesses of the second region have completed (“yes” branch of the conditional block 1016), then logic updates parameters characterizing the region to indicate there is no region (block 1018). Afterward, control flow ofmethod 1000 returns to block 1002 where logic monitors memory access patterns and/or receives software hints. - Referring to
FIG. 11 , a generalized block diagram of one embodiment ofdata storage 1100 is shown. Each ofsystem memory 1110 and a last-level cache 1130 store data. Processing units, communication interfaces and so forth are not shown for ease of illustration. Although the description describes thecache 1130 as a last-level cache, in other embodiments, thecache 1130 is another level within the cache memory subsystem. Data 1120-1128 are contiguous data stored in thesystem memory 1110. The last-level cache 1130 stores a copy of a portion of the contiguous data 1120-1128 at different points in time. - In one design example, the weights of a large (deep) neural network are stored in
system memory 1110, such as off-package DRAM. The weights, such as data 1120-1128, are too large to fit in the last-level cache 1130, such as in-package 3D DRAM. During training of the neural network, the weights are evaluated by a processing unit executing a software application. From points in time t1 to 11t7 (or times t1 to t7), the size and content of the region stored in the last-level cache 1130 changes. Thedata 1120 corresponds to a first layer of weights of a multi-layer neural network. Thedata 1122 corresponds to a second layer of weights of the multi-layer neural network, and so on. - Initially, the
data 1120 is copied into the last-level cache 1130 at time t0 (not shown). At the later time t1, thedata 1122 is added to the region stored in the last-level cache 1130. Similarly, at times t2 and t3, thedata 1124 and thedata 1126 are added to the region stored in the last-level cache 1130. As the evaluation of the neural network proceeds by inference or forward propagation, the region in the last-level cache 1130 expands in order to store the weights. The accessing of the neural network's weights proceeds in a regular, predictable manner. Therefore, the region in the last-level cache 1130 is increased sufficiently ahead of the evaluation of the weights. As described earlier, programmable registers of CSRs 347 (ofFIG. 3 ) store the parameters L and P to indicate when and by how much to change the size of the region stored in the last-level cache. Accordingly, the processing unit accesses the weights in the in-package last-level cache, rather than in the off-package DRAM. - At time t3, the entire last-
level cache 1130 is filled. At this time, logic in the cache controller or in the processing unit adjusts the size of the region by decreasing the size from the left. At time t4, the logic removes thedata 1120 from the region in the last-level cache 1130. The logic updates region parameters accordingly. Due to the nature of the software application performing the training of the weights, once the processing unit evaluates a given layer, the corresponding weights are not needed again for the current inference or forward propagation. Therefore, in some designs, the given layer of weights are removed from the region of the last-level cache 1130. At time t5,data 1126 is added to the region of the last-level cache 1130. The region wraps around the last-level cache 1130. The access of a wraparound region alters the calculation of the translated address for the last-level cache 130. Logic uses a computation as described earlier fordata storage 600. - At time t6, the logic removes the
data 1122 from the region in the last-level cache 1130. The logic updates region parameters accordingly. At time t7, the logic addsdata 1128 to the region of the last-level cache 1130. After the processing unit has processed the last layer of the neural network, the processing unit generates a final output. The processing unit typically compares this final output against an expected value to compute an error or loss. The training of the neural network then continues with a backward propagation phase. During the backward propagation, the processing unit processes the layers of the multi-layered neural network in reverse order. Logic allocates and deallocates the region of the last-level cache 1130 in a manner to support the reverse order. - Referring to
FIG. 12 , a generalized block diagram of one embodiment ofdata storage 1200 is shown. Thesystem memory 1210 stores data. Processing units, communication interfaces and so forth are not shown for ease of illustration. Thesystem memory 1210 stores data of multiple regions. Examples of the regions include afirst region 1220 pointed to by the address 1212 (“w0”), data of asecond region 1222 pointed to by the address 1214 (“x0”), data of athird region 1224 pointed to by the address 1216 (“y0”), and data of afourth region 1226 pointed to by the address 1218 (“z0”). - In this example, a software application performs a stencil-like calculation where each element in an output vector stored in the region pointed to by the address 1218 (“z0”) is a sum of elements in the other vectors pointed to by addresses 1212-1216 (“w0”−“y0”). For example, if the output vector is represented as vector “d”, and each of the vectors in the other regions are represented as “a” to “c”, then the value of the element d[i] of the vector d is a[i−1]+a[i]+a[i+1]+b[i−1]+b[i]+b[i+1]+c[i−1]+c[i]+c[i+1]. The
adder 1230 sums the values of the elements of the input vectors to generate an element in the output vector. In many cases, none of the input vectors fit within an in-package cache. However, each region within the in-package cache is capable of storing an active portion of each respective vector. As the calculation proceeds, each of the regions are updated to maintain an active portion of each vector. An example of such a scheme is shown in the following description. - Referring to
FIG. 13 , a generalized block diagram of one embodiment ofdata storage 1300 is shown. Circuitry and logic previously described are numbered identically. The last-level cache (LLC) 1330 stores a copy of data stored in thesystem memory 1210. Although the description describes thecache 1330 as a last-level cache, in other embodiments, thecache 1330 is another level within the cache memory subsystem. Processing units, communication interfaces and so forth are not shown for ease of illustration. The last-level cache 1330 stores data of multiple regions. Examples of the regions include a first region pointed to by the address 1332 (“a0”) with a size 1334 (“S bytes”), a second region pointed to by the address 1336 (“b0”) with a size 1338 (“T bytes”), athird region 1224 pointed to by the address 1340 (“c0”) with a size 1342 (“U bytes”), and a fourth region pointed to by the address 1344 (“d0”) with a size 1346 (“V bytes”). - The table 1350 stores region parameters for the regions stored in the last-
level cache 1330. In many designs, the fields 1352-1358 are equivalent to the fields 152-158 of the region parameters 150 (ofFIG. 1 ). Here, the table 1350 supports multiple separate regions, rather than a single region. In the illustrated embodiment, the table 1350 includes four valid rows (entries) for supporting the four regions in the last-level cache 1330. Although four regions and entries are shown, any number of entries and regions are used in other examples. To support the multiple regions, logic maintains the information in the table 1350 to ensure that the multiple regions grow, reduce and wrap around the last-level cache 1330 without overrunning another region. For each memory access of the last-level cache 1330, logic compares the requested address against each valid, supported region in the last-level cache 1330. In various designs, the table 1350 stores information in a fully-associative manner. The requested address now checks against all N sets of region definition registers (analogous to a fully-associative cache structure). - Referring now to
FIG. 14 , one embodiment of amethod 1400 for performing memory accesses in a computing system is shown. One or more processing units execute one or more computer programs, or software applications. Examples of a processing unit are a processor core within a CPU, a GPU, or other. In some embodiments, a System-in-Package (SiP) includes the processing unit and high bandwidth memory. One example of the high bandwidth memory is a 3D integrated memory, such as a 3D DRAM. In an embodiment, the processing unit utilizes at least a portion of the 3D DRAM as a cache. The processing unit determines a memory request misses within a cache memory subsystem in levels lower than the last-level cache (block 1402). In various embodiments, the processing unit utilizes at least a portion of the high bandwidth memory as a last-level cache. The processing unit sends an address corresponding to the memory request to the last-level cache (block 1404). - Logic selects a range of one or more address ranges within a system memory address space with data stored in the last-level cache (block 1406). If logic determines the request address is not within the selected range (“no” branch of the conditional block 1408), and the last range is not reached (“no” branch of the conditional block 1410), then control flow of
method 1400 returns to block 1406. Inblock 1406, logic selects another range of the one or more address ranges. If logic determines the request address is not within the selected range (“no” branch of the conditional block 1408), and the last range is reached (“yes” branch of the conditional block 1410), then logic sends the memory request including the request address to system memory (block 1412). - If logic determines the request address is within the selected range (“yes” branch of the conditional block 1408), then logic determines an offset based on a difference between the address and a start address of the range in system memory (block 1414). Logic determines a translated address based on the offset and a start address of the range in the last-level cache (block 1416). Logic services the memory request by accessing data from the last-level cache beginning at the translated address (block 1418).
- In various embodiments, program instructions of a software application are used to implement the methods and/or mechanisms previously described. The program instructions describe the behavior of hardware in a high-level programming language, such as C. Alternatively, a hardware design language (HDL) is used, such as Verilog. The program instructions are stored on a non-transitory computer readable storage medium. Numerous types of storage media are available. The storage medium is accessible by a computing system during use to provide the program instructions and accompanying data to the computing system for program execution. The computing system includes at least one or more memories and one or more processors that execute program instructions.
- It should be emphasized that the above-described embodiments are only non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/575,991 US20220138107A1 (en) | 2018-12-10 | 2022-01-14 | Cache for storing regions of data |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/214,363 US11232039B2 (en) | 2018-12-10 | 2018-12-10 | Cache for storing regions of data |
US17/575,991 US20220138107A1 (en) | 2018-12-10 | 2022-01-14 | Cache for storing regions of data |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/214,363 Continuation US11232039B2 (en) | 2018-12-10 | 2018-12-10 | Cache for storing regions of data |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220138107A1 true US20220138107A1 (en) | 2022-05-05 |
Family
ID=69137998
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/214,363 Active US11232039B2 (en) | 2018-12-10 | 2018-12-10 | Cache for storing regions of data |
US17/575,991 Pending US20220138107A1 (en) | 2018-12-10 | 2022-01-14 | Cache for storing regions of data |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/214,363 Active US11232039B2 (en) | 2018-12-10 | 2018-12-10 | Cache for storing regions of data |
Country Status (6)
Country | Link |
---|---|
US (2) | US11232039B2 (en) |
EP (1) | EP3895025B1 (en) |
JP (1) | JP7108141B2 (en) |
KR (2) | KR20210088683A (en) |
CN (1) | CN113168378A (en) |
WO (1) | WO2020123343A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210103820A1 (en) * | 2019-10-03 | 2021-04-08 | Vathys, Inc. | Pipelined backpropagation with minibatch emulation |
KR20210088304A (en) * | 2020-01-06 | 2021-07-14 | 삼성전자주식회사 | Operating method of image processor, image processing apparatus and operating method of image processing apparatus |
KR20210106221A (en) * | 2020-02-20 | 2021-08-30 | 삼성전자주식회사 | System on chip, data processing method thereof and neural network device |
US11507517B2 (en) * | 2020-09-25 | 2022-11-22 | Advanced Micro Devices, Inc. | Scalable region-based directory |
CN113010455B (en) * | 2021-03-18 | 2024-09-03 | 北京金山云网络技术有限公司 | Data processing method and device and electronic equipment |
KR20230154691A (en) * | 2022-05-02 | 2023-11-09 | 이화여자대학교 산학협력단 | A method of executing instruction operation of processor using compiler data dependency information |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5537574A (en) * | 1990-12-14 | 1996-07-16 | International Business Machines Corporation | Sysplex shared data coherency method |
US6021476A (en) * | 1997-04-30 | 2000-02-01 | Arm Limited | Data processing apparatus and method for controlling access to a memory having a plurality of memory locations for storing data values |
US20080082743A1 (en) * | 2006-09-29 | 2008-04-03 | Hanebutte Ulf R | Method and apparatus for caching memory content on a computing system to facilitate instant-on resuming from a hibernation state |
US20110010504A1 (en) * | 2009-07-10 | 2011-01-13 | James Wang | Combined Transparent/Non-Transparent Cache |
US20110066808A1 (en) * | 2009-09-08 | 2011-03-17 | Fusion-Io, Inc. | Apparatus, System, and Method for Caching Data on a Solid-State Storage Device |
US20120221774A1 (en) * | 2011-02-25 | 2012-08-30 | Fusion-Io, Inc. | Apparatus, system, and method for managing contents of a cache |
US20130138892A1 (en) * | 2011-11-30 | 2013-05-30 | Gabriel H. Loh | Dram cache with tags and data jointly stored in physical rows |
US20160224463A1 (en) * | 2015-02-04 | 2016-08-04 | International Business Machines Corporation | Operations interlock under dynamic relocation of storage |
US9710514B1 (en) * | 2013-06-25 | 2017-07-18 | Marvell International Ltd. | Systems and methods for efficient storage access using metadata |
US20170255557A1 (en) * | 2016-03-07 | 2017-09-07 | Qualcomm Incorporated | Self-healing coarse-grained snoop filter |
US20170272209A1 (en) * | 2016-03-15 | 2017-09-21 | Cloud Crowding Corp. | Distributed Storage System Data Management And Security |
US20190087337A1 (en) * | 2017-09-19 | 2019-03-21 | International Business Machines Corporation | Table of contents cache entry having a pointer for a range of addresses |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07146819A (en) * | 1993-11-22 | 1995-06-06 | Nec Corp | Cache system |
US5845308A (en) * | 1995-12-27 | 1998-12-01 | Vlsi Technology, Inc. | Wrapped-line cache for microprocessor system |
US6658532B1 (en) * | 1999-12-15 | 2003-12-02 | Intel Corporation | Cache flushing |
US7334108B1 (en) * | 2004-01-30 | 2008-02-19 | Nvidia Corporation | Multi-client virtual address translation system with translation units of variable-range size |
GB2473850A (en) * | 2009-09-25 | 2011-03-30 | St Microelectronics | Cache configured to operate in cache or trace modes |
US9063663B2 (en) * | 2010-09-21 | 2015-06-23 | Hitachi, Ltd. | Semiconductor storage device and data control method thereof |
US8949544B2 (en) | 2012-11-19 | 2015-02-03 | Advanced Micro Devices, Inc. | Bypassing a cache when handling memory requests |
US9792049B2 (en) * | 2014-02-24 | 2017-10-17 | Cypress Semiconductor Corporation | Memory subsystem with wrapped-to-continuous read |
US9501420B2 (en) * | 2014-10-22 | 2016-11-22 | Netapp, Inc. | Cache optimization technique for large working data sets |
US10387315B2 (en) * | 2016-01-25 | 2019-08-20 | Advanced Micro Devices, Inc. | Region migration cache |
-
2018
- 2018-12-10 US US16/214,363 patent/US11232039B2/en active Active
-
2019
- 2019-12-09 EP EP19832499.8A patent/EP3895025B1/en active Active
- 2019-12-09 WO PCT/US2019/065155 patent/WO2020123343A1/en unknown
- 2019-12-09 JP JP2021532404A patent/JP7108141B2/en active Active
- 2019-12-09 KR KR1020217017697A patent/KR20210088683A/en not_active Ceased
- 2019-12-09 KR KR1020257004460A patent/KR20250025041A/en active Pending
- 2019-12-09 CN CN201980081940.9A patent/CN113168378A/en active Pending
-
2022
- 2022-01-14 US US17/575,991 patent/US20220138107A1/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5537574A (en) * | 1990-12-14 | 1996-07-16 | International Business Machines Corporation | Sysplex shared data coherency method |
US6021476A (en) * | 1997-04-30 | 2000-02-01 | Arm Limited | Data processing apparatus and method for controlling access to a memory having a plurality of memory locations for storing data values |
US20080082743A1 (en) * | 2006-09-29 | 2008-04-03 | Hanebutte Ulf R | Method and apparatus for caching memory content on a computing system to facilitate instant-on resuming from a hibernation state |
US20110010504A1 (en) * | 2009-07-10 | 2011-01-13 | James Wang | Combined Transparent/Non-Transparent Cache |
US20110066808A1 (en) * | 2009-09-08 | 2011-03-17 | Fusion-Io, Inc. | Apparatus, System, and Method for Caching Data on a Solid-State Storage Device |
US20120221774A1 (en) * | 2011-02-25 | 2012-08-30 | Fusion-Io, Inc. | Apparatus, system, and method for managing contents of a cache |
US20130138892A1 (en) * | 2011-11-30 | 2013-05-30 | Gabriel H. Loh | Dram cache with tags and data jointly stored in physical rows |
US9710514B1 (en) * | 2013-06-25 | 2017-07-18 | Marvell International Ltd. | Systems and methods for efficient storage access using metadata |
US20160224463A1 (en) * | 2015-02-04 | 2016-08-04 | International Business Machines Corporation | Operations interlock under dynamic relocation of storage |
US20170255557A1 (en) * | 2016-03-07 | 2017-09-07 | Qualcomm Incorporated | Self-healing coarse-grained snoop filter |
US20170272209A1 (en) * | 2016-03-15 | 2017-09-21 | Cloud Crowding Corp. | Distributed Storage System Data Management And Security |
US20190087337A1 (en) * | 2017-09-19 | 2019-03-21 | International Business Machines Corporation | Table of contents cache entry having a pointer for a range of addresses |
Also Published As
Publication number | Publication date |
---|---|
EP3895025A1 (en) | 2021-10-20 |
WO2020123343A1 (en) | 2020-06-18 |
KR20250025041A (en) | 2025-02-20 |
KR20210088683A (en) | 2021-07-14 |
JP2022510715A (en) | 2022-01-27 |
US11232039B2 (en) | 2022-01-25 |
JP7108141B2 (en) | 2022-07-27 |
EP3895025B1 (en) | 2024-06-19 |
CN113168378A (en) | 2021-07-23 |
US20200183848A1 (en) | 2020-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220138107A1 (en) | Cache for storing regions of data | |
EP2786255B1 (en) | A dram cache with tags and data jointly stored in physical rows | |
KR101025398B1 (en) | Prefetching from DDR to SRM | |
TWI545435B (en) | Coordinated prefetching in hierarchically cached processors | |
Loh | 3D-stacked memory architectures for multi-core processors | |
US20130138894A1 (en) | Hardware filter for tracking block presence in large caches | |
JP7036925B2 (en) | Memory controller considering cache control | |
US10310976B2 (en) | System and method for concurrently checking availability of data in extending memories | |
US20120151150A1 (en) | Cache Line Fetching and Fetch Ahead Control Using Post Modification Information | |
WO2023033955A1 (en) | Dynamic allocation of cache memory as ram | |
CN116895305A (en) | Providing fine-grained access to packaged memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |