CN119782207B

CN119782207B - Data prefetching method, electronic device, apparatus, medium and product

Info

Publication number: CN119782207B
Application number: CN202411918083.2A
Authority: CN
Inventors: 潘步堃; 徐翠萍; 王迪; 丁程
Original assignee: Hygon Information Technology Co Ltd
Current assignee: Hygon Information Technology Co Ltd
Filing date: 2024-12-24
Publication date: 2025-10-17
Anticipated expiration: 2044-12-24

Abstract

The disclosure provides a data prefetching method, electronic equipment, a device, a medium and a product, which comprise the steps of receiving a demand address for caching by utilizing a plurality of entries of a prefetcher, training based on the demand address to identify streaming data rules and generate streaming data prefetching requests, determining whether N entries which are to generate repeated prefetching addresses exist in the plurality of entries based on hit information of the plurality of entries by utilizing a multi-path hit detector of the prefetcher, determining M selected entries from the N entries based on information of the N entries by utilizing the multi-path hit detector, generating a revocation instruction for unselected entries except the M entries in the N entries by utilizing the multi-path hit detector so as to inhibit the streaming data prefetching requests from being generated by the unselected entries, detecting and inhibiting a part of the entries which are to generate the repeated prefetching addresses, thereby avoiding the repeated streaming data prefetching requests and avoiding the power consumption and the waste of entry bandwidths caused by repeated prefetching.

Description

Data prefetching method, electronic device, apparatus, medium and product

Technical Field

Embodiments of the present disclosure relate to a data prefetching method, an electronic device, an apparatus, a medium, and a product.

Background

Modern multi-issue high performance CPUs (central processing units, central Processing Unit) include at least one processor Core (CPU Core, otherwise known as a CPU Core, processor Core) therein, each of which may include multiple execution units to execute instructions. In conventional CPU architecture, program instructions and data are typically stored in DRAM memory (dynamic random access memory ). The CPU core operating frequency is far higher than the DRAM memory operating frequency, so that hundreds of clock cycles of the CPU core are required to acquire data and instructions from the memory, which necessarily causes the CPU core to idle due to waiting for related instructions and data, resulting in performance degradation. In view of this, modern high-performance CPUs store recently accessed data by setting a multi-level cache architecture, and at the same time, a prefetcher is used to find the rule of the CPU for data access, so as to prefetch the data and instructions to be accessed into the cache in advance. How to use prefetchers for efficient and accurate data prefetching is particularly important to processors that provide high performance.

Disclosure of Invention

Embodiments of the present disclosure provide a data prefetching method, an electronic device, an apparatus, a medium, and a product capable of detecting and disabling a portion of entries that will generate a repeated prefetch address, thereby avoiding wasting power consumption and bandwidth by repeated prefetch requests and improving prefetch performance and overall operation performance of the device.

According to an aspect of the present disclosure, a data prefetching method is provided. The method is applicable to electronic equipment, the electronic equipment comprises a prefetcher and a cache, the prefetcher comprises a plurality of entries and a multi-path hit detector, the method comprises the steps of receiving a demand address for the cache by the plurality of entries, training based on the demand address to identify streaming data rules and generate streaming data prefetching requests, wherein each entry in the plurality of entries is independently trained and generates streaming data prefetching requests, determining whether N entries which generate repeated prefetching addresses exist in the plurality of entries based on hit information of the plurality of entries by the multi-path hit detector, wherein N is an integer greater than 1, determining M selected entries from the N entries based on information of the N entries by the multi-path hit detector, wherein M is an integer greater than or equal to 1 and less than N, and generating a rejection instruction for unselected entries except the M entries by the multi-path hit detector, so that the unselected entries are prohibited from generating the streaming data prefetching requests.

According to some embodiments of the present disclosure, the multi-hit detector is implemented as a pipeline, independent of the multiple entries, and performs multi-hit detection in parallel with the training and pre-fetching process of the multiple entries.

According to some embodiments of the present disclosure, the demand address is a virtual address corresponding to a load instruction or a store instruction.

According to some embodiments of the present disclosure, determining whether N entries that will generate a duplicate prefetch address exist in a plurality of entries based on hit information of the plurality of entries using a multi-way hit detector includes determining that N entries that will generate a duplicate prefetch address exist in a case where a demand address hits simultaneously N entries of streaming data in the same direction and in a prefetch mode or in a training mode, or determining that N entries that will generate a duplicate prefetch address exist in a case where a demand address hits simultaneously N entries of streaming data in the same direction and in a prefetch mode or in a prefetch mode.

According to some embodiments of the present disclosure, M is equal to 1, and determining M selected entries from the N entries based on the information of the N entries using the multi-way hit detector includes determining an entry with a highest confidence parameter from the N entries as the selected entry based on the information of the N entries.

According to some embodiments of the present disclosure, M is equal to 1, and determining M selected entries from the N entries based on the information of the N entries using the multi-way hit detector includes determining an entry from the N entries that is assigned earliest in time as the selected entry based on the information of the N entries.

According to some embodiments of the present disclosure, M is equal to 1, and determining M selected entries from the N entries based on the information of the N entries using the multi-way hit detector includes determining an entry having a smallest or largest entry sequence number from the N entries as the selected entry based on the information of the N entries.

According to some embodiments of the disclosure, the prefetcher further comprises a prefetch request generator. The data prefetching method according to the embodiment of the disclosure further comprises the step of generating a streaming data prefetching request based on the M selected entries by using a prefetching request generator, wherein the streaming data prefetching request is used for streaming data prefetching based on a cache.

According to some embodiments of the present disclosure, a streaming data prefetch request is used to perform a streaming data prefetch process based on a first level cache, a second level cache, and a last level cache of a cache.

According to another aspect of the present disclosure, an electronic device is provided. The electronic device comprises a prefetcher and a cache, wherein the prefetcher is configured to comprise a plurality of entries and a multi-path hit detector, the plurality of entries are configured to receive a demand address for the cache and train based on the demand address to identify streaming data rules and generate streaming data prefetch requests, each entry in the plurality of entries is independently trained and generates streaming data prefetch requests, the multi-path hit detector is configured to determine whether N entries which generate repeated prefetch addresses exist in the plurality of entries based on hit information of the plurality of entries, wherein N is an integer greater than 1, determine M selected entries from the N entries based on information of the N entries in the case that the N entries are determined to exist, wherein M is an integer greater than or equal to 1 and less than N, and generate a disable instruction for unselected entries except the M entries in the N entries so that the unselected entries are prohibited from generating the streaming data prefetch requests.

According to some embodiments of the present disclosure, the multi-way hit detector determining whether N entries that will generate duplicate prefetch addresses exist in the plurality of entries based on hit information of the plurality of entries includes determining that N entries that will generate duplicate prefetch addresses exist in a case where the demand address hits simultaneously N entries of the streaming data in the same direction and in a prefetch mode or in a training mode, or determining that N entries that will generate duplicate prefetch addresses exist in a case where the demand address hits simultaneously N entries of the streaming data in the same direction and in a prefetch mode or in a prefetch mode.

According to some embodiments of the present disclosure, M is equal to 1, and the multi-way hit detector determining M selected entries from the N entries based on the information of the N entries includes one of determining an entry with a highest confidence parameter from the N entries as the selected entry based on the information of the N entries, determining an entry allocated earliest in time from the N entries based on the information of the N entries as the selected entry, or determining an entry with a smallest or largest entry sequence number from the N entries based on the information of the N entries as the selected entry.

According to some embodiments of the present disclosure, the prefetcher further comprises a prefetch request generator configured to generate a streaming data prefetch request based on the M selected entries, the streaming data prefetch request being for use in performing a streaming data prefetch process based on the cache.

According to some embodiments of the present disclosure, the electronic device further comprises a read-write unit, wherein the prefetcher is located within the read-write unit of the electronic device, the read-write unit further comprises a read-write queue configured to generate a demand access based on information of the read-write operation, and an arbiter configured to generate a data access request based on the demand access of the read-write queue and a streaming data prefetch request generated by prefetching, the data access request corresponding to a demand address, wherein the prefetcher is trained and prefetched based on the demand address.

According to yet another aspect of the present disclosure, an electronic device is provided. The electronic device includes a processor and a memory having computer readable code stored therein, which when executed by the processor causes the processor to perform the steps of a data prefetching method according to an embodiment of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon computer-readable instructions, which when executed by a processor, cause the processor to perform the steps of a data prefetching method according to an embodiment of the present disclosure.

According to yet another aspect of the present disclosure, there is also provided a computer program product, comprising a computer program. The computer program, when executed by a processor, causes the processor to perform the steps of a data pre-fetching method according to an embodiment of the present disclosure.

According to the data prefetching method, the electronic device, the medium and the product provided by the embodiment of the disclosure, the plurality of entries of the prefetcher are utilized to independently train and generate streaming data prefetching requests, so that the requirement on the depth of a pipeline of the entries is reduced, and the hardware implementation area of the plurality of entries of the prefetcher can be controlled on the basis of ensuring the rapid training of the plurality of entries. Further, the multi-path hit detector of the prefetcher is utilized to repeatedly detect a plurality of entries in parallel, namely, based on hit information of the plurality of entries, whether the entries which generate repeated prefetch addresses exist in the plurality of entries is determined, and a part of the entries which generate the repeated prefetch addresses are forbidden, so that the subsequent prefetch process is ensured not to repeatedly hit the plurality of entries, and the plurality of entries cannot send out prefetch requests of the repeated addresses, thereby avoiding the waste of power consumption and entry bandwidth caused by repeated prefetching, and further improving the prefetching performance of the prefetcher and the overall operation performance of the electronic equipment.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1 shows a schematic block diagram of a read-write unit and its hardware prefetch module in a processor;

FIG. 2 shows a schematic block diagram of a streaming data prefetcher;

FIG. 3 shows a schematic diagram of a stream prefetch entry payload;

FIG. 4A shows a schematic flow diagram of a training process by a streaming data prefetcher;

FIG. 4B shows a schematic flow diagram of a prefetching process by a streaming data prefetcher;

FIG. 5 shows a schematic flow chart diagram of a data prefetching method according to an embodiment of the present disclosure;

FIG. 6 illustrates a flowchart of the execution of a multi-way hit detector according to an embodiment of the present disclosure;

FIG. 7A illustrates an example prefetch process diagram in the related art;

FIG. 7B illustrates an example prefetch process diagram, implemented in accordance with the present disclosure;

FIG. 8A shows a schematic block diagram of an electronic device according to an embodiment of the disclosure;

FIG. 8B illustrates an example block diagram of an electronic device according to an embodiment of this disclosure;

FIG. 9 shows a schematic block diagram illustrating an electronic device according to an embodiment of the present disclosure;

FIG. 10 illustrates an architectural diagram of an exemplary electronic device, according to an embodiment of the present disclosure;

fig. 11 shows a schematic block diagram of a storage medium according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure. It will be apparent that the described embodiments are merely embodiments of a portion, but not all, of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are intended to be within the scope of the present disclosure, based on the embodiments in this disclosure.

Furthermore, as shown in the present disclosure and claims, unless the context clearly indicates otherwise, the words "a," "an," "the," and/or "the" are not specific to the singular, but may include the plural. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

A flowchart is used in this disclosure to describe the steps of a method according to an embodiment of the present disclosure. It should be understood that the steps that follow or before do not have to be performed in exact order. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to these processes.

First, terms that may be involved in at least some embodiments of the present disclosure are explained as follows.

An electronic device typically includes one or more processor cores (CPU cores) and a cache, which may be implemented as a multi-level cache architecture in which the level one (L1) cache memory has the fastest access speed but the smallest capacity, typically located within the processor cores, the level one (LLC, typically the third level) cache memory has the largest capacity, typically the slowest access speed, typically shared by multiple processor cores, and the level two (L2) cache memory has an access speed and capacity between that of the L1 cache memory and the LLC cache memory, typically also located within the processor cores.

Data prefetching DATA PREFETCH program instructions and data may be stored in dynamic random access memory Dynamic Random Access Memory (DRAM) in a CPU architecture. The operating frequency of the processor core is much higher than that of the DRAM memory, and therefore, obtaining data and instructions from the memory requires hundreds of clock cycles of the processor core, causing the processor core to idle due to waiting for the relevant instructions and data, resulting in performance loss. Therefore, modern high-performance electronic devices all contain a multi-level cache architecture to store recently accessed data, and meanwhile, a prefetcher is utilized to find the data access rule of a CPU so as to prefetch the data to be accessed and instructions into a cache in advance.

Prefetcher PREFETCHER the prefetcher may be divided into a primary data prefetcher (i.e., a data prefetcher that prefetches target data into a first level Cache memory (L1 Cache), a secondary data prefetcher (i.e., a data prefetcher that prefetches target data into a second level Cache memory (L2 Cache)), a final level (LAST LEVEL) data prefetcher (i.e., a data prefetcher that prefetches target data into a final level Cache memory (LLC Cache)), and so on.

It will be appreciated that if the prefetcher generates too many inaccurate prefetch requests, this may in turn lead to problems such as longer access delay and increased cache pollution, which may have a significant negative impact on the performance of the CPU, and eventually lead to reduced performance. This problem is more pronounced for CPUs that support SMT (simultaneous multithreading, simultaneous Multithreading, also known as concurrent multithreading).

Prefetch performance index (PREFETCH METRICS), which is a criterion for evaluating prefetch performance by the prefetcher, may include an Accuracy (Accuracy), a Coverage (Coverage), and a latency (Lateness). Accumey is the ratio of the correct number of prefetches to the total number of prefetches, coverage is the ratio of the number of Demand requests (requests generated by access instructions in the program) hitting the prefetched data to the total number of Demand requests, lateness is the ratio of the correct but too late number of prefetches to the total number of prefetches. Often, it is difficult to achieve both Accuracy and Coverage simultaneously.

A read-write Unit (LSU), which may be configured inside the CPU core, may also be referred to as a Load-Store Unit, which is an execution Unit within the CPU that is responsible for processing Store, load instructions, and typically includes an L1 level cache.

Read-write queue (Load-Store Quene, LSQ), which is a queue in the LSU that holds information needed for read-write operations (Load-Store operations).

Cache line CACHELINE, CL is the minimum granularity of data storage in the data cache.

Miss Status processing registers (Miss Status HANDLING REGISTER, MSHR) are used to record each outstanding cache Miss (CACHE MISS) event, the information recorded of which typically includes address, thread identification, etc. The occupancy rate of the MSHR may represent the usage rate of the memory, and if the occupancy rate of the MSHR is higher, it may be reflected that CACHE MISS events of the prefetcher are more, that is, the prefetching accuracy is reduced, and the usage rate of the memory at this time is at a relatively higher level.

The translation lookaside buffer (Translation Lookaside Buffer, TLB), which may also be referred to as a page table cache, a bypass cache, etc., is used by the memory management unit as a cache for the CPU to improve the speed of virtual to physical address translation. The TLB has a fixed number of spatial slots for holding tag page table entries mapping virtual addresses to physical addresses. The search key is a virtual memory address, and the search result is a physical address. If the requested virtual address is present in the TLB, the Content-addressable memory (CAM) will give a very fast match, and the memory can then be accessed using the resulting physical address. If the requested virtual address is not in the TLB, the tag page table will be used for virtual-to-real address translation, and the access speed of the tag page table is much slower than the TLB.

In the related art, to improve the operation efficiency of the CPU, a hardware prefetch module (HARDWARE PREFETCHER, HWPF) is usually implemented in the LSU, and the module moves data required by the program stream being executed by the thread into the data cache in advance, so as to improve the speed of data access. To achieve this, the hardware prefetch module HWPF needs to predict the addresses that are likely to be accessed in the future based on the address sequence of the current data access. For HWPF, a common address sequence mode is called "streaming data" (STREAMING DATA), that is, data is sequentially read from and written to the memory in the order of increasing or decreasing virtual addresses with CL as granularity. The streaming data prefetching module enters a prefetching state after identifying such an access sequence in a training stage, so as to send out a subsequent prefetching request to the downstream in advance according to the progressive increasing or progressive decreasing direction of the sequence, so as to prefetch the streaming data which can be accessed into a cache in advance.

To identify streaming data at different starting addresses, a plurality of entries are typically provided in HWPF, each of which may be individually "trained" in that the pipeline depth of the training directly affects the hardware area of the prefetch module. In the actual program execution process, the same piece of streaming data may be identified as multiple pieces of streaming data and distributed into multiple items because of various reasons, such as that the program flow itself has memory access sequence jumping back and forth, and modern high-performance processors execute instructions in an out-of-order manner. Thus, multiple entries may repeat the "training" and "prefetching," which may result in power consumption and subsequent waste of memory resources, pipeline bandwidth.

Aiming at the technical problems caused by repeated training and prefetching of the hardware prefetching module HWPF for prefetching streaming data, the disclosure provides a data prefetching method for detecting and prohibiting a part of entries which generate repeated prefetching addresses, thereby avoiding the waste of power consumption and bandwidth of repeated prefetching requests and improving the prefetching performance and the overall running performance of equipment.

Specifically, with the data prefetching method provided by the embodiment of the disclosure, for the prefetcher (e.g., HWPF), multiple entries are independently trained and a streaming data prefetching request is generated, so that the requirement on the pipeline depth of the entries is reduced, and the hardware implementation area of the multiple entries of the prefetcher can be controlled on the basis of ensuring the rapid training of the multiple entries. Further, the multi-path hit detector of the prefetcher repeatedly detects a plurality of entries in parallel, namely, based on hit information of the plurality of entries, whether the entries which generate repeated prefetch addresses exist in the plurality of entries is determined, and a part of the entries which generate the repeated prefetch addresses are forbidden, so that the subsequent prefetch process is ensured not to repeatedly hit the plurality of entries, and the plurality of entries cannot send out prefetch requests of the repeated addresses, thereby avoiding the waste of power consumption and entry bandwidth caused by repeated prefetch, and further improving the prefetch performance (PREFETCH METRICS) of the prefetcher and the overall operation performance of the electronic equipment.

In order to facilitate understanding of the aspects of the present disclosure, before describing in detail the implementation of the data prefetching method provided by the present disclosure, the implementation of the hardware prefetching module HWPF in the LSU and the functions involved therein will be briefly described below in conjunction with the accompanying drawings. It will be understood that when any element herein appears in more than one drawing, the element is identified by the same or similar reference numeral in each drawing.

By way of example, FIG. 1 shows a schematic block diagram of a read-write Unit (LSU) and its hardware prefetch module HWPF in a processor. Referring to FIG. 1, first, the front end sends load, store operations, and address information (shown as ld/st op) to the LSU and generates demand accesses via the read-write queue LSQ (DEMAND ACCESS), after which these operations go through the arbiter (Arbiter) into the data access pipeline (DATA ACCESS PIPE) to begin execution.

Load/Store operations (ld/st ops) address translation on the pipeline by accessing the TLB, and compare by accessing the Tag array (Tag) to confirm whether the L1 cache is hit. For Load operation, if the L1 cache is hit, corresponding data is read from the data cache and returned, if the L1 cache is not hit, a read request needs to be continuously sent to a downstream cache such as the L2 cache and stored in the MSHR, and when the downstream return data is backfilled to the L1 cache, the original Load operation is awakened to be on the pipeline again, and the corresponding data is returned. For Store operations, if the hit L1 cache has an exclusive state, the corresponding Data is written to the Data cache, otherwise, a read request to obtain the writable state is also sent to the downstream L2 cache and stored in the MSHR, and when the downstream return Data backfills the L1 cache, the original Store operation is awakened to be pipelined again and the corresponding Data writing is completed, for example, shown as Store Data (STD) in FIG. 1.

It will be appreciated that Hit caching (Hit Cache) can greatly speed up the completion of Load/Store operations. Accordingly, a hardware prefetch module HWPF may also be configured in the LSU module. HWPF attempts to identify the regular address sequence by grabbing the ld/st op information on the pipeline as input. Specifically, as a prefetcher, the streaming data prefetcher is responsible for identifying a sequence of streaming data (streaming) in memory accesses. When the streaming data prefetcher recognizes a streaming data sequence through the training confirmation, the prefetcher enters a prefetching state and then starts to send a prefetch request. Assuming that the address in the prefetch request is a virtual address, the prefetch request and Load/Store request similarly also go through the arbiter and then into the pipeline, access the TLB for address translation, access Tag array (Tag) to confirm whether the L1 cache is hit, and if the L1 cache is hit, no prefetch is required. If the L1 cache is not hit, a corresponding request needs to be sent to the downstream L2 cache and stored in the MSHR, and when the downstream return data is backfilled to the L1 cache, the prefetch request is completed.

As described above, during the actual program execution, for various reasons, such as that the program flow itself has memory access sequence jumping back and forth, the modern high-performance processor executes instructions in an out-of-order manner, etc., the same piece of streaming data may be identified as multiple pieces of streaming data and stored into multiple entries, so that multiple entries may repeat "training" and "prefetching", and after the prefetching request with the same address is pipelined, power consumption and subsequent waste of storage resources and pipeline bandwidth will be caused.

Next, fig. 2 shows a schematic block diagram of a streaming data prefetcher. Next, a general block diagram of a streaming data prefetcher is described, as shown in fig. 2, which may be configured with a plurality of entries for training a plurality of streaming data simultaneously, here exemplified by 32 entries (streaming entry 0 to streaming entry 31). After the input end of the item is sent into the memory address (ld/st op), the memory address is simultaneously sent into 32 items for matching, and each item is mutually independent, so that the prefetch request can be trained and sent simultaneously. Specifically, the streaming data prefetcher first operates in a "training" phase, and after such a streaming access sequence is identified, it enters a "prefetching" state to issue a subsequent prefetch request in advance downstream according to the increasing or decreasing direction of the sequence, so as to prefetch streaming data that is likely to be accessed into the cache in advance. The prefetch request of each stream entry also reaches a stream prefetch generator (STREAM PREFETCH Generation) through a selector (STREAM PREFETCH PICKER), and finally issues a stream data prefetch request (Prefetch Request).

In addition, each entry of the streaming data prefetcher needs to store some information for recording the training and prefetching information for streaming data prefetching in order to implement the training and prefetching process described above. Schematically, FIG. 3 shows a schematic diagram of the streaming prefetch entry payload described above.

The meaning of the individual parameters shown in fig. 3 is as follows:

valid is an entry Valid bit, and is set to be 1 when the entry is allocated;

Direction bit (Direction) to mark the increasing/decreasing Direction of stream data, and to build the variable when the stream mode sequence training is completed;

conf, confidence (Confidence), which can range from 0 to a set maximum (e.g., 3), when training is completed, conf is set to 1 from 0, the greater the confidence, which indicates that the more times the sequence is hit, the higher the prefetch accuracy of the sequence is expected, and therefore the greater the offset of the prefetch address can be set relatively;

BaseAddr Base Address (Base Address) to identify the start Address of the streaming mode sequence, which is updated when training is completed and the entry is refreshed;

AccVec Access vector (Acce Vector), which is a bit vector used to track whether a block has been accessed by demand, here with CL granularity, e.g., 64 bits (64B);

DmdPtr Access Pointer (Demand Pointer) to track the relative position of the access address and BaseAddr, which will not change if the new access address lags the current DmdPtr;

PfPtr the current prefetch pointer (Prefetch Pointer), based on the above parameters, the next L1 prefetch request address may be represented as (BaseAddr + (L PfPtr +1) ×64b).

As a specific implementation example, the present disclosure further provides fig. 4A and 4B, where fig. 4A shows a schematic flow chart of a training process of the streaming data prefetcher, fig. 4B shows a schematic flow chart of a prefetching process of the streaming data prefetcher, and a description is next made of the training process of the streaming data prefetcher and the prefetching process.

As shown in fig. 4A, when the demand address enters the streaming data prefetcher (S401), it is compared with BaseAddr in each entry to determine whether the entry is hit (S402). An entry is considered to hit if the comparison of the demand address and BaseAddr differs by a certain extent (the extent of the training mode and the prefetch mode may be different). Otherwise, if the demand address misses an entry, for example, if a certain condition is met (if the MSHR is allocated), a new entry is allocated to the entry (S403), and the state of the new allocation entry is initialized, for example, the Valid bit Valid of the new allocation entry is set to 1, baseaddr is set to the demand address, conf is set to 0, and at this time, the new allocation entry enters the training mode.

Next, at S404, it is determined whether the current entry is in a training state. If it is determined that the training state is in progress, the relevant state information may be updated (S405), for example, according to the fact that the required address is within a certain range, for example, [ BaseAddr-n, baseAddr +n ] (as yes, assuming n=2), the corresponding access information is recorded with the access vector AccVec, where AccVec may be multiplexed, or the bit vector may be recorded exclusively using another hardware resource.

Next, at S406, it is determined whether the training of the current entry is completed, and when BaseAddr-1, baseaddr-2 or BaseAddr +1, baseaddr+2 are all accessed, that is, 3 consecutive accesses with increasing/decreasing addresses are found, at which point the current entry training is considered complete, the entry state is updated and the entry enters the prefetch mode (S407). Updating the entry state here may include setting confidence Conf to 1, dir to the corresponding value according to the rule that the address is increasing/decreasing (e.g., 0 for increasing, 1 for decreasing), baseAddr to BaseAddr +/-2, dmdptr to 0, pfptr to 0, whereby the entry enters a prefetch mode, at which point a prefetch request may be sent starting from PfPtr =0, with a request address of BaseAddr +/-1. If the training of the current entry is not complete, then S408 is performed, i.e., the training state is updated based on the address.

For step S404, in the case where the current entry is not in the training state, S409 is performed to determine whether the required address is out of range (e.g., [ BaseAddr-n, baseAddr +n ]), and if so, the entry is refreshed and the prefetch state is updated according to the address (S410), and if not, the relevant state is updated according to the address (S411), and the update herein may refer to S407, for example.

Referring to FIG. 4B, a schematic flow chart of a prefetching process by a streaming data prefetcher is shown.

First, at S412, the entry enters the prefetch mode, then at S413 it may be determined whether the parameter PfPtr reaches the current maximum, if so, it waits for the entry to be refreshed or DmdPtr to be updated (S414), if not, it proceeds to subsequent S415-S417, i.e. a prefetch request is generated, it waits for a arbitration and update PfPtr.

When the prefetch request is sent, pfPtr is added, and if the new PfPtr is still in range, the next prefetch request is sent. The longest distance of the prefetch request is determined by DmdPtr and Conf, pfPtr < DmdPtr +Dist [ Conf ], conf determines the maximum interval Dist of PfPtr and DmdPtr, the larger Conf is, the larger Dist is, the Dist value under different Conf can be fixed in hardware implementation, can be configurable in software or can be dynamically adjusted in the running process according to the processing condition of the current access request.

After an entry enters the pre-fetch mode, the state that the address is requested by the required address in the stream sequence in a certain observation window can be continuously recorded by using the bit vector AccVec, and the observation window Depth can be realized differently according to specific designs, can be a fixed value, can be changed based on the value of Conf, and can be dynamically adjusted in the running process according to the processing condition of the current access request. When a prefetch request is generated, if AccVec bits corresponding to the address corresponding to PfPtr are 1, which indicates that the address has been accessed, pfPtr can skip the address to directly prefetch the next stroke, and the search depth of AccVec can be changed or not changed at all according to the specific implementation.

After the entry enters the pre-fetch mode, dmdPtr records the current most distant required address relative to BaseAddr, when DmdPtr exceeds the observation window Depth, the entry needs to be refreshed, at this time, the number of bits 1 recorded in the count AccVec is counted, if the count exceeds a certain threshold, conf=conf+1, if the count is smaller than a certain threshold, conf=conf-1, otherwise, conf remains unchanged. If the refreshed entry is still in the prefetch mode, the other parameters within the entry are updated:

BaseAddr= BaseAddr+/-Depth

DmdPtr=DmdPtr-Depth

PfPtr=PfPtr-Depth

AccVec = AccVec << Depth

The thresholds described above may be implemented differently with a specific design, and may be a fixed value, or may be changed based on the value of Conf, or may be dynamically adjusted during operation according to the processing situation of the current access request. When Conf drops to 0, the entry will fall back to training mode, go through the training procedure described previously and re-enter the prefetch mode after certain conditions are met. When entry Conf >0, entry into "prefetch" mode begins generating prefetch requests. When a plurality of items enter a prefetch mode at the same time to generate a prefetch request, one item is selected through a selector, then a corresponding prefetch request is generated according to information in the item, and PfPtr of the selected item is updated after the prefetch request is sent. The sending completion refers to that the request is successfully received by the next-stage module, and in this implementation, the next-stage module may be a prefetch request queue, or other implementations are possible.

It can be seen that each streaming mode prefetch entry has independent training and prefetch flows, and when the demand address enters the streaming data prefetcher, the demand address is simultaneously sent to each entry, and training is performed simultaneously. In some software scenarios, or due to out-of-order transmission and execution by the processor, the same streaming data may be trained out of multiple streaming entries at the same time. For multiple entries, this repetition of "training" and "prefetching" results in power consumption and subsequent waste of memory resources and pipeline bandwidth after the prefetch requests with the same address are pipelined.

The present disclosure provides a data prefetching method capable of detecting and disabling a portion of entries that will generate a duplicate prefetch address, thereby avoiding the waste of power consumption and bandwidth by duplicate prefetch requests, improving prefetch performance and overall operation performance of the device.

Specifically, with the data prefetching method provided by the embodiment of the disclosure, for the prefetcher (e.g., HWPF), multiple entries are independently trained and a streaming data prefetching request is generated, so that the requirement on the pipeline depth of the entries is reduced, and the hardware implementation area of the multiple entries of the prefetcher can be controlled on the basis of ensuring the rapid training of the multiple entries. Further, the multi-path hit detector of the prefetcher repeatedly detects a plurality of entries in parallel, namely, based on hit information of the plurality of entries, whether the entries which generate repeated prefetch addresses exist in the plurality of entries is determined, and a part of the entries which generate the repeated prefetch addresses are forbidden, so that the subsequent prefetch process is ensured not to repeatedly hit the plurality of entries, and the plurality of entries cannot send out prefetch requests of the repeated addresses, thereby avoiding the waste of power consumption and entry bandwidth caused by repeated prefetch, and further improving the prefetch performance of the prefetcher and the overall operation performance of the electronic equipment.

The data prefetching method of the present disclosure will be described below with reference to the accompanying drawings. Detailed descriptions of known functions and known components may be omitted for the sake of clarity and conciseness in the following description of the embodiments of the present disclosure.

The data prefetching method according to the embodiment of the disclosure is applicable to electronic equipment, and the electronic equipment can comprise a prefetcher and a cache. The prefetcher herein may refer to a streaming data prefetcher as described above in connection with the accompanying drawings, wherein a plurality of entries are included for streaming data prefetching. Further, prefetchers according to the present disclosure may also be configured with a multi-hit detector, which may be implemented in hardware in a pipelined fashion, as one implementation, and perform a repeat sequence determination process in parallel with the entries of the prefetcher to identify multi-hit situations for more than one entry.

Fig. 5 shows a schematic flow chart of a data pre-fetching method according to some embodiments of the present disclosure, the method comprising steps S501-S504.

Step S501, a demand address for the cache is received by utilizing a plurality of entries, training is performed based on the demand address to identify streaming data rules and generate streaming data prefetch requests, wherein each entry in the plurality of entries is independently trained and generates streaming data prefetch requests.

According to some embodiments of the present disclosure, the demand address is a virtual address corresponding to a load instruction or a store instruction. Specifically, this step S501 may be combined with the description of fig. 2, such as that each of the entries 0 to 31 receives the ld/st op information, and training is performed in parallel to identify the streaming data rule and generate the possible prefetch request.

Step S502, determining whether N items which generate repeated prefetch addresses exist in the items based on hit information of the items by utilizing a multi-channel hit detector, wherein N is an integer greater than 1.

In the scheme according to the disclosure, a multi-path hit detector is added into the streaming data prefetcher to specifically remove repeated hit entries of the streaming data prefetcher, and the detection and processing flow of the multi-path hit is independent from the training and prefetching process of each streaming entry, so that the training and prefetching pipeline of each entry is not affected.

According to some embodiments of the present disclosure, determining whether N entries that will generate a duplicate prefetch address exist in a plurality of entries based on hit information of the plurality of entries using a multi-way hit detector includes determining that N entries that will generate a duplicate prefetch address exist in a case where a demand address hits simultaneously N entries of the same direction (Dir) of streaming data in a prefetch mode or in a training mode, or determining that N entries that will generate a duplicate prefetch address exist in a case where a demand address hits simultaneously N entries of the same direction of streaming data in a prefetch mode or in a prefetch mode are being performed.

It will be appreciated that the criteria for a multi-way hit may be implemented differently for each particular design. As an example, under some implementations, the multi-hit detector may determine whether the demand address hits in multiple entries simultaneously, and if it is determined that multiple entries are hit, then it is determined that a multi-hit situation exists.

Next, in step S503, in the case where it is determined that there are N entries, M selected entries are determined from the N entries based on the information of the N entries using the multi-hit detector, where M is an integer of 1 or more and less than N.

As an implementation, M may be equal to 1, that is, an entry is directly determined from the multiple hit entries as a selected entry, and on this basis, all the N entries except the selected entry are unselected entries. As other implementation manners, M may also be set to a value smaller than N according to specific situations, for example, in the case where N is 4, two items may be selected as selected items, and the other two items are unselected items.

As one implementation, M is equal to 1, in which case determining M selected entries from the N entries based on the information of the N entries using the multi-way hit detector includes determining an entry from the N entries having a highest confidence parameter as the selected entry based on the information of the N entries. That is, for N entries for which there are multiple hits, the one with the highest confidence Conf value is determined as the selected entry, and the remaining entries are the unselected entries.

As another implementation, M is equal to 1, in which case determining M selected entries from the N entries based on the information of the N entries using the multi-way hit detector includes determining an entry from the N entries that is most recently allocated as the selected entry based on the information of the N entries. That is, for N entries for which there are multiple hits, the selected entry is determined based on the allocation time of the entry.

As yet another implementation, M is equal to 1, in which case determining M selected entries from the N entries based on the information of the N entries using the multi-way hit detector includes determining an entry having a smallest or largest entry sequence number from the N entries as the selected entry based on the information of the N entries. For example, referring to fig. 2, for N entries for which there are multiple hits, the selected entry is determined according to the sequence number value of the entry, e.g., the entry with the smallest sequence number of the entry is determined as the selected entry. Assume that in step S502, entry 0 and entry 2 are determined as 2 entries that will produce duplicate prefetch addresses, in this implementation, entry 0 may be the selected entry and entry 2 may be the unselected entry.

Referring next to FIG. 5, in step S504, a invalidate instruction is generated for unselected entries, except M entries, of the N entries using a multi-way hit detector such that the unselected entries are prohibited from generating streaming data prefetch requests.

According to embodiments of the present disclosure, a multi-way hit detector first detects whether there are multiple entries hit, and when multiple entry hits occur, the selected entry is determined according to a certain rule, and for unselected entries, an entry disable signal, or entry invalidate signal, may be issued, for example, to stop these duplicate entries from continuing to generate subsequent prefetch requests, i.e., to reduce duplicate entries. As an example, these entry resources that received the invalidate signal will be freed up for streaming data training and prefetching procedures based on subsequent demand addresses. In addition, the multi-path hit detector detects and abolishes a plurality of entries in the hardware prefetch module, so that the intervention can be performed before the prefetch request is sent out, the influence of repeated prefetch requests on prefetch training is avoided, resources such as pipelines, caches and the like are also avoided from being occupied by the repeated prefetch requests, and the prefetch performance and the overall performance of the processor are improved.

According to some embodiments of the present disclosure, the prefetcher may further comprise a prefetch request generator. The data prefetching method according to the embodiment of the disclosure further comprises the step of generating a streaming data prefetching request based on the M selected entries by using a prefetching request generator, wherein the streaming data prefetching request is used for streaming data prefetching based on a cache. That is, for the selected entry, no other operations may be performed, and the subsequent prefetching process may continue as in the original implementation, which is not limiting herein.

According to some embodiments of the present disclosure, a streaming data prefetch request is used to perform a streaming data prefetch process based on a first level cache, a second level cache, and a last level cache of a cache. As described above, prefetchers can be classified into an L1 level prefetcher, an L2 level prefetcher, an LLC level prefetcher, and the like, according to the cache location to which target data is prefetched. The data prefetching method according to the embodiment of the present disclosure can be applied to any one of the above-described L1 level prefetcher, L2 level prefetcher, and LLC level prefetcher, that is, there is no requirement for an application range.

According to some embodiments of the present disclosure, the multi-hit detector is implemented as a pipeline, independent of the multiple entries, and performs multi-hit detection in parallel with the training and pre-fetching process of the multiple entries. According to the embodiment of the disclosure, the processing flow of the multi-path hit detector may be a pipeline in actual implementation, and the pipeline of the multi-path hit detector and the pipeline of the stream entry are independent, and the pipeline stages of the two are determined by the time sequence of each specific implementation logic.

It is because the detection process of the multi-way hit detector according to the present disclosure is independent of the plurality of entries and multi-way hit detection is performed in parallel with the training and prefetching process of the plurality of entries, which can further save the pipeline stages of the detector and the entries and reduce power consumption. This is because if each entry adds a determination of a multi-hit in the training pipeline, additional sequential pressure will be created, and the number of training pipeline stages for each entry needs to be increased to implement the relevant logic, which increases both the hardware resource consumption and the increased number of pipeline stages, which also results in greater prefetch latency, and more importantly, for scenarios where multi-hit does not occur, the increased number of logic and stages will result in additional power consumption and performance loss.

Fig. 6 shows a flowchart of the execution of the multi-way hit detector according to the embodiment of the present disclosure, as shown in fig. 6, in step S601, a demand address enters the streaming data prefetcher, and then, it is determined whether the address hits in a plurality of entries (S602). If there is a multi-hit, the highest priority entry is determined from among the hit entries (S603), for example, the entry with the highest confidence value Conf is taken as the selected entry. Then, at S604, the entry with the highest priority is reserved, and other unselected entries are revoked. If there is no multi-hit situation, the multi-hit detector does not perform other operations (S605).

Next, a specific example of implementing multi-way hit detection according to the data prefetching method of the present disclosure is given in conjunction with fig. 7A and 7B. Assume that each entry of the streaming data prefetcher is empty, i.e., in an initial state. Then, the process is carried out. The incoming demand address sequence is denoted in turn as a, a+5, a+1, a+2, a+6, a+7, a+3, a+4, a+8, a+9. It can be seen that the sequence is an out-of-order streaming sequence.

First, referring to the case of the non-joining parallel multi-way hit detection shown in fig. 7A, it is assumed that the hit situation of the above-described demand address sequence is as follows:

address A allocates entry 0, address A+5 allocates entry 1;

Then, address A+1, A+2 hits in entry 0, and entry 0 enters the prefetch mode;

then, address A+6, A+7 hits in entry 0, entry 0 updates state AccVec, dmdPtr, while address A+6, A+7 hits in entry 1, and entry 1 enters prefetch mode;

thereafter, address A+3, A+4 hits in entry 0, entry 0 updates state AccVec;

Thereafter, address A+8, A+9 hits in entry 0, entry 0 updates state AccVec, dmdPtr, and simultaneously, address A+8, A+9 hits in entry 1, entry 1 updates state AccVec, dmdPtr.

After the required address sequence is input, finally, baseAddr =a+2 and dmdptr=7 of the entry 0 are obtained, the confidence conf=1 and the interval dist=5 of the entry 0 are assumed, then the entry 0 can send out prefetch requests 12 strokes, the address range is a+3-a+14, the BaseAddr =a+7 and dmdptr=2 of the entry 1 are obtained, the confidence conf=1 and the interval dist=5 of the entry 1 are assumed, then the entry 1 can send out prefetch requests 7 strokes and the address range is a+8-a+14, and the prefetch address of the entry 1 is found to be completely within the prefetch address range of the entry 0, and the prefetch address requests sent out by the two entries are repeated. Repeatedly processing prefetch requests necessarily results in a waste of subsequent execution unit bandwidth, power consumption, and performance penalty.

Next, referring to fig. 7B, there is shown a training situation of the streaming data prefetcher after adding the multi-hit detector, so that the hit situation of the above-mentioned demand address sequences a, a+5, a+1, a+2, a+6, a+7, a+3, a+4, a+8, a+9 is the same as the situation shown in fig. 7A, namely:

address A allocates entry 0, address A+5 allocates entry 1;

Then, address A+1, A+2 hits in entry 0, and entry 0 enters the prefetch mode;

Referring to fig. 7B, since the process of multi-way detection is added, the multi-way hit detector finds that the same kind of entry multi-hits when the address a+7 is input, then, for example, the reservation entry 0 (as a selected entry) may be selected according to AccVec count, and the entry 1 (as an unselected entry) may be revoked, whereby the entry 1 does not perform the subsequent prefetch process any more;

thereafter, address A+3, A+4 hits in entry 0, entry 0 updates state AccVec;

Thereafter, A+8, A+9 hits in entry 0, entry 0 updating state AccVec, dmdPtr;

After the input of the required address sequence is finished, baseAddr =a+2, dmdptr=7 and dist=5 in the final entry 0, if conf=1, then the entry 0 can send out a prefetch request for 12 strokes, the addresses are a+3 to a+14, and the entry 1 is invalid.

Furthermore, it will be appreciated that because the pipeline of the multi-hit detector and the training/prefetch pipeline of the streaming data prefetch entry are independent, there may be a window of time from entry into prefetch mode until entry is invalidated, before which entry 1 may issue a partial prefetch request, which in the example of FIG. 7B issues at most 4 prefetch requests, the subsequent sequence does not continue to generate duplicate prefetch address requests. The revoked entry 1 may be released for other training procedures for the demand address. For example, in the case where the subsequent address is a+100 is received, it may be allocated to the released entry 1.

Generally, according to the data prefetching method provided by the embodiment of the disclosure, the plurality of entries of the prefetcher are independently trained and the streaming data prefetching request is generated, so that the requirement on the pipeline depth of the entries is reduced, and the hardware implementation area of the plurality of entries of the prefetcher can be controlled on the basis of ensuring the rapid training of the plurality of entries. Further, the multi-path hit detector of the prefetcher repeatedly detects a plurality of entries in parallel, namely, based on hit information of the plurality of entries, whether the entries which generate repeated prefetch addresses exist in the plurality of entries is determined, and a part of the entries which generate the repeated prefetch addresses are forbidden, so that the subsequent prefetch process is ensured not to repeatedly hit the plurality of entries, and the plurality of entries cannot send out prefetch requests of the repeated addresses, thereby avoiding the waste of power consumption and entry bandwidth caused by repeated prefetch, and further improving the prefetch performance of the prefetcher and the overall operation performance of the electronic equipment.

According to another aspect of the present disclosure, an electronic device is also provided. Fig. 8A shows a schematic block diagram of an electronic device according to an embodiment of the disclosure. As shown in fig. 8A, an electronic device 1000 according to an embodiment of the disclosure may include a prefetcher 1010 and a cache 1020. The prefetcher 1010 may be configured to include a plurality of entries 1011 and a multi-way hit detector 1012. According to embodiments of the present disclosure, the prefetcher 1010 may be, for example, a streaming data prefetcher as described above for discovering the streaming data access laws of an address. The cache 1020 may be the three level cache architecture described above, namely an L1 cache, an L2 cache, and an LLC cache. The plurality of entries 1011 may have a hardware configuration as shown in fig. 2 for independently performing streaming data prefetching and training procedures, respectively.

According to some embodiments of the present disclosure, the plurality of entries 1011 may be configured to receive a demand address for a cache and train based on the demand address to identify streaming data laws and generate streaming data prefetch requests, wherein each of the plurality of entries trains independently and generates streaming data prefetch requests.

According to some embodiments of the present disclosure, the multi-way hit detector 1012 may be configured to determine whether there are N entries of the plurality of entries that will generate repeated prefetch addresses based on hit information of the plurality of entries, where N is an integer greater than 1, determine M selected entries from the N entries based on information of the N entries, where M is an integer greater than or equal to 1 less than N, if it is determined that there are N entries, and generate a nullification instruction for unselected entries of the N entries other than the M entries such that the unselected entries are prohibited from generating streaming data prefetch requests.

The specific implementation of the multiple entries 1011 and the multiple hit detector provided according to the embodiments of the present disclosure may refer to the above description of the data prefetching method according to the present disclosure, and the description thereof will not be repeated here.

According to some embodiments of the present disclosure, the multi-hit detector 1012 is implemented as a pipeline, independent of the plurality of entries 1011, and multi-hit detection is performed in parallel with the training and pre-fetching process of the plurality of entries 1011.

According to some embodiments of the present disclosure, the demand address is a virtual address corresponding to a load instruction or a store instruction. Specifically, items 0 through 31, such as in FIG. 2, each receive the ld/st op information, training is performed in parallel to identify streaming data laws and generate possible prefetch requests.

According to some embodiments of the present disclosure, the multi-way hit detector 1012 determines whether N entries that will generate a duplicate prefetch address exist among the plurality of entries based on hit information of the plurality of entries, including determining that N entries that will generate a duplicate prefetch address exist in a case where the demand address simultaneously hits N entries of the streaming data in the same direction and in the prefetch mode, or in a training mode, or determining that N entries that will generate a duplicate prefetch address exist in a case where the demand address simultaneously hits N entries of the streaming data in the same direction and in the prefetch mode, or in the prefetch mode.

According to some embodiments of the present disclosure, M is equal to 1, and the multi-way hit detector 1012 determines M selected entries from the N entries based on the information of the N entries includes one of determining an entry with a highest confidence parameter from the N entries as the selected entry based on the information of the N entries, determining an entry allocated earliest in time from the N entries as the selected entry based on the information of the N entries, or determining an entry with a smallest or largest entry sequence number from the N entries as the selected entry based on the information of the N entries.

The prefetcher 1010 may also include a prefetch request generator according to some embodiments of the present disclosure. The prefetch request generator may be configured to generate a streaming data prefetch request based on the M selected entries, wherein the streaming data prefetch request is for a streaming data prefetch process based on the cache.

According to some embodiments of the present disclosure, the electronic device 1000 may further include a read-write unit (LSU). Wherein the prefetcher 1010 is located within the read-write unit, and the read-write unit further comprises a read-write queue (LSQ) and an arbiter (Arbiter). The prefetcher 1010 may refer to the hardware prefetching module HWPF in fig. 1, and its relationship with the LSU and the LSQ may refer to the description of fig. 1, which is not repeated here.

According to some embodiments of the present disclosure, the read-write queue LSQ may be configured to generate demand access based on information of read-write operations. An arbiter (Arbiter) may be configured to generate a data access request based on the demand access of the read-write queue and the prefetch generated streaming data prefetch request, the data access request corresponding to the demand address, wherein the prefetcher is trained and prefetched based on the demand address.

The electronic device according to the embodiments of the present disclosure may implement specific execution procedures of functions of data access, prefetching, and the like, and in addition, the electronic device according to the embodiments of the present disclosure may implement the steps of the data prefetching method according to some embodiments of the present disclosure described above in conjunction with the drawings, and will not be described repeatedly herein. The electronic equipment of the embodiment of the disclosure can perform similar data prefetching process and realize similar technical effects.

As one implementation, fig. 8B illustrates an example block diagram of an electronic device according to an embodiment of the disclosure. As can be seen from comparing fig. 2 and 8B, the prefetcher according to the embodiment of the disclosure is further configured with a multi-way hit detector, independent of the entries, and capable of repeatedly detecting multiple entries in parallel, that is, determining whether there is an entry that will generate a repeated prefetch address in the multiple entries based on hit information of the multiple entries, for example, by providing a discard signal to an unselected entry to prohibit a portion of the entries that will generate the repeated prefetch address, so as to ensure that the subsequent prefetch process does not repeatedly hit the multiple entries, and the multiple entries will not issue prefetch requests of the repeated address, thereby avoiding waste of power consumption and entry bandwidth caused by repeated prefetching, and further improving the prefetching performance of the prefetcher and the overall operation performance of the electronic device.

According to yet another aspect of the present disclosure, an electronic device is provided. Fig. 9 illustrates a schematic block diagram of an electronic device according to some embodiments of the present disclosure.

As shown in fig. 9, the electronic device 2000 may include a processor 2010 and a memory 2020. According to some embodiments of the present disclosure, memory 2020 has stored therein computer readable code which, when executed by processor 2010, may perform the steps of a data prefetching method provided according to an embodiment of the present disclosure.

Processor 2010 may perform various actions and processes in accordance with programs stored in memory 2020. In particular, processor 2010 may be an integrated circuit having signal processing capabilities. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 2020 stores computer readable code which, when executed by processor 2010, is capable of causing the processor to implement a data prefetching method according to some embodiments of the present disclosure. The memory 2020 may be volatile memory or nonvolatile memory or may include both volatile and nonvolatile memory. It should be noted that the memory described herein may be any suitable type of memory. By way of example, processor 2010 is capable of implementing the steps of the data pre-fetching method described above in connection with the figures by executing computer readable code in memory 2020.

Fig. 10 shows an architectural schematic diagram of an exemplary electronic device according to an embodiment of the present disclosure. It will be appreciated that the electronic device shown in fig. 10 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

For example, as shown in fig. 10, in some examples, the electronic device 3000 includes a processing means (e.g., a central processor, a graphics processor, etc.) 3010, which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 3020 or a program loaded from the storage means 3080 into a random access Memory (Random Access Memory, RAM) 3030. In the RAM3030, various programs and data necessary for the operation of the computer system are also stored. The processing device 3010, ROM3020, and RAM3030 are connected via a Bus (Bus) 3040. An Input/Output (I/O) interface 3050 is also connected to bus 3040.

For example, components may be connected to I/O interface 3050 including input devices 3060 such as a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 3070 such as a Liquid Crystal Display (LCD), speaker, vibrator, etc., storage devices 3080 such as magnetic tape, hard disk, etc., and communication devices 3090 including network interface cards such as LAN cards, modems, etc. The communication means 3090 may allow the electronic device 3000 to perform wireless or wired communication with other devices to exchange data, performing communication processing via a network such as the internet. The drive 3100 is also connected to the I/O interface 3050 as needed. Removable media 3110, such as a magnetic disk, optical disk, magneto-optical disk, semiconductor memory, or the like, is installed as needed on drive 3100, so that a computer program read therefrom is installed as needed into storage device 3090. While fig. 10 illustrates an electronic device 3000 comprising various means, it is to be understood that not all illustrated means are required to be implemented or included. More or fewer devices may be implemented or included instead.

For example, the electronic device 3000 may further include a peripheral interface (not shown in the figure), and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, etc. The communication device 3090 may communicate with a network and other equipment through wireless communication.

For example, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program loaded on a non-transitory computer readable medium.

According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium having stored thereon computer-readable instructions, which when executed by a processor, cause the processor to perform the steps of a data prefetching method according to the present disclosure.

Fig. 11 shows a schematic block diagram of a storage medium according to an embodiment of the present disclosure. As shown in fig. 11, computer-readable storage medium 4000 has stored thereon computer-readable instructions 4010. The steps of the data prefetching method described with reference to the above figures may be performed when the computer readable instructions 4010 are executed by the processor.

It should be noted that the computer readable storage medium provided by the present disclosure may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable medium may be included in the electronic device or may exist alone without being incorporated into the electronic device.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Further, while the present disclosure makes various references to certain elements in an electronic device, electronic apparatus, according to embodiments of the present disclosure, any number of different elements may be used and run on a client and/or server. The units are merely illustrative and different aspects of the method, electronic device, electronic apparatus may use different units.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as a read only memory, a magnetic disk, or an optical disk. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiment may be implemented in the form of hardware, or may be implemented in the form of a software functional module. The present disclosure is not limited to any specific form of combination of hardware and software.

Unless defined otherwise, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims

1. A method of prefetching data, the method being applicable to an electronic device comprising a prefetcher and a cache, wherein the prefetcher comprises a plurality of entries and a multi-hit detector, the method comprising:

Receiving a demand address for the cache using the plurality of entries and training based on the demand address to identify streaming data rules and generate streaming data prefetch requests, wherein each of the plurality of entries independently performs the training and generates streaming data prefetch requests, and

Determining, with the multi-way hit detector, whether there are N entries of the plurality of entries that will produce duplicate prefetch addresses based on hit information of the plurality of entries, wherein N is an integer greater than 1;

determining M selected items from the N items based on the information of the N items by using the multi-path hit detector under the condition that the N items are determined to exist, wherein M is an integer greater than or equal to 1 and less than N, and

Generating, with the multi-way hit detector, a invalidate instruction for an unselected entry of the N entries other than the M entries, such that the unselected entry is prohibited from generating a streaming data prefetch request.

2. The method of claim 1, wherein the multi-hit detector is implemented as a pipeline, independent of the plurality of entries, and performs multi-hit detection in parallel with the training and prefetching process of the plurality of entries.

3. The method of claim 1, wherein the demand address is a virtual address corresponding to a load instruction or a store instruction.

4. The method of claim 1, wherein determining, with the multi-way hit detector, whether there are N entries of the plurality of entries that will generate duplicate prefetch addresses based on hit information of the plurality of entries comprises:

in the case that the demand address hits in N entries in the same direction of the streaming data and in the prefetch mode or in the training mode at the same time, determining that there are N entries that will produce duplicate prefetch addresses, or

In the case that the demand address hits in N entries of the streaming data simultaneously in the same direction and in the prefetch mode, or in the prefetch mode, it is determined that there are N entries that will produce duplicate prefetch addresses.

5. The method of claim 1, wherein M is equal to 1, wherein determining, with the multi-way hit detector, M selected entries from the N entries based on the information of the N entries comprises:

And determining the item with the highest confidence coefficient parameter from the N items based on the information of the N items as the selected item.

6. The method of claim 1, wherein M is equal to 1, wherein determining, with the multi-way hit detector, M selected entries from the N entries based on the information of the N entries comprises:

And determining an item which is distributed earliest in time from the N items as the selected item based on the information of the N items.

7. The method of claim 1, wherein M is equal to 1, wherein determining, with the multi-way hit detector, M selected entries from the N entries based on the information of the N entries comprises:

And determining an entry with the smallest or largest entry sequence number from the N entries as the selected entry based on the information of the N entries.

8. The method of claim 1, wherein the prefetcher further comprises a prefetch request generator, the method further comprising:

Generating, with the prefetch request generator, a streaming data prefetch request based on the M selected entries, the streaming data prefetch request being used to perform a streaming data prefetch process based on the cache.

9. The method of claim 8, wherein the streaming data prefetch request is for a streaming data prefetch process based on a first level cache, a second level cache, and a last level cache of the cache.

10. An electronic device comprising a prefetcher and a cache, the prefetcher configured to include a plurality of entries and a multi-hit detector, wherein,

The plurality of entries are configured to receive a demand address for the cache and train based on the demand address to identify a streaming data law and generate a streaming data prefetch request, wherein each of the plurality of entries independently do the train and generate the streaming data prefetch request, and

The multi-way hit detector is configured to:

Determining, based on hit information of the plurality of entries, whether there are N entries of the plurality of entries that will generate duplicate prefetch addresses, wherein N is an integer greater than 1;

in the case of determining that the N items exist, determining M selected items from the N items based on the information of the N items, wherein M is an integer greater than or equal to 1 and less than N, and

Generating a nullification instruction for an unselected entry of the N entries except the M entries, such that the unselected entry is prohibited from generating a streaming data prefetch request.

11. The electronic device of claim 10, wherein the multi-hit detector is implemented as a pipeline, independent of the plurality of entries, and performs multi-hit detection in parallel with the training and pre-fetching process of the plurality of entries.

12. The electronic device of claim 10, wherein the demand address is a virtual address corresponding to a load instruction or a store instruction.

13. The electronic device of claim 10, wherein the multi-way hit detector determining whether there are N entries of the plurality of entries that will generate duplicate prefetch addresses based on hit information of the plurality of entries comprises:

14. The electronic device of claim 10, wherein M is equal to 1, wherein the multi-hit detector determines M selected entries from the N entries based on the information of the N entries comprises one of:

determining an item with the highest confidence coefficient parameter from the N items based on the information of the N items as the selected item;

Determining the earliest item in time from the N items as the selected item based on the information of the N items, or

15. The electronic device of claim 10, wherein the prefetcher further comprises a prefetch request generator configured to generate a streaming data prefetch request based on the M selected entries, wherein the streaming data prefetch request is used to conduct a streaming data prefetch process based on the cache.

16. The electronic device of claim 15, wherein the streaming data prefetch request is to perform a streaming data prefetch process based on a first level cache, a second level cache, and a last level cache of the cache.

17. The electronic device of claim 10, further comprising a read-write unit, wherein the prefetcher is located within the read-write unit of the electronic device, the read-write unit further comprising a read-write queue and an arbiter, wherein,

The read-write queue is configured to generate a demand access based on information of read-write operations;

The arbiter is configured to generate a data access request based on a demand access of the read-write queue and a streaming data prefetch request generated by the prefetch, the data access request corresponding to the demand address, wherein the prefetcher is trained and prefetched based on the demand address.

18. An electronic device comprising a processor and a memory having stored therein computer readable code which when executed by the processor causes the processor to perform the steps of the method of any of claims 1-9.

19. A non-transitory computer readable storage medium having stored thereon computer readable instructions which, when executed by a processor, cause the processor to perform the steps of the method of any of claims 1-9.

20. A computer program product comprising a computer program which, when executed by a processor, causes the processor to perform the steps of the method according to any of claims 1-9.