WO1996037844A1

WO1996037844A1 - A pipelined microprocessor that makes memory requests to a cache memory and an external memory controller during the same clock cycle

Info

Publication number: WO1996037844A1
Application number: PCT/US1996/007091
Authority: WO
Inventors: Robert Divivier; Mario D. Nemirovsky; Robert Bignell
Original assignee: National Semiconductor Corporation
Priority date: 1995-05-26
Filing date: 1996-05-16
Publication date: 1996-11-28
Also published as: EP0772829A1; KR970705086A

Abstract

Memory requests are made to a cache memory and an external memory controller during the same clock cycle when the bus connected to the external memory is available. By making both memory requests during the same clock cycle rather than first accessing the cache memory as is conventionally done, the cycle time lost when the request is not stored in the cache memory can be eliminated. The unneeded external memory requests that result each time the request is stored in the cache memory are eliminated by gating the request to the external memory controller with a logic signal output by the cache memory that indicates whether the request is present.

Description

A PIPELINED MICROPROCESSOR

THAT MAKES MEMORY REQUESTS TO A

CACHE MEMORY AND AN EXTERNAL MEMORY CONTROLLER

DURING THE SAME CLOCK CYCLE BACKGROUND OF THE INVENTION

1. Field of the Invention.

The present invention relates to pipelined microprocessors and, more particularly, to a pipelined microprocessor that makes memory requests to a cache memory and an external memory controller during the same clock cycle when the system bus that connects an external memory to the processor is available.

2. Description of the Related Art.

A pipelined microprocessor is a microprocessor that operates on instructions in stages so that, at each stage of the pipeline, a different function is performed on an instruction. As a result, multiple instructions move through the pipe at the same time much like to-be-assembled products move through a multistage assembly line.

FIG. 1 shows a block diagram that illustrates the flow of an instruction through a conventional pipelined processor. As shown in FIG. 1, the first stage in the pipe is a prefetch stage. In this stage, the to-be-executed instructions are retrieved from either a cache memory or an external memory, and are then sequentially loaded into a prefetch buffer. The purpose of the prefetch stage is to fill the prefetch buffer so that one instruction can be advanced to the decode stage, the next stage in the pipe, with each clock cycle.

In the decode stage, each instruction moving through the pipe is decoded to determine what operation is to be performed. After the decode stage, an operand stage determines if data will be needed to perform the operation and, if needed, retrieves the data from either one of several data registers, a data cache memory, or the external memory. Following this, the operation specified by the instruction is performed in an execution stage, while the results of the operation are stored in a write-back stage.

In the ideal case, each instruction is advanced from one stage to the next with each successive clock cycle. Thus, while it takes five clock cycles for an instruction to propagate through the pipeline, the processor appears to complete the execution of each instruction in only one clock cycle. One instruction which can stall the pipeline, or prevent instructions from advancing from one stage to the next with each clock cycle, is the branch instruction. With the branch instruction, the next instruction to be executed is determined by the outcome of the operation.

If the outcome of the operation requires the next instruction in the pipe, then the pipeline continues to function as described above. However, if the outcome of the operation requires an instruction not currently in the pipe, then the instructions in the prefetch, decode, and operand stages must be flushed from the pipeline and replaced by the alternate instructions required by the branch condition.

To minimize the time consuming process of obtaining the alternate instructions from external memory, conventional pipelined processors utilize an on-chip instruction cache memory to store a limited number of instructions. As is well known, instructions can typically be retrieved from a cache memory in one clock cycle, while two to three clock cycles are typically required to retrieve instructions from an external memory. In addition, significantly more clock cycles can be consumed waiting for the bus if the bus is busy.

Because of the difference in cycle time in accessing cache memory versus external memory, as well as the penalities incurred with tying up the external bus and changing the page hit logic in a dynamic random-access-controller (DRAM), conventional processors first attempt to retrieve instructions from the cache memory when the condition of a branch instruction requires an instruction which is not in the pipeline. If the needed instruction is not in the cache memory, only then does the processor retrieve the instruction from the external memory.

One problem with this approach is that when the needed instruction is not stored in the cache memory, the processor must waste at least one clock cycle to establish this fact. One solution to this problem is to simply increase the size of the cache memory, thereby increasing the likelihood that the needed instruction will be stored in the cache memory.

The primary drawback to this solution is that a larger cache memory increases the size, cost, and complexity of the resulting circuit. Thus, there is a need for technique which will minimize the number of clock cycles that are wasted when the needed instruction is not stored in the cache memory.

SUMMARY OF THE INVENTION

In a conventional pipelined microprocessor, a memory request is first made to a cache memory and then, if the request is not stored in the cache memory, to an external memory. Thus, when the request is not stored in the cache memory, several clock cycles can be lost by first accessing the cache memory rather than the external memory. In the present invention, these lost clock cycles are eliminated by utilizing a dual memory access circuit that makes a memory request to both the cache memory and an external memory controller during the same clock cycle when the bus that connects the external memory to the processor is available.

By making both memory requests during the same clock cycle rather than first accessing the cache memory, the cycle time lost when the request is not stored in the cache memory can be eliminated. The unneeded external memory requests that result each time the request is stored in the cache memory are eliminated by gating the request to the external memory controller with a logic signal output by the cache memory that indicates whether the request is present.

A dual memory access circuit in accordance with the present invention includes a memory access controller that monitors the availability of a system bus, and that outputs a first request in response to needed information when the system bus is unavailable. On the other hand, when the system bus is available, the memory access controller outputs the first request and a second request in response to the needed information. In addition, when both the first and second requests are output, the memory access controller also asserts a dual request signal. The dual memory access circuit further includes a cache controller that determines whether the needed information is stored in the cache memory and, if the needed information is found and the dual request signal is asserted, asserts a terminate request signal. An external memory controller gates the second request and the terminate request signal so that when the terminate request signal is asserted, the second request is gated out, and so that when the terminate request signal is deasserted, the second request signal is latched by the external memory controller.

A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description and accompanying drawings which set forth an illustrative embodiment in which the principals of the invention are utilized.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram illustrating the flow of an instruction through a conventional pipelined processor. FIG. 2 is a block diagram illustrating a dual memory access circuit 100 in accordance with the present invention.

FIG. 3 is a block diagram illustrating the operation of cache memory 120 and cache controller 130.

FIG. 4 is a timing diagram illustrating the dual memory request.

FIG. 5 is a block diagram illustrating the operation of DRAM controller 140. DETAILED DESCRIPTION FIG. 2 shows a block diagram of a dual memory access circuit 100 in accordance with the present invention. As shown in FIG. 2, circuit 100 includes a memory access controller 110 that controls memory requests to a cache memory 120 and an external memory to obtain instructions requested by the prefetch and execution stages of a pipelined microprocessor. Both cache memory 120 and the external memory, in turn, supply the requested instructions back to the prefetch stage of the microprocessor. (In the FIG. 2 embodiment, an eight-bit word is output by cache memory 120 whereas a 16-bit word is output by the external memory).

In the prefetch stage, a prefetch buffer stores the next sequence of instructions to be decoded. Memory access controller 110 monitors the status of the prefetch buffer and, when the prefetch buffer is less than full, requests the next instruction in the sequence from memory.

In accordance with the present invention, memory access controller 110 determines which memory the request will be directed to in response to the status of the system bus that connects the external memory to the processor. When the system bus is unavailable, memory access controller 110 requests the next instruction from cache memory 120 by outputting a requested address RADl, as defined by a prefetch instruction pointer, to a cache controller 130. Cache controller 130, in turn, determines whether the requested address RADl is stored in cache memory 120 by comparing the requested address RADl to the address tags stored in cache memory 120. - FIG. 3 shows a block diagram that illustrates the operation of cache memory 120 and cache controller 130. As shown in FIG. 3, cache memory 120, which is configured as a direct-mapped memory, includes a data RAM 122 that stores eight bytes of data in each line of memory (the number of bytes per line is arbitrary), and a tag RAM 124 that stores tag fields which identify the page of memory to which each line of memory corresponds. In addition, a valid bit RAM 126 indicates whether the information stored in data RAM 122 is valid. Although shown as a direct-mapped memory, those skilled in the art will appreciate that fully-associative and set-associative configurations can also be utilized.

As further shown in FIG. 3, the requested address RADl output from memory access controller 110 includes a page field that identifies the page of memory, a line field that identifies the line of the page, and a "don't care" byte position field. In operation, cache controller 130 utilizes the line field of the requested address RADl to identify the corresponding page field stored in tag RAM 124. The page field from tag RAM 124 is then compared to the page field of the requested address RADl to determine if the page fields match. If the page fields match and the corresponding bit in valid bit RAM 125 indicates that the information is valid, cache controller 130 asserts a cache hit signal CHS, which indicates a match, while cache memory 120 outputs the instruction associated with the requested address RADl. If, on the other hand, the page fields do not match, thereby indicating that the requested address RADl is not stored in cache memory 120, then the instruction associated with the requested address RADl must be retrieved from the external memory.

However, when the system bus is available, memory access controller 110 requests the next instruction from both cache memory 120 and the external memory during the same clock cycle. (In the FIG. 2 embodiment, the request is also output to an external bus controller which controls, among other things, the read-only-memory (ROM) and the disk drives). Controller 110 initiates the requests by outputting the requested address RADl to cache controller 130 while also outputting a requested address RAD2 to a dynamic random access memory (DRAM) controller 140. In most cases, the requested addresses RADl and RAD2 will be identical but may differ as required by the addressing scheme being utilized.

FIG. 4 shows a timing diagram that illustrates the dual memory request. As shown in FIG. 4, at substantially the same time that the requested addresses RADl and RAD2 are output, memory access controller 110 also asserts a dual address signal DAS which indicates that requests are going to both memories, and a bus request signal BRS which indicates that the requested address RAD2 is valid.

In the present invention, when cache controller 130 matches the requested address RADl with one of the address tags while the dual address signal DAS is asserted, cache controller 130 asserts a terminate address signal TAS at approximately the same time that the cache hit signal CHS is asserted. (See FIG. 3).

DRAM controller 140, in turn, gates the bus request signal BRS with the terminate address signal TAS to determine whether the external memory request should continue. FIG. 5 shows a block diagram that illustrates the operation of DRAM controller 140. As shown in FIG. 5, when the terminate address signal TAS is asserted, the bus request signal

BRS is gated out, thereby terminating the memory request made to the external memory. On the other hand, when the terminate address signal TAS is deasserted, the second requested address RAD2 is latched by DRAM controller 140 on the falling edge of the same system clock cycle that initiated the requested addresses RADl and RAD2. Cache controller 130 deasserts the cache hit signal CHS and the terminate address signal TAS when cache controller 130 determines that the requested address RADl does not match any of the address tags, i.e., the page fields do not match.

The present invention also follows the same approach with respect to instructions requested by the execution stage. In the execution stage, the outcome of a branch instruction frequently calls for an instruction which is not in the pipeline. In this situation, memory access controller 110 must request the instruction from either cache memory 120 or the external memory.

As above, when the system bus is available, memory access controller 110 asserts the dual address signal DAS and the bus request signal BRS, and outputs the requested addresses RADl and RAD2 to cache controller 130 and DRAM controller 140, respectively. On the other hand, if the system bus is unavailable, memory access controller 1 10 only outputs the requested address RADl to cache controller 130.

The advantage of simultaneously requesting the next instruction from both cache controller 130 and DRAM controller 140 when the bus is available is that each time the next instruction is absent from cache memory 120, the present invention saves the wasted clock cycle that is required to first check cache memory 120. This, in turn, improves the performance of the processor. If, on the other hand, the system bus is unavailable, memory access controller 110 must wait at least one clock cycle to access the system bus in any case. Thus, no cycle time can be saved when the bus is unavailable. As a result, controller 110 only outputs the requested address RADl to cache controller 130 when the bus is unavailable.

Terminating the bus request signal BRS when cache controller 130 determines that the page fields match also has the added advantage of decreasing the bus utilization by the processor. If the bus request signal BRS was not terminated each time the page fields match, the system bus would be tied up servicing an unneeded request, thereby limiting the utilization of the bus. In addition, the logic within the memory access controller 110 would have to be configured to wait for the unneeded instruction to return from the DRAM. Further, terminating the bus request signal BRS when cache controller 130 determines that the page fields match has the additional advantage of maintaining the page hit registers stored in DRAM controller 140. When a DRAM is accessed by DRAM controller 140, the address is divided into a row address which defines a page of memory, and a column address which defines the individual bytes stored in the page. Once the row address or page address is defined, other bytes of data on the same page can be quickly retrieved from the DRAM by changing the column address. If the row address is changed, however, the time required to access a particular byte is increased. Thus, by terminating the bus request signal BRS when controller 130 determines that the page fields match, the row address of the preceding external memory access can be maintained. As a result, by not changing the page hit registers, data can be taken from the same page much more quickly.

The invention embodiments described herein have been implemented in an integrated circuit which includes a number of additional functions and features which are described in the following co-pending, commonly assigned patent applications, the disclosure of each of which is incorporated herein by reference: U.S. patent application Serial No. 08/ , entitled "DISPLAY CONTROLLER

CAPABLE OF ACCESSING AN EXTERNAL MEMORY FOR GRAY SCALE MODULATION DATA" (atty. docket no. NSC 1-62700); U.S. patent application Serial No. 08/ , entitled "SERIAL INTERFACE CAPABLE OF OPERATING IN TWO DIFFERENT SERIAL DATA TRANSFER MODES" (atty. docket no. NSC 1-62800); U.S. patent application Serial No.

08/ , entitled "HIGH PERFORMANCE MULTIFUNCTION DIRECT MEMORY ACCESS

(DMA) CONTROLLER" (atty. docket no. NSC 1-62900); U.S. patent application Serial No. 08/ , entitled "OPEN DRAIN MULTI-SOURCE CLOCK GENERATOR HAVING MINIMUM PULSE WIDTH" (atty. docket no. NSC 1-63000); U.S. patent application Serial No.

08/ , entitled "INTEGRATED CIRCUIT WITH MULTIPLE FUNCTIONS SHARING

MULTIPLE INTERNAL SIGNAL BUSES ACCORDING TO DISTRIBUTED BUS ACCESS AND

CONTROL ARBITRATION" (atty. docket no. NSC 1-63100); U.S. patent application Serial No.

08/ , entitled "EXECUTION UNIT ARCHITECTURE TO SUPPORT x86 INSTRUCTION SET AND x86 SEGMENTED ADDRESSING" (atty. docket no. NSC 1-63300); U.S. patent application

Serial No. 08/ , entitled "BARREL SHIFTER" (atty. docket no. NSC1-63400); U.S. patent application Serial No. 08/ , entitled "BIT SEARCHING THROUGH 8, 16, OR 32-BIT

OPERANDS USING A 32-BIT DATA PATH" (atty. docket no. NSC 1-63500); U.S. patent application Serial No. 08/ , entitled "DOUBLE PRECISION (64-BIT) SHIFT OPERATIONS USING A 32-BIT DATA PATH" (atty. docket no. NSC 1-63600); U.S. patent application Serial No.

08/ , entitled "METHOD FOR PERFORMING SIGNED DIVISION" (atty. docket no.

NSC 1-63700); U.S. patent application Serial No. 08/ , entitled "METHOD FOR

PERFORMING ROTATE THROUGH CARRY USING A 32-BIT BARREL SHIFTER AND COUNTER" (atty. docket no. NSC 1-63800); U.S. patent application Serial No. 08/ , entitled "AREA AND TIME EFFICIENT FIELD EXTRACTION CIRCUIT" (atty. docket no. NSC 1-63900);

U.S. patent application Serial No. 08/ , entitled "NON-ARITHMETICAL CIRCULAR

BUFFER CELL AVAILABILITY STATUS INDICATOR CIRCUIT" (atty. docket no. NSC 1-64000);

U.S. patent application Serial No. 08/ , entitled "TAGGED PREFETCH AND

INSTRUCTION DECODER FOR VARIABLE LENGTH INSTRUCTION SET AND METHOD OF OPERATION" (atty. docket no. NSC 1-64100); U.S. patent application Serial No. 08/ , entitled "PARTITIONED DECODER CIRCUIT FOR LOW POWER OPERATION" (atty. docket no.

NSC 1-64200); U.S. patent application Serial No. 08/ , entitled "CIRCUIT FOR

DESIGNATING INSTRUCTION POINTERS FOR USE BY A PROCESSOR DECODER" (atty. docket no. NSC1-64300); U.S. patent application Serial No. 08/ , entitled "CIRCUIT FOR GENERATING A DEMAND-BASED GATED CLOCK" (atty. docket no. NSC 1-64500); U.S. patent application Serial No. 08/ , entitled "INCREMENTOR/DECREMENTOR" (atty. docket no.

NSC 1-64700); U.S. patent application Serial No. 08/ , entitled "A PIPELINED

MICROPROCESSOR THAT PIPELINES MEMORY REQUESTS TO AN EXTERNAL MEMORY" (atty. docket no. NSC 1-64800); U.S. patent application Serial No. 08/ , entitled "CODE BREAKPOINT DECODER" (atty. docket no. NSC 1-64900); U.S. patent application Serial No.

08/ , entitled "TWO TIER PREFETCH BUFFER STRUCTURE AND METHOD WITH

BYPASS" (atty. docket no. NSC 1-65000); U.S. patent application Serial No. 08/ , entitled

"INSTRUCTION LIMIT CHECK FOR MICROPROCESSOR" (atty. docket no. NSC1-65100); U.S. patent application Serial No. 08/ , entitled "A PIPELINED MICROPROCESSOR THAT

MAKES MEMORY REQUESTS TO A CACHE MEMORY AND AN EXTERNAL MEMORY CONTROLLER DURING THE SAME CLOCK CYCLE" (atty. docket no. NSC 1-65200); U.S. patent application Serial No. 08/ , entitled "APPARATUS AND METHOD FOR EFFICIENT COMPUTATION OF A 486™ MICROPROCESSOR COMPATIBLE POP INSTRUCTION" (atty. docket no. NSCl- 5700); U.S. patent application Serial No. 08/ , entitled "APPARATUS

AND METHOD FOR EFFICIENTLY DETERMINING ADDRESSES FOR MISALIGNED DATA STORED IN MEMORY" (atty. docket no. NSC 1-65800); U.S. patent application Serial No. 08/ , entitled "METHOD OF IMPLEMENTING FAST 486™ MICROPROCESSOR COMPATIBLE STRING OPERATION" (atty. docket no. NSC 1-65900); U.S. patent application Serial

No. 08/ , entitled "A PIPELINED MICROPROCESSOR THAT PREVENTS THE CACHE

FROM BEING READ WHEN THE CONTENTS OF THE CACHE ARE INVALID" (atty. docket no.

NSC 1-66000); U.S. patent application Serial No. 08/ , entitled "DRAM CONTROLLER

THAT REDUCES THE TIME REQUIRED TO PROCESS MEMORY REQUESTS" (atty. docket no. NSC 1-66300); U.S. patent application Serial No. 08/ , entitled "INTEGRATED PRIMARY

BUS AND SECONDARY BUS CONTROLLER WITH REDUCED PIN COUNT" (atty. docket no.

NSC 1-66400); U.S. patent application Serial No. 08/ , entitled "SUPPLY AND INTERFACE

CONFIGURABLE INPUT/OUTPUT BUFFER" (atty. docket no. NSC 1-66500); U.S. patent application Serial No. 08/ , entitled "CLOCK GENERATION CIRCUIT FOR A DISPLAY CONTROLLER HAVING A FINE TUNEABLE FRAME RATE" (atty. docket no. NSC 1-66600); U.S. patent application Serial No. 08/ , entitled "CONFIGURABLE POWER MANAGEMENT

SCHEME" (atty. docket no. NSC 1-66700); U.S. patent application Serial No. 08/ , entitled

"BIDIRECTIONAL PARALLEL SIGNAL INTERFACE" (atty. docket no. NSC 1-67000); U.S. patent application Serial No. 08/ , entitled "LIQUID CRYSTAL DISPLAY (LCD) PROTECTION CIRCUIT" (atty. docket no. NSC 1-67100); U.S. patent application Serial No. 08/ , entitled

"IN-CIRCUIT EMULATOR STATUS INDICATOR CIRCUIT" (atty. docket no. NSC 1-67400); U.S. patent application Serial No. 08/ , entitled "DISPLAY CONTROLLER CAPABLE OF

ACCESSING GRAPHICS DATA FROM A SHARED SYSTEM MEMORY" (atty. docket no.

NSC 1-67500); U.S. patent application Serial No. 08/ , entitled "INTEGRATED CIRCUIT WITH TEST SIGNAL BUSES AND TEST CONTROL CIRCUITS" (atty. docket no. NSC 1-67600);

U.S. patent application Serial no. 08/ , entitled "DECODE BLOCK TEST METHOD AND

APPARATUS" (atty. docket no. NSC 1-68000).

It should be understood that various alternatives to the embodiment of the invention described herein may be employed in practicing the invention. Thus, it is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

WHAT IS CLAIMED IS:

1. A dual memory access circuit of a pipelined processor that requests information from a plurality of memories, the dual access memory circuit being connected to an external memory via a system bus and responding to information requests generated by a prefetch stage and an execution stage of the pipelined processor, the dual access memory circuit comprising: a memory access controller that monitors availability of the system bus, outputs a first request in response to needed information when the system bus is unavailable, outputs the first request and a second request within a predetermined time in response to the needed information when the bus is available, and asserts a dual request signal and a bus request signal within the predetermined time when both the first and second requests are output; a cache controller that asserts a terminate request signal within the predetermined time when the first request is stored in a cache memory and the dual request signal is asserted; and an external memory controller that terminates the bus request signal when the terminate request signal is asserted within the predetermined time, thereby preventing an external memory access, and outputs a bus address in response to the second request when the terminate request signal is deasserted.

2. The processor of claim 1 wherein the first request includes an address that includes a page field.

3. The processor of claim 2 wherein the second request includes an address.

4. The processor of claim 3 wherein the cache controller determines whether the first request is stored in the cache memory by comparing the page field to a tag address stored in the cache memory.

5. The processor of claim 3 wherein the address of the first request and the address of the second request are identical.

6. The processor of claim 1 wherein the predetermined time is equal to or less than a clock period.

7. The processor of claim 1 wherein the predetermined time is one-half a clock period.

8. A method for requesting information from a plurality of memories by a dual memory access circuit of a pipelined processor, the dual access memory circuit being connected to an external memory via a system bus and responding to information requests generated by a prefetch stage and an execution stage of the pipelined processor, the method comprising the steps of: monitoring availability of the system bus; outputting a first request in response to needed information when the system bus is unavailable; outputting the first request and a second request within a predetermined time in response to the needed information when the bus is available; asserting a dual request signal and a bus request signal within the predetermined time when both the first and second requests are output; asserting a terminate request signal within the predetermined time when the first request is stored in a cache memory and the dual request signal is asserted; and terminating the bus request signal when the terminate request signal is asserted within the predetermined time, thereby preventing an external memory access.

9. The method of claim 8 wherein the first request includes an address that includes a page field.

10. The method of claim 9 wherein the second request includes an address.

11. The method of claim 10 wherein the first request is determined to be stored in the cache memory by comparing the page field to a tag address stored in the cache memory.

12. The method of claim 10 wherein the address of the first request and the address of the second request are identical.

13. The method of claim 8 wherein the predetermined time is equal to or less than a clock period.

14. The method of claim 8 wherein the predetermined time is one-half a clock period.