WO1998002797A1

WO1998002797A1 - Method and apparatus for predecoding variable byte-length instructions within a superscalar microprocessor

Info

Publication number: WO1998002797A1
Application number: PCT/US1996/011757
Authority: WO
Inventors: Thang M. Tran
Original assignee: Advanced Micro Devices, Inc.
Priority date: 1996-07-16
Filing date: 1996-07-16
Publication date: 1998-01-22
Also published as: EP0912923A1; JP3732233B2; JP2000515274A

Abstract

A superscalar microprocessor is provided that includes a predecode unit configured to predecode variable byte-length instructions prior to their storage within an instruction cache. The predecode unit is configured to generate a plurality of predecode bits for each instruction byte. The plurality of predecode bits associated with each instruction byte are collectively referred to as a predecode tag. An instruction alignment unit then uses the predecode tags to dispatch the variable byte-length instructions simultaneously to a plurality of decode units which form fixed issue positions within the superscalar microprocessor. With the information conveyed by the functional bits, the decode units can detect the exact locations of the opcode, displacement, immediate, register, and scale-index bytes. Accordingly, no serial scan by the decode units through the instruction bytes is needed. In addition, the functional bits allow the decode units to calculate linear addresses (via adder circuits) expeditiously for use by other subunits within the superscalar microprocessor. Accordingly, relatively fast decoding may be attained, and high performance may be accommodated.

Description

TITLE. METHOD AND APPARATUS FOR PREDECODING VARIABLE BYTE- LENGTH INSTRUCTIONS WITHIN A SUPERSCALAR MICROPROCESSOR

BACKGROUND OF THE INVENTION

1 Field of the Invention

0 This invention relates to superscalar microprocessors and more particularly to the predecoding of vaπable bvte-length computer instructions wtthui high performance and high frequency superscalar microprocessors

2 Description of the Relevant An

Superscalar microprocessors are capable of attaining performance characteristics which surpass those ot conventional scalar processors by allowing the concurrent execution of multiple instructions Due to the widespread acceptance of the x86 family of microprocessors, efforts have been undertaken be microprocessor manufacturers to develop superscalar microprocessors which execute x86 instructions Such C superscalar microprocessors achieve relatively high performance characteristics while advantageously maintaining backwards compatibility with the vast amount of existing software developed for previous microprocessor generations such as the 8086. 80286. 80386. and 80486

The x86 instruction set is relatively complex and is characterized by a plurality of variable bvte- 5 length instructions A generic format illustrative of the x86 instruction set is shown in Figure 1 A As illustrated in the figure, an x86 instruction consists of from one to five optional prefix bvtes 102 followed by an operation code (opcode) field 104. an optional addressing mode (Mod R/M) byte 106. an optional scale- index-base (SIB) byte 108. an optional displacement field 1 10. and an optional immediate data field 1 12

30 The opcode field 104 defines the basic operation for a particular instruction The default operation of a particular opcode may be modified by one or more prefix bytes. For example, a prefix byte may be used to change the address or operand size for an instruction, to ovemde the default segment used in memory addressing, or to instruct the processor to repeat a string operation a number of times The opcode field 104 follows the prefix bytes 102. if any. and may be one or two bytes in iength The addressing mode ( MODRM )

Z 5 bvte 106 specifies the registers used as well as memorv addressing modes The scale-index-base (SIB) byte 108 is used only in 32-bιt base-relative addressing using scale and index factors A base field of the SIB bvtε specifies which register contains the base value for the address calculation, and an index field specifies whicr register contains the index value A scale field specifies the power of two bv which the index value will be multiplied before being added, along with any displacement, to the base value The next instruction field is

•50 the optional displacement field 1 10 which may be from one to four bvtes in length The displacement field 1 10 contains a constant used m address calculations The optional immediate field 1 12. which mav also be from one to four bytes in length, contains a constant used as an instruction operand The 80286 sets a maximum length for an instruction at 10 bytes, while the 80386 and 80486 both allow instruction lengths of up to 15 bvtes

Referring now to Figure IB several different variable byte-length x86 instruction formats are shown The shortest x86 instruction is only one byte long, and compπses a single opcode byte as shown in format (a) For certain instructions, the byte containing the opcode field also contains a register field as shown in formats (b), (c) and (e) Format (j) shows an instruction with two opcode bytes An optional MODRM byte follows opcode bytes in formats (d), (0, (h), and (j) Immediate data follows opcode bvtes in formats (e), (g), (I), and (k), and follows a MODRM byte in formats (f) and (h) Figure 1 C illustrates several possible addressing mode formats (a)-(h) Formats (c), (d), (e), (g), and (h) contain MODRM bvtes with offset (I e . displacement) information An SIB byte is used in formats (0, (g). an (h)

The complexity of the x86 instruction set poses difficulties in implementing high performance x86 compatible superscalar microprocessors One difficulty arises from the fact that instructions must be aligned with respect to the parallel-coupled instruction decoders of such processors before proper decode can be effectuated In contrast to most RISC instruction formats, since the x86 instruction set consists of variable byte-length instructions, the start bytes of successive instructions within a line are not necessarily equally spaced, and the number of instructions per line is not fixed As a result, employment of simple fixed-length shifting logic cannot in itself solve the problem of instruction alignment

Superscalar microprocessors have been proposed that employ instruction predecodmg techniques to help solve the problem of quickly aligning, decoding and executing a plurality of variable byte-length instructions in parallel In one such superscalar microprocessor, when instructions are written within the instruction cache from an external mam memory, a predecoder appends several predecode bits (referred to collectively as a predecode tag) to each byte These bits indicate whether the byte is the start and/or end byte of an x86 instruction, the number of microinstructions required to implement the x86 instruction, and the location of opcodes and prefixes After instructions are fetched from the cache, the superscalar microprocessor converts each instruction to one or more microinstructions referred to as ROPS The ROPS are similar to RISC instructions in that they are associated with a fixed length and with simple, consistent encodings Since the x86 instructions in the instruction cache are already tagged with predecode bits indicating where instructions start and end and how many ROPS each needs, it is a relatively simple task for the bvte queue to locate instruction boundaries, to translate each x86 instruction to one or more ROPS. and to provide a fixed number of ROPS to parallel instruction decoders

Although the predecodmg technique described above has been largel^y successful, over fifty percent of the available storage space within the instruction cache array must be allocated for the predecode bits This accordingly limits the amount of storage within the instruction cache for instruction code and/or mcreases the cost of the processor due to increased die size

SUMMARY OF THE INVENTION

The problems outlined above are tn large part solved by a method for predecodmg vaπable byte- length instructions in accordance with the present invention In one embodiment, a predecode unit is provided which is capable of predecodmg variable byte-length instructions prior to their storage withm an instruction cache The predecode unit is configured to generate a plurality of predecode bits for each mstruction byte The plurality of predecode bits associated with each mstruction byte are collectively referred to as a predecode tag An mstruction alignment unit then uses the predecode tags to dispatch the vaπable byte-length instructions to a plurality of decode units which form fixed issue positions within the superscalar microprocessor

In one implementation, the predecode unit generates three predecode bits associated with each byte of mstruction code a ' start" bit, an "end" bit, and a "functional" bit The start bit is set if the associated byte is the first byte of the instruction Similarly, the end bit is set if the byte is the last byte of the instruction

Rather than associating a dedicated meaning to the functional bit. the predecode unit is configured such that the meaning conveved by or associated with the functional bit is dependent both upon its state ( I e whether the functional bit is set or not) and upon the state of the start bit for that byte The meaning of the functional bit may further be dependent upon the status of the start bit of a previous instruction byte

For example, in one implementation if the start bit for a particular byte is set. the functional bit indicates whether the instruction is a directly decodeable "fast path" instruction or is an MROM instruction (I e . an instruction to be serialized through microcode) On the other hand, if the start bit for a particular byte is cleared and if the byte immediately follows a start byte (i e , an instruction byte whose start bit is set), the functional bit indicates whether the opcode is the first byte of the mstruction or whether a prefix is the first byte of the instruction If the staπ bit for the byte is cleared and the byte does not follow a start byte, the functional bit indicates whether the associated byte is either a MODRM or an SIB byte, or is displacement or immediate data

Bv utilizing the predecode information from the predecode unit, the mstruction alignment unit may be implemented with a relatively small number of cascaded levels of logic gates thus accommodating very high frequencies of operation Instruction alignment to decode units may further be accomplished with relatively few pipeline stages In addition, the pluralit^y of decode units to which the variable byte length instructions are aligned utilize the predecode tags to attain relatively fast decoding of the instructions Finally since the predecode unit is configured such that the meaning of the functional bit of a particular predecode tag is dependent upon the status of the start bit. a relatively large amount of predecode information mav be conveyed with a relatively small number of predecode bits This thereby allows a reduction in the size of the instruction cache without compromising performance

Furthermore, with the information conveyed by the functional bits, the decode units know the exact locations of the opcode, displacement, immediate, register, and scale-index bytes Accordingly, no senal scan by the decode units through the mstruction bytes is needed In addition, the functional bits allow the decode units to calculate the 8-bit Imear addresses (via adder circuits) expeditiously for use by other subunits within the superscalar microprocessor Accordingly, relatively fast decoding may be attained, and high performance may be accommodated

Broadly speaking, the present invention contemplates a method for predecodmg variable bvte length instructions within a superscalar microprocessor comprising the steps of generating a staπ bit indicative of whether a byte of an mstruction is a start byte, generatmg an end bit indicative of whether said byte of said instruction is an end byte, and generatmg a functional bit that conveys a meaning dependent upon a value of

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying draw gs in which

Figure 1 A is a diagram which illustrates the generic x86 mstruction set format

Figure 1 B is a diagram which illustrates several different vaπable byte-length x86 mstruction formats

Figure 1C is a diagram which illustrates several possible x86 addressing mode formats

Figure 2 is a block diagram of a superscalar microprocessor which includes an mstruction alignment unit to forward multiple instructions to six decode units

Figure 3 is a block diagram of the instruction alignment unit and six decode units

Figures 4A-4C are block diagrams which depict execution of an MROM instruction While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herem be described in detail It should be understood, however, that the drawings and detailed description thereto are not intended to limit ^'5 the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims

DETAILED DESCRIPTION OF THE INVENTION

0

Referring next to Figure 2, a block diagram of a superscalar microprocessor 200 including a predecode unit 202 which operates accordance with a method of the present invention is shown As illustrated in the embodiment of Figure 2, superscalar microprocessor 200 includes a predecode unit 202 and a branch prediction unit 220 coupled to an instruction cache 204 A prefetch unit 203 is coupled to

5 predecode unit 202 An instruction alignment unit 206 is coupled between mstruction cache 204 and a plurality of decode units 208A-208F (referred to collectively as decode units 208) Each decode unit 208A- 208F is coupled to a respective reservation station 21 OA-21 OF (referred collectively as reservation stations 210), and each reservation station 210A-210F is coupled to a respective functional unit 212A-212F (referred to collectively as functional units 212) Decode units 208, reservation stations 210. and functional units 212 0 are further coupled to a reorder buffer 216, a register file 218 and a load/store unit 222 A data cache 224 is finally shown coupled to load/store unit 222, and an MROM unit 209 is shown coupled to instruction alignment unit 206

Generally speaking, instruction cache 204 is a high speed cache memory provided to temporarily 5 store instructions prior to their dispatch to decode units 208 In one embodiment, mstruction cache 204 is configured to cache up to 32 kilobytes of instruction code organized in lines of 16 bytes each (where each byte consists of 8 bits) During operation, instruction code is provided to instruction cache 204 by prefetching code from a ma memory (not shown) through prefetch unit 203 For each byte of instruction code, instruction cache 204 further stores a predecode tag associated therewith It is noted that mstruction ¹0 cache 204 could be implemented in a set-associative, a fully-associative, or a direct-mapped configuration

Prefetch unit 203 is provided to prefetch instruction code from the mam memory for storage withm instruction cache 204 In one embodiment, prefetch unit 203 is configured to burst 64-bit wide code from the mam memory into mstruction cache 204 It is understood that a variety of specific code prefetching 5 techniques and algorithms mav be employed bv prefetch unit 203

As prefetch unit 203 fetches instructions from the main memory, predecode unit 202 generates three predecode bits associated with each byte of instruction code a "start" bit, an end" bit, and a "functional" bit The start bit as well as the end bit of each bvte are indicative of the boundaries of an mstruction The functional bit of each bvte conveys additional information regarding the byte or the instruction such as whether the instruction can be decoded directly by decode units 208 or whether the mstruction must be executed by invoking a microcode procedure controlled by MROM unit 209 (as will be descπbed in greater detail below), whether the byte is a MODRM or SIB byte or whether the byte is displacement or immediate data The functional bit may further be employed to indicate the location of an opcode byte It will be appreciated from the following that the encoded meaning of the functional bit of a particular instruction byte is dependent upon the associated start bit

0 Table 1 indicates one encoding of the predecode tags as implemented by predecode unit 202 As indicated within the table, if a given byte is the first byte of an instruction, the start bit for that byte is set by predecode unit 202 as the byte is fetched from mam memory and stored within mstruction cache 204 If the byte is the last byte of an mstruction, the end bit for that byte is set If a particular mstruction cannot be directly decoded by the decode units 208. the functional bit associated with the first byte of the mstruction is 5 set On the other hand, if the instruction can be directly decoded by the decode units 208. the functional bit associated with the first byte of the instruction is cleared The functional bit for the second byte of a particular instruction is cleared if the opcode is the first byte, and is set if the opcode is the second byte It is noted that in situations where the opcode is the second byte, the first byte is a prefix byte The functional bit values for instruction byte numbers 3-8 indicate whether the byte is a MODRM or an SIB byte, as well as 0 whether the byte contains displacement or immediate data

Table 1 Encoding of Start. End and Functional Bits

Instr Start End Functional

5 Number Value Value Value Meaning

1 1 X 0 Fast decode

1 1 X 1 MROM lnstr

2 0 X 0 Opcode is first 0 byte

2 0 X 1 Opcode is this byte, first byte is prefix

3-8 0 X 0 MODRM or 5 SIB byte

3-8 0 X 1 Displacement or immediate data, the second functional bit

J O set m bytes 3-8 indicates immediate data

1-8 X 0 X Not last byte of instruction

4 5 1 -8 X 1 X Last byte of instruction In accordance with Table 1 above, it is noted that the predecode unit 202 of superscalar microprocessor 200 is configured to generate a functional bit for each byte of mstruction code The meanmg of the functional bit is dependent upon the value of the start bit associated with that byte For the encoding 5 scheme illustrated in Table 1 , the meanmg of the functional bit is further dependent upon the value of the start bit associated with a previous instruction byte.

For the specific implementation described above, it will be appreciated that the functional bit indicates whether the mstruction is a directly decodeable instruction or an MROM mstruction (descπbed

- 0 further below) if the start bit for that byte is set. If the start bit associated with a particular byte of mstruction code is cleared and immediately follows a byte of mstruction code in which the start bit was set. the functional bit mdicates whether the opcode is the first byte or whether a prefix is the first byte Still further, if the start bit for a byte of mstruction code is cleared and the previous byte's start bit was also cleared, the functional bit mdicates whether the byte is a MODRM or SIB byte, or whether the byte is displacement or

- 5 immediate data For subsequent bytes within a particular mstruction, the second functional bit set in bytes 3-

8 indicates immediate data

In accordance with the predecode scheme employed by the superscalar microprocessor 200 as descπbed above, a predecode tag is generated which is associated with each byte of instruction code Both

20 predecode tags and the instruction code are stored withm instruction cache 204 for subsequent processing by the superscalar microprocessor. Since the meaning of the functional bit is dependent upon the start bit of a particular byte and upon the start bits of previous bytes, a relatively large amount of predecode information can be conveyed to the instruction alignment unit 206 and to decode units 208 to attain relatively fast alignment and decode of instructions. Since the number of bits required withm the predecode tag is

15 relatively small, the required size of the instruction cache 204 may be reduced without compromismg performance

Furthermore, with the information conveyed by the functional bits, the decode units know the exact locations of the opcode, displacement, immediate, register, and scale-index bytes. Accordingly, no serial 0 scan by the decode units through the mstruction bytes is needed. In addition, the functional bits ailow the decode units to calculate the 8-bit linear addresses (via adder circuits) expeditiously for use by other subunits withm the superscalar microprocessor Accordingly, relatively fast decoding may be attained, and high performance may be accommodated.

5 As stated previously, in one embodiment certain instructions within the x86 instruction set may be directly decoded by decode unit 208. These instructions are referred to as "fast path" instructions The remaining instructions of the x86 instruction set are referred to as "MROM instructions" MROM instructions are executed by invoking MROM unit 209. When an MROM mstruction is encountered. MROM unit 209 parses and serializes the instruction into a subset of defined fast path instructions to effectuate a desired operation A listing of exemplary x86 instructions categorized as fast path mstrucuons as well as a description of the manner of handling both fast path and MROM instructions will be provided further below

Instruction alignment unit 206 is provided to channel or "funnel" variable bvte-length instructions from instruction cache 204 to fixed issue positions formed by decode units 208A-208F As will be descπbed in conjunction with Figures 3-5. instruction alignment unit 206 is configured to channel instruction code to designated decode units 208A-208F depending upon the locations of the start bvtes of instructions withm a Ime as delineated by instruction cache 204 In one embodiment, the particular decode unit 208A-208F to which a given instruction may be dispatched is dependent upon both the location of the start bvte of that instruction as well as the location of the previous instruction's start bvte. if anv Instructions starting at certain bvte locations may further be restricted for issue to only one predetermined issue position Specific details follow

Before proceeding with a description of the alignment of instructions from mstruction cache 204 to decode units 208, general aspects regarding other subsystems employed within the exemplary superscalar microprocessor 200 of Figure 2 will be described For the embodiment of Figure 2 each of the decode units 208 includes decoding circuitry for decoding the predetermined fast path instructions referred to above In addition, each decode unit 208A-208F routes displacement and immediate data to a corresponding reservation station unit 210A-210F Output signals from the decode units 208 include bit-encoded execution instructions for the functional units 212 as well as operand address information, immediate data and/or displacement data

The superscalar microprocessor of Figure 2 supports out of order execution, and thus employs reorder buffer 216 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution and branch misprediction recovery, and to facilitate precise exceptions As will be appreciated by those of skill in the art. a temporary storage location within reorder buffer 216 is reserved upon decode of an mstruction that mvolves the update of a register to thereby store speculative register states Reorder buffer 216 may be implemented m a first-in- first-out configuration wherein speculative results move to the "bottom" of the buffer as they are validated and written to the register file, thus making room for new entries at the "top" of the buffer Other specific configurations of reorder buffer 216 are also possible as will be described further below If a branch prediction is incorrect the results of speculativeiy-executed instructions along the mispredicted path can be invalidated in the buffer before thev are written to register file 218

The bit-encoded execution instructions and immediate data provided at the outputs of decode units 208A-208F are routed directly to respective reservation station units 21 OA-21 OF In one embodiment, each reservation station unit 210A-210F is capable of holding mstruction information (i e . bit encoded execution bits as well as operand values, operand tags and/or immediate data) for up to three pending instructions awaiting issue to the corresponding functional unit It is noted that for the embodiment of Figure 2. each decode unit 208A-208F is associated with a dedicated reservation station unit 210A-210F. and that each 5 reservation station unit 210A-210F is similarly associated with a dedicated functional unit 212A-212F

Accordingly, six dedicated "issue positions" are formed by decode units 208. reservation station units 210 and functional units 212 Instructions aligned and dispatched to issue position 0 through decode unit 208 A are passed to reservation station unit 210A and subsequently to functional unit 212A for execution Similarly, instructions aligned and dispatched to decode unit 208B are passed to reservation station unit 0 210B and into functional unit 212B. and so on

Upon decode of a particular instruction, if a required operand is a register location, register address information is routed to reorder buffer 216 and register file 218 simultaneously Those of skill in the art will appreciate that the x86 register file includes eight 32 bit real registers (l e., typically referred to as EAX, 5 EBX. ECX. EDX. EBP, ESI. EDI and ESP), as will be described further below Reorder buffer 216 contains temporary storage locations for results which change the contents of these registers to therebv allow out of order execution A temporary storage location of reorder buffer 216 is reserved for each instruction which, upon decode, modifies the contents of one of the real registers Therefore, at various points during execution of a particular program, reorder buffer 216 may have one or more locations which contain the speculatively 0 executed contents of a given register If following decode of a given mstruction it is determined that reorder buffer 216 has previous locatιon(s) assigned to a register used as an operand m the given instruction the reorder buffer 216 forwards to the corresponding reservation station either^* 1 ) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous instruction. If the reorder buffer has a 5 location reserved for a given register, the operand value (or tag) is provided from reorder buffer 216 rather than from register file 218 If there is no location reserved for a required register in reorder buffer 16, the value is taken directly from register file 218 If the operand corresponds to a memory location, the operand value is provided to the reservation station unit through load/store unit 222

0 Details regarding suitable reorder buffer implementations may be found withm the publication

"Superscalar Microprocessor Design" by Mike Johnson. Prentice-Hall. Englewood Cliffs. New Jersey, 1991 and within the co-pending, commonly assigned patent application entitled "High Performance Superscalar Microprocessor". Serial No 08/146.382. filed October 29. 1993 by Witt, et al These documents are incorporated herein by reference in their entirety

- 3

Reservation station units 210A-210F are provided to temporarily store instruction information to be speculatively executed by the corresponding functional units 212A-212F As stated previously, each reservation station unit 210A-210F mav store instruction information for up to three pending instructions Each of the six reservation stations 210A-210F contain locations to store bit-encoded execution instructions to be speculatively executed by the corresponding functional unit and the values of operands If a particular operand is not available, a tag for that operand is provided from reorder buffer 216 and is stored within the correspondmg reservation station until the result has been generated (l e . by completion of the execution of a previous mstruction) It is noted that when an instruction is executed by one of the functional units 212A-

212F. the result of that mstruction is passed directly to any reservation station units 210A-210F that are waiting for that result at the same time the result is passed to update reorder buffer 216 (this technique is commonly referred to as "result forwarding") Instructions are issued to functional units for execution after the values of any required operand(s) are made available That is, if an operand associated with a pending instruction with one of the reservation station units 210 A-21 OF has been tagged with a location of a previous result value withm reorder buffer 216 which corresponds to an mstruction which modifies the required operand, the instruction is not issued to the corresponding functional unit 212 until the operand result for the previous mstruction has been obtained Accordingly, the order in which instructions are executed may not be the same as the order of the original program instruction sequence Reorder buffer 216 ensures that data coherency is maintained in situations where read-after-wπte dependencies occur

In one embodiment, each of the functional units 212 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations

Each of the functional units 212 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 220 If a branch prediction was mcorrect. branch prediction unit 220 flushes instructions subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes prefetch/predecode unit 202 to fetch the required instructions from instruction cache 204 or main memory It is noted that in such situations, results of instructions in the original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 222 and reorder buffer 216 Exemplary configurations of suitable branch prediction mechanisms are well known

Results produced by functional units 212 are sent to the reorder buffer 216 if a register value is being updated, and to the load/store unit 222 if the contents of a memory location is changed If the result is to be stored in a register, the reorder buffer 216 stores the result in the location reserved for the value of the register when the instruction was decoded As stated previously, results are also broadcast to reservation station units 21 OA-21 OF where pending instructions may be waiting for the results of previous mstruction executions to obtain the required operand values Generally speaking, load/store unit 222 provides an interface between functional units 212A-212F and data cache 224. In one embodiment, load/store unit 222 is configured with a store buffer with eight storage locations for data and address information for pending loads or stores. Functional units 212 arbitrate for access to the load/store unit 222. When the buffer is full, a functional unit must wait until the load/store ^' 5 unit 222 has room for the pending load or store request information. The load/store unit 222 also performs dependency checking for load instructions against pending store instructions to ensure that data coherency is maintained.

Data cache 224 is a high speed cache memory provided to temporarily store data being transferred 10 between load/store unit 222 and the main memory subsystem. In one embodiment, data cache 224 has a capacity of storing up to eight kilobytes of data. It is understood that data cache 224 may be implemented in a variety of specific memory configurations, including a set associative configuration.

Details regarding the dispatch of instructions from instruction cache 204 through instruction 15 alignment unit 206 to decode units 208 will next be considered. Figure 3 is a block diagram which depicts internal portions of one embodiment of instruction alignment unit 206 as well as internal portions of decode units 208A-208F with respect to a line of instruction code to be provided from instruction cache 204. As stated previously, instruction alignment unit 206 is configured to channel variable byte-length instructions (in this case certain x86 instructions referred to as fast path instructions) to decode units 208A-208F.

20

As shown in Figure 3. a latching unit 302 is incorporated as a portion of an output buffer section 301 of instruction cache 204. Latching unit 302 is capable of storing a line of instruction code provided from a storage array (not shown in Figure 3) of instruction cache 204 prior to being dispatched to decode units 208.

25

The instruction alignment unit 206 of Figure 3 includes a plurality of multiplexer circuits referred to as multiplexer channels 304A-304G coupled between latching unit 302 and decode units 208. A multiplexer control circuit 306 is further shown coupled to each multiplexer channel 304A-304G. In this embodiment, each decode unit 208A-208F includes an associated instruction decoder 318A-318F having an input port 30 coupled to a respective multiplexer channel 304A-304F. Each decode unit 208A-208F further includes a respective displacement/immediate data buffer 330A-330F and a respective instruction issue unit 340A- 340F.

During operation, a line of instruction code to be executed is provided to latching unit 302 from the 35 storage array of instruction cache 204. Each byte of instruction code within instruction cache 204 is associated with a corresponding predecode tag including a start bit. an end bit. and a functional bit. When a line of instruction code is provided to latching unit 302. the predecode tag associated with each byte is provided to an input of multiplexer control circuit 306. As will be described in further detail below. depending upon the predecode tags correspondmg to each line of instruction code within latchmg unit 302, multiplexer control circuit 306 controls multiplexer channels 304A-304G such that the instruction bytes are selectively routed to designated instruction decoders 318A-318F. Instruction paths formed by decode units 208A-208F are referred to as issue positions. The channeling of instruction code through multiplexer channels 304A-304G is dependent upon the location of the start byte associated with each instruction relative to each line as delineated by latching unit 302. In the embodiment of Figure 3. each of the first five multiplexer channels 304A-304F routes four contiguous bytes of instruction code from latchmg unit 302 to a respective instruction decoder 318A-318F. Multiplexer channel 304G is capable of channeling up to three contiguous bytes of instruction code to instruction decoder 318.

Table 2 below illustrates the possible multiplexer channels 304A-304G through which staπ bytes may be channeled. As stated previously, the channeling of instruction code is dependent upon the location(s) of staπ bytes within a given line. It is noted that each multiplexer channel 304A-304F is configured to route the lowest-order start byte among those allocated to it, provided the staπ byte has not been selected for routing by a lower order multiplexer channel.

Table 2. Dispatches to Issue Positions Based on Staπ Bvte Locations.

Start Byte Dispatch To

In Location Issue Position

0 0

1 O or 1

2 O or 1

— j . 1 or 2

4 1 or 2

5 2

6 2 or 3

7 2 or 3

8 2 or 3

9 3 or 4

10 3 or 4

1 1 4

12 4 or 5

13 5 or 6

14 5 or 6

15 5 or 6

Referring to Table 2. multiplexer channel 304A is capable of routing staπ bytes located at byte positions 0-2 to decode unit 318A. Multiplexer channel 304B is capable of routing staπ bytes at byte positions 1 -4 to decode unit 318B. Multiplexer channel 304C is capable of transfemng staπ bytes at byte positions 3-8 to decode unit 208C. Similarly, multiplexer channel 304D is capable of transfemng staπ bytes at byte positions 6-10 to decode unit 208D, and multiplexer channel 304E is capable of transfemng start bytes at byte positions 9-12 to decode unit 208E. Finally, multiplexer channel 304F is capable of fe rin staπ b tes at b te ositions 12-15 to decode unit 318F. Staπ b tes located at b te ositions 13- 15 may alternatively be routed through multiplexer channel 304G to a seventh issue position which is empioved to wrap bytes of an incomplete instruction (i.e . an instruction which extends into the next line) to the next cache line for decode As will be descπbed further below instruction bvtes routed through multiplexer channel 304G are provided to instruction decoder 304A upon the next clock cycle when the ^■ 5 remaining bytes of that instruction are available withm latching unit 302

If an mstruction wraps around to a subsequent cache line, the dispatch of the instruction to a designated position is dependent upon the nature of the remaining bytes of the mstruction that appear on the next line For situations where solely displacement or immediate data wrap around to the next cache line, - 0 that immediate or displacement data is provided to displacement/immediate data buffer 330F through multiplexer channel 304A It is noted that m this situation, the preceding bytes of that instruction (which appear on the preceding cache l e) will have been dispatched to mstruction decoder 318F during the preceding clock cycle For situations in which prefix, opcode, MODRM. and/or SIB bytes wrap around to the next cache line, the instruction mformation from the previous lme is routed through multiplexer channel 5 304G to instruction decoder 318A, and is merged with the rest of the instruction code during the next clock cycle

It will be appreciated that by limiting the possible number of issue positions to which a given instruction of a line may be dispatched, the number of cascaded levels of logic required to implement the 0 instruction alignment unit 206 may be advantageously reduced Furthermore, by restricting the dispatch of an instruction having a staπ byte which resides at one of a select subset of byte locations withm a line to a single issue position (I e . byte positions 5 and 1 1 ), the number of cascaded levels of logic for instruction alignment may be reduced even further Accordingly, the instruction alignment unit 206 as described above allows the implementation of a superscalar microprocessor having a relatively small number of gates per

15 pipeline stage to thereby accommodate very high frequencies of operation For relatively long instructions, although issue positions may be skipped, relatively high performance may still be achieved since other issue positions are available for remaining instructions within a cache line

The defined fast path instructions may be up to eight bytes in length, and may include a s gle prefix 0 byte It is noted that by limiting the defined fast path instructions to only a smgle prefix byte, it is possible that bytes 4 through 7. if any. of any fast path instruction will only contain displacement and/or immediate data Therefore, for situations in which the instruction is greater than four bytes, the first four bvtes of the instruction are routed through the multiplexer channel allocated to that instruction s start byte The remaining bytes of the instruction are channeled through the next issue position's multiplexer channel In such 5 situations, the instruction decoder of the issue position fi e , instruction decoder) receiving the remaining bytes of the instruction detects the absence of a start bit at its first-byte position, and accordingly passes the data to the displacement/immediate data buffer 330 of the preceding issue position and issues a NOOP instruction Thus, if a staπ byte of an mstruction is located at byte position 0 of latchmg unit 302. that bvte is provided to decode unit 208A along with the next three contiguous bvtes residm at byte positions 1 , 2, and 3 If the next start byte resides at position 2 (I e , first instruction was two bytes in length), bvtes 2-5 are routed through multiplexer channel 304B to decode unit 208B For the embodiment of Figure 3. each instruction decoder 318A-318F is capable of decoding only one instruction at a time Accordmgly, although the start bytes of more than one mstruction may be provided to, for example, mstruction decoder 318 A, only the first mstruction is decoded Bytes beyond the first end byte, correspondmg to additional instructions withm a given instrucnon decoder, are extraneous and are effectively ignored It is noted that the multiplexer channels 304 of instruction alignment unit 206 could be alternatively configured such that onlv a smgle instruction (or portions thereof), in accordance with the instruction's start and end predecode bits, are channeled to a given mstruction decoder 318

In accordance with the above, if a first instruction starts at byte position 0, bytes 0-3 are provided to instruction decoder 318A If the mstruction is longer than four bytes, bytes 4-7 of latchmg unit 302 are provided through multiplexer channel 304B to instruction decoder 318B, which subsequently passes the data to displacement/immediate data buffer 330A For this situation, multiplexer channel 308C will route the next start byte appearing in the code to instruction decoder 318C If, on the other hand, the first instruction staπing at byte location 0 is four bytes or less, the next instruction is routed through multiplexer channel 304B beginning with the start byte of the second mstruction If that instruction is greater than four bytes long, the immediate or displacement data correspondmg to that instruction is routed through multiplexer channel 304C to displacement/immediate data buffers 330B The remaining multiplexer channels operate similarly

It is noted that if immediate or displacement data is wrapped around to a subsequent lme from an instruction starting at a previous line, that data is provided to displacement/immediate data buffer 340F through multiplexer channel 304A when the immediate or displacement data is available in latchmg unit 302 It is further noted that instruction decoding is not affected since no decodmg is required for displacement and immediate data The first mstruction of the subsequent line is therefore routed to instruction decoder 318B through multiplexer channel 304B

It is similarly noted that if prefix, opcode, MODRM, and/or SIB information is wrapped around from an mstruction beginning on a previous lme, multiplexer channel 304G routes the preceding portions of the instruction to instruction decoder 318 A, in which case the next instruction (correspondmg to the first staπ _→ byte withm latching unit 302 during the next clock cycle) will be routed through multiplexer channel 304B to instruction decoder 318B

As will be understood better from the following example, situations may arise wherein none of the possible issue positions to which a given start byte may be provided are available due to occupation of those issue positions by previous instructions. When such a situation arises, that instruction and any mstructions following it must be held until the next clock cycle for dispatch.

A sample sequence of x86 instructions is shown in Table 3 below. Instructions 1 through 7 in addition to the first byte of instruction 8 are shown within cache line 1. Cache line 2 begins with the second byte of instruction 8, and further includes instructions 9 through 16.

Table 3. Sample Sequence of Instructions.

-0

Instr. Address Num. Cache Line

Number Offset Instruction Bytes Line Byte

1 0000 INC ESI 1 0

- 5 2 0001 CMP BYTE,[ESI] 3 1-3

3 0004 JZ DST1 2 4-5

4 0006 CMP BYTE,[ESI] 3 6-8

5 0009 JZ DST2 2 9-10

6 000B INC [EDX] 2 1 1-12

20 7 000D OR ECX.ECX 2 13-14

8 000F JZ DST3 2 15

2 0

9 001 1 MOV AL,[ESI] 2 2 1-2

10 0013 MOV [ECX],AL 2 2 3-4

11 0015 INC ECX 1 2 5

12 0016 INC ESI 1 2 6

13 0017 CMP BYTE,[ESI] 3 2 7-9

14 001A JNZ DST4 2 2 10-1 1

15 001C INC [ECX] 2 2 12-13

30 16 001E OR ECX,ECX 2 2 14-15

Table 4 below illustrates the manner in which the above sequence of instructions in Table 3 are dispatched to the decode units 208A-208F by instruction alignment unit 206.

Table 4. Instruction Alignment for Sample Sequence

.5 of Instructions in Table 3.

Issue Issue Issue Issue Issue Issue

Pos. 0 Pos. 1 Pos. 2 Pos. 3 Pos. 4 Pos. 5

Clock (0:2) (1:4) (3:8) (6:10) (9:12) (12:15)

1 Ins. 1 Ins. 2 Ins. 3 Ins. 4 Ins. 5

2 Ins. 6 Ins. 7

L> Ins. 8 Ins. 9 Ins. 10

4 Ins. 1 1 Ins. 12

5 Ins. 13 Ins. 14 Ins. 15 Ins.16

Instructions 1-5 are dispatched to issue positions 0-4 corresponding to decode units 318A-318E, respectively, during a first clock cycle. Instruction 6. which begins at byte position 1 1 of latchmg unit 302, can onlv be channeled to issue position 4 correspondmg to decode unit 318E However, smce issue position 4 is already occupied by mstruction 5, mstruction 6 cannot be dispatched dunng this cycle Accordmgly, multiplexer control circuit 306 causes decode unit 318F to issue a NOOP (no operation) instruction dunng the decode stage when instructions 1 -4 are decoded

Dunng clock cycle 2, mstruction 6 is dispatched to issue position 4. and instruction 7 is dispatched to issue position 5 It is noted when these mstructions are decoded, multiplexer control circuit 306 causes decode units 318A-318D to issue NOOP instructions Smce mstruction 8 wraps around to the next cache lme, the first byte of the mstruction is wrapped around to instruction decoder 318 dunng the next clock cycle through multiplexer channel 304G

Dunng clock cycle 3, mstruction 8 is dispatched to issue position 0 It is noted that the first byte of instruction 8 is wrapped around from byte position 15 of the previous lme Instructions 9 and 10 are further dispatched to issue positions 1 and 2 through multiplexer channels 304B and 304C, respectively Upon decode of mstructions 8-10, mstruction issue units 340D-E cause NOOP mstructions to be issued

Instructions 11 and 12 are dispatched to issue positions 2 and 3 dunng clock cycle 4 Instruction 13 begms in byte 7, and cannot be routed to issue position 4 Therefore, the dispatch of instruction 13 must be held until the next clock cycle

Dunng clock cycle 5, instructions 13 through 16 are dispatched to issue positions 2 through 5, respectively Similar to the above, dunng decode of instructions 13-16, instruction issue units 340A and 340B cause NOOP instructions to be issued for issue positions 0 and 1

Referring back to Figure 2, instructions which are not included withm the subset of x86 mstructions designated as fast path mstructions are executed under the control of MROM unit 209 usmg stored microcode MROM unit 209 parses such instructions into a senes of fast path mstructions which are dispatched dunng one or more clock cycles As stated previously, predecode unit 202 is configured such that when a predesignated MROM instruction is encountered, the functional bit associated with the first byte of the instruction is set This condition is readily detectable by MROM unit 209 to effectuate senaiization of the mstruction as will be descnbed further below

When an MROM instruction within a lme of code in latchmg unit 202 is detected by MROM unit 209, this mstruction and any following it are not dispatched during the current cycie Any

preced g it are dispatched in accordance with the above description

Dunng the following clock cycle(s), MROM unit 209 provides series of fast path instructions to the decode units 208 through instruction alignment unit 206 m accordance with the microcode for that particular MROM instruction. Once all of the microcoded instructions have been dispatched to decode units 208 through alignment unit 206 to effectuate the desired MROM operation, the instructions which followed the MROM instruction are allowed to be dispatched.

- 5 Table 5 below illustrates a sample of x86 assembly language code segment containing an MROM instruction (REP MOVSB).

Table 5. x86 Assembly Language Code Segment With MROM Instruction. 0

MOV CX, S_LEN ;get string length

CLD -increment indices

REP MOVSB ;move string

POP CX -restore registers 15 POP DI

POP SI

Figures 4A-4C are block diagrams of portions of superscalar processor 200 depicting the dispatch and decode of the instructions of Table 5 during consecutive clock cycles. During the first clock cycle as 20 depicted within Figure 4 A, the first two instructions (MOVE CX, S_LEN and CLD) are routed through multiplexer channels 304A and 304B to issue positions 0 and 1 (i.e., decode units 318A and 318B). Upon decode MROM unit 209 further causes decode units 208C-208F to issue NOOP instructions.

Microcoded instructions that effectuate the REP MOVSB instruction are dispatched during cycles 2 25 through N. as depicted by Figure 4B. During these cycles, a set of fast path instructions in accordance with the microcode stored in MROM unit 209 are dispatched through the instruction alignment unit 206 to decode units 208A-208F. It is noted that this MROM sequence may take several cycles to complete.

30 Following complete dispatch of the MROM instruction, the remaining instructions of the line following the MROM instruction are allowed to be dispatched to issue positions 3-5 through multiplexer channels 304D-304F. Upon decode of these instructions, MROM unit 209 causes decode units 208A-208C issue NOOP instructions.

3.5 It is understood that while the instruction alignment unit 206 as described above in conjunction with

Figures 2-4 is configured to selectively route instructions to the specific issue positions indicated by Table 2, other configurations are also possible. That is, the specific issue position or positions to which a given instruction within a line of memory is dispatched may be varied from that described above. It is further specifically contemplated that the number of issue positions provided within a superscalar microprocessor

; o employing a decode unit in accordance with the invention may also vary. Other configurations of an instruction alignment unit for providing instructions to the parallel decode units are also possible, and other configurations of the decode units are possible.

It is noted that the specific predecode scheme employed by predecode unit 202 may vary from that indicated in Table 1. For example, the specific meanings conveyed by a particular combination of the values of the start bit and functional bit of a particular byte of instruction code may be different from the specific meanmg indicated within Table 1. Furthermore, while the instruction alignment unit 206 and decode units 208 in the embodiment described above are configured to directly transfer and decode certain raw x86 instructions (i.e., fast path instructions), implementations of a superscalar microprocessor are also possible wherein an instruction alignment unit is configured to translate a raw x86 instruction into one or more fixed length instructions, such as ROPs. In such a configuration, a plurality of decode units would be configured to receive and decode the translated instructions.

Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims

WHAT IS CLAIMED IS:

1. A method for predecoding variable byte length instructions within a superscalar microprocessor comprising the steps of:

generating a start bit indicative of whether a byte of an instruction is a start byte; generating an end bit indicative of whether said byte of said instruction is an end byte: and generating a functional bit that conveys a meanmg dependent upon a value of said start bit.

2. The method for predecoding variable byte length instructions within superscalar microprocessor as recited in claim 1 wherein said meaning of said functional bit is further dependent upon a value of a corresponding start bit of a previous instruction byte.

3. The method for predecoding variable byte length instructions within a superscalar microprocessor as recited in claim 1 wherein said functional bit is indicative of whether an opcode is a first byte of said instruction.

4. The method for predecoding variable byte length instructions within a superscalar microprocessor as recited in claim 1 comprising the further step of providing said start bit, said end bit and said functional bit to an instuction decoder.

5. The method for predecoding variable byte length mstructions within a superscalar microprocessor as recited in claim 4 comprising the further step of detecting said start bit, said end bit, and said functional bit within said instruction decoder to determine a boundary of said mstruction.

6. A superscalar microprocessor comprising:

an instruction cache for storing a plurality of variable byte-length instructions; a predecode unit coupled to said instruction cache and configured to generate a predecode tag associated with a byte of an instruction, wherein said predecode tag includes a start bit having a value indicative of whether said byte is a starting byte of said instruction and further includes a functional bit that conveys a meaning dependent upon said value of said start bit; a plurality of decode units for decoding designated instructions which correspond to said plurality of variable byte-length instructions; and an instruction alignment unit coupled between said instruction cache and said plurality of decode units for providing decodabie instructions to said plurality of decode units.

7. The superscalar microprocessor as recited in claim 6 wherein said instruction alignment unit is configured to provide said instruction to one of said plurality of decode units.

8. The superscalar microprocessor as recited in claim 6 wherein each of said plurality of said decode units is configured to decode a predetermined subset of an x86 instruction set.

9. The superscalar microprocessor as recited in claim 6 wherein said plurality of variable byte-length instructions are organized in lines within said instruction cache, wherein a line includes a predetermined number of bytes.

10. The superscalar microprocessor as recited in claim 6 wherein said predecode tag further includes an end bit indicative of whether said byte is an ending byte of said instruction.

1 1. The superscalar microprocessor as recited in claim 6 wherein said instruction alignment unit is 5 configured to provide said predecode tag to at least one of said plurality of decode units.

1 . T e superscalar microprocessor as recited in claim 1 1 wherein said at least one of said plurality of decode units is configured to detect said predecode tag to determine a boundary of said instruction.

0 13. The superscalar microprocessor as recited in claim 6 further comprising a plurality of functional units configured to receive output signals from said plurality of decode units.

14. The superscalar microprocessor as recited in claim 13 wherein said output signals from said plurality of said decode units include bit-encoded execution instructions. 5

15. The superscalar microprocessor as recited in claim 13 further comprising a plurality of reservation stations coupled to said plurality of decode units and to said plurality of functional units, wherein said plurality of reservation stations are configured to temporarily store said output signals from said plurality of decode units prior to issuance to said plurality of said functional units.

: o

16. The superscalar microprocessor as recited in claim 13 wherein a dedicated functional unit is associated with each of said plurality of decode units.

17. The superscalar microprocessor as recited in claim 6 further comprising a reorder buffer coupled to 5 said plurality of decode units for storing speculatively-executed instruction results.

18. A method for predecoding variable byte length instructions within a superscalar microprocessor comprising: generatmg a boundary bit mdicative of whether a byte of an instruction is a boundary byte: and

generatmg a functional bit that conveys a meanmg dependent upon a value of said boundary bit. ^■ 5

19. A superscalar microprocessor comprising:

an instruction cache for storing a plurality of variable byte-length instructions;

10 a predecode unit coupled to said instruction cache and configured to generate a predecode tag associated with a byte of an mstruction, wherein said predecode tag mcludes a boundary bit having a value indicative of whether said byte is a boundary of said mstruction and further includes a functional bit that conveys a meaning dependent upon said value of said boundary bit; and

15 a plurality of decode units coupled to receive said plurality of variable byte-length mstructions.

20