ERRATA
PentiumŪ Pro Processor System Architecture, 1st Edition


Updated on 9/19/97 with errata for pages 233 and 235.

Updated on 8/2/97 with following changes:

 

Page

Applies to Printing

Description

13 First Under the heading "Starting Up Other Processors," replace all of the text above the bulleted item with the following text: 

"The code executing on the BSP is responsible for detecting the presence of the other processors. This information is then stored in non-volatile memory. An MP (multi-processing) OS uses this information to determine the available processors. An OS that is not MP-aware (e.g., DOS) only makes use of the BSP. The other processors remain dormant (in other words, they're useless).

Assuming that it is an MP OS, it assigns a task to a processor in the following manner:"

14 First Under the heading "Relationship of Processors to Main Memory," after the last bulleted item, insert the following sentence: 

"Lines in the L1 code cache are marked either I (invalid) or S (shared, or valid)."

15 First In table 1-1, 4th body row, 2nd column, replace with the following: 

"After receiving the line from the snoop agent, the requestor places line in S state."

17 First Table 1-1, center column, add the following text at the end of the last sentence: 

"(because initiator immediately stores into it and marks it modified)."

46 First Make the following changes to table 3-3: 

Bit 0 should be reserved.
Bit 1 enables/disables data bus ECC error checking.
Bits 4:1 and 7:6 should all be changed to show 0 = disable and 1 = enable.
Bit 26 should be shown as R/W, not read-only.
Bits 63:27 are reserved, not 31:27.

50 First Make the following changes to Figure 3-3: 

Bit 0 is reserved.
Bit 26 is R/W.
MSB is 63, not 31.

57 First Replace the text under the heading "APIC Arbitration Background" with the following text: 

"During output of any message on the APIC bus, the local APIC’s 4-bit ID is inverted and driven onto APIC data line one (PICD1) serially, msb first. Multiple processors may start driving messages simultaneously. In this case, as each inverted bit of the arbitration ID is driven onto the bus, msb first, an electrical zero beats an electrical one. When a processor driving an electrical one sees an electrical zero on the line, it realizes that it has lost the arbitration and ceases to attempt transmission of its message. It waits until the current message transmission completes and then reattempts transmission of its message. The winner of the arbitration sends its message and then changes its APIC arbitration ID to the lowest value (i.e., 1111b). The losers each upgrade their priority by inverting their current arbitration ID, adding one to it and then reinverting it."

57 First Replace the text under the heading "Startup APIC Arbitration ID Assignment" with the following text: 

"On the trailing-edge of reset, each processor’s local APIC arbitration ID is set to the invert of its processor’s agent ID. At start-up time, the net result is that the processor with an APIC arbitration priority ID of Fh has the lowest APIC arbitration priority and the processor with highest numerical agent ID has the highest APIC arbitration priority. The final result is that the processor with the numerically highest agent ID will be the BSP (as described in the next section)."

57 First Replace the text under the heading "BSP Selection Process" with the following text: 

1.After each processor’s BIST completes (if it was started), the local APICs in all processors simultaneously attempt to send their BIPI (Bootstrap Inter-Processor Interrupt) message to all processors (including themselves) over the APIC bus.

2. The processor with the highest APIC arbitration priority (i.e., inverted agent ID) wins the arbitration and sends the first BIPI message to all of the processors (including itself). The processors that lose the arbitration must wait until the winner finishes issuing its BIPI message before they reattempt issuance of their BIPI messages.

3. The arbitration winner’s BIPI message is received by all of the processors and the APIC ID field of the message (lower 4 bits of the 8-bit vector field) is compared to each of the receiving processors’ APIC ID. The processor with a match sets the BSP bit in its APICBASE MSR (model-specific register). This identifies it as the bootstrap processor. All of the losers clear this bit in their respective APICBASE MSRs, thereby identifying themselves as the applications processors, or APs.

4. The winner (i.e., the BSP) changes its rotating APIC priority level to Fh (the lowest APIC priority) and attempts to issue the FIPI (Final Inter-Processor Interrupt) message to all processors (including itself). The losers of the first competition each upgrade their priority by inverting their current arbitration ID, adding one to it and then reinverting it.

5. All of the processors that lost the first competition (the APs) attempt once again to transmit their BIPI messages and the winner of the first arbitration (the BSP) attempts to transmit its FIPI message.

6. Because the BSP set its arbitration priority level to the lowest, it is guaranteed to lose the competition. One of the other processors will win (the one with the highest arbitration ID).

7. As each of the APs is successful in acquiring APIC bus ownership and transmitting its BIPI message, it then sets its arbitration ID to Fh, to make itself the least important.

8. As each of the application processors (APs) in succession wins the bus and finishes broadcasting its BIPI, it then remains in the halt state until it subsequently receives a SIPI (Startup Inter-Processor Interrupt message) from the BSP at a later time.

9. After all of the APs have broadcast their BIPIs, the BSP will be successful in re-acquiring APIC bus ownership and will then broadcast its FIPI. Upon receiving its own FIPI, the BSP then begins fetching the POST code.

Once the BSP selection process has completed, the BSP initiates fetch, decode and execution of the ROM POST code starting at the power-on restart address selected at the trailing edge of reset (see “Power-On Restart Address Selection” on page 40).

58 First Replace the text under the heading "Processor's Initial Memory Reads" with the following text: 

When the BSP comes out of reset, caching is disabled (CR0[CD] and CR0[NW] are both set to one). The processor’s 32-byte prefetch streaming buffer is empty. As a result, the processor initiates a 32-byte memory read transaction to fill the streaming buffer, but it indicates that the addressed area of memory is uncacheable. A detailed description of the processor’s initial memory code fetches from the boot ROM can be found on our web site in a technical paper.

59 First Replace the text under the heading "How APs are Started" with the following text: 

"The Intel Multiprocessing specification dictates that the startup code executed by the BSP is responsible for detecting the presence of processors other than the BSP. When the available AP processors have been detected, the starrtup code stores this information in non-volatile memory (for the OS to consult when it is loaded and takes over)."

59 First Change the heading "SMP OS" to "MP OS" and replace the text under the heading with the following text: 

"If the OS is a Multi-Processing, or MP, OS, it must: 

  • consult the information stored in non-volatile memory by the startup code to determine the presence (or abscence) of the other processors (the APs) 
  • place tasks in memory for them to execute and
  • pass the start address of these programs to each of them. "
60 First Under the heading "AP Task Assignment", change the two ocurrences of "SMP" to "MP".
66   Replace figure 5-3 with the following picture: 

The first sentence of the second paragraph should end with "because it overflows into the next line."

66,72-74 First, Second MindShare has received a clarification from Intel on the operation of the IFU and DEC1 pipeline stages. 

Using the instruction boundary markers inserted into the 16-byte code block in the IFU2 stage, the IFU3 stage rotates the next three sequential IA instructions to optimize their alignment with the three decoders (decoders 0, 1 and 2). If the three instructions consists of three simple IA instructions, they are submitted to the three decoders in strict program order (in a single clock). If, on the other hand, the three IA instructions consists of two simple instructions and a complex instruction (in any order), the IFU3 stage rotation logic rotates the instructions to align the complex instruction with the complex decoder (i.e., decoder 0) and the two simple instructions with decoders 1 and 2. In the next clock (the DEC1 stage), the three instructions are submitted to the three decoders. In the following clock (the DEC2 stage), the micro-ops produced by the decoders are placed in the ID Queue in strict program order. If the three IA instruction series contains more than one complex instruction, decoder throughput will not be optimal. The following table describes the result:
 

Composition of Next 3
IA Instructions

Instructions
Decoded

Description

simple,simple,simple

3

Throughput optimized. No rotation necessary. In the DEC1 stage, all three decoders are simultaneously submitted simple IA instructions to decode.
simple,simple,complex

3

Throughput optimized. Rotation performed to align the three IA instructions with the three decoders. In the DEC1 stage, all three decoders are simultaneously submitted IA instructions to decode.
simple,complex,simple

3

Throughput optimized. Rotation performed to align the three IA instructions with the three decoders. In the DEC1 stage, all three decoders are simultaneously submitted IA instructions to decode.
simple,complex,complex

2

The first two IA instructions are rotated and submitted to decoders 0 and 1.
complex,simple,simple

3

The three IA instructions are submitted to the three decoders in original program order. No rotation is necessary.
complex,simple,complex

2

The first two IA instructions are submitted to decoders 0 and 1 without rotating them.
complex,complex,simple

1

Just the first IA instruction is submitted to decoder 0.
complex,complex,complex

1

Just the first IA instruction is submitted to decoder 0.

 

113 First, Second Replace the last paragraph under the heading "Return Stack Buffer (RSB)" with the following: 

When a CALL instruction is executed, the return address is pushed into stack memory and is also pushed onto the RSB. The next RET instruction subsequently seen in the IFU2 stage causes the processor to access the top (i.e., most-recent) entry in the RSB. The branch prediction logic predicts a branch to the return address recorded in the selected RSB entry. In the event that the called routine alters the return address entry in stack memory, this will result in a misprediction.

233 all printings of 1st edition In list item number one, change "within one clock" to read "within one or two clocks".
235 all printings of 1st edition In figure 10-11, in clock 2-3 period, change "deassert BNR# within 1 clk and enter stall state" to read "deassert BNR# within 1 or 2 clks and enter stall state".
237 First Replace figure 10-12 with the following picture: 

268 All First column heading in table 11-6 should be "ASZ" rather than "ASIZ."
288 First Replace figure 13-3 with the following picture: 

320 First,Second In table 14-2, first row, second column, delete the second paragraph (starts with "It should be noted...").
322 First, Second In figure 14-4, the bubble that reads "DID and DEN# (deasserted) latched" should read "DID and DEN# (asserted) latched.
338 First In 4th bulleted item, change "AD[1:0]#" to AP[1:0]#".
343 First At the the bottom of the page in the two major bullet items, "D[31:0]#" and "D[63:32]#" should be swapped.
351 First In the first sentence of the description of the INIT# signal, change "floating-point registers." to "floating-point or MCA registers.".
365 First Table 18-3, EAX, byte 1, "64 entries" should read "32 entries".
409 First Under the heading "CLI/STI Solution," item 2, "EFLAGS[VME]" should read "CR4[VME]".
410 First Item 7, delete ", reenabling interrupt recognition" at end of item.
420 First Replace figure 22-3 with the following picture: 

422 First Under the heading "MCG_CTL Register," in the second sentence, replace "MCG_STATUS" with "MCG_CAP".
431 First In table 22-4, column three of "Bus and Interconnect Errors," replace "BUS{LL}_{PP}_{RRRR}_{II}_ERR" with "BUS{LL}_{PP}_{RRRR}_{II}_{T}_ERR".
447 First Replace figure 24-1 with the following figure: 

453 First Under the heading "When Exiting MMX Routine, Execute EMMS," replace the text of the first bullet item with "the FPU registers are renamed as the MMX registers (MM[7:0]).
475 First,Second Although the 450 PBs do not support transaction deferral, the description on this page of how they handle transactions that must cross onto the PCI bus is incorrect. The following description replaces that found in the book. 

When a processor initiates a transaction that targets a device beyond the host/PCI bridge, the bridge stretches the snoop phase (via snoop stalls) while it tries to acquire PCI bus ownership. If the host/PCI bridge cannot acquire PCI bus ownership within a reasonable amount of time (I don't know the limit the bridge uses, but let's say around 400ns), it ends the snoop phase (indicating a miss) and terminates the transaction with a retry response. It does not memorize the transaction. The bridge will issue a retry to the processor each time that it retries the transaction until it can get PCI bus ownership within the window. It then starts the PCI transaction, stretching the snoop phase until the PCI device is ready to transfer the data (TRDY# asserted). It then indicates the normal data response (assuming that it's a read) and transfers the data across the bridge to the processor.

I asked HP Instrument Division to take some measurements using their Pentium Pro preprocessor. I've included an example below that shows an attempt to read from a CMOS RAM location (IO port 71h). On the first two attempts, the snoop phase was stalled for 352ns and 376ns, respectively, after which the retry response was delivered to the processor (because the PCI bus could not be acquired quickly). On the third attempt, the PCI bus was acquired within the window, the PCI transaction initiated, and the snoop phase stretched for 1.256us while the ISA read took place. The snoop phase was then terminated with a clean snoop and the normal data response was then received and the data transferred.
 

IO Port

Operation

Snoop Phase Duration

Response Type

PCI Transaction Initiated?

71h IO read 352ns Retry No
71h IO read 376ns Retry No
71h IO read 1.256us Normal Data Yes