Intel Pentium 4 3.2GHz Prescott
And Pentium 4 3.4GHz  Extreme Edition
Significant changes in P4 architecture bring future scalability

By, Dave Altavilla
February 1, 2004

For those of you that are visual learners, this snapshot of the new Pentium 4 Prescott architecture block diagram should help put things into perspective.  What we will try to do in this section, is break things down for you in layman's terms, rather than ramble on with technical drivel that will likely have you looking at the back side of your eyelids before too long.  However, this new P4 architecture is dramatically changed enough, that it merits some discussion and analysis.

P4 Prescott Block Diagram
Click for full view

Again, the major physical changes of the new P4 Prescott, are its 1MB of L2 Cache, 16Kb of instruction cache and deeper 31 stage pipeline.  Let's look at what some of these hardware enhancements bring to the table in features and performance.

Pentium 4 Architecture Enhancements
Picking up where Northwood left off

P4 Architectural Improvements
Click for full view

Deeper 31 Stage Pipeline:
Prescott's new deeper pipelined core has perhaps the most significant impact on the core's performance and future scalability.  Versus a Northwood core, the extra 11 stages in Prescott's pipeline, will afford the processor much more headroom for clock speeds in the future.  In fact, Intel has a 4GHz P4 on their roadmap this year, with 3.4 and 3.6GHz flavors right around the corner in Q2.  The downside of a longer pipeline is the increased penalty that you take on a missed branch prediction.  With a pipeline that is 50+% deeper than Northwood, the stalls when the Branch Prediction Unit misses its target, can be crippling to performance.  Of course the trade off is that if you can scale the core speed high enough, the inefficiencies of a deeper pipeline become less of an issue.

Improved Branch Predictor:
Intel has been buffing out their BPU for the Pentium 4, ever since its first introduction in the Willamette core.  Here's where Prescott hopefully makes up some ground and avoids a branch miss all together, since coming back through and flushing out the pipeline is like taking the long way home.  In fact, Intel claims to have enhanced both static and dynamic branch prediction algorithms, such that the number of actual branch misses are significantly lower with Prescott versus Northwood.  In some cases Prescott is more accurate with branch prediction by a factor of 2X over Northwood, in other scenarios the benefits are minimal.  Regardless, this is another critical area for Prescott, since clock for clock, with a deeper pipeline, the core could in fact be slower than Northwood without these enhancements.

Larger On Chip Caches And Buffers:
Prescott's larger 1MB L2 cache really needs no explanation, except to say that larger cache means the core needs to go off chip less often to fetch data from system memory while it is processing data.  The benefits of a larger L2 cache will be exhibited especially where applications have a larger footprint that historically needed to run from system memory but now can reside more so in resident on chip cache.  In addition, Prescott brings a larger 16KB instruction cache but when compared to 64K that is currently on AMD's Athlon 64, it still seems a bit smallish.  Finally, Prescott brings additional "WC Buffers" to bear for transfer across the graphics subsystem.  These "Write Combining" buffers will assist with flow management of data across the AGP bus, which will in turn provide more efficient use of front side bus bandwidth.  While this may seem like a solid benefit for current AGP graphics solutions, the real benefit will most likely come down the road, when PCI Express based graphics cards will require more system bandwidth.

Improved Hyperthreading Technology:
Multithreaded applications will benefit from Prescott's enhanced Hyperthreading engine.  Essentially, increased queue sizes and the chips larger L1 data cache, will alleviate some bottlenecks in situations where there is more than one active thread being processed.  In addition, there are specific "context identifier" bits now available on each of Prescott's logical processing units.  This will allow for sharing of L1 data cache by both logical processors, thus reducing instances of contention in cache and increasing cache hit rate during multithreaded processing.

13 New SSE3 Instructions:
The new SSE3 instructions that Prescott brings, will have little impact on performance currently but will provide enhancements for developers in the Multimedia and Gaming products.  Floating Point to Integer conversions, complex arithmetic functions, video encoding and thread synchronization, will be some of the special functions that can be called with SSE3.  Like the early days of SSE and SSE2, it will take a while for adoption but expect SSE3 to have significant impact in the future, just as Northwood does with current SSE optimized applications today.

90 Nanometer (.09 micron) Manufacturing Process:
So, how do you fit all these new enhancements on your plate and keep power, heat, cost and defect density per wafer under control?  It's all about die size baby.  As we noted earlier, a Prescott core is less than half the size of a .13 micron based Pentium 4 Extreme Edition at 112 square mm versus 237 square mm.  Drop that svelte new core on a 300mm (think Pizzeria sized here) wafer and the guys back in finance begin to actually smile at the profit margins.  Not to mention, customers may actually enjoy lower retail prices!  Intel has one of the first fab lines in the world to hit the 90nm (.09 micron) mark in volume production.  The operative word is "volume" here.  There are other manufacturers like TSMC, that currently have 90nm technology up and running but aren't quite volume ready at this point.  Prescott's launch marks another milestone for Intel and another industry first for the chip giant.
 

Prescott And P4EE Vital Signs, Thermals And Overclocking