T h e  I B M  P o w e r P C  9 7 0

"There is much that is misunderstood, in attempting to get a grasp on next generation processor technologies, such as the PowerPC 970, AMD's Opteron and Intel's Itanium. There is a good article on 970 technology at ArsTechnica. While this FAQ is not meant (by any means) to be exhaustive, or the end-all to all 970 FAQ's, it should provide a couple bits of information that will help in understanding where 970 is coming from. These were the issues that most confused me, until some very nice IBM folks sat down and set me straight!" --Bruce

A technical review of the IBM 970 CPU.
by Bruce Boettjer, Sr. Engineer for Momentum Computer
Actually, an awful lot! The smart folks at Mercury Computer wrote a white paper just after the G4 processor came out. Essentially, this paper showed the G4 to be quite capable, if one could only keep it well fed with data. In those days, MPX bus was the fastest in the land (PowerPC land, anyways), but could still only provide a 'paltry' 800MBytes/sec in practice. What this meant to the G4, was that it was to spend most of it's time waiting on data. One of the new things in the 970 is the EIO (Elastic IO) Front-Side bus that is actually two unidirectional 32-bit busses. Each of these busses can saturate with data at 500MHz DDR (1G Transfers/sec). This yeilds a much more satisfying 8GigaBytes/sec of throughput. To date, the author understands that over %95 bus utilization is achievable.

To really understand the 970, one must investigate its breeding. The legacy of 970 comes from IBM's Power4 architecture (another highly reliable processor that needed another processor to manage it). While Power4 had multiple CPUs on the same die, 970 has only 1, but with a lot of supercharged caveats... While this FAQ could probably innundate you with a lot of impressive numbers for the internals, what remains important is how those numbers are used - ergo, we'll stick to the elements that make the largest impact...
  1. This processor can maintain over 200 instructions 'in-flight' at any given moment. So this processor not only subscribes to Intels 'do single instructions really fast' paradigm, but introduces a massively parallel architecture that allows the best of both sides of the same coin: "Do a heck of a lot of things, at the same time, really fast".

  2. This processor blurs the line between 32-bit and 64-bit computing applications. One has to be very careful when describing the capabilities of this machine, as specsmanship can confuse the whole picture irreparibly.

    The SIMD (Altivec) unit in the 970 is the same Altivec unit that appears in all of the G4 family components. This provides unparalleled performance to 32-bit floating point operations (graphics, video processing, High-Performance 32-bit scientific applications, etc...) in achieving a max of 8 FLOPS per clock cycle (i.e.. a single-processor 1.4GHz machine would provide 11.2 32-bit SIMD MFLOPS).

    The native 64-bit floating point capability of the 970 is inhereted from the Power4 architecture. There are 2 IEEE (64-bit) floating point units. This is in addition, and exclusive of, the SIMD Altivec unit. These floating point units are capable of providing 2 FLOPS per clock cycle (i.e.. a single-processor 1.4GHz machine would provide 2.8 64-bit IEEE MFLOPS)

    Benchmarking programs that help people quantify the capabilities of a given processor are generally written with generic code, on generic compilers. This is a severe disadvantage to 970, as generic compilers will not optimize generic code to fully take adantage of the 970s capabilities. The 970, while very capable, will happily buzz along while a user attempts to perform 32-bit math on the 64-bit floating point units, without ever attempting to funnel this data through the SIMD unit, simply because the compiler didn't know about it. Once compiler technology catches up with this processor (very soon, actually), the benchmarks will better show the true advantages this processor can offer.

  3. The 970 has another nifty feature. Native 64-bit operation. What this means is that all programs previously written for PowerPC (and AltiVec) will run on 970, with very minor modifications. What this also means, is that 64-bit programs will be able to run along-side the 32-bit programs, without a hiccup. There is no 32-bit 'mode'. Back in the day, when Intel was designing their new 32-bit machine, IBM and Motorola jointly produced 'Book 4', an architecture that later came to be known as PowerPC. It was designed as a 64-bit machine - only 32-bits have been implemented in 603, 604, G3 & G4 machines, to date. There is no 'special' instruction set for working on 64-bit data vs 32-bit data, only the address space and pointer size are different. In contrast, Opteron comes up in 32-bit mode out of reset and out of each and every interrupt taken. A separate set of GPR's are set aside for 64-bit operation. To achieve 64-bit 'mode', a user must climb out of 32-bit reset/interrupt service routine, stuff appropriate 64-bit registers and perform a context-switch for each and every instruction that is not in the same 32-bit or 64-bit mode as the last instruction executed. Messy, very messy. Running mixed-mode programs on Opteron & Itanium can result in taking as much time doing context-switches as time that the processor is actually doing something useful. With 970, there is only one set of GPR's and they work for all operations, all the time.

What are the differences between a 32-bit machine and a 64-bit machine?
This is a loaded question. There are really two valid answers, depending on which market from whence you are coming from.

The embedded markets will see a 64-bit machine as having a lot more address space than a 32-bit machine. A 32-bit machine can address 2^32 (4GBytes) unique address spaces and nothing else (i.e.. A 32-bit card may allow a user to stuff 4GBytes of memory on the board, but there will be no space left over for the processor to do anything but talk to the memory... not very useful). Applications that require large amounts of memory (greater than 2GBytes) can very quickly run out of address space, when attempting to memory map more than two processors together. 970 allows for 42-bit physical addressing, with 64-bit virtual addressing. This amounts to 2^42 (4TBytes, 3 orders of magnitude larger than 32-bit space) unique address spaces. This becomes more than adequate for folks who want to stuff large amounts of physical memory (Greater than 4GBytes) on a local board and then connect this board to a large amount of other boards who also have a lot of memory stuffed. The enhanced address space allows these boards to talk to each other, with significant message and data passing capabilities.

The High-Performance Compute (HPC) markets will see a 64-bit machine as a means to an end. Many HPC applications (Protein Folding code, Bioinformatics, extremly high precision simulation) absolutely, positively can not handle the 'lack of precision' offered by 32-bit machines (a 32-bit multiply yeilds a 64-bit number, half of which is lost, because of the 32-bit architecture). 64-bit machines with 64-bit floating point units are the only way to achieve the level of exactness required by some of these application programs. the enhanced address space is very nice, but the real meat is in how fast a given 64-bit application can be reliably run on a 64-bit machine, or cluster of 64-bit machines (at supercompute facilities).

As a footnote, all mathematical operations can be broken down into a sequence of MACs (Multiply and Accumulate). The depth of a given architectures pipeline, as well as the number of IEEE 64-bit floating point units (and AltiVec capability than can be used) will determine exactly what saturation point (level of performance) can be expected from a processor. Given the 970's ability to manage over 200 instructions 'in flight' at any given moment, coupled with both the 32-bit AltiVec unit and the two IEEE 64-bit Floating Point units, the potential for this processor to gain a strong foothold in both the embedded and HPC markets is very attractive.


Why are FLOP and MIP benchmarks so deceiving on the 970?
Standard Benchmarks have long stood as a way for humans to compare processors of different make and manufacture, on an 'apples to apples' level. This has been a good way of measuring capability, up to this point. To a large part, benchmarking has become the art of specsmanship. The benchmarks, themselves, represent tasks that no ordinary application would subject a processor to, in a real-life application. In the case of the 970, this is a severe disadvantage. The 970 was designed by some very smart folks who understood the real bottlenecks in compute applications, and provided some real innovative solutions to address these issues. There are no benchmarks available that will measure how effectively the 970's speculative branch execution unit performs. There are no benchmarks that will measure the advantage of having 200 instructions 'in-flight', or be able to effectively measure latency. Certainly, as there has never been a processor with both 32-bit and 64-bit Floating Point units before, there is a great amount of momentum in the belief that there is also no existing benchmark that will properly give a measure of how much better, or worse, 970 stacks up against other processors.

The only real benchmark that will prove itself for the 970 is the test of real application code. Only the true test of running an OS, with multiple threads running, will prove out this processor against others. It's not a pretty solution, but at least it's honest.

HPC Intro | Price/Performance | IBM 970 | AltiVec | Embedded Systems | Geek Glossary




 
          Copyright ® 1999-2010. Fixstars Corporation. All rights reserved.
YDL.net Fixstars Corporation