Tuesday 3 April 2012

The NVIDIA Blog

The NVIDIA Blog

Link to NVIDIA

No Free Lunch for Intel MIC (or GPU’s)

Posted: 03 Apr 2012 09:53 AM PDT

fermi die shot-wide-2

The recent news and industry reaction regarding Intel's forthcoming "Many Integrated Core" (MIC) accelerator has been interesting to watch. It appears Intel, like NVIDIA and AMD, has now concluded that hybrid architectures are the proper response to the growing power constraints of high performance computing.

While I agree with this, some of the discussions around programming the upcoming MIC chips leave me scratching my head – particularly the notion that, because MIC runs the x86 instruction set, there's no need to change your existing code, and your port will come for free.

Power is the Problem

The technology underpinnings responsible for the move toward hybrid computing are pretty compelling, driven by the huge inflection point we experienced in the previous decade. Moore's Law is alive and well, continuing to dish up more and more transistors per square mm. But Dennard Scaling is not.

We can no longer reduce voltage in proportion to transistor size, so the energy per operation is no longer dropping fast enough to compensate for the increased density. The result is that processors are now totally constrained by power. And it's getting exponentially worse with subsequent generations of integrated circuits!

Circuit performance per watt is still improving, but now at closer to 20 percent per year instead of the almost 70 percent per year we used to enjoy. So how can we continue to improve performance anywhere close to historic rates, and achieve exascale computing by the end of this decade? Since the underlying technology is going to fall far short of the improvements we need, our only hope is to dramatically reduce the overhead per operation.

Hybrid is the Answer

This is where hybrid architectures come in. NVIDIA GPUs implement hundreds of simple, power-efficient cores that are optimized for high throughput on parallel workloads. Multicore x86 processors implement a handful of complex cores that are optimized for fast single-thread performance, but take many times more energy per operation.

“Since you can't optimize a core for
both energy-efficiency and fast
single-thread performance, the hybrid
architecture allows us to concentrate
on making the GPU cores more and more
energy efficient, while relying on the
CPU cores for serial performance.”

To improve application performance per watt, we have to shift most of the work to the throughput-optimized cores, and just use the fast (but less efficient) CPU cores for the residual serial work. This is a hybrid architecture. Since you can't optimize a core for both energy-efficiency and fast single-thread performance, the hybrid architecture allows us to concentrate on making the GPU cores more and more energy efficient, while relying on the CPU cores for serial performance.

Intel has announced a similar approach with MIC. They don't really have the equivalent of a throughput-optimized GPU core, but were able to go back to a 15+ year-old Pentium design to get a simpler processor core, and then marry it with a wide vector unit to get higher flops per watt than can be achieved by Xeon processors.

So far, so good. But, I'm perplexed when I hear some people say that there's no need to change your existing code to run on MIC because it uses the x86 instruction set. Just recompile with the –mmic flag, and your existing MPI or OpenMP code will run natively on the MIC processor! (In other words, ignore the Xeon, and just use the MIC chip as a big multi-core processor.)

Native Mode Complications

Functionally, a simple recompile may work, but I'm convinced it's not practical for most HPC applications and doesn’t reflect the approach most people will need to take to get good performance on their MIC systems.

“A simple recompile may work, but
I'm convinced it's not practical for
most HPC applications and doesn't
reflect the approach most people will
need to take to get good performance
on their MIC systems.”

The idea of running flat MPI code (one rank per core) on a multi-node MIC system seems quite problematic. Whatever memory sits on the MIC PCIe card will be shared by more than 50 cores, leading to very small memory per core. From what I know of the MPI communication stack, that won't leave much memory for the actual data – certainly far below the traditional 1-2 GB/core most HPC apps want.  And 50+ cores all trying to send messages through the system interconnect NIC seems like a recipe for a network logjam. The other concern is the Amdahl's Law bottleneck resulting from executing all the per-rank serial code on a lower-performance, Pentium-class scalar core.

The OpenMP approach seems only slightly better. You'd still have the very small per-core memory and the Amdahl's Law bottleneck, but at least you'd have fewer threads trying to send messages out the NIC.  Perhaps the biggest issue with this approach is that existing OpenMP codes, written for multi-core CPUs, are unlikely to have enough parallelism exposed to profitably occupy over 50 vector cores.

What About MIC Performance?

So far, the discussions of MIC programming have avoided confronting these issues by excluding any talk of performance.

We've seen scaling charts for MIC that show performance improving as more cores are used, but there is no absolute performance shown.  And the "scaling" results are literally for a single chip (not really scaling at all in the HPC sense). Looks eerily similar to the original Larrabee GPU charts from four years back.

To be fair, Knights Ferry is a pre-production prototype and thus performance is not supposed to be competitive. But, it just doesn’t make sense to talk about ease of programming in the absence of any performance considerations.

The whole point of an accelerator is to accelerate! What programming effort will be necessary on MIC to actually get good performance?

No "Magic" Compiler

The reality is that there is no such thing as a "magic" compiler that will automatically parallelize your code. No future processor or system (from Intel, NVIDIA, or anyone else) is going to relieve today's programmers from the hard work of preparing their applications for the future.

With clock rates stalled, all future performance increases must come from increased parallelism, and power constraints will actually cause us to use simpler processors at lower clock rates for the majority of our work, further exacerbating this issue.

“The reality is that there is no such
thing as a "magic" compiler that will
automatically parallelize your code.”

At the high end, an exaflop computer running at about 1 GHz will require approximately one billion-way parallelism, but the same logic will drive up required parallelism at all system scales. This means that all HPC codes will need to be cast as throughput problems with massive numbers of parallel threads. Exploiting locality will also become ever more important as the relative cost of data movement versus computation continues to rise. This will have a significant impact on the algorithms and data layouts used to solve many science problems, and is a fundamental issue not tied to any one processor architecture.

Directives: Performance + Portability

Portability across machines is very important for application developers, and directives are a great way to express parallelism in a portable manner. The programmer focuses on exposing the parallelism, while the compiler and runtime focus on mapping it to the underlying hardware (perhaps with the help of auto-tuners). The new OpenACC standard allows users to express locality as well as parallelism, and is particularly well suited to today's emerging hybrid architectures.

Existing OpenMP codes can also be taken forward, but will require some additional work.  Unfortunately, most OpenMP codes today apply the parallel directives to inner loops, which is appropriate for exploiting modest parallelism across a small number of cores. In order to run well on future machines with much more on-node parallelism, however, the directives need to be raised up in the call tree to expose much greater amounts of parallelism.

No Free Lunch

This will take effort, but it's work that makes the applications inherently better suited for future architectures. Initial experience tuning codes for the new NVIDIA GPU-accelerated Titan supercomputer at Oak Ridge National Laboratory has been very positive, providing significant acceleration on key scientific codes.

Using OpenACC to express the parallelism and locality, as code was optimized for GPUs, the same code now ran significantly faster on vanilla multicore CPU systems. Tuning HPC codes for accelerators is real work, but it is work that will pay off across machine types and especially on future machines with increased levels of parallelism.

Which brings me back to the topic of programming for the upcoming MIC processors.

It's clear to me that hybrid architectures make increasing sense in our power-constrained future, and Intel's MIC effort shows they think so too. The upcoming Knights Corner processor will reportedly look much like today's Fermi GPUs: power-efficient accelerators attached to an x86 CPU via PCIe. Programming the two architectures should be very similar: structure applications to expose parallelism and locality, and express via directives; use the multi-core CPUs for serial code, and execute the parallel kernels on the accelerator. The hope that unmodified HPC applications will work well on MIC with just a recompile is not really credible, nor is talking about ease of programming without consideration of performance.

There is no free lunch. Programmers will need to put in some effort to structure their applications for hybrid architectures. But that work will pay off handsomely for today's, and especially tomorrow's, HPC systems.

Get Started Today

It remains to be seen how Intel MIC will perform when it eventually arrives. But why wait? Better to get ahead of the game by starting down the hybrid multicore path now.

You can start today with NVIDIA GPUs, and you'll be that much further ahead regardless of which processor architecture you ultimately choose.

If you currently have access to MIC chips and have been testing real applications, I would love to hear from you. Post a comment below sharing your experiences and results. I'm also interested in your thoughts on the move to hybrid multicore architectures and how we'll need to program them.

No comments:

Post a Comment