Finally - Google Now Reporting Subscriber Statistics!
Mentor – Company of the Quarter?

Intel@UT: “Hundreds of Cores: Verification Challenges of Tera-scale Computers”

Earlier this week I attended a presentation hosted by the Computer Engineering Research Center at The University of Texas at Austin and given by Brian Moore, Director of Validation Research in Intel's Microprocessor Technology Lab.  In "Hundreds of Cores: Verification Challenges of Tera-scale Computers" (sorry, haven't been able to find a link to the actual slides yet), Moore discussed recent advances in computer architecture and the challenges that validation teams will face as a result.  I was hoping he would delve into detail about the types of tools and techniques Intel was using to validate multi-core processors, and what his views were on whether those technologies would scale.  Instead, the discussion was more general, perhaps more focused on motivating the students in attendance.  Below, I'll summarize the talk and provide observations on where I would have liked to have learned more. 

The Presenter

As director of the Validation Research Lab, Moore is responsible for understanding how changes in technology affect "pre-silicon validation, post silicon validation, and in-situ runtime validation".  Take a look at his biography for more info.  One of the things about Moore that caught my attention was that he managed the design and validation of the interconnect components in the Tera-FLOP machine, AKA "ASCI Red".  (See also this paper from Intel with more details describing the architecture of the system).

Figure 1: VP RICK STULEN and Intel designer Stephen Wheat look at the innards of an ASCI Red rack. The machine's easy accessibility made it possible to upgrade the processors, assuring it would remain one of the world's fastest computers for nearly a decade. (Courtesy of Sandia National Labs)

I actually saw the Tera-FLOP machine on my first co-op rotation with Intel back in 1997 before it was shipped to Sandia National Labs. To this day I can remember being awestruck by the site of almost ten thousand Pentium Pro processors and the associated supporting infrastructure.  And, strangely enough, I worked closely with someone at Intel who had been a part of the ASCI Red project who Moore knew as well.  Anyway, I digress… Let's talk about the presentation.

The Presentation

Introduction

As mentioned above, back in 1997 Intel delivered the Tera-FLOP machine to Sandia.  The machine used ~9200 Pentium Pro processors, weighed 44 tons, and used 500kW of power. 

Recently, Intel announced an 80 core test chip:

  • 80 Floating Point cores
  • 1 TFLOP at 3GHz, 62W
  • 2 TFLOPs at 6GHz

The chip didn't have the massive I/O or memory that the Tera-FLOP machine had, but it is interesting that in 10 years the computing power in raw FLOPS has been miniaturized to fit on a single die.  According to Moore, significant advances in validation must occur in order to get chips of this scale working.  Why?  As devices get smaller, you'll tend to have more cores with greater:

  • Soft error rates
  • Device variation
  • Time dependent degradation

Also, techniques used to catch bad devices during manufacturing such as burn in may become infeasible.  Burn in involves thermally stressing a part and watching for failures that occur as a result of the temperature variation.  The problem is, as devices get smaller, they become more and more sensitive to variations of this sort, so that even normally functioning cores will start to fail during burn in testing. 

Definitions

Moore defined validation to include pre-silicon, post-silicon, and "in-situ" validation.  For the novice reading this site (or perhaps someone from The Vegan Lunchbox) I'll describe each one of these areas.

Pre-silicon validation is used to prevent bugs incorporated into the system during the initial architecture and RTL development phases from escaping into silicon.  Techniques used include writing software simulations in Verilog, VHDL, C++, or a high level verification language (such as e, SystemVerilog, SystemC, or Vera). My focus on this site is usually on topics related to this phase of the verification effort.  Often, emulation and HW/SW co-simulation are used to increase simulation speed and allow software to be tested in advance of hardware completion.

Post-silicon validation is used to detect failures that have escaped to silicon.  For this phase, live devices are tested under real-life conditions to ensure (for example) that you can boot necessary OSes, firmware works appropriately, devices can transfer large amounts of data and are compatible with other devices running the same protocols, and a host of other things.  Pre and Post-silicon phase completion is usually considered a requirement for shipping production silicon, though the associated teams will continue work with the technical support as needed when bugs arise in the field.

In-situ validation is used to recover from failures real-time in production parts, and potentially monitor and report failures.  For example, in a multi-core device, over time any particular core might experience problems.  A reasonable solution to the problem would be to disable the core and reconfigure the system to allow the system to continue to operate. 

A Thought Experiment

Moore spent a decent amount of time describing how many cores you might reasonably expect to find on-die in the next 5-10 years, how complex those cores might be (Pentium, Intel Core, or P4), and what the optimal number of cores might be based on different process technologies.  For our purposes though, the main thing to consider is the types of issues that are likely to crop up in a multi-core system.

Moore proposed a simple thought experiment to highlight some of the issues.  What if pre-silicon verification could prove that a design had zero bugs? (This is impossible for anything other than a trivial design, and no, I don't believe even formal verification can close the gap here, since some of the bugs could be architectural).  Even given a perfect design, real silicon will have bugs due to transient, process, and time-dependent errors.  These bugs will manifest themselves like actual logic bugs, making it difficult to know which type of issue you're dealing with unless validation techniques improve.  Moore was especially worried about tracking down issues related to degradation over time.  Because it can be difficult to determine which phase a bug is related to (pre/post/in-situ validation) it is going to be necessary to share information and techniques between the phases.  (Yes, I know, you'd like to know what those techniques are… so would I!  Keep reading, but the details that would have made the talk significantly more interesting were sadly absent). 

Research Areas

According to Moore, the Validation Research Lab at Intel has been around for about nine months to address just the sorts of issues described above.  They've come up with a list of research areas that should be addressed in order to support "hundreds of cores".   A bit on the generic side, but interesting nonetheless.

Research to Enable SoC Optimization and Validation

  1. Increase the abstraction layer – Basically, move from doing things at gate and RTL layers so that we can simplify the most complicated parts of the design.  I wondered at the time whether this means Intel is looking to tools like SystemC to raise the level of design abstraction.  The same idea obviously holds true in the verification space as well. 
  2. Create equivalency link between the specification and the implementation. Though he didn't say specifically, this must be similar to the formal equivalency checking that takes place between RTL and gate level designs.  Will this ever be possible?  I doubt it, but perhaps Moore knows something I don't.
  3. Simulation acceleration/emulation.
  4. Micro-architectural coverage metrics, mechanisms, and analysis.
  5. Modular validation. 
  6. System level validation.

The biggest problem in verification, according to Moore, is knowing when you're done.  Reuse can help, and may be a big factor in making it possible to build devices with hundreds of cores.

Research to Address 1K-100K DPM

Moore also addressed research areas needed to address problems that cause 1K-100K Defects Per Million (DPM) issues.  I'm anything but an expert in this area, so I apologize for the terseness of this section:

  1. Marginal Circuit Analysis
  2. Advanced scan techniques

Conclusion

So, there you have it.  There are big challenges ahead for validation teams everywhere, and we need to address them… J. Again, the talk was mostly aimed at graduate students, so it's hard to fault Moore for the lack of details (especially given the fact that the presentation just barely finished within the allotted time).  Still, I'd like to know a few things related to pre-silicon validation:

  • What tools and techniques does Intel use for validation currently? (I've got a decent idea from my time there but things have likely changed in the last few years).
  • Does Intel believe these tools and techniques can grow to handle the upcoming challenges?  If not, why not (in detail)?
  • How many people is it going to take to validate these types of large devices?  When I was there as part of the Networking group, we always ended up having to do things our own way because we didn't have the vast teams or resources available to the processor teams.  However, I don't see how ever increasing team sizes will scale, especially given recent cost-cutting efforts at Intel.  What's the solution?
  • Does there need to be more industry collaboration to solve validation problems?  For example, would it make sense to start opening up development of verification IP into a more open-source model to leverage the strengths of the vast sea of verification engineering talent?

I'll try to dig a little deeper and see if I can get additional info on the topic.  As always, stay tuned!

Comments