Nuclear weapon simulations show performance in molecular detail

presented during the IEEE Supercomputing Conference and the other during the International Symposium on High-Performance Parallel and Distributed Computing.

The researchers have developed automated methods to detect a glitch soon after it occurs. “You want the system to automatically pinpoint when and in what machine the error took place and also the part of the code that was involved,” Bagchi said. “Then, a developer can come in, look at it and fix the problem.”

One bottleneck arises from the fact that data are streaming to a central server. “Streaming data to a central server works fine for a hundred machines, but it can’t keep up when you are streaming data from a thousand machines,” said Purdue doctoral student Ignacio Laguna, who worked with Lawrence Livermore computer scientists. “We’ve eliminated this central brain, so we no longer have that bottleneck.”

The release notes that each machine in the supercomputer cluster contains several cores, or processors, and each core might run one “process” during simulations. The researchers created an automated method for “clustering,” or grouping the large number of processes into a smaller number of “equivalence classes” with similar traits. Grouping the processes into equivalence classes makes it possible to quickly detect and pinpoint problems.

The recent breakthrough was to be able to scale up the clustering so that it works with a large supercomputer,” Bagchi said.

Lawrence Livermore computer scientist Todd Gamblin came up with the scalable clustering approach.

A lingering bottleneck in using the simulations is related to a procedure called checkpointing, or periodically storing data to prevent its loss in case a machine or application crashes. The information is saved in a file called a checkpoint and stored in a parallel system distant from the machines on which the application runs.

The problem is that when you scale up to 10,000 machines, this parallel file system bogs down,” Bagchi said. “It’s about 10 times too much activity for the system to handle, and this mismatch will just become worse because we are continuing to create faster and faster  computers.”

Doctoral student Tanzima Zerin and Rudolf Eigenmann, a professor of electrical and computer engineering, along with Bagchi, led work to develop a method for compressing the checkpoints, similar to the compression of data for images.

We’re beginning to solve the checkpointing problem,” Bagchi said. “It’s not completely solved, but we are getting there.”

The checkpointing bottleneck must be solved in order for researchers to create supercomputers capable of “exascale computing,” or 1,000 quadrillion operations per second.

It’s the Holy Grail of supercomputing,” Bagchi said.

The research has been funded by Lawrence Livermore and the National Science Foundation. The work also involves Lawrence Livermore scientists Greg Bronevetsky, Dong H. Ahn, Martin Schulz and IBM Austin researcher Mootaz Elnozahy.

Conference paper can be accessed online at IEEExplore or ACM Digital Library or from the research group’s home page.

The release also notes that Purdue researchers did not work with the actual classified nuclear weapons software code, but instead used generic benchmarks, a set of programs designed to help evaluate the performance of parallel supercomputers.