• October 2015
    M T W T F S S

New Programming Approach Seeks to Make Large-Scale Computation More Reliable

UChicago News (IL) (10/07/15) Benjamin Recchie

As computer components become smaller and smaller, packing more transistors into a smaller space than ever before, more errors are likely to crop up in the computations the hardware carries out. This effect is likely to be especially pronounced in high-performance computers, and researchers at the University of Chicago’s Computation Institute are looking for a new method to correct for these errors. Most computers today use a technique called checkpoint restart to periodically save data at any given point mid-calculation, so if an error occurs, the computer can revert to an earlier state in the calculation without having to completely start over. However, even this method is likely to be insufficient as complexity grows and errors increase. Andrew Chien and his colleagues at the Computation Institute are experimenting with a new technique called Global View Resilience, which enables applications to not only save work that is underway, but also to offer flexible error-check and self-repair while in operation. Chien’s group has tested this method on the Midway supercomputing cluster located on the university’s Hyde Park campus. Chien says the new method has proven very reliable at compensating for errors introduced deliberately by the researchers.