Reinforcement learning discovers silent data errors

For high performance chips in massive data centers, mathematics can be the enemy. Thanks to the scale of calculations that occur in hyperscala data centers, operating all day with millions of nodes and large amounts of silicon, extremely uncommon errors appear. It is simply statistical. These rare “silent” data errors do not appear during conventional quality control projections, even when companies spend hours looking for them.

This month at the IEEE International reliability physics symposium In Monterey, California, Intel The engineers described a technique that Use reinforcement learning To discover more silent data errors faster. The company is using the automatic learning method to guarantee the quality of its Xón processors

When an error occurs in a data center, operators can take a node down and replace it, or use the defective system for lower risk computing, he says Manu ShamsaAn electrical engineer on the Intel’s Chandler campus, Arizona. But it would be much better if errors could be detected before. Ideally, they would be trapped before incorporating a chip into a computer system, when it is possible to make design or manufacturing corrections to prevent errors from being recurring in the future.

“On a laptop, you will not notice any error. In data centers, With really dense nodes, there are great possibilities that the stars align and an error occurs. ” —Manu Shamsa, Intel

Finding these defects is not so easy. Shamsa says that engineers have been so bewildered by them, they joked that they should be due to a spooky remote action, Einstein’s phrase for quantum tangle. But there is nothing creepy in them, and Shamsa has spent years characterizing them. In an article presented at the same conference last year, your team offers a set catalog of the causes of these errors. Most are due to infinitesimal variations in manufacturing.

Even if each of the billions of transistors in each chip is functional, they are not completely identical to each other. The subtle differences in how a transistor given to changes in temperature, voltage or frequency responds, for example, can lead to an error.

It is much more likely that these subtleties arise in large data centers due to the pace of computing and the large amount of silicon involved. “On a laptop, you will not notice any error. In data centers, With really dense nodes, there are great possibilities that the stars align and an error occurs, ”says Shamsa.

Some errors could only arise after a chip has installed in a data center and has been running for months. Small variations in transistors’ properties can cause them to degrade over time. One of those silent errors that Shamsa has found is related to electrical resistance. A transistor that works correctly at the beginning and passes the standard tests to search for shorts, with use, can be degraded to become more resistant.

“You are thinking that everything is fine, but below, an error is causing an incorrect decision,” says Shamsa. Over time, thanks to a slight weakness in a single transistor, “one plus one goes to three, in silence, until you see the impact,” says Shamsa.

The new technique is based on an existing set of methods to detect silent errors, called Own tests. These tests make the chip make difficult mathematics problems, repeatedly for a period of time, hoping to make silent errors evident. They involve operations in different matrices full of random data.

There are a lot of own evidence. Executing them all would take a little practical amount of time, so chips manufacturers use a randomized approach to generate a manageable set of them. This saves time but leaves errors without being detected. “There is no principle to guide the ticket selection,” says Shamsa. I wanted to find a way to guide the selection so that a relatively small number of tests could increase more errors.

Intel team used reinforcement learning to develop tests by its Xeon CPU Chip that performs matrix multiplication using the FUSE-Multly-ADD (FMA) instructions. Shamsa says they chose the FMA region because it occupies a relatively large area of ​​the chip, which makes it more vulnerable to possible silent errors: more silicon, more problems. In addition, defects in this part of a chip can generate electromagnetic fields that affect other parts of the system. And because the FMA goes out to save energy when it is not in use, try it involves feeding it repeatedly up and down, potentially activating hidden defects that otherwise would not appear in standard tests.

During each step of your training, the reinforcement learning program selects different tests for the potentially defective chip. Each error that detects is a reward, and over time the agent learns to select what tests maximize the possibilities of detecting errors. After approximately 500 test cycles, the algorithm learned which set of own tests optimized the error detection rate for the FMA region.

Shamsa says that this technique is more likely to detect a defect than random tests. The own tests are open source, part of the OpenDCDIAG For data centers. Therefore, other users should be able to use reinforcement learning to modify these tests for their own systems, he says.

To some extent, silent and subtle defects are an inevitable part of the manufacturing process: perfection and uniformity are out of reach. But Shamsa says Intel is trying to use this research to learn to find the precursors that lead to silent data errors faster. It is investigating whether there are red flags that could provide an early warning of future errors and if it is possible to change chips recipes or designs to administer them.

Of the articles of your site

Related articles on the web

#Reinforcement #learning #discovers #silent #data #errors

Leave a Reply

Your email address will not be published. Required fields are marked *