Silent data corruption is on the rise following advancements in semiconductor technology. The explosion in AI for speech, image, video, and text processing leads to a growing complexity and diversity of hardware systems, bringing an increased risk to data integrity.
SDC rate is much higher than software engineers expect, undermining the hardware reliability they used to take for granted. Recent publications by hyperscalers such as Google and Meta shed some light on the extent of the problem. They report that approximately one in a thousand machines in their fleets is affected by SDC. While they only provide estimates, Alibaba recently published exact statistics revealing 361 Defective Parts Per Million (DPPM) that caused SDCs in their cloud system.
Why should anyone worry about 361 DPPM or even several thousands of them? At a low scale, such numbers are considered normal. However, in fleets that can have millions of servers, SDC occurrences are frequent enough to threaten the integrity of vital services. For example, if a generative AI data system runs on thousands of devices simultaneously, a single error in one of them can lead to many system-level failures.
Undetected manufacturing defects, accelerated aging, or environmental factors can lead to data corruption, during storage, transmission, or processing, and result in unintended changes in information. Traditional approaches to prevent SDC during silicon manufacturing and data center operation fail to provide adequate reliability.
This paper explores proteanTecs two-stage detection approach, offering SDC prevention solutions for different stages of a chip’s lifespan: ML-powered Outlier Detection for semiconductor defect detection and Real-Time Health Monitoring for in-field predictive and prescriptive maintenance.
This paper highlights: