Some RAM, or Random-Access memory, is advertised as ECC memory. ECC stands for Error Correcting Code and is a process of identifying and correcting errors in memory. Errors in RAM can cause corruption or alteration of data, which can result in device crashes and even security vulnerabilities. ECC RAM is typically not compatible with consumer-grade PC hardware.
What are memory errors?
Memory errors are an issue where the value stored in memory is changed. Data in RAM is stored in binary, with values of 1 or 0. If the value of a 1 gets switched to a 0 or vice versa, in a process called “bit-flipping”, the data that is stored in the RAM changes.
For example, the changed bit could be used to store a value in a spreadsheet. In this case, the value in the spreadsheet could be changed to a completely different number which would affect the result for any calculations, for example altering the economic forecasts of a business. In other cases, the changed bit could disable a security feature, or create a typo that alters how a program is run. These two examples are extremely difficult to detect and resolve without the use of ECC memory. In an extreme scenario, a single bit being flipped could cause a catastrophic error that causes a system crash.
Bit-flipping has many potential causes, the most common cause is the result of background radiation, primarily caused by neutrons created by cosmic ray events. A cosmic ray is a high energy particle, typically a proton, that travels at nearly the speed of light. They are emitted by stellar bodies including the Sun and other high energy astronomical objects. When a cosmic ray hits an atom a shower of neutrons and other sub-atomic particles are created, these neutrons then go on to have secondary interactions.
These secondary neutron interactions are believed to be the primary source of bit-flipping errors. Cosmic rays are more common at higher altitudes with a 3.5x increase at 1.5km above sea level and a 300x increase at the cruising altitude of airliners. This increased risk at altitude necessitates extra reliability measures.
How common are memory errors?
Most people don’t see their computers crashing every day, so it’d be easy to think that this is primarily a theoretical risk. Research from hyperscale data centers has been used to analyse the rate of bit-flipping incidents. Research performed by Google across its data centers has shown an error rate of roughly 1 single-bit error per gigabyte of RAM every 1.8 hours.
Nasa’s Cassini-Huygens mission that launched in 1997 to travel to Saturn was configured with two identical flight computers each with 2.5 Gb of RAM. Across the first two and a half years of its journey, the spacecraft observed a consistent 280 single-bit errors a day. During one day, when Cassini-Huygens was in the path of a solar flare, a four-fold increase of bit errors was observed, providing further evidence for the Sun being the cause of most bit-flipping issues.
There were concerns that the continued increase in the density of RAM modules would lead to later versions of RAM being more and more vulnerable to bit-flips. More recent studies have shown that the opposite is in fact the case, as errors have dropped as the process geometry has decreased.
How does ECC memory protect against errors?
ECC memory uses error-correcting codes, such as Hamming codes, to correct single-bit errors in RAM. Double bit errors can be detected but not corrected. Hamming error-correcting codes work by using an array of parity bits. Together these parity bits can be used to detect is any data bits have changed. If a bit is identified as having flipped then it is changed back automatically.
Tip: A single-bit error is a bit-flipping incident when only a single bit is flipped. In double0bit errors, two bits are flipped. The two bits don’t need to be flipped in the same incident, the second bit-flip only needs to happen before the first flipped bit is corrected.
One more parity bit than is required is included in Hamming error-correcting codes. This extra parity bit gives the code the ability to detect the occurrence of double bit errors, however, these errors can’t be corrected.
The process of performing the error detection and correction is performed on the memory controller onboard the RAM stick.
Consumer availability and support
Most consumer-grade PC hardware doesn’t support ECC memory. This is partially as a method of artificially distinguishing server hardware from consumer hardware. ECC RAM, however, does cost more and run slightly slower. Additionally, the extra stability it would provide to home consumers is minimal as bit-flipping errors are not the primary cause of system crashes.
None of Intel’s consumer and enthusiast-grade CPUs supports ECC memory, only its server-grade CPUs, such as the Xeon range CPUs do. AMD’s consumer-grade CPUs don’t support ECC, however, their workstation and server-grade CPUs, Threadripper and EPYC respectively, do support ECC memory.