Memory scrubbing is a term that describes the process of reading computer memory, correcting errors, and then rewriting the updated data instead. Modern computer chips have an incredibly high density of memory cells. Because of this, they are vulnerable to a few things, including alpha particles and cosmic rays. Alpha particles were more of an issue for early RAM chips due to radioactive contaminants; modern technology has fixed that issue but can’t protect against cosmic rays.
Tip: Alpha particles are a potent form of radiation. They are essentially the nucleus of a helium atom and are emitted from nuclei during radioactive decay. They are relatively large and have a high electric charge giving them a concise range. However, within that range, they can have a considerable effect. On devices like memory cells that work based on electric charges, the electric charge of an alpha particle can cause a bit to “flip” from a 1 to a 0 or vice-versa.
Cosmic rays are high-energy protons and other atomic nuclei that come from astrophysical sources and bombard the planet constantly. The atmosphere provides reasonable protection from cosmic rays, but this protection drops off at altitude, increasing the risk of a bit flip from a cosmic ray. The actual risk is generally from “secondaries” – the debris – from a cosmic ray collision with the upper atmosphere.
It’s believed that such factors cause roughly one error per year in at least 8% of RAM sticks worldwide. While that may not seem like a lot (and is, in fact, a small minority), it’s still enough to be a problem. Because of outside factors, corrupted memory cells are considered soft errors – and they can be fixed.
ECC RAM
Generally, computer data isn’t stored in just one location – particularly in so-called ECC memory modules. ECC stands for error correction code. These modules are used where potentially corrupted data could cause expensive or fatal issues. Airplanes, nuclear reactors, spacecraft, scientific data sets, and financial models must be reliable.
ECC RAM stores data redundantly enough for errors to be corrected. The data isn’t duplicated. However, a checksum is available that can be compared to the current data. If the checksum is inaccurate, then an error has occurred and can be corrected. If the checksum is accurate, then no correctable error has occurred.
The phrasing there is essential. “No correctable error has occurred” is not the same as “no error has occurred.” This concept only works if the computer very frequently checks for issues. Soft or bit errors can be fixed as long as they are individual errors – but multiples cannot be fixed.
For example, if a RAM stick stores the calculation 2+2=4, it could detect a bitflip that changes it to 2+2=5. If there were two separate bit flips, though, you might end up with the stored value of 2+3=5. This isn’t the value expected and could cause issues for calculations based on that value, but ECC couldn’t identify or correct this issue.
That example is oversimplified. It helps to demonstrate why it’s essential that these rare errors are caught as soon as possible, though. The two errors can be corrected as long as one is fixed before the other occurs. Given the low incidence rate, there’s relatively little urgency. Random events, however, can happen close together.
Implementation in Practice
Once identified, the errors need to be fixed. The process of finding and fixing memory errors is called memory scrubbing. To prevent any performance impact, memory scrubbing doesn’t happen when the CPU requests data from the RAM. Instead, the memory controller runs the scrubbing process while the RAM is idle.
This does increase the power draw as the process involves reading data from RAM, and fixing errors involve writing data back to the RAM. When the CPU requests data, the memory controller opportunistically checks for errors. Memory scrubbing, however, ensures that all RAM is checked for errors, even if it’s not being actively requested.
Tip: Often, scrub configurations can be edited in the BIOS setup program of your computer if you feel the need to do so. This will require ECC RAM, which most computers do not support.
How often storage needs to be scrubbed, ideally, depends on the type and size of the storage. CPU caches, which are SRAM-based, can also be scrubbed. However, they are usually far smaller than the main memory and don’t need to be scrubbed often. Main memory is typically DRAM-based. DRAM offers far more storage density and uses more physical space, which means more opportunities for soft errors. Therefore, they also need to be scrubbed more frequently.
Conclusion
Memory scrubbing is the process of using Error Correction Codes, aka ECC, to verify that the data stored in a memory device has not been affected by a bit flip. In modern devices, bit flips are typically caused by cosmic ray secondaries, as evidenced by the increasing incidence rate at altitude. Scrubbing happens in two ways. The first is opportunistic; when the CPU requests data, this is typically referred to as demand scrubbing.
The other method is called patrol scrubbing. This is where the memory controller performs automatically across the whole RAM while the RAM is otherwise idle. Regular and efficient scrubbing of storage devices helps contribute to their reliability. Memory scrubbing, however, requires ECC RAM and is not possible on standard RAM.
Tip: DDR5 memory includes “on-die ECC.” This allows correcting bit flips while the data is at rest. True ECC memory, however, can not only do this but also detect errors during transmission because the parity information is also transmitted simultaneously.
On-die ECC cannot do this and nor as effective as true ECC memory. Nevertheless, it is a relatively cost-effective solution to increase memory reliability as it does not involve adding a redundant RAM chip to a DIMM.