Data access is a critical part of CPU design. CPUs operate at extremely high speeds, processing multiple instructions each clock cycle and so need access to a lot of data. The vast majority of that data is stored on the storage media. Storage devices, however, are impossibly slow compared to a CPU. Storage devices are also significantly better at sequential reads than they are at random reads, though SSDs offer a marked improvement in this regard (and many others) over HDDs.
System RAM is designed to be loaded with all the data the CPU might need for the currently running software. RAM has a significantly lower latency than storage, it also is specifically tailored to have high random read performance. Still, as much as modern RAM is fast, it’s still nothing compared to the CPU with latencies on the order of 400 clock cycles.
To further reduce the latency, most modern CPUs include tiers of cache memory. Typically, these are referred to as the L1, L2, and L3 caches. L1 is really high speed, typically taking on the order of 5 clock cycles to access. L2 is a bit slower, on the order of 20 cycles. L3 is even slower still at around 200 cycles. While L1 is incredibly fast, it’s also tiny. Much of its speed comes from the fact that smaller caches take less time to search. L2 is bigger than L1 but smaller than L3 which is smaller still than system RAM. Balancing the size of these caches well is critical to getting a high-performance CPU. Cache hit ratios are important, but you need to balance the number of hits with how long it takes to get that hit, hence the tiers.
Note that scratchpad memory doesn’t fit in the traditional memory hierarchy. That’s because it isn’t used in most consumer CPUs. Scratchpad memory is designed to be used like a scratchpad would be in real life. You note down temporary information that you need to remember but don’t need to actually file away. Much of the time a CPU processes data and then needs that result again straight away. It can copy it to memory, but to be able to access it quickly it should also keep it in the cache.
Scratchpad memory essentially fills the same gap as the L1 cache. It’s accessible as fast as possible, often in single-digit cycle counts. To manage this, it is also relatively small. There are two key differences between L1 and scratchpad memory, though. Firstly, scratchpad memory is directly addressable. Secondly, it is shared between all cores and processors.
Differences between cache and scratchpad
The CPU cache is essentially transparent to the CPU, it can’t deliberately put data there and its contents can’t be programmed. Instead, the CPU just requests data from RAM and happens to get it back faster, sometimes significantly faster than it might expect. Allowing the scratchpad to be addressable means that code can specify exactly what data should be in the scratchpad. This can be useful, though modern caching algorithms are excellent with hit rates of 95-97% being expected in standard workloads.
L1 cache is always locked to an individual processing core. No other processing core can access it. This means that if multiple cores need the same data, they may duplicate it in their respective L1 caches. In some CPU architectures, the L2 is per core, in others it’s shared by a small number or even all of the cores. L3 tends to be shared by all cores. Sharing cache between cores allows two or more cores to access the same data without duplicating it. It also allows one core to utilise more than its fair share when it has the need and the cache has the space.
Scratchpad acts similarly to the L1 in terms of speed, and capacity, but it is shared between all cores. This allows very fast access to specific data being acted on in a multithreaded workload. Scratchpad memory can even be shared between distinct CPUs on multi-socket motherboards.
One disadvantage that scratchpad memory has is that it may be relied on too heavily. Being able to access it directly, software may rely on its presence in certain quantities. In this case, it would then be incapable of running on CPUs without that much scratchpad memory. Cache tiers simply don’t suffer from this problem and so are better suited to general-purpose use.
Scratchpad memory is most often found in multi-socket server systems designed for HPC (High-Performance Computing). There, its combination of speed and shared access makes it useful for highly parallel workloads.
Scratchpad memory also sees use in much smaller processors. Embedded processors, often MPSoCs. An embedded processor is often relatively low power and specialised for a specific task. This specialisation is often represented in hardware optimisations. In a Multi-Processor System on a Chip especially, shared high-speed memory may provide significant latency improvements to multiple different processors. These sorts of CPUs are often very fixed in design. Game consoles for example already see a lot of optimisations for hardware design and so can make good use of such features without needing to worry about backwards or forwards compatibility.
Scratchpad memory is similar to L1 cache but has a number of differences that change its use cases. Instead of being a cache its directly addressable allowing data to be specifically assigned to particularly high-speed memory. It’s also shared between all processor cores and processors, making it particularly useful in heavily multithreaded workloads.