Data access is a critical part of CPU design. CPUs operate at extremely high speeds, processing multiple instructions each clock cycle and so need access to a lot of data. The vast majority of that data is stored on the storage media. Storage devices, however, are impossibly slow compared to a CPU. Storage devices are also significantly better at sequential reads than they are at random reads, though SSDs offer a marked improvement in this regard (and many others) over HDDs.
System RAM is designed to be loaded with all the data the CPU might need for the currently running software. RAM has a significantly lower latency than storage, it also is specifically tailored to have high random read performance. Still, as much as modern RAM is fast, it’s still nothing compared to the CPU with latencies on the order of 400 clock cycles.
Caching to reduce latency
To further reduce the latency, most modern CPUs include tiers of cache memory. Typically, these are referred to as the L1, L2, and L3 caches. L1 is really high speed, typically taking on the order of 5 clock cycles to access. L2 is a bit slower, on the order of 20 cycles. L3 is even slower still at around 200 cycles.
While L1 is incredibly fast, it’s also tiny. Much of its speed comes from the fact that smaller caches take less time to search. L2 is bigger than L1 but smaller than L3 which is smaller still than system RAM. Balancing the size of these caches well is critical to getting a high-performance CPU. Cache hit ratios are important, but you need to balance the number of hits with how long it takes to get that hit, hence the tiers.
Balancing the capacity of each cache tier with the hit rate is tricky enough but it’s also important to decide how broad the access to that cache is too. There are three approaches. The first is to limit a cache to a single core. You can also allow all cores to access the cache. The final option is a middle ground of letting a selection of cores share cache.
Sharing is slow
A cache that is only accessible by a single core is called local memory. limiting access to the cache means that you don’t need to position it for multiple access. This means you can keep it as close as possible. This as well as small capacities being faster make up an ideal L1 cache. Each core has its own small and close cache.
Shared memory, would be a cache accessible by multiple cores. There is no particular differentiation for caches shared between some or all cores, though it does have a performance impact. Like with a local cache being small, it makes sense for a shared cache to be large. Partly because it needs to serve more cores, and partly because it needs to be physically near each core. This makes this concept more useful for L2 and especially L3 caches.
Local cache memory doesn’t need to be restricted to CPUs. The concept can also apply to other types of processors. The most well-known secondary processor, however, is the GPU which essentially doesn’t have any local memory. There are so many processing cores, that everything is grouped up. Even the smallest group shares the lowest levels of cache.
At the RAM level
Some computers, such as cluster computers can have multiple physical CPUs. Typically, each of these will have its own pool of RAM. In some cases, this RAM will be shared across all CPUs, in others, it will be limited to each CPU. When each CPU in a multiprocessor system can only access its own pool of RAM, that is also local memory.
At the software level
Software running on the computer is allocated memory space. In some cases, one program may be running multiple processes with a shared memory space. Some programs may even actively share memory space with another deliberately. Typically, though, this memory space is limited to just that one process. Again, this is an example of local memory.
Local memory is a term that identifies as portion of memory that is only available to a single thing. That thig may be a processing core, processor, or process. The overall concept is always the same though, even I the specifics vary. Local memory tends to be more secure. It also tends to be smaller in capacity. Access times are generally faster for local memory than for shared memory. Outside of caching though this relies on you measuring the worst-case speed of the shared memory. local memory is typically very useful. Depending on the workload, however, it is typically most efficient to have a combination of local and shared memory. Except for caches, where it is always better to combine local and shared memory.