CPUs are incredibly complex beasts. There are many interconnecting parts that all have to work in perfect unison to achieve the levels of performance we see. One of the key features of a CPU is the cache. It’s not a flashy feature. It doesn’t advertise as well as the core count or peak boost frequency. It is critical to performance, though.
Modern CPUs are incredibly fast. They perform more than five billion operations every second. Keeping the CPU fed with data when it operates that fast is difficult. The RAM has enough capacity to supply the CPU with data. It can even transfer data every second, thanks to very high bandwidths. That’s not the problem, though. The problem is latency.
RAM can respond very quickly. The problem is that “very quickly” is a long time when you do five billion things every second. Even the fastest RAM has a latency above 60 nanoseconds. Again, 60 nanoseconds sound like no time at all. The problem is that if the CPU ran at 1GHz, it would take 1ns to complete a cycle. With high-end CPUs hitting 5.7GHz, that’s one cycle every 175 picoseconds. How’re those 60 nanoseconds of latency looking now? That’s 342 cycles of latency.
That sort of latency would be a killer for any CPU performance. To get around that, a cache is used. The cache is placed on the CPU die itself. It’s also much smaller than RAM and uses a different structure, SRAM rather than DRAM. This makes it much quicker to respond than the main system RAM. The cache is typically tiered, with L1, L2, and L3 used to denote the tiers that get further and further from the CPU cores. Lower tiers are faster but smaller. L1 can have a latency of four or five clock cycles, much better than 342.
But Some CPUs Mention an L0?
The terminology for L1, L2, and L3 is pretty standard. The vague understanding of what they mean and do is relatively common, even across CPU vendors. This is because they’re governed by material and electrical physics; not much can change. You can have a fast cache or a big cache, not both. It needs to be bigger if you share a cache between multiple cores. To that end, L1 and L2 tend to be core specific. The larger L3 cache tends to be shared between some or all cores on the CPU or chiplet.
As you can probably guess, L0 is related to caching but has been shoved into the naming scheme after the fact. It doesn’t help to understand what it means, though. You can probably guess some things, though. It’s going to be limited to one core, it’s going to be tiny, and it’s going to be fast. The other name it goes by can help a bit; that’s micro-op cache.
Instead of caching data from memory, or full instructions, L0 caches micro-ops. As we recently described, a micro-op is a feature of modern CPUs. Instructions in x86 and other ISAs are big, complex, and challenging to fit efficiently in a pipeline. You can pipeline them much more efficiently if you break them down into constituent micro-ops. In some cases, you can even group multiple micro-ops, even from different instructions, into a single micro-op achieving both a performance improvement and power reduction.
CPU Architecture ft Micro-Op Cache
To execute an instruction, a modern CPU decodes it. This involves splitting the instruction into its constituent micro-ops and determining the memory locations that should be referenced. Many software utilizes similar functionality regularly and can often reuse the same code in a loop or from a called function. This means that the exact instructions can be called again and again. This then means that the same micro-ops get called again and again. And if the same micro-ops are needed repeatedly, they can be cached. Caching micro-ops can reduce the load on the instruction decoders, reducing the power draw or helping to fill up the pipeline faster.
The cache does need to be kept small, but when carefully managed, it can be accessed with a single or even no cycle latency. This can be enough to prevent the need to take on the 4-cycle latency to the L1 cache and comes with no cache-miss penalty.
L0 cache is another name for the micro-op cache. It can be a part of modern CPUs that utilize micro-operations. It typically holds a few thousand entries and has capacities listed in numbers of entries rather than bytes. L0 can be accessed faster than L1, typically with a 1- or 0-cycle latency. Caching micro-ops reduces the load on the instruction decoders, especially in code that makes good use of loops or functions.