Historically CPUs have been perfectly sequential machines. This is highly logical and easy to understand but can be a performance issue. Over the years, there have been many ingenious adjustments to CPU designs to extract as much performance as possible from silicon wafers. One of the more interesting ones, though, is out-of-order execution. In out-of-order execution CPUs, instructions don’t necessarily need to be executed in the order in which they are issued.
Stalling in Order
The main performance issue an in-order CPU runs into is called a pipeline stall. This happens when an instruction is dependent on some memory, but that memory isn’t directly available in a register. In this case, the CPU must find that value in memory. The CPU cache is checked first as these are the fastest memory tier. If the value isn’t there, the system RAM is checked. During this time, the CPU must sit idle, as the memory-dependent instruction must be completed in order before the following instructions.
The performance impact of a pipeline stall may not be so bad, but it can also be relatively severe. For example, the L1 cache can typically return a result in the order of magnitude of 5 CPU cycles. The L2 cache may take 20 cycles, L3 around 200 cycles, and system RAM around 400 cycles. Given that a CPU may operate at around 5GHz, that is 5 billion clock cycles per second, even 400 cycles aren’t that bad (0.000008%). But if you have many instructions needing to reference data further down the cache tiers, the cumulative effect can cause a noticeable slowdown.
Out-of-Order Execution and Register Renaming
Out-of-order execution is a technique that allows the scheduler to reorder the instructions in its queue. Through this reordering, it can choose to prioritize specific threads over others. It can also nudge instructions back in the queue when they have a data dependency that hasn’t yet been met. This prevents pipeline stalls as much as possible, minimizing idle cycles.
Out-of-order execution requires a feature called register renaming. The CPU can access data held in registers within a single cycle. Registers are used to store data being read and written. It is essential, however, to ensure that the computer at large sees everything happening in the logical order, not in the out-of-order, CPU cycle-optimized order. To enable this, CPUs have many more logical registers than the CPU architecture demands.
Data that needs to be written out, but has an “earlier” instruction that hasn’t been completed yet, is placed in a holding register. This data isn’t transferred to another register when the order has sorted itself out. Instead, the name of the holding register is changed to that of the register it should be in. This is somewhat similar to preparing a dessert before the main course but then keeping it in the fridge until it’s time to serve it.
These logical registers are entirely unaddressed. The CPU can only really address the logical registers that currently share the name of the architectural registers. That said, the CPU is also aware of them enough that if other reordered instructions rely on the data in the logical holding register, they can use it rather than the “outdated” data in the architectural register at that particular empirical time.
Memory Barriers
A memory barrier – also referred to as a membar, memory fence, or fence instruction – is an instruction in computer code. It allows a programmer to enforce an ordering constraint on memory operations issued before and after the memory barrier. The memory barrier instructs the CPU scheduler to ensure that all instructions are processed before any instruction after the barrier. This is done to ensure that important operations are completed in the correct order.
Generally, on modern computers, this shouldn’t be necessary. Out-of-order execution and registry renaming are well-established and mature fields. Nevertheless, a memory barrier can be helpful for older, less sophisticated, out-of-order processors or used in critical memory operations.
Memory barriers may come with some performance detriment. This is because they actively prevent the CPU scheduler from optimizing specific parts of the instruction flow. This increases the chance of a pipeline stall.
Conclusion
A memory barrier is an instruction that ensures an ordering constraint on memory operations. This is important because out-of-order execution processors may reorder specific instructions. While registry renaming is well established as a method to ensure memory integrity in this environment, it can be helpful to ensure it manually.
The memory barrier forces the CPU scheduler to ensure that instructions are completed before any instruction after the barrier. This prevents memory operations from being reordered. It also prevents the CPU from optimizing the instruction flow, which can impact performance.