Computers are complex machines with no part more complex than the CPU. At a basic overview level, it seems like the CPU should be relatively simple. It takes a series of commands, processes them, and then outputs the data. This bears little resemblance to the actual workings of modern CPUs though.
Sub-scalar to super-scalar
Early CPUs were exactly as you’d expect. They took instructions individually, in the order they were given them, processed them to completion, then moved on to the next instruction. CPUs of this type were sub-scalar, able to complete less than one instruction per clock cycle. CPU designers identified that there were many different stages of completing an instruction. Each of these stages required different hardware. This meant that when running a single instruction through the whole sequence at a time, some parts of the hardware sat idle. In any sort of processor, idle hardware is useless hardware.
To utilise this idle hardware, CPU designs were updated to use a pipeline approach. This further separated the hardware for each stage but allowed them all to be utilised at the same time by a series of instructions. While it still took a few cycles for each instruction to go through the pipeline, the overall throughput was one instruction per cycle. This made CPUs scalar.
To be able to do more, processors needed to be made super-scalar. To achieve this, multiple parallel pipelines were implemented.
Keeping pipelines fed with data
The main performance issue with computers is typically memory latency. Many instructions operate on data, and so that data needs to be available for the instruction to be executed. The question is, what do you do if you need to wait for that data because it isn’t immediately available? Traditionally, the answer was just to stall and wait for it to become available. This leaves the whole pipeline empty, potentially for hundreds of CPU cycles. Things get even worse when two instructions in parallel pipelines need to wait for memory, as the first will hold up even the request for the second’s data. While CPU cache memory can help address this issue, it still can’t fix it. A new paradigm was needed to solve it. That paradigm shift was Out Of Order Execution or OOO.
The first stage of a pipeline is to decode the instruction. This means working out what needs to be done, and verifying that the data needed for the operation is available. In an OOO CPU, decoded instructions are added to a queue. They are only removed from the queue and actually processed when the data they need is available. Critically, it doesn’t matter what order the instructions were added to the queue. If an early instruction is waiting for data, a more recent instruction can skip ahead if it’s ready to go. OOO processors can reorder the instructions they are supposed to process based on the queue of upcoming instructions and which of those are ready for execution.
Critical dependencies
This process assumes two things. First of all, that it is possible to reliably identify and handle true dependencies. Secondly that you can reliably handle and identify false dependencies. What is the difference? Well, a true dependency is a dependency that can’t be mitigated at all in an OOO system. The easiest example is the read-after-write. If you have one instruction that is supposed to write some data and another that is supposed to then read that data, there’s no way to be able to reorder those instructions. They must be completed in the order in which they were presented, or you’ll get nonsense data.
A false dependency is one that can be hidden with another clever trick. Let’s take the example of write-after-read. At first glance, you might think that you can’t overwrite data before you’ve read it. Things aren’t that simple though. What if you have another place you can write the new data, and then you can just swap the new and old data once the old data has been read? This is the process of register renaming and it is critical to OOO processing.
Typically, an instruction set defines a set number of architectural registers that are used in the system. You literally can’t address any others. But what if you do overprovision registers? You can just hide them for the most part, use them to store data that shouldn’t have been processed yet, and then simply swap the labels of the hidden and architectural registers when the timeline is correct again. At any one time, there are the exact right amount of architectural registers, they’re just not necessarily always in the same place. A real-world analogy would be hot-desking.
Conclusion
Out Of Order execution is a processing paradigm where instructions can be dynamically reordered at execution time by the CPU. This is done on the basis of the earliest issued instructions that have data available. This means that instructions being loaded into the pipeline are always ready to be executed and there are no delays while waiting for data. Of course, it is necessary to have a long enough queue that it doesn’t get filled with instructions waiting for data but that’s an implementation challenge. OOO execution relies on register renaming to hide false dependencies. Even if these instructions are actually executed out of order, the registers are renamed in such a way as to hide this fact from the rest of the computer.
Did this help? Let us know!