Historically CPUs have been designed to only execute one instruction at a time in the order in which the instructions were scheduled. This configuration is relatively easy to create and conceptualize. It’s not very efficient, though, as there are multiple different parts to executing an instruction that requires different circuitry. This circuitry must sit idle in sequential processors while other parts are working. While this is good for power consumption, it’s not an efficient use of the hardware and leaves potential performance on the table.
This issue is resolved with an instruction pipeline. In a pipelined processor, the discrete sections of circuitry are broken apart a little further into pipeline stages. This separation lets the circuitry for each stage be used constantly. This is done by starting on the next instruction before the current one is finished. As many pipeline instructions as possible, stages can be in flight at once.
This doesn’t speed up the processing of any individual instruction. What it does do, however, is dramatically increase the instruction throughput. For example, suppose an instruction takes five cycles to complete. In that case, you can have one instruction every five cycles, or in a pipelined CPU, you can have one instruction every second, as five instructions are in various execution stages.
What Does a Pipeline Look Like?
The pipelines in modern CPUs vary in length between manufacturer and CPU architecture. Some of the longest pipelines in modern consumer CPUs are on the order of 20 stages long. Having a long pipeline means that the circuitry for each stage can be simplified, resulting in increased clock speeds. It is, however, easier to understand the concept with the 5-stage Classic RISC (Reduced Instruction Set Computing) pipeline.
In the classic RISC pipeline, there are five pipeline stages. Instruction Fetch is the first stage; it retrieves the instruction to be executed. Instruction Decode is the second stage; it decodes the retrieved instruction to identify what needs to be done. Execute is the third stage; it’s where the computation defined by the instruction is performed. Memory Access is the fourth stage; it is where memory can be accessed; it also acts as a buffer to ensure that one and two-cycle instructions stay aligned in the pipeline. The final stage is Write Back, where the computation results are written to the destination register.
Pipeline Stalls
The thing with pipelines is that you can start one or more instructions before the current one has finished processing. This is useful but carries some risk. One risk is based on reading the value of a variable before an earlier instruction has the chance to change the value. For example, consider the following instructions:
Set A = 1
Set A = A + 1
Print A to the screen.
Setting A to 1 is easy; no issues there. The problem comes afterward. If you’re already reading the value of A in the instruction decode stage, the value of A hasn’t yet been updated or even adjusted by the previous instruction. When decoding the second instruction, the first instruction won’t have even defined A yet, at least not to a memory location, likely resulting in an “undefined” error. Again, you’ll get the answer “1” when attempting to print A to the screen. At this point, A has been defined, but the update to add one to make A = 2 hasn’t been written to a memory location yet.
Pipeline stalls or pipeline bubbles can solve this issue. That is inserting a pause in the pipeline to wait for the dependency on previous instructions to complete. Unfortunately, this results in a performance penalty, as the CPU sits idle for several cycles. The more advanced solution is out-of-order execution, in which other instructions that aren’t dependent on the previous ones can be moved up in the queue. This avoids the performance impact but is a lot more complex to manage.
Pipeline Flush
Conditional branches cause another similar issue with pipelines. A branch is an instruction that causes the computer to skip to another code section. A conditional branch may or may not be taken depending on a condition. For example, a conditional branch could be taken if the variable A is even or not taken if A is odd. The problem is that this answer may not be known by the time you decide to take the branch or not, as an earlier instruction changing the value of A may still be in the pipeline, having not written its update to memory yet.
There are three possible choices: you can stall and do nothing until you know for sure which branch to take. You can start processing both options. Once the correct branch is known, discard the data for the wrong branch. Or you can attempt to predict which branch will be taken. The first option significantly impacts performance, as you’ll just be sitting idle. Trying both branches means you’re not idle, but your performance is essentially halved until you know which branch to take, as half of the instructions you start working on will be discarded.
Branch prediction is what most modern CPUs do. With a guess rate above 50%, you’re at least doing better than trying both. Modern algorithms can have success rates of around 95%. On a successful guess, there’s no performance impact. If you guess wrong, though, the impact can be high. The instructions for the incorrect branch in flight through the pipeline must be canceled. This is called a pipeline flush. Once the pipeline has been flushed, the instructions for the real branch can be started.
Out-of-Order Execution
Out-of-order execution can be the saving grace here. As long as the scheduler can see far enough ahead to optimize the flow, out-of-order execution can eliminate the specter of pipeline flushes. This happens by ensuring that when a conditional branch is scheduled, the conditions it waits on are processed as soon as possible. The conditional branch is then processed as quickly as possible after that.
Additionally, no subsequent dependent instructions are scheduled until the conditional branch is fully processed. This reordering and prioritization ensure that the conditional branch doesn’t have to be guessed at all, as the value is known and that the instructions of the correct branch are only scheduled once the right branch has been identified. Other unrelated instructions can be used to fill the gaps.
The only issue with this is if there are many conditional branches. In this case, there may not be enough unrelated instructions to efficiently schedule the processing of the conditional branches, thus requiring the fallback to branch prediction.
Conclusion
A pipeline flush is the process of a pipelined CPU clearing the currently in-flight instructions related to an incorrectly predicted conditional branch. Predicting the taken/not taken outcome of a branching decision can result in no loss of performance about conditional branches.
Choosing incorrectly requires the pipeline to be flushed and then restarted. This has a higher impact than just waiting, so a good branch prediction algorithm is essential as you want to minimize the number of pipeline flushes.
Did this help? Let us know!