In almost any computer program, there are parts of the code that branch into separate paths. For example, an if-then-else statement has two possible outcomes. These statements don’t present any issue to sequential processors, as the CPU processes every command in order. Branches present a big problem for pipelined processors, as multiple instructions are being executed at once.
In the case of a program with a branching statement with two possible outcomes, the two possible code paths can’t be in the same place. The processing required to complete either option will be different. Otherwise, the program wouldn’t branch. There will likely be a fair number of branching statements that are only ever taken once, like an if statement.
There are also branching statements that form a loop. While these might not be as numerically common as single-use statements, they are generally statistically repeated. It can be assumed that it’s more likely that a branch will take you back around a loop than not.
Why Is This a Problem?
It doesn’t matter how this problem is phrased in a fully sequential processor. It simply isn’t a problem. Which branch will be taken is known before the first part of the following instruction is processed.
In a pipelined processor, however, the following instructions are queued up. They are already being processed when the processor knows which branch is being taken. So how should the processor handle this scenario? There are a few options. The worst is simply waiting and leaving the pipeline idle while it waits for the answer on which branch to take. This would mean that every time you have a branching statement, you would always lose at least as many cycles of processor time as you have stages in the pipeline.
Alternatively, you could try running both options in the pipeline and discard the wrong choice. This would have half the penalty of not doing anything but still incur a performance penalty on every branching statement. Given that modern CPUs can also run instructions out of order, you could potentially attempt to run every branching instruction as soon as possible. So its outcome is known before it’s needed. This option could be helpful, except it isn’t scalable. Suppose you have a relatively high density of branching statements. In that case, you will simply be unable to run all of them early without having some idle time.
How Is This Issue Really Addressed
In reality, the processor includes a branch predictor. The branch predictor attempts to guess which result of a branching choice will be taken. The processor then assumes that the prediction is correct and schedules instructions. If the branch predictor is accurate, there is no performance penalty.
If the branch predictor makes a mistake, you must flush the pipeline and start processing the actual result. This does result in a slightly worse performance penalty than having done nothing and just waited for the result. To get the best performance, it’s important to ensure that the branch predictor is as accurate as possible. There is a range of different approaches to this.
In machine code, a branch is always a choice between reading the next instruction or jumping to a set of instructions elsewhere. Some early implementations of branch predictors simply assumed that all branches would always or would never be taken. This implementation can have a surprisingly good success rate if compilers know this behavior and are designed to adjust the machine code so that the most likely result aligns with the processor’s general assumption. This requires a lot of tuning and adjusting software development habits.
Another alternative was to learn from the statistic that loops are generally taken and always jump if the branch goes backward in the instruction stream and never jump if the jump is forwards because that would normally be the statistically less likely state of leaving a loop. These are both types of static branch prediction, where the result of a branch is predicted at compile time.
Dynamic Branch Predictors
Modern branch predictors are dynamic, using statistics from the currently running program to achieve better prediction success rates. A saturating counter is a basis for all current branch predictors. The first guess is decided statically or randomly. Once a branch has been taken or not taken, that result is stored in a portion of memory. The next time that same branch comes around, the branch predictor predicts the same result as before.
This naturally results in good prediction rates for loops. There are two versions of this. The early versions simply used a single bit of data to indicate if the branch was taken or not. Later versions use two bits to indicate a weakly or strongly taken or not taken choice. This setup can still predict the result of taking a loop if the program returns to it, generally increasing success rates.
To track patterns, some branch predictors keep track of a history of what choices were taken. For example, suppose a loop is continuously called but only loops around four times before breaking out of the loop. In that case, a two-level adaptive predictor can identify this pattern and predict it happening again. This increases success rates even further over a simple saturated counter. Modern branch predictors build on this further by using a neural network to spot and predict patterns.
Some branch predictors use multiple algorithms and then compare the results to decide which prediction to use. Some systems keep track of each branching instruction separately in what’s called local branch prediction. Others use a global branch prediction system to track all recent branching instructions. Both have their uses and downsides.
A branch predictor is a special part of a pipelined CPU. It attempts to predict the outcome of a branching instruction before the actual instruction is processed. This is a core function of a pipelined CPU as it allows the CPU to keep the pipeline saturated even if it’s unsure what instructions need to be executed. They offer no reduction in performance when they are correct. Modern algorithms can be accurate in relatively predictable workloads as much as 97% of the time.
There is no perfect method of prediction, so implementations vary. The performance impact of making a wrong prediction depends on the length of the pipeline. Specifically, the length of the pipeline before it can be determined if the prediction was wrong. It also depends on how many instructions are in each pipeline stage. Modern high-end desktop CPUs have around 19 pipeline stages and can run at least four instructions simultaneously in each stage.