The purpose of a CPU is to perform instructions. Historically, early CPUs would identify the next instruction in the queue to be completed. The CPU would then run through all the processing needed to complete that instruction. Only once the instruction had been fully processed could the next one be acquired from the queue. This design led to several inefficiencies that have been addressed in modern CPUs. One of the biggest inefficiencies was addressed by pipelining.
Classic RISC Pipeline
In any CPU, there are multiple different parts of executing an instruction. A basic overview of the concept can be easily understood from the Classic RISC (Reduced Instruction Set Computing) pipeline. In that, there are five pipeline stages. Instruction Fetch is the first stage. It retrieves the instruction to be executed. Instruction Decode is the second stage; it decodes the retrieved instruction to identify what needs to be done.
Execute is the third stage; it’s where the computation defined by the instruction is performed. Memory Access is the fourth stage, where memory can be accessed. It also acts as a buffer to ensure that one- and two-cycle instructions stay aligned in the pipeline. The final stage is Write Back, where the computation results are written to the destination register.
A standard sequential CPU may not necessarily correctly separate these functions as definitively as the classic RISC pipeline. However, it still needs to perform the same sort of tasks in the same order. The thing is, the silicon required on a CPU to complete each of these functions is separate. So it’s possible to perform each of these functions simultaneously, each on a different instruction, pushing them through the pipeline in order. This is called pipelining.
Benefits of Pipelining
The single biggest benefit of pipelining is a massive throughput gain. I assume that each instruction takes one clock cycle to go through a stage. In a sequential pipeline, the CPU could process one instruction every five cycles. In a pipelined CPU, each instruction still takes five cycles to be completed. Five instructions are in different stages of being processed simultaneously, though. One instruction is completed every cycle (in a best-case scenario). In this way, pipelining offers a significant performance increase.
Note: The performance increase of a pipelined architecture is only directly comparable to the same architecture not utilizing each pipeline stage at once. CPUs capable of pipelining will always use it because of its benefits, but the actual performance increase will vary depending on the specific architectures compared.
Pipelining is very similar in concept to a production line in a factory. While it still takes a defined amount of time to complete something, breaking down that action into multiple independent parts and then performing them simultaneously significantly increases performance.
Downsides of Pipelining
The main downside of pipelining is the increased silicon budget that needs to be assigned to data storage methods such as registers and cache. Only the data associated with that instruction must be immediately accessible for optimum performance when one instruction is being acted on simultaneously.
If you then add in a pipeline capable of handling multiple instructions, each of those instructions will need a similar amount of available memory. This eats into the silicon budget available for the data processing CPU parts. Increasing the silicon budget also increases the power draw and heat production of a CPU
Assuming that there’s no node change and simplifying other necessary changes, this means that to add a pipeline to a CPU architecture, the CPU die size would either need to increase, or the die space assigned to processing cores or other functionality would need to be decreased. This does, however, mean that the die shrink associated with a more current node can be assigned to the hardware necessary for the pipeline.
Scaling to Superscalar
The term used to describe the ability of a fully pipelined CPU’s ability to complete every CPU cycle on instruction is scalar. Sequential CPUs are always subscalar. Pipelined CPUs can be scalar, though pipeline stalls and incorrect branch prediction can reduce their performance to subscalar. It’s possible to increase performance to superscalar, completing more than one instruction per cycle. To reach superscalar performance, a CPU needs to double up on hardware. In this case, there are essentially two or more pipelines side-by-side.
To achieve the best performance, you ideally want each stage to take the same number of cycles, ideally one, to complete. If there are stages that cannot be split into further substages, a similar effect can be achieved by increasing the number of those stages.
For example: consider a three-stage pipeline. The first two stages take one cycle each to complete, but the last stage takes two cycles. This limits the overall throughput to one instruction every two cycles. Adding duplicate hardware for the previous stage can achieve the overall throughput of one instruction per cycle. The concept is identical to adding a parallel workstation on a production line.
Conclusion
A CPU pipeline refers to the separate hardware required to complete instructions in several stages. Critically, each of these stages is then used simultaneously by multiple instructions. The concept is analogous to a production line in a factory with various workstations for different functions.
There are some extra hardware and complexity requirements but significant performance benefits. Performance can be further increased by having parallel pipelines, though the hardware requirements for this are even higher. All modern CPUs utilize pipelines.
Did this help? Let us know!