Pipeline and out-of-order instruction execution optimize performance

Pipelines are an essential element in modern processor microarchitecture. Even RISC processors with extremely simple instruction sets can't execute an instruction in a single clock cycle without a pipelined design. The pipeline breaks instruction decode and execution into multiple stages so that the processors can execute an instruction during each clock cycle on aggregate. But microcontroller (MCU) designers must take care to balance the depth of the pipeline. Moreover the team must decide how much silicon to dedicate to pipeline processing because die size relates directly to IC cost. The Renesas RX employs a 5-stage pipeline, and integrates support for out-order execution to minimize pipeline stalls.

A pipeline, as the name implies, is a serial set of functional stages that perform a portion of the action necessary to execute an instruction. Each stage of the pipeline passes the results of its work along to the next stage. In a most-basic implementation, a pipeline allows a microprocessor to begin executing a subsequent instruction before execution of the prior instruction completes. In a processor with a five-state pipeline, you could have five instructions being processed simultaneously.

Some general-purpose microprocessors have used upwards of 20 pipeline stages. In such designs, each stage includes a minimal amount of logic thus enabling faster clock rates. Some MCU architectures implement 10 or more stages. At the other end of the spectrum, some MCUs integrate only two-stage pipelines.

The downside to long pipelines includes the danger of pipeline stalls that waste CPU time, and the time it takes to reload the pipeline on context switches and even conditional branch operations. Let's consider the latter first. Pipeline efficiency depends on an orderly flow of instructions to enable the overlapped instruction execution. If a conditional jump instruction executes transferring execution to a new memory address, the subsequent instructions in the pipeline must be flushed and the process of loading the pipeline restarts at the new instruction address.

To examine pipeline stalls, consider the figure below. The example is based on a five stage pipeline such as the one used in the RX architecture. The stages are instruction fetch, instruction decode, execution, memory access, and register write back. The upper part of the figure illustrates a pipeline stall. The MOV instruction requires multiple clock cycles for the memory access stage to complete its work – due to the relative addressing mode where R1 points to operands stored in memory. The subsequent ADD instruction stalls waiting for the memory access stage. In turn, the SUB instruction stalls waiting for the decode stage.

Processor designers get around the stall issue using techniques such as out-of-order execution that is illustrated the lower portion of the figure. The RX MCU, for example, can proceed to execute the ADD and SUB instructions before the MOV instruction clears the memory access stage. Out-of-order execution works so long as the subsequent instructions don't depend on data from the prior instruction. In the example, the ADD and SUB instructions use different registers than does the MOV instruction so there is no dependency.

Out-of-order execution isn't perfect, but it's a necessary use of silicon to optimize performance in a five-stage pipeline. Longer pipelines require more complex techniques to protect against stalls. For example, some processors use speculative execution and branch prediction. But those techniques greatly increase processor complexity and silicon real estate requirements.

MCUs for embedded systems must balance the advantages that a pipeline affords in terms of performance while mitigating the performance hit from stalls and context switches. Many embedded systems require hard real-time performance and quick response to interrupts. The five-stage architecture minimizes the time it takes to respond to an interrupt and fill the pipeline while still offering a path to single-cycle instruction execution on aggregate.