Difference between revisions of "Pipelining"

From CPUdev wiki
Jump to navigation Jump to search
(Draft)
 
(Fix example clockrate)
 
Line 2: Line 2:
Hence, reducing the delay before a signal can hit a storage cell allows increasing clock speed.
Hence, reducing the delay before a signal can hit a storage cell allows increasing clock speed.


For example, take the time profile a naive CPU design which executes a single instruction per cycle, at 0.5MHz:
For example, take the time profile a naive CPU design which executes a single instruction per cycle, at 0.33MHz:
<source>
<source>
             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs
             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs
Line 9: Line 9:
xor x1,x3                                              |--------------------|
xor x1,x3                                              |--------------------|
</source>
</source>
By splitting instruction fetch, decode, and execute into separate stages we might be able to increase the clock rate to 1.5MHz:
By splitting instruction fetch, decode, and execute into separate stages we might be able to increase the clock rate to 1.00MHz:
<source>
<source>
             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs
             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs

Latest revision as of 19:39, 19 March 2025

A limiting factor in CPU performance is transistor propagation delay: the amount of time it takes for a signal to traverse from start to end. Hence, reducing the delay before a signal can hit a storage cell allows increasing clock speed.

For example, take the time profile a naive CPU design which executes a single instruction per cycle, at 0.33MHz:

             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs
add x1,x5    |--------------------|
ror x2,x4                         |--------------------|
xor x1,x3                                              |--------------------|

By splitting instruction fetch, decode, and execute into separate stages we might be able to increase the clock rate to 1.00MHz:

             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs
add x1,x5    |--IF--|--ID--|--EX--|
ror x2,x4                         |--IF--|--ID--|--EX--|
xor x1,x3                                              |--IF--|--ID--|--EX--|

Note that the fetch, decode and execute stages are independent. We can overlap these stages...:

             0µs    1µs    2µs    3µs    4µs    5µs    6µs    7µs    8µs    9µs
add x1,x5    |--IF--|--ID--|--EX--|
ror x2,x4           |--IF--|--ID--|--EX--|
xor x1,x3                  |--IF--|--ID--|--EX--|

... reducing total execution time from 9µs to 5µs!