User:Demindiro/Ramblings/Massively concurrent Stack-Stack-Machine

From CPUdev wiki
Revision as of 12:45, 5 April 2025 by Demindiro (talk | contribs) (Dump of a potentially silly idea)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Out-of-Order Execution and Simultaneous Multi-Threading has been on my mind for a while, with many ideas on how to address it best. Below is one of my ideas in text form.

Latency, OoOE, SMT and architectural state

As transistors shrink, processors get faster and faster. However, for practical it is still necessary to communicate with external components such as DRAM. Processor speed improves much faster than the latency between processor and external components. Nowadays processor are so large even latency with on-chip cache is a significant factor.

Most high-performance processors address these challenges with out-of-order execution: instructions later in the stream can be executed ahead of earlier instructions so long they do not depend on them. Enabling OoOE requires several large and fast structures. How far a processor core can execute ahead depends entirely on the size of these structures.

Another technique is simultaneous multi-threading: one processor core can evaluate multiple instruction streams simultaneously. Hence, if one stream is blocked on long-latency instructions, the other can continue being evaluated. SMT allows sharing execution resources but requires duplicating architectural state, in particular registers. Depending on the complexity of the core, this may be a large or small cost. In particular, advanced OoOE processors tend to have at least 2-way SMT, with the extra state consuming a relatively small amount of die space.

SMT is not always beneficial: most (all?) implementations try to ensure fair scheduling, interleaving instructions from all streams. This may leads to cache thrashing and severely degrade the performance of some applications. On the other hand, it will perform much better than only OoOE if threads are often stalled waiting on memory operations to finish.

What if we could mitigate the cache thrashing? An interesting analogy is fair versus unfair mutexes: unfair mutexes tend to lead to higher throughput by allowing threads with data already in cache to continue executing, significantly improving cache utilization. On the other hand, it may cause unpredictable delays for certain tasks, which is why fair mutexes, at least by default, are often preferred.

If (user) applications had explicit control over SMT then perhaps it would be practical to have "unfair" scheduling for SMT too, reducing cache thrashing. But what should the interface look like? Ideally there would be no arbitrary limits on the amount of threads an application can schedule at once. The most significant limit here is the amount of architectural state: having too much both bloats a core and makes task switching costlier, as more state needs to be (explicitly) saved to memory.

What if we required only the absolute minimum amount of state?

Stack machines and stack of tasks

A stack machine requires only a tiny amount of state: a program counter and a stack pointer is sufficient. This makes switching tasks very cheap.

On the other hand, it requires very frequent access to memory, specifically the stack. This makes OoOE impractical. However, with massive and cheap concurrency the need for OoOE might be avoided altogether.

How can we achieve this massive concurrency? We can use a stack of tasks that are ready to run. The top-most entries are removed from the stack if the processor is able to schedule them. Tasks that are waiting on a operation remain in internal processor state.

Since tasks have very little architectural state it is cheap to implement the fork-join paradigm: push a pointer to a stack and an instruction address and off it goes! (TODO: what about joins? And how to cheaply allocate new stack space? Would a bump allocator + copying GC work? Or an arena?)