44 stable releases
new 2.1.1 | Nov 16, 2024 |
---|---|
2.0.16 | Nov 16, 2024 |
2.0.3 | Jul 20, 2024 |
1.18.26 | Oct 11, 2024 |
0.0.0 | Dec 1, 2023 |
#512 in Magic Beans
10,684 downloads per month
Used in 17 crates
(via solana-unified-scheduler-…)
470KB
9K
SLoC
The task (transaction) scheduling code for the unified scheduler
High-level API and design
The most important type is SchedulingStateMachine
. It takes new tasks (= transactions) and
may return back them if runnable via
::schedule_task()
while maintaining the account
readonly/writable lock rules. Those returned runnable tasks are guaranteed to be safe to
execute in parallel. Lastly, SchedulingStateMachine
should be notified about the completion
of the exeuction via ::deschedule_task()
, so that
conflicting tasks can be returned from
::schedule_next_unblocked_task()
as
newly-unblocked runnable ones.
The design principle of this crate (solana-unified-scheduler-logic
) is simplicity for the
separation of concern. It is interacted only with a few of its public API by
solana-unified-scheduler-pool
. This crate doesn't know about banks, slots, solana-runtime,
threads, crossbeam-channel at all. Becasue of this, it's deterministic, easy-to-unit-test, and
its perf footprint is well understood. It really focuses on its single job: sorting
transactions in executable order.
Algorithm
The algorithm can be said it's based on per-address FIFO queues, which are updated every time both new task is coming (= called scheduling) and runnable (= post-scheduling) task is finished (= called descheduling).
For the non-conflicting scheduling case, the story is very simple; it just remembers that all of accessed addresses are write-locked or read-locked with the number of active (= currently-scheduled-and-not-descheduled-yet) tasks. Correspondingly, descheduling does the opposite book-keeping process, regardless whether a finished task has been conflicted or not.
For the conflicting scheduling case, it remembers that each of non-conflicting addresses like the non-conflicting case above. As for conflicting addresses, each task is recorded to respective FIFO queues attached to the (conflicting) addresses. Importantly, the number of conflicting addresses of the conflicting task is also remembered.
The last missing piece is that the scheduler actually tries to reschedule previously blocked tasks while deschduling, in addition to the above-mentioned book-keeping processing. Namely, when given address is ready for new fresh locking resulted from descheduling a task (i.e. write lock is released or read lock count has reached zero), it pops out the first element of the FIFO blocked-task queue of the address. Then, it immediately marks the address as relocked. It also decrements the number of conflicting addresses of the popped-out task. As the final step, if the number reaches to the zero, it means the task has fully finished locking all of its addresses and is directly routed to be runnable. Lastly, if the next first element of the blocked-task queue is trying to read-lock the address like the popped-out one, this rescheduling is repeated as an optimization to increase parallelism of task execution.
Put differently, this algorithm tries to gradually lock all of addresses of tasks at different timings while not deviating the execution order from the original task ingestion order. This implies there's no locking retries in general, which is the primary source of non-linear perf. degration.
As a ballpark number from a synthesized micro benchmark on usual CPU for mainnet-beta
validators, it takes roughly 100ns to schedule and deschedule a transaction with 10 accounts.
And 1us for a transaction with 100 accounts. Note that this excludes crossbeam communication
overhead at all. That's said, it's not unrealistic to say the whole unified scheduler can
attain 100k-1m tps overall, assuming those transaction executions aren't bottlenecked.
Runtime performance characteristics and data structure arrangement
Its algorithm is very fast for high throughput, real-time for low latency. The whole
unified-scheduler architecture is designed from grounds up to support the fastest execution of
this scheduling code. For that end, unified scheduler pre-loads address-specific locking state
data structures (called UsageQueue
) for all of transaction's accounts, in order to offload
the job to other threads from the scheduler thread. This preloading is done inside
create_task()
. In this way, task scheduling
computational complexity is basically reduced to several word-sized loads and stores in the
schduler thread (i.e. constant; no allocations nor syscalls), while being proportional to the
number of addresses in a given transaction. Note that this statement is held true, regardless
of conflicts. This is because the preloading also pre-allocates some scratch-pad area
(blocked_usages_from_tasks
) to stash blocked
ones. So, a conflict only incurs some additional fixed number of mem stores, within error
margin of the constant complexity. And additional memory allocation for the scratchpad could
said to be amortized, if such an unusual event should occur.
[Arc
] is used to implement this preloading mechanism, because UsageQueue
s are shared across
tasks accessing the same account, and among threads due to the preloading. Also, interior
mutability is needed. However, SchedulingStateMachine
doesn't use conventional locks like
RwLock. Leveraging the fact it's the only state-mutating exclusive thread, it instead uses
UnsafeCell
, which is sugar-coated by a tailored wrapper called TokenCell
. TokenCell
imposes an overly restrictive aliasing rule via rust type system to maintain the memory safety.
By localizing any synchronization to the message passing, the scheduling code itself attains
maximally possible single-threaed execution without stalling cpu pipelines at all, only
constrained to mem access latency, while efficiently utilizing L1-L3 cpu cache with full of
UsageQueue
s.
Buffer bloat insignificance
The scheduler code itself doesn't care about the buffer bloat problem, which can occur in unified scheduler, where a run of heavily linearized and blocked tasks could be severely hampered by very large number of interleaved runnable tasks along side. The reason is again for separation of concerns. This is acceptable because the scheduling code itself isn't susceptible to the buffer bloat problem by itself as explained by the description and validated by the mentioned benchmark above. Thus, this should be solved elsewhere, specifically at the scheduler pool.
Dependencies
~13–21MB
~329K SLoC