3 releases (stable)
2.0.1 | Mar 10, 2021 |
---|---|
2.0.0 | Feb 25, 2021 |
#275 in Date and time
205KB
4K
SLoC
Byztime
Byztime is a Byzantine-fault-tolerant protocol for synchronizing time among a group of peers, without reliance on any external time authority. The time kept by Byztime is simply a counter that advances at a rate of something very close to one unit per second, such that all nodes are in close agreement as to its current value. Byztime timestamps have no well-defined epoch. If all nodes have correctly-set system clocks when first initialized, then Byztime will initially match POSIX time, but will eventually drift away from it since 1. there is no external source keeping it in sync, and 2. Byztime's timescale lacks leap seconds.
Byztime's algorithm is focused on keeping its worst-case error --- the absolute distance between any two correct nodes' estimate of the current time --- as small as possible. It achieves this somewhat at the expense of typical-case error, using only the single highest-quality time sample from each peer rather than combining many samples to smooth out network jitter. In the worst case, the difference between two correct nodes' clocks will asymptotically converge toward 4δ + 4ερ, where δ is the one-way network latency between the two farthest-spaced peers, ε is the (dimensionless) drift rate of correct nodes' hardware clocks, and ρ is the polling interval. If all nodes behave honestly, the bound improves to 2δ + 2ερ and will be reached after a single round of the protocol rather than converging asymptotically.
Byztimed runs completely independently of NTP, and a bad NTP time source will not disrupt Byztime. This comes with a minor caveat: just before the daemon shuts down it records the the current offset between Byztime time and system time, and uses this offset to re-initialize its estimate following a reboot. The only time this particularly matters is if many nodes reboot simultaneously and the network loses quorum. What happens in this case depends somewhat on NTP and what order things start up in at boot time. If Byztime starts before NTP starts and shuts down only after NTP shuts down, then the continuity of the Byztime timescale will be as good as the RTC and the CMOS battery of the restarting nodes, but no better. On the other hand if NTP is allowed to stabilize the system clock before Byztime starts up, then the continuity of the Byztime scales will be as good as its NTP sources — which is probably a lot better than your RTC, but could be arbitrarily bad if the NTP source is faulty. Again, this only becomes an issue if Byztime loses quorum, meaning ⅓ or more of the network reboots at once.
Byztime also currently relies on the system time for determining whether an X.509 certificate is expired. Once Roughtime matures a bit we may consider integrating a Roughtime client into byztimed for certificate validation purposes.
Build
Byztime is built like any standard Rust
crate. The easiest way to install byztimed is to get it from
crates.io via
cargo
:
cargo install byztimed
If you prefer to check out and build this repo, note that byztimed
includes
libbyztime
as a submodule, so be sure to clone with git clone --recurse-submodules
, or run git submodule update --init --recursive
if you have already cloned without the
--recurse-submodules
option.
Byztime is tested against Rust's stable channel, but compilers significantly older than the current stable will probably work. The most recent version known not to work is 1.38 (because we rely on async/await, which stabilized in 1.39).
Byztimed currently runs only on Linux, and is well-tested only on AMD64. Other CPU architectures should work; please file a bug ticket if you encounter any issues. We hope to eventually support more operating systems, but this will be an uphill battle because Byztime depends on timekeeping facilities that currently only Linux provides. The effort to improve portability will likely require contributing some new functionality to other OS kernels.
Usage
Run byztimed as byztimed <path-to-config-file>
. See
CONFIG.md
for configuration file syntax.
libbyztime
is the C client library for consuming time from byztimed
. See its
byztime.h
file for API documentation. The byztime
crate within
this repo provides idiomatic Rust bindings to libbyztime and its
documentation can be read on docs.rs.
Protocol overview
Although it is fundamentally a peer-to-peer protocol, Byztime uses a client-server communication pattern, with each node acting as both a client and a server to each other node. A client-only operation mode, wherein a node synchronizes itself to the consensus but does not vote in it, is also supported.
Byztime uses Network Time Security (NTS) for cryptographic link protection. Communication from each client to each server begins by the client initiating a TLS handshake and then using NTS-KE to negotiate shared keys and obtain NTS cookies. After NTS-KE is complete, the TLS connection closes and the remainder of the protocol runs over UDP. NTS provides message-level authenticated encryption. It provides replay protection for the client, but not for the server. The server never updates any state in response to a message from a client, so processing replays is harmless. For the remainder of this overview, we'll take the abstraction of authenticated encryption for granted and omit NTS-related machinery from our descriptions.
Each node is assumed to be equipped with a local clock which counts the time elapsed since some arbitrary epoch such as when the system last booted. One node's local clock has no a priori known relationship to another's. Rather, this relationship is discovered through the execution of the protocol. The shared time that nodes seek to synchronize to is called the global clock. Nodes maintain an estimate of their global offset, which is the difference between the global clock and their local clock. The local clock never receives any adjustments; only the global offset does.
The protocol proceeds by each node periodically sending a query to each of its peers, and the peer sending a response which includes a snapshot of its local clock and its current estimate of its global offset. Each query/response volley is called a measurement.
The protocol uses the following global parameters:
-
N
: the number of nodes participating in consensus. -
f
: the number of faulty nodes that can be tolerated.f = floor((N-1)/3)
. -
drift
: a dimensionless number giving an upper bound on how fast or slow a correct node's local clock might be. For example ifdrift
is 50e-6 then the clock might drift by up to 50µs per second.
Each node keeps the following state:
-
The
era
of its local clock. This is a randomly-generated identifier which changes if the local clock loses its state, e.g. after a reboot. -
global_offset
: The node's estimate of the offset between the global clock and its local clock:local_clock() + global_offset == estimate of global clock
. -
error
: The maximum absolute difference between the above estimate of the global clock and its true value. -
last_update
: The local clock time at whichglobal_offset
anderror
were last updated.
And then for each of its peers:
-
peer.era
: The peer's clock era as of the last time it communicated. -
peer.local_offset
: The node's estimate of the offset between its local clock and the peer's local clock:local_clock() + peer.local_offset == estimate of peer's local clock
. -
peer.global_offset
: The peer's estimate of the offset between its own local clock and the global clock, as of the last time it communicated. -
peer.rtt
: The round trip time of the current "best" measurement of the peer's clock. This measurement is the one on whichpeer.local_offset
is based. -
peer.origin_time
: The local clock time at which the query which led to the current best measurement was sent. -
peer.inflight_id
: The random unique identifier associated with a query, if any, that is currently awaiting a response. -
peer.inflight_origin_time
: The local clock time at which the current in-flight query (if any) was sent.
There is some additional state related to NTS — a cache of cookies and shared keys — which works basically the same way as it does for NTP and we'll disregard it for the purposes of this explanation.
At first run, nodes initialize global_offset
such that the global
clock matches their real-time clock. They periodically check the
offset between the two and persist this offset to disk. This
persisted value is used to recover a rough value with which to
reinitialize global_offset
after a reboot, but the error bounds
on offsets recovered in this manner are considered infinite.
Once per configured polling interval, clients send a query message to
each of their peers, containing just a randomly-generated unique
identifier. The sender updates peer.inflight_id
and
peer.inflight_origin_time
to reflect the content of the packet and
the time at which it was sent. If there was already another query in
flight, the old query is assumed to have been dropped by the network
and the new inflight_id
and inflight_origin_time
values
overwrite the old ones.
Servers respond immediately to any query they receive. The response contains:
-
response_id
: A copy of the query's unique identifier. -
response_local
: A snapshot of the server's local clock. -
response_era
: The server'sera
. -
response_global_offset
: The server'sglobal_offset
.
When the client receives the response, it processes it as follows:
-
Set
now
to a snapshot of the local clock at the moment the response was received. -
Verify that
peer.inflight_id
is non-null and matchesresponse_id
. If not, discard the response. Otherwise, setpeer.inflight_id
to null and continue. -
Set
peer.global_offset
toresponse_global_offset
. -
Compute
rtt
asnow - peer.inflight_origin_time
. -
If this is the first response seen so far from this peer, or if
peer.era
does not match the era contained in the response, skip to step 8. -
Compute the following lower-is-better quality metric for the current best measurement we have from this peer:
Q = peer.rtt/2 + 2 * drift * (now - peer.origin_time)
. This represents the worst-case error in estimating the offset between this node's local clock and the peer's local clock, taking into account network asymmetry and clock drift. Drift is multiplied by 2 because the two clocks could each be drifting in opposite directions. -
Compute this quality metric for the new measurement:
Q' = rtt/2 + 2 * drift * (now - peer.inflight_origin_time)
. IfQ' > Q
, then the old measurement is better than the new one, so keep it and return without further processing. -
Set
peer.rtt
tortt
,peer.origin_time
topeer.inflight_origin_time
, andpeer.era
toresponse_era
. -
Set
peer.local_offset
toresponse_clock + rtt/2 - now
.
Now with the newly updated clock values from the peer, recompute
global_offset
and error
:
-
For each peer
p
, compute an estimateest = p.local_offset + p.global_offset
and errorerr = p.rtt/2 + 2 * drift * (now - p.origin_time)
, giving an interval(est - err, est + err)
. Create lists of all resulting minima and maxima. -
If we ourselves are a participant in consensus, insert
global_offset
into the list of minima and the list of maxima. -
Sort both lists. Discard the
f
lowest minima and thef
highest maxima. Letmin'
equal the lowest remaining minimum andmax'
equal the highest remaining maximum. Letglobal_offset' = (max' + min')/2
. Leterror' = (max' - min')/2
. This averaging method — discarding thef
highest and lowest and taking the midpoint of the remaining range — is due to "A New Fault-Tolerant Algorithm for Clock Synchronization" (Welch and Lynch 1988) and is crucial for achieving Byzantine fault tolerance. -
Determine whether the new global offset and error are consistent with the old one. Let
age = now - last_update
. Letmin = global_offset - error
and letmax = global_offset + error
. Letdrift_limit = 2 * age * drift
. Now check thatmin' > min - drift_limit
andmax' < max + drift_limit
. If this check fails, return without further processing. (This step is not necessary for ensuring synchronization, but without it, a MitM adversary could cause time throughout the network to advance too quickly or too slowly, by delaying query messages but not response messages or vice versa.) -
Set
last_update
tonow
,global_offset
toglobal_offset'
, anderror
toerror'
.
This completes our description of the protocol. Applications
consuming time from Byztime query the current values of
global_offset
, error
, and last_update
. The global time is
local_time + global_offset
, with error bounds of
±(error + 2*drift*(local_time - last_update))
.
Estimates of global time are not frequency-stable: they jump
discontinuously with each update and can move backward. It's up to
the application how to deal with this. libbyztime
includes support
for clamping the results of successive calls to get_global_time()
to make them consistent with each other.
Caveats
Akamai has been using Byztime in business-critical applications since early 2020 and it has been very stable for us. However, until two specific issues are resolved, this software should be considered beta:
-
We are likely to make some backward-incompatible changes to Byztime's wire protocol. Byztime currently uses NTS-KE codepoints in the Experimental Use range; we plan to obtain and use permanent allocations from IANA. We also will likely change the format of Byztime's time packets, currently based on Protobufs, to a bespoke fixed-field format, in order to make parsing more predictable and make it easier to ensure to that request size matches response size. We plan to have a single flag-day release that makes all these changes at once, and then commit to backward-compatibility thereafter.
-
Some of Byztime's statistics-and-health reporting capabilities have have been removed for this open-source release because they depend on Akamai-internal infrastructure to function. We plan to redesign and reimplement this functionality around open standards.
Dependencies
~22–36MB
~677K SLoC