Write about the chain cover a little. (#13602)
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>pull/13610/head
parent
f7ddfe17a3
commit
a25a37002c
|
@ -0,0 +1 @@
|
|||
Improve the description of the ["chain cover index"](https://matrix-org.github.io/synapse/latest/auth_chain_difference_algorithm.html) used internally by Synapse.
|
|
@ -34,13 +34,45 @@ the process of indexing it).
|
|||
## Chain Cover Index
|
||||
|
||||
Synapse computes auth chain differences by pre-computing a "chain cover" index
|
||||
for the auth chain in a room, allowing efficient reachability queries like "is
|
||||
event A in the auth chain of event B". This is done by assigning every event a
|
||||
*chain ID* and *sequence number* (e.g. `(5,3)`), and having a map of *links*
|
||||
between chains (e.g. `(5,3) -> (2,4)`) such that A is reachable by B (i.e. `A`
|
||||
is in the auth chain of `B`) if and only if either:
|
||||
for the auth chain in a room, allowing us to efficiently make reachability queries
|
||||
like "is event `A` in the auth chain of event `B`?". We could do this with an index
|
||||
that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However, this
|
||||
would be prohibitively large, scaling poorly as the room accumulates more state
|
||||
events.
|
||||
|
||||
1. A and B have the same chain ID and `A`'s sequence number is less than `B`'s
|
||||
Instead, we break down the graph into *chains*. A chain is a subset of a DAG
|
||||
with the following property: for any pair of events `E` and `F` in the chain,
|
||||
the chain contains a path `E -> F` or a path `F -> E`. This forces a chain to be
|
||||
linear (without forks), e.g. `E -> F -> G -> ... -> H`. Each event in the chain
|
||||
is given a *sequence number* local to that chain. The oldest event `E` in the
|
||||
chain has sequence number 1. If `E` has a child `F` in the chain, then `F` has
|
||||
sequence number 2. If `E` has a grandchild `G` in the chain, then `G` has
|
||||
sequence number 3; and so on.
|
||||
|
||||
Synapse ensures that each persisted event belongs to exactly one chain, and
|
||||
tracks how the chains are connected to one another. This allows us to
|
||||
efficiently answer reachability queries. Doing so uses less storage than
|
||||
tracking reachability on an event-by-event basis, particularly when we have
|
||||
fewer and longer chains. See
|
||||
|
||||
> Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944).
|
||||
> *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598.
|
||||
|
||||
for the original idea or
|
||||
|
||||
> Y. Chen, Y. Chen, [An efficient algorithm for answering graph
|
||||
> reachability queries](https://doi.org/10.1109/ICDE.2008.4497498),
|
||||
> in: 2008 IEEE 24th International Conference on Data Engineering, April 2008,
|
||||
> pp. 893–902. (PDF available via [Google Scholar](https://scholar.google.com/scholar?q=Y.%20Chen,%20Y.%20Chen,%20An%20efficient%20algorithm%20for%20answering%20graph%20reachability%20queries,%20in:%202008%20IEEE%2024th%20International%20Conference%20on%20Data%20Engineering,%20April%202008,%20pp.%20893902.).)
|
||||
|
||||
for a more modern take.
|
||||
|
||||
In practical terms, the chain cover assigns every event a
|
||||
*chain ID* and *sequence number* (e.g. `(5,3)`), and maintains a map of *links*
|
||||
between events in chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B`
|
||||
(i.e. `A` is in the auth chain of `B`) if and only if either:
|
||||
|
||||
1. `A` and `B` have the same chain ID and `A`'s sequence number is less than `B`'s
|
||||
sequence number; or
|
||||
2. there is a link `L` between `B`'s chain ID and `A`'s chain ID such that
|
||||
`L.start_seq_no` <= `B.seq_no` and `A.seq_no` <= `L.end_seq_no`.
|
||||
|
@ -49,8 +81,9 @@ There are actually two potential implementations, one where we store links from
|
|||
each chain to every other reachable chain (the transitive closure of the links
|
||||
graph), and one where we remove redundant links (the transitive reduction of the
|
||||
links graph) e.g. if we have chains `C3 -> C2 -> C1` then the link `C3 -> C1`
|
||||
would not be stored. Synapse uses the former implementations so that it doesn't
|
||||
need to recurse to test reachability between chains.
|
||||
would not be stored. Synapse uses the former implementation so that it doesn't
|
||||
need to recurse to test reachability between chains. This trades-off extra storage
|
||||
in order to save CPU cycles and DB queries.
|
||||
|
||||
### Example
|
||||
|
||||
|
|
Loading…
Reference in New Issue