Commit Graph

322 Commits (c915b918409a4be9ff27382161fd74694b872dfb)

Author SHA1 Message Date
David Robertson c3627d0f99
Correctly read to-device stream pos on SQLite (#16682) 2023-11-24 13:42:38 +00:00
Erik Johnston 6fec2d035f
Also discard 'caches' and 'backfill' stream POSITIONS (#16655)
Follow on from #16640
2023-11-17 14:14:29 +00:00
Erik Johnston fef08cbee8
Fix sending out of order `POSITION` over replication (#16639)
If a worker reconnects to Redis we send out the current positions of all our streams. However, if we're also trying to send out a backlog of RDATA at the same time then we can end up sending a `POSITION` with the current token *before* we've sent all the RDATA before the current token.

This doesn't cause actual bugs as the receiving servers see the POSITION, fetch the relevant rows from the DB, and then ignore the old RDATA as they come in. However, this is inefficient so it'd be better if we didn't  send out-of-order positions
2023-11-16 13:05:09 +00:00
Erik Johnston 898655fd12
More efficiently handle no-op POSITION (#16640)
We may receive `POSITION` commands where we already know that worker has
advanced past that position, so there is no point in handling it.
2023-11-16 12:32:17 +00:00
Erik Johnston 408c13801a
Add fast path for replication events stream fetch (#16580)
We can bail early if the from token is greater than or equal to the
current token.
2023-10-30 14:47:57 +00:00
Erik Johnston 5413cefe32
Reduce amount of caches POSITIONS we send (#16561)
Follow on from / actually correctly does #16557
2023-10-27 16:07:11 +01:00
Erik Johnston 89dbbd68e1
Reduce spurious replication catchup (#16555) 2023-10-27 13:27:20 +00:00
Erik Johnston 0680d76659
Reduce replication traffic due to reflected cache stream POSITION (#16557) 2023-10-27 12:51:08 +01:00
Erik Johnston ba47fea528
Allow multiple workers to write to receipts stream. (#16432)
Fixes #16417
2023-10-25 16:16:19 +01:00
Jason Little ffbe9b7666
Remove duplicate call to wake a remote destination when using federation sending worker (#16515) 2023-10-24 08:09:59 -04:00
Erik Johnston 8f35f8148e
Fix bug where a new writer advances their token too quickly (#16473)
* Fix bug where a new writer advances their token too quickly

When starting a new writer (for e.g. persisting events), the
`MultiWriterIdGenerator` doesn't have a minimum token for it as there
are no rows matching that new writer in the DB.

This results in the the first stream ID it acquired being announced as
persisted *before* it actually finishes persisting, if another writer
gets and persists a subsequent stream ID. This is due to the logic of
setting the minimum persisted position to the minimum known position of
across all writers, and the new writer starts off not being considered.

* Fix sending out POSITIONs when our token advances without update

Broke in #14820

* For replication HTTP requests, only wait for minimal position
2023-10-23 16:57:30 +01:00
Patrick Cloke 49c9745b45
Avoid sending massive replication updates when purging a room. (#16510) 2023-10-18 12:26:01 -04:00
Patrick Cloke ae5b997cfa
Fix comments related to replication. (#16428) 2023-10-06 07:25:44 -04:00
Patrick Cloke 4e302b30b6
Add __slots__ to replication commands. (#16429)
To slightly reduce the amount of memory each command takes.
2023-10-05 07:38:55 -04:00
Erik Johnston 80ec81dcc5
Some refactors around receipts stream (#16426) 2023-10-04 16:28:40 +01:00
Erik Johnston 20fb08ec80
Downgrade repl stream time out error to warning (#16401)
This is because if a worker reaches ~100% CPU then everything starts
lagging and we hit the log line a lot. When at error we invoke sentry
and that has a lot of overhead, which then puts even more pressure on
the worker.
2023-09-29 11:52:48 +00:00
Patrick Cloke f84da3c32e
Add a cache around server ACL checking (#16360)
* Pre-compiles the server ACLs onto an object per room and
  invalidates them when new events come in.
* Converts the server ACL checking into Rust.
2023-09-26 11:57:50 -04:00
Erik Johnston 329597022e
Some minor performance fixes for task schedular (#16313) 2023-09-14 16:20:47 +01:00
Erik Johnston ab13fb08bf
Improve logging of replication (#16309) 2023-09-13 09:51:50 +00:00
Patrick Cloke 55c20da4a3 Merge remote-tracking branch 'origin/release-v1.91' into release-v1.92 2023-09-06 11:25:28 -04:00
Quentin Gliech 1940d990a3
Revert MSC3861 introspection cache, admin impersonation and account lock (#16258) 2023-09-06 15:19:51 +01:00
Erik Johnston d35bed8369
Don't wake up destination transaction queue if they're not due for retry. (#16223) 2023-09-04 17:14:09 +01:00
Patrick Cloke e9235d92f2
Track currently syncing users by device for presence (#16172)
Refactoring to use both the user ID & the device ID when tracking
the currently syncing users in the presence handler.

This is done both locally and over replication. Note that the device
ID is discarded but will be used in a future change.
2023-08-29 11:44:07 -04:00
Mathieu Velten 501da8ecd8
Task scheduler: add replication notify for new task to launch ASAP (#16184) 2023-08-28 14:03:51 +00:00
Erik Johnston 803f63df1c
Fix perf of `wait_for_stream_positions` (#16148) 2023-08-22 15:11:22 +00:00
Shay 69048f7b48
Add an admin endpoint to allow authorizing server to signal token revocations (#16125) 2023-08-22 14:15:34 +00:00
Patrick Cloke ad3f43be9a
Run pyupgrade for python 3.7 & 3.8. (#16110) 2023-08-15 08:11:20 -04:00
Erik Johnston ae55cc1e6b
Add ability to wait for locks and add locks to purge history / room deletion (#15791)
c.f. #13476
2023-07-31 10:58:03 +01:00
Jason Little c835befd10
Add Unix socket support for Redis connections (#15644)
Adds a new configuration setting to connect to Redis via a Unix
socket instead of over TCP. Disabled by default.
2023-05-26 15:28:39 -04:00
Patrick Cloke 375b0a8a11
Update code to refer to "workers". (#15606)
A bunch of comments and variables are out of date and use
obsolete terms.
2023-05-16 15:56:38 -04:00
Roel ter Maat 2611433b70
Add redis SSL configuration options (#15312)
* Add SSL options to redis config

* fix lint issues

* Add documentation and changelog file

* add missing . at the end of the changelog

* Move client context factory to new file

* Rename ssl to tls and fix typo

* fix lint issues

* Added when redis attributes were added
2023-05-11 13:02:51 +01:00
Mathieu Velten 9228ae633f
Add some clarification to the doc/comments regarding TCP replication (#15354) 2023-03-30 12:51:35 +02:00
Patrick Cloke afb216c202
Remove no-op send_command for Redis replication. (#15274)
With Redis commands do not need to be re-issued by the main
process (they fan-out to all processes at once) and thus it is no
longer necessary to worry about them reflecting recursively forever.
2023-03-16 11:13:30 -04:00
Patrick Cloke 3bf973edc7
Remove unused class: DirectTcpReplicationClientFactory. (#15272) 2023-03-15 15:42:20 -04:00
H. Shay b2fd03d075 Merge branch 'master' into develop 2023-02-28 10:14:20 -08:00
Erik Johnston b2357a898c
Fix bug where 5s delays would occasionally happen. (#15150)
This only affects deployments using workers.
2023-02-24 14:39:50 +00:00
dependabot[bot] 9bb2eac719
Bump black from 22.12.0 to 23.1.0 (#15103) 2023-02-22 15:29:09 -05:00
reivilibre addd12f16d
Tweak logging for when a worker waits for its view of a replication stream to catch up. (#15120)Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Improve logging messages for the 'wait for repl stream' read-after-write consistency feature

* Newsfile

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>

* Update synapse/replication/tcp/client.py

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>

---------

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
2023-02-21 12:26:00 +00:00
David Robertson 80d44060c9
Faster joins: omit partial rooms from eager syncs until the resync completes (#14870)
* Allow `AbstractSet` in `StrCollection`

Or else frozensets are excluded. This will be useful in an upcoming
commit where I plan to change a function that accepts `List[str]` to
accept `StrCollection` instead.

* `rooms_to_exclude` -> `rooms_to_exclude_globally`

I am about to make use of this exclusion mechanism to exclude rooms for
a specific user and a specific sync. This rename helps to clarify the
distinction between the global config and the rooms to exclude for a
specific sync.

* Better function names for internal sync methods

* Track a list of excluded rooms on SyncResultBuilder

I plan to feed a list of partially stated rooms for this sync to ignore

* Exclude partial state rooms during eager sync

using the mechanism established in the previous commit

* Track un-partial-state stream in sync tokens

So that we can work out which rooms have become fully-stated during a
given sync period.

* Fix mutation of `@cached` return value

This was fouling up a complement test added alongside this PR.
Excluding a room would mean the set of forgotten rooms in the cache
would be extended. This means that room could be erroneously considered
forgotten in the future.

Introduced in #12310, Synapse 1.57.0. I don't think this had any
user-visible side effects (until now).

* SyncResultBuilder: track rooms to force as newly joined

Similar plan as before. We've omitted rooms from certain sync responses;
now we establish the mechanism to reintroduce them into future syncs.

* Read new field, to present rooms as newly joined

* Force un-partial-stated rooms to be newly-joined

for eager incremental syncs only, provided they're still fully stated

* Notify user stream listeners to wake up long polling syncs

* Changelog

* Typo fix

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>

* Unnecessary list cast

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>

* Rephrase comment

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>

* Another comment

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>

* Fixup merge(?)

* Poke notifier when receiving un-partial-stated msg over replication

* Fixup merge whoops

Thanks MV :)

Co-authored-by: Mathieu Velen <mathieuv@matrix.org>

Co-authored-by: Mathieu Velten <mathieuv@matrix.org>
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
2023-01-23 15:44:39 +00:00
Sean Quah 2ec9c58496
Faster joins: Update room stats and the user directory on workers when finishing join (#14874)
* Faster joins: Update room stats and user directory on workers when done

When finishing a partial state join to a room, we update the current
state of the room without persisting additional events. Workers receive
notice of the current state update over replication, but neglect to wake
the room stats and user directory updaters, which then get incidentally
triggered the next time an event is persisted or an unrelated event
persister sends out a stream position update.

We wake the room stats and user directory updaters at the appropriate
time in this commit.

Part of #12814 and #12815.

Signed-off-by: Sean Quah <seanq@matrix.org>

* fixup comment

Signed-off-by: Sean Quah <seanq@matrix.org>
2023-01-23 10:31:36 +00:00
reivilibre 22cc93afe3
Enable Faster Remote Room Joins against worker-mode Synapse. (#14752)
* Enable Complement tests for Faster Remote Room Joins on worker-mode

* (dangerous) Add an override to allow Complement to use FRRJ under workers

* Newsfile

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>

* Fix race where we didn't send out replication notification

* MORE HACKS

* Fix get_un_partial_stated_rooms_token to take instance_name

* Fix bad merge

* Remove warning

* Correctly advance un_partial_stated_room_stream

* Fix merge

* Add another notify_replication

* Fixups

* Create a separate ReplicationNotifier

* Fix test

* Fix portdb

* Create a separate ReplicationNotifier

* Fix test

* Fix portdb

* Fix presence test

* Newsfile

* Apply suggestions from code review

* Update changelog.d/14752.misc

Co-authored-by: Erik Johnston <erik@matrix.org>

* lint

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
Co-authored-by: Erik Johnston <erik@matrix.org>
2023-01-22 21:10:11 +00:00
Erik Johnston 0ec12a3753
Reduce max time we wait for stream positions (#14881)
Now that we wait for stream positions whenever we do a HTTP replication
hit, we need to be less brutal in the case where we do timeout (as we
have bugs around this).
2023-01-20 21:04:33 +00:00
Erik Johnston cdf2707678
Fix bug in wait for stream position (#14872)
This caused some requests to fail.

This caused some requests to fail.

This really only started causing issues due to #14856
2023-01-19 22:19:56 +00:00
Erik Johnston 9187fd940e
Wait for streams to catch up when processing HTTP replication. (#14820)
This should hopefully mitigate a class of races where data gets out of
sync due a HTTP replication request racing with the replication streams.
2023-01-18 19:35:29 +00:00
Erik Johnston 316590d1ea
Fix bug in `wait_for_stream_position` (#14856)
We were incorrectly checking if the *local* token had been advanced, rather than the token for the remote instance.

In practice, I don't think this has caused any bugs due to where we use `wait_for_stream_position`, as critically we don't use it on instances that also write to the given streams (and so the local token will lag behind all remote tokens).
2023-01-17 09:58:22 +00:00
Erik Johnston 2b084c5b71
Merge device list replication streams (#14833) 2023-01-17 09:29:58 +00:00
Erik Johnston 73ff493dfb
Merge account data streams (#14826) 2023-01-13 14:57:43 +00:00
Nick Mills-Barrett db1cfe9c80
Update all stream IDs after processing replication rows (#14723)
This creates a new store method, `process_replication_position` that
is called after `process_replication_rows`. By moving stream ID advances
here this guarantees any relevant cache invalidations will have been
applied before the stream is advanced.

This avoids race conditions where Python switches between threads mid
way through processing the `process_replication_rows` method where stream
IDs may be advanced before caches are invalidated due to class resolution
ordering.

See this comment/issue for further discussion:
	https://github.com/matrix-org/synapse/issues/14158#issuecomment-1344048703
2023-01-04 11:49:26 +00:00
reivilibre 2888d7ec83
Faster remote room joins: invalidate caches and unblock requests when receiving un-partial-stated event notifications over replication. [rei:frrj/streams/unpsr] (#14546) 2022-12-19 14:57:51 +00:00
reivilibre fb60cb16fe
Faster remote room joins: stream the un-partial-stating of events over replication. [rei:frrj/streams/unpsr] (#14545) 2022-12-14 14:47:11 +00:00