MatrixSynapse

Commit Graph

Author	SHA1	Message	Date
David Robertson	4bb26c95a9	Refactor `filter_events_for_server` (#15240 ) * Tweak docstring and type hint * Flip logic and provide better name * Separate decision from action * Track a set of strings, not EventBases * Require explicit boolean options from callers * Add explicit option for partial state rooms * Changelog * Rename param	2023-03-10 15:31:25 +00:00
Sean Quah	4f4f27e57f	Mitigate a race where /make_join could 403 for restricted rooms (#15080 ) Previously, when creating a join event in /make_join, we would decide whether to include additional fields to satisfy restricted room checks based on the current state of the room. Then, when building the event, we would capture the forward extremities of the room to use as prev events. This is subject to race conditions. For example, when leaving and rejoining a room, the following sequence of events leads to a misleading 403 response: 1. /make_join reads the current state of the room and sees that the user is still in the room. It decides to omit the field required for restricted room joins. 2. The leave event is persisted and the room's forward extremities are updated. 3. /make_join builds the event, using the post-leave forward extremities. The event then fails the restricted room checks. To mitigate the race, we move the read of the forward extremities closer to the read of the current state. Ideally, we would compute the state based off the chosen prev events, but that can involve state resolution, which is expensive. Signed-off-by: Sean Quah <seanq@matrix.org>	2023-02-17 09:40:32 +00:00
David Robertson	06ba71083e	Fix order of partial state tables when purging (#15068 ) * Fix order of partial state tables when purging `partial_state_rooms` has an FK on `events` pointing to the join event we get from `/send_join`, so we must delete from that table before deleting from `events`. NB: It would be nice to cancel any resync processes for the room being purged. We do not do this at present. To do so reliably we'd need an internal HTTP "replication" endpoint, because the worker doing the resync process may be different to that handling the purge request. The first time the resync process tries to write data after the deletion it will fail because we have deleted necessary data e.g. auth events. AFAICS it will not retry the resync, so the only downside to not cancelling the resync is a scary-looking traceback. (This is presumably extremely race-sensitive.) * Changelog * admist(?) -> between * Warn about a race * Fix typo, thanks Sean Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> --------- Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>	2023-02-14 23:42:29 +00:00
Mathieu Velten	6cddf24e36	Faster joins: don't stall when a user joins during a fast join (#14606 ) Fixes #12801. Complement tests are at https://github.com/matrix-org/complement/pull/567. Avoid blocking on full state when handling a subsequent join into a partial state room. Also always perform a remote join into partial state rooms, since we do not know whether the joining user has been banned and want to avoid leaking history to banned users. Signed-off-by: Mathieu Velten <mathieuv@matrix.org> Co-authored-by: Sean Quah <seanq@matrix.org> Co-authored-by: David Robertson <davidr@element.io>	2023-02-10 23:31:05 +00:00
Shay	03bccd542b	Add a class UnpersistedEventContext to allow for the batching up of storing state groups (#14675 ) * add class UnpersistedEventContext * modify create new client event to create unpersistedeventcontexts * persist event contexts after creation * fix tests to persist unpersisted event contexts * cleanup * misc lints + cleanup * changelog + fix comments * lints * fix batch insertion? * reduce redundant calculation * add unpersisted event classes * rework compute_event_context, split into function that returns unpersisted event context and then persists it * use calculate_context_info to create unpersisted event contexts * update typing * $%#^&* * black * fix comments and consolidate classes, use attr.s for class * requested changes * lint * requested changes * requested changes * refactor to be stupidly explicit * clearer renaming and flow * make partial state non-optional * update docstrings --------- Co-authored-by: Erik Johnston <erik@matrix.org>	2023-02-09 13:05:02 -08:00
Sean Quah	0a686d1d13	Faster joins: Refactor handling of servers in room (#14954 ) Ensure that the list of servers in a partial state room always contains the server we joined off. Also refactor `get_partial_state_servers_at_join` to return `None` when the given room is no longer partial stated, to explicitly indicate when the room has partial state. Otherwise it's not clear whether an empty list means that the room has full state, or the room is partial stated, but the server we joined off told us that there are no servers in the room. Signed-off-by: Sean Quah <seanq@matrix.org>	2023-02-03 15:39:59 +00:00
Patrick Cloke	ba79fb4a61	Use StrCollection in place of Collection[str] in (most) handlers code. (#14922 ) Due to the increased safety of StrCollection over Collection[str] and Sequence[str].	2023-01-26 12:31:58 -05:00
David Robertson	80d44060c9	Faster joins: omit partial rooms from eager syncs until the resync completes (#14870 ) * Allow `AbstractSet` in `StrCollection` Or else frozensets are excluded. This will be useful in an upcoming commit where I plan to change a function that accepts `List[str]` to accept `StrCollection` instead. * `rooms_to_exclude` -> `rooms_to_exclude_globally` I am about to make use of this exclusion mechanism to exclude rooms for a specific user and a specific sync. This rename helps to clarify the distinction between the global config and the rooms to exclude for a specific sync. * Better function names for internal sync methods * Track a list of excluded rooms on SyncResultBuilder I plan to feed a list of partially stated rooms for this sync to ignore * Exclude partial state rooms during eager sync using the mechanism established in the previous commit * Track un-partial-state stream in sync tokens So that we can work out which rooms have become fully-stated during a given sync period. * Fix mutation of `@cached` return value This was fouling up a complement test added alongside this PR. Excluding a room would mean the set of forgotten rooms in the cache would be extended. This means that room could be erroneously considered forgotten in the future. Introduced in #12310, Synapse 1.57.0. I don't think this had any user-visible side effects (until now). * SyncResultBuilder: track rooms to force as newly joined Similar plan as before. We've omitted rooms from certain sync responses; now we establish the mechanism to reintroduce them into future syncs. * Read new field, to present rooms as newly joined * Force un-partial-stated rooms to be newly-joined for eager incremental syncs only, provided they're still fully stated * Notify user stream listeners to wake up long polling syncs * Changelog * Typo fix Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Unnecessary list cast Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Rephrase comment Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Another comment Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Fixup merge(?) * Poke notifier when receiving un-partial-stated msg over replication * Fixup merge whoops Thanks MV :) Co-authored-by: Mathieu Velen <mathieuv@matrix.org> Co-authored-by: Mathieu Velten <mathieuv@matrix.org> Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>	2023-01-23 15:44:39 +00:00
Sean Quah	2ec9c58496	Faster joins: Update room stats and the user directory on workers when finishing join (#14874 ) * Faster joins: Update room stats and user directory on workers when done When finishing a partial state join to a room, we update the current state of the room without persisting additional events. Workers receive notice of the current state update over replication, but neglect to wake the room stats and user directory updaters, which then get incidentally triggered the next time an event is persisted or an unrelated event persister sends out a stream position update. We wake the room stats and user directory updaters at the appropriate time in this commit. Part of #12814 and #12815. Signed-off-by: Sean Quah <seanq@matrix.org> * fixup comment Signed-off-by: Sean Quah <seanq@matrix.org>	2023-01-23 10:31:36 +00:00
reivilibre	22cc93afe3	Enable Faster Remote Room Joins against worker-mode Synapse. (#14752 ) * Enable Complement tests for Faster Remote Room Joins on worker-mode * (dangerous) Add an override to allow Complement to use FRRJ under workers * Newsfile Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org> * Fix race where we didn't send out replication notification * MORE HACKS * Fix get_un_partial_stated_rooms_token to take instance_name * Fix bad merge * Remove warning * Correctly advance un_partial_stated_room_stream * Fix merge * Add another notify_replication * Fixups * Create a separate ReplicationNotifier * Fix test * Fix portdb * Create a separate ReplicationNotifier * Fix test * Fix portdb * Fix presence test * Newsfile * Apply suggestions from code review * Update changelog.d/14752.misc Co-authored-by: Erik Johnston <erik@matrix.org> * lint Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org> Co-authored-by: Erik Johnston <erik@matrix.org>	2023-01-22 21:10:11 +00:00
Sean Quah	d329a566df	Faster joins: Fix incompatibility with restricted joins (#14882 ) * Avoid clearing out forward extremities when doing a second remote join When joining a restricted room where the local homeserver does not have a user able to issue invites, we perform a second remote join. We want to avoid clearing out forward extremities in this case because the forward extremities we have are up to date and clearing out forward extremities creates a window in which the room can get bricked if Synapse crashes. Signed-off-by: Sean Quah <seanq@matrix.org> * Do a full join when doing a second remote join into a full state room We cannot persist a partial state join event into a joined full state room, so we perform a full state join for such rooms instead. As a future optimization, we could always perform a partial state join and compute or retrieve the full state ourselves if necessary. Signed-off-by: Sean Quah <seanq@matrix.org> * Add lock around partial state flag for rooms Signed-off-by: Sean Quah <seanq@matrix.org> * Preserve partial state info when doing a second partial state join Signed-off-by: Sean Quah <seanq@matrix.org> * Add newsfile * Add a TODO(faster_joins) marker Signed-off-by: Sean Quah <seanq@matrix.org>	2023-01-22 19:19:31 +00:00
Sean Quah	cdea7c11d0	Faster joins: Avoid starting duplicate partial state syncs (#14844 ) Currently, we will try to start a new partial state sync every time we perform a remote join, which is undesirable if there is already one running for a given room. We intend to perform remote joins whenever additional local users wish to join a partial state room, so let's ensure that we do not start more than one concurrent partial state sync for any given room. ------------------------------------------------------------------------ There is a race condition where the homeserver leaves a room and later rejoins while the partial state sync from the previous membership is still running. There is no guarantee that the previous partial state sync will process the latest join, so we restart it if needed. Signed-off-by: Sean Quah <seanq@matrix.org>	2023-01-20 12:06:19 +00:00
Mathieu Velten	54c012c5a8	Make `handle_new_client_event` throws `PartialStateConflictError` (#14665 ) Then adapts calling code to retry when needed so it doesn't 500 to clients. Signed-off-by: Mathieu Velten <mathieuv@matrix.org> Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>	2022-12-15 16:04:23 +00:00
David Robertson	b5b5f66084	Move `StateFilter` to `synapse.types` (#14668 ) * Move `StateFilter` to `synapse.types` * Changelog	2022-12-12 16:19:30 +00:00
reivilibre	501f62d1a6	Faster remote room joins: stream the un-partial-stating of rooms over replication. [rei:frrj/streams/unpsr] (#14473 )	2022-12-05 13:07:55 +00:00
Mathieu Velten	1526ff389f	Faster joins: filter out non local events when a room doesn't have its full state (#14404 ) Signed-off-by: Mathieu Velten <mathieuv@matrix.org>	2022-11-21 16:46:14 +01:00
Patrick Cloke	d8cc86eff4	Remove redundant types from comments. (#14412 ) Remove type hints from comments which have been added as Python type hints. This helps avoid drift between comments and reality, as well as removing redundant information. Also adds some missing type hints which were simple to fill in.	2022-11-16 15:25:24 +00:00
Eric Eastwood	40fa8294e3	Refactor MSC3030 `/timestamp_to_event` to move away from our snowflake pull from `destination` pattern (#14096 ) 1. `federation_client.timestamp_to_event(...)` now handles all `destination` looping and uses our generic `_try_destination_list(...)` helper. 2. Consistently handling `NotRetryingDestination` and `FederationDeniedError` across `get_pdu` , backfill, and the generic `_try_destination_list` which is used for many places we use this pattern. 3. `get_pdu(...)` now returns `PulledPduInfo` so we know which `destination` we ended up pulling the PDU from	2022-10-26 16:10:55 -05:00
Shay	b7a7ff6ee3	Add initial power level event to batch of bulk persisted events when creating a new room. (#14228 )	2022-10-21 10:46:22 -07:00
Shay	847e2393f3	Prepatory work for adding power level event to batched events (#14214 )	2022-10-18 09:58:47 -07:00
David Robertson	c3a4780080	When restarting a partial join resync, prioritise the server which actioned a partial join (#14126 )	2022-10-18 12:33:18 +01:00
Eric Eastwood	40bb37eb27	Stop getting missing `prev_events` after we already know their signature is invalid (#13816 ) While https://github.com/matrix-org/synapse/pull/13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place. Related to - https://github.com/matrix-org/synapse/issues/13622 - https://github.com/matrix-org/synapse/pull/13635 - https://github.com/matrix-org/synapse/issues/13676 Part of https://github.com/matrix-org/synapse/issues/13356 Follow-up to https://github.com/matrix-org/synapse/pull/13815 which tracks event signature failures. With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now. For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761 To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid. - `backfill` - `outgoing-federation-request` `/backfill` - `_check_sigs_and_hash_and_fetch` - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - `_process_pulled_events` - `_process_pulled_event` for each validated event - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it - `_get_state_ids_after_missing_prev_event` - `outgoing-federation-request` `/state_ids` - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check	2022-10-15 00:36:49 -05:00
Andrew Morgan	9c23442ac9	Correct field name for stripped state events when knocking. `knock_state_events` -> `knock_room_state` (#14102 )	2022-10-12 14:37:20 +01:00
Erik Johnston	3dfc4a08dc	Fix performance regression in `get_users_in_room` (#13972 ) Fixes #13942. Introduced in #13575. Basically, let's only get the ordered set of hosts out of the DB if we need an ordered set of hosts. Since we split the function up the caching won't be as good, but I think it will still be fine as e.g. multiple backfill requests for the same room will hit the cache.	2022-09-30 13:15:32 +01:00
reivilibre	73ecff7e9e	Improve backfill robustness by trying more servers. (#13890 ) Co-authored-by: Eric Eastwood <erice@element.io>	2022-09-29 10:00:02 +00:00
Eric Eastwood	df8b91ed2b	Limit and filter the number of backfill points to get from the database (#13879 ) There is no need to grab thousands of backfill points when we only need 5 to make the `/backfill` request with. We need to grab a few extra in case the first few aren't visible in the history. Previously, we grabbed thousands of backfill points from the database, then sorted and filtered them in the app. Fetching the 4.6k backfill points for `#matrix:matrix.org` from the database takes ~50ms - ~570ms so it's not like this saves a lot of time 🤷. But it might save us more time now that `get_backfill_points_in_room`/`get_insertion_event_backward_extremities_in_room` are more complicated after https://github.com/matrix-org/synapse/pull/13635 This PR moves the filtering and limiting to the SQL query so we just have less data to work with in the first place. Part of https://github.com/matrix-org/synapse/issues/13356	2022-09-28 15:26:16 -05:00
Erik Johnston	4b17a5ace8	Handle remote device list updates during partial join (#13913 ) c.f. #12993 (comment), point 3 This stores all device list updates that we receive while partial joins are ongoing, and processes them once we have the full state. Note: We don't actually process the device lists in the same ways as if we weren't partially joined. Instead of updating the device list remote cache, we simply notify local users that a change in the remote user's devices has happened. I think this is safe as if the local user requests the keys for the remote user and we don't have them we'll simply fetch them as normal.	2022-09-28 13:42:43 +00:00
Kateřina Churanová	6caa303083	fix: Push notifications for invite over federation (#13719 )	2022-09-28 12:31:53 +00:00
David Robertson	f5aaa55e27	Add new columns tracking when we partial-joined (#13892 )	2022-09-27 17:26:35 +01:00
Eric Eastwood	ac1a31740b	Only try to backfill event if we haven't tried before recently (#13635 ) Only try to backfill event if we haven't tried before recently (exponential backoff). No need to keep trying the same backfill point that fails over and over. Fix https://github.com/matrix-org/synapse/issues/13622 Fix https://github.com/matrix-org/synapse/issues/8451 Follow-up to https://github.com/matrix-org/synapse/pull/13589 Part of https://github.com/matrix-org/synapse/issues/13356	2022-09-23 14:01:29 -05:00
reivilibre	c06b2b7142	Faster Remote Room Joins: tell remote homeservers that we are unable to authorise them if they query a room which has partial state on our server. (#13823 )	2022-09-23 11:47:16 +01:00
Eric Eastwood	51d732db3b	Optimize how we calculate `likely_domains` during backfill (#13575 ) Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [2. Loading tons of events in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)). There are 3 ways we currently calculate hosts that are in the room: 1. `get_current_state` -> `get_domains_from_state` - Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill` - This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳 1. `get_current_hosts_in_room` - Used for other federation things like sending read receipts and typing indicators 1. `get_hosts_in_room_at_events` - Used when pushing out events over federation to other servers in the `_process_event_queue_loop` Fix https://github.com/matrix-org/synapse/issues/13626 Part of https://github.com/matrix-org/synapse/issues/13356 Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh) ### Query performance #### Before The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s): ``` synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org'; Time: 16035.612 ms (00:16.036) synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org'; Time: 4243.237 ms (00:04.243) ``` But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back. ```sh $ psql synapse synapse=# \timing on synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]:(.)$')) FROM current_state_events WHERE type = 'm.room.member' AND membership = 'join' AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org'; count ------- 4130 (1 row) Time: 13181.598 ms (00:13.182) synapse=# SELECT COUNT() from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org'; count ------- 80814 synapse=# SELECT COUNT() from current_state_events; count --------- 8162847 synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') ); pg_size_pretty ---------------- 4702 MB ``` #### After I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart. After a Postgres restart, it takes 330ms and running back to back takes 260ms. ```sh $ psql synapse synapse=# \timing on Timing is on. synapse=# SELECT substring(c.state_key FROM '@[^:]:(.)$') as host FROM current_state_events c /* Get the depth of the event from the events table */ INNER JOIN events AS e USING (event_id) WHERE c.type = 'm.room.member' AND c.membership = 'join' AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org' GROUP BY host ORDER BY min(e.depth) ASC; Time: 333.800 ms ``` #### Going further To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up. Another thing we can do is optimize the query to use a index skip scan: - https://wiki.postgresql.org/wiki/Loose_indexscan - Index Skip Scan, https://commitfest.postgresql.org/37/1741/ - https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/	2022-08-30 01:38:14 -05:00
Eric Eastwood	9385c41ba4	Fix Prometheus metrics being negative (mixed up start/end) (#13584 ) Fix: - https://github.com/matrix-org/synapse/pull/13535#discussion_r949582508 - https://github.com/matrix-org/synapse/pull/13533#discussion_r949577244	2022-08-23 08:47:30 +01:00
Eric Eastwood	088bcb7ecb	Time how long it takes us to do backfill processing (#13535 )	2022-08-17 10:33:19 +01:00
Eric Eastwood	0a4efbc1dd	Instrument the federation/backfill part of `/messages` (#13489 ) Instrument the federation/backfill part of `/messages` so it's easier to follow what's going on in Jaeger when viewing a trace. Split out from https://github.com/matrix-org/synapse/pull/13440 Follow-up from https://github.com/matrix-org/synapse/pull/13368 Part of https://github.com/matrix-org/synapse/issues/13356	2022-08-16 12:39:40 -05:00
Eric Eastwood	344a2f767c	Instrument `FederationStateIdsServlet` - `/state_ids` (#13499 ) Instrument FederationStateIdsServlet - `/state_ids` so it's easier to follow what's going on in Jaeger when viewing a trace.	2022-08-15 19:41:23 +01:00
reivilibre	e9e6aacfbe	Faster Room Joins: prevent Synapse from answering federated join requests for a room which it has not fully joined yet. (#13416 )	2022-08-04 16:27:04 +01:00
Eric Eastwood	92d21faf12	Instrument `/messages` for understandable traces in Jaeger (#13368 ) In Jaeger: - Before: huge list of uncategorized database calls - After: nice and collapsible into units of work	2022-08-03 10:57:38 -05:00
Sean Quah	8d317f6da5	Fix error when out of servers to sync partial state with (#13432 ) so that we raise the intended error instead. Signed-off-by: Sean Quah <seanq@matrix.org>	2022-08-02 12:12:44 +01:00
reivilibre	e17e5c97e0	Faster Room Joins: don't leave a stuck room partial state flag if the join fails. (#13403 )	2022-08-01 16:45:39 +00:00
David Teller	11f811470f	Uniformize spam-checker API, part 5: expand other spam-checker callbacks to return `Tuple[Codes, dict]` (#13044 ) Signed-off-by: David Teller <davidt@element.io> Co-authored-by: Brendan Abolivier <babolivier@matrix.org>	2022-07-11 16:52:10 +00:00
Sean Quah	1391a76cd2	Faster room joins: fix race in recalculation of current room state (#13151 ) Bounce recalculation of current state to the correct event persister and move recalculation of current state into the event persistence queue, to avoid concurrent updates to a room's current state. Also give recalculation of a room's current state a real stream ordering. Signed-off-by: Sean Quah <seanq@matrix.org>	2022-07-07 12:19:31 +00:00
Sean Quah	68db233f0c	Handle race between persisting an event and un-partial stating a room (#13100 ) Whenever we want to persist an event, we first compute an event context, which includes the state at the event and a flag indicating whether the state is partial. After a lot of processing, we finally try to store the event in the database, which can fail for partial state events when the containing room has been un-partial stated in the meantime. We detect the race as a foreign key constraint failure in the data store layer and turn it into a special `PartialStateConflictError` exception, which makes its way up to the method in which we computed the event context. To make things difficult, the exception needs to cross a replication request: `/fed_send_events` for events coming over federation and `/send_event` for events from clients. We transport the `PartialStateConflictError` as a `409 Conflict` over replication and turn `409`s back into `PartialStateConflictError`s on the worker making the request. All client events go through `EventCreationHandler.handle_new_client_event`, which is called in a lot of places. Instead of trying to update all the code which creates client events, we turn the `PartialStateConflictError` into a `429 Too Many Requests` in `EventCreationHandler.handle_new_client_event` and hope that clients take it as a hint to retry their request. On the federation event side, there are 7 places which compute event contexts. 4 of them use outlier event contexts: `FederationEventHandler._auth_and_persist_outliers_inner`, `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and `FederationHandler.do_remotely_reject_invite`. These events won't have the partial state flag, so we do not need to do anything for then. The remaining 3 paths which create events are `FederationEventHandler.process_remote_join`, `FederationEventHandler.on_send_membership_event` and `FederationEventHandler._process_received_pdu`. We can't experience the race in `process_remote_join`, unless we're handling an additional join into a partial state room, which currently blocks, so we make no attempt to handle it correctly. `on_send_membership_event` is only called by `FederationServer._on_send_membership_event`, so we catch the `PartialStateConflictError` there and retry just once. `_process_received_pdu` is called by `on_receive_pdu` for incoming events and `_process_pulled_event` for backfill. The latter should never try to persist partial state events, so we ignore it. We catch the `PartialStateConflictError` in `on_receive_pdu` and retry just once. Refering to the graph of code paths in https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648 may make the above make more sense. Signed-off-by: Sean Quah <seanq@matrix.org>	2022-07-05 16:12:52 +01:00
David Teller	a164a46038	Uniformize spam-checker API, part 4: port other spam-checker callbacks to return `Union[Allow, Codes]`. (#12857 ) Co-authored-by: Brendan Abolivier <babolivier@matrix.org>	2022-06-13 18:16:16 +00:00
Richard van der Hoff	f68b5e5773	Merge branch 'rav/simplify_event_auth_interface' into develop	2022-06-13 11:34:59 +01:00
Richard van der Hoff	c1b28b8842	Remove redundant `room_version` param from `check_auth_rules_from_context` It's now implied by the room_version property on the event.	2022-06-12 23:13:10 +01:00
Richard van der Hoff	68be42f6b6	Remove `room_version` param from `validate_event_for_room_version` Instead, use the `room_version` property of the event we're validating. The `room_version` was originally added as a parameter somewhere around #4482, but really it's been redundant since #6875 added a `room_version` field to `EventBase`.	2022-06-12 23:13:09 +01:00
Richard van der Hoff	7c6b2204d1	Faster joins: add issue links to the TODOs (#13004 ) ... to help us keep track of these things	2022-06-09 10:13:03 +00:00
Erik Johnston	e3163e2e11	Reduce the amount of state we pull from the DB (#12811 )	2022-06-06 09:24:12 +01:00
Erik Johnston	888a29f412	Wait for lazy join to complete when getting current state (#12872 )	2022-06-01 16:02:53 +01:00

1 2 3 4 5 ...

810 Commits (c071cd5a0ebc2983e5576036ffef3668ba2a30cd)