Commit Graph

609 Commits (f14c63213492f512f451b295331a18dbb28685c6)

Author SHA1 Message Date
Sean Quah 1391a76cd2
Faster room joins: fix race in recalculation of current room state (#13151)
Bounce recalculation of current state to the correct event persister and
move recalculation of current state into the event persistence queue, to
avoid concurrent updates to a room's current state.

Also give recalculation of a room's current state a real stream
ordering.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-07-07 12:19:31 +00:00
Sean Quah 68db233f0c
Handle race between persisting an event and un-partial stating a room (#13100)
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special `PartialStateConflictError` exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: `/fed_send_events` for events coming over federation and
`/send_event` for events from clients. We transport the
`PartialStateConflictError` as a `409 Conflict` over replication and
turn `409`s back into `PartialStateConflictError`s on the worker making
the request.

All client events go through
`EventCreationHandler.handle_new_client_event`, which is called in
*a lot* of places. Instead of trying to update all the code which
creates client events, we turn the `PartialStateConflictError` into a
`429 Too Many Requests` in
`EventCreationHandler.handle_new_client_event` and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
`FederationEventHandler._auth_and_persist_outliers_inner`,
`FederationHandler.do_knock`, `FederationHandler.on_invite_request` and
`FederationHandler.do_remotely_reject_invite`. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
`FederationEventHandler.process_remote_join`,
`FederationEventHandler.on_send_membership_event` and
`FederationEventHandler._process_received_pdu`.

We can't experience the race in `process_remote_join`, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

`on_send_membership_event` is only called by
`FederationServer._on_send_membership_event`, so we catch the
`PartialStateConflictError` there and retry just once.

`_process_received_pdu` is called by `on_receive_pdu` for incoming
events and `_process_pulled_event` for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
`PartialStateConflictError` in `on_receive_pdu` and retry just once.

Refering to the graph of code paths in
https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648
may make the above make more sense.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-07-05 16:12:52 +01:00
David Robertson 97e9fbe1b2
Type annotations in `synapse.databases.main.devices` (#13025)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-06-15 15:20:04 +00:00
Patrick Cloke cf05258f76
Remove groups replication code. (#12900)
The replication logic for groups is no longer used, so the message
passing infrastructure can be removed.
2022-05-31 13:04:08 -04:00
Erik Johnston 1e453053cb
Rename storage classes (#12913) 2022-05-31 12:17:50 +00:00
reivilibre 39dee30f01
Send `USER_IP` commands on a different Redis channel, in order to reduce traffic to workers that do not process these commands. (#12809) 2022-05-20 15:28:23 +01:00
reivilibre 177b884ad7
Lay some foundation work to allow workers to only subscribe to some kinds of messages, reducing replication traffic. (#12672) 2022-05-19 16:29:08 +01:00
Andrew Morgan 83be72d76c
Add `StreamKeyType` class and replace string literals with constants (#12567) 2022-05-16 15:35:31 +00:00
Sean Quah a559c8b0d9
Respect the `@cancellable` flag for `ReplicationEndpoint`s (#12700)
While `ReplicationEndpoint`s register themselves via `JsonResource`,
they pass a method that calls the handler, instead of the handler itself,
to `register_paths`. As a result, `JsonResource` will not correctly pick
up the `@cancellable` flag and we have to apply it ourselves.

Signed-off-by: Sean Quah <seanq@element.io>
2022-05-11 12:25:39 +01:00
Shay d80a7ab151
Update `replication.md` with info on TCP module structure (#12621) 2022-05-09 14:46:43 -07:00
Šimon Brandner ef86cf3d28
Update `_on_new_receipts()` to work with MSC2285 changes. (#12636) 2022-05-05 13:25:51 +00:00
Erik Johnston c0379d6e5b
Reduce log spam when running multiple event persisters (#12610) 2022-05-05 10:20:23 +01:00
Erik Johnston d1cd96ce29
Add opentracing spans to calls to external cache (#12380) 2022-04-07 13:18:29 +01:00
Sean Quah 800ba87cc8
Refactor and convert `Linearizer` to async (#12357)
Refactor and convert `Linearizer` to async. This makes a `Linearizer`
cancellation bug easier to fix.

Also refactor to use an async context manager, which eliminates an
unlikely footgun where code that doesn't immediately use the context
manager could forget to release the lock.

Signed-off-by: Sean Quah <seanq@element.io>
2022-04-05 15:43:52 +01:00
Erik Johnston 66053b6bfb
Prefill more stream change caches. (#12372) 2022-04-05 14:26:41 +01:00
Erik Johnston b446c99ac9
Prefill the device_list_stream_cache (#12367)
* Prefill the device_list_stream_cache

* Newsfile

* Newsfile
2022-04-04 20:12:25 +01:00
Erik Johnston 5c9e39e619
Track device list updates per room. (#12321)
This is a first step in dealing with #7721.

The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things:

1. we can defer calculating the set of remote servers that need to be poked about the change; and
2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed.

However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled.

There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly).

Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
2022-04-04 15:25:20 +01:00
reivilibre f871222880
Move `update_client_ip` background job from the main process to the background worker. (#12251) 2022-04-01 13:08:55 +01:00
David Robertson a2b00a4486
Bump `black` and `click` versions (#12320) 2022-03-29 10:41:19 +00:00
reivilibre 4a53f35737
Improve code documentation for the typing stream over replication. (#12211) 2022-03-11 14:00:15 +00:00
Patrick Cloke 3e4af36bc8
Rename get_tcp_replication to get_replication_command_handler. (#12192)
Since the object it returns is a ReplicationCommandHandler.

This is clean-up from adding support to Redis where the command handler
was added as an additional layer of abstraction from the TCP protocol.
2022-03-10 13:01:56 +00:00
Nick Mills-Barrett 180d8ff0d4
Retry some http replication failures (#12182)
This allows for the target process to be down for around a minute
which provides time for restarts during synapse upgrades/config updates.

Closes: #12178

Signed off by Nick Mills-Barrett nick@beeper.com
2022-03-09 14:53:28 +00:00
Patrick Cloke d8bab6793c
Fix incorrect type hints for txredis. (#12042)
Some properties were marked as RedisProtocol instead of ConnectionHandler,
which wraps RedisProtocol instance(s).
2022-03-08 07:26:05 -05:00
Erik Johnston 423cca9efe
Spread out sending device lists to remote hosts (#12132) 2022-03-04 11:48:15 +00:00
Richard van der Hoff e24ff8ebe3
Remove `HomeServer.get_datastore()` (#12031)
The presence of this method was confusing, and mostly present for backwards
compatibility. Let's get rid of it.

Part of #11733
2022-02-23 11:04:02 +00:00
Erik Johnston 6d14b3dabf
Better error message when failing to request from another process (#12060) 2022-02-22 15:52:08 +00:00
Patrick Cloke d0e78af35e
Add missing type hints to synapse.replication. (#11938) 2022-02-08 11:03:08 -05:00
Patrick Cloke 6c0984e3f0
Remove unnecessary ignores due to Twisted upgrade. (#11939)
Twisted 22.1.0 fixed some internal type hints, allowing Synapse
to remove ignore calls for parameters to connectTCP.
2022-02-08 09:15:59 -05:00
Patrick Cloke 63d90f10ec
Add missing type hints to synapse.replication.http. (#11856) 2022-02-08 07:44:39 -05:00
Richard van der Hoff 2277275485
Stop reading from `event_reference_hashes` (#11794)
Preparation for dropping this table altogether. Part of #6574.
2022-01-21 09:18:10 +00:00
Patrick Cloke 10a88ba91c
Use auto_attribs/native type hints for attrs classes. (#11692) 2022-01-13 13:49:28 +00:00
Richard van der Hoff 2359ee3864
Remove redundant `get_current_events_token` (#11643)
* Push `get_room_{min,max_stream_ordering}` into StreamStore

Both implementations of this are identical, so we may as well push it down and
get rid of the abstract base class nonsense.

* Remove redundant `StreamStore` class

This is empty now

* Remove redundant `get_current_events_token`

This was an exact duplicate of `get_room_max_stream_ordering`, so let's get rid
of it.

* newsfile
2022-01-04 16:10:27 +00:00
Patrick Cloke cbd82d0b2d
Convert all namedtuples to attrs. (#11665)
To improve type hints throughout the code.
2021-12-30 18:47:12 +00:00
Sean Quah 5305a5e881
Type hint the constructors of the data store classes (#11555) 2021-12-13 17:05:00 +00:00
Quentin Gliech a15a893df8
Save the OIDC session ID (sid) with the device on login (#11482)
As a step towards allowing back-channel logout for OIDC.
2021-12-06 12:43:06 -05:00
Sean Quah ffd858aa68
Add type hints to `synapse/storage/databases/main/events_worker.py` (#11411)
Also refactor the stream ID trackers/generators a bit and try to
document them better.
2021-11-26 18:41:31 +00:00
Patrick Cloke 5cace20bf1
Add missing type hints to `synapse.app`. (#11287) 2021-11-10 15:06:54 -05:00
Nick Barrett af54167516
Enable passing typing stream writers as a list. (#11237)
This makes the typing stream writer config match the other stream writers
that only currently support a single worker.
2021-11-03 14:25:47 +00:00
Brendan Abolivier c7a5e49664
Implement an `on_new_event` callback (#11126)
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
2021-10-26 15:17:36 +02:00
Sean Quah 2b82ec425f
Add type hints for most `HomeServer` parameters (#11095) 2021-10-22 18:15:41 +01:00
Sean Quah 6a67f3786a
Fix logging context warnings when losing replication connection (#10984)
Instead of triggering `__exit__` manually on the replication handler's
logging context, use it as a context manager so that there is an
`__enter__` call to balance the `__exit__`.
2021-10-15 13:10:58 +01:00
Sean Quah 6b18eb4430
Fix opentracing and Prometheus metrics for replication requests (#10996)
This commit fixes two bugs to do with decorators not instrumenting
`ReplicationEndpoint`'s `send_request` correctly. There are two
decorators on `send_request`: Prometheus' `Gauge.track_inprogress()`
and Synapse's `opentracing.trace`.

`Gauge.track_inprogress()` does not have any support for async
functions when used as a decorator. Since async functions behave like
regular functions that return coroutines, only the creation of the
coroutine was covered by the metric and none of the actual body of
`send_request`.

`Gauge.track_inprogress()` returns a regular, non-async function
wrapping `send_request`, which is the source of the next bug.
The `opentracing.trace` decorator would normally handle async functions
correctly, but since the wrapped `send_request` is a non-async function,
the decorator ends up suffering from the same issue as
`Gauge.track_inprogress()`: the opentracing span only measures the
creation of the coroutine and none of the actual function body.

Using `Gauge.track_inprogress()` as a context manager instead of a
decorator resolves both bugs.
2021-10-12 11:23:46 +01:00
David Robertson 51a5da74cc
Annotate synapse.storage.util (#10892)
Also mark `synapse.streams` as having has no untyped defs

Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
2021-10-08 14:25:16 +00:00
Patrick Cloke f4b1a9a527
Require direct references to configuration variables. (#10985)
This removes the magic allowing accessing configurable
variables directly from the config object. It is now required
that a specific configuration class is used (e.g. `config.foo`
must be replaced with `config.server.foo`).
2021-10-06 10:47:41 -04:00
David Robertson 29364145b2
Pass str to twisted's IReactorTCP (#10895)
This follows a correction made in twisted/twisted#1664 and should fix our Twisted Trial CI job.

Until that change is in a twisted release, we'll have to ignore the type
of the `host` argument. I've raised #10899 to remind us to review the
issue in a few months' time.
2021-09-30 12:51:47 +01:00
Patrick Cloke 94b620a5ed
Use direct references for configuration variables (part 6). (#10916) 2021-09-29 06:44:15 -04:00
Patrick Cloke bb7fdd821b
Use direct references for configuration variables (part 5). (#10897) 2021-09-24 07:25:21 -04:00
Patrick Cloke 01c88a09cd
Use direct references for some configuration variables (#10798)
Instead of proxying through the magic getter of the RootConfig
object. This should be more performant (and is more explicit).
2021-09-13 13:07:12 -04:00
Richard van der Hoff 1800aabfc2
Split `FederationHandler` in half (#10692)
The idea here is to take anything to do with incoming events and move it out to a separate handler, as a way of making FederationHandler smaller.
2021-08-26 21:41:44 +01:00
Andrew Morgan 84469bdac7
Remove the unused public_room_list_stream (#10565)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2021-08-17 14:02:50 +01:00