Commit Graph

116 Commits (1e571cd66437ea2455c203dafb94c20ba48cdcc1)

Author SHA1 Message Date
Erik Johnston 33548f37aa
Improve tracing for to device messages (#9686) 2021-04-01 17:08:21 +01:00
Patrick Cloke da75d2ea1f
Add type hints for the federation sender. (#9681)
Includes an abstract base class which both the FederationSender
and the FederationRemoteSendQueue must implement.
2021-03-29 11:43:20 -04:00
Erik Johnston c602ba8336
Fixed undefined variable error in catchup (#9664)
Broke in #9640

Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2021-03-24 16:12:47 +00:00
Erik Johnston dd71eb0f8a
Make federation catchup send last event from any server. (#9640)
Currently federation catchup will send the last *local* event that we
failed to send to the remote. This can cause issues for large rooms
where lots of servers have sent events while the remote server was down,
as when it comes back up again it'll be flooded with events from various
points in the DAG.

Instead, let's make it so that all the servers send the most recent
events, even if its not theirs. The remote should deduplicate the
events, so there shouldn't be much overhead in doing this.
Alternatively, the servers could only send local events if they were
also extremities and hope that the other server will send the event
over, but that is a bit risky.
2021-03-18 15:52:26 +00:00
Erik Johnston 026503fa3b
Don't go into federation catch up mode so easily (#9561)
Federation catch up mode is very inefficient if the number of events
that the remote server has missed is small, since handling gaps can be
very expensive, c.f. #9492.

Instead of going into catch up mode whenever we see an error, we instead
do so only if we've backed off from trying the remote for more than an
hour (the assumption being that in such a case it is more than a
transient failure).
2021-03-15 14:42:40 +00:00
Richard van der Hoff 8a4b3738f3
Replace `last_*_pdu_age` metrics with timestamps (#9540)
Following the advice at
https://prometheus.io/docs/practices/instrumentation/#timestamps-not-time-since,
it's preferable to export unix timestamps, not ages.

There doesn't seem to be any particular naming convention for timestamp
metrics.
2021-03-04 16:40:18 +00:00
Andrew Morgan 8bcfc2eaad
Be smarter about which hosts to send presence to when processing room joins (#9402)
This PR attempts to eliminate unnecessary presence sending work when your local server joins a room, or when a remote server joins a room your server is participating in by processing state deltas in chunks rather than individually.

---

When your server joins a room for the first time, it requests the historical state as well. This chunk of new state is passed to the presence handler which, after filtering that state down to only membership joins, will send presence updates to homeservers for each join processed.

It turns out that we were being a bit naive and processing each event individually, and sending out presence updates for every one of those joins. Even if many different joins were users on the same server (hello IRC bridges), we'd send presence to that same homeserver for every remote user join we saw.

This PR attempts to deduplicate all of that by processing the entire batch of state deltas at once, instead of only doing each join individually. We process the joins and note down which servers need which presence:

* If it was a local user join, send that user's latest presence to all servers in the room
* If it was a remote user join, send the presence for all local users in the room to that homeserver

We deduplicate by inserting all of those pending updates into a dictionary of the form:

```
{
  server_name1: {presence_update1, ...},
  server_name2: {presence_update1, presence_update2, ...}
}
```

Only after building this dict do we then start sending out presence updates.
2021-02-19 11:37:29 +00:00
Eric Eastwood 0a00b7ff14
Update black, and run auto formatting over the codebase (#9381)
- Update black version to the latest
 - Run black auto formatting over the codebase
    - Run autoformatting according to [`docs/code_style.md
`](80d6dc9783/docs/code_style.md)
 - Update `code_style.md` docs around installing black to use the correct version
2021-02-16 22:32:34 +00:00
Erik Johnston dd8da8c5f6
Precompute joined hosts and store in Redis (#9198) 2021-01-26 13:57:31 +00:00
Erik Johnston 921a3f8a59
Fix not sending events over federation when using sharded event persisters (#8536)
* Fix outbound federaion with multiple event persisters.

We incorrectly notified federation senders that the minimum persisted
stream position had advanced when we got an `RDATA` from an event
persister.

Notifying of federation senders already correctly happens in the
notifier, so we just delete the offending line.

* Change some interfaces to use RoomStreamToken.

By enforcing use of `RoomStreamTokens` we make it less likely that
people pass in random ints that they got from somewhere random.
2020-10-14 13:27:51 +01:00
Richard van der Hoff f31f8e6319
Remove stream ordering from Metadata dict (#8452)
There's no need for it to be in the dict as well as the events table. Instead,
we store it in a separate attribute in the EventInternalMetadata object, and
populate that on load.

This means that we can rely on it being correctly populated for any event which
has been persited to the database.
2020-10-05 14:43:14 +01:00
Richard van der Hoff 3bd3707cb9
Fix malformed log line in new federation "catch up" logic (#8442) 2020-10-02 11:05:29 +01:00
Richard van der Hoff c1ef579b63
Add prometheus metrics to track federation delays (#8430)
Add a pair of federation metrics to track the delays in sending PDUs to/from 
particular servers.
2020-10-01 11:09:12 +01:00
reivilibre 36efbcaf51
Catch-up after Federation Outage (bonus): Catch-up on Synapse Startup (#8322)
Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>

* Fix _set_destination_retry_timings

This came about because the code assumed that retry_interval
could not be NULL — which has been challenged by catch-up.
2020-09-18 14:59:13 +01:00
reivilibre 576bc37d31
Catch-up after Federation Outage (split, 4): catch-up loop (#8272) 2020-09-15 09:07:19 +01:00
reivilibre 17fa4c7ca7
Catch up after Federation Outage (split, 2): Track last successful stream ordering after transmission (#8247)
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
2020-09-04 15:06:51 +01:00
reivilibre 58f61f10f7
Catch-up after Federation Outage (split, 1) (#8230)
Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
2020-09-04 12:22:23 +01:00
Patrick Cloke c619253db8
Stop sub-classing object (#8249) 2020-09-04 06:54:56 -04:00
reivilibre 4535e849d7
Remove obsolete order field in `send_new_transaction` (#8245)
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
2020-09-03 19:23:07 +01:00
Patrick Cloke 5758dcf30c
Add type hints for state. (#8140) 2020-08-24 14:25:27 -04:00
Patrick Cloke eebf52be06
Be stricter about JSON that is accepted by Synapse (#8106) 2020-08-19 07:26:03 -04:00
Patrick Cloke ad6190c925
Convert stream database to async/await. (#8074) 2020-08-17 07:24:46 -04:00
reivilibre ff0e894656
Drop federation transmission queues during a significant remote outage. (#7864)
* Empty federation transmission queues when we are backing off.

Fixes #7828.

Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>

* Address feedback

Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>

* Reword newsfile
2020-08-13 12:35:04 +01:00
Erik Johnston 9d1e4942ab
Fix typing for notifier (#8064) 2020-08-12 14:03:08 +01:00
Olivier Wilkinson (reivilibre) 3aa36b782c Merge branch 'master' into develop 2020-07-30 15:18:36 +01:00
Patrick Cloke c978f6c451
Convert federation client to async/await. (#7975) 2020-07-30 08:01:33 -04:00
Erik Johnston 2c1b9d6763
Update worker docs with recent enhancements (#7969) 2020-07-29 23:22:13 +01:00
Patrick Cloke b975fa2e99
Convert state resolution to async/await (#7942) 2020-07-24 10:59:51 -04:00
Patrick Cloke fefe9943ef
Convert presence handler helpers to async/await. (#7939) 2020-07-23 16:47:36 -04:00
Erik Johnston 649a7ead5c
Add ability to run multiple pusher instances (#7855)
This reuses the same scheme as federation sender sharding
2020-07-16 14:06:28 +01:00
Olivier Wilkinson (reivilibre) 12528dc42f Remove obsolete comment.
It was correct at the time of our friend Jorik writing it (checking
git blame), but the world has moved now and it is no longer a
generator.

Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
2020-07-16 11:12:48 +01:00
Erik Johnston f299441cc6
Add ability to shard the federation sender (#7798) 2020-07-10 18:26:36 +01:00
Patrick Cloke 38e1fac886
Fix some spelling mistakes / typos. (#7811) 2020-07-09 09:52:58 -04:00
Erik Johnston 1e03513f9a
Fix new metric where we used ms instead of seconds (#7771)
Introduced in #7755, not yet released.
2020-07-01 15:23:58 +01:00
Erik Johnston a99658074d
Add some metrics for inbound and outbound federation processing times (#7755) 2020-06-30 16:58:06 +01:00
Patrick Cloke bd6dc17221
Replace iteritems/itervalues/iterkeys with native versions. (#7692) 2020-06-15 07:03:36 -04:00
Richard van der Hoff 075375bbc9 add a comment 2020-05-21 13:25:41 +01:00
Richard van der Hoff d5aa7d93ed
Fix catchup-on-reconnect for the Federation Stream (#7374)
looks like we managed to break this during the refactorathon.
2020-05-05 14:15:57 +01:00
Erik Johnston 4cff617df1
Move catchup of replication streams to worker. (#7024)
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
2020-03-25 14:54:01 +00:00
Erik Johnston b08b0a22d5
Add typing to synapse.federation.sender (#6871) 2020-02-07 13:56:38 +00:00
Erik Johnston a8a50f5b57
Wake up transaction queue when remote server comes back online (#6706)
This will be used to retry outbound transactions to a remote server if
we think it might have come back up.
2020-01-17 10:27:19 +00:00
Erik Johnston d386f2f339
Add StateMap type alias (#6715) 2020-01-16 13:31:22 +00:00
Andrew Morgan 3916e1b97a
Clean up newline quote marks around the codebase (#6362) 2019-11-21 12:00:14 +00:00
Hubert Chathi 6f4bc6d01d Merge branch 'develop' into cross-signing_federation 2019-10-31 22:38:21 -04:00
Amber Brown 020add5099
Update black to 19.10b0 (#6304)
* update version of black and also fix the mypy config being overridden
2019-11-01 02:43:24 +11:00
Andrew Morgan 54fef094b3
Remove usage of deprecated logger.warn method from codebase (#6271)
Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
2019-10-31 10:23:24 +00:00
Hubert Chathi bb6cec27a5 rename get_devices_by_remote to get_device_updates_by_remote 2019-10-30 14:57:34 -04:00
Hubert Chathi c40d7244f8 Merge branch 'develop' into cross-signing_federation 2019-10-24 22:31:25 -04:00
Hubert Chathi 8d3542a64e implement federation parts of cross-signing 2019-10-22 19:04:35 -04:00
Erik Johnston c66a06ac6b Move storage classes into a main "data store".
This is in preparation for having multiple data stores that offer
different functionality, e.g. splitting out state or event storage.
2019-10-21 16:05:06 +01:00
Richard van der Hoff 66537e10ce
add some metrics on the federation sender (#6160) 2019-10-03 17:47:20 +01:00
Jorik Schellekens ef20aa52eb
use access methods (duh..)
Co-Authored-By: Erik Johnston <erik@matrix.org>
2019-09-05 15:07:17 +01:00
Jorik Schellekens 1d65292e94 Link the send loop with the edus contexts
The contexts were being filtered too early so  the send loop wasn't
being linked to them unless the destination
was whitelisted.
2019-09-05 14:42:37 +01:00
Jorik Schellekens 8767b63a82
Propagate opentracing contexts through EDUs (#5852)
Propagate opentracing contexts through EDUs
Co-Authored-By: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
2019-08-22 18:21:10 +01:00
Amber Brown 4806651744
Replace returnValue with return (#5736) 2019-07-23 23:00:55 +10:00
Richard van der Hoff a6a776f3d8
remove dead transaction persist code (#5622)
this hasn't done anything for years
2019-07-05 12:59:42 +01:00
Amber Brown 463b072b12
Move logging utilities out of the side drawer of util/ and into logging/ (#5606) 2019-07-04 00:07:04 +10:00
Amber Brown 32e7c9e7f2
Run Black. (#5482) 2019-06-20 19:32:02 +10:00
Erik Johnston b42f90470f Add experimental option to reduce extremities.
Adds new config option `cleanup_extremities_with_dummy_events` which
periodically sends dummy events to rooms with more than 10 extremities.

THIS IS REALLY EXPERIMENTAL.
2019-06-18 15:02:18 +01:00
Richard van der Hoff 5c15039e06
Clean up code for sending federation EDUs. (#5381)
This code confused the hell out of me today. Split _get_new_device_messages
into its two (unrelated) parts.
2019-06-13 13:52:08 +01:00
Andrew Morgan 2d1d7b7e6f Prevent multiple device list updates from breaking a batch send (#5156)
fixes #5153
2019-06-06 23:54:00 +01:00
Richard van der Hoff 130f932cbc Run `black` on per_destination_queue
... mostly to fix pep8 fails
2019-05-09 16:27:02 +01:00
Quentin Dufour 11ea16777f Limit the number of EDUs in transactions to 100 as expected by receiver (#5138)
Fixes #3951.
2019-05-09 11:01:41 +01:00
Erik Johnston 197fae1639 Use event streams to calculate presence
Primarily this fixes a bug in the handling of remote users joining a
room where the server sent out the presence for all local users in the
room to all servers in the room.

We also change to using the state delta stream, rather than the
distributor, as it will make it easier to split processing out of the
master process (as well as being more flexible).

Finally, when sending presence states to newly joined servers we filter
out old presence states to reduce the number sent. Initially we filter
out states that are offline and have a last active more than a week ago,
though this can be changed down the line.

Fixes #3962
2019-03-27 13:41:36 +00:00
Richard van der Hoff a902d13180
Batch up outgoing read-receipts to reduce federation traffic. (#4890)
Rate-limit outgoing read-receipts as per #4730.
2019-03-20 16:02:25 +00:00
Richard van der Hoff 02e23b36bc Rename and move the classes 2019-03-13 20:02:56 +00:00