Commit Graph

4599 Commits (51d732db3b4ab13eb58e937a546abce7968112ef)

Author SHA1 Message Date
Eric Eastwood 51d732db3b
Optimize how we calculate `likely_domains` during backfill (#13575)
Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [*2. Loading tons of events* in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)).

There are 3 ways we currently calculate hosts that are in the room:

 1. `get_current_state` -> `get_domains_from_state`
    - Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill`
    - This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳
 1. `get_current_hosts_in_room`
    - Used for other federation things like sending read receipts and typing indicators
 1. `get_hosts_in_room_at_events`
    - Used when pushing out events over federation to other servers in the `_process_event_queue_loop`

Fix https://github.com/matrix-org/synapse/issues/13626

Part of https://github.com/matrix-org/synapse/issues/13356

Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)


### Query performance

#### Before

The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s):
```
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 16035.612 ms (00:16.036)

synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 4243.237 ms (00:04.243)
```

But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back.

```sh
$ psql synapse
synapse=# \timing on
synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]*:(.*)$'))
FROM current_state_events
WHERE
    type = 'm.room.member'
    AND membership = 'join'
    AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
 count
-------
  4130
(1 row)

Time: 13181.598 ms (00:13.182)

synapse=# SELECT COUNT(*) from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
 count
-------
 80814

synapse=# SELECT COUNT(*) from current_state_events;
  count
---------
 8162847

synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') );
 pg_size_pretty
----------------
 4702 MB
```

#### After

I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart.

After a Postgres restart, it takes 330ms and running back to back takes 260ms.

```sh
$ psql synapse
synapse=# \timing on
Timing is on.
synapse=# SELECT
    substring(c.state_key FROM '@[^:]*:(.*)$') as host
FROM current_state_events c
/* Get the depth of the event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE
    c.type = 'm.room.member'
    AND c.membership = 'join'
    AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
GROUP BY host
ORDER BY min(e.depth) ASC;
Time: 333.800 ms
```

#### Going further

To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up. 

Another thing we can do is optimize the query to use a index skip scan:

 - https://wiki.postgresql.org/wiki/Loose_indexscan
 - Index Skip Scan, https://commitfest.postgresql.org/37/1741/
 - https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
2022-08-30 01:38:14 -05:00
Eric Eastwood d58615c82c
Directly lookup local membership instead of getting all members in a room first (`get_users_in_room` mis-use) (#13608)
See https://github.com/matrix-org/synapse/pull/13575#discussion_r953023755
2022-08-24 14:13:12 -05:00
Eric Eastwood b93bd95e8a
When loading current ids, sort by `stream_id` to avoid incorrect overwrite and avoid errors caused by sorting alphabetical instance name which can be `null` (#13585)
When loading current ids, sort by stream ID so that we don't want to overwrite the `current_position` of an instance to a lower stream ID than we're actually at ([discussion](https://github.com/matrix-org/synapse/pull/13585#discussion_r951795379)). Previously, it sorted alphabetically by instance name which can be `null` and throw errors but more importantly, accomplishes nothing.

Fixes the following startup error which is why I started looking into this area:

```
$ poetry run synapse_homeserver --config-path homeserver.yaml
****************************************************************
 Error during initialisation:
    '<' not supported between instances of 'NoneType' and 'str'
 There may be more information in the logs.
****************************************************************
```

Somehow my database ended up looking like the following, notice the `instance_name` is `null` in the db, and we can't sort `NoneType` things. Another question is why do we see the `instance_name` as `null` sometimes instead of `master` in monolith mode?
```
$ psql synapse
synapse=# SELECT * FROM stream_positions;
   stream_name   | instance_name | stream_id
-----------------+---------------+-----------
 account_data    | master        |      1242
 events          | master        |      1787
 to_device       | master        |        58
 presence_stream | master        |    485638
 receipts        | master        |       341
 backfill        | master        |   -139106
(6 rows)
synapse=# SELECT instance_name, stream_id FROM receipts_linearized;
 instance_name | stream_id
---------------+-----------
               |       211
               |         3
               |         4
               |       212
               |       213
               |       224
               |       228
               |       164
               |       313
               |       253
               |        38
               |       321
               |       324
               |       189
               |       192
               |       193
               |       194
               |       195
               |       197
               |       198
               |       275
               |        79
               |       339
               |       340
               |        82
               |       341
               |        84
               |        85
               |        91
               |       119
```
2022-08-24 12:53:46 -05:00
Nick Mills-Barrett b687010f89
Rewrite get push actions queries (#13597) 2022-08-24 10:12:51 +01:00
Erik Johnston 05c9c7363b
Fix regression caused by #13573 (#13600)
Broke in #13573.
2022-08-23 14:14:05 +00:00
Erik Johnston aec87a0f93
Speed up fetching large numbers of push rules (#13592) 2022-08-23 13:15:43 +01:00
Nick Mills-Barrett 5e7847dc92
Cache user IDs instead of profile objects (#13573)
The profile objects are never used and increase cache size significantly.
2022-08-23 09:49:59 +00:00
Quentin Gliech 3dd175b628
`synapse.api.auth.Auth` cleanup: make permission-related methods use `Requester` instead of the `UserID` (#13024)
Part of #13019

This changes all the permission-related methods to rely on the Requester instead of the UserID. This is a first step towards enabling scoped access tokens at some point, since I expect the Requester to have scope-related informations in it.

It also changes methods which figure out the user/device/appservice out of the access token to return a Requester instead of something else. This avoids having store-related objects in the methods signatures.
2022-08-22 14:17:59 +01:00
Sean Quah 84169a82dc
Avoid blocking lazy-loading `/sync`s during partial joins (#13477)
Use a state filter or accept partial state in a few places where we
request state, to avoid blocking.

To make lazy-loading `/sync`s work, we need to provide the memberships
of event senders, which are not guaranteed to be in the room state.
Instead we dig through auth events for memberships to present to
clients. The auth events of an event are guaranteed to contain a
passable membership event, otherwise the event would have been rejected.

Note that this only covers the common code paths encountered during
testing. There has been no exhaustive checking of all sync code paths.

Fixes #13146.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-08-18 11:53:02 +01:00
reivilibre 8bdf2bd31e
Fix a bug in the `/event_reports` Admin API which meant that the total count could be larger than the number of results you can actually query for. (#13525)
Co-authored-by: Brendan Abolivier <babolivier@matrix.org>
2022-08-17 18:08:23 +00:00
Dirk Klimpel d75512d19e
Add forgotten status to Room Details API (#13503) 2022-08-17 09:42:01 +00:00
Eric Eastwood 0a4efbc1dd
Instrument the federation/backfill part of `/messages` (#13489)
Instrument the federation/backfill part of `/messages` so it's easier to follow what's going on in Jaeger when viewing a trace.

Split out from https://github.com/matrix-org/synapse/pull/13440

Follow-up from https://github.com/matrix-org/synapse/pull/13368

Part of https://github.com/matrix-org/synapse/issues/13356
2022-08-16 12:39:40 -05:00
reivilibre c3516e9dec
Faster room joins: make `/joined_members` block whilst the room is partial stated. (#13514) 2022-08-16 13:16:56 +01:00
Erik Johnston 5442891cbc
Make push rules use proper structures. (#13522)
This improves load times for push rules:

| Version              | Time per user | Time for 1k users | 
| -------------------- | ------------- | ----------------- |
| Before               |       138 µs  |             138ms |
| Now (with custom)    |       2.11 µs |            2.11ms |
| Now (without custom) |       49.7 ns |           0.05 ms |

This therefore has a large impact on send times for rooms
with large numbers of local users in the room.
2022-08-16 12:22:17 +01:00
Eric Eastwood 344a2f767c
Instrument `FederationStateIdsServlet` - `/state_ids` (#13499)
Instrument FederationStateIdsServlet - `/state_ids` so it's easier to follow what's going on in Jaeger when viewing a trace.
2022-08-15 19:41:23 +01:00
David Robertson 19e5d44886
Revert "Update locked versions of mypy and mypy-zope (#13521)"
This reverts commit f383b9b3ec. Other PRs
were seeing mypy failures that looked to be related to mypy-zope.
Confusingly, we didn't see this on #13521.

Revert this for now and investigate later.
2022-08-15 14:51:05 +01:00
Patrick Cloke 46bd7f4ed9
Clarifications for event push action processing. (#13485)
* Clarifies comments.
* Fixes an erroneous comment (about return type) added in #13455
  (ec24813220).
* Clarifies the name of a variable.
* Simplifies logic of pulling out the latest join for the requesting user.
2022-08-15 09:33:17 -04:00
David Robertson f383b9b3ec
Update locked versions of mypy and mypy-zope (#13521) 2022-08-15 11:32:30 +01:00
Richard van der Hoff 507c1cb330
Update the rejected state of events during resync (#13459)
Events can be un-rejected or newly-rejected during resync, so ensure we update
the database and caches when that happens.
2022-08-11 10:42:24 +00:00
Šimon Brandner ab18441573
Support stable identifiers for MSC2285: private read receipts. (#13273)
This adds support for the stable identifiers of MSC2285 while
continuing to support the unstable identifiers behind the configuration
flag. These will be removed in a future version.
2022-08-05 11:09:33 -04:00
Erik Johnston b6a6bb4027
Add comments about how event push actions are stored. (#13445) 2022-08-04 19:38:08 +00:00
Patrick Cloke ec24813220
Improve comments (& avoid a duplicate query) in push actions processing. (#13455)
* Adds docstrings and inline comments.
* Formats SQL queries using triple quoted strings.
* Minor formatting changes.
* Avoid fetching `event_push_summary_stream_ordering` multiple times
  in the same transactions.
2022-08-04 19:24:44 +00:00
Richard van der Hoff 96d92156d0
Update type of `EventContext.rejected` (#13460) 2022-08-04 17:45:01 +01:00
Nick Mills-Barrett 41320a0554
Optimise async get event lookups (#13435)
Still maintains local in memory lookup optimisation, but does any external
lookup as part of the deferred that prevents duplicate lookups for the same
event at once. This makes the assumption that fetching from an external
cache is a non-zero load operation.
2022-08-04 15:49:55 +01:00
Eric Eastwood 92d21faf12
Instrument `/messages` for understandable traces in Jaeger (#13368)
In Jaeger:

 - Before: huge list of uncategorized database calls
 - After: nice and collapsible into units of work
2022-08-03 10:57:38 -05:00
Sean Quah 224d792dd7
Refactor `_resolve_state_at_missing_prevs` to return an `EventContext` (#13404)
Previously, `_resolve_state_at_missing_prevs` returned the resolved
state before an event and a partial state flag. These were unwieldy to
carry around would only ever be used to build an event context. Build
the event context directly instead.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-08-01 13:53:56 +01:00
Richard van der Hoff 23768ccb4d
Faster joins: fix rejected events becoming un-rejected during resync (#13413)
Make sure that we re-check the auth rules during state resync, otherwise
rejected events get un-rejected.
2022-08-01 11:20:05 +01:00
Šimon Brandner 583f22780f
Use stable prefixes for MSC3827: filtering of `/publicRooms` by room type (#13370)
Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
2022-07-27 19:46:57 +01:00
Richard van der Hoff ca3db044a3
Fix infinite loop in partial-state resync (#13353)
Make sure that we only pull out events from the db once they have no
prev-events with partial state.
2022-07-26 11:47:31 +00:00
Sean Quah 335ebb21cc
Faster room joins: avoid blocking when pulling events with missing prevs (#13355)
Avoid blocking on full state in `_resolve_state_at_missing_prevs` and
return a new flag indicating whether the resolved state is partial.
Thread that flag around so that it makes it into the event context.

Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
2022-07-26 12:39:23 +01:00
Patrick Cloke 8b603299bf
Remove unused argument for get_relations_for_event. (#13383) 2022-07-26 07:19:20 -04:00
Erik Johnston 43adf2521c
Refactor presence so we can prune user in room caches (#13313)
See #10826 and #10786 for context as to why we had to disable pruning on
those caches.

Now that `get_users_who_share_room_with_user` is called frequently only
for presence, we just need to make calls to it less frequent and then we
can remove the various levels of caching that is going on.
2022-07-25 09:21:06 +00:00
Erik Johnston 0b87eb8e0c
Make DictionaryCache have better expiry properties (#13292) 2022-07-21 17:13:44 +01:00
David Robertson 34949ead1f
Track DB txn times w/ two counters, not histogram (#13342) 2022-07-21 13:23:05 +01:00
Patrick Cloke 50122754c8
Add missing types to opentracing. (#13345)
After this change `synapse.logging` is fully typed.
2022-07-21 12:01:52 +00:00
Nick Mills-Barrett 190f49d8ab
Use cache store remove base slaved (#13329)
This comes from two identical definitions in each of the base stores, and means the base slaved store is now empty and can be removed.
2022-07-21 11:51:30 +01:00
Eric Eastwood 0f971ca68e
Update `get_pdu` to return the original, pristine `EventBase` (#13320)
Update `get_pdu` to return the untouched, pristine `EventBase` as it was originally seen over federation (no metadata added). Previously, we returned the same `event` reference that we stored in the cache which downstream code modified in place and added metadata like setting it as an `outlier`  and essentially poisoned our cache. Now we always return a copy of the `event` so the original can stay pristine in our cache and re-used for the next cache call.

Split out from https://github.com/matrix-org/synapse/pull/13205

As discussed at:

 - https://github.com/matrix-org/synapse/pull/13205#discussion_r918365746
 - https://github.com/matrix-org/synapse/pull/13205#discussion_r918366125

Related to https://github.com/matrix-org/synapse/issues/12584. This PR doesn't fix that issue because it hits [`get_event` which exists from the local database before it tries to `get_pdu`](7864f33e28/synapse/federation/federation_client.py (L581-L594)).
2022-07-20 15:58:51 -05:00
Patrick Cloke a6895dd576
Add type annotations to `trace` decorator. (#13328)
Functions that are decorated with `trace` are now properly typed
and the type hints for them are fixed.
2022-07-19 14:14:30 -04:00
Erik Johnston de70b25e84
Reduce memory usage of state group cache (#13323) 2022-07-19 14:40:37 +01:00
David Robertson b977867358
Rate limit joins per-room (#13276) 2022-07-19 11:45:17 +00:00
Nick Mills-Barrett 2ee0b6ef4b
Safe async event cache (#13308)
Fix race conditions in the async cache invalidation logic, by separating
the async & local invalidation calls and ensuring any async call i
executed first.

Signed off by Nick @ Beeper (@Fizzadar).
2022-07-19 11:25:29 +00:00
Shay 7864f33e28
Increase batch size of `bulk_get_push_rules` and `_get_joined_profiles_from_event_ids`. (#13300) 2022-07-18 13:15:23 -07:00
Shay 15edf23626
Improve performance of query ` _get_subset_users_in_room_with_profiles` (#13299) 2022-07-18 12:35:45 -07:00
Erik Johnston f721f1baba
Revert "Make all `process_replication_rows` methods async (#13304)" (#13312)
This reverts commit 5d4028f217.
2022-07-18 14:28:14 +01:00
Nick Mills-Barrett 6785b0f39d
Use READ COMMITTED isolation level when purging rooms (#12942)
To close: #10294.

Signed off by Nick @ Beeper.
2022-07-18 14:17:24 +01:00
Nick Mills-Barrett 5d4028f217
Make all `process_replication_rows` methods async (#13304)
More prep work for asyncronous caching, also makes all process_replication_rows methods consistent (presence handler already is so).

Signed off by Nick @ Beeper (@Fizzadar)
2022-07-17 22:19:43 +01:00
Erik Johnston 0731e0829c
Don't pull out the full state when storing state (#13274) 2022-07-15 12:59:45 +00:00
Richard van der Hoff b116d3ce00
Bg update to populate new `events` table columns (#13215)
These columns were added back in Synapse 1.52, and have been populated for new
events since then. It's now (beyond) time to back-populate them for existing
events.
2022-07-15 12:47:26 +01:00
Erik Johnston 7be954f59b
Fix a bug which could lead to incorrect state (#13278)
There are two fixes here:
1. A long-standing bug where we incorrectly calculated `delta_ids`; and
2. A bug introduced in #13267 where we got current state incorrect.
2022-07-15 11:06:41 +00:00
Nick Mills-Barrett cc21a431f3
Async get event cache prep (#13242)
Some experimental prep work to enable external event caching based on #9379 & #12955. Doesn't actually move the cache at all, just lays the groundwork for async implemented caches.

Signed off by Nick @ Beeper (@Fizzadar)
2022-07-15 09:30:46 +00:00