MatrixSynapse/synapse/handlers
Eric Eastwood 51d732db3b
Optimize how we calculate `likely_domains` during backfill (#13575)
Optimize how we calculate `likely_domains` during backfill because I've seen this take 17s in production just to `get_current_state` which is used to `get_domains_from_state` (see case [*2. Loading tons of events* in the `/messages` investigation issue](https://github.com/matrix-org/synapse/issues/13356)).

There are 3 ways we currently calculate hosts that are in the room:

 1. `get_current_state` -> `get_domains_from_state`
    - Used in `backfill` to calculate `likely_domains` and `/timestamp_to_event` because it was cargo-culted from `backfill`
    - This one is being eliminated in favor of `get_current_hosts_in_room` in this PR 🕳
 1. `get_current_hosts_in_room`
    - Used for other federation things like sending read receipts and typing indicators
 1. `get_hosts_in_room_at_events`
    - Used when pushing out events over federation to other servers in the `_process_event_queue_loop`

Fix https://github.com/matrix-org/synapse/issues/13626

Part of https://github.com/matrix-org/synapse/issues/13356

Mentioned in [internal doc](https://docs.google.com/document/d/1lvUoVfYUiy6UaHB6Rb4HicjaJAU40-APue9Q4vzuW3c/edit#bookmark=id.2tvwz3yhcafh)


### Query performance

#### Before

The query from `get_current_state` sucks just because we have to get all 80k events. And we see almost the exact same performance locally trying to get all of these events (16s vs 17s):
```
synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 16035.612 ms (00:16.036)

synapse=# SELECT type, state_key, event_id FROM current_state_events WHERE room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
Time: 4243.237 ms (00:04.243)
```

But what about `get_current_hosts_in_room`: When there is 8M rows in the `current_state_events` table, the previous query in `get_current_hosts_in_room` took 13s from complete freshness (when the events were first added). But takes 930ms after a Postgres restart or 390ms if running back to back to back.

```sh
$ psql synapse
synapse=# \timing on
synapse=# SELECT COUNT(DISTINCT substring(state_key FROM '@[^:]*:(.*)$'))
FROM current_state_events
WHERE
    type = 'm.room.member'
    AND membership = 'join'
    AND room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
 count
-------
  4130
(1 row)

Time: 13181.598 ms (00:13.182)

synapse=# SELECT COUNT(*) from current_state_events where room_id = '!OGEhHVWSdvArJzumhm:matrix.org';
 count
-------
 80814

synapse=# SELECT COUNT(*) from current_state_events;
  count
---------
 8162847

synapse=# SELECT pg_size_pretty( pg_total_relation_size('current_state_events') );
 pg_size_pretty
----------------
 4702 MB
```

#### After

I'm not sure how long it takes from complete freshness as I only really get that opportunity once (maybe restarting computer but that's cumbersome) and it's not really relevant to normal operating times. Maybe you get closer to the fresh times the more access variability there is so that Postgres caches aren't as exact. Update: The longest I've seen this run for is 6.4s and 4.5s after a computer restart.

After a Postgres restart, it takes 330ms and running back to back takes 260ms.

```sh
$ psql synapse
synapse=# \timing on
Timing is on.
synapse=# SELECT
    substring(c.state_key FROM '@[^:]*:(.*)$') as host
FROM current_state_events c
/* Get the depth of the event from the events table */
INNER JOIN events AS e USING (event_id)
WHERE
    c.type = 'm.room.member'
    AND c.membership = 'join'
    AND c.room_id = '!OGEhHVWSdvArJzumhm:matrix.org'
GROUP BY host
ORDER BY min(e.depth) ASC;
Time: 333.800 ms
```

#### Going further

To improve things further we could add a `limit` parameter to `get_current_hosts_in_room`. Realistically, we don't need 4k domains to choose from because there is no way we're going to query that many before we a) probably get an answer or b) we give up. 

Another thing we can do is optimize the query to use a index skip scan:

 - https://wiki.postgresql.org/wiki/Loose_indexscan
 - Index Skip Scan, https://commitfest.postgresql.org/37/1741/
 - https://www.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
2022-08-30 01:38:14 -05:00
..
ui_auth Drop support for delegating email validation, round 2 (#13596) 2022-08-23 11:40:00 +00:00
__init__.py Remove redundant "coding: utf-8" lines (#9786) 2021-04-14 15:34:27 +01:00
account.py Optionally include account validity in MSC3720 account status responses (#12266) 2022-03-24 11:19:41 +01:00
account_data.py Add `StreamKeyType` class and replace string literals with constants (#12567) 2022-05-16 15:35:31 +00:00
account_validity.py Implement cancellation support/protection for module callbacks (#12568) 2022-05-09 12:31:14 +01:00
admin.py Rename storage classes (#12913) 2022-05-31 12:17:50 +00:00
appservice.py Federation Sender & Appservice Pusher Stream Optimisations (#13251) 2022-07-15 09:36:56 +01:00
auth.py `synapse.api.auth.Auth` cleanup: make permission-related methods use `Requester` instead of the `UserID` (#13024) 2022-08-22 14:17:59 +01:00
cas.py Remove `HomeServer.get_datastore()` (#12031) 2022-02-23 11:04:02 +00:00
deactivate_account.py Add third_party module callbacks to check if a user can delete a room and deactivate a user (#12028) 2022-03-09 18:23:57 +00:00
device.py Update `get_users_in_room` mis-use to get hosts with dedicated `get_current_hosts_in_room` (#13605) 2022-08-24 14:15:37 -05:00
devicemessage.py Additional constants for EDU types. (#12884) 2022-05-27 07:14:36 -04:00
directory.py Update `get_users_in_room` mis-use to get hosts with dedicated `get_current_hosts_in_room` (#13605) 2022-08-24 14:15:37 -05:00
e2e_keys.py Add missing types to opentracing. (#13345) 2022-07-21 12:01:52 +00:00
e2e_room_keys.py Add missing types to opentracing. (#13345) 2022-07-21 12:01:52 +00:00
event_auth.py Use dedicated `get_local_users_in_room` to find local users when calculating `join_authorised_via_users_server` of a `/make_join` request (#13606) 2022-08-24 11:14:28 -05:00
events.py Directly lookup local membership instead of getting all members in a room first (`get_users_in_room` mis-use) (#13608) 2022-08-24 14:13:12 -05:00
federation.py Optimize how we calculate `likely_domains` during backfill (#13575) 2022-08-30 01:38:14 -05:00
federation_event.py Comment about a better future where we can get the state diff between two events (#13586) 2022-08-24 18:59:27 -05:00
identity.py Drop support for delegating email validation, round 2 (#13596) 2022-08-23 11:40:00 +00:00
initial_sync.py `synapse.api.auth.Auth` cleanup: make permission-related methods use `Requester` instead of the `UserID` (#13024) 2022-08-22 14:17:59 +01:00
message.py Directly lookup local membership instead of getting all members in a room first (`get_users_in_room` mis-use) (#13608) 2022-08-24 14:13:12 -05:00
oidc.py Move the "email unsubscribe" resource, refactor the macaroon generator & simplify the access token verification logic. (#12986) 2022-06-14 09:12:08 -04:00
pagination.py Move the execution of the retention purge_jobs to the main worker (#13632) 2022-08-26 08:38:10 +01:00
password_policy.py Use direct references for some configuration variables (part 3) (#10885) 2021-09-23 07:13:34 -04:00
presence.py Update `get_users_in_room` mis-use to get hosts with dedicated `get_current_hosts_in_room` (#13605) 2022-08-24 14:15:37 -05:00
profile.py Use a single query in `ProfileHandler.get_profile` (#13209) 2022-07-07 11:02:09 +00:00
push_rules.py Add a module API to allow modules to edit push rule actions (#12406) 2022-04-27 13:55:33 +00:00
read_marker.py Refactor and convert `Linearizer` to async (#12357) 2022-04-05 15:43:52 +01:00
receipts.py Support stable identifiers for MSC2285: private read receipts. (#13273) 2022-08-05 11:09:33 -04:00
register.py `synapse.api.auth.Auth` cleanup: make permission-related methods use `Requester` instead of the `UserID` (#13024) 2022-08-22 14:17:59 +01:00
relations.py `synapse.api.auth.Auth` cleanup: make permission-related methods use `Requester` instead of the `UserID` (#13024) 2022-08-22 14:17:59 +01:00
room.py Optimize how we calculate `likely_domains` during backfill (#13575) 2022-08-30 01:38:14 -05:00
room_batch.py Rename storage classes (#12913) 2022-05-31 12:17:50 +00:00
room_list.py Use stable prefixes for MSC3827: filtering of `/publicRooms` by room type (#13370) 2022-07-27 19:46:57 +01:00
room_member.py Directly lookup local membership instead of getting all members in a room first (`get_users_in_room` mis-use) (#13608) 2022-08-24 14:13:12 -05:00
room_member_worker.py Implement knock feature (#6739) 2021-06-09 19:39:51 +01:00
room_summary.py Revert 'Remove the unspecced field in the response. (#13365)' to give more time for clients to update. (#13501) 2022-08-11 10:27:48 +00:00
saml.py Remove `HomeServer.get_datastore()` (#12031) 2022-02-23 11:04:02 +00:00
search.py Reduce the amount of state we pull from the DB (#12811) 2022-06-06 09:24:12 +01:00
send_email.py Support Implicit TLS for sending emails (#13317) 2022-07-25 16:27:19 +01:00
set_password.py Remove `HomeServer.get_datastore()` (#12031) 2022-02-23 11:04:02 +00:00
sso.py Use `getClientAddress` instead of `getClientIP`. (#12599) 2022-05-04 14:11:21 -04:00
state_deltas.py Remove `HomeServer.get_datastore()` (#12031) 2022-02-23 11:04:02 +00:00
stats.py Implement MSC3827: Filtering of `/publicRooms` by room type (#13031) 2022-06-29 17:12:45 +00:00
sync.py Cache user IDs instead of profile objects (#13573) 2022-08-23 09:49:59 +00:00
typing.py Update `get_users_in_room` mis-use to get hosts with dedicated `get_current_hosts_in_room` (#13605) 2022-08-24 14:15:37 -05:00
user_directory.py Wait for lazy join to complete when getting current state (#12872) 2022-06-01 16:02:53 +01:00