Commit Graph

96 Commits (a5a799722db0c33dc61fb2c6c7282ff7e82eb2e9)

Author SHA1 Message Date
Sean Quah cf66d712c6
Fix initialization of `_device_list_id_gen` (#14914)
On startup, the `_device_list_id_gen` stream id generator is initialized
using the maximum stream id seen in a list of tables. When we started
populating the `device_list_remote_pending` table in #13913, we forgot
to add it to the aforementioned list of tables, so the stream id
generator can hand out old stream ids after a restart. The end result is
that Synapse can fail to handle device list update EDUs after a restart
when a partial state join is in progress.

Add the `device_list_remote_pending` table to the list of tables to
consider when initializing the `_device_list_id_gen` stream id generator.

Signed-off-by: Sean Quah <seanq@matrix.org>
2023-01-26 10:38:49 +00:00
Erik Johnston 65d0386693
Always notify replication when a stream advances (#14877)
This ensures that all other workers are told about stream updates in a timely manner, without having to remember to manually poke replication.
2023-01-20 18:02:18 +00:00
Erik Johnston 2b084c5b71
Merge device list replication streams (#14833) 2023-01-17 09:29:58 +00:00
reivilibre ba4ea7d13f
Batch up replication requests to request the resyncing of remote users's devices. (#14716) 2023-01-10 11:17:59 +00:00
Nick Mills-Barrett db1cfe9c80
Update all stream IDs after processing replication rows (#14723)
This creates a new store method, `process_replication_position` that
is called after `process_replication_rows`. By moving stream ID advances
here this guarantees any relevant cache invalidations will have been
applied before the stream is advanced.

This avoids race conditions where Python switches between threads mid
way through processing the `process_replication_rows` method where stream
IDs may be advanced before caches are invalidated due to class resolution
ordering.

See this comment/issue for further discussion:
	https://github.com/matrix-org/synapse/issues/14158#issuecomment-1344048703
2023-01-04 11:49:26 +00:00
reivilibre 74b89c2761
Revert the deletion of stale devices due to performance issues. (#14662) 2022-12-12 13:55:23 +00:00
Erik Johnston 94bc21e69f
Limit the number of devices we delete at once (#14649) 2022-12-09 13:31:32 +00:00
Erik Johnston c2de2ca630
Delete stale non-e2e devices for users, take 2 (#14595)
This should help reduce the number of devices e.g. simple bots the repeatedly login rack up.

We only delete non-e2e devices as they should be safe to delete, whereas if we delete e2e devices for a user we may accidentally break their ability to receive e2e keys for a message.
2022-12-09 09:37:07 +00:00
Erik Johnston cee9445884
Better return type for `get_all_entities_changed` (#14604)
Help callers from using the return value incorrectly by ensuring
that callers explicitly check if there was a cache hit or not.
2022-12-05 15:19:14 -05:00
Patrick Cloke fac8a38525
Properly handle unknown results for the stream change cache. (#14592)
StreamChangeCache.get_all_changed_entities can return None to signify
it does not have information at the given stream position. Two callers (related
to device lists and presence) were treating this response the same as an empty
list (i.e. there being no updates).
2022-12-02 10:28:41 -05:00
David Robertson c29e2c6306
Revert "POC delete stale non-e2e devices for users (#14038)" (#14582) 2022-11-29 17:48:48 +00:00
David Robertson e860316818
Fix `UndefinedColumn: column "key_json" does not exist` errors when handling users with more than 50 non-E2E devices (#14580) 2022-11-29 13:05:07 +00:00
Erik Johnston c7e29ca277
POC delete stale non-e2e devices for users (#14038)
This should help reduce the number of devices e.g. simple bots the repeatedly login rack up.

We only delete non-e2e devices as they should be safe to delete, whereas if we delete e2e devices for a user we may accidentally break their ability to receive e2e keys for a message.

Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
2022-11-29 10:36:41 +00:00
Sean Quah f792dd74e1
Remove option to skip locking of tables during emulated upserts (#14469)
To perform an emulated upsert into a table safely, we must either:
 * lock the table,
 * be the only writer upserting into the table
 * or rely on another unique index being present.

When the 2nd or 3rd cases were applicable, we previously avoided locking
the table as an optimization. However, as seen in #14406, it is easy to
slip up when adding new schema deltas and corrupt the database.

The only time we lock when performing emulated upserts is while waiting
for background updates on postgres. On sqlite, we do no locking at all.

Let's remove the option to skip locking tables, so that we don't shoot
ourselves in the foot again.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-11-28 13:42:06 +00:00
Erik Johnston f38d7d79c8
Add another index to `device_lists_changes_in_room` (#14534)
This helps avoid reading unnecessarily large amounts of data from the
table when querying with a set of room IDs.
2022-11-23 14:09:00 +00:00
Sean Quah 9cae44f49e
Track unconverted device list outbound pokes using a position instead (#14516)
When a local device list change is added to
`device_lists_changes_in_room`, the `converted_to_destinations` flag is
set to `FALSE` and the `_handle_new_device_update_async` background
process is started. This background process looks for unconverted rows
in `device_lists_changes_in_room`, copies them to
`device_lists_outbound_pokes` and updates the flag.

To update the `converted_to_destinations` flag, the database performs a
`DELETE` and `INSERT` internally, which fragments the table. To avoid
this, track unconverted rows using a `(stream ID, room ID)` position
instead of the flag.

From now on, the `converted_to_destinations` column indicates rows that
need converting to outbound pokes, but does not indicate whether the
conversion has already taken place.

Closes #14037.

Signed-off-by: Sean Quah <seanq@matrix.org>
2022-11-22 16:46:52 +00:00
David Robertson 115f0eb233
Reintroduce #14376, with bugfix for monoliths (#14468)
* Add tests for StreamIdGenerator

* Drive-by: annotate all defs

* Revert "Revert "Remove slaved id tracker (#14376)" (#14463)"

This reverts commit d63814fd73, which in
turn reverted 36097e88c4. This restores
the latter.

* Fix StreamIdGenerator not handling unpersisted IDs

Spotted by @erikjohnston.

Closes #14456.

* Changelog

Co-authored-by: Nick Mills-Barrett <nick@fizzadar.com>
Co-authored-by: Erik Johnston <erik@matrix.org>
2022-11-16 22:16:46 +00:00
Patrick Cloke d8cc86eff4
Remove redundant types from comments. (#14412)
Remove type hints from comments which have been added
as Python type hints. This helps avoid drift between comments
and reality, as well as removing redundant information.

Also adds some missing type hints which were simple to fill in.
2022-11-16 15:25:24 +00:00
Erik Johnston d63814fd73
Revert "Remove slaved id tracker (#14376)" (#14463)
This reverts commit 36097e88c4.
2022-11-16 13:50:07 +00:00
Nick Mills-Barrett 36097e88c4
Remove slaved id tracker (#14376)
This matches the multi instance writer ID generator class which can
both handle advancing the current token over replication and by calling
the database.
2022-11-14 17:31:36 +00:00
Nick Mills-Barrett 3a4f80f8c6
Merge/remove `Slaved*` stores into `WorkerStores` (#14375) 2022-11-11 10:51:49 +00:00
Richard van der Hoff 1469fed0e3
Add debugging to help diagnose lost device-list-update (#14268) 2022-10-24 10:45:10 +01:00
Aaron Raimist 2a76a7369f
Fix hiding devices names over federation (#10015)
And don't include blank opentracing stuff in device list updates.

Signed-off-by: Aaron Raimist <aaron@raim.ist>
2022-10-18 20:54:27 +00:00
Erik Johnston 5f659d4a88
Handle local device list updates during partial join (#13934) 2022-09-28 23:22:35 +01:00
Erik Johnston 4b17a5ace8
Handle remote device list updates during partial join (#13913)
c.f. #12993 (comment), point 3

This stores all device list updates that we receive while partial joins are ongoing, and processes them once we have the full state.

Note: We don't actually process the device lists in the same ways as if we weren't partially joined. Instead of updating the device list remote cache, we simply notify local users that a change in the remote user's devices has happened. I think this is safe as if the local user requests the keys for the remote user and we don't have them we'll simply fetch them as normal.
2022-09-28 13:42:43 +00:00
Erik Johnston e8318a4333
Handle the case of remote users leaving a partial join room for device lists (#13885) 2022-09-27 13:01:08 +01:00
reivilibre d3d9ca156e
Cancel the processing of key query requests when they time out. (#13680) 2022-09-07 12:03:32 +01:00
Patrick Cloke 50122754c8
Add missing types to opentracing. (#13345)
After this change `synapse.logging` is fully typed.
2022-07-21 12:01:52 +00:00
Patrick Cloke a6895dd576
Add type annotations to `trace` decorator. (#13328)
Functions that are decorated with `trace` are now properly typed
and the type hints for them are fixed.
2022-07-19 14:14:30 -04:00
reivilibre b26cbe3d45
Fix type error that made its way onto develop (#13098)
* Fix type error introduced accidentally by #13045

* Newsfile

Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
2022-06-17 13:05:27 +01:00
Erik Johnston 5099b5ecc7
Use new `device_list_changes_in_room` table when getting device list changes (#13045) 2022-06-17 11:42:03 +01:00
David Robertson 97e9fbe1b2
Type annotations in `synapse.databases.main.devices` (#13025)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-06-15 15:20:04 +00:00
Patrick Cloke 53b77b203a
Replace noop background updates with DELETE. (#12954)
Removes the `register_noop_background_update` and deletes the background
updates directly in a delta file.
2022-06-13 14:06:27 -04:00
Patrick Cloke 9dc3293e0b
Consolidate the logic of delete_device/delete_devices. (#12970)
By always using delete_devices and sometimes passing a list
with a single device ID.

Previously these methods had gotten out of sync with each
other and it seems there's little benefit to the single-device
variant.
2022-06-07 07:43:35 -04:00
Brendan Abolivier 28989cb301
Add a background job to automatically delete stale devices (#12855)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-05-27 17:47:32 +02:00
Patrick Cloke c52abc1cfd
Additional constants for EDU types. (#12884)
Instead of hard-coding strings in many places.
2022-05-27 07:14:36 -04:00
Dirk Klimpel b76f1a4d5f
Add some type hints to datastore (#12485) 2022-04-27 13:05:00 +01:00
Erik Johnston f59e3f4c90
Mark remote device list updates as already handled (#12557) 2022-04-26 17:07:21 +01:00
Erik Johnston c48ab3734e
Fix sending opentracing contexts to remote servers (#12555) 2022-04-26 14:48:16 +00:00
Erik Johnston 0b014eb25e
Only send out device list updates for our own users (#12465)
Broke in #12365
2022-04-14 13:05:31 +01:00
Erik Johnston aa28110264
Process device list updates asynchronously (#12365) 2022-04-12 16:50:40 +01:00
Erik Johnston 66053b6bfb
Prefill more stream change caches. (#12372) 2022-04-05 14:26:41 +01:00
Erik Johnston 5c9e39e619
Track device list updates per room. (#12321)
This is a first step in dealing with #7721.

The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things:

1. we can defer calculating the set of remote servers that need to be poked about the change; and
2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed.

However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled.

There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly).

Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
2022-04-04 15:25:20 +01:00
Andrew Morgan d8d0271977
Send device list updates to application services (MSC3202) - part 1 (#11881)
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
2022-03-30 14:39:27 +01:00
Erik Johnston 2b5643b3af
Optimise calculating device_list changes in `/sync`. (#11974)
For users with large accounts it is inefficient to calculate the set of
users they share a room with (and takes a lot of space in the cache).
Instead we can look at users whose devices have changed since the last
sync and check if they share a room with the syncing user.
2022-02-15 15:01:00 +00:00
Andrew Morgan 3655585e85
Add a docstring to `add_device_change_to_streams` and fix some nearby types (#11912) 2022-02-08 10:52:22 +00:00
David Robertson f160fe18e3
Debug for device lists updates (#11760)
Debug for #8631.

I'm having a hard time tracking down what's going wrong in that issue.
In the reported example, I could see server A sending federation traffic
to server B and all was well. Yet B reports out-of-sync device updates
from A.

I couldn't see what was _in_ the events being sent from A to B. So I
have added some crude logging to track

- when we have updates to send to a remote HS
- the edus we actually accumulate to send
- when a federation transaction includes a device list update edu
- when such an EDU is received

This is a bit of a sledgehammer.
2022-01-20 13:38:44 +00:00
Olivier Wilkinson (reivilibre) e7da1ced24 Merge branch 'release-v1.50' into develop 2022-01-14 15:25:16 +00:00
Patrick Cloke 3e0536cd2a
Replace uses of simple_insert_many with simple_insert_many_values. (#11742)
This should be (slightly) more efficient and it is simpler
to have a single method for inserting multiple values.
2022-01-13 19:44:18 -05:00
reivilibre b602ba194b
Fix a bug introduced in Synapse v1.50.0rc1 whereby outbound federation could fail because too many EDUs were produced for device updates. (#11730)
Co-authored-by: David Robertson <davidr@element.io>
2022-01-13 18:12:18 +00:00