* Ensure account data stream IDs are unique.
The account data stream is shared between three tables, and the maximum
allocated ID was tracked in a dedicated table. Updating the max ID
happened outside the transaction that allocated the ID, leading to a
race where if the server was restarted then the same ID could be
allocated but the max ID failed to be updated, leading it to be reused.
The ID generators have support for tracking across multiple tables, so
we may as well use that instead of a dedicated table.
* Fix bug in account data replication stream.
If the same stream ID was used in both global and room account data then
the getting updates for the replication stream would fail due to
`heapq.merge(..)` trying to compare a `str` with a `None`. (This is
because you'd have two rows like `(534, '!room')` and `(534, None)` from
the room and global account data tables).
Fix is just to order by stream ID, since we don't rely on the ordering
beyond that. The bug where stream IDs can be reused should be fixed now,
so this case shouldn't happen going forward.
Fixes#7617
While working on https://github.com/matrix-org/synapse/issues/5665 I found myself digging into the `Ratelimiter` class and seeing that it was both:
* Rather undocumented, and
* causing a *lot* of config checks
This PR attempts to refactor and comment the `Ratelimiter` class, as well as encourage config file accesses to only be done at instantiation.
Best to be reviewed commit-by-commit.
* Expose `return_html_error`, and allow it to take a Jinja2 template instead of a raw string
* Clean up exception handling in SAML2ResponseResource
* use the existing code in `return_html_error` instead of re-implementing it
(giving it a jinja2 template rather than inventing a new form of template)
* do the exception-catching in the REST layer rather than in the handler
layer, to make sure we catch all exceptions.
It looks like `user_device_resync` was ignoring cross-signing keys from the results received from the remote server. This patch fixes this, by processing these keys using the same process `_handle_signing_key_updates` does (and effectively factor that part out of that function).
The query keeps showing up in my slow query log.
This changes the plan under the top-level Sort node from
```
WindowAgg (cost=280335.88..292963.15 rows=561212 width=80) (actual time=138.651..160.562 rows=27112 loops=1)
-> Sort (cost=280335.88..281738.91 rows=561212 width=84) (actual time=138.597..140.622 rows=27112 loops=1)
Sort Key: state_groups_state.type, state_groups_state.state_key, state_groups_state.state_group
Sort Method: quicksort Memory: 4581kB
-> Nested Loop (cost=2.83..226745.22 rows=561212 width=84) (actual time=21.548..47.657 rows=27112 loops=1)
-> HashAggregate (cost=2.27..3.28 rows=101 width=8) (actual time=21.526..21.535 rows=20 loops=1)
Group Key: state.state_group
-> CTE Scan on state (cost=0.00..2.02 rows=101 width=8) (actual time=21.280..21.493 rows=20 loops=1)
-> Index Scan using state_groups_state_type_idx on state_groups_state (cost=0.56..2189.40 rows=5557 width=84) (actual time=0.005..0.991 rows=1356 loops=20)
Index Cond: (state_group = state.state_group)
```
to
```
Nested Loop (cost=2.83..226745.22 rows=561212 width=84) (actual time=24.194..52.834 rows=27112 loops=1)
-> HashAggregate (cost=2.27..3.28 rows=101 width=8) (actual time=24.130..24.138 rows=20 loops=1)
Group Key: state.state_group
-> CTE Scan on state (cost=0.00..2.02 rows=101 width=8) (actual time=23.887..24.113 rows=20 loops=1)
-> Index Scan using state_groups_state_type_idx on state_groups_state (cost=0.56..2189.40 rows=5557 width=84) (actual time=0.016..1.159 rows=1356 loops=20)
Index Cond: (state_group = state.state_group)
```
This cuts the execution time from ~190ms to ~130ms, i.e. a reduction
of ~30%.
The full plans are visualised at https://explain.depesz.com/s/WpbT and
https://explain.depesz.com/s/KlEk
Signed-off-by: Dagfinn Ilmari Mannsåker <ilmari@ilmari.org>
Without this patch, if an error happens which isn't caught by `user_device_resync`, then `_maybe_retry_device_resync` would fail, without retrying the next users in the iteration. This patch fixes this so that it now only logs an error in this case.
==============================
Bugfixes
--------
- Fix cache config to not apply cache factor to event cache. Regression in v1.14.0rc1. ([\#7578](https://github.com/matrix-org/synapse/issues/7578))
- Fix bug where `ReplicationStreamer` was not always started when replication was enabled. Bug introduced in v1.14.0rc1. ([\#7579](https://github.com/matrix-org/synapse/issues/7579))
- Fix specifying individual cache factors for caches with special characters in their name. Regression in v1.14.0rc1. ([\#7580](https://github.com/matrix-org/synapse/issues/7580))
Improved Documentation
----------------------
- Fix the OIDC `client_auth_method` value in the sample config. ([\#7581](https://github.com/matrix-org/synapse/issues/7581))
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEdVkXOgzrGzds0jtrHgFcFF8ZFs0FAl7Oh9sACgkQHgFcFF8Z
Fs2OmQ//SlksllrJ7egG0JeprfCCz9TJt7cYJJhCprqfBVNs8lIc+UT/NX0/duGe
g2KC4kPGEyusRmGSV43Yt/m/qqhcdM4tXq4iErgHG/xOTqNN/GVvrE3RaUvo9ydS
9IdeIkDON5ylEe8sSigbBGUnpCS20Stch1z2edoUSQHNnMMSgmTNpFaZnCEXc+Us
EYv+HfCeShdLXbzMioYj6B5qNiRnG+hJmz+h40Bmp/HAuZRyp4kx5EawAgMXIDfy
DGWn3H7TksqC+7zETAlQgMwFzggwQx64Dpa9w2RFAchUCG6bi8pM2U3doVDGlu1i
4HcltfBE+fE5Sy2tT2zz8qpaGGFwAp6K/c+4h29PwVBKSHRN4nepSqNHHb3fA1Ea
GI8DWiHGyZWzlMmKI4x85lMFyAGvGJiO1Jo8icietJHZ3+7tr6SUlnIPwlLFdkUv
xCKmlrYzwDO5enzoF6HuddnMLbtE304Ckr+vj5gWpBwYf3FIXpUS101YBV8zXTsB
NJBMu2fjYEkPQ/d9YjWRZwL312Cb68Kytlp2ETiTuuyfLCA5Df2wdA6yReemyg//
ZFfN1Z/Dc6p6MVRF3sf38jcWKX3r9ErNQ0p5q//uo4JbpnMW7FLXiSv/xonj4JTv
VLSFy6F0YIIVTol4H9Fj7f3iq8/zsJe3kBE/Kycd4uhplmfvK2w=
=BVTL
-----END PGP SIGNATURE-----
Merge tag 'v1.14.0rc2' into develop
Synapse 1.14.0rc2 (2020-05-27)
==============================
Bugfixes
--------
- Fix cache config to not apply cache factor to event cache. Regression in v1.14.0rc1. ([\#7578](https://github.com/matrix-org/synapse/issues/7578))
- Fix bug where `ReplicationStreamer` was not always started when replication was enabled. Bug introduced in v1.14.0rc1. ([\#7579](https://github.com/matrix-org/synapse/issues/7579))
- Fix specifying individual cache factors for caches with special characters in their name. Regression in v1.14.0rc1. ([\#7580](https://github.com/matrix-org/synapse/issues/7580))
Improved Documentation
----------------------
- Fix the OIDC `client_auth_method` value in the sample config. ([\#7581](https://github.com/matrix-org/synapse/issues/7581))
'client_auth_method' commented out value was erronously 'client_auth_basic',
when code and docstring says it should be 'client_secret_basic'.
Signed-off-by: Jason Robinson <jasonr@matrix.org>
The bg update never managed to complete, because it kept being interrupted by
transactions which want to take a lock.
Just doing it in the foreground isn't that bad, and is a good deal simpler.
A couple of changes of significance:
* remove the `_last_ack < federation_position` condition, so that
updates will still be correctly processed after restart
* Correctly wire up send_federation_ack to the right class.
we can use `make_in_list_sql_clause` rather than doing our own half-baked
equivalent, which has the benefit of working just fine with empty lists.
(This has quite a lot of tests, so I think it's pretty safe)
The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room).
Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on.
People probably want to look at this commit by commit.
Instead of doing a complicated dance of deleting and moving aliases one
by one, which sends a canonical alias update into the old room for each
one, lets do it all in one go.
This also changes the function to move *all* local alias events to the new
room, however that happens later on anyway.
PyPy's gc.get_stats() returns an object containing detailed allocator statistics
which could be beneficial to collect as metrics.
Signed-off-by: Ivan Shapovalov <intelfx@intelfx.name>
When a call to `user_device_resync` fails, we don't currently mark the remote user's device list as out of sync, nor do we retry to sync it.
https://github.com/matrix-org/synapse/pull/6776 introduced some code infrastructure to mark device lists as stale/out of sync.
This commit uses that code infrastructure to mark device lists as out of sync if processing an incoming device list update makes the device handler realise that the device list is out of sync, but we can't resync right now.
It also adds a looping call to retry all failed resync every 30s. This shouldn't cause too much spam in the logs as this commit also removes the "Failed to handle device list update for..." warning logs when catching `NotRetryingDestination`.
Fixes#7418
We don't really make any promises about returning accurate presence data when
presence is disabled, so we may as well just return a static response, rather
than making the master handle a request.
`_is_server_still_joined` will throw if it is given state updates with non-user ID state keys with local user leaves. This is actually rarely a problem since local leaves almost always get persisted by themselves.
(I discovered this on a branch that was otherwise broken, so I haven't seen this in the wild)
Bugfixes:
- Hash passwords as early as possible during registration. #7523
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEF3tZXk38tRDFVnUIM/xY9qcRMEgFAl7CpGYACgkQM/xY9qcR
MEhVixAAk2hDWVXxbGzUk2LmfiIsFA2eV55sw+VqEw0eRfe1d/mP6aH75VmTt3pw
IymZUVxDXdbTnPNPw+ldyGhzu9C6JJjXnNRBZnIkR5vcSbWsV0mPl/qHFu/4FnZI
m4Nj1Sx3sG0CyNDpWjVrzTW6SbDX9J68DXbLwnNTSX3KPa7gNn6TUmFfKzlrNI23
pPmD+EITYMn/H9HOhxhTzq//Ja7UOViAKQ0q4N2I4GxmLP6ufx9P3s5FG/oJqA+H
Pka2+9JnfHq2Ze22CoDcg8q5f5MgVkQzGeir0ZsGJwJqOYjeTmbCvD3T/RYWO5g+
ZghON3tsMQmdzUQqGRxcn/YLOZY9ZqrX2kBf5E6Wapwj9MfKg2ToLZM4yrWN0+RX
KDuWaKXYtkSQCo1nDS2KooVMWjGNZautWWnHzZ0KNQCIkxVpGC234JYI685grKXb
dg7R41kdXI7NJzqS4iM1fxXoLx64fpoREa/pbLF6VeLaYXBlzMjfhiIx2pQBN3L/
q/y3ftev9VCp+2wPxiKUkiC4Sh7dgWUzNuyHU+4lsPUbI1H/MN5dN2ryObdEGWc/
5YU3tv2MTQJ7jECHR+/fastnG+5d2kVm/FK+zVhG17JvA2VmDaLnSde+mzGbsO1N
gIUx5VrTEP7y0tC8C/VgbS3c2KqCSOopqd3j2slLLrtQlXM71VE=
=lpDI
-----END PGP SIGNATURE-----
Merge tag 'v1.13.0rc3' into develop
Synapse 1.13.0rc3 (2020-05-18)
Bugfixes:
- Hash passwords as early as possible during registration. #7523
Make sure that the AccountDataStream presents complete updates, in the right
order.
This is much the same fix as #7337 and #7358, but applied to a different stream.
This is required as both event persistence and the background update needs access to this function. It should be perfectly safe for two workers to write to that table at the same time.
This allows us to have the logic on both master and workers, which is necessary to move event persistence off master.
We also combine the instantiation of ID generators from DataStore and slave stores to the base worker stores. This allows us to select which process writes events independently of the master/worker splits.