In order to circumvent the number of duplicate foo:count metrics increasing
without bounds, it's time for a rearrangement.
The following are all deprecated, and replaced with synapse_util_metrics_block_count:
synapse_util_metrics_block_timer:count
synapse_util_metrics_block_ru_utime:count
synapse_util_metrics_block_ru_stime:count
synapse_util_metrics_block_db_txn_count:count
synapse_util_metrics_block_db_txn_duration:count
The following are all deprecated, and replaced with synapse_http_server_response_count:
synapse_http_server_requests
synapse_http_server_response_time:count
synapse_http_server_response_ru_utime:count
synapse_http_server_response_ru_stime:count
synapse_http_server_response_db_txn_count:count
synapse_http_server_response_db_txn_duration:count
The following are renamed (the old metrics are kept for now, but deprecated):
synapse_util_metrics_block_timer:total ->
synapse_util_metrics_block_time_seconds
synapse_util_metrics_block_ru_utime:total ->
synapse_util_metrics_block_ru_utime_seconds
synapse_util_metrics_block_ru_stime:total ->
synapse_util_metrics_block_ru_stime_seconds
synapse_util_metrics_block_db_txn_count:total ->
synapse_util_metrics_block_db_txn_count
synapse_util_metrics_block_db_txn_duration:total ->
synapse_util_metrics_block_db_txn_duration_seconds
synapse_http_server_response_time:total ->
synapse_http_server_response_time_seconds
synapse_http_server_response_ru_utime:total ->
synapse_http_server_response_ru_utime_seconds
synapse_http_server_response_ru_stime:total ->
synapse_http_server_response_ru_stime_seconds
synapse_http_server_response_db_txn_count:total ->
synapse_http_server_response_db_txn_count
synapse_http_server_response_db_txn_duration:total
synapse_http_server_response_db_txn_duration_seconds
Add some logging to the Limiter in a similar spirit to the Linearizer, to help
debug issues.
Also fix a logcontext leak.
Also refactor slightly to avoid throwing exceptions.
This is a bit of an experimental change at this point; the idea is to see if it
helps us track down where our stack overflows are coming from by logging the
stack when the exception was caught and turned into a Failure. (We'll also need
edf2704420).
If we deploy this, we'll be able to enable it via the log config yaml.
Occaisonally has_any_entity_changed would throw the error: "Set changed
size during iteration" when taking the max of the `sorteddict`. While
its uncertain how that happens, its quite inefficient to iterate over
the entire dict anyway so we change to using the more traditional
`bisect_*` functions.
We update the normal cache descriptors to handle caches with a single
argument specially so that the key wasn't a 1-tuple. We need to update
the cache list to be aware of this.
Most of the time was spent copying a dict to filter out sentinel values
that indicated that keys did not exist in the dict. The sentinel values
were added to ensure that we cached the non-existence of keys.
By updating DictionaryCache to keep track of which keys were known to
not exist itself we can remove a dictionary copy.
The _get_joined_users_from_context cache stores a mapping from user_id
to avatar_url and display_name. Instead of storing those in a dict,
store them in a namedtuple as that uses much less memory.
We also try converting the string to ascii to further reduce the size.
Currently the cache descriptors store deferreds rather than raw values,
this is a simple way of triggering only one database hit and sharing the
result if two callers attempt to get the same value.
However, there are a few caches that simply store a mapping from string
to string (or int). These caches can have a large number of entries,
under the assumption that each entry is small. However, the size of a
deferred (specifically the size of ObservableDeferred) is signigicantly
larger than that of the raw value, 2kb vs 32b.
This PR therefore changes the cache descriptors to store the raw values
rather than the deferreds.
As a side effect cached storage function now either return a deferred or
the actual value, as the cached list decriptor already does. This is
fine as we always end up just yield'ing on the returned value
eventually, which handles that case correctly.
The cache wrappers had a habit of leaking the logcontext into the reactor while
the lookup function was running, and then not restoring it correctly when the
lookup function had completed. It's all the fault of
`preserve_context_over_{fn,deferred}` which are basically a bit broken.
Due to a failure to instantiate DeferredTimedOutError, time_bound_deferred
would throw a CancelledError when the deferred timed out, which was rather
confusing.
The `@cached` decorator on `KeyStore._get_server_verify_key` was missing
its `num_args` parameter, which meant that it was returning the wrong key for
any server which had more than one recorded key.
By way of a fix, change the default for `num_args` to be *all* arguments. To
implement that, factor out a common base class for `CacheDescriptor` and `CacheListDescriptor`.
Fix a bug in ``logcontext.preserve_fn`` which made it leak context into the
reactor, and add a test for it.
Also, get rid of ``logcontext.reset_context_after_deferred``, which tried to do
the same thing but had its own, different, set of bugs.