mirror of https://github.com/CIRCL/lookyloo
fix: Quit BG indexer when shutdown is requested. Improve exceptions handling in archiver
parent
11a3b6b2f9
commit
efe2124753
68
README.md
68
README.md
|
@ -65,6 +65,74 @@ pip install pylookyloo
|
||||||
|
|
||||||
For more details on `pylookyloo`, read the overview [docs](https://www.lookyloo.eu/docs/main/pylookyloo-overview.html), the [documentation](https://pylookyloo.readthedocs.io/en/latest/) of the module itself, or the code in this [GitHub repository](https://github.com/Lookyloo/PyLookyloo).
|
For more details on `pylookyloo`, read the overview [docs](https://www.lookyloo.eu/docs/main/pylookyloo-overview.html), the [documentation](https://pylookyloo.readthedocs.io/en/latest/) of the module itself, or the code in this [GitHub repository](https://github.com/Lookyloo/PyLookyloo).
|
||||||
|
|
||||||
|
# Notes regarding using S3FS for storage
|
||||||
|
|
||||||
|
## Directory listing
|
||||||
|
|
||||||
|
TL;DR: it is slow.
|
||||||
|
|
||||||
|
If you have namy captures (say more than 1000/day), and store captures in a s3fs bucket mounted with s3fs-fuse,
|
||||||
|
doing a directory listing in bash (`ls`) will most probably lock the I/O for every process
|
||||||
|
trying to access any file in the whole bucket. The same will be true if you access the
|
||||||
|
filesystem using python methods (`iterdir`, `scandir`...))
|
||||||
|
|
||||||
|
A workaround is to use the python s3fs module as it will not access the filesystem for listing directories.
|
||||||
|
You can configure the s3fs credentials in `config/generic.json` key `s3fs`.
|
||||||
|
|
||||||
|
## Versioning
|
||||||
|
|
||||||
|
By default, a MinIO bucket (backend for s3fs) will have versioning enabled, wich means it
|
||||||
|
keeps a copy of every version of every file you're storing. It becomes a problem if you have a lot of captures
|
||||||
|
as the index files are updated on every change, and the max amount of versions is 10.000.
|
||||||
|
So by the time you have > 10.000 captures in a directory, you'll get I/O errors when you try
|
||||||
|
to update the index file. And you absolutely do not care about that versioning in lookyloo.
|
||||||
|
|
||||||
|
To check if versioning is enabled (can be either enabled or suspended):
|
||||||
|
|
||||||
|
```
|
||||||
|
mc version info <alias_in_config>/<bucket>
|
||||||
|
```
|
||||||
|
|
||||||
|
The command below will suspend versioning:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mc version suspend <alias_in_config>/<bucket>
|
||||||
|
```
|
||||||
|
|
||||||
|
And if you're already stuck with an index that was updated 10.000 times and you cannot do anything about it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mc rm --non-current --versions --recursive --force <alias_in_config>/<bucket>/path/to/index
|
||||||
|
```
|
||||||
|
|
||||||
|
Error message from bash (unhelpful):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ (git::main) rm /path/to/lookyloo/archived_captures/Year/Month/Day/index
|
||||||
|
rm: cannot remove '/path/to/lookyloo/archived_captures/Year/Month/Day/index': Input/output error
|
||||||
|
```
|
||||||
|
|
||||||
|
Python code:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from lookyloo.default import get_config
|
||||||
|
import s3fs
|
||||||
|
|
||||||
|
s3fs_config = get_config('generic', 's3fs')
|
||||||
|
s3fs_client = s3fs.S3FileSystem(key=s3fs_config['config']['key'],
|
||||||
|
secret=s3fs_config['config']['secret'],
|
||||||
|
endpoint_url=s3fs_config['config']['endpoint_url'])
|
||||||
|
|
||||||
|
s3fs_bucket = s3fs_config['config']['bucket_name']
|
||||||
|
s3fs_client.rm_file(s3fs_bucket + '/Year/Month/Day/index')
|
||||||
|
```
|
||||||
|
|
||||||
|
Error from python (somewhat more helpful):
|
||||||
|
```
|
||||||
|
OSError: [Errno 5] An error occurred (MaxVersionsExceeded) when calling the DeleteObject operation: You've exceeded the limit on the number of versions you can create on this object
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
# Contributing to Lookyloo
|
# Contributing to Lookyloo
|
||||||
To learn more about contributing to Lookyloo, see our [contributor guide](https://www.lookyloo.eu/docs/main/contributing.html).
|
To learn more about contributing to Lookyloo, see our [contributor guide](https://www.lookyloo.eu/docs/main/contributing.html).
|
||||||
|
|
||||||
|
|
|
@ -324,9 +324,13 @@ class Archiver(AbstractManager):
|
||||||
try:
|
try:
|
||||||
new_capture_path = self.__archive_single_capture(capture_path)
|
new_capture_path = self.__archive_single_capture(capture_path)
|
||||||
capture_breakpoint -= 1
|
capture_breakpoint -= 1
|
||||||
except OSError as e:
|
except OSError:
|
||||||
self.logger.warning(f'Unable to archive capture: {e}')
|
self.logger.exception(f'Unable to archive capture {capture_path}')
|
||||||
finally:
|
(capture_path / 'lock').unlink(missing_ok=True)
|
||||||
|
except Exception:
|
||||||
|
self.logger.exception(f'Critical exception while archiving {capture_path}')
|
||||||
|
(capture_path / 'lock').unlink(missing_ok=True)
|
||||||
|
else:
|
||||||
(new_capture_path / 'lock').unlink(missing_ok=True)
|
(new_capture_path / 'lock').unlink(missing_ok=True)
|
||||||
|
|
||||||
if archiving_done:
|
if archiving_done:
|
||||||
|
|
|
@ -45,7 +45,12 @@ class BackgroundIndexer(AbstractManager):
|
||||||
archive_interval = timedelta(days=get_config('generic', 'archive'))
|
archive_interval = timedelta(days=get_config('generic', 'archive'))
|
||||||
cut_time = (datetime.now() - archive_interval)
|
cut_time = (datetime.now() - archive_interval)
|
||||||
for month_dir in make_dirs_list(self.lookyloo.capture_dir):
|
for month_dir in make_dirs_list(self.lookyloo.capture_dir):
|
||||||
|
__counter_shutdown = 0
|
||||||
for capture_time, path in sorted(get_sorted_captures_from_disk(month_dir, cut_time=cut_time, keep_more_recent=True), reverse=True):
|
for capture_time, path in sorted(get_sorted_captures_from_disk(month_dir, cut_time=cut_time, keep_more_recent=True), reverse=True):
|
||||||
|
__counter_shutdown += 1
|
||||||
|
if __counter_shutdown % 10 and self.shutdown_requested():
|
||||||
|
self.logger.warning('Shutdown requested, breaking.')
|
||||||
|
return False
|
||||||
if ((path / 'tree.pickle.gz').exists() or (path / 'tree.pickle').exists()):
|
if ((path / 'tree.pickle.gz').exists() or (path / 'tree.pickle').exists()):
|
||||||
# We already have a pickle file
|
# We already have a pickle file
|
||||||
self.logger.debug(f'{path} has a pickle.')
|
self.logger.debug(f'{path} has a pickle.')
|
||||||
|
|
Loading…
Reference in New Issue