Commit Graph

286 Commits (0718c6cb39096845671f4501d553c97e52a4ff0a)

Author SHA1 Message Date
Raphaël Vinot 14df52a623 chg: Many improvments in archiver 2023-08-05 13:36:56 +02:00
Raphaël Vinot e9dad5de61 chg: Attempt to reduce disk use 2023-08-04 15:03:58 +02:00
Raphaël Vinot c203aa91b9 chg: Avoid directory listing as much as possible in archiver, allow shutdown 2023-08-04 14:02:45 +02:00
Raphaël Vinot 5fca6b13ea chg: Show stacktrace when we cannot build the pickle 2023-08-04 13:15:39 +02:00
Raphaël Vinot 959b7ca96d fix: use glob with path instead of rglob (faster)) 2023-08-04 13:15:03 +02:00
Raphaël Vinot 4be8186cc6 chg: Improve readability of the background indexer 2023-07-30 16:59:41 +02:00
Raphaël Vinot ea2ded9beb fix: properly handle missing title in cache 2023-07-27 15:21:06 +02:00
Raphaël Vinot ebfc2f00a5 fix: Exception when a formerly broken capture is re-processed and works 2023-07-27 14:56:39 +02:00
Raphaël Vinot 855485984f fix: handle gracefully empty lists in hset, and duplicates UUIDs 2023-07-26 22:16:00 +02:00
Raphaël Vinot fd9325bb0d chg: Improve logging, add lock on indexer. 2023-07-26 12:37:12 +02:00
Raphaël Vinot f60457a484 fix: Put the max captures counter at the right place... 2023-07-26 11:45:22 +02:00
Raphaël Vinot fc5850e147 chg: Avoid building old pickles forever 2023-07-26 11:38:40 +02:00
Raphaël Vinot a18f8f9675 chg: do not discard capture without HAR files
They are often just captures with an error file.
2023-07-25 20:29:30 +02:00
Raphaël Vinot ef3432cbed fix: Few more improvments on lockfile and broken captures. 2023-07-25 20:16:48 +02:00
Raphaël Vinot 484aec5ddd fix: Properly handle lock file. 2023-07-25 19:29:53 +02:00
Raphaël Vinot 345a2f3f45 fix: Import method from the right file 2023-07-25 17:16:59 +02:00
Raphaël Vinot 3c50474ce4 fix: check if a tree.pickle.gz exists in the background indexer 2023-07-25 17:13:28 +02:00
Raphaël Vinot 0c7b3d9106 fix: indexer getting stuck when we had more than one at a time 2023-07-25 17:08:00 +02:00
Raphaël Vinot 177474e874 new: Basic support for HHHash 2023-07-21 15:48:20 +02:00
Raphaël Vinot fec61d42ee fix: Re-submit captures cleaned up too early in lacus 2023-06-27 11:33:56 +02:00
Raphaël Vinot 582b5956e9 new: Store capture settings, use TypedDict whenever possible. 2023-05-15 16:08:19 +02:00
Raphaël Vinot 6a9bcc0050 new: Automatic reporting via API
Related to #678
2023-04-28 17:19:53 +02:00
Raphaël Vinot 4ceae60db7 chg: Avoid stopping the captures before they're done 2023-04-09 13:58:34 +02:00
Raphaël Vinot 2ceda75eab chg: Fairly big refactoring/cleanup to support LacusCore 1.4.0 2023-04-08 13:49:18 +02:00
Raphaël Vinot 9995371916 chg: Normalize logging on the config file settings 2023-04-05 16:23:46 +02:00
Raphaël Vinot e410b7631e fix: no decoding in archiver, catch exception when requesting hashes on broken capture 2023-03-16 14:47:24 +01:00
Raphaël Vinot 9497060028 fix: Cleanup prints, improve archiver. 2023-03-16 12:28:28 +01:00
Raphaël Vinot 96f1b2bd53 fix: Avoid exception if microsec is missing. 2023-03-12 19:25:16 +01:00
Raphaël Vinot 36d39f6076 new: Add PID in lock file, allows to check if the locking process is still there 2023-02-26 17:20:17 +01:00
Raphaël Vinot f0615fc54f chg: Bump deps, prepare for v1.17.0 release 2022-12-28 16:25:44 +01:00
Raphaël Vinot b7302d09b5 fix: pass the browser to the brower key. 2022-12-02 14:26:08 +01:00
Raphaël Vinot 00370291ac new: Logging config in file 2022-11-23 15:54:22 +01:00
Raphaël Vinot 3c1cbd6ece new: Very basic page to submit an existing capture via a HAR file 2022-11-19 01:32:17 +01:00
Raphaël Vinot 9677c4d120 new: Support lacus unreachable by caching locally
+ initialize lacus globally for consistency.
2022-11-01 18:10:25 +01:00
Raphaël Vinot a48c6e0bd6 new: SIGTERM handling (PyLacus and LacusCore) 2022-10-28 12:40:28 +02:00
Raphaël Vinot e3075060cd chg: Properly type the response from LacusCore/PyLacus 2022-10-26 14:25:23 +02:00
Raphaël Vinot 93c3ea8d39 fix: Catch exception when Lacus is unreachable 2022-09-29 15:42:05 +02:00
Raphaël Vinot a27683f090 fix: Match compressed HAR as valid for rebuild 2022-09-28 11:23:44 +02:00
Raphaël Vinot d500872943 fix: Wait for capture to be done before processing it 2022-09-27 15:44:17 +02:00
Raphaël Vinot 5cd8169735 chg: Avoid captures without url(s) or document 2022-09-27 11:33:36 +02:00
Raphaël Vinot f886b8676b fix: More exceptions catching for the the new caching method 2022-09-26 20:55:16 +02:00
Raphaël Vinot edd8d786d3 chg: Do not try to build a tree if there are no HAR files 2022-09-26 15:59:04 +02:00
Raphaël Vinot 31261e84c2 fix: Better handling of half broken captures without HAR files 2022-09-26 14:58:30 +02:00
Raphaël Vinot 52b68fccdc fix: make mypy happy, simplify code 2022-09-23 21:45:50 +02:00
Raphaël Vinot 2ec55be573 fix: Properly unset async capture when the queue is empty 2022-09-23 21:40:56 +02:00
Raphaël Vinot 354841b005 chg: Improve status reporting when a capture is ongoing 2022-09-23 21:33:38 +02:00
Raphaël Vinot da33a7f5b3 chg: Avoid stacktrace when trying to generate broken capture 2022-09-23 14:46:19 +02:00
Raphaël Vinot 18b0b6e3cd chg: Improve logging for archiver. 2022-09-23 14:32:42 +02:00
Raphaël Vinot c7ca251e7a chg: make to_capture key a ranked set again 2022-09-23 14:25:01 +02:00
Raphaël Vinot 19799c19af chg: Re-enable start script 2022-09-23 13:13:09 +02:00
Raphaël Vinot da502ee3d6 chg: Implement support for LacusCore *or* PyLacus 2022-09-23 13:13:09 +02:00
Raphaël Vinot d38b612c37 chg: Bump lacuscore 2022-09-23 13:13:09 +02:00
Raphaël Vinot 9189888a0d chg: Properly handle missing HAR 2022-09-23 13:13:09 +02:00
Raphaël Vinot 623813167e chg: Add missing bits 2022-09-23 13:13:09 +02:00
Raphaël Vinot 318f554db3 chg: move to lacus, WiP 2022-09-23 13:13:09 +02:00
Raphaël Vinot 812c63b0f2 fix: error in UAs, typing 2022-09-05 18:58:45 +02:00
Raphaël Vinot c6464936fc chg: Bump to poetry v1.2, remove dep on setuptools 2022-08-31 16:33:13 +02:00
Raphaël Vinot f232eba662 chg: Improve UA rendering 2022-08-23 17:44:48 +02:00
Raphaël Vinot ebbe6e3ce9 new: Pick mobile devices on capture page 2022-08-22 17:34:00 +02:00
Raphaël Vinot 35789f549b fix: Exception on invalid capture 2022-08-20 23:33:32 +02:00
Raphaël Vinot d63ea473f5 new: Autoselect browser engine based on the UA 2022-08-19 14:26:22 +02:00
Raphaël Vinot 998ef12b06 new: Add support for playwright devices and browser name (API only) 2022-08-18 11:19:32 +02:00
Raphaël Vinot e89e9a20cb fix: Force BG processor to index all the recent captures 2022-08-12 01:08:28 +02:00
Raphaël Vinot be2e1ddc33 fix: properly handle listing configuration, clear None from queries before pasing to redis 2022-08-10 18:53:14 +02:00
Raphaël Vinot 49f335405e fix: Avoid exceptions on invalid requests 2022-08-05 11:28:44 +02:00
Raphaël Vinot 4280a4e11f fix: Support for document on public instances. 2022-08-04 21:28:47 +02:00
Raphaël Vinot 94bae7c5e3 chg: Avoid exception on broken captures 2022-08-04 21:11:58 +02:00
Raphaël Vinot 4f72d64735 new: Upload a file instead of submitting a URL. 2022-08-04 16:58:07 +02:00
Raphaël Vinot 3170038db7 new: dropdown to pass DoNotTrack HTTP header
Improvments on the capture page.
2022-08-03 12:07:45 +02:00
Raphaël Vinot bcfaaec941 chg: Improve logging in archiver 2022-07-27 14:33:28 +02:00
Raphaël Vinot c9381873c7 chg: cleanup the file download feature 2022-07-19 17:54:45 +02:00
Arhamyss 47ecc7a4fa new: Download file 2022-07-19 10:07:36 -04:00
Raphaël Vinot 5f329e4d7b new: compress HAR files in archived captures. 2022-07-12 18:44:33 +02:00
Raphaël Vinot 6ba019ec83 chg: Improve somewhat the useragents available for capturing
Fix #416
2022-06-09 18:58:17 +02:00
Raphaël Vinot 1817a3e13b chg: sunday cleanup 2022-05-23 00:15:52 +02:00
Raphaël Vinot d222ae04aa new: Keep capture even if we have a network error 2022-05-03 12:23:16 +02:00
Raphaël Vinot 463d1d2d1a new: autosubmit to FOX, bump deps 2022-05-02 13:04:55 +02:00
Raphaël Vinot ef1094a331 chg: Bump deps, fix cookie issue
Fix  #404
2022-04-29 00:44:03 +02:00
Raphaël Vinot 1679ccf90f chg: Improve capture, ignore ssl issues. 2022-04-26 13:49:24 +02:00
Raphaël Vinot 77fbf47e73 fix: capture cleanup 2022-04-26 10:25:11 +02:00
Raphaël Vinot 147bc65992 fix: Mypy, docker 2022-04-26 00:59:57 +02:00
Raphaël Vinot 41c7e87458 fix: docker, improve error catching 2022-04-26 00:33:50 +02:00
Raphaël Vinot 5af278f84d fix: issue in playwrightcapture module 2022-04-25 15:20:05 +02:00
Raphaël Vinot 4ad898a375 chg: Use packaged playwright capture module 2022-04-25 13:34:01 +02:00
Raphaël Vinot c93a6c307d chg: properly set cookies 2022-04-24 20:17:54 +03:00
Raphaël Vinot 680eb1b309 fix: better handling if capture fails. 2022-04-21 15:48:28 +03:00
Raphaël Vinot 8d159ffba0 new: Switch away from splash to use playwright 2022-04-21 14:55:07 +03:00
Raphaël Vinot 83fc0bd8f4 fix: shutil.move wants str (not Path) for python<3.9 2022-04-10 12:43:56 +02:00
Kimmo Linnavuo a80b6a31e4 Use shutil.move instead of path rename when moving discarded captures 2022-04-08 15:28:06 +03:00
Raphaël Vinot cf46dde1ed chg: Add basic pre-hook config 2022-03-31 11:30:53 +02:00
Raphaël Vinot ae9cb3e81c chg: Bump deps 2022-03-29 21:13:02 +02:00
Raphaël Vinot c9307b5159 chg: Improve start/stop for DBs 2021-12-02 14:39:32 +01:00
Raphaël Vinot a55fb5380a chg: Sync stop script with template 2021-11-26 14:16:22 -05:00
Raphaël Vinot d7c9892957 fix: Wait for DBs to be down before returning in stop script 2021-11-26 13:48:46 -05:00
Raphaël Vinot daca988f3f chg: better handling of broken indexes in archiver 2021-11-26 12:36:35 -05:00
Raphaël Vinot cef1088984 chg: programmatically shutdown DBs 2021-11-26 12:35:15 -05:00
Raphaël Vinot 58b50f2b24 new: Pass optional arbitrary HTTP headers to capture 2021-11-23 12:59:56 -08:00
Raphaël Vinot bfb1e6b181 fix: Use default_public for all capture, including if submitted via the API 2021-11-02 14:58:31 -07:00
Raphaël Vinot 1f998b457f chg: use template 2021-10-18 13:06:43 +02:00
Raphaël Vinot 6e9e3990c4 fix: Indexes not updated on tree rebuild, better handling of tree cache 2021-09-24 16:16:41 +02:00
Raphaël Vinot 48fc807e7d new: Add monitoring for pickle cache status 2021-09-24 12:02:28 +02:00
Raphaël Vinot 32ee474be2 chg: Improve tree creation and cache 2021-09-22 17:09:04 +02:00
Raphaël Vinot d1f673f3a7 chg: Cleanup passing listing key to and from bool in redis 2021-09-10 14:20:58 +02:00
Raphaël Vinot 9c7929569e fix: The captures are visible on the index by default. 2021-09-08 20:43:56 +02:00
Raphaël Vinot 48b632aa1e fix: Incorrect matching for listing key in capture (always false) 2021-09-08 10:53:31 +02:00
Raphaël Vinot 902c8f81b6 chg: Improve error message if the capture fails
Fix #257
2021-09-07 18:16:01 +02:00
Raphaël Vinot dfbe40a52e chg: reorder imports 2021-09-07 16:00:07 +02:00
Raphaël Vinot c09adec333 chg: Improve logging. 2021-09-01 14:08:25 +02:00
Raphaël Vinot 797de9ddb3 fix: remove datefmt from logging.basicConfig, it was a bad idea. 2021-09-01 10:40:59 +02:00
Raphaël Vinot 2e5a5f3aff fix: unlink indexes pointing to unknown directories 2021-08-30 14:45:44 +02:00
Raphaël Vinot e56c70d1a1 chg: out of safety, do not remove a capture dir. 2021-08-30 12:54:17 +02:00
Raphaël Vinot 117500b777 chg: Make archiver an index generator 2021-08-30 12:48:13 +02:00
Raphaël Vinot 324736f62c fix: Use proper exception on redis start 2021-08-27 18:08:34 +02:00
Raphaël Vinot ae76cb77be fix: Uncomment website start 2021-08-27 17:49:27 +02:00
Raphaël Vinot 8a51383d7a chg: Move the process managment methods to the proper class 2021-08-27 17:28:26 +02:00
Raphaël Vinot 85e43fc677 chg: Make the website start a normal start script 2021-08-27 16:45:16 +02:00
Raphaël Vinot d41b7735dd chg: Improve storage, support both modes. 2021-08-26 15:49:19 +02:00
Raphaël Vinot 407e78ae7f chg: More cleanup, support clean shutdown of multiple async captures 2021-08-25 16:40:51 +02:00
Raphaël Vinot bf700e7a7b chg: Major refactoring, move capture code to external script. 2021-08-25 13:36:48 +02:00
Raphaël Vinot c732e38395 chg: Add logging in BG processing 2021-08-24 18:44:00 +02:00
Raphaël Vinot 81390d5ea0 chg: cleanup in the mail lookyloo class 2021-08-24 18:32:54 +02:00
Raphaël Vinot 8433cbcc1b chg: Cleanup archiver, initialize index captures in start 2021-08-24 17:10:14 +02:00
Raphaël Vinot ece30a33eb chg: Fix typo in archiver 2021-08-23 16:56:17 +02:00
Raphaël Vinot fb1685cedc add: reset recent captures in archiving process 2021-08-23 16:19:50 +02:00
Raphaël Vinot 8f28335010 fix: properly match cut time 2021-08-23 15:51:06 +02:00
Raphaël Vinot 2c1971311a chg: Make the cut-off date for archiving the 1st of the month 2021-08-23 15:36:59 +02:00
Raphaël Vinot 5c9b88a3ca fix: Make sure all the archived UUIDs are removed 2021-08-23 15:29:21 +02:00
Raphaël Vinot 67e6571145 chg: Force init the archived indexes 2021-08-23 15:14:08 +02:00
Raphaël Vinot 53ceb9c329 chg: Cleanup when dir is moved, digit months on 2 values 2021-08-23 14:53:19 +02:00
Raphaël Vinot d359bc7521 chg: Better use of cache, sanity checks 2021-08-23 12:17:44 +02:00
Raphaël Vinot 58b837cb6c new: Archiver, refactoring. 2021-08-20 17:46:22 +02:00
Raphaël Vinot 6be9b69d95 chg: Use connection pool whenever possible 2021-08-18 18:01:04 +02:00
Raphaël Vinot 59f2a510c0 fix: properly catch broken capture, bump deps 2021-07-14 11:34:10 +02:00
Raphaël Vinot 1117ab6371 chg: add stats, avoid building big trees twice, bump deps 2021-05-26 18:25:06 -07:00
Raphaël Vinot 335ab662cf new: Auto trigger modules in the bg process 2021-05-19 15:12:35 -07:00
Raphaël Vinot f865ec912a fix: Move set/unset running to abstract
Avoid issues when a script fails unexpectedly.
2021-04-09 14:33:42 +02:00
Raphaël Vinot 7707d638cf new: Use async capture for the UI.
Add a method to make sure splash is up before trying to capture.
2021-04-08 19:15:53 +02:00
Raphaël Vinot 4847fdb670 fix: Windows path in update 2021-04-06 17:43:45 +02:00
Raphaël Vinot c38ec90bb1 fix: Make update script windows compatible 2021-04-06 17:27:59 +02:00
Raphaël Vinot fa6b4701c0 chg: update the cache at the right place. 2021-03-20 21:54:46 +01:00
Raphaël Vinot 13d34421dc chg: Improve BG indexer 2021-03-20 01:13:37 +01:00
Raphaël Vinot 648d4d5b5b chg: Add background ingester to the start script 2021-03-18 01:00:27 +01:00
Raphaël Vinot b3541e0e78 new: background indexer 2021-03-12 16:53:00 +01:00
Raphaël Vinot 6059cb5219 chg: Remove useless code 2021-03-12 16:49:04 +01:00
Raphaël Vinot 82d9cc7b2f fix: Properly rebuild indexed captures 2021-03-07 13:25:27 +01:00
Raphaël Vinot 3ec8015e14 chg: Better messages if website does not start 2021-02-21 23:40:47 +01:00
Raphaël Vinot 6149df06eb chg: Make the cache entries a dataclass
Fix #99
2021-01-14 17:12:23 +01:00
Raphaël Vinot 354f269218 new: Integrate categorization in indexing 2020-11-09 16:02:54 +01:00
Raphaël Vinot ea052c7c12 fix: Rename scrape -> capture in async 2020-11-05 14:14:33 +01:00
Raphaël Vinot 8b1e3585ea chg: Improve initial caching. 2020-10-29 23:25:20 +01:00