Lookyloo is a web interface allowing to scrape a website and then displays a tree of domains calling each other. https://lookyloo.circl.lu/
Go to file
dependabot[bot] 39c0400bfa
build(deps-dev): bump types-psutil from 6.0.0.20240901 to 6.0.0.20241011
Bumps [types-psutil](https://github.com/python/typeshed) from 6.0.0.20240901 to 6.0.0.20241011.
- [Commits](https://github.com/python/typeshed/commits)

---
updated-dependencies:
- dependency-name: types-psutil
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-10-11 10:33:03 +00:00
.github build(deps): bump docker/build-push-action from 5 to 6 2024-09-24 10:29:13 +00:00
bin chg: Allow to stop indexing before all the captures are indexed 2024-09-27 12:26:34 +02:00
cache fix: [modules] corect variable type 2024-05-20 14:56:36 +02:00
config Update generic.json.sample 2024-07-01 08:58:52 +02:00
contributing Initial set up contrib guide 2020-08-24 13:33:14 +01:00
doc new: Add installation notes 2020-08-14 15:38:01 +02:00
etc fix: Allow upload of bigger files 2022-09-21 15:26:40 +02:00
full_index new: Support for valkey, new kvrocks 2024-04-29 17:09:26 +02:00
indexing fix: [modules] corect variable type 2024-05-20 14:56:36 +02:00
known_content new: A few more single px gifs \o/ 2024-02-28 15:03:10 +01:00
known_content_user chg: Cleanups, allow to add context from ressources page 2020-09-03 16:32:53 +02:00
logs new: Logging config in file 2022-11-23 15:54:22 +01:00
lookyloo new: Support for disabling JS during capture 2024-10-10 17:24:44 +02:00
tools chg: Bump datatables 2024-08-08 15:54:22 +02:00
user_agents chg: Improve somewhat the useragents available for capturing 2022-06-09 18:58:17 +02:00
website new: Support for disabling JS during capture 2024-10-10 17:24:44 +02:00
.dockerignore chg: Do not put the content of scraped in the package. 2020-07-07 13:56:58 +02:00
.gitignore new: find related captures by hostname and URL 2024-05-14 18:54:04 +02:00
.pre-commit-config.yaml chg: Migrate ressources/body hashes to new index that allows pagination on capture time 2024-10-07 13:15:15 +02:00
Dockerfile fix: Install system deps in dockerfile 2024-08-13 14:17:09 +02:00
LICENSE Update LICENSE 2021-02-09 14:51:53 +01:00
README.md Update README.md 2023-11-20 12:07:48 +01:00
SECURITY.md chg: Add basic pre-hook config 2022-03-31 11:30:53 +02:00
code_of_conduct.md chg: Add basic pre-hook config 2022-03-31 11:30:53 +02:00
docker-compose.yml fix: Add few more volumes for docker 2023-03-06 14:49:03 +01:00
mypy.ini chg: Use new annotations 2024-01-12 17:15:41 +01:00
poetry.lock build(deps-dev): bump types-psutil from 6.0.0.20240901 to 6.0.0.20241011 2024-10-11 10:33:03 +00:00
pyproject.toml build(deps-dev): bump types-psutil from 6.0.0.20240901 to 6.0.0.20241011 2024-10-11 10:33:03 +00:00

README.md

Lookyloo icon

Lookyloo is a web interface that captures a webpage and then displays a tree of the domains, that call each other.

Gitter

What's in a name?!

Lookyloo ...

Same as Looky Lou; often spelled as Looky-loo (hyphen) or lookylou

1. A person who just comes to look.
2. A person who goes out of the way to look at people or something, often causing crowds and disruption.
3. A person who enjoys watching other people's misfortune. Oftentimes car onlookers that stare at a car accidents.

In L.A., usually the lookyloos cause more accidents by not paying full attention to what is ahead of them.

Source: Urban Dictionary

No, really, what is Lookyloo?

Lookyloo is a web interface that allows you to capture and map the journey of a website page.

Find all you need to know about Lookyloo on our documentation website.

Here's an example of a Lookyloo capture of the site github.com Screenshot of Lookyloo capturing Github

REST API

The API is self documented with swagger. You can play with it on the demo instance.

Installation

Please refer to the install guide.

Python client

pylookyloo is the recommended client to interact with a Lookyloo instance.

It is avaliable on PyPi, so you can install it using the following command:

pip install pylookyloo

For more details on pylookyloo, read the overview docs, the documentation of the module itself, or the code in this GitHub repository.

Notes regarding using S3FS for storage

Directory listing

TL;DR: it is slow.

If you have namy captures (say more than 1000/day), and store captures in a s3fs bucket mounted with s3fs-fuse, doing a directory listing in bash (ls) will most probably lock the I/O for every process trying to access any file in the whole bucket. The same will be true if you access the filesystem using python methods (iterdir, scandir...))

A workaround is to use the python s3fs module as it will not access the filesystem for listing directories. You can configure the s3fs credentials in config/generic.json key s3fs.

Warning: this will not save you if you run ls on a directoy that contains a lot of captures.

Versioning

By default, a MinIO bucket (backend for s3fs) will have versioning enabled, wich means it keeps a copy of every version of every file you're storing. It becomes a problem if you have a lot of captures as the index files are updated on every change, and the max amount of versions is 10.000. So by the time you have > 10.000 captures in a directory, you'll get I/O errors when you try to update the index file. And you absolutely do not care about that versioning in lookyloo.

To check if versioning is enabled (can be either enabled or suspended):

mc version info <alias_in_config>/<bucket>

The command below will suspend versioning:

mc version suspend <alias_in_config>/<bucket>

I'm stuck, my file is raising I/O errors

It will happen when your index was updated 10.000 times and versioning was enabled.

This is how to check you're in this situation:

  • Error message from bash (unhelpful):
$ (git::main) rm /path/to/lookyloo/archived_captures/Year/Month/Day/index
rm: cannot remove '/path/to/lookyloo/archived_captures/Year/Month/Day/index': Input/output error
  • Check with python
from lookyloo.default import get_config
import s3fs

s3fs_config = get_config('generic', 's3fs')
s3fs_client = s3fs.S3FileSystem(key=s3fs_config['config']['key'],
                                secret=s3fs_config['config']['secret'],
                                endpoint_url=s3fs_config['config']['endpoint_url'])

s3fs_bucket = s3fs_config['config']['bucket_name']
s3fs_client.rm_file(s3fs_bucket + '/Year/Month/Day/index')
  • Error from python (somewhat more helpful):
OSError: [Errno 5] An error occurred (MaxVersionsExceeded) when calling the DeleteObject operation: You've exceeded the limit on the number of versions you can create on this object
  • Solution: run this command to remove all older versions of the file
mc rm --non-current --versions --recursive --force <alias_in_config>/<bucket>/Year/Month/Day/index

Contributing to Lookyloo

To learn more about contributing to Lookyloo, see our contributor guide.

Code of Conduct

At Lookyloo, we pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community. You can access our Code of Conduct here or on the Lookyloo docs site.

Support

  • To engage with the Lookyloo community contact us on Gitter.
  • Let us know how we can improve Lookyloo by opening an issue.
  • Follow us on Twitter.

Security

To report vulnerabilities, see our Security Policy.

Credits

Thank you very much Tech Blog @ willshouse.com for the up-to-date list of UserAgents.

License

See our LICENSE.