Lookyloo is a web interface allowing to scrape a website and then displays a tree of domains calling each other. https://lookyloo.circl.lu/

Go to file

Raphaël Vinot e5e4e4972e new: Add visualisation for embedded resources.		2020-07-10 18:57:16 +02:00
.github/workflows	chg: Cleanup docker publish action	2020-07-08 15:28:57 +02:00
bin	chg: More reasonable rebuild cache	2020-07-08 18:28:07 +02:00
cache	fix: Systemd service, add proper stop script	2019-04-05 14:01:36 +02:00
client	chg: Bump client version	2020-06-30 16:12:55 +02:00
config	new: Add max depth for scraping.	2020-06-29 18:00:56 +02:00
doc	new: Add ressources from papers	2020-05-20 16:50:58 +02:00
etc	chg: Properly handle proxied setups	2020-04-22 14:58:01 +02:00
indexing	new: Initial version of cookies indexing	2020-07-08 15:42:13 +02:00
lookyloo	new: Add visualisation for embedded resources.	2020-07-10 18:57:16 +02:00
tools	new: Add option to use User agents of the Lookyloo users for scraping	2020-06-25 16:43:58 +02:00
user_agents	new: Add option to use User agents of the Lookyloo users for scraping	2020-06-25 16:43:58 +02:00
website	new: Add visualisation for embedded resources.	2020-07-10 18:57:16 +02:00
.dockerignore	chg: Do not put the content of scraped in the package.	2020-07-07 13:56:58 +02:00
.gitignore	new: Add config files, initial support for 3rd party modules	2020-03-31 14:12:57 +02:00
.travis.yml	chg: Bump dependencies	2020-04-03 23:35:56 +02:00
Dockerfile	fix: Initial attempt to fix the docker image	2020-07-07 01:00:59 +02:00
LICENSE	chg: Bump license	2020-06-08 14:08:23 +02:00
README.md	Update README.md	2020-06-17 17:42:02 +02:00
SECURITY.md	Create SECURITY.md	2020-06-17 11:44:46 +02:00
docker-compose.yml	fix: Docker, capture form, error message.	2020-07-08 02:25:15 +02:00
poetry.lock	new: Add referer to initial URL	2020-07-08 00:37:29 +02:00
pyproject.toml	chg: Use capture UUID as a reference everywhere	2020-06-29 12:01:31 +02:00
setup.py	chg: Use capture UUID as a reference everywhere	2020-06-29 12:01:31 +02:00

README.md

Lookyloo is a web interface allowing to scrape a website and then displays a tree of domains calling each other.

Thank you very much Tech Blog @ willshouse.com for the up-to-date list of UserAgents.

What is that name?!

1. People who just come to look.
2. People who go out of their way to look at people or something often causing crowds and more disruption.
3. People who enjoy staring at watching other peoples misfortune. Oftentimes car onlookers to car accidents.
Same as Looky Lou; often spelled as Looky-loo (hyphen) or lookylou
In L.A. usually the lookyloo's cause more accidents by not paying full attention to what is ahead of them.

Source: Urban Dictionary

Screenshot

Implementation details

This code is very heavily inspired by webplugin and adapted to use flask as backend.

The two core dependencies of this project are the following:

ETE Toolkit: A Python framework for the analysis and visualization of trees.
Splash: Lightweight, scriptable browser as a service with an HTTP API

Cookies

If you want to scrape a website as if you were loggged in, you need to pass your sessions cookies. You can do it the the folloing way:

Install Cookie Quick Manager
Click onthe icon in the top right of your browser > Manage all cookies
Search for a domain, tick the Sub-domain box if needed
Right clock on the domain you want to export > save to file > $LOOKYLOO_HOME/cookies.json

Then, you need to restart the webserver and from now on, every cookies you have in that file will be available for the browser used by Splash

Python client

You can use pylookyloo as a standalone script, or as a library, more details here

Installation

IMPORTANT: Use poetry

NOTE: Yes, it requires python3.7+. No, it will never support anything older.

NOTE: If you want to run a public instance, you should set only_global_lookups=True in website/web/__init__.py and bin/async_scrape.py to disallow scraping of private IPs.

Installation of Splash

You need a running splash instance, preferably on docker

sudo apt install docker.io
sudo docker pull scrapinghub/splash
sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --disable-browser-caches
# On a server with a decent abount of RAM, you may want to run it this way:
# sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash -s 100 -m 50000 --disable-browser-caches

Install redis

git clone https://github.com/antirez/redis.git
cd redis
git checkout 5.0
make
cd ..

Installation of Lookyloo

git clone https://github.com/Lookyloo/lookyloo.git
cd lookyloo
poetry install
echo LOOKYLOO_HOME="'`pwd`'" > .env

Run the app

poetry run start.py

Run the app in production

With a reverse proxy (Nginx)

pip install uwsgi

Config files

You have to configure the two following files:

etc/nginx/sites-available/lookyloo
etc/systemd/system/lookyloo.service

Copy them to the appropriate directories, and run the following command:

sudo ln -s /etc/nginx/sites-available/lookyloo /etc/nginx/sites-enabled

If needed, remove the default site

sudo rm /etc/nginx/sites-enabled/default

Make sure everything is working:

sudo systemctl start lookyloo
sudo systemctl enable lookyloo
sudo nginx -t
# If it is cool:
sudo service nginx restart

And you can open http://<IP-or-domain>/

Now, you should configure TLS (let's encrypt and so on)

Use aquarium for a reliable multi-users app

Aquarium is a haproxy + splash bundle that will allow lookyloo to be used by more than one user at once.

The initial version of the project was created by TeamHG-Memex but we have a dedicated repository that fits our needs better.

Follow the documentation if you want to use it.

Run the app with a simple docker setup

Dockerfile

The repository includes a Dockerfile for building a containerized instance of the app.

Lookyloo stores the scraped data in /lookyloo/scraped. If you want to persist the scraped data between runs it is sufficient to define a volume for this directory.

Running a complete setup with Docker Compose

Additionally you can start a complete setup, including the necessary Docker instance of splashy, by using Docker Compose and the included service definition in docker-compose.yml by running

docker-compose up

After building and startup is complete lookyloo should be available at http://localhost:5000/

If you want to persist the data between different runs uncomment the "volumes" definition in the last two lines of docker-compose.yml and define a data storage directory in your Docker host system there.