|Raphaël Vinot 9b4bbb2450 fix: Missing cast||2 days ago|
|.github/workflows||3 months ago|
|bin||3 weeks ago|
|cache||10 months ago|
|client||1 week ago|
|doc||10 months ago|
|etc||10 months ago|
|lookyloo||2 days ago|
|user_agents/2019||2 months ago|
|website||1 week ago|
|.gitignore||1 year ago|
|.travis.yml||4 weeks ago|
|Dockerfile||1 week ago|
|LICENSE||10 months ago|
|README.md||3 weeks ago|
|docker-compose.yml||4 weeks ago|
|poetry.lock||2 days ago|
|pyproject.toml||1 week ago|
|setup.py||1 month ago|
Lookyloo is a web interface allowing to scrape a website and then displays a tree of domains calling each other.
Thank you very much Tech Blog @ willshouse.com for the up-to-date list of UserAgents.
1. People who just come to look. 2. People who go out of their way to look at people or something often causing crowds and more disruption. 3. People who enjoy staring at watching other peoples misfortune. Oftentimes car onlookers to car accidents. Same as Looky Lou; often spelled as Looky-loo (hyphen) or lookylou In L.A. usually the lookyloo's cause more accidents by not paying full attention to what is ahead of them.
Source: Urban Dictionary
This code is very heavily inspired by webplugin and adapted to use flask as backend.
The two core dependencies of this project are the following:
If you want to scrape a website as if you were loggged in, you need to pass your sessions cookies. You can do it the the folloing way:
Then, you need to restart the webserver and from now on, every cookies you have in that file will be available for the browser used by Splash
$ pip install git+https://github.com/CIRCL/lookyloo.git/#egg=pylookyloo\&subdirectory=client $ lookyloo --help usage: lookyloo [-h] [--url URL] --query QUERY Enqueue a URL on Lookyloo optional arguments: -h, --help show this help message and exit --url URL URL of the instance. --query QUERY URL to unqueue
IMPORTANT: Use pipenv
NOTE: Yes, it requires python3.6+. No, it will never support anything older.
NOTE: If you want to run a public instance, you should set
bin/async_scrape.py to disallow scraping of private IPs.
You need a running splash instance, preferably on docker
sudo apt install docker.io sudo docker pull scrapinghub/splash sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --disable-ui --disable-lua --disable-browser-caches # On a server with a decent abount of RAM, you may want to run it this way: # sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --disable-ui -s 100 --disable-lua -m 50000 --disable-browser-caches
git clone https://github.com/antirez/redis.git cd redis git checkout 5.0 make cd ..
git clone https://github.com/CIRCL/lookyloo.git cd lookyloo pipenv install echo LOOKYLOO_HOME="'`pwd`'" > .env
pipenv run start.py
pip install uwsgi
You have to configure the two following files:
Copy them to the appropriate directories, and run the following command:
sudo ln -s /etc/nginx/sites-available/lookyloo /etc/nginx/sites-enabled
If needed, remove the default site
sudo rm /etc/nginx/sites-enabled/default
Make sure everything is working:
sudo systemctl start lookyloo sudo systemctl enable lookyloo sudo nginx -t # If it is cool: sudo service nginx restart
And you can open
Now, you should configure TLS (let’s encrypt and so on)
Aquarium is a haproxy + splash bundle that will allow lookyloo to be used by more than one user at once.
Follow the documentation if you want to use it.
The repository includes a Dockerfile for building a containerized instance of the app.
Lookyloo stores the scraped data in /lookyloo/scraped. If you want to persist the scraped data between runs it is sufficient to define a volume for this directory.
Additionally you can start a complete setup, including the necessary Docker instance of splashy, by using Docker Compose and the included service definition in docker-compose.yml by running
After building and startup is complete lookyloo should be available at http://localhost:5000/
If you want to persist the data between different runs uncomment the “volumes” definition in the last two lines of docker-compose.yml and define a data storage directory in your Docker host system there.