lookyloo/README.md

![Lookyloo icon](website/web/static/lookyloo.jpeg)

*Lookyloo* is a web interface allowing to scrape a website and then displays a
tree of domains calling each other.

Thank you very much [Tech Blog @ willshouse.com](https://techblog.willshouse.com/2012/01/03/most-common-user-agents/)
for the up-to-date list of UserAgents.

# What is that name?!


```
1. People who just come to look.
2. People who go out of their way to look at people or something often causing crowds and more disruption.
3. People who enjoy staring at watching other peoples misfortune. Oftentimes car onlookers to car accidents.
Same as Looky Lou; often spelled as Looky-loo (hyphen) or lookylou
In L.A. usually the lookyloo's cause more accidents by not paying full attention to what is ahead of them.
```

Source: [Urban Dictionary](https://www.urbandictionary.com/define.php?term=lookyloo)

# Screenshot

![Screenshot of Lookyloo](doc/example.png)

# Implementation details

This code is very heavily inspired by [webplugin](https://github.com/etetoolkit/webplugin) and adapted to use flask as backend.

The two core dependencies of this project are the following:

* [ETE Toolkit](http://etetoolkit.org/): A Python framework for the analysis and visualization of trees.
* [Splash](https://splash.readthedocs.io/en/stable/): Lightweight, scriptable browser as a service with an HTTP API

# Cookies

If you want to scrape a website as if you were loggged in, you need to pass your sessions cookies.
You can do it the the folloing way:

1. Install [Cookie Quick Manager](https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/)
2. Click onthe icon in the top right of your browser > Manage all cookies
3. Search for a domain, tick the Sub-domain box if needed
4. Right clock on the domain you want to export > save to file > $LOOKYLOO_HOME/cookies.json

Then, you need to restart the webserver and from now on, every cookies you have in that file will be available for the browser used by Splash

# Python client

You can use `pylookyloo` as a standalone script, or as a library, [more details here](https://github.com/CIRCL/lookyloo/tree/master/client)

# Installation

**IMPORTANT**: Use [pipenv](https://pipenv.readthedocs.io/en/latest/)

**NOTE**: Yes, it requires python3.6+. No, it will never support anything older.

**NOTE**: If you want to run a public instance, you should set `only_global_lookups=True`
in `website/web/__init__.py` and `bin/async_scrape.py` to disallow scraping of private IPs.

## Installation of Splash

You need a running splash instance, preferably on [docker](https://splash.readthedocs.io/en/stable/install.html)

```bash
sudo apt install docker.io
sudo docker pull scrapinghub/splash
sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --disable-browser-caches
# On a server with a decent abount of RAM, you may want to run it this way:
# sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash -s 100 -m 50000 --disable-browser-caches
```

## Install redis

```bash
git clone https://github.com/antirez/redis.git
cd redis
git checkout 5.0
make
cd ..
```

## Installation of Lookyloo

```bash
git clone https://github.com/CIRCL/lookyloo.git
cd lookyloo
pipenv install
echo LOOKYLOO_HOME="'`pwd`'" > .env
```

# Run the app

```bash
pipenv run start.py
```

# Run the app in production

## With a reverse proxy (Nginx)

```bash
pip install uwsgi
```

## Config files

You have to configure the two following files:

* `etc/nginx/sites-available/lookyloo`
* `etc/systemd/system/lookyloo.service`

Copy them to the appropriate directories, and run the following command:
```bash
sudo ln -s /etc/nginx/sites-available/lookyloo /etc/nginx/sites-enabled
```

If needed, remove the default site
```bash
sudo rm /etc/nginx/sites-enabled/default
```

Make sure everything is working:

```bash
sudo systemctl start lookyloo
sudo systemctl enable lookyloo
sudo nginx -t
# If it is cool:
sudo service nginx restart
```

And you can open ```http://<IP-or-domain>/```

Now, you should configure [TLS (let's encrypt and so on)](https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-16-04)

# Use aquarium for a reliable multi-users app

Aquarium is a haproxy + splash bundle that will allow lookyloo to be used by more than one user at once.

The initial version of the project was created by [TeamHG-Memex](https://github.com/TeamHG-Memex/aquarium) but
we have a [dedicated repository](https://github.com/circl/aquarium) that fits our needs better.

Follow [the documentation](https://github.com/CIRCL/aquarium/blob/master/README.rst) if you want to use it.


# Run the app with a simple docker setup

## Dockerfile
The repository includes a [Dockerfile](Dockerfile) for building a containerized instance of the app.

Lookyloo stores the scraped data in /lookyloo/scraped. If you want to persist the scraped data between runs it is sufficient to define a volume for this directory.

## Running a complete setup with Docker Compose
Additionally you can start a complete setup, including the necessary Docker instance of splashy, by using
Docker Compose and the included service definition in [docker-compose.yml](docker-compose.yml) by running

```
docker-compose up
```

After building and startup is complete lookyloo should be available at [http://localhost:5000/](http://localhost:5000/)

If you want to persist the data between different runs uncomment  the "volumes" definition in the last two lines of
[docker-compose.yml](docker-compose.yml) and define a data storage directory in your Docker host system there.
fix: path to logo in readme 2019-04-19 14:44:38 +02:00			`![Lookyloo icon](website/web/static/lookyloo.jpeg)`
Add initial web interface 2017-07-23 19:56:51 +02:00
			`Lookyloo is a web interface allowing to scrape a website and then displays a`
			`tree of domains calling each other.`

chg: Update Readme 2019-05-27 15:36:49 +02:00			`Thank you very much [Tech Blog @ willshouse.com](https://techblog.willshouse.com/2012/01/03/most-common-user-agents/)`
			`for the up-to-date list of UserAgents.`

Update readme 2017-08-12 17:45:33 +02:00			`# What is that name?!`
Add initial web interface 2017-07-23 19:56:51 +02:00

			```
			`1. People who just come to look.`
			`2. People who go out of their way to look at people or something often causing crowds and more disruption.`
			`3. People who enjoy staring at watching other peoples misfortune. Oftentimes car onlookers to car accidents.`
			`Same as Looky Lou; often spelled as Looky-loo (hyphen) or lookylou`
			`In L.A. usually the lookyloo's cause more accidents by not paying full attention to what is ahead of them.`
			```

- Some MarkDown Links added. nginx/sites-available typo fix. service nginx restart typo fix. 2017-10-02 12:19:55 +02:00			`Source: [Urban Dictionary](https://www.urbandictionary.com/define.php?term=lookyloo)`
Add initial web interface 2017-07-23 19:56:51 +02:00
chg: Add screenshot 2018-01-05 17:09:42 +01:00			`# Screenshot`

			`![Screenshot of Lookyloo](doc/example.png)`
Add initial web interface 2017-07-23 19:56:51 +02:00
Update readme 2017-08-12 17:45:33 +02:00			`# Implementation details`
Add initial web interface 2017-07-23 19:56:51 +02:00
- Some MarkDown Links added. nginx/sites-available typo fix. service nginx restart typo fix. 2017-10-02 12:19:55 +02:00			`This code is very heavily inspired by [webplugin](https://github.com/etetoolkit/webplugin) and adapted to use flask as backend.`
Add initial web interface 2017-07-23 19:56:51 +02:00
chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`The two core dependencies of this project are the following:`
Add initial web interface 2017-07-23 19:56:51 +02:00
chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`* [ETE Toolkit](http://etetoolkit.org/): A Python framework for the analysis and visualization of trees.`
			`* [Splash](https://splash.readthedocs.io/en/stable/): Lightweight, scriptable browser as a service with an HTTP API`
Add initial web interface 2017-07-23 19:56:51 +02:00
chg: Bump readme 2020-01-23 15:03:36 +01:00			`# Cookies`

			`If you want to scrape a website as if you were loggged in, you need to pass your sessions cookies.`
			`You can do it the the folloing way:`

			`1. Install [Cookie Quick Manager](https://addons.mozilla.org/en-US/firefox/addon/cookie-quick-manager/)`
			`2. Click onthe icon in the top right of your browser > Manage all cookies`
			`3. Search for a domain, tick the Sub-domain box if needed`
			`4. Right clock on the domain you want to export > save to file > $LOOKYLOO_HOME/cookies.json`

			`Then, you need to restart the webserver and from now on, every cookies you have in that file will be available for the browser used by Splash`

chg: Add documentation for API 2019-06-26 17:38:07 +02:00			`# Python client`

chg: Update client API usage 2020-03-16 17:20:46 +01:00			You can use `pylookyloo` as a standalone script, or as a library, [more details here](https://github.com/CIRCL/lookyloo/tree/master/client)
Update readme with server config 2017-08-12 19:11:02 +02:00
chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`# Installation`

			`IMPORTANT: Use [pipenv](https://pipenv.readthedocs.io/en/latest/)`

			`NOTE: Yes, it requires python3.6+. No, it will never support anything older.`

chg: Update documentation 2019-07-05 16:59:23 +02:00			NOTE: If you want to run a public instance, you should set `only_global_lookups=True`
			in `website/web/__init__.py` and `bin/async_scrape.py` to disallow scraping of private IPs.

chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`## Installation of Splash`
Update readme 2017-08-12 17:45:33 +02:00
- Some MarkDown Links added. nginx/sites-available typo fix. service nginx restart typo fix. 2017-10-02 12:19:55 +02:00			`You need a running splash instance, preferably on [docker](https://splash.readthedocs.io/en/stable/install.html)`
Update readme 2017-08-12 17:45:33 +02:00
			```bash
			`sudo apt install docker.io`
			`sudo docker pull scrapinghub/splash`
chg: Bump readme 2020-04-20 11:18:57 +02:00			`sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --disable-browser-caches`
Add command to run splash on the server 2017-08-12 20:45:57 +02:00			`# On a server with a decent abount of RAM, you may want to run it this way:`
chg: Bump readme 2020-04-20 11:18:57 +02:00			`# sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash -s 100 -m 50000 --disable-browser-caches`
Update readme 2017-08-12 17:45:33 +02:00			```

chg: Update install guide 2019-02-05 14:41:32 +01:00			`## Install redis`

			```bash
			`git clone https://github.com/antirez/redis.git`
			`cd redis`
			`git checkout 5.0`
			`make`
			`cd ..`
			```

chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`## Installation of Lookyloo`
Update readme 2017-08-12 17:45:33 +02:00
			```bash
chg: Update install guide 2019-02-05 14:41:32 +01:00			`git clone https://github.com/CIRCL/lookyloo.git`
chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`cd lookyloo`
			`pipenv install`
			echo LOOKYLOO_HOME="'`pwd`'" > .env
Add initial web interface 2017-07-23 19:56:51 +02:00			```
chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00
			`# Run the app`
Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00
			```bash
chg: Update install guide 2019-02-05 14:41:32 +01:00			`pipenv run start.py`
Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00			```

chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`# Run the app in production`

Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00			`## With a reverse proxy (Nginx)`

			```bash
			`pip install uwsgi`
			```

chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`## Config files`
Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00
			`You have to configure the two following files:`

- Some MarkDown Links added. nginx/sites-available typo fix. service nginx restart typo fix. 2017-10-02 12:19:55 +02:00			* `etc/nginx/sites-available/lookyloo`
Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00			* `etc/systemd/system/lookyloo.service`

chg: Typos in readme 2019-04-05 14:07:08 +02:00			`Copy them to the appropriate directories, and run the following command:`
Update server config 2017-08-12 20:40:08 +02:00			```bash
			`sudo ln -s /etc/nginx/sites-available/lookyloo /etc/nginx/sites-enabled`
			```

			`If needed, remove the default site`
			```bash
			`sudo rm /etc/nginx/sites-enabled/default`
			```
Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00
			`Make sure everything is working:`

			```bash
			`sudo systemctl start lookyloo`
			`sudo systemctl enable lookyloo`
			`sudo nginx -t`
			`# If it is cool:`
- Some MarkDown Links added. nginx/sites-available typo fix. service nginx restart typo fix. 2017-10-02 12:19:55 +02:00			`sudo service nginx restart`
Add config to run as service behind nginx 2017-08-12 20:12:14 +02:00			```

- Markdown parses this: http://<IP-or-domain>/ - BOLD does still parse, back-ticks should not 2017-10-02 12:30:35 +02:00			And you can open ```http://<IP-or-domain>/```
Update server config 2017-08-12 20:40:08 +02:00
- Some MarkDown Links added. nginx/sites-available typo fix. service nginx restart typo fix. 2017-10-02 12:19:55 +02:00			`Now, you should configure [TLS (let's encrypt and so on)](https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-16-04)`
Update server config 2017-08-12 20:40:08 +02:00
chg: Add doc for aquarium 2019-06-06 15:35:24 +02:00			`# Use aquarium for a reliable multi-users app`
Added Docker documentation to Readme 2018-04-08 18:11:48 +02:00
chg: Add doc for aquarium 2019-06-06 15:35:24 +02:00			`Aquarium is a haproxy + splash bundle that will allow lookyloo to be used by more than one user at once.`

			`The initial version of the project was created by [TeamHG-Memex](https://github.com/TeamHG-Memex/aquarium) but`
			`we have a [dedicated repository](https://github.com/circl/aquarium) that fits our needs better.`

			`Follow [the documentation](https://github.com/CIRCL/aquarium/blob/master/README.rst) if you want to use it.`


			`# Run the app with a simple docker setup`
Added Docker documentation to Readme 2018-04-08 18:11:48 +02:00
Added data storage directory to Readme 2018-04-08 18:19:05 +02:00			`## Dockerfile`
			`The repository includes a [Dockerfile](Dockerfile) for building a containerized instance of the app.`
Added Docker documentation to Readme 2018-04-08 18:11:48 +02:00
Added data storage directory to Readme 2018-04-08 18:19:05 +02:00			`Lookyloo stores the scraped data in /lookyloo/scraped. If you want to persist the scraped data between runs it is sufficient to define a volume for this directory.`

			`## Running a complete setup with Docker Compose`
Added Docker documentation to Readme 2018-04-08 18:11:48 +02:00			`Additionally you can start a complete setup, including the necessary Docker instance of splashy, by using`
			`Docker Compose and the included service definition in [docker-compose.yml](docker-compose.yml) by running`

			```
			`docker-compose up`
			```

chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`After building and startup is complete lookyloo should be available at [http://localhost:5000/](http://localhost:5000/)`
Added Docker documentation to Readme 2018-04-08 18:11:48 +02:00
chg: Cleanup, use pipfile 2019-01-23 15:13:29 +01:00			`If you want to persist the data between different runs uncomment the "volumes" definition in the last two lines of`
			`[docker-compose.yml](docker-compose.yml) and define a data storage directory in your Docker host system there.`