AIL-framework/bin/Categ.py

#!/usr/bin/env python3
# -*-coding:UTF-8 -*
"""
The ZMQ_PubSub_Categ Module
============================

This module is consuming the Redis-list created by the ZMQ_PubSub_Tokenize_Q
Module.

Each words files created under /files/ are representing categories.
This modules take these files and compare them to
the stream of data given by the ZMQ_PubSub_Tokenize_Q  Module.

When a word from a paste match one or more of these words file, the filename of
the paste is published/forwarded to the next modules.

Each category (each files) are representing a dynamic channel.
This mean that if you create 1000 files under /files/ you'll have 1000 channels
where every time there is a matching word to a category, the paste containing
this word will be pushed to this specific channel.

..note:: The channel will have the name of the file created.

Implementing modules can start here, create your own category file,
and then create your own module to treat the specific paste matching this
category.

..note:: Module ZMQ_Something_Q and ZMQ_Something are closely bound, always put
the same Subscriber name in both of them.

Requirements
------------

*Need running Redis instances. (Redis)
*Categories files of words in /files/ need to be created
*Need the ZMQ_PubSub_Tokenize_Q Module running to be able to work properly.

"""
import os
import argparse
import time
import re
from pubsublogger import publisher
from packages import Paste

from Helper import Process

if __name__ == "__main__":
    publisher.port = 6380
    publisher.channel = "Script"

    config_section = 'Categ'

    p = Process(config_section)
    matchingThreshold = p.config.getint("Categ", "matchingThreshold")

    # SCRIPT PARSER #
    parser = argparse.ArgumentParser(description='Start Categ module on files.')

    parser.add_argument(
        '-d', type=str, default="../files/",
        help='Path to the directory containing the category files.',
        action='store')

    args = parser.parse_args()

    # FUNCTIONS #
    publisher.info("Script Categ started")

    categories = ['CreditCards', 'Mail', 'Onion', 'Web', 'Credential', 'Cve', 'ApiKey']
    tmp_dict = {}
    for filename in categories:
        bname = os.path.basename(filename)
        tmp_dict[bname] = []
        with open(os.path.join(args.d, filename), 'r') as f:
            patterns = [r'%s' % ( re.escape(s.strip()) ) for s in f]
            tmp_dict[bname] = re.compile('|'.join(patterns), re.IGNORECASE)

    prec_filename = None

    while True:
        filename = p.get_from_set()
        if filename is None:
            publisher.debug("Script Categ is Idling 10s")
            print('Sleeping')
            time.sleep(10)
            continue

        paste = Paste.Paste(filename)
        content = paste.get_p_content()

        #print('-----------------------------------------------------')
        #print(filename)
        #print(content)
        #print('-----------------------------------------------------')

        for categ, pattern in tmp_dict.items():
            found = set(re.findall(pattern, content))
            if len(found) >= matchingThreshold:
                msg = '{} {}'.format(paste.p_path, len(found))
                #msg = " ".join( [paste.p_path, bytes(len(found))] )

                print(msg, categ)
                p.populate_set_out(msg, categ)

                publisher.info(
                    'Categ;{};{};{};Detected {} as {};{}'.format(
                        paste.p_source, paste.p_date, paste.p_name,
                        len(found), categ, paste.p_path))
decode with redis connection 2018-05-04 13:53:29 +02:00			`#!/usr/bin/env python3`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00			`# --coding:UTF-8 -`
			`"""`
			`The ZMQ_PubSub_Categ Module`
			`============================`

Cleanup (remove unused imports, more pep8 compatible) 2014-08-14 14:11:07 +02:00			`This module is consuming the Redis-list created by the ZMQ_PubSub_Tokenize_Q`
			`Module.`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
			`Each words files created under /files/ are representing categories.`
			`This modules take these files and compare them to`
			`the stream of data given by the ZMQ_PubSub_Tokenize_Q Module.`

			`When a word from a paste match one or more of these words file, the filename of`
			`the paste is published/forwarded to the next modules.`

			`Each category (each files) are representing a dynamic channel.`
			`This mean that if you create 1000 files under /files/ you'll have 1000 channels`
			`where every time there is a matching word to a category, the paste containing`
			`this word will be pushed to this specific channel.`

			`..note:: The channel will have the name of the file created.`

			`Implementing modules can start here, create your own category file,`
Cleanup (remove unused imports, more pep8 compatible) 2014-08-14 14:11:07 +02:00			`and then create your own module to treat the specific paste matching this`
			`category.`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
			`..note:: Module ZMQ_Something_Q and ZMQ_Something are closely bound, always put`
			`the same Subscriber name in both of them.`

			`Requirements`
			`------------`

			`*Need running Redis instances. (Redis)`
			`*Categories files of words in /files/ need to be created`
			`*Need the ZMQ_PubSub_Tokenize_Q Module running to be able to work properly.`

			`"""`
Improve the cleanup. Still some to do. 2014-08-19 19:07:07 +02:00			`import os`
Cleanup (remove unused imports, more pep8 compatible) 2014-08-14 14:11:07 +02:00			`import argparse`
			`import time`
Categ now listen to the Global queue 2014-09-05 17:05:45 +02:00			`import re`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00			`from pubsublogger import publisher`
Cleanup (remove unused imports, more pep8 compatible) 2014-08-14 14:11:07 +02:00			`from packages import Paste`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
Big refactoring, make the queues more flexible 2014-08-29 19:37:56 +02:00			`from Helper import Process`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
Improve the cleanup. Still some to do. 2014-08-19 19:07:07 +02:00			`if __name__ == "__main__":`
Small fixes to make the refactoring production ready * the port for the logging is 6380 * use os.environ properly * fix typos 2014-08-22 17:35:40 +02:00			`publisher.port = 6380`
Improve the cleanup. Still some to do. 2014-08-19 19:07:07 +02:00			`publisher.channel = "Script"`

Big refactoring, make the queues more flexible 2014-08-29 19:37:56 +02:00			`config_section = 'Categ'`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
Big refactoring, make the queues more flexible 2014-08-29 19:37:56 +02:00			`p = Process(config_section)`
update: Moved filtering operation (thresholds, number of matching in the categ file) in the configuration file. It permits to better control the flow of pastes. Also set default mixer duplicate filtering to 3 (Do not filter) 2017-12-11 17:28:34 +01:00			`matchingThreshold = p.config.getint("Categ", "matchingThreshold")`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
			`# SCRIPT PARSER #`
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`parser = argparse.ArgumentParser(description='Start Categ module on files.')`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
Cleanup (remove unused imports, more pep8 compatible) 2014-08-14 14:11:07 +02:00			`parser.add_argument(`
Improve the cleanup. Still some to do. 2014-08-19 19:07:07 +02:00			`'-d', type=str, default="../files/",`
			`help='Path to the directory containing the category files.',`
Cleanup (remove unused imports, more pep8 compatible) 2014-08-14 14:11:07 +02:00			`action='store')`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
			`args = parser.parse_args()`

			`# FUNCTIONS #`
Big refactoring, make the queues more flexible 2014-08-29 19:37:56 +02:00			`publisher.info("Script Categ started")`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
add apiKeys module 2018-04-26 14:42:39 +02:00			`categories = ['CreditCards', 'Mail', 'Onion', 'Web', 'Credential', 'Cve', 'ApiKey']`
Improve the cleanup. Still some to do. 2014-08-19 19:07:07 +02:00			`tmp_dict = {}`
Big refactoring, make the queues more flexible 2014-08-29 19:37:56 +02:00			`for filename in categories:`
Improve the cleanup. Still some to do. 2014-08-19 19:07:07 +02:00			`bname = os.path.basename(filename)`
			`tmp_dict[bname] = []`
Small fixes to make the refactoring production ready * the port for the logging is 6380 * use os.environ properly * fix typos 2014-08-22 17:35:40 +02:00			`with open(os.path.join(args.d, filename), 'r') as f:`
python 3 backend upgrade 2018-04-16 14:50:04 +02:00			`patterns = [r'%s' % ( re.escape(s.strip()) ) for s in f]`
Categ now listen to the Global queue 2014-09-05 17:05:45 +02:00			`tmp_dict[bname] = re.compile('\|'.join(patterns), re.IGNORECASE)`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00
			`prec_filename = None`

			`while True:`
Categ now listen to the Global queue 2014-09-05 17:05:45 +02:00			`filename = p.get_from_set()`
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`if filename is None:`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00			`publisher.debug("Script Categ is Idling 10s")`
python 3 backend upgrade 2018-04-16 14:50:04 +02:00			`print('Sleeping')`
Initial import of AIL framework - Analysis Information Leak framework AIL is a modular framework to analyse potential information leak from unstructured data source like pastes from Past ebin or similar services. AIL framework is flexible and can be extended to support other functionalities to mine sen sitive information 2014-08-06 11:43:40 +02:00			`time.sleep(10)`
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`continue`

			`paste = Paste.Paste(filename)`
			`content = paste.get_p_content()`

fix python 3 compqtibility issues 2018-04-20 10:42:19 +02:00			`#print('-----------------------------------------------------')`
			`#print(filename)`
			`#print(content)`
			`#print('-----------------------------------------------------')`
python 3 backend upgrade 2018-04-16 14:50:04 +02:00
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`for categ, pattern in tmp_dict.items():`
			`found = set(re.findall(pattern, content))`
update: Moved filtering operation (thresholds, number of matching in the categ file) in the configuration file. It permits to better control the flow of pastes. Also set default mixer duplicate filtering to 3 (Do not filter) 2017-12-11 17:28:34 +01:00			`if len(found) >= matchingThreshold:`
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`msg = '{} {}'.format(paste.p_path, len(found))`
python 3 backend upgrade 2018-04-16 14:50:04 +02:00			`#msg = " ".join( [paste.p_path, bytes(len(found))] )`

			`print(msg, categ)`
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`p.populate_set_out(msg, categ)`

			`publisher.info(`
Added path of the paste in the log of Categ.py 2017-02-14 10:59:47 +01:00			`'Categ;{};{};{};Detected {} as {};{}'.format(`
Add new category (Credential) 2016-02-10 16:39:56 +01:00			`paste.p_source, paste.p_date, paste.p_name,`
Fixed bug closing parenthesis in categ + changed behavior of sending to browseWarningPaste in Mail 2017-02-28 09:14:18 +01:00			`len(found), categ, paste.p_path))`