PyCIRCLean/bin
Raphaël Vinot 3eecd9cc16 Fix winoffice file processing with olefile 2016-05-14 20:44:16 +02:00
..
README.md Use own version of officedissector. 2016-05-09 17:38:32 +02:00
__init__.py Do not use subprocess. 2015-10-27 10:24:45 +01:00
filecheck.py Fix winoffice file processing with olefile 2016-05-14 20:44:16 +02:00
generic.py Print FS tree for unpacked archives 2015-11-24 11:41:45 +01:00
pier9.py Code de-dupication 2015-11-05 15:34:22 +01:00
specific.py Code de-dupication 2015-11-05 15:34:22 +01:00

README.md

Requirements per script

Note: in order to use any of those script, you need to install then (in a virtualenv or system wide)

    pip install git+https://github.com/ahupp/python-magic.git # we cannot use the PyPi package for now due to a bug
    python setup.py install # from the root of the repository

filecheck.py

WARNING: Only works with Python 2.7 (oletools and olefile aren't ported to Python3 for now)

Requirements by type of document:

  • Microsoft office: oletools, olefile
  • OOXML: officedissector
  • PDF: pdfid
  • Archives: p7zip-full, p7zip-rar
    sudo apt-get install p7zip-full p7zip-rar libxml2-dev libxslt1-dev
    pip install lxml officedissector git+https://github.com/ahupp/python-magic.git oletools olefile
    pip install git+https://github.com/Rafiot/officedissector.git
    # pdfid is not a package, installing manually
    wget https://didierstevens.com/files/software/pdfid_v0_2_1.zip
    unzip pdfid_v0_2_1.zip
    python setup.py -q install

generic.py

Requirements by type of document:

  • Office and all text files: unoconv, libreoffice
  • PDF: ghostscript, pdf2htmlEX
    # required for pdf2htmlEX
    sudo add-apt-repository ppa:fontforge/fontforge --yes
    sudo add-apt-repository ppa:coolwanglu/pdf2htmlex --yes
    sudo apt-get update -qq
    sudo apt-get install -qq libpoppler-dev libpoppler-private-dev libspiro-dev libcairo-dev libpango1.0-dev libfreetype6-dev libltdl-dev libfontforge-dev python-imaging python-pip firefox xvfb
    # install pdf2htmlEX
    git clone https://github.com/coolwanglu/pdf2htmlEX.git
    pushd pdf2htmlEX
    cmake -DCMAKE_INSTALL_PREFIX:PATH=/usr -DENABLE_SVG=ON .
    make
    sudo make install
    popd
    # Installing the rest
    sudo apt-get install ghostscript p7zip-full p7zip-rar libreoffice unoconv

pier9.py

No external dependencies required.

specific.py

No external dependencies required.