Fix#787
- Uses regex to pick up the hostnames/domains from the "Received: from" headers.
Unicode to ASCII function
- Spam messages more often than not contain junk text as unicode characters in the headers. The "from" and "subject" headers being the most common ones. Before this change the script would error on such emails or sometimes replace the unicode characters with questionmarks "?".
- Function takes argument as an input and then encodes it in ascii while ignoring any malformed data. It then returns an ASCII string without the unicode characters.
- Currently implemented for "from" and "subject" handling.
Email object no longer requires extra php libraries for install.
Tests have been expanded to improve coverage.
RTF encapsulated HTML and Plain Text will now be de-encapsulated.
The raw MSG binary will now be included in the extracted email object.
eml files downloaded from Windows Online security on some Windows 11
systems are automatically encoded in UTF with a byte order mark (BOM)
at the front of the file. This will cause the email parser to fail.
This is a somewhat isolated problem. It only will affects a small
subset of Windows users who download and re-upload eml files. But,
this small subset of users is the target user-base for the MISP
email module: low expertiese users who wish to quickly share
high-value indicators on an ad-hoc basis.
While this fix could be tacked onto the MISP email module instead of
here, I beleive that this fix is more appropriate in the PyMISP object
code. As the "email" object parser this object should be built to
parse all manner of emails that it may encounter. This includes common
malformations such as this one and, even horrors such as, the .msg
format. This commit adds a generically named "attempt_decoding"
function which can be expanded to address all manner of sins that
are encountered in the future.
instead of the mimetype. According to [MISP DataModels](https://www.misp-project.org/datamodels/#types)
```
mime-type: A media type (also MIME type and content type) is a two-part identifier for file formats and format contents transmitted on the Internet
```
more precisely defined in [RFC2045](https://tools.ietf.org/html/rfc2045) and others.
The description returned by libmagic is more useful than the generic mime-type,
but I did not find a place to put the description in the current data model.