Class MimeTypeNormalization

  • All Implemented Interfaces:
    Configurable

    public class MimeTypeNormalization
    extends ParseFilter
    Normalises the MimeType value e.g. text/html; charset=UTF-8 => HTML application/pdf => PDF and creates a new entry with a key 'format' in the metadata. Requires the JSoupParserBolt to be used with the configuration _detect.mimetype_ set to true.
    • Constructor Detail

      • MimeTypeNormalization

        public MimeTypeNormalization()
    • Method Detail

      • filter

        public void filter​(String url,
                           byte[] content,
                           DocumentFragment doc,
                           ParseResult parse)
        Description copied from class: ParseFilter
        Called when parsing a specific page
        Specified by:
        filter in class ParseFilter
        Parameters:
        url - the URL of the page being parsed
        content - the content being parsed
        doc - the DOM tree resulting of the parsing of the content or null if ParseFilter.needsDOM() returns false
        parse - the metadata to be updated with the resulting of the parsing