Class MimeTypeNormalization
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.parse.ParseFilter
-
- com.digitalpebble.stormcrawler.parse.filter.MimeTypeNormalization
-
- All Implemented Interfaces:
Configurable
public class MimeTypeNormalization extends ParseFilter
Normalises the MimeType value e.g. text/html; charset=UTF-8 => HTML application/pdf => PDF and creates a new entry with a key 'format' in the metadata. Requires the JSoupParserBolt to be used with the configuration _detect.mimetype_ set to true.
-
-
Constructor Summary
Constructors Constructor Description MimeTypeNormalization()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
filter(String url, byte[] content, DocumentFragment doc, ParseResult parse)
Called when parsing a specific page-
Methods inherited from class com.digitalpebble.stormcrawler.parse.ParseFilter
needsDOM
-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.digitalpebble.stormcrawler.util.Configurable
configure
-
-
-
-
Method Detail
-
filter
public void filter(String url, byte[] content, DocumentFragment doc, ParseResult parse)
Description copied from class:ParseFilter
Called when parsing a specific page- Specified by:
filter
in classParseFilter
- Parameters:
url
- the URL of the page being parsedcontent
- the content being parseddoc
- the DOM tree resulting of the parsing of the content or null ifParseFilter.needsDOM()
returnsfalse
parse
- the metadata to be updated with the resulting of the parsing
-
-