Class JSoupFilters
- java.lang.Object
-
- com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
- com.digitalpebble.stormcrawler.parse.JSoupFilters
-
- All Implemented Interfaces:
JSONResource
,JSoupFilter
,Configurable
public class JSoupFilters extends AbstractConfigurable implements JSoupFilter, JSONResource
Wrapper for the JSoupFilters defined in a JSON configuration
-
-
Field Summary
Fields Modifier and Type Field Description static JSoupFilters
emptyParseFilter
-
Constructor Summary
Constructors Constructor Description JSoupFilters(Map<String,Object> stormConf, String configFile)
loads the filters from a JSON configuration file
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
configure(@NotNull Map<String,Object> stormConf, @NotNull com.fasterxml.jackson.databind.JsonNode filtersConf)
Called when this filter is being initializedvoid
filter(@NotNull String url, byte[] content, @NotNull org.jsoup.nodes.Document doc, @NotNull ParseResult parse)
Called when parsing a specific pagestatic JSoupFilters
fromConf(Map<String,Object> stormConf)
Loads and configure the JSoupFilters based on the storm config if there is one otherwise returns an empty JSoupFilter.String
getResourceFile()
void
loadJSONResources(InputStream inputStream)
Load the resources from an input streamstatic void
main(String[] args)
Used for quick testing + debugging-
Methods inherited from class com.digitalpebble.stormcrawler.util.AbstractConfigurable
configure, getName
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface com.digitalpebble.stormcrawler.util.Configurable
configure, getName
-
Methods inherited from interface com.digitalpebble.stormcrawler.JSONResource
loadJSONResources
-
-
-
-
Field Detail
-
emptyParseFilter
public static final JSoupFilters emptyParseFilter
-
-
Constructor Detail
-
JSoupFilters
public JSoupFilters(Map<String,Object> stormConf, String configFile) throws IOException
loads the filters from a JSON configuration file- Throws:
IOException
-
-
Method Detail
-
fromConf
public static JSoupFilters fromConf(Map<String,Object> stormConf)
Loads and configure the JSoupFilters based on the storm config if there is one otherwise returns an empty JSoupFilter.
-
loadJSONResources
public void loadJSONResources(InputStream inputStream) throws IOException
Description copied from interface:JSONResource
Load the resources from an input stream- Specified by:
loadJSONResources
in interfaceJSONResource
- Throws:
com.fasterxml.jackson.core.JsonParseException
com.fasterxml.jackson.databind.JsonMappingException
IOException
-
getResourceFile
public String getResourceFile()
- Specified by:
getResourceFile
in interfaceJSONResource
- Returns:
- filename of the JSON resource
-
configure
public void configure(@NotNull @NotNull Map<String,Object> stormConf, @NotNull @NotNull com.fasterxml.jackson.databind.JsonNode filtersConf)
Description copied from interface:Configurable
Called when this filter is being initialized- Specified by:
configure
in interfaceConfigurable
- Parameters:
stormConf
- The Storm configuration used for the configurablefiltersConf
- the filter specific configuration. Never null
-
filter
public void filter(@NotNull @NotNull String url, byte[] content, @NotNull @NotNull org.jsoup.nodes.Document doc, @NotNull @NotNull ParseResult parse)
Description copied from interface:JSoupFilter
Called when parsing a specific page- Specified by:
filter
in interfaceJSoupFilter
- Parameters:
url
- the URL of the page being parsedcontent
- the content being parseddoc
- document produced by JSoup's parsingFparse
- the metadata to be updated with the resulting of the parsing
-
main
public static void main(String[] args) throws IOException, org.apache.commons.cli.ParseException
Used for quick testing + debugging- Throws:
IOException
org.apache.commons.cli.ParseException
-
-