Class JSoupParserBolt
- java.lang.Object
-
- org.apache.storm.topology.base.BaseComponent
-
- org.apache.storm.topology.base.BaseRichBolt
-
- com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
-
- com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
-
- All Implemented Interfaces:
Serializable
,org.apache.storm.task.IBolt
,org.apache.storm.topology.IComponent
,org.apache.storm.topology.IRichBolt
public class JSoupParserBolt extends StatusEmitterBolt
Parser for HTML documents only which uses ICU4J to detect the charset encoding. Kindly donated to storm-crawler by shopstyle.com.- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static String
ANCHORS_KEY_NAME
Metadata key name for tracking the anchors-
Fields inherited from class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
collector
-
-
Constructor Summary
Constructors Constructor Description JSoupParserBolt()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
void
execute(org.apache.storm.tuple.Tuple tuple)
String
guessMimeType(String URL, String httpCT, byte[] content)
void
prepare(Map<String,Object> conf, org.apache.storm.task.TopologyContext context, org.apache.storm.task.OutputCollector collector)
protected List<Outlink>
toOutlinks(String url, Metadata metadata, Map<String,List<String>> slinks)
-
Methods inherited from class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
allowRedirs, emitOutlink, filterOutlink
-
-
-
-
Field Detail
-
ANCHORS_KEY_NAME
public static final String ANCHORS_KEY_NAME
Metadata key name for tracking the anchors- See Also:
- Constant Field Values
-
-
Method Detail
-
prepare
public void prepare(Map<String,Object> conf, org.apache.storm.task.TopologyContext context, org.apache.storm.task.OutputCollector collector)
- Specified by:
prepare
in interfaceorg.apache.storm.task.IBolt
- Overrides:
prepare
in classStatusEmitterBolt
-
execute
public void execute(org.apache.storm.tuple.Tuple tuple)
-
declareOutputFields
public void declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
- Specified by:
declareOutputFields
in interfaceorg.apache.storm.topology.IComponent
- Overrides:
declareOutputFields
in classStatusEmitterBolt
-
-