Class JSoupParserBolt

  • All Implemented Interfaces:
    Serializable, org.apache.storm.task.IBolt, org.apache.storm.topology.IComponent, org.apache.storm.topology.IRichBolt

    public class JSoupParserBolt
    extends StatusEmitterBolt
    Parser for HTML documents only which uses ICU4J to detect the charset encoding. Kindly donated to storm-crawler by shopstyle.com.
    See Also:
    Serialized Form
    • Field Detail

      • ANCHORS_KEY_NAME

        public static final String ANCHORS_KEY_NAME
        Metadata key name for tracking the anchors
        See Also:
        Constant Field Values
    • Constructor Detail

      • JSoupParserBolt

        public JSoupParserBolt()
    • Method Detail

      • prepare

        public void prepare​(Map<String,​Object> conf,
                            org.apache.storm.task.TopologyContext context,
                            org.apache.storm.task.OutputCollector collector)
        Specified by:
        prepare in interface org.apache.storm.task.IBolt
        Overrides:
        prepare in class StatusEmitterBolt
      • execute

        public void execute​(org.apache.storm.tuple.Tuple tuple)
      • declareOutputFields

        public void declareOutputFields​(org.apache.storm.topology.OutputFieldsDeclarer declarer)
        Specified by:
        declareOutputFields in interface org.apache.storm.topology.IComponent
        Overrides:
        declareOutputFields in class StatusEmitterBolt
      • guessMimeType

        public String guessMimeType​(String URL,
                                    String httpCT,
                                    byte[] content)