Class SiteMapParserBolt

  • All Implemented Interfaces:
    Serializable, org.apache.storm.task.IBolt, org.apache.storm.topology.IComponent, org.apache.storm.topology.IRichBolt

    public class SiteMapParserBolt
    extends StatusEmitterBolt
    Extracts URLs from a sitemap file. The parsing is triggered by sniffing the content and can also be forced by 'isSitemap=true' in the metadata, otherwise the tuple are passed on to the default stream, whereas any URLs extracted from the sitemaps are sent to the 'status' field with a 'DISCOVERED' status.
    See Also:
    Serialized Form
    • Constructor Detail

      • SiteMapParserBolt

        public SiteMapParserBolt()
    • Method Detail

      • execute

        public void execute​(org.apache.storm.tuple.Tuple tuple)
      • parseExtensionAttributes

        public void parseExtensionAttributes​(crawlercommons.sitemaps.SiteMapURL url,
                                             Metadata metadata)
      • prepare

        public void prepare​(Map<String,​Object> stormConf,
                            org.apache.storm.task.TopologyContext context,
                            org.apache.storm.task.OutputCollector collector)
        Specified by:
        prepare in interface org.apache.storm.task.IBolt
        Overrides:
        prepare in class StatusEmitterBolt
      • declareOutputFields

        public void declareOutputFields​(org.apache.storm.topology.OutputFieldsDeclarer declarer)
        Specified by:
        declareOutputFields in interface org.apache.storm.topology.IComponent
        Overrides:
        declareOutputFields in class StatusEmitterBolt