Class AbstractIndexerBolt

  • All Implemented Interfaces:
    Serializable, org.apache.storm.task.IBolt, org.apache.storm.topology.IComponent, org.apache.storm.topology.IRichBolt
    Direct Known Subclasses:
    DummyIndexer, StdOutIndexer

    public abstract class AbstractIndexerBolt
    extends org.apache.storm.topology.base.BaseRichBolt
    Abstract class to simplify writing IndexerBolts *
    See Also:
    Serialized Form
    • Field Detail

      • metadata2fieldParamName

        public static final String metadata2fieldParamName
        Mapping between metadata keys and field names for indexing Can be a list of values separated by a = or a single string
        See Also:
        Constant Field Values
      • metadataFilterParamName

        public static final String metadataFilterParamName
        list of metadata key + values to be used as a filter. A document will be indexed only if it has such a md. Can be null in which case we don't filter at all.
        See Also:
        Constant Field Values
      • textFieldParamName

        public static final String textFieldParamName
        Field name to use for storing the text of a document *
        See Also:
        Constant Field Values
      • textLengthParamName

        public static final String textLengthParamName
        Trim length of text to index. Defaults to -1 to keep it intact *
        See Also:
        Constant Field Values
      • urlFieldParamName

        public static final String urlFieldParamName
        Field name to use for storing the url of a document *
        See Also:
        Constant Field Values
      • canonicalMetadataParamName

        public static final String canonicalMetadataParamName
        Field name to use for reading the canonical property of the metadata
        See Also:
        Constant Field Values
      • ignoreEmptyFieldValueParamName

        public static final String ignoreEmptyFieldValueParamName
        Indicates that empty field values should not be emitted at all.
        See Also:
        Constant Field Values
    • Constructor Detail

      • AbstractIndexerBolt

        public AbstractIndexerBolt()
    • Method Detail

      • prepare

        public void prepare​(Map<String,​Object> conf,
                            org.apache.storm.task.TopologyContext context,
                            org.apache.storm.task.OutputCollector collector)
      • filterDocument

        protected boolean filterDocument​(Metadata meta)
        Determine whether a document should be indexed based on the presence of a given key/value or the RobotsTags.ROBOTS_NO_INDEX directive.
        Returns:
        true if the document should be kept.
      • filterMetadata

        protected Map<String,​String[]> filterMetadata​(Metadata meta)
        Returns a mapping field name / values for the metadata to index *
      • getDocumentID

        protected String getDocumentID​(Metadata metadata,
                                       String normalisedUrl)
        Get the document id.
        Parameters:
        metadata - The Metadata.
        normalisedUrl - The normalised url.
        Returns:
        Return the normalised url SHA-256 digest as String.
      • valueForURL

        protected String valueForURL​(org.apache.storm.tuple.Tuple tuple)
        Returns the value to be used as the URL for indexing purposes, if present the canonical value is used instead
      • fieldNameForText

        protected String fieldNameForText()
        Returns the field name to use for the text or null if the text must not be indexed
      • trimText

        protected String trimText​(String text)
        Returns a trimmed string or the original one if it is below the threshold set in the configuration.
      • fieldNameForURL

        protected String fieldNameForURL()
        Returns the field name to use for the URL or null if the URL must not be indexed
      • ignoreEmptyFields

        protected boolean ignoreEmptyFields()
      • declareOutputFields

        public void declareOutputFields​(org.apache.storm.topology.OutputFieldsDeclarer declarer)