Class AbstractIndexerBolt
- java.lang.Object
-
- org.apache.storm.topology.base.BaseComponent
-
- org.apache.storm.topology.base.BaseRichBolt
-
- com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
- All Implemented Interfaces:
Serializable
,org.apache.storm.task.IBolt
,org.apache.storm.topology.IComponent
,org.apache.storm.topology.IRichBolt
- Direct Known Subclasses:
DummyIndexer
,StdOutIndexer
public abstract class AbstractIndexerBolt extends org.apache.storm.topology.base.BaseRichBolt
Abstract class to simplify writing IndexerBolts *- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static String
canonicalMetadataParamName
Field name to use for reading the canonical property of the metadatastatic String
ignoreEmptyFieldValueParamName
Indicates that empty field values should not be emitted at all.static String
metadata2fieldParamName
Mapping between metadata keys and field names for indexing Can be a list of values separated by a = or a single stringstatic String
metadataFilterParamName
list of metadata key + values to be used as a filter.static String
textFieldParamName
Field name to use for storing the text of a document *static String
textLengthParamName
Trim length of text to index.static String
urlFieldParamName
Field name to use for storing the url of a document *
-
Constructor Summary
Constructors Constructor Description AbstractIndexerBolt()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
protected String
fieldNameForText()
Returns the field name to use for the text or null if the text must not be indexedprotected String
fieldNameForURL()
Returns the field name to use for the URL or null if the URL must not be indexedprotected boolean
filterDocument(Metadata meta)
Determine whether a document should be indexed based on the presence of a given key/value or the RobotsTags.ROBOTS_NO_INDEX directive.protected Map<String,String[]>
filterMetadata(Metadata meta)
Returns a mapping field name / values for the metadata to index *protected String
getDocumentID(Metadata metadata, String normalisedUrl)
Get the document id.protected boolean
ignoreEmptyFields()
void
prepare(Map<String,Object> conf, org.apache.storm.task.TopologyContext context, org.apache.storm.task.OutputCollector collector)
protected String
trimText(String text)
Returns a trimmed string or the original one if it is below the threshold set in the configuration.protected String
valueForURL(org.apache.storm.tuple.Tuple tuple)
Returns the value to be used as the URL for indexing purposes, if present the canonical value is used instead
-
-
-
Field Detail
-
metadata2fieldParamName
public static final String metadata2fieldParamName
Mapping between metadata keys and field names for indexing Can be a list of values separated by a = or a single string- See Also:
- Constant Field Values
-
metadataFilterParamName
public static final String metadataFilterParamName
list of metadata key + values to be used as a filter. A document will be indexed only if it has such a md. Can be null in which case we don't filter at all.- See Also:
- Constant Field Values
-
textFieldParamName
public static final String textFieldParamName
Field name to use for storing the text of a document *- See Also:
- Constant Field Values
-
textLengthParamName
public static final String textLengthParamName
Trim length of text to index. Defaults to -1 to keep it intact *- See Also:
- Constant Field Values
-
urlFieldParamName
public static final String urlFieldParamName
Field name to use for storing the url of a document *- See Also:
- Constant Field Values
-
canonicalMetadataParamName
public static final String canonicalMetadataParamName
Field name to use for reading the canonical property of the metadata- See Also:
- Constant Field Values
-
ignoreEmptyFieldValueParamName
public static final String ignoreEmptyFieldValueParamName
Indicates that empty field values should not be emitted at all.- See Also:
- Constant Field Values
-
-
Method Detail
-
prepare
public void prepare(Map<String,Object> conf, org.apache.storm.task.TopologyContext context, org.apache.storm.task.OutputCollector collector)
-
filterDocument
protected boolean filterDocument(Metadata meta)
Determine whether a document should be indexed based on the presence of a given key/value or the RobotsTags.ROBOTS_NO_INDEX directive.- Returns:
- true if the document should be kept.
-
filterMetadata
protected Map<String,String[]> filterMetadata(Metadata meta)
Returns a mapping field name / values for the metadata to index *
-
getDocumentID
protected String getDocumentID(Metadata metadata, String normalisedUrl)
Get the document id.- Parameters:
metadata
- TheMetadata
.normalisedUrl
- The normalised url.- Returns:
- Return the normalised url SHA-256 digest as String.
-
valueForURL
protected String valueForURL(org.apache.storm.tuple.Tuple tuple)
Returns the value to be used as the URL for indexing purposes, if present the canonical value is used instead
-
fieldNameForText
protected String fieldNameForText()
Returns the field name to use for the text or null if the text must not be indexed
-
trimText
protected String trimText(String text)
Returns a trimmed string or the original one if it is below the threshold set in the configuration.
-
fieldNameForURL
protected String fieldNameForURL()
Returns the field name to use for the URL or null if the URL must not be indexed
-
ignoreEmptyFields
protected boolean ignoreEmptyFields()
-
declareOutputFields
public void declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
-
-