Class AbstractQueryingSpout
- java.lang.Object
-
- org.apache.storm.topology.base.BaseComponent
-
- org.apache.storm.topology.base.BaseRichSpout
-
- com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
- All Implemented Interfaces:
Serializable
,org.apache.storm.spout.ISpout
,org.apache.storm.topology.IComponent
,org.apache.storm.topology.IRichSpout
public abstract class AbstractQueryingSpout extends org.apache.storm.topology.base.BaseRichSpout
Common features of spouts which query a backend to generate tuples. Tracks the URLs being processes, with an optional delay before they are removed from the cache. Throttles the rate a which queries are emitted and provides a buffer to store the URLs waiting to be sent.- Since:
- 1.11
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
AbstractQueryingSpout.InProcessMap<K,V>
Map which holds elements some additional time after the removal.
-
Field Summary
Fields Modifier and Type Field Description protected org.apache.storm.spout.SpoutOutputCollector
_collector
protected AbstractQueryingSpout.InProcessMap<String,Object>
beingProcessed
Map to keep in-process URLs, with the URL as key and optional value depending on the spout implementation.protected URLBuffer
buffer
protected org.apache.storm.metric.api.MultiCountMetric
eventCounter
protected AtomicBoolean
isInQuery
Required for implementations doing asynchronous calls *protected Instant
lastTimeResetToNOW
protected long
maxDelayBetweenQueries
protected long
minDelayBetweenQueries
protected CollectionMetric
queryTimes
protected int
resetFetchDateAfterNSecs
protected static String
resetFetchDateParamName
Delay in seconds after which the nextFetchDate filter is set to the current time, default 120.protected static String
StatusMaxDelayParamName
Max time to allow between 2 successive queries to the backend.protected static String
StatusMinDelayParamName
Min time to allow between 2 successive queries to the backend.protected static String
StatusTTLPurgatory
Time in seconds for which acked or failed URLs will be considered for fetching again, default 30 secs.
-
Constructor Summary
Constructors Constructor Description AbstractQueryingSpout()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description void
ack(Object msgId)
void
activate()
void
deactivate()
void
declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
void
fail(Object msgId)
protected long
getTimeLastQuerySent()
protected void
markQueryReceivedNow()
sets the marker that we are in a query to false and timeLastQueryReceived to nowvoid
nextTuple()
void
open(Map<String,Object> stormConf, org.apache.storm.task.TopologyContext context, org.apache.storm.spout.SpoutOutputCollector collector)
protected abstract void
populateBuffer()
Method where specific implementations query the storage.
-
-
-
Field Detail
-
StatusTTLPurgatory
protected static final String StatusTTLPurgatory
Time in seconds for which acked or failed URLs will be considered for fetching again, default 30 secs.- See Also:
- Constant Field Values
-
StatusMinDelayParamName
protected static final String StatusMinDelayParamName
Min time to allow between 2 successive queries to the backend. Value in msecs, default 2000.- See Also:
- Constant Field Values
-
minDelayBetweenQueries
protected long minDelayBetweenQueries
-
StatusMaxDelayParamName
protected static final String StatusMaxDelayParamName
Max time to allow between 2 successive queries to the backend. Value in msecs, default 20000.- See Also:
- Constant Field Values
-
maxDelayBetweenQueries
protected long maxDelayBetweenQueries
-
resetFetchDateParamName
protected static final String resetFetchDateParamName
Delay in seconds after which the nextFetchDate filter is set to the current time, default 120. Is used to prevent the search to be limited to a handful of sources.- See Also:
- Constant Field Values
-
resetFetchDateAfterNSecs
protected int resetFetchDateAfterNSecs
-
lastTimeResetToNOW
protected Instant lastTimeResetToNOW
-
eventCounter
protected org.apache.storm.metric.api.MultiCountMetric eventCounter
-
buffer
protected URLBuffer buffer
-
_collector
protected org.apache.storm.spout.SpoutOutputCollector _collector
-
isInQuery
protected AtomicBoolean isInQuery
Required for implementations doing asynchronous calls *
-
queryTimes
protected CollectionMetric queryTimes
-
beingProcessed
protected AbstractQueryingSpout.InProcessMap<String,Object> beingProcessed
Map to keep in-process URLs, with the URL as key and optional value depending on the spout implementation. The entries are kept in a cache for a configurable amount of time to avoid that some items are fetched a second time if new items are queried shortly after they have been acked.
-
-
Method Detail
-
open
public void open(Map<String,Object> stormConf, org.apache.storm.task.TopologyContext context, org.apache.storm.spout.SpoutOutputCollector collector)
-
populateBuffer
protected abstract void populateBuffer()
Method where specific implementations query the storage. Implementations should call markQueryReceivedNow when the documents have been received.
-
nextTuple
public void nextTuple()
-
getTimeLastQuerySent
protected long getTimeLastQuerySent()
-
markQueryReceivedNow
protected void markQueryReceivedNow()
sets the marker that we are in a query to false and timeLastQueryReceived to now
-
activate
public void activate()
- Specified by:
activate
in interfaceorg.apache.storm.spout.ISpout
- Overrides:
activate
in classorg.apache.storm.topology.base.BaseRichSpout
-
deactivate
public void deactivate()
- Specified by:
deactivate
in interfaceorg.apache.storm.spout.ISpout
- Overrides:
deactivate
in classorg.apache.storm.topology.base.BaseRichSpout
-
ack
public void ack(Object msgId)
- Specified by:
ack
in interfaceorg.apache.storm.spout.ISpout
- Overrides:
ack
in classorg.apache.storm.topology.base.BaseRichSpout
-
fail
public void fail(Object msgId)
- Specified by:
fail
in interfaceorg.apache.storm.spout.ISpout
- Overrides:
fail
in classorg.apache.storm.topology.base.BaseRichSpout
-
declareOutputFields
public void declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
-
-