A B C D E F G H I J K L M N O P Q R S T U V W X _ 
All Classes All Packages

A

AbstractConfigurable - Class in com.digitalpebble.stormcrawler.util
 
AbstractConfigurable() - Constructor for class com.digitalpebble.stormcrawler.util.AbstractConfigurable
 
AbstractHttpProtocol - Class in com.digitalpebble.stormcrawler.protocol
 
AbstractHttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
AbstractHttpProtocol.KeyValue - Class in com.digitalpebble.stormcrawler.protocol
 
AbstractIndexerBolt - Class in com.digitalpebble.stormcrawler.indexing
Abstract class to simplify writing IndexerBolts *
AbstractIndexerBolt() - Constructor for class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
AbstractQueryingSpout - Class in com.digitalpebble.stormcrawler.persistence
Common features of spouts which query a backend to generate tuples.
AbstractQueryingSpout() - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
AbstractQueryingSpout.InProcessMap<K,​V> - Class in com.digitalpebble.stormcrawler.persistence
Map which holds elements some additional time after the removal.
AbstractStatusUpdaterBolt - Class in com.digitalpebble.stormcrawler.persistence
Abstract bolt used to store the status of URLs.
AbstractStatusUpdaterBolt() - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
AbstractURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
Abstract class for URLBuffer interface, meant to simplify the code of the implementations and provide some default methods
AbstractURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
AbstractURLBuffer.URLMetadata - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
 
accept() - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
Return if this rule is used for filtering-in or out.
ack(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
ack(Object) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
ack(Tuple, String) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Must be called by extending classes to store and collect in one go
acked(String) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.PriorityURLBuffer
 
acked(String) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
 
acked(String) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Notify the buffer that a URL has been successfully processed used e.g to compute an ideal delay for a host queue
activate() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
activate() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
activate() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
active - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 
AdaptiveScheduler - Class in com.digitalpebble.stormcrawler.persistence
Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed: if yes, shrink the fetch interval up to a minimum fetch interval if not, increase the fetch interval up to a maximum
AdaptiveScheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
add(String, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
Stores the URL and its Metadata using the hostname as key.
add(String, Metadata) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Stores the URL and its Metadata using the hostname as key.
add(String, Metadata, String) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
Stores the URL and its Metadata under a given key.
add(String, Metadata, String) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Stores the URL and its Metadata under a given key.
add(String, Metadata, Date) - Static method in class com.digitalpebble.stormcrawler.spout.MemorySpout
Add a new URL with the given metadata and nextFetch-date
addHeadersToRequest(Request.Builder, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
 
addHeadersToRequest(HttpRequestBase, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
addMeasurement(long) - Method in class com.digitalpebble.stormcrawler.util.CollectionMetric
 
addSitemap(String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
addValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
addValues(String, Collection<String>) - Method in class com.digitalpebble.stormcrawler.Metadata
 
agentNames - Variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
allow5xx - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 
allowForbidden - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 
AllowRedirParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
allowRedirs() - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
ANCHORS_KEY_NAME - Static variable in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
Metadata key name for tracking the anchors
AS_IS_NEXTFETCHDATE_METADATA - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Key used to pass a preset Date to use as nextFetchDate.
asMap() - Method in class com.digitalpebble.stormcrawler.Metadata
Returns the underlying Map *

B

BasicURLFilter - Class in com.digitalpebble.stormcrawler.filtering.basic
Simple URL filters : can be used early in the filtering chain
BasicURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
BasicURLNormalizer - Class in com.digitalpebble.stormcrawler.filtering.basic
 
BasicURLNormalizer() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
 
BATCH_SIZE - Static variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 
beingProcessed - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Map to keep in-process URLs, with the URL as key and optional value depending on the spout implementation.
buffer - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
buffer - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 
bufferClassParamName - Static variable in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Implementation to use for URLBuffer.
build(String) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
 

C

CACHE - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
cacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Parameter name to configure the cache @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterAccess=1h"
cacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"
canonicalMetadataParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Field name to use for reading the canonical property of the metadata
CharsetIdentification - Class in com.digitalpebble.stormcrawler.util
 
CharsetIdentification() - Constructor for class com.digitalpebble.stormcrawler.util.CharsetIdentification
 
checkCustomInterval(Metadata, Status) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
Returns the first matching custom interval
checkDomainMatchToUrl(String, String) - Static method in class com.digitalpebble.stormcrawler.util.CookieConverter
Helper method to check if url matches a cookie domain.
chooseTasks(int, List<Object>) - Method in class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
cleanup() - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
cleanup() - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
cleanup() - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
 
cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
close() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
CollectionMetric - Class in com.digitalpebble.stormcrawler.util
 
CollectionMetric() - Constructor for class com.digitalpebble.stormcrawler.util.CollectionMetric
 
CollectionTagger - Class in com.digitalpebble.stormcrawler.parse.filter
Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.
CollectionTagger() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
 
collector - Variable in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
collector - Variable in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
com.digitalpebble.stormcrawler - package com.digitalpebble.stormcrawler
 
com.digitalpebble.stormcrawler.bolt - package com.digitalpebble.stormcrawler.bolt
 
com.digitalpebble.stormcrawler.filtering - package com.digitalpebble.stormcrawler.filtering
 
com.digitalpebble.stormcrawler.filtering.basic - package com.digitalpebble.stormcrawler.filtering.basic
 
com.digitalpebble.stormcrawler.filtering.depth - package com.digitalpebble.stormcrawler.filtering.depth
 
com.digitalpebble.stormcrawler.filtering.host - package com.digitalpebble.stormcrawler.filtering.host
 
com.digitalpebble.stormcrawler.filtering.metadata - package com.digitalpebble.stormcrawler.filtering.metadata
 
com.digitalpebble.stormcrawler.filtering.regex - package com.digitalpebble.stormcrawler.filtering.regex
 
com.digitalpebble.stormcrawler.filtering.robots - package com.digitalpebble.stormcrawler.filtering.robots
 
com.digitalpebble.stormcrawler.filtering.sitemap - package com.digitalpebble.stormcrawler.filtering.sitemap
 
com.digitalpebble.stormcrawler.indexing - package com.digitalpebble.stormcrawler.indexing
 
com.digitalpebble.stormcrawler.jsoup - package com.digitalpebble.stormcrawler.jsoup
 
com.digitalpebble.stormcrawler.parse - package com.digitalpebble.stormcrawler.parse
 
com.digitalpebble.stormcrawler.parse.filter - package com.digitalpebble.stormcrawler.parse.filter
 
com.digitalpebble.stormcrawler.persistence - package com.digitalpebble.stormcrawler.persistence
 
com.digitalpebble.stormcrawler.persistence.urlbuffer - package com.digitalpebble.stormcrawler.persistence.urlbuffer
 
com.digitalpebble.stormcrawler.protocol - package com.digitalpebble.stormcrawler.protocol
 
com.digitalpebble.stormcrawler.protocol.file - package com.digitalpebble.stormcrawler.protocol.file
 
com.digitalpebble.stormcrawler.protocol.httpclient - package com.digitalpebble.stormcrawler.protocol.httpclient
 
com.digitalpebble.stormcrawler.protocol.okhttp - package com.digitalpebble.stormcrawler.protocol.okhttp
 
com.digitalpebble.stormcrawler.protocol.selenium - package com.digitalpebble.stormcrawler.protocol.selenium
 
com.digitalpebble.stormcrawler.proxy - package com.digitalpebble.stormcrawler.proxy
 
com.digitalpebble.stormcrawler.spout - package com.digitalpebble.stormcrawler.spout
 
com.digitalpebble.stormcrawler.util - package com.digitalpebble.stormcrawler.util
 
CommaSeparatedToMultivaluedMetadata - Class in com.digitalpebble.stormcrawler.parse.filter
Rewrites single metadata containing comma separated values into multiple values for the same key, useful for instance for keyword tags.
CommaSeparatedToMultivaluedMetadata() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata
 
conf - Variable in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
Configurable - Interface in com.digitalpebble.stormcrawler.util
An interface marking the implementing class as initializeable and configurable via Configurable.createConfiguredInstance(String, Class, Map, JsonNode) The implementing class HAS to implement an empty constructor.
ConfigurableTopology - Class in com.digitalpebble.stormcrawler
 
ConfigurableTopology() - Constructor for class com.digitalpebble.stormcrawler.ConfigurableTopology
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
 
configure(Map<String, Object>, JsonNode) - Method in interface com.digitalpebble.stormcrawler.util.Configurable
Called when this filter is being initialized
configure(Map<String, Object>, JsonNode, Class<T>, String) - Static method in interface com.digitalpebble.stormcrawler.util.Configurable
configure(Map<String, Object>, JsonNode, String) - Method in class com.digitalpebble.stormcrawler.util.AbstractConfigurable
 
configure(Map<String, Object>, JsonNode, String) - Method in interface com.digitalpebble.stormcrawler.util.Configurable
Called when this filter is being initialized
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.jsoup.LinkParseFilter
 
configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.jsoup.XPathFilter
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
configure(MultiProxyManager.ProxyRotation, String[]) - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
 
configure(Map) - Method in class com.digitalpebble.stormcrawler.util.URLPartitioner
 
configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
 
configure(Map<String, Object>) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
 
configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
 
configure(Config) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
 
configure(Config) - Method in interface com.digitalpebble.stormcrawler.proxy.ProxyManager
 
configure(Config) - Method in class com.digitalpebble.stormcrawler.proxy.SingleProxyManager
 
ConfUtils - Class in com.digitalpebble.stormcrawler.util
 
Constants - Class in com.digitalpebble.stormcrawler
 
containsKey(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
 
containsKey(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
containsKeyWithValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
CONTENT_DISPOSITION - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_ENCODING - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_LANGUAGE - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_LENGTH - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_LOCATION - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_MD5 - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CONTENT_TYPE - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
CookieConverter - Class in com.digitalpebble.stormcrawler.util
Helper to extract cookies from cookies string.
CookieConverter() - Constructor for class com.digitalpebble.stormcrawler.util.CookieConverter
 
createConfiguredInstance(Class<?>, Class<T>, Map<String, Object>, JsonNode) - Static method in interface com.digitalpebble.stormcrawler.util.Configurable
createConfiguredInstance(String, Class<T>, Map<String, Object>, JsonNode) - Static method in interface com.digitalpebble.stormcrawler.util.Configurable
Used by classes like URLFilters and ParseFilters to load the configuration of utilized filters from the provided JSON config.
createInstance(Map<String, Object>) - Static method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Returns a URLBuffer instance based on the configuration *
createRule(boolean, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter
 
createRule(boolean, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
Creates a new RegexRule.
customHeaders - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 

D

deactivate() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
deactivate() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
deactivate() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
DebugParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Dumps the DOM representation of a document into a file
DebugParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
DEFAULT_CHARSET - Static variable in class com.digitalpebble.stormcrawler.util.CharsetIdentification
 
defaultfetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
defaultFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
DefaultScheduler - Class in com.digitalpebble.stormcrawler.persistence
Schedules a nextFetchDate based on the configuration
DefaultScheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
 
DELAY_METADATA - Static variable in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
Key used to pass a custom delay via metadata.
DelegatorProtocol - Class in com.digitalpebble.stormcrawler.protocol
Protocol implementation that enables selection from a collection of sub-protocols using filters based on each call's metadata and URL.
DelegatorProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
DELETION_STREAM_NAME - Static variable in class com.digitalpebble.stormcrawler.Constants
 
depthKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Metadata key name for tracking the depth
deserialize(ByteBuffer) - Method in class com.digitalpebble.stormcrawler.util.StringTabScheme
 
DISCONNECT - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
network disconnect or timeout during fetch
DISCOVERED - com.digitalpebble.stormcrawler.persistence.Status
 
dnsEnd(Call, String, List<InetAddress>) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.DNSResolutionListener
 
DNSResolutionListener - Class in com.digitalpebble.stormcrawler.protocol.okhttp
 
DNSResolutionListener(Map<String, Long>) - Constructor for class com.digitalpebble.stormcrawler.protocol.okhttp.DNSResolutionListener
 
dnsStart(Call, String) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.DNSResolutionListener
 
DocumentFragmentBuilder - Class in com.digitalpebble.stormcrawler.parse
Adapted from org.jsoup.helper.W3CDom but does not transfer namespaces.
DocumentFragmentBuilder.W3CBuilder - Class in com.digitalpebble.stormcrawler.parse
Implements the conversion by walking the input.
DomainParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Adds domain (or host) to metadata - can be used later on for indexing *
DomainParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
 
drivers - Variable in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
DummyIndexer - Class in com.digitalpebble.stormcrawler.indexing
* Any tuple that went through all the previous bolts is sent to the status stream with a Status of FETCHED.
DummyIndexer() - Constructor for class com.digitalpebble.stormcrawler.indexing.DummyIndexer
 

E

emitOutlink(Tuple, URL, String, Metadata, String...) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
Used for redirections or when discovering sitemap URLs.
empty - Static variable in class com.digitalpebble.stormcrawler.Metadata
 
EMPTY_RULES - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file is empty or missing; all requests are allowed.
emptyNavigationFilters - Static variable in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
 
emptyParseFilter - Static variable in class com.digitalpebble.stormcrawler.parse.JSoupFilters
 
emptyParseFilter - Static variable in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
emptyQueue(String) - Method in interface com.digitalpebble.stormcrawler.persistence.EmptyQueueListener
 
EmptyQueueListener - Interface in com.digitalpebble.stormcrawler.persistence
Used by URLBuffer to inform the spouts when a queue has no more URLs in it
emptyURLFilters - Static variable in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
equals(Object) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
ERROR - com.digitalpebble.stormcrawler.persistence.Status
 
ERRORCACHE - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
errorcacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"
errorFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
eventCounter - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
EXCLUDE_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.DummyIndexer
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
 
execute(Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
expressions - Variable in class com.digitalpebble.stormcrawler.jsoup.XPathFilter
 
expressions - Variable in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
extractConfigElement(Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
If the config consists of a single key 'config', its values are used instead
extractMetaTags(String) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
Extracts meta tags based on the value of the content attribute *
extractMetaTags(DocumentFragment) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
extractRefreshURL(String) - Static method in class com.digitalpebble.stormcrawler.util.RefreshTag
 
extractRefreshURL(Document) - Static method in class com.digitalpebble.stormcrawler.util.RefreshTag
 
extractResult(TimeReducerState) - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
 

F

fail(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
fail(Object) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
FastURLFilter - Class in com.digitalpebble.stormcrawler.filtering.regex
URL filter based on regex patterns and organised by [host | domain | metadata | global].
FastURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
 
FeedParserBolt - Class in com.digitalpebble.stormcrawler.bolt
Extracts URLs from feeds
FeedParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
FETCH_ERROR - com.digitalpebble.stormcrawler.persistence.Status
 
FETCH_INTERVAL_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Key to store the current fetch interval value, must be listed in "metadata.persist".
FETCHED - com.digitalpebble.stormcrawler.persistence.Status
 
FetcherBolt - Class in com.digitalpebble.stormcrawler.bolt
A multithreaded, queue-based fetcher adapted from Apache Nutch.
FetcherBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
fetchErrorCountParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
fetchErrorFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
fetchIntervalDecRate - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
fetchIntervalIncRate - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
fetchRobotsMd - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 
fieldNameForText() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns the field name to use for the text or null if the text must not be indexed
fieldNameForURL() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns the field name to use for the URL or null if the URL must not be indexed
FileProtocol - Class in com.digitalpebble.stormcrawler.protocol.file
 
FileProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
FileResponse - Class in com.digitalpebble.stormcrawler.protocol.file
 
FileResponse(String, Metadata, FileProtocol) - Constructor for class com.digitalpebble.stormcrawler.protocol.file.FileResponse
 
FileSpout - Class in com.digitalpebble.stormcrawler.spout
Reads the lines from a UTF-8 file and use them as a spout.
FileSpout(boolean, String...) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
FileSpout(String...) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
FileSpout(String, String) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
FileSpout(String, String, boolean) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
 
filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
 
filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
 
filter(String, byte[], Document, ParseResult) - Method in interface com.digitalpebble.stormcrawler.parse.JSoupFilter
Called when parsing a specific page
filter(RemoteWebDriver, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
The end result comes from the first filter to return non-null *
filter(RemoteWebDriver, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
This function does the replacements by iterating through all the regex patterns.
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter
 
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilter
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
filter(Metadata) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Determine which metadata should be persisted for a given document including those which are not necessarily transferred to the outlinks
filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.jsoup.LinkParseFilter
 
filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.jsoup.XPathFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.MimeTypeNormalization
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
Called when parsing a specific page
filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
filterDocument(Metadata) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Determine whether a document should be indexed based on the presence of a given key/value or the RobotsTags.ROBOTS_NO_INDEX directive.
filterJson(Document) - Static method in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
 
filterJson(DocumentFragment) - Static method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
 
filterMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns a mapping field name / values for the metadata to index *
filterOutlink(URL, String, Metadata, String...) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
filterPathRepet(String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
 
FORBID_ALL_RULES - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
A BaseRobotRules object appropriate for use when the robots.txt file is not fetched due to a 403/Forbidden response; all requests are disallowed.
formatHttpDate(String) - Static method in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
Format an ISO date string as HTTP date used in HTTP headers, e.g.,
foundSitemapKey - Static variable in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
Loads and configure the NavigationFilters based on the storm config if there is one otherwise returns an emptyNavigationFilters.
fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.filtering.URLFilters
Loads and configure the URLFilters based on the storm config if there is one otherwise returns an empty URLFilter.
fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
Loads and configure the JSoupFilters based on the storm config if there is one otherwise returns an empty JSoupFilter.
fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.parse.ParseFilters
Loads and configure the ParseFilters based on the storm config if there is one otherwise returns an emptyParseFilter.
fromHTTPCode(int) - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
Maps the HTTP Code to FETCHED, FETCH_ERROR or REDIRECTION
fromJsoup(Document) - Static method in class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder
 

G

get(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
get(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
getAddress() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getAgentString(Config) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
getAnchor() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
getArea() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getBoolean(Map<String, Object>, String, boolean) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getBoolean(Map<String, Object>, String, String, String, boolean) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix and if that fails too, the default value
getCacheKey(URL) - Static method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
Compose unique key to store and access robot rules in cache for given URL
getCharset(Metadata, byte[], int) - Static method in class com.digitalpebble.stormcrawler.util.CharsetIdentification
Identifies the charset of a document based on the following logic: guess from the ByteOrderMark - else if the same charset is specified in the http headers and the html metadata then use it - otherwise use ICU's charset detector to make an educated guess and if that fails too returns UTF-8.
getCharsetFast(Metadata, byte[], int) - Static method in class com.digitalpebble.stormcrawler.util.CharsetIdentification
Identifies the charset of a document based on the following logic: guess from the ByteOrderMark - else return any charset specified in the http headers if any, otherwise return the one from the html metadata; finally use ICU's charset detector to make an educated guess and if that fails too returns UTF-8.
getClassFor(String, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
Retrieves a class-instance for qualifiedClassName extending T.
getComponentConfiguration() - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
getConf() - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
getContent() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getContent() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
getContentLengthFetched() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
Returns the number of bytes fetched per request when not cached *
getCookies(String[], URL) - Static method in class com.digitalpebble.stormcrawler.util.CookieConverter
Get a list of cookies based on the cookies string taken from response header and the target url.
getCountry() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getCrawlDelay() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
getDocumentID(Metadata, String) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Get the document id.
getDocumentID(Metadata, String) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Get the document id.
getEncoding() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
getFirstValue(Metadata, String...) - Static method in class com.digitalpebble.stormcrawler.Metadata
Returns the first non empty value found for the keys or null if none found.
getFirstValue(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
getFirstValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
getFloat(Map<String, Object>, String, float) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getHost(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Returns the lowercased hostname for the url or null if the url is not well formed.
getHostSegments(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Partitions of the hostname of the url by "."
getHostSegments(URL) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Partitions of the hostname of the url by "."
getInstance(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.persistence.Scheduler
Returns a Scheduler instance based on the configuration *
getInstance(Map<String, Object>) - Static method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Deprecated.
getInstance(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
 
getInstance(Config) - Static method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
 
getInt(Map<String, Object>, String, int) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getInt(Map<String, Object>, String, String, String, int) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix and if that fails too, the default value
getKey() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
 
getLocation() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getLong(Map<String, Object>, String, long) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getMetadata() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
getMetadata() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getMetadata() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
getMetaForOutlink(String, String, Metadata) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Determine which metadata should be transferred to an outlink.
getName() - Method in class com.digitalpebble.stormcrawler.util.AbstractConfigurable
 
getName() - Method in interface com.digitalpebble.stormcrawler.util.Configurable
 
getOutlinks() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
getOutputFields() - Method in class com.digitalpebble.stormcrawler.util.StringTabScheme
 
getPage(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Returns the page for the url.
getParseMap() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
getPartition(String, Metadata) - Method in class com.digitalpebble.stormcrawler.util.URLPartitioner
Returns the host, domain, IP of a URL so that it can be partitioned for politeness, depending on the value of the config partition.url.mode.
getPartition(String, Metadata, String) - Static method in class com.digitalpebble.stormcrawler.util.URLPartitioner
Returns the host, domain, IP of a URL so that it can be partitioned for politeness, depending on the value of the parameter partitionMode.
getPassword() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getPort() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getProtocol() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getProtocol(String) - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
Returns instance(s) of the implementation for the protocol passed as argument.
getProtocol(URL) - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
Returns an instance of the protocol to use for a given URL
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
 
getProtocolOutput(String, Metadata) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
Fetches the content and additional metadata
getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
getProxy(Metadata) - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
 
getProxy(Metadata) - Method in interface com.digitalpebble.stormcrawler.proxy.ProxyManager
 
getProxy(Metadata) - Method in class com.digitalpebble.stormcrawler.proxy.SingleProxyManager
 
getResourceFile() - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
 
getResourceFile() - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
getResourceFile() - Method in interface com.digitalpebble.stormcrawler.JSONResource
 
getResourceFile() - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
 
getResourceFile() - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
 
getResourceFile() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
getRobotRules(String) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
getRobotRulesSet(Protocol, String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
getRobotRulesSet(Protocol, URL) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
Get the rules from robots.txt which applies for the given url.
getRobotRulesSet(Protocol, URL) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
getRobotRulesSetFromCache(URL) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
Returns the robots rules from the cache or empty rules if not found
getSitemaps() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
getStatus() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getStatusCode() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
getString(Map<String, Object>, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getString(Map<String, Object>, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
getString(Map<String, Object>, String, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix or null.
getString(Map<String, Object>, String, String, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix and if that fails too, the default value
getTargetURL() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
getText() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getTimeLastQuerySent() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
getUsage() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
Retrieves the current usage of the proxy
getUsername() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
getValue() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
 
getValueAndReset() - Method in class com.digitalpebble.stormcrawler.util.CollectionMetric
 
getValues(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
getValues(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
getValues(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
getValues(String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
guessMimeType(String, String, byte[]) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 

H

handleResponse(HttpResponse) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
handleResponseWithContentLimit(HttpResponse, int) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
hashCode() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
hasNext() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
hasNext() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Implementations of this method should be synchronised
head(Node, int) - Method in class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder.W3CBuilder
 
HostURLFilter - Class in com.digitalpebble.stormcrawler.filtering.host
Filters URL based on the hostname.
HostURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
 
HTTP_DATE_FORMATTER - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
Formatter for dates in HTTP headers, used to fill the "If-Modified-Since" request header field, e.g.
HttpHeaders - Class in com.digitalpebble.stormcrawler.protocol
A collection of HTTP header names and utilities around header values.
HttpProtocol - Class in com.digitalpebble.stormcrawler.protocol.httpclient
Uses Apache httpclient to handle http and https
HttpProtocol - Class in com.digitalpebble.stormcrawler.protocol.okhttp
 
HttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
HttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
 
HttpRobotRulesParser - Class in com.digitalpebble.stormcrawler.protocol
This class is used for parsing robots for urls belonging to HTTP protocol.
HttpRobotRulesParser(Config) - Constructor for class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 

I

ignoreEmptyFields() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
ignoreEmptyFieldValueParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Indicates that empty field values should not be emitted at all.
in_buffer - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
inCache() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
 
INCLUDE_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
 
incrementUsage() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
Increments the usage tracker for the proxy
init() - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
 
init(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
init(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
 
init(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.Scheduler
Configuration of the scheduler based on the config.
InitialisationUtil - Class in com.digitalpebble.stormcrawler.util
 
initializeFromClass(Class<?>, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
Initializes a class from clazz as type superClass.
initializeFromClass(Class<? extends T>) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
Initializes a class from clazz of type T.
initializeFromQualifiedName(String, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
Initializes a class from qualifiedClassName as type superClass.
InProcessMap(long, TimeUnit) - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
 
INTERNAL - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
implementation internal reason
INTERVAL_DEC_RATE - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (float) to set the decrement rate.
INTERVAL_INC_RATE - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (float) to set the increment rate.
INTERVAL_MAX - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (int) to set the maximum fetch interval in minutes.
INTERVAL_MIN - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (int) to set the minimum fetch interval in minutes.
isAllowAll() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isAllowed(String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isAllowNone() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isDeferVisits() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
isEmpty() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
isFeedKey - Static variable in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
isInQuery - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Required for implementations doing asynchronous calls *
isNoCache() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
isNoFollow() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
isNoIndex() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
 
ISO_INSTANT_FORMATTER - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
Formatter to parse ISO-formatted dates persisted in status index
isSitemapKey - Static variable in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
iterator() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 

J

JSONResource - Interface in com.digitalpebble.stormcrawler
Defines a generic behaviour for ParseFilters or URLFilters to load resources from a JSON file.
JSoupFilter - Interface in com.digitalpebble.stormcrawler.parse
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
JSoupFilters - Class in com.digitalpebble.stormcrawler.parse
Wrapper for the JSoupFilters defined in a JSON configuration
JSoupFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.parse.JSoupFilters
loads the filters from a JSON configuration file
JSoupParserBolt - Class in com.digitalpebble.stormcrawler.bolt
Parser for HTML documents only which uses ICU4J to detect the charset encoding.
JSoupParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 

K

keySet() - Method in class com.digitalpebble.stormcrawler.Metadata
 
KeyValue(String, String) - Constructor for class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
 

L

LAST_MODIFIED - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
lastTimeResetToNOW - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
LDJsonParseFilter - Class in com.digitalpebble.stormcrawler.jsoup
Extracts data from JSON-LD representation (https://json-ld.org/).
LDJsonParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Extracts data from JSON-LD representation (https://json-ld.org/)
LDJsonParseFilter() - Constructor for class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
 
LDJsonParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
 
LEAST_USED - com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
 
LENGTH - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
fetch exceeded configured http.content.limit
LinkParseFilter - Class in com.digitalpebble.stormcrawler.jsoup
ParseFilter to extract additional links with Xpath can be configured with e.g.
LinkParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
ParseFilter to extract additional links with Xpath can be configured with e.g.
LinkParseFilter() - Constructor for class com.digitalpebble.stormcrawler.jsoup.LinkParseFilter
 
LinkParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
 
listener - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
loadConf(String, Config) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
 
loadJSONResources() - Method in interface com.digitalpebble.stormcrawler.JSONResource
Load the resources from the JSON file in the uber jar
loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
 
loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
 
loadJSONResources(InputStream) - Method in interface com.digitalpebble.stormcrawler.JSONResource
Load the resources from an input stream
loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
 
loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
 
loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
loadListFromConf(String, String, String, Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Return one or more Strings regardless of whether they are represented as a single String or a list in the config for the combination all 2 String parameters.
loadListFromConf(String, Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
Return one or more Strings regardless of whether they are represented as a single String or a list in the config or an empty List if no value could be found for that key.
LOCATION - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
lock() - Method in class com.digitalpebble.stormcrawler.Metadata
Prevents modifications to the metadata object.
LOG - Static variable in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
LOG - Static variable in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
 
LOG - Static variable in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
 
LOG - Static variable in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
 
LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
LOG - Static variable in class com.digitalpebble.stormcrawler.proxy.SCProxy
 
LOG - Static variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 

M

main(Protocol, String[]) - Static method in interface com.digitalpebble.stormcrawler.protocol.Protocol
 
main(String[]) - Static method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
Utility method to test rules against an input.
main(String[]) - Static method in class com.digitalpebble.stormcrawler.filtering.URLFilters
Utility to check the filtering of a URL *
main(String[]) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
Used for quick testing + debugging
main(String[]) - Static method in class com.digitalpebble.stormcrawler.parse.ParseFilters
* Used for quick testing + debugging
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
 
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
 
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
 
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
 
main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
 
markQueryReceivedNow() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
sets the marker that we are in a query to false and timeLastQueryReceived to now
match(String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
Checks if a url matches this rule.
MAX_ARRAY_SIZE - Static variable in class com.digitalpebble.stormcrawler.Constants
Maximum array size, safe value on any JVM
maxDelayBetweenQueries - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
MaxDepthFilter - Class in com.digitalpebble.stormcrawler.filtering.depth
Filter out URLs whose depth is greater than maxDepth.
MaxDepthFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
 
maxDepthKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Metadata key name for tracking a non-default max depth
maxFetchErrorsParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Number of successive FETCH_ERROR before status changes to ERROR *
maxFetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
MAXTIMEPARAM - Static variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
 
MD5SignatureParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Computes a signature for a page, based on the binary content or text.
MD5SignatureParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
 
MemorySpout - Class in com.digitalpebble.stormcrawler.spout
Stores URLs in memory.
MemorySpout(boolean, String...) - Constructor for class com.digitalpebble.stormcrawler.spout.MemorySpout
Emits tuples with DISCOVERED status, which is useful when injecting seeds directly to a statusupdaterbolt.
MemorySpout(String...) - Constructor for class com.digitalpebble.stormcrawler.spout.MemorySpout
 
MemoryStatusUpdater - Class in com.digitalpebble.stormcrawler.persistence
Use in combination with the MemorySpout for testing in local mode.
MemoryStatusUpdater() - Constructor for class com.digitalpebble.stormcrawler.persistence.MemoryStatusUpdater
 
Metadata - Class in com.digitalpebble.stormcrawler
Wrapper around Map <String,String[]> *
Metadata() - Constructor for class com.digitalpebble.stormcrawler.Metadata
 
Metadata(Map<String, String[]>) - Constructor for class com.digitalpebble.stormcrawler.Metadata
Wraps an existing HashMap into a Metadata object - does not clone the content
metadata2fieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Mapping between metadata keys and field names for indexing Can be a list of values separated by a = or a single string
MetadataFilter - Class in com.digitalpebble.stormcrawler.filtering.metadata
Filter out URLs based on metadata in the source document
MetadataFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
 
metadataFilterParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
list of metadata key + values to be used as a filter.
metadataPersistParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating which metadata to persist for a given document but not transfer to outlinks.
MetadataTransfer - Class in com.digitalpebble.stormcrawler.util
Implements the logic of how the metadata should be passed to the outlinks, what should be stored back in the persistence layer etc...
MetadataTransfer() - Constructor for class com.digitalpebble.stormcrawler.util.MetadataTransfer
 
metadataTransferClassParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Class to use for transfering metadata to outlinks.
metadataTransferParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating which metadata to transfer to the outlinks and persist for a given document.
MimeTypeNormalization - Class in com.digitalpebble.stormcrawler.parse.filter
Normalises the MimeType value e.g.
MimeTypeNormalization() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.MimeTypeNormalization
 
minDelayBetweenQueries - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
minFetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
MultiProxyManager - Class in com.digitalpebble.stormcrawler.proxy
MultiProxyManager is a ProxyManager implementation for a multiple proxy endpoints
MultiProxyManager() - Constructor for class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
 
MultiProxyManager.ProxyRotation - Enum in com.digitalpebble.stormcrawler.proxy
 

N

NavigationFilter - Class in com.digitalpebble.stormcrawler.protocol.selenium
 
NavigationFilter() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
 
NavigationFilters - Class in com.digitalpebble.stormcrawler.protocol.selenium
Wrapper for the NavigationFilter defined in a JSON configuration
NavigationFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
loads the filters from a JSON configuration file
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
 
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
 
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
Specifies whether this filter requires a DOM representation of the document
needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
 
next() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.PriorityURLBuffer
 
next() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled
next() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SimpleURLBuffer
Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled
next() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled
nextTuple() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
nextTuple() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
nextTuple() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
NO_TEXT_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
 
normaliseToMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
Adds a normalised representation of the directives in the metadata *
NOT_TRIMMED - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
 
numQueues() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
Total number of queues in the buffer *
numQueues() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Total number of queues in the buffer *

O

onRemoval(String, Object[], RemovalCause) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
 
open(Map<String, Object>, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
open(Map<String, Object>, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
open(Map<String, Object>, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
 
Outlink - Class in com.digitalpebble.stormcrawler.parse
 
Outlink(String) - Constructor for class com.digitalpebble.stormcrawler.parse.Outlink
 
Outlink(String, String) - Constructor for class com.digitalpebble.stormcrawler.parse.Outlink
 
overwriteLastModified - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 

P

ParseData - Class in com.digitalpebble.stormcrawler.parse
 
ParseData() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
 
ParseData(Metadata) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
 
ParseData(String, Metadata) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
 
parseExtensionAttributes(SiteMapURL, Metadata) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
ParseFilter - Class in com.digitalpebble.stormcrawler.parse
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
ParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseFilter
 
ParseFilters - Class in com.digitalpebble.stormcrawler.parse
Wrapper for the ParseFilters defined in a JSON configuration
ParseFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseFilters
loads the filters from a JSON configuration file
ParseResult - Class in com.digitalpebble.stormcrawler.parse
 
ParseResult() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
 
ParseResult(List<Outlink>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
 
ParseResult(Map<String, ParseData>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
 
ParseResult(Map<String, ParseData>, List<Outlink>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
 
parseRules(String, byte[], String, String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
Parses the robots content using the SimpleRobotRulesParser from crawler commons
PARTITION_MODE_DOMAIN - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PARTITION_MODE_HOST - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PARTITION_MODE_IP - Static variable in class com.digitalpebble.stormcrawler.Constants
 
PARTITION_MODEParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
partitioner - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
PerSecondReducer - Class in com.digitalpebble.stormcrawler.util
Used to return an average value per second *
PerSecondReducer() - Constructor for class com.digitalpebble.stormcrawler.util.PerSecondReducer
 
populateBuffer() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Method where specific implementations query the storage.
populateBuffer() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.DummyIndexer
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
 
prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
prepare(WorkerTopologyContext, GlobalStreamId, List<Integer>) - Method in class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
PriorityURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
Determines the priority of the buffers based on the number of URLs acked in a configurable period of time.
PriorityURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.PriorityURLBuffer
 
Protocol - Interface in com.digitalpebble.stormcrawler.protocol
 
PROTOCOL_MD_PREFIX_PARAM - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
PROTOCOL_VERSIONS_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Key which holds the protocol version(s) used for this request (for layered protocols this field may hold multiple comma-separated values)
ProtocolFactory - Class in com.digitalpebble.stormcrawler.protocol
 
protocolMDprefix - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
ProtocolResponse - Class in com.digitalpebble.stormcrawler.protocol
 
ProtocolResponse(byte[], int, Metadata) - Constructor for class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
 
ProtocolResponse.TrimmedContentReason - Enum in com.digitalpebble.stormcrawler.protocol
Enum of reasons which may cause that protocol content is trimmed.
protocolVersions - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
proxyCount() - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
 
proxyManager - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
ProxyManager - Interface in com.digitalpebble.stormcrawler.proxy
Proxy manager is an abstract class specification that details the required interface of a proxy manager
put(String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
put(String, String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
Add the key value to the metadata object for a given URL *
putAll(Metadata) - Method in class com.digitalpebble.stormcrawler.Metadata
Puts all the metadata into the current instance *
putAll(Metadata, String) - Method in class com.digitalpebble.stormcrawler.Metadata
Puts all prefixed metadata into the current instance

Q

queryTimes - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
QUEUE_MODE_DOMAIN - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
QUEUE_MODE_HOST - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
QUEUE_MODE_IP - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
QUEUED_TIMEOUT_PARAM_KEY - Static variable in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
Acks URLs which have spent too much time in the queue, should be set to a value equals to the topology timeout
queues - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 

R

RANDOM - com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
 
REDIRECTION - com.digitalpebble.stormcrawler.persistence.Status
 
reduce(TimeReducerState, Object) - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
 
RefreshTag - Class in com.digitalpebble.stormcrawler.util
 
RefreshTag() - Constructor for class com.digitalpebble.stormcrawler.util.RefreshTag
 
RegexRule - Class in com.digitalpebble.stormcrawler.filtering.regex
A generic regular expression rule.
RegexRule(boolean, String) - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
Constructs a new regular expression rule.
RegexURLFilter - Class in com.digitalpebble.stormcrawler.filtering.regex
Filters URLs based on a file of regular expressions using the Java Regex implementation.
RegexURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter
 
RegexURLFilterBase - Class in com.digitalpebble.stormcrawler.filtering.regex
An abstract class for implementing Regex URL filtering.
RegexURLFilterBase() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
 
RegexURLNormalizer - Class in com.digitalpebble.stormcrawler.filtering.regex
The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.
RegexURLNormalizer() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
 
RemoteDriverProtocol - Class in com.digitalpebble.stormcrawler.protocol.selenium
Delegates the requests to one or more remote selenium servers.
RemoteDriverProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
 
remove(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
 
remove(String) - Method in class com.digitalpebble.stormcrawler.Metadata
 
REQUEST_HEADERS_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Key which holds the verbatim HTTP request headers in metadata (if supported by Protocol implementation and if http.store.headers is true).
REQUEST_TIME_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Key which holds the request time (begin of request) in metadata.
requireSuperClass(Class<?>, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
Asserts the following:
resetFetchDateAfterNSecs - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
resetFetchDateParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Delay in seconds after which the nextFetchDate filter is set to the current time, default 120.
resolveURL(URL, String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
Resolve relative URL-s and fix a few java.net.URL errors in handling of URLs with embedded params and pure query targets.
RESPONSE_COOKIES_HEADER - Static variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
RESPONSE_HEADERS_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Key which holds the verbatim HTTP response headers in metadata.
RESPONSE_IP_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Key which holds the IP address of the server the request was sent to (response received from) in metadata.
rng - Variable in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
 
RobotRules - Class in com.digitalpebble.stormcrawler.protocol
Wrapper for BaseRobotRules which tracks the number of requests and length of the responses needed to get the rules.
RobotRules(BaseRobotRules) - Constructor for class com.digitalpebble.stormcrawler.protocol.RobotRules
 
RobotRulesParser - Class in com.digitalpebble.stormcrawler.protocol
This class uses crawler-commons for handling the parsing of robots.txt files.
RobotRulesParser() - Constructor for class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
 
ROBOTS_NO_CACHE - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
 
ROBOTS_NO_FOLLOW - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
 
ROBOTS_NO_FOLLOW_STRICT - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
Whether to interpret the noFollow directive strictly (remove links) or not (remove anchor and do not track original URL).
ROBOTS_NO_INDEX - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
 
RobotsFilter - Class in com.digitalpebble.stormcrawler.filtering.robots
URLFilter which discards URLs based on the robots.txt directives.
RobotsFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
 
RobotsTags - Class in com.digitalpebble.stormcrawler.util
Normalises the robots instructions provided by the HTML meta tags or the HTTP X-Robots-Tag headers.
RobotsTags() - Constructor for class com.digitalpebble.stormcrawler.util.RobotsTags
 
RobotsTags(Metadata, String) - Constructor for class com.digitalpebble.stormcrawler.util.RobotsTags
Get the values from the fetch metadata *
ROUND_ROBIN - com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
 
roundDateParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Used for rounding nextFetchDates.
run(String[]) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
 

S

schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
 
schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.Scheduler
Returns an optional Date indicating when the document should be refetched next, based on its status.
Scheduler - Class in com.digitalpebble.stormcrawler.persistence
 
Scheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.Scheduler
 
schedulerClassParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.Scheduler
Class to use for Scheduler.
SchedulingURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
Checks how long the last N URLs took to work out whether a queue should release a URL.
SchedulingURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
 
SCProxy - Class in com.digitalpebble.stormcrawler.proxy
Proxy class is used as the central interface to proxy based interactions with a single remote server The class stores all information relating to the remote server, authentication, and usage activity
SCProxy(String) - Constructor for class com.digitalpebble.stormcrawler.proxy.SCProxy
Construct a proxy object from a valid proxy connection string
SCProxy(String, String, String, String, String, String, String, String, String) - Constructor for class com.digitalpebble.stormcrawler.proxy.SCProxy
Construct a proxy class from it's variables
SeleniumProtocol - Class in com.digitalpebble.stormcrawler.protocol.selenium
 
SeleniumProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
 
SelfURLFilter - Class in com.digitalpebble.stormcrawler.filtering.basic
Filters links to self *
SelfURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
 
set(String, Metadata) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
Set the metadata for a given URL *
SET_HEADER_BY_REQUEST - Static variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
SET_LAST_MODIFIED - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Configuration property (boolean) whether or not to set the "last-modified" metadata field when a page change was detected by signature comparison.
setAnchor(String) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
setConf(Config) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
 
setConf(Config) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
Set the Configuration object
setContent(byte[]) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
setContentLengthFetched(int[]) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
Returns the number of bytes fetched per request when not cached *
setCrawlDelay(long) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
setDeferVisits(boolean) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
 
setEmptyQueueListener(EmptyQueueListener) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
 
setEmptyQueueListener(EmptyQueueListener) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
 
setLastModified - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
 
setMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
setMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
setOutlinks(List<Outlink>) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
setScheme(Scheme) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
Specify a Scheme for parsing the lines into URLs and Metadata.
setTargetURL(String) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
setText(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
 
setValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
Set the value for a given key.
setValues(String, String[]) - Method in class com.digitalpebble.stormcrawler.Metadata
 
SIGNATURE_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Name of the signature key in metadata, must be defined as "keyName" in the configuration of MD5SignatureParseFilter .
SIGNATURE_MODIFIED_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Key to store the date when the signature has been changed, must be listed in "metadata.persist".
SIGNATURE_OLD_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
Name of key to hold previous signature: a copy, not overwritten by MD5SignatureParseFilter.
SimpleFetcherBolt - Class in com.digitalpebble.stormcrawler.bolt
A simple fetcher with no internal queues.
SimpleFetcherBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
SimpleURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
Simple implementation of a URLBuffer which rotates on the queues without applying any priority.
SimpleURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.SimpleURLBuffer
 
SingleProxyManager - Class in com.digitalpebble.stormcrawler.proxy
SingleProxyManager is a ProxyManager implementation for a single proxy endpoint
SingleProxyManager() - Constructor for class com.digitalpebble.stormcrawler.proxy.SingleProxyManager
 
SitemapFilter - Class in com.digitalpebble.stormcrawler.filtering.sitemap
URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site.
SitemapFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter
 
SiteMapParserBolt - Class in com.digitalpebble.stormcrawler.bolt
Extracts URLs from a sitemap file.
SiteMapParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
 
size() - Method in class com.digitalpebble.stormcrawler.Metadata
 
size() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
size() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
Total number of URLs in the buffer *
size() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
Total number of URLs in the buffer *
skipRobots - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
start(ConfigurableTopology, String[]) - Static method in class com.digitalpebble.stormcrawler.ConfigurableTopology
 
Status - Enum in com.digitalpebble.stormcrawler.persistence
 
STATUS_ERROR_CAUSE - Static variable in class com.digitalpebble.stormcrawler.Constants
 
STATUS_ERROR_MESSAGE - Static variable in class com.digitalpebble.stormcrawler.Constants
 
STATUS_ERROR_SOURCE - Static variable in class com.digitalpebble.stormcrawler.Constants
 
StatusEmitterBolt - Class in com.digitalpebble.stormcrawler.bolt
Provides common functionalities for Bolts which emit tuples to the status stream, e.g.
StatusEmitterBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
 
StatusMaxDelayParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Max time to allow between 2 successive queries to the backend.
StatusMinDelayParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Min time to allow between 2 successive queries to the backend.
StatusStreamName - Static variable in class com.digitalpebble.stormcrawler.Constants
 
StatusTTLPurgatory - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
Time in seconds for which acked or failed URLs will be considered for fetching again, default 30 secs.
StdOutIndexer - Class in com.digitalpebble.stormcrawler.indexing
Indexer which generates fields for indexing and sends them to the standard output.
StdOutIndexer() - Constructor for class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
 
StdOutStatusUpdater - Class in com.digitalpebble.stormcrawler.persistence
Dummy status updater which dumps the content of the incoming tuples to the standard output.
StdOutStatusUpdater() - Constructor for class com.digitalpebble.stormcrawler.persistence.StdOutStatusUpdater
 
store(String, Status, Metadata, Optional<Date>, Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
store(String, Status, Metadata, Optional<Date>, Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.MemoryStatusUpdater
 
store(String, Status, Metadata, Optional<Date>, Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.StdOutStatusUpdater
 
storeHTTPHeaders - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 
StringTabScheme - Class in com.digitalpebble.stormcrawler.util
Converts a byte array into URL + metadata
StringTabScheme() - Constructor for class com.digitalpebble.stormcrawler.util.StringTabScheme
 
submit(String, Config, TopologyBuilder) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
Submits the topology under a specific name *
submit(Config, TopologyBuilder) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
Submits the topology with the name taken from the configuration *

T

tail(Node, int) - Method in class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder.W3CBuilder
 
text(Element) - Method in class com.digitalpebble.stormcrawler.parse.TextExtractor
 
TEXT_MAX_TEXT_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
 
TextExtractor - Class in com.digitalpebble.stormcrawler.parse
Filters the text extracted from HTML documents, used by JSoupParserBolt.
TextExtractor(Map<String, Object>) - Constructor for class com.digitalpebble.stormcrawler.parse.TextExtractor
 
textFieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Field name to use for storing the text of a document *
textLengthParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Trim length of text to index.
THROTTLE_STREAM - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
 
TIME - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
fetch exceeded configured max.
toASCII(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
 
toOutlinks(String, Metadata, Map<String, List<String>>) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
 
toProtocolResponse() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileResponse
 
toString() - Method in class com.digitalpebble.stormcrawler.Metadata
 
toString() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
 
toString() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
 
toString() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
Formats the proxy information into a URL compatible connection string
toString(String) - Method in class com.digitalpebble.stormcrawler.Metadata
Returns a String representation of the metadata with one K/V per line
toUNICODE(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
 
trackDepthParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating whether to track the depth from seed.
trackPathParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Parameter name indicating whether to track the url path or not.
TRANSFER_ENCODING - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
 
traverse(NodeVisitor, Node, int, StringBuilder) - Static method in class com.digitalpebble.stormcrawler.parse.TextExtractor
Start a depth-first traverse of the root and all of its descendants.
TRIMMED_RESPONSE_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Metadata key which holds a boolean value in metadata whether the response content is trimmed or not.
TRIMMED_RESPONSE_REASON_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
Metadata key which holds the reason why content has been trimmed, see ProtocolResponse.TrimmedContentReason.
trimText(String) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns a trimmed string or the original one if it is below the threshold set in the configuration.

U

unlock() - Method in class com.digitalpebble.stormcrawler.Metadata
Release the lock on a metadata
UNSPECIFIED - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
unknown reason
URLBuffer - Interface in com.digitalpebble.stormcrawler.persistence.urlbuffer
Buffers URLs to be processed into separate queues; used by spouts.
urlFieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Field name to use for storing the url of a document *
URLFilter - Class in com.digitalpebble.stormcrawler.filtering
Unlike Nutch, URLFilters can normalise the URLs as well as filtering them.
URLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.URLFilter
 
URLFilterBolt - Class in com.digitalpebble.stormcrawler.bolt
 
URLFilterBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
Relies on the file defined in urlfilters.config.file and applied to all tuples regardless of status
URLFilterBolt(boolean, String) - Constructor for class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
 
URLFilters - Class in com.digitalpebble.stormcrawler.filtering
Wrapper for the URLFilters defined in a JSON configuration.
URLFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.filtering.URLFilters
Loads the filters from a JSON configuration file
URLPartitioner - Class in com.digitalpebble.stormcrawler.util
Generates a partition key for a given URL based on the hostname, domain or IP address.
URLPartitioner() - Constructor for class com.digitalpebble.stormcrawler.util.URLPartitioner
 
URLPartitionerBolt - Class in com.digitalpebble.stormcrawler.bolt
Generates a partition key for a given URL based on the hostname, domain or IP address.
URLPartitionerBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
 
urlPathKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
Metadata key name for tracking the source URLs
URLStreamGrouping - Class in com.digitalpebble.stormcrawler.util
Directs tuples to a specific bolt instance based on the URLPartitioner, e.g.
URLStreamGrouping() - Constructor for class com.digitalpebble.stormcrawler.util.URLStreamGrouping
Groups URLs based on the hostname *
URLStreamGrouping(String) - Constructor for class com.digitalpebble.stormcrawler.util.URLStreamGrouping
 
URLUtil - Class in com.digitalpebble.stormcrawler.util
Utility class for URL analysis
useCacheParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
Parameter name to indicate whether the internal cache should be used for discovered URLs.
useCookies - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
 

V

valueForURL(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
Returns the value to be used as the URL for indexing purposes, if present the canonical value is used instead
valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
Returns the enum constant of this type with the specified name.
valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
Returns the enum constant of this type with the specified name.
values() - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
Returns an array containing the constants of this enum type, in the order they are declared.
values() - Static method in enum com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
Returns an array containing the constants of this enum type, in the order they are declared.

W

W3CBuilder(HTMLDocumentImpl, DocumentFragment) - Constructor for class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder.W3CBuilder
 

X

XPathFilter - Class in com.digitalpebble.stormcrawler.jsoup
Reads a XPATH patterns and stores the value found in web page as metadata
XPathFilter - Class in com.digitalpebble.stormcrawler.parse.filter
Simple ParseFilter to illustrate and test the interface.
XPathFilter() - Constructor for class com.digitalpebble.stormcrawler.jsoup.XPathFilter
 
XPathFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
 

_

_collector - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
 
_collector - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
 
_collector - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 
_scheme - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
 
A B C D E F G H I J K L M N O P Q R S T U V W X _ 
All Classes All Packages