A B C D E F G H I J K L M N O P Q R S T U V W X _
All Classes All Packages
All Classes All Packages
All Classes All Packages
A
- AbstractConfigurable - Class in com.digitalpebble.stormcrawler.util
- AbstractConfigurable() - Constructor for class com.digitalpebble.stormcrawler.util.AbstractConfigurable
- AbstractHttpProtocol - Class in com.digitalpebble.stormcrawler.protocol
- AbstractHttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- AbstractHttpProtocol.KeyValue - Class in com.digitalpebble.stormcrawler.protocol
- AbstractIndexerBolt - Class in com.digitalpebble.stormcrawler.indexing
-
Abstract class to simplify writing IndexerBolts *
- AbstractIndexerBolt() - Constructor for class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
- AbstractQueryingSpout - Class in com.digitalpebble.stormcrawler.persistence
-
Common features of spouts which query a backend to generate tuples.
- AbstractQueryingSpout() - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- AbstractQueryingSpout.InProcessMap<K,V> - Class in com.digitalpebble.stormcrawler.persistence
-
Map which holds elements some additional time after the removal.
- AbstractStatusUpdaterBolt - Class in com.digitalpebble.stormcrawler.persistence
-
Abstract bolt used to store the status of URLs.
- AbstractStatusUpdaterBolt() - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
- AbstractURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
-
Abstract class for URLBuffer interface, meant to simplify the code of the implementations and provide some default methods
- AbstractURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- AbstractURLBuffer.URLMetadata - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
- accept() - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
-
Return if this rule is used for filtering-in or out.
- ack(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- ack(Object) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- ack(Tuple, String) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Must be called by extending classes to store and collect in one go
- acked(String) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.PriorityURLBuffer
- acked(String) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
- acked(String) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Notify the buffer that a URL has been successfully processed used e.g to compute an ideal delay for a host queue
- activate() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- activate() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- activate() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
- active - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
- AdaptiveScheduler - Class in com.digitalpebble.stormcrawler.persistence
-
Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed: if yes, shrink the fetch interval up to a minimum fetch interval if not, increase the fetch interval up to a maximum
- AdaptiveScheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- add(String, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
-
Stores the URL and its Metadata using the hostname as key.
- add(String, Metadata) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Stores the URL and its Metadata using the hostname as key.
- add(String, Metadata, String) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
-
Stores the URL and its Metadata under a given key.
- add(String, Metadata, String) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Stores the URL and its Metadata under a given key.
- add(String, Metadata, Date) - Static method in class com.digitalpebble.stormcrawler.spout.MemorySpout
-
Add a new URL with the given metadata and nextFetch-date
- addHeadersToRequest(Request.Builder, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
- addHeadersToRequest(HttpRequestBase, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- addMeasurement(long) - Method in class com.digitalpebble.stormcrawler.util.CollectionMetric
- addSitemap(String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- addValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
- addValues(String, Collection<String>) - Method in class com.digitalpebble.stormcrawler.Metadata
- agentNames - Variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- allow5xx - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
- allowForbidden - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
- AllowRedirParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- allowRedirs() - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
- ANCHORS_KEY_NAME - Static variable in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
-
Metadata key name for tracking the anchors
- AS_IS_NEXTFETCHDATE_METADATA - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Key used to pass a preset Date to use as nextFetchDate.
- asMap() - Method in class com.digitalpebble.stormcrawler.Metadata
-
Returns the underlying Map *
B
- BasicURLFilter - Class in com.digitalpebble.stormcrawler.filtering.basic
-
Simple URL filters : can be used early in the filtering chain
- BasicURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
- BasicURLNormalizer - Class in com.digitalpebble.stormcrawler.filtering.basic
- BasicURLNormalizer() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
- BATCH_SIZE - Static variable in class com.digitalpebble.stormcrawler.spout.FileSpout
- beingProcessed - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Map to keep in-process URLs, with the URL as key and optional value depending on the spout implementation.
- buffer - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- buffer - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
- bufferClassParamName - Static variable in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Implementation to use for URLBuffer.
- build(String) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
C
- CACHE - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- cacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Parameter name to configure the cache @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterAccess=1h"
- cacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
Parameter name to configure the cache for robots @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=6h"
- canonicalMetadataParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Field name to use for reading the canonical property of the metadata
- CharsetIdentification - Class in com.digitalpebble.stormcrawler.util
- CharsetIdentification() - Constructor for class com.digitalpebble.stormcrawler.util.CharsetIdentification
- checkCustomInterval(Metadata, Status) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
-
Returns the first matching custom interval
- checkDomainMatchToUrl(String, String) - Static method in class com.digitalpebble.stormcrawler.util.CookieConverter
-
Helper method to check if url matches a cookie domain.
- chooseTasks(int, List<Object>) - Method in class com.digitalpebble.stormcrawler.util.URLStreamGrouping
- cleanup() - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
- cleanup() - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- cleanup() - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
- cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
- cleanup() - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
- close() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- CollectionMetric - Class in com.digitalpebble.stormcrawler.util
- CollectionMetric() - Constructor for class com.digitalpebble.stormcrawler.util.CollectionMetric
- CollectionTagger - Class in com.digitalpebble.stormcrawler.parse.filter
-
Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.
- CollectionTagger() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
- collector - Variable in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
- collector - Variable in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
- com.digitalpebble.stormcrawler - package com.digitalpebble.stormcrawler
- com.digitalpebble.stormcrawler.bolt - package com.digitalpebble.stormcrawler.bolt
- com.digitalpebble.stormcrawler.filtering - package com.digitalpebble.stormcrawler.filtering
- com.digitalpebble.stormcrawler.filtering.basic - package com.digitalpebble.stormcrawler.filtering.basic
- com.digitalpebble.stormcrawler.filtering.depth - package com.digitalpebble.stormcrawler.filtering.depth
- com.digitalpebble.stormcrawler.filtering.host - package com.digitalpebble.stormcrawler.filtering.host
- com.digitalpebble.stormcrawler.filtering.metadata - package com.digitalpebble.stormcrawler.filtering.metadata
- com.digitalpebble.stormcrawler.filtering.regex - package com.digitalpebble.stormcrawler.filtering.regex
- com.digitalpebble.stormcrawler.filtering.robots - package com.digitalpebble.stormcrawler.filtering.robots
- com.digitalpebble.stormcrawler.filtering.sitemap - package com.digitalpebble.stormcrawler.filtering.sitemap
- com.digitalpebble.stormcrawler.indexing - package com.digitalpebble.stormcrawler.indexing
- com.digitalpebble.stormcrawler.jsoup - package com.digitalpebble.stormcrawler.jsoup
- com.digitalpebble.stormcrawler.parse - package com.digitalpebble.stormcrawler.parse
- com.digitalpebble.stormcrawler.parse.filter - package com.digitalpebble.stormcrawler.parse.filter
- com.digitalpebble.stormcrawler.persistence - package com.digitalpebble.stormcrawler.persistence
- com.digitalpebble.stormcrawler.persistence.urlbuffer - package com.digitalpebble.stormcrawler.persistence.urlbuffer
- com.digitalpebble.stormcrawler.protocol - package com.digitalpebble.stormcrawler.protocol
- com.digitalpebble.stormcrawler.protocol.file - package com.digitalpebble.stormcrawler.protocol.file
- com.digitalpebble.stormcrawler.protocol.httpclient - package com.digitalpebble.stormcrawler.protocol.httpclient
- com.digitalpebble.stormcrawler.protocol.okhttp - package com.digitalpebble.stormcrawler.protocol.okhttp
- com.digitalpebble.stormcrawler.protocol.selenium - package com.digitalpebble.stormcrawler.protocol.selenium
- com.digitalpebble.stormcrawler.proxy - package com.digitalpebble.stormcrawler.proxy
- com.digitalpebble.stormcrawler.spout - package com.digitalpebble.stormcrawler.spout
- com.digitalpebble.stormcrawler.util - package com.digitalpebble.stormcrawler.util
- CommaSeparatedToMultivaluedMetadata - Class in com.digitalpebble.stormcrawler.parse.filter
-
Rewrites single metadata containing comma separated values into multiple values for the same key, useful for instance for keyword tags.
- CommaSeparatedToMultivaluedMetadata() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata
- conf - Variable in class com.digitalpebble.stormcrawler.ConfigurableTopology
- Configurable - Interface in com.digitalpebble.stormcrawler.util
-
An interface marking the implementing class as initializeable and configurable via
Configurable.createConfiguredInstance(String, Class, Map, JsonNode)
The implementing class HAS to implement an empty constructor. - ConfigurableTopology - Class in com.digitalpebble.stormcrawler
- ConfigurableTopology() - Constructor for class com.digitalpebble.stormcrawler.ConfigurableTopology
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
- configure(Map<String, Object>, JsonNode) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
- configure(Map<String, Object>, JsonNode) - Method in interface com.digitalpebble.stormcrawler.util.Configurable
-
Called when this filter is being initialized
- configure(Map<String, Object>, JsonNode, Class<T>, String) - Static method in interface com.digitalpebble.stormcrawler.util.Configurable
- configure(Map<String, Object>, JsonNode, String) - Method in class com.digitalpebble.stormcrawler.util.AbstractConfigurable
- configure(Map<String, Object>, JsonNode, String) - Method in interface com.digitalpebble.stormcrawler.util.Configurable
-
Called when this filter is being initialized
- configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.jsoup.LinkParseFilter
- configure(Map, JsonNode) - Method in class com.digitalpebble.stormcrawler.jsoup.XPathFilter
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- configure(MultiProxyManager.ProxyRotation, String[]) - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
- configure(Map) - Method in class com.digitalpebble.stormcrawler.util.URLPartitioner
- configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
- configure(Map<String, Object>) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
- configure(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
- configure(Config) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
- configure(Config) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
- configure(Config) - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
- configure(Config) - Method in interface com.digitalpebble.stormcrawler.proxy.ProxyManager
- configure(Config) - Method in class com.digitalpebble.stormcrawler.proxy.SingleProxyManager
- ConfUtils - Class in com.digitalpebble.stormcrawler.util
- Constants - Class in com.digitalpebble.stormcrawler
- containsKey(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
- containsKey(String) - Method in class com.digitalpebble.stormcrawler.Metadata
- containsKeyWithValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
- CONTENT_DISPOSITION - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CONTENT_ENCODING - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CONTENT_LANGUAGE - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CONTENT_LENGTH - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CONTENT_LOCATION - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CONTENT_MD5 - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CONTENT_TYPE - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- CookieConverter - Class in com.digitalpebble.stormcrawler.util
-
Helper to extract cookies from cookies string.
- CookieConverter() - Constructor for class com.digitalpebble.stormcrawler.util.CookieConverter
- createConfiguredInstance(Class<?>, Class<T>, Map<String, Object>, JsonNode) - Static method in interface com.digitalpebble.stormcrawler.util.Configurable
-
Calls
Configurable.createConfiguredInstance(String, Class, Map, JsonNode)
withcaller.getName()
forconfigName
. - createConfiguredInstance(String, Class<T>, Map<String, Object>, JsonNode) - Static method in interface com.digitalpebble.stormcrawler.util.Configurable
-
Used by classes like URLFilters and ParseFilters to load the configuration of utilized filters from the provided JSON config.
- createInstance(Map<String, Object>) - Static method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Returns a URLBuffer instance based on the configuration *
- createRule(boolean, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter
- createRule(boolean, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
-
Creates a new
RegexRule
. - customHeaders - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
D
- deactivate() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- deactivate() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- deactivate() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
- DebugParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
-
Dumps the DOM representation of a document into a file
- DebugParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- declareOutputFields(OutputFieldsDeclarer) - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
- DEFAULT_CHARSET - Static variable in class com.digitalpebble.stormcrawler.util.CharsetIdentification
- defaultfetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- defaultFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- DefaultScheduler - Class in com.digitalpebble.stormcrawler.persistence
-
Schedules a nextFetchDate based on the configuration
- DefaultScheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
- DELAY_METADATA - Static variable in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
-
Key used to pass a custom delay via metadata.
- DelegatorProtocol - Class in com.digitalpebble.stormcrawler.protocol
-
Protocol implementation that enables selection from a collection of sub-protocols using filters based on each call's metadata and URL.
- DelegatorProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- DELETION_STREAM_NAME - Static variable in class com.digitalpebble.stormcrawler.Constants
- depthKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Metadata key name for tracking the depth
- deserialize(ByteBuffer) - Method in class com.digitalpebble.stormcrawler.util.StringTabScheme
- DISCONNECT - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
network disconnect or timeout during fetch
- DISCOVERED - com.digitalpebble.stormcrawler.persistence.Status
- dnsEnd(Call, String, List<InetAddress>) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.DNSResolutionListener
- DNSResolutionListener - Class in com.digitalpebble.stormcrawler.protocol.okhttp
- DNSResolutionListener(Map<String, Long>) - Constructor for class com.digitalpebble.stormcrawler.protocol.okhttp.DNSResolutionListener
- dnsStart(Call, String) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.DNSResolutionListener
- DocumentFragmentBuilder - Class in com.digitalpebble.stormcrawler.parse
-
Adapted from org.jsoup.helper.W3CDom but does not transfer namespaces.
- DocumentFragmentBuilder.W3CBuilder - Class in com.digitalpebble.stormcrawler.parse
-
Implements the conversion by walking the input.
- DomainParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
-
Adds domain (or host) to metadata - can be used later on for indexing *
- DomainParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
- drivers - Variable in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
- DummyIndexer - Class in com.digitalpebble.stormcrawler.indexing
-
* Any tuple that went through all the previous bolts is sent to the status stream with a Status of FETCHED.
- DummyIndexer() - Constructor for class com.digitalpebble.stormcrawler.indexing.DummyIndexer
E
- emitOutlink(Tuple, URL, String, Metadata, String...) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
-
Used for redirections or when discovering sitemap URLs.
- empty - Static variable in class com.digitalpebble.stormcrawler.Metadata
- EMPTY_RULES - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
A
BaseRobotRules
object appropriate for use when therobots.txt
file is empty or missing; all requests are allowed. - emptyNavigationFilters - Static variable in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
- emptyParseFilter - Static variable in class com.digitalpebble.stormcrawler.parse.JSoupFilters
- emptyParseFilter - Static variable in class com.digitalpebble.stormcrawler.parse.ParseFilters
- emptyQueue(String) - Method in interface com.digitalpebble.stormcrawler.persistence.EmptyQueueListener
- EmptyQueueListener - Interface in com.digitalpebble.stormcrawler.persistence
-
Used by URLBuffer to inform the spouts when a queue has no more URLs in it
- emptyURLFilters - Static variable in class com.digitalpebble.stormcrawler.filtering.URLFilters
- equals(Object) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- ERROR - com.digitalpebble.stormcrawler.persistence.Status
- ERRORCACHE - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- errorcacheConfigParamName - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
Parameter name to configure the cache for robots errors @see http://docs.guava-libraries.googlecode .com/git/javadoc/com/google/common/cache/CacheBuilderSpec.html Default value is "maximumSize=10000,expireAfterWrite=1h"
- errorFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- eventCounter - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- EXCLUDE_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.DummyIndexer
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
- execute(Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
- expressions - Variable in class com.digitalpebble.stormcrawler.jsoup.XPathFilter
- expressions - Variable in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
- extractConfigElement(Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
If the config consists of a single key 'config', its values are used instead
- extractMetaTags(String) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
-
Extracts meta tags based on the value of the content attribute *
- extractMetaTags(DocumentFragment) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
- extractRefreshURL(String) - Static method in class com.digitalpebble.stormcrawler.util.RefreshTag
- extractRefreshURL(Document) - Static method in class com.digitalpebble.stormcrawler.util.RefreshTag
- extractResult(TimeReducerState) - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
F
- fail(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- fail(Object) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- FastURLFilter - Class in com.digitalpebble.stormcrawler.filtering.regex
-
URL filter based on regex patterns and organised by [host | domain | metadata | global].
- FastURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
- FeedParserBolt - Class in com.digitalpebble.stormcrawler.bolt
-
Extracts URLs from feeds
- FeedParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
- FETCH_ERROR - com.digitalpebble.stormcrawler.persistence.Status
- FETCH_INTERVAL_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Key to store the current fetch interval value, must be listed in "metadata.persist".
- FETCHED - com.digitalpebble.stormcrawler.persistence.Status
- FetcherBolt - Class in com.digitalpebble.stormcrawler.bolt
-
A multithreaded, queue-based fetcher adapted from Apache Nutch.
- FetcherBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.FetcherBolt
- fetchErrorCountParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- fetchErrorFetchIntervalParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- fetchIntervalDecRate - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- fetchIntervalIncRate - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- fetchRobotsMd - Variable in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
- fieldNameForText() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Returns the field name to use for the text or null if the text must not be indexed
- fieldNameForURL() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Returns the field name to use for the URL or null if the URL must not be indexed
- FileProtocol - Class in com.digitalpebble.stormcrawler.protocol.file
- FileProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- FileResponse - Class in com.digitalpebble.stormcrawler.protocol.file
- FileResponse(String, Metadata, FileProtocol) - Constructor for class com.digitalpebble.stormcrawler.protocol.file.FileResponse
- FileSpout - Class in com.digitalpebble.stormcrawler.spout
-
Reads the lines from a UTF-8 file and use them as a spout.
- FileSpout(boolean, String...) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
- FileSpout(String...) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
- FileSpout(String, String) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
- FileSpout(String, String, boolean) - Constructor for class com.digitalpebble.stormcrawler.spout.FileSpout
- filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
- filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
- filter(String, byte[], Document, ParseResult) - Method in interface com.digitalpebble.stormcrawler.parse.JSoupFilter
-
Called when parsing a specific page
- filter(RemoteWebDriver, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
-
The end result comes from the first filter to return non-null *
- filter(RemoteWebDriver, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLNormalizer
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
-
This function does the replacements by iterating through all the regex patterns.
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilter
-
Returns null if the URL is to be removed or a normalised representation which can correspond to the input URL
- filter(URL, Metadata, String) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
- filter(Metadata) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Determine which metadata should be persisted for a given document including those which are not necessarily transferred to the outlinks
- filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.jsoup.LinkParseFilter
- filter(String, byte[], Document, ParseResult) - Method in class com.digitalpebble.stormcrawler.jsoup.XPathFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.CommaSeparatedToMultivaluedMetadata
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.DomainParseFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.MimeTypeNormalization
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
-
Called when parsing a specific page
- filter(String, byte[], DocumentFragment, ParseResult) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
- filterDocument(Metadata) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Determine whether a document should be indexed based on the presence of a given key/value or the RobotsTags.ROBOTS_NO_INDEX directive.
- filterJson(Document) - Static method in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
- filterJson(DocumentFragment) - Static method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
- filterMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Returns a mapping field name / values for the metadata to index *
- filterOutlink(URL, String, Metadata, String...) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
- filterPathRepet(String) - Method in class com.digitalpebble.stormcrawler.filtering.basic.BasicURLFilter
- FORBID_ALL_RULES - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
A
BaseRobotRules
object appropriate for use when therobots.txt
file is not fetched due to a403/Forbidden
response; all requests are disallowed. - formatHttpDate(String) - Static method in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
-
Format an ISO date string as HTTP date used in HTTP headers, e.g.,
- foundSitemapKey - Static variable in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
-
Loads and configure the NavigationFilters based on the storm config if there is one otherwise returns an emptyNavigationFilters.
- fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.filtering.URLFilters
-
Loads and configure the URLFilters based on the storm config if there is one otherwise returns an empty URLFilter.
- fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
-
Loads and configure the JSoupFilters based on the storm config if there is one otherwise returns an empty JSoupFilter.
- fromConf(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.parse.ParseFilters
-
Loads and configure the ParseFilters based on the storm config if there is one otherwise returns an emptyParseFilter.
- fromHTTPCode(int) - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
-
Maps the HTTP Code to FETCHED, FETCH_ERROR or REDIRECTION
- fromJsoup(Document) - Static method in class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder
G
- get(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- get(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- getAddress() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getAgentString(Config) - Static method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- getAnchor() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- getArea() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getBoolean(Map<String, Object>, String, boolean) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- getBoolean(Map<String, Object>, String, String, String, boolean) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix and if that fails too, the default value
- getCacheKey(URL) - Static method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
-
Compose unique key to store and access robot rules in cache for given URL
- getCharset(Metadata, byte[], int) - Static method in class com.digitalpebble.stormcrawler.util.CharsetIdentification
-
Identifies the charset of a document based on the following logic: guess from the ByteOrderMark - else if the same charset is specified in the http headers and the html metadata then use it - otherwise use ICU's charset detector to make an educated guess and if that fails too returns UTF-8.
- getCharsetFast(Metadata, byte[], int) - Static method in class com.digitalpebble.stormcrawler.util.CharsetIdentification
-
Identifies the charset of a document based on the following logic: guess from the ByteOrderMark - else return any charset specified in the http headers if any, otherwise return the one from the html metadata; finally use ICU's charset detector to make an educated guess and if that fails too returns UTF-8.
- getClassFor(String, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
-
Retrieves a class-instance for
qualifiedClassName
extendingT
. - getComponentConfiguration() - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
- getConf() - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
- getContent() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- getContent() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
- getContentLengthFetched() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
-
Returns the number of bytes fetched per request when not cached *
- getCookies(String[], URL) - Static method in class com.digitalpebble.stormcrawler.util.CookieConverter
-
Get a list of cookies based on the cookies string taken from response header and the target url.
- getCountry() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getCrawlDelay() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- getDocumentID(Metadata, String) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Get the document id.
- getDocumentID(Metadata, String) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Get the document id.
- getEncoding() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- getFirstValue(Metadata, String...) - Static method in class com.digitalpebble.stormcrawler.Metadata
-
Returns the first non empty value found for the keys or null if none found.
- getFirstValue(String) - Method in class com.digitalpebble.stormcrawler.Metadata
- getFirstValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
- getFloat(Map<String, Object>, String, float) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- getHost(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
-
Returns the lowercased hostname for the url or null if the url is not well formed.
- getHostSegments(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
-
Partitions of the hostname of the url by "."
- getHostSegments(URL) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
-
Partitions of the hostname of the url by "."
- getInstance(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.persistence.Scheduler
-
Returns a Scheduler instance based on the configuration *
- getInstance(Map<String, Object>) - Static method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Deprecated.
- getInstance(Map<String, Object>) - Static method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
- getInstance(Config) - Static method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
- getInt(Map<String, Object>, String, int) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- getInt(Map<String, Object>, String, String, String, int) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix and if that fails too, the default value
- getKey() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
- getLocation() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getLong(Map<String, Object>, String, long) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- getMetadata() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- getMetadata() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- getMetadata() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
- getMetaForOutlink(String, String, Metadata) - Method in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Determine which metadata should be transferred to an outlink.
- getName() - Method in class com.digitalpebble.stormcrawler.util.AbstractConfigurable
- getName() - Method in interface com.digitalpebble.stormcrawler.util.Configurable
- getOutlinks() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- getOutputFields() - Method in class com.digitalpebble.stormcrawler.util.StringTabScheme
- getPage(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
-
Returns the page for the url.
- getParseMap() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- getPartition(String, Metadata) - Method in class com.digitalpebble.stormcrawler.util.URLPartitioner
-
Returns the host, domain, IP of a URL so that it can be partitioned for politeness, depending on the value of the config partition.url.mode.
- getPartition(String, Metadata, String) - Static method in class com.digitalpebble.stormcrawler.util.URLPartitioner
-
Returns the host, domain, IP of a URL so that it can be partitioned for politeness, depending on the value of the parameter partitionMode.
- getPassword() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getPort() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getProtocol() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getProtocol(String) - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
-
Returns instance(s) of the implementation for the protocol passed as argument.
- getProtocol(URL) - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolFactory
-
Returns an instance of the protocol to use for a given URL
- getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
- getProtocolOutput(String, Metadata) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
-
Fetches the content and additional metadata
- getProtocolOutput(String, Metadata) - Method in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
- getProxy(Metadata) - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
- getProxy(Metadata) - Method in interface com.digitalpebble.stormcrawler.proxy.ProxyManager
- getProxy(Metadata) - Method in class com.digitalpebble.stormcrawler.proxy.SingleProxyManager
- getResourceFile() - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
- getResourceFile() - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
- getResourceFile() - Method in interface com.digitalpebble.stormcrawler.JSONResource
- getResourceFile() - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
- getResourceFile() - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
- getResourceFile() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
- getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- getRobotRules(String) - Method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- getRobotRules(String) - Method in interface com.digitalpebble.stormcrawler.protocol.Protocol
- getRobotRulesSet(Protocol, String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- getRobotRulesSet(Protocol, URL) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
-
Get the rules from robots.txt which applies for the given
url
. - getRobotRulesSet(Protocol, URL) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- getRobotRulesSetFromCache(URL) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
-
Returns the robots rules from the cache or empty rules if not found
- getSitemaps() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- getStatus() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getStatusCode() - Method in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
- getString(Map<String, Object>, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- getString(Map<String, Object>, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- getString(Map<String, Object>, String, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix or null.
- getString(Map<String, Object>, String, String, String, String) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
Returns the value for prefix + optional + suffix, if nothing is found then return prefix + suffix and if that fails too, the default value
- getTargetURL() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- getText() - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- getTimeLastQuerySent() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- getUsage() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
-
Retrieves the current usage of the proxy
- getUsername() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
- getValue() - Method in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
- getValueAndReset() - Method in class com.digitalpebble.stormcrawler.util.CollectionMetric
- getValues(String) - Method in class com.digitalpebble.stormcrawler.Metadata
- getValues(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- getValues(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
- getValues(String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- guessMimeType(String, String, byte[]) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
H
- handleResponse(HttpResponse) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- handleResponseWithContentLimit(HttpResponse, int) - Method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- hashCode() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- hasNext() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- hasNext() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Implementations of this method should be synchronised
- head(Node, int) - Method in class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder.W3CBuilder
- HostURLFilter - Class in com.digitalpebble.stormcrawler.filtering.host
-
Filters URL based on the hostname.
- HostURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.host.HostURLFilter
- HTTP_DATE_FORMATTER - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
-
Formatter for dates in HTTP headers, used to fill the "If-Modified-Since" request header field, e.g.
- HttpHeaders - Class in com.digitalpebble.stormcrawler.protocol
-
A collection of HTTP header names and utilities around header values.
- HttpProtocol - Class in com.digitalpebble.stormcrawler.protocol.httpclient
-
Uses Apache httpclient to handle http and https
- HttpProtocol - Class in com.digitalpebble.stormcrawler.protocol.okhttp
- HttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- HttpProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
- HttpRobotRulesParser - Class in com.digitalpebble.stormcrawler.protocol
-
This class is used for parsing robots for urls belonging to HTTP protocol.
- HttpRobotRulesParser(Config) - Constructor for class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
I
- ignoreEmptyFields() - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
- ignoreEmptyFieldValueParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Indicates that empty field values should not be emitted at all.
- in_buffer - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- inCache() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
- INCLUDE_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
- incrementUsage() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
-
Increments the usage tracker for the proxy
- init() - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
- init(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- init(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
- init(Map<String, Object>) - Method in class com.digitalpebble.stormcrawler.persistence.Scheduler
-
Configuration of the scheduler based on the config.
- InitialisationUtil - Class in com.digitalpebble.stormcrawler.util
- initializeFromClass(Class<?>, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
-
Initializes a class from
clazz
as typesuperClass
. - initializeFromClass(Class<? extends T>) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
-
Initializes a class from
clazz
of typeT
. - initializeFromQualifiedName(String, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
-
Initializes a class from
qualifiedClassName
as typesuperClass
. - InProcessMap(long, TimeUnit) - Constructor for class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
- INTERNAL - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
implementation internal reason
- INTERVAL_DEC_RATE - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Configuration property (float) to set the decrement rate.
- INTERVAL_INC_RATE - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Configuration property (float) to set the increment rate.
- INTERVAL_MAX - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Configuration property (int) to set the maximum fetch interval in minutes.
- INTERVAL_MIN - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Configuration property (int) to set the minimum fetch interval in minutes.
- isAllowAll() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- isAllowed(String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- isAllowNone() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- isDeferVisits() - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- isEmpty() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- isFeedKey - Static variable in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
- isInQuery - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Required for implementations doing asynchronous calls *
- isNoCache() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
- isNoFollow() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
- isNoIndex() - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
- ISO_INSTANT_FORMATTER - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
-
Formatter to parse ISO-formatted dates persisted in status index
- isSitemapKey - Static variable in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- iterator() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
J
- JSONResource - Interface in com.digitalpebble.stormcrawler
-
Defines a generic behaviour for ParseFilters or URLFilters to load resources from a JSON file.
- JSoupFilter - Interface in com.digitalpebble.stormcrawler.parse
-
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
- JSoupFilters - Class in com.digitalpebble.stormcrawler.parse
-
Wrapper for the JSoupFilters defined in a JSON configuration
- JSoupFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.parse.JSoupFilters
-
loads the filters from a JSON configuration file
- JSoupParserBolt - Class in com.digitalpebble.stormcrawler.bolt
-
Parser for HTML documents only which uses ICU4J to detect the charset encoding.
- JSoupParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
K
- keySet() - Method in class com.digitalpebble.stormcrawler.Metadata
- KeyValue(String, String) - Constructor for class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol.KeyValue
L
- LAST_MODIFIED - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- lastTimeResetToNOW - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- LDJsonParseFilter - Class in com.digitalpebble.stormcrawler.jsoup
-
Extracts data from JSON-LD representation (https://json-ld.org/).
- LDJsonParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
-
Extracts data from JSON-LD representation (https://json-ld.org/)
- LDJsonParseFilter() - Constructor for class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
- LDJsonParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
- LEAST_USED - com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
- LENGTH - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
fetch exceeded configured http.content.limit
- LinkParseFilter - Class in com.digitalpebble.stormcrawler.jsoup
-
ParseFilter to extract additional links with Xpath can be configured with e.g.
- LinkParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
-
ParseFilter to extract additional links with Xpath can be configured with e.g.
- LinkParseFilter() - Constructor for class com.digitalpebble.stormcrawler.jsoup.LinkParseFilter
- LinkParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.LinkParseFilter
- listener - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- loadConf(String, Config) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
- loadJSONResources() - Method in interface com.digitalpebble.stormcrawler.JSONResource
-
Load the resources from the JSON file in the uber jar
- loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
- loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.filtering.URLFilters
- loadJSONResources(InputStream) - Method in interface com.digitalpebble.stormcrawler.JSONResource
-
Load the resources from an input stream
- loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.parse.filter.CollectionTagger
- loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
- loadJSONResources(InputStream) - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
- loadListFromConf(String, String, String, Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
Return one or more Strings regardless of whether they are represented as a single String or a list in the config for the combination all 2 String parameters.
- loadListFromConf(String, Map) - Static method in class com.digitalpebble.stormcrawler.util.ConfUtils
-
Return one or more Strings regardless of whether they are represented as a single String or a list in the config or an empty List if no value could be found for that key.
- LOCATION - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- lock() - Method in class com.digitalpebble.stormcrawler.Metadata
-
Prevents modifications to the metadata object.
- LOG - Static variable in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
- LOG - Static variable in class com.digitalpebble.stormcrawler.filtering.regex.FastURLFilter
- LOG - Static variable in class com.digitalpebble.stormcrawler.jsoup.LDJsonParseFilter
- LOG - Static variable in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
- LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- LOG - Static variable in class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
- LOG - Static variable in class com.digitalpebble.stormcrawler.proxy.SCProxy
- LOG - Static variable in class com.digitalpebble.stormcrawler.spout.FileSpout
M
- main(Protocol, String[]) - Static method in interface com.digitalpebble.stormcrawler.protocol.Protocol
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
-
Utility method to test rules against an input.
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.filtering.URLFilters
-
Utility to check the filtering of a URL *
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.parse.JSoupFilters
-
Used for quick testing + debugging
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.parse.ParseFilters
-
* Used for quick testing + debugging
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.DelegatorProtocol
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.file.FileProtocol
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.httpclient.HttpProtocol
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol
- main(String[]) - Static method in class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
- markQueryReceivedNow() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
sets the marker that we are in a query to false and timeLastQueryReceived to now
- match(String) - Method in class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
-
Checks if a url matches this rule.
- MAX_ARRAY_SIZE - Static variable in class com.digitalpebble.stormcrawler.Constants
-
Maximum array size, safe value on any JVM
- maxDelayBetweenQueries - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- MaxDepthFilter - Class in com.digitalpebble.stormcrawler.filtering.depth
-
Filter out URLs whose depth is greater than maxDepth.
- MaxDepthFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.depth.MaxDepthFilter
- maxDepthKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Metadata key name for tracking a non-default max depth
- maxFetchErrorsParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Number of successive FETCH_ERROR before status changes to ERROR *
- maxFetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- MAXTIMEPARAM - Static variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
- MD5SignatureParseFilter - Class in com.digitalpebble.stormcrawler.parse.filter
-
Computes a signature for a page, based on the binary content or text.
- MD5SignatureParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.MD5SignatureParseFilter
- MemorySpout - Class in com.digitalpebble.stormcrawler.spout
-
Stores URLs in memory.
- MemorySpout(boolean, String...) - Constructor for class com.digitalpebble.stormcrawler.spout.MemorySpout
-
Emits tuples with DISCOVERED status, which is useful when injecting seeds directly to a statusupdaterbolt.
- MemorySpout(String...) - Constructor for class com.digitalpebble.stormcrawler.spout.MemorySpout
- MemoryStatusUpdater - Class in com.digitalpebble.stormcrawler.persistence
-
Use in combination with the MemorySpout for testing in local mode.
- MemoryStatusUpdater() - Constructor for class com.digitalpebble.stormcrawler.persistence.MemoryStatusUpdater
- Metadata - Class in com.digitalpebble.stormcrawler
-
Wrapper around Map <String,String[]> *
- Metadata() - Constructor for class com.digitalpebble.stormcrawler.Metadata
- Metadata(Map<String, String[]>) - Constructor for class com.digitalpebble.stormcrawler.Metadata
-
Wraps an existing HashMap into a Metadata object - does not clone the content
- metadata2fieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Mapping between metadata keys and field names for indexing Can be a list of values separated by a = or a single string
- MetadataFilter - Class in com.digitalpebble.stormcrawler.filtering.metadata
-
Filter out URLs based on metadata in the source document
- MetadataFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.metadata.MetadataFilter
- metadataFilterParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
list of metadata key + values to be used as a filter.
- metadataPersistParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Parameter name indicating which metadata to persist for a given document but not transfer to outlinks.
- MetadataTransfer - Class in com.digitalpebble.stormcrawler.util
-
Implements the logic of how the metadata should be passed to the outlinks, what should be stored back in the persistence layer etc...
- MetadataTransfer() - Constructor for class com.digitalpebble.stormcrawler.util.MetadataTransfer
- metadataTransferClassParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Class to use for transfering metadata to outlinks.
- metadataTransferParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Parameter name indicating which metadata to transfer to the outlinks and persist for a given document.
- MimeTypeNormalization - Class in com.digitalpebble.stormcrawler.parse.filter
-
Normalises the MimeType value e.g.
- MimeTypeNormalization() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.MimeTypeNormalization
- minDelayBetweenQueries - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- minFetchInterval - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- MultiProxyManager - Class in com.digitalpebble.stormcrawler.proxy
-
MultiProxyManager is a ProxyManager implementation for a multiple proxy endpoints
- MultiProxyManager() - Constructor for class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
- MultiProxyManager.ProxyRotation - Enum in com.digitalpebble.stormcrawler.proxy
N
- NavigationFilter - Class in com.digitalpebble.stormcrawler.protocol.selenium
- NavigationFilter() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilter
- NavigationFilters - Class in com.digitalpebble.stormcrawler.protocol.selenium
-
Wrapper for the NavigationFilter defined in a JSON configuration
- NavigationFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.NavigationFilters
-
loads the filters from a JSON configuration file
- needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.DebugParseFilter
- needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.LDJsonParseFilter
- needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
- needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilter
-
Specifies whether this filter requires a DOM representation of the document
- needsDOM() - Method in class com.digitalpebble.stormcrawler.parse.ParseFilters
- next() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.PriorityURLBuffer
- next() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
-
Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled
- next() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SimpleURLBuffer
-
Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled
- next() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Retrieves the next available URL, guarantees that the URLs are always perfectly shuffled
- nextTuple() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- nextTuple() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- nextTuple() - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
- NO_TEXT_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
- normaliseToMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.util.RobotsTags
-
Adds a normalised representation of the directives in the metadata *
- NOT_TRIMMED - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
- numQueues() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
-
Total number of queues in the buffer *
- numQueues() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Total number of queues in the buffer *
O
- onRemoval(String, Object[], RemovalCause) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
- open(Map<String, Object>, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- open(Map<String, Object>, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- open(Map<String, Object>, TopologyContext, SpoutOutputCollector) - Method in class com.digitalpebble.stormcrawler.spout.MemorySpout
- Outlink - Class in com.digitalpebble.stormcrawler.parse
- Outlink(String) - Constructor for class com.digitalpebble.stormcrawler.parse.Outlink
- Outlink(String, String) - Constructor for class com.digitalpebble.stormcrawler.parse.Outlink
- overwriteLastModified - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
P
- ParseData - Class in com.digitalpebble.stormcrawler.parse
- ParseData() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
- ParseData(Metadata) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
- ParseData(String, Metadata) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseData
- parseExtensionAttributes(SiteMapURL, Metadata) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- ParseFilter - Class in com.digitalpebble.stormcrawler.parse
-
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
- ParseFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseFilter
- ParseFilters - Class in com.digitalpebble.stormcrawler.parse
-
Wrapper for the ParseFilters defined in a JSON configuration
- ParseFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseFilters
-
loads the filters from a JSON configuration file
- ParseResult - Class in com.digitalpebble.stormcrawler.parse
- ParseResult() - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
- ParseResult(List<Outlink>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
- ParseResult(Map<String, ParseData>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
- ParseResult(Map<String, ParseData>, List<Outlink>) - Constructor for class com.digitalpebble.stormcrawler.parse.ParseResult
- parseRules(String, byte[], String, String) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
Parses the robots content using the
SimpleRobotRulesParser
from crawler commons - PARTITION_MODE_DOMAIN - Static variable in class com.digitalpebble.stormcrawler.Constants
- PARTITION_MODE_HOST - Static variable in class com.digitalpebble.stormcrawler.Constants
- PARTITION_MODE_IP - Static variable in class com.digitalpebble.stormcrawler.Constants
- PARTITION_MODEParamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- partitioner - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- PerSecondReducer - Class in com.digitalpebble.stormcrawler.util
-
Used to return an average value per second *
- PerSecondReducer() - Constructor for class com.digitalpebble.stormcrawler.util.PerSecondReducer
- populateBuffer() - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Method where specific implementations query the storage.
- populateBuffer() - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.FeedParserBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.DummyIndexer
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
- prepare(Map<String, Object>, TopologyContext, OutputCollector) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
- prepare(WorkerTopologyContext, GlobalStreamId, List<Integer>) - Method in class com.digitalpebble.stormcrawler.util.URLStreamGrouping
- PriorityURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
-
Determines the priority of the buffers based on the number of URLs acked in a configurable period of time.
- PriorityURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.PriorityURLBuffer
- Protocol - Interface in com.digitalpebble.stormcrawler.protocol
- PROTOCOL_MD_PREFIX_PARAM - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
- PROTOCOL_VERSIONS_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Key which holds the protocol version(s) used for this request (for layered protocols this field may hold multiple comma-separated values)
- ProtocolFactory - Class in com.digitalpebble.stormcrawler.protocol
- protocolMDprefix - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- ProtocolResponse - Class in com.digitalpebble.stormcrawler.protocol
- ProtocolResponse(byte[], int, Metadata) - Constructor for class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
- ProtocolResponse.TrimmedContentReason - Enum in com.digitalpebble.stormcrawler.protocol
-
Enum of reasons which may cause that protocol content is trimmed.
- protocolVersions - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- proxyCount() - Method in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
- proxyManager - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- ProxyManager - Interface in com.digitalpebble.stormcrawler.proxy
-
Proxy manager is an abstract class specification that details the required interface of a proxy manager
- put(String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- put(String, String, String) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
-
Add the key value to the metadata object for a given URL *
- putAll(Metadata) - Method in class com.digitalpebble.stormcrawler.Metadata
-
Puts all the metadata into the current instance *
- putAll(Metadata, String) - Method in class com.digitalpebble.stormcrawler.Metadata
-
Puts all prefixed metadata into the current instance
Q
- queryTimes - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- QUEUE_MODE_DOMAIN - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- QUEUE_MODE_HOST - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- QUEUE_MODE_IP - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- QUEUED_TIMEOUT_PARAM_KEY - Static variable in class com.digitalpebble.stormcrawler.bolt.FetcherBolt
-
Acks URLs which have spent too much time in the queue, should be set to a value equals to the topology timeout
- queues - Variable in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
R
- RANDOM - com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
- REDIRECTION - com.digitalpebble.stormcrawler.persistence.Status
- reduce(TimeReducerState, Object) - Method in class com.digitalpebble.stormcrawler.util.PerSecondReducer
- RefreshTag - Class in com.digitalpebble.stormcrawler.util
- RefreshTag() - Constructor for class com.digitalpebble.stormcrawler.util.RefreshTag
- RegexRule - Class in com.digitalpebble.stormcrawler.filtering.regex
-
A generic regular expression rule.
- RegexRule(boolean, String) - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexRule
-
Constructs a new regular expression rule.
- RegexURLFilter - Class in com.digitalpebble.stormcrawler.filtering.regex
-
Filters URLs based on a file of regular expressions using the
Java Regex implementation
. - RegexURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilter
- RegexURLFilterBase - Class in com.digitalpebble.stormcrawler.filtering.regex
-
An abstract class for implementing Regex URL filtering.
- RegexURLFilterBase() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLFilterBase
- RegexURLNormalizer - Class in com.digitalpebble.stormcrawler.filtering.regex
-
The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.
- RegexURLNormalizer() - Constructor for class com.digitalpebble.stormcrawler.filtering.regex.RegexURLNormalizer
- RemoteDriverProtocol - Class in com.digitalpebble.stormcrawler.protocol.selenium
-
Delegates the requests to one or more remote selenium servers.
- RemoteDriverProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.RemoteDriverProtocol
- remove(Object) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout.InProcessMap
- remove(String) - Method in class com.digitalpebble.stormcrawler.Metadata
- REQUEST_HEADERS_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Key which holds the verbatim HTTP request headers in metadata (if supported by Protocol implementation and if http.store.headers is true).
- REQUEST_TIME_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Key which holds the request time (begin of request) in metadata.
- requireSuperClass(Class<?>, Class<? extends T>, Class<?>...) - Static method in class com.digitalpebble.stormcrawler.util.InitialisationUtil
-
Asserts the following:
- resetFetchDateAfterNSecs - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- resetFetchDateParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Delay in seconds after which the nextFetchDate filter is set to the current time, default 120.
- resolveURL(URL, String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
-
Resolve relative URL-s and fix a few java.net.URL errors in handling of URLs with embedded params and pure query targets.
- RESPONSE_COOKIES_HEADER - Static variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- RESPONSE_HEADERS_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Key which holds the verbatim HTTP response headers in metadata.
- RESPONSE_IP_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Key which holds the IP address of the server the request was sent to (response received from) in metadata.
- rng - Variable in class com.digitalpebble.stormcrawler.proxy.MultiProxyManager
- RobotRules - Class in com.digitalpebble.stormcrawler.protocol
-
Wrapper for BaseRobotRules which tracks the number of requests and length of the responses needed to get the rules.
- RobotRules(BaseRobotRules) - Constructor for class com.digitalpebble.stormcrawler.protocol.RobotRules
- RobotRulesParser - Class in com.digitalpebble.stormcrawler.protocol
-
This class uses crawler-commons for handling the parsing of
robots.txt
files. - RobotRulesParser() - Constructor for class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
- ROBOTS_NO_CACHE - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
- ROBOTS_NO_FOLLOW - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
- ROBOTS_NO_FOLLOW_STRICT - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
-
Whether to interpret the noFollow directive strictly (remove links) or not (remove anchor and do not track original URL).
- ROBOTS_NO_INDEX - Static variable in class com.digitalpebble.stormcrawler.util.RobotsTags
- RobotsFilter - Class in com.digitalpebble.stormcrawler.filtering.robots
-
URLFilter which discards URLs based on the robots.txt directives.
- RobotsFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.robots.RobotsFilter
- RobotsTags - Class in com.digitalpebble.stormcrawler.util
-
Normalises the robots instructions provided by the HTML meta tags or the HTTP X-Robots-Tag headers.
- RobotsTags() - Constructor for class com.digitalpebble.stormcrawler.util.RobotsTags
- RobotsTags(Metadata, String) - Constructor for class com.digitalpebble.stormcrawler.util.RobotsTags
-
Get the values from the fetch metadata *
- ROUND_ROBIN - com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
- roundDateParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Used for rounding nextFetchDates.
- run(String[]) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
S
- schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.DefaultScheduler
- schedule(Status, Metadata) - Method in class com.digitalpebble.stormcrawler.persistence.Scheduler
-
Returns an optional Date indicating when the document should be refetched next, based on its status.
- Scheduler - Class in com.digitalpebble.stormcrawler.persistence
- Scheduler() - Constructor for class com.digitalpebble.stormcrawler.persistence.Scheduler
- schedulerClassParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.Scheduler
-
Class to use for Scheduler.
- SchedulingURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
-
Checks how long the last N URLs took to work out whether a queue should release a URL.
- SchedulingURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.SchedulingURLBuffer
- SCProxy - Class in com.digitalpebble.stormcrawler.proxy
-
Proxy class is used as the central interface to proxy based interactions with a single remote server The class stores all information relating to the remote server, authentication, and usage activity
- SCProxy(String) - Constructor for class com.digitalpebble.stormcrawler.proxy.SCProxy
-
Construct a proxy object from a valid proxy connection string
- SCProxy(String, String, String, String, String, String, String, String, String) - Constructor for class com.digitalpebble.stormcrawler.proxy.SCProxy
-
Construct a proxy class from it's variables
- SeleniumProtocol - Class in com.digitalpebble.stormcrawler.protocol.selenium
- SeleniumProtocol() - Constructor for class com.digitalpebble.stormcrawler.protocol.selenium.SeleniumProtocol
- SelfURLFilter - Class in com.digitalpebble.stormcrawler.filtering.basic
-
Filters links to self *
- SelfURLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.basic.SelfURLFilter
- set(String, Metadata) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
-
Set the metadata for a given URL *
- SET_HEADER_BY_REQUEST - Static variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- SET_LAST_MODIFIED - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Configuration property (boolean) whether or not to set the "last-modified" metadata field when a page change was detected by signature comparison.
- setAnchor(String) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- setConf(Config) - Method in class com.digitalpebble.stormcrawler.protocol.HttpRobotRulesParser
- setConf(Config) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRulesParser
-
Set the
Configuration
object - setContent(byte[]) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- setContentLengthFetched(int[]) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
-
Returns the number of bytes fetched per request when not cached *
- setCrawlDelay(long) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- setDeferVisits(boolean) - Method in class com.digitalpebble.stormcrawler.protocol.RobotRules
- setEmptyQueueListener(EmptyQueueListener) - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
- setEmptyQueueListener(EmptyQueueListener) - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
- setLastModified - Variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
- setMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- setMetadata(Metadata) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- setOutlinks(List<Outlink>) - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- setScheme(Scheme) - Method in class com.digitalpebble.stormcrawler.spout.FileSpout
-
Specify a Scheme for parsing the lines into URLs and Metadata.
- setTargetURL(String) - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- setText(String) - Method in class com.digitalpebble.stormcrawler.parse.ParseData
- setValue(String, String) - Method in class com.digitalpebble.stormcrawler.Metadata
-
Set the value for a given key.
- setValues(String, String[]) - Method in class com.digitalpebble.stormcrawler.Metadata
- SIGNATURE_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Name of the signature key in metadata, must be defined as "keyName" in the configuration of
MD5SignatureParseFilter
. - SIGNATURE_MODIFIED_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Key to store the date when the signature has been changed, must be listed in "metadata.persist".
- SIGNATURE_OLD_KEY - Static variable in class com.digitalpebble.stormcrawler.persistence.AdaptiveScheduler
-
Name of key to hold previous signature: a copy, not overwritten by
MD5SignatureParseFilter
. - SimpleFetcherBolt - Class in com.digitalpebble.stormcrawler.bolt
-
A simple fetcher with no internal queues.
- SimpleFetcherBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- SimpleURLBuffer - Class in com.digitalpebble.stormcrawler.persistence.urlbuffer
-
Simple implementation of a URLBuffer which rotates on the queues without applying any priority.
- SimpleURLBuffer() - Constructor for class com.digitalpebble.stormcrawler.persistence.urlbuffer.SimpleURLBuffer
- SingleProxyManager - Class in com.digitalpebble.stormcrawler.proxy
-
SingleProxyManager is a ProxyManager implementation for a single proxy endpoint
- SingleProxyManager() - Constructor for class com.digitalpebble.stormcrawler.proxy.SingleProxyManager
- SitemapFilter - Class in com.digitalpebble.stormcrawler.filtering.sitemap
-
URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site.
- SitemapFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.sitemap.SitemapFilter
- SiteMapParserBolt - Class in com.digitalpebble.stormcrawler.bolt
-
Extracts URLs from a sitemap file.
- SiteMapParserBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.SiteMapParserBolt
- size() - Method in class com.digitalpebble.stormcrawler.Metadata
- size() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- size() - Method in class com.digitalpebble.stormcrawler.persistence.urlbuffer.AbstractURLBuffer
-
Total number of URLs in the buffer *
- size() - Method in interface com.digitalpebble.stormcrawler.persistence.urlbuffer.URLBuffer
-
Total number of URLs in the buffer *
- skipRobots - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- start(ConfigurableTopology, String[]) - Static method in class com.digitalpebble.stormcrawler.ConfigurableTopology
- Status - Enum in com.digitalpebble.stormcrawler.persistence
- STATUS_ERROR_CAUSE - Static variable in class com.digitalpebble.stormcrawler.Constants
- STATUS_ERROR_MESSAGE - Static variable in class com.digitalpebble.stormcrawler.Constants
- STATUS_ERROR_SOURCE - Static variable in class com.digitalpebble.stormcrawler.Constants
- StatusEmitterBolt - Class in com.digitalpebble.stormcrawler.bolt
-
Provides common functionalities for Bolts which emit tuples to the status stream, e.g.
- StatusEmitterBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.StatusEmitterBolt
- StatusMaxDelayParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Max time to allow between 2 successive queries to the backend.
- StatusMinDelayParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Min time to allow between 2 successive queries to the backend.
- StatusStreamName - Static variable in class com.digitalpebble.stormcrawler.Constants
- StatusTTLPurgatory - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
-
Time in seconds for which acked or failed URLs will be considered for fetching again, default 30 secs.
- StdOutIndexer - Class in com.digitalpebble.stormcrawler.indexing
-
Indexer which generates fields for indexing and sends them to the standard output.
- StdOutIndexer() - Constructor for class com.digitalpebble.stormcrawler.indexing.StdOutIndexer
- StdOutStatusUpdater - Class in com.digitalpebble.stormcrawler.persistence
-
Dummy status updater which dumps the content of the incoming tuples to the standard output.
- StdOutStatusUpdater() - Constructor for class com.digitalpebble.stormcrawler.persistence.StdOutStatusUpdater
- store(String, Status, Metadata, Optional<Date>, Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
- store(String, Status, Metadata, Optional<Date>, Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.MemoryStatusUpdater
- store(String, Status, Metadata, Optional<Date>, Tuple) - Method in class com.digitalpebble.stormcrawler.persistence.StdOutStatusUpdater
- storeHTTPHeaders - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
- StringTabScheme - Class in com.digitalpebble.stormcrawler.util
-
Converts a byte array into URL + metadata
- StringTabScheme() - Constructor for class com.digitalpebble.stormcrawler.util.StringTabScheme
- submit(String, Config, TopologyBuilder) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
-
Submits the topology under a specific name *
- submit(Config, TopologyBuilder) - Method in class com.digitalpebble.stormcrawler.ConfigurableTopology
-
Submits the topology with the name taken from the configuration *
T
- tail(Node, int) - Method in class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder.W3CBuilder
- text(Element) - Method in class com.digitalpebble.stormcrawler.parse.TextExtractor
- TEXT_MAX_TEXT_PARAM_NAME - Static variable in class com.digitalpebble.stormcrawler.parse.TextExtractor
- TextExtractor - Class in com.digitalpebble.stormcrawler.parse
-
Filters the text extracted from HTML documents, used by JSoupParserBolt.
- TextExtractor(Map<String, Object>) - Constructor for class com.digitalpebble.stormcrawler.parse.TextExtractor
- textFieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Field name to use for storing the text of a document *
- textLengthParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Trim length of text to index.
- THROTTLE_STREAM - Static variable in class com.digitalpebble.stormcrawler.bolt.SimpleFetcherBolt
- TIME - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
fetch exceeded configured max.
- toASCII(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
- toOutlinks(String, Metadata, Map<String, List<String>>) - Method in class com.digitalpebble.stormcrawler.bolt.JSoupParserBolt
- toProtocolResponse() - Method in class com.digitalpebble.stormcrawler.protocol.file.FileResponse
- toString() - Method in class com.digitalpebble.stormcrawler.Metadata
- toString() - Method in class com.digitalpebble.stormcrawler.parse.Outlink
- toString() - Method in class com.digitalpebble.stormcrawler.parse.ParseResult
- toString() - Method in class com.digitalpebble.stormcrawler.proxy.SCProxy
-
Formats the proxy information into a URL compatible connection string
- toString(String) - Method in class com.digitalpebble.stormcrawler.Metadata
-
Returns a String representation of the metadata with one K/V per line
- toUNICODE(String) - Static method in class com.digitalpebble.stormcrawler.util.URLUtil
- trackDepthParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Parameter name indicating whether to track the depth from seed.
- trackPathParamName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Parameter name indicating whether to track the url path or not.
- TRANSFER_ENCODING - Static variable in class com.digitalpebble.stormcrawler.protocol.HttpHeaders
- traverse(NodeVisitor, Node, int, StringBuilder) - Static method in class com.digitalpebble.stormcrawler.parse.TextExtractor
-
Start a depth-first traverse of the root and all of its descendants.
- TRIMMED_RESPONSE_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Metadata key which holds a boolean value in metadata whether the response content is trimmed or not.
- TRIMMED_RESPONSE_REASON_KEY - Static variable in class com.digitalpebble.stormcrawler.protocol.ProtocolResponse
-
Metadata key which holds the reason why content has been trimmed, see
ProtocolResponse.TrimmedContentReason
. - trimText(String) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Returns a trimmed string or the original one if it is below the threshold set in the configuration.
U
- unlock() - Method in class com.digitalpebble.stormcrawler.Metadata
-
Release the lock on a metadata
- UNSPECIFIED - com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
unknown reason
- URLBuffer - Interface in com.digitalpebble.stormcrawler.persistence.urlbuffer
-
Buffers URLs to be processed into separate queues; used by spouts.
- urlFieldParamName - Static variable in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Field name to use for storing the url of a document *
- URLFilter - Class in com.digitalpebble.stormcrawler.filtering
-
Unlike Nutch, URLFilters can normalise the URLs as well as filtering them.
- URLFilter() - Constructor for class com.digitalpebble.stormcrawler.filtering.URLFilter
- URLFilterBolt - Class in com.digitalpebble.stormcrawler.bolt
- URLFilterBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
-
Relies on the file defined in urlfilters.config.file and applied to all tuples regardless of status
- URLFilterBolt(boolean, String) - Constructor for class com.digitalpebble.stormcrawler.bolt.URLFilterBolt
- URLFilters - Class in com.digitalpebble.stormcrawler.filtering
-
Wrapper for the URLFilters defined in a JSON configuration.
- URLFilters(Map<String, Object>, String) - Constructor for class com.digitalpebble.stormcrawler.filtering.URLFilters
-
Loads the filters from a JSON configuration file
- URLPartitioner - Class in com.digitalpebble.stormcrawler.util
-
Generates a partition key for a given URL based on the hostname, domain or IP address.
- URLPartitioner() - Constructor for class com.digitalpebble.stormcrawler.util.URLPartitioner
- URLPartitionerBolt - Class in com.digitalpebble.stormcrawler.bolt
-
Generates a partition key for a given URL based on the hostname, domain or IP address.
- URLPartitionerBolt() - Constructor for class com.digitalpebble.stormcrawler.bolt.URLPartitionerBolt
- urlPathKeyName - Static variable in class com.digitalpebble.stormcrawler.util.MetadataTransfer
-
Metadata key name for tracking the source URLs
- URLStreamGrouping - Class in com.digitalpebble.stormcrawler.util
-
Directs tuples to a specific bolt instance based on the URLPartitioner, e.g.
- URLStreamGrouping() - Constructor for class com.digitalpebble.stormcrawler.util.URLStreamGrouping
-
Groups URLs based on the hostname *
- URLStreamGrouping(String) - Constructor for class com.digitalpebble.stormcrawler.util.URLStreamGrouping
- URLUtil - Class in com.digitalpebble.stormcrawler.util
-
Utility class for URL analysis
- useCacheParamName - Static variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
-
Parameter name to indicate whether the internal cache should be used for discovered URLs.
- useCookies - Variable in class com.digitalpebble.stormcrawler.protocol.AbstractHttpProtocol
V
- valueForURL(Tuple) - Method in class com.digitalpebble.stormcrawler.indexing.AbstractIndexerBolt
-
Returns the value to be used as the URL for indexing purposes, if present the canonical value is used instead
- valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
Returns the enum constant of this type with the specified name.
- valueOf(String) - Static method in enum com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
-
Returns the enum constant of this type with the specified name.
- values() - Static method in enum com.digitalpebble.stormcrawler.persistence.Status
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum com.digitalpebble.stormcrawler.protocol.ProtocolResponse.TrimmedContentReason
-
Returns an array containing the constants of this enum type, in the order they are declared.
- values() - Static method in enum com.digitalpebble.stormcrawler.proxy.MultiProxyManager.ProxyRotation
-
Returns an array containing the constants of this enum type, in the order they are declared.
W
- W3CBuilder(HTMLDocumentImpl, DocumentFragment) - Constructor for class com.digitalpebble.stormcrawler.parse.DocumentFragmentBuilder.W3CBuilder
X
- XPathFilter - Class in com.digitalpebble.stormcrawler.jsoup
-
Reads a XPATH patterns and stores the value found in web page as metadata
- XPathFilter - Class in com.digitalpebble.stormcrawler.parse.filter
-
Simple ParseFilter to illustrate and test the interface.
- XPathFilter() - Constructor for class com.digitalpebble.stormcrawler.jsoup.XPathFilter
- XPathFilter() - Constructor for class com.digitalpebble.stormcrawler.parse.filter.XPathFilter
_
- _collector - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractQueryingSpout
- _collector - Variable in class com.digitalpebble.stormcrawler.persistence.AbstractStatusUpdaterBolt
- _collector - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
- _scheme - Variable in class com.digitalpebble.stormcrawler.spout.FileSpout
All Classes All Packages