All Classes and Interfaces (storm-crawler-core 2.9 API)

Class

Description

AbstractConfigurable

AbstractHttpProtocol

AbstractHttpProtocol.KeyValue

AbstractIndexerBolt

Abstract class to simplify writing IndexerBolts *

AbstractQueryingSpout

Common features of spouts which query a backend to generate tuples.

AbstractStatusUpdaterBolt

Abstract bolt used to store the status of URLs.

AbstractURLBuffer

Abstract class for URLBuffer interface, meant to simplify the code of the implementations and provide some default methods

AdaptiveScheduler

Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed: if yes, shrink the fetch interval up to a minimum fetch interval if not, increase the fetch interval up to a maximum

BasicURLFilter

Simple URL filters : can be used early in the filtering chain

BasicURLNormalizer

CharsetIdentification

CollectionMetric

CollectionTagger

Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.

CommaSeparatedToMultivaluedMetadata

Rewrites single metadata containing comma separated values into multiple values for the same key, useful for instance for keyword tags.

Configurable

An interface marking the implementing class as initializeable and configurable via Configurable.createConfiguredInstance(String, Class, Map, JsonNode) The implementing class HAS to implement an empty constructor.

Helper to extract cookies from cookies string.

DebugParseFilter

Dumps the DOM representation of a document into a file

DefaultScheduler

Schedules a nextFetchDate based on the configuration

DelegatorProtocol

Protocol implementation that enables selection from a collection of sub-protocols using filters based on each call's metadata

DelegatorRemoteDriverProtocol

Deprecated.

use DelegatorProtocol instead

DNSResolutionListener

DocumentFragmentBuilder

Adapted from org.jsoup.helper.W3CDom but does not transfer namespaces.

DocumentFragmentBuilder.W3CBuilder

Implements the conversion by walking the input.

DomainParseFilter

Adds domain (or host) to metadata - can be used later on for indexing *

DummyIndexer

* Any tuple that went through all the previous bolts is sent to the status stream with a Status of FETCHED.

EmptyQueueListener

Used by URLBuffer to inform the spouts when a queue has no more URLs in it

FastURLFilter

URL filter based on regex patterns and organised by [host | domain | metadata | global].

FeedParserBolt

Extracts URLs from feeds

FetcherBolt

A multithreaded, queue-based fetcher adapted from Apache Nutch.

FileProtocol

FileResponse

FileSpout

Reads the lines from a UTF-8 file and use them as a spout.

HostURLFilter

Filters URL based on the hostname.

HttpHeaders

A collection of HTTP header names and utilities around header values.

HttpProtocol

Uses Apache httpclient to handle http and https

HttpProtocol

HttpRobotRulesParser

This class is used for parsing robots for urls belonging to HTTP protocol.

InitialisationUtil

JSONResource

Defines a generic behaviour for ParseFilters or URLFilters to load resources from a JSON file.

JSoupFilter

Implementations of ParseFilter are responsible for extracting custom data from the crawled content.

JSoupFilters

Wrapper for the JSoupFilters defined in a JSON configuration

JSoupParserBolt

Parser for HTML documents only which uses ICU4J to detect the charset encoding.

LDJsonParseFilter

Extracts data from JSON-LD representation (https://json-ld.org/).

LDJsonParseFilter

Extracts data from JSON-LD representation (https://json-ld.org/)

LinkParseFilter

ParseFilter to extract additional links with Xpath can be configured with e.g.

LinkParseFilter

ParseFilter to extract additional links with Xpath can be configured with e.g.

MaxDepthFilter

Filter out URLs whose depth is greater than maxDepth.

MD5SignatureParseFilter

Computes a signature for a page, based on the binary content or text.

MemorySpout

Stores URLs in memory.

MemoryStatusUpdater

Use in combination with the MemorySpout for testing in local mode.

Metadata

Wrapper around Map <String,String[]> *

MetadataFilter

Filter out URLs based on metadata in the source document

MetadataTransfer

Implements the logic of how the metadata should be passed to the outlinks, what should be stored back in the persistence layer etc...

MimeTypeNormalization

Normalises the MimeType value e.g.

MultiProxyManager

MultiProxyManager is a ProxyManager implementation for a multiple proxy endpoints

MultiProxyManager.ProxyRotation

NavigationFilter

NavigationFilters

Wrapper for the NavigationFilter defined in a JSON configuration

Outlink

ParseData

ParseFilter

Implementations of ParseFilter are responsible for extracting custom data from the crawled content.

ParseFilters

Wrapper for the ParseFilters defined in a JSON configuration

ParseResult

PerSecondReducer

Used to return an average value per second *

PriorityURLBuffer

Determines the priority of the buffers based on the number of URLs acked in a configurable period of time.

Protocol

ProtocolFactory

ProtocolResponse

ProtocolResponse.TrimmedContentReason

Enum of reasons which may cause that protocol content is trimmed.

ProxyManager

Proxy manager is an abstract class specification that details the required interface of a proxy manager

RefreshTag

RegexRule

A generic regular expression rule.

RegexURLFilter

Filters URLs based on a file of regular expressions using the

Java Regex
 implementation

RegexURLFilterBase

An abstract class for implementing Regex URL filtering.

RegexURLNormalizer

The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.

RemoteDriverProtocol

Delegates the requests to one or more remote selenium servers.

RobotRules

Wrapper for BaseRobotRules which tracks the number of requests and length of the responses needed to get the rules.

RobotRulesParser

This class uses crawler-commons for handling the parsing of robots.txt files.

RobotsFilter

URLFilter which discards URLs based on the robots.txt directives.

RobotsTags

Normalises the robots instructions provided by the HTML meta tags or the HTTP X-Robots-Tag headers.

Scheduler

SchedulingURLBuffer

Checks how long the last N URLs took to work out whether a queue should release a URL.

SCProxy

Proxy class is used as the central interface to proxy based interactions with a single remote server The class stores all information relating to the remote server, authentication, and usage activity

SeleniumProtocol

SelfURLFilter

Filters links to self *

SimpleFetcherBolt

A simple fetcher with no internal queues.

SimpleURLBuffer

Simple implementation of a URLBuffer which rotates on the queues without applying any priority.

SingleProxyManager

SingleProxyManager is a ProxyManager implementation for a single proxy endpoint

SitemapFilter

URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site.

SiteMapParserBolt

Extracts URLs from a sitemap file.

Status

StatusEmitterBolt

Provides common functionalities for Bolts which emit tuples to the status stream, e.g.

StdOutIndexer

Indexer which generates fields for indexing and sends them to the standard output.

StdOutStatusUpdater

Dummy status updater which dumps the content of the incoming tuples to the standard output.

StringTabScheme

Converts a byte array into URL + metadata

TextExtractor

Filters the text extracted from HTML documents, used by JSoupParserBolt.

URLBuffer

Buffers URLs to be processed into separate queues; used by spouts.

URLFilter

Unlike Nutch, URLFilters can normalise the URLs as well as filtering them.

URLFilterBolt

URLFilters

Wrapper for the URLFilters defined in a JSON configuration.

URLPartitioner

Generates a partition key for a given URL based on the hostname, domain or IP address.

URLPartitionerBolt

Generates a partition key for a given URL based on the hostname, domain or IP address.

URLStreamGrouping

Directs tuples to a specific bolt instance based on the URLPartitioner, e.g.

URLUtil

Utility class for URL analysis

XPathFilter

Reads a XPATH patterns and stores the value found in web page as metadata

XPathFilter

Simple ParseFilter to illustrate and test the interface.