All Classes and Interfaces

Class
Description
 
 
 
Abstract class to simplify writing IndexerBolts *
Common features of spouts which query a backend to generate tuples.
Abstract bolt used to store the status of URLs.
Abstract class for URLBuffer interface, meant to simplify the code of the implementations and provide some default methods
Adaptive fetch scheduler, checks by signature comparison whether a re-fetched page has changed: if yes, shrink the fetch interval up to a minimum fetch interval if not, increase the fetch interval up to a maximum
Simple URL filters : can be used early in the filtering chain
 
 
 
Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.
Rewrites single metadata containing comma separated values into multiple values for the same key, useful for instance for keyword tags.
An interface marking the implementing class as initializeable and configurable via Configurable.createConfiguredInstance(String, Class, Map, JsonNode) The implementing class HAS to implement an empty constructor.
 
 
 
Helper to extract cookies from cookies string.
Dumps the DOM representation of a document into a file
Schedules a nextFetchDate based on the configuration
Protocol implementation that enables selection from a collection of sub-protocols using filters based on each call's metadata
Deprecated.
use DelegatorProtocol instead
 
Adapted from org.jsoup.helper.W3CDom but does not transfer namespaces.
Implements the conversion by walking the input.
Adds domain (or host) to metadata - can be used later on for indexing *
* Any tuple that went through all the previous bolts is sent to the status stream with a Status of FETCHED.
Used by URLBuffer to inform the spouts when a queue has no more URLs in it
URL filter based on regex patterns and organised by [host | domain | metadata | global].
Extracts URLs from feeds
A multithreaded, queue-based fetcher adapted from Apache Nutch.
 
 
Reads the lines from a UTF-8 file and use them as a spout.
Filters URL based on the hostname.
A collection of HTTP header names and utilities around header values.
Uses Apache httpclient to handle http and https
 
This class is used for parsing robots for urls belonging to HTTP protocol.
 
Defines a generic behaviour for ParseFilters or URLFilters to load resources from a JSON file.
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
Wrapper for the JSoupFilters defined in a JSON configuration
Parser for HTML documents only which uses ICU4J to detect the charset encoding.
Extracts data from JSON-LD representation (https://json-ld.org/).
Extracts data from JSON-LD representation (https://json-ld.org/)
ParseFilter to extract additional links with Xpath can be configured with e.g.
ParseFilter to extract additional links with Xpath can be configured with e.g.
Filter out URLs whose depth is greater than maxDepth.
Computes a signature for a page, based on the binary content or text.
Stores URLs in memory.
Use in combination with the MemorySpout for testing in local mode.
Wrapper around Map <String,String[]> *
Filter out URLs based on metadata in the source document
Implements the logic of how the metadata should be passed to the outlinks, what should be stored back in the persistence layer etc...
Normalises the MimeType value e.g.
MultiProxyManager is a ProxyManager implementation for a multiple proxy endpoints
 
 
Wrapper for the NavigationFilter defined in a JSON configuration
 
 
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.
Wrapper for the ParseFilters defined in a JSON configuration
 
Used to return an average value per second *
Determines the priority of the buffers based on the number of URLs acked in a configurable period of time.
 
 
 
Enum of reasons which may cause that protocol content is trimmed.
Proxy manager is an abstract class specification that details the required interface of a proxy manager
 
A generic regular expression rule.
Filters URLs based on a file of regular expressions using the Java Regex implementation.
An abstract class for implementing Regex URL filtering.
The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string.
Delegates the requests to one or more remote selenium servers.
Wrapper for BaseRobotRules which tracks the number of requests and length of the responses needed to get the rules.
This class uses crawler-commons for handling the parsing of robots.txt files.
URLFilter which discards URLs based on the robots.txt directives.
Normalises the robots instructions provided by the HTML meta tags or the HTTP X-Robots-Tag headers.
 
Checks how long the last N URLs took to work out whether a queue should release a URL.
Proxy class is used as the central interface to proxy based interactions with a single remote server The class stores all information relating to the remote server, authentication, and usage activity
 
Filters links to self *
A simple fetcher with no internal queues.
Simple implementation of a URLBuffer which rotates on the queues without applying any priority.
SingleProxyManager is a ProxyManager implementation for a single proxy endpoint
URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site.
Extracts URLs from a sitemap file.
 
Provides common functionalities for Bolts which emit tuples to the status stream, e.g.
Indexer which generates fields for indexing and sends them to the standard output.
Dummy status updater which dumps the content of the incoming tuples to the standard output.
Converts a byte array into URL + metadata
Filters the text extracted from HTML documents, used by JSoupParserBolt.
Buffers URLs to be processed into separate queues; used by spouts.
Unlike Nutch, URLFilters can normalise the URLs as well as filtering them.
 
Wrapper for the URLFilters defined in a JSON configuration.
Generates a partition key for a given URL based on the hostname, domain or IP address.
Generates a partition key for a given URL based on the hostname, domain or IP address.
Directs tuples to a specific bolt instance based on the URLPartitioner, e.g.
Utility class for URL analysis
Reads a XPATH patterns and stores the value found in web page as metadata
Simple ParseFilter to illustrate and test the interface.