Uses of Class
com.digitalpebble.stormcrawler.util.AbstractConfigurable
-
-
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering Modifier and Type Class Description class
URLFilter
Unlike Nutch, URLFilters can normalise the URLs as well as filtering them.class
URLFilters
Wrapper for the URLFilters defined in a JSON configuration. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.basic
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.basic Modifier and Type Class Description class
BasicURLFilter
Simple URL filters : can be used early in the filtering chainclass
BasicURLNormalizer
class
SelfURLFilter
Filters links to self * -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.depth
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.depth Modifier and Type Class Description class
MaxDepthFilter
Filter out URLs whose depth is greater than maxDepth. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.host
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.host Modifier and Type Class Description class
HostURLFilter
Filters URL based on the hostname. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.metadata
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.metadata Modifier and Type Class Description class
MetadataFilter
Filter out URLs based on metadata in the source document -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.regex
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.regex Modifier and Type Class Description class
FastURLFilter
URL filter based on regex patterns and organised by [host | domain | metadata | global].class
RegexURLFilter
Filters URLs based on a file of regular expressions using theJava Regex implementation
.class
RegexURLFilterBase
An abstract class for implementing Regex URL filtering.class
RegexURLNormalizer
The RegexURLNormalizer is a URL filter that normalizes URLs by matching a regular expression and inserting a replacement string. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.robots
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.robots Modifier and Type Class Description class
RobotsFilter
URLFilter which discards URLs based on the robots.txt directives. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.sitemap
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.filtering.sitemap Modifier and Type Class Description class
SitemapFilter
URLFilter which discards URLs discovered in a page which is not a sitemap when sitemaps have been found for that site. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.jsoup
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.jsoup Modifier and Type Class Description class
LDJsonParseFilter
Extracts data from JSON-LD representation (https://json-ld.org/).class
LinkParseFilter
ParseFilter to extract additional links with Xpath can be configured with e.g.class
XPathFilter
Reads a XPATH patterns and stores the value found in web page as metadata -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.parse
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.parse Modifier and Type Class Description class
JSoupFilters
Wrapper for the JSoupFilters defined in a JSON configurationclass
ParseFilter
Implementations of ParseFilter are responsible for extracting custom data from the crawled content.class
ParseFilters
Wrapper for the ParseFilters defined in a JSON configuration -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.parse.filter
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.parse.filter Modifier and Type Class Description class
CollectionTagger
Assigns one or more tags to the metadata of a document based on its URL matching patterns defined in a JSON resource file.class
CommaSeparatedToMultivaluedMetadata
Rewrites single metadata containing comma separated values into multiple values for the same key, useful for instance for keyword tags.class
DebugParseFilter
Dumps the DOM representation of a document into a fileclass
DomainParseFilter
Adds domain (or host) to metadata - can be used later on for indexing *class
LDJsonParseFilter
Extracts data from JSON-LD representation (https://json-ld.org/)class
LinkParseFilter
ParseFilter to extract additional links with Xpath can be configured with e.g.class
MD5SignatureParseFilter
Computes a signature for a page, based on the binary content or text.class
MimeTypeNormalization
Normalises the MimeType value e.g.class
XPathFilter
Simple ParseFilter to illustrate and test the interface. -
Uses of AbstractConfigurable in com.digitalpebble.stormcrawler.protocol.selenium
Subclasses of AbstractConfigurable in com.digitalpebble.stormcrawler.protocol.selenium Modifier and Type Class Description class
NavigationFilter
class
NavigationFilters
Wrapper for the NavigationFilter defined in a JSON configuration
-