Class | Description |
---|---|
BdbServerCache |
ServerCache backed by BDB big maps; the usual choice for crawls.
|
CrawlHost |
Represents a single remote "host".
|
CrawlServer |
Represents a single remote "server".
|
CustomRobotsPolicy |
Follow a custom-written robots policy, rather than the site's own declarations
Does not support overlays of different custom-robots; instead it is
recommended each custom policy be declared as a separate bean, with a
distinct name.
|
DefaultTempDirProvider | |
FirstNamedRobotsPolicy |
Working from an ordered list of potential User-Agents, consisting of first
the regularly-configured User-Agent and then those in the candidateUserAgents
list, consider each potential agent in order.
|
IgnoreRobotsPolicy |
Policy to ignore robots.
|
MostFavoredRobotsPolicy |
Follow a most-favored robots policy -- allowing an URL if either the
conventionally-configured User-Agent, or any of a number of alternate
User-Agents (from the candidateUserAgents list) would be allowed.
|
ObeyRobotsPolicy |
Classic obey-robots-as-declared policy.
|
RobotsDirectives |
Represents the directives that apply to a user-agent (or set of
user-agents)
|
RobotsPolicy |
RobotsPolicy represents the strategy used by the crawler
for determining how robots.txt files will be honored.
|
Robotstxt |
Utility class for parsing and representing 'robots.txt' format
directives, into a list of named user-agents and map from user-agents
to RobotsDirectives.
|
ServerCache |
Abstract class for crawl-global registry of CrawlServer (host:port) and
CrawlHost (hostname) objects.
|
Copyright © 2003–2021 Internet Archive. All rights reserved.