Package org.archive.crawler.frontier
Class HostnameQueueAssignmentPolicyWithLimits
java.lang.Object
org.archive.crawler.frontier.QueueAssignmentPolicy
org.archive.crawler.frontier.URIAuthorityBasedQueueAssignmentPolicy
org.archive.crawler.frontier.HostnameQueueAssignmentPolicy
org.archive.crawler.frontier.HostnameQueueAssignmentPolicyWithLimits
- All Implemented Interfaces:
Serializable
,org.archive.spring.HasKeyedProperties
A variation on @link
HostnameQueueAssignmentPolicy
that allows the
operator (per sheet) to specify the maximum number of domains and sub-domains
to use for the queue name.- See Also:
-
Field Summary
FieldsFields inherited from class org.archive.crawler.frontier.URIAuthorityBasedQueueAssignmentPolicy
conhash, DEFAULT_CLASS_KEY
Fields inherited from class org.archive.crawler.frontier.QueueAssignmentPolicy
kp
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected String
getCoreKey
(org.archive.net.UURI basis) int
getLimit()
protected String
getLimitedHostname
(String hostname, int limit) void
setLimit
(int limit) Set the maximum number of domains and sub-domains to include in the queue name.Methods inherited from class org.archive.crawler.frontier.URIAuthorityBasedQueueAssignmentPolicy
bucketBasis, getClassKey, getDeferToPrevious, getParallelQueues, getParallelQueuesRandomAssignment, getSubqueue, setDeferToPrevious, setParallelQueues, setParallelQueuesRandomAssignment
Methods inherited from class org.archive.crawler.frontier.QueueAssignmentPolicy
getForceQueueAssignment, getKeyedProperties, maximumNumberOfKeys, setForceQueueAssignment
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface org.archive.spring.HasKeyedProperties
getKeyedProperties
-
Field Details
-
LIMIT
- See Also:
-
-
Constructor Details
-
HostnameQueueAssignmentPolicyWithLimits
public HostnameQueueAssignmentPolicyWithLimits()
-
-
Method Details
-
setLimit
public void setLimit(int limit) Set the maximum number of domains and sub-domains to include in the queue name.E.g. if limit is set to
2
than the following assignments are made:
example.com -> example.com
www.example.com -> example.com
subdomain.example.com -> example.com
www.subdomain.example.com -> example.com
otherdomain.com -> otherdomain.com
Note: No accommodation is made for TLDs, like
.co.uk
that always use two levels. Operators should use useSurtPrefixesSheetAssociation
sheets to apply these limits appropriately if crawling a mixture of TLDs with and without the mandatory second level or only apply the limit on specific domains.- Parameters:
limit
- The limit on number of domains to use in assigning a queue name to a URI.
-
getLimit
public int getLimit() -
getCoreKey
- Overrides:
getCoreKey
in classHostnameQueueAssignmentPolicy
-
getLimitedHostname
-