Package | Description |
---|---|
org.archive.crawler.util | |
org.archive.modules |
The beginnings of a refactored settings framework.
|
org.archive.modules.credential |
Contains html form login and basic and digest credentials
used by Heritrix logging into sites.
|
org.archive.modules.deciderules | |
org.archive.modules.deciderules.recrawl | |
org.archive.modules.deciderules.surt | |
org.archive.modules.extractor | |
org.archive.modules.fetcher | |
org.archive.modules.forms | |
org.archive.modules.net | |
org.archive.modules.recrawl | |
org.archive.modules.seeds | |
org.archive.modules.warc | |
org.archive.modules.writer | |
org.archive.state |
Modifier and Type | Method and Description |
---|---|
void |
CrawledBytesHistotable.accumulate(CrawlURI curi) |
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
CrawlURI.fullVia |
Modifier and Type | Field and Description |
---|---|
protected Collection<CrawlURI> |
CrawlURI.outLinks
All discovered outbound urls as CrawlURIs (navlinks, embeds, etc.)
|
Modifier and Type | Method and Description |
---|---|
CrawlURI |
CrawlURI.clearPrerequisiteUri()
Clear prerequisite, if any.
|
CrawlURI |
CrawlURI.createCrawlURI(String destination,
LinkContext context,
Hop hop) |
CrawlURI |
CrawlURI.createCrawlURI(UURI destination,
LinkContext context,
Hop hop)
Utility method for creating CrawlURIs that were found as out links from the current CrawlURI
links from this CrawlURI.
|
CrawlURI |
CrawlURI.createCrawlURI(UURI destination,
LinkContext context,
Hop hop,
int scheduling,
boolean seed)
Utility method for creation of CrawlURIs found extracting
links from this CrawlURI.
|
static CrawlURI |
CrawlURI.fromHopsViaString(String uriHopsViaContext) |
CrawlURI |
CrawlURI.getFullVia() |
CrawlURI |
CrawlURI.getPrerequisiteUri()
Get the prerequisite for this URI.
|
CrawlURI |
CrawlURI.markPrerequisite(String preq)
Do all actions associated with setting a
CrawlURI as
requiring a prerequisite. |
Modifier and Type | Method and Description |
---|---|
Collection<CrawlURI> |
CrawlURI.getOutLinks()
Returns discovered links.
|
Modifier and Type | Method and Description |
---|---|
int |
CrawlURI.compareTo(CrawlURI o) |
static String |
Processor.flattenVia(CrawlURI puri) |
static long |
Processor.getRecordedSize(CrawlURI puri) |
static boolean |
Processor.hasHttpAuthenticationCredential(CrawlURI puri) |
protected void |
CrawlURI.inheritFrom(CrawlURI ancestor)
Inherit (copy) the relevant keys-values from the ancestor.
|
protected void |
ScriptedProcessor.innerProcess(CrawlURI curi) |
protected abstract void |
Processor.innerProcess(CrawlURI uri)
Actually performs the process.
|
protected ProcessResult |
Processor.innerProcessResult(CrawlURI uri) |
protected void |
Processor.innerRejectProcess(CrawlURI uri)
Invoked after a URI has been rejected.
|
static boolean |
Processor.isSuccess(CrawlURI puri) |
ProcessResult |
Processor.process(CrawlURI uri)
Processes the given URI.
|
void |
ProcessorChain.process(CrawlURI curi,
ProcessorChain.ChainStatusReceiver thread) |
void |
CrawlURI.setFullVia(CrawlURI curi) |
void |
CrawlURI.setPrerequisiteUri(CrawlURI pre)
Set a prerequisite for this URI.
|
protected boolean |
ScriptedProcessor.shouldProcess(CrawlURI curi) |
protected abstract boolean |
Processor.shouldProcess(CrawlURI uri)
Determines whether the given uri should be processed by this
processor.
|
Modifier and Type | Method and Description |
---|---|
void |
Credential.attach(CrawlURI curi)
Attach this credentials avatar to the passed
curi . |
boolean |
Credential.detach(CrawlURI curi)
Detach this credential from passed curi.
|
boolean |
Credential.detachAll(CrawlURI curi)
Detach all credentials of this type from passed curi.
|
static HttpAuthenticationCredential |
HttpAuthenticationCredential.getByRealm(Set<Credential> rfc2617Credentials,
String realm,
CrawlURI context)
Convenience method that does look up on passed set using realm for key.
|
String |
HttpAuthenticationCredential.getPrerequisite(CrawlURI curi) |
String |
HtmlFormCredential.getPrerequisite(CrawlURI curi) |
abstract String |
Credential.getPrerequisite(CrawlURI curi)
Return the authentication URI, either absolute or relative, that serves
as prerequisite the passed
curi . |
boolean |
HttpAuthenticationCredential.hasPrerequisite(CrawlURI curi) |
boolean |
HtmlFormCredential.hasPrerequisite(CrawlURI curi) |
abstract boolean |
Credential.hasPrerequisite(CrawlURI curi) |
boolean |
HttpAuthenticationCredential.isPrerequisite(CrawlURI curi) |
boolean |
HtmlFormCredential.isPrerequisite(CrawlURI curi) |
abstract boolean |
Credential.isPrerequisite(CrawlURI curi) |
boolean |
Credential.rootUriMatch(ServerCache cache,
CrawlURI curi)
Test passed curi matches this credentials rootUri.
|
Set<Credential> |
CredentialStore.subset(CrawlURI context,
Class<?> type)
Return set made up of all credentials of the passed
type . |
Set<Credential> |
CredentialStore.subset(CrawlURI context,
Class<?> type,
String rootUri)
Return set made up of all credentials of the passed
type . |
Modifier and Type | Method and Description |
---|---|
boolean |
DecideRule.accepts(CrawlURI uri) |
DecideResult |
DecideRule.decisionFor(CrawlURI uri) |
protected void |
DecideRuleSequence.decisionMade(CrawlURI uri,
DecideRule decisiveRule,
int decisiveRuleNumber,
DecideResult result) |
protected boolean |
NotMatchesRegexDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
IpAddressSetDecideRule.evaluate(CrawlURI curi) |
protected boolean |
NotMatchesFilePatternDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
SchemeNotInSetDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is over the threshold number of
hops.
|
protected boolean |
FetchStatusDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is equal to the configured status
|
protected boolean |
FetchStatusNotMatchesRegexDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's FetchStatus does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
TransclusionDecideRule.evaluate(CrawlURI curi)
Evaluate whether given object is within the acceptable thresholds of
transitive hops.
|
protected boolean |
TooManyPathSegmentsDecideRule.evaluate(CrawlURI curi)
Evaluate whether given object is over the threshold number of
path-segments.
|
protected boolean |
SourceSeedDecideRule.evaluate(CrawlURI curi) |
protected boolean |
TooManyHopsDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is over the threshold number of
hops.
|
protected boolean |
ResourceNoLongerThanDecideRule.evaluate(CrawlURI curi) |
protected boolean |
ExternalGeoLocationDecideRule.evaluate(CrawlURI uri) |
protected boolean |
MatchesListRegexDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's string version
matches configured regexes
|
protected boolean |
NotMatchesListRegexDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's string version does not match
configured regexs (by reversing the superclass's answer).
|
protected boolean |
HopCrossesAssignmentLevelDomainDecideRule.evaluate(CrawlURI uri) |
protected boolean |
ContentTypeNotMatchesRegexDecideRule.evaluate(CrawlURI o)
Evaluate whether given object's string version does not match
configured regex (by reversing the superclass's answer).
|
protected boolean |
ViaSurtPrefixedDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's surt form
matches one of the supplied surts
|
protected boolean |
HasViaDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object is over the threshold number of
hops.
|
protected boolean |
MatchesRegexDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's string version
matches configured regex
|
protected boolean |
ResponseContentLengthDecideRule.evaluate(CrawlURI uri) |
protected boolean |
MatchesStatusCodeDecideRule.evaluate(CrawlURI uri)
Returns "true" if the provided CrawlURI has a fetch status that falls
within this instance's specified range.
|
protected boolean |
NotMatchesStatusCodeDecideRule.evaluate(CrawlURI uri)
Returns "true" if the provided CrawlURI has a fetch status that does not
fall within this instance's specified range.
|
protected abstract boolean |
PredicatedDecideRule.evaluate(CrawlURI object) |
protected boolean |
AddRedirectFromRootServerToScope.evaluate(CrawlURI uri) |
protected String |
IpAddressSetDecideRule.getHostAddress(CrawlURI curi)
from WriterPoolProcessor
|
protected String |
FetchStatusMatchesRegexDecideRule.getString(CrawlURI uri) |
protected String |
HopsPathMatchesRegexDecideRule.getString(CrawlURI uri) |
protected String |
ContentTypeMatchesRegexDecideRule.getString(CrawlURI uri) |
protected String |
MatchesRegexDecideRule.getString(CrawlURI uri) |
DecideResult |
DecideRuleSequence.innerDecide(CrawlURI uri) |
protected DecideResult |
SeedAcceptDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
AcceptDecideRule.innerDecide(CrawlURI uri) |
DecideResult |
ScriptedDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
RejectDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
ContentLengthDecideRule.innerDecide(CrawlURI uri) |
DecideResult |
PrerequisiteAcceptDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
PathologicalPathDecideRule.innerDecide(CrawlURI uri) |
protected DecideResult |
PredicatedDecideRule.innerDecide(CrawlURI uri) |
protected abstract DecideResult |
DecideRule.innerDecide(CrawlURI uri) |
DecideResult |
AcceptDecideRule.onlyDecision(CrawlURI uri) |
DecideResult |
RejectDecideRule.onlyDecision(CrawlURI uri) |
DecideResult |
PredicatedDecideRule.onlyDecision(CrawlURI uri) |
DecideResult |
DecideRule.onlyDecision(CrawlURI uri) |
Modifier and Type | Method and Description |
---|---|
protected boolean |
IdenticalDigestDecideRule.evaluate(CrawlURI curi)
Evaluate whether given CrawlURI's revisit profile has been set to identical digest
|
static boolean |
IdenticalDigestDecideRule.hasIdenticalDigest(CrawlURI curi)
Utility method for testing if a CrawlURI's revisit profile matches an identical payload digest.
|
Modifier and Type | Method and Description |
---|---|
void |
SurtPrefixedDecideRule.addedSeed(CrawlURI curi)
If appropriate, convert seed notification into prefix-addition.
|
protected boolean |
NotOnHostsDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's URI is NOT in the set of
hosts -- simply reverse superclass's determination
|
protected boolean |
NotSurtPrefixedDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's URI is NOT in the SURT
prefix set -- simply reverse superclass's determination
|
protected boolean |
NotOnDomainsDecideRule.evaluate(CrawlURI object)
Evaluate whether given object's URI is NOT in the set of
domains -- simply reverse superclass's determination
|
protected boolean |
SurtPrefixedDecideRule.evaluate(CrawlURI uri)
Evaluate whether given object's URI is covered by the SURT prefix set
|
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
ExtractorSWF.CrawlUriSWFAction.curi |
CrawlURI |
StringExtractorTestBase.TestData.expectedResult |
CrawlURI |
StringExtractorTestBase.TestData.uri |
Modifier and Type | Method and Description |
---|---|
static CrawlURI |
Extractor.addRelativeToBase(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
static CrawlURI |
Extractor.addRelativeToVia(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
protected CrawlURI |
ContentExtractorTestBase.defaultURI()
Returns a CrawlURI for testing purposes.
|
Modifier and Type | Method and Description |
---|---|
static void |
Extractor.add(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
protected void |
ExtractorSWF.CrawlUriSWFAction.addAnnotations(CrawlURI relToVia,
CrawlURI relToBase) |
protected void |
ExtractorHTTP.addContentLocationHeaderLink(CrawlURI curi,
String headerKey) |
protected void |
ExtractorHTTP.addHeaderLink(CrawlURI curi,
String headerKey) |
protected void |
ExtractorHTTP.addHeaderLink(CrawlURI curi,
String headerName,
String url) |
protected void |
ExtractorHTML.addLinkFromString(CrawlURI curi,
CharSequence uri,
CharSequence context,
Hop hop) |
protected void |
Extractor.addOutlink(CrawlURI curi,
String uri,
LinkContext context,
Hop hop)
Create and add a 'Link' to the CrawlURI with given URI/context/hop-type
|
protected void |
Extractor.addOutlink(CrawlURI curi,
UURI uuri,
LinkContext context,
Hop hop) |
protected void |
ExtractorHTTP.addRefreshHeaderLink(CrawlURI curi,
String headerKey) |
static CrawlURI |
Extractor.addRelativeToBase(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
static CrawlURI |
Extractor.addRelativeToVia(CrawlURI uri,
int max,
String newUri,
LinkContext context,
Hop hop) |
protected static void |
ContentExtractorTestBase.assertNoSideEffects(CrawlURI uri)
Asserts that the given URI has no URI errors, no localized errors, and
no annotations.
|
protected void |
ExtractorMultipleRegex.buildAndAddOutlink(CrawlURI curi,
Map<String,Object> bindings) |
protected void |
ExtractorHTML.considerIfLikelyUri(CrawlURI curi,
CharSequence candidate,
CharSequence valueContext,
Hop hop)
Consider whether a given string is URI-like.
|
protected void |
ExtractorHTML.considerQueryStringValues(CrawlURI curi,
CharSequence queryString,
CharSequence valueContext,
Hop hop)
Consider a query-string-like collections of key=value[&key=value]
pairs for URI-like strings in the values.
|
protected boolean |
ExtractorJS.considerString(Extractor ext,
CrawlURI curi,
boolean handlingJSFile,
String candidate) |
protected long |
ExtractorJS.considerStrings(CrawlURI curi,
CharSequence cs) |
long |
ExtractorJS.considerStrings(Extractor ext,
CrawlURI curi,
CharSequence cs) |
long |
ExtractorJS.considerStrings(Extractor ext,
CrawlURI curi,
CharSequence cs,
boolean handlingJSFile) |
void |
ExtractorURI.extract(CrawlURI curi)
Perform usual extraction on a CrawlURI
|
void |
ExtractorMultipleRegex.extract(CrawlURI curi) |
protected void |
ExtractorHTTP.extract(CrawlURI curi) |
protected void |
ContentExtractor.extract(CrawlURI uri)
Extracts links
|
void |
ExtractorImpliedURI.extract(CrawlURI curi)
Perform usual extraction on a CrawlURI
|
protected abstract void |
Extractor.extract(CrawlURI uri)
Extracts links from the given URI.
|
protected void |
ExtractorHTML.extract(CrawlURI curi,
CharSequence cs)
Run extractor.
|
protected void |
JerichoExtractorHTML.extract(CrawlURI curi,
CharSequence cs)
Run extractor.
|
protected void |
ExtractorURI.extractLink(CrawlURI curi,
CrawlURI wref)
Consider a single Link for internal URIs
|
protected Charset |
ExtractorXML.getContentDeclaredCharset(CrawlURI curi,
String contentPrefix) |
protected Charset |
ExtractorHTML.getContentDeclaredCharset(CrawlURI curi,
String contentPrefix) |
protected boolean |
ExtractorSWF.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorSitemap.innerExtract(CrawlURI uri) |
protected boolean |
TrapSuppressExtractor.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorUniversal.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorRobotsTxt.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorXML.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorDOC.innerExtract(CrawlURI curi)
Processes a word document and extracts any hyperlinks from it.
|
boolean |
ExtractorCSS.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorPDF.innerExtract(CrawlURI curi) |
protected boolean |
ExtractorJS.innerExtract(CrawlURI curi) |
protected abstract boolean |
ContentExtractor.innerExtract(CrawlURI uri)
Actually extracts links.
|
boolean |
ExtractorHTML.innerExtract(CrawlURI curi) |
protected void |
HTTPContentDigest.innerProcess(CrawlURI curi) |
protected void |
Extractor.innerProcess(CrawlURI uri)
Processes the given URI.
|
protected boolean |
ExtractorHTML.isHtmlExpectedHere(CrawlURI curi)
Test whether this HTML is so unexpected (eg in place of a GIF URI)
that it shouldn't be scanned for links.
|
protected void |
ExtractorHTML.processEmbed(CrawlURI curi,
CharSequence value,
CharSequence context) |
protected void |
ExtractorHTML.processEmbed(CrawlURI curi,
CharSequence value,
CharSequence context,
Hop hop) |
protected void |
JerichoExtractorHTML.processForm(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
protected void |
ExtractorHTML.processGeneralTag(CrawlURI curi,
CharSequence element,
CharSequence cs) |
protected void |
JerichoExtractorHTML.processGeneralTag(CrawlURI curi,
au.id.jericho.lib.html.Element element,
au.id.jericho.lib.html.Attributes attributes) |
protected void |
ExtractorHTML.processLink(CrawlURI curi,
CharSequence value,
CharSequence context)
Handle generic HREF cases.
|
protected boolean |
ExtractorHTML.processMeta(CrawlURI curi,
CharSequence cs)
Process metadata tags.
|
protected boolean |
JerichoExtractorHTML.processMeta(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
protected void |
AggressiveExtractorHTML.processScript(CrawlURI curi,
CharSequence sequence,
int endOfOpenTag) |
protected void |
ExtractorHTML.processScript(CrawlURI curi,
CharSequence sequence,
int endOfOpenTag) |
protected void |
JerichoExtractorHTML.processScript(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
protected void |
ExtractorHTML.processScriptCode(CrawlURI curi,
CharSequence cs)
Extract the (java)script source in the given CharSequence.
|
protected void |
ExtractorHTML.processStyle(CrawlURI curi,
CharSequence sequence,
int endOfOpenTag)
Process style text.
|
protected void |
JerichoExtractorHTML.processStyle(CrawlURI curi,
au.id.jericho.lib.html.Element element) |
static long |
ExtractorCSS.processStyleCode(Extractor ext,
CrawlURI curi,
CharSequence cs) |
static long |
ExtractorXML.processXml(Extractor ext,
CrawlURI curi,
CharSequence cs) |
protected boolean |
ExtractorSWF.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorSitemap.shouldExtract(CrawlURI uri) |
protected boolean |
TrapSuppressExtractor.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorUniversal.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorRobotsTxt.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorXML.shouldExtract(CrawlURI curi) |
protected boolean |
ExtractorDOC.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorCSS.shouldExtract(CrawlURI curi) |
protected boolean |
ExtractorPDF.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorJS.shouldExtract(CrawlURI uri) |
protected abstract boolean |
ContentExtractor.shouldExtract(CrawlURI uri)
Determines if otherwise valid URIs should have links extracted or not.
|
protected boolean |
ExtractorHTML.shouldExtract(CrawlURI uri) |
protected boolean |
ExtractorURI.shouldProcess(CrawlURI uri) |
protected boolean |
ExtractorMultipleRegex.shouldProcess(CrawlURI uri) |
protected boolean |
ExtractorHTTP.shouldProcess(CrawlURI uri) |
protected boolean |
HTTPContentDigest.shouldProcess(CrawlURI uri) |
protected boolean |
ContentExtractor.shouldProcess(CrawlURI uri)
Determines if links should be extracted from the given URI.
|
protected boolean |
ExtractorImpliedURI.shouldProcess(CrawlURI uri) |
Constructor and Description |
---|
CrawlUriSWFAction(CrawlURI curi,
Extractor ext) |
TestData(CrawlURI uri,
CrawlURI expectedResult) |
Modifier and Type | Field and Description |
---|---|
protected CrawlURI |
FetchHTTPRequest.curi |
Modifier and Type | Method and Description |
---|---|
protected void |
FetchHTTP.addResponseContent(org.apache.http.HttpResponse response,
CrawlURI curi)
This method populates
curi with response status and
content type. |
protected void |
FetchWhois.addWhoisLink(CrawlURI curi,
String query) |
protected void |
FetchWhois.addWhoisLinks(CrawlURI curi)
Adds outlinks to whois:{domain} and whois:{ipAddress}
|
protected org.apache.http.HttpEntity |
FetchHTTPRequest.buildPostRequestEntity(CrawlURI curi) |
protected boolean |
FetchHTTP.checkMidfetchAbort(CrawlURI curi) |
protected void |
FetchHTTP.cleanup(CrawlURI curi,
Exception exception,
String message,
int status)
Cleanup after a failed method execute.
|
org.apache.http.client.CookieStore |
FetchHTTPCookieStore.cookieStoreFor(CrawlURI curi)
Returns a
CookieStore whose CookieStore.getCookies()
returns all the cookies that could possibly apply curi . |
org.apache.http.client.CookieStore |
AbstractCookieStore.cookieStoreFor(CrawlURI curi) |
protected ProcessResult |
FetchWhois.deferOrFinishGeneric(CrawlURI curi,
String domainOrIp) |
protected void |
FetchHTTP.doAbort(CrawlURI curi,
org.apache.http.client.methods.AbstractExecutionAwareRequest request,
String annotation) |
protected Map<String,String> |
FetchHTTP.extractChallenges(org.apache.http.HttpResponse response,
CrawlURI curi,
org.apache.http.client.AuthenticationStrategy authStrategy) |
protected void |
FetchHTTP.failedExecuteCleanup(CrawlURI curi,
Exception exception)
Cleanup after a failed method execute.
|
protected void |
FetchWhois.fetch(CrawlURI curi,
String whoisServer,
String whoisQuery) |
protected Object |
FetchHTTP.getAttributeEither(CrawlURI curi,
String key)
Get a value either from inside the CrawlURI instance, or from
settings (module attributes).
|
protected Set<Credential> |
FetchHTTP.getCredentials(CrawlURI curi,
Class<?> type) |
protected static String |
FetchHTTP.getServerKey(CrawlURI uri) |
protected String |
FetchWhois.getWhoisQuery(CrawlURI curi) |
protected String |
FetchWhois.getWhoisServer(CrawlURI curi) |
protected void |
FetchHTTP.handle401(org.apache.http.HttpResponse response,
CrawlURI curi)
Server is looking for basic/digest auth credentials (RFC2617).
|
protected void |
FetchFTP.innerProcess(CrawlURI curi)
Processes the given URI.
|
protected void |
FetchWhois.innerProcess(CrawlURI uri) |
protected void |
FetchSFTP.innerProcess(CrawlURI curi)
Processes the given URI.
|
protected void |
FetchDNS.innerProcess(CrawlURI curi) |
protected void |
FetchHTTP.innerProcess(CrawlURI curi) |
protected ProcessResult |
FetchWhois.innerProcessResult(CrawlURI curi) |
protected boolean |
FetchDNS.isQuadAddress(CrawlURI curi,
String dnsName,
CrawlHost targetHost) |
protected boolean |
FetchHTTP.maybeMidfetchAbort(CrawlURI curi,
org.apache.http.client.methods.AbstractExecutionAwareRequest request) |
protected void |
FetchHTTP.promoteCredentials(CrawlURI curi)
Promote successful credential to the server.
|
protected void |
FetchDNS.recordDNS(CrawlURI curi,
org.xbill.DNS.Record[] rrecordSet) |
protected void |
FetchHTTP.setCharacterEncoding(CrawlURI curi,
Recorder rec,
org.apache.http.HttpResponse response)
Set the character encoding based on the result headers or default.
|
protected void |
FetchHTTP.setOtherCodings(CrawlURI uri,
Recorder rec,
org.apache.http.HttpResponse response)
Set the transfer, content encodings based on headers (if necessary).
|
protected void |
FetchHTTP.setSizes(CrawlURI curi,
Recorder rec)
Update CrawlURI internal sizes based on current transaction (and
in the case of 304s, history)
|
protected void |
FetchDNS.setUnresolvable(CrawlURI curi,
CrawlHost host) |
protected boolean |
FetchFTP.shouldProcess(CrawlURI curi) |
protected boolean |
FetchWhois.shouldProcess(CrawlURI uri) |
protected boolean |
FetchSFTP.shouldProcess(CrawlURI curi) |
protected boolean |
FetchDNS.shouldProcess(CrawlURI curi) |
protected boolean |
FetchHTTP.shouldProcess(CrawlURI curi)
Can this processor fetch the given CrawlURI.
|
protected void |
FetchDNS.storeDNSRecord(CrawlURI curi,
String dnsName,
CrawlHost targetHost,
org.xbill.DNS.Record[] rrecordSet) |
void |
FetchStats.tally(CrawlURI curi,
FetchStats.Stage stage) |
void |
FetchStats.CollectsFetchStats.tally(CrawlURI curi,
FetchStats.Stage stage) |
Constructor and Description |
---|
FetchHTTPRequest(FetchHTTP fetcher,
CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
protected void |
ExtractorHTMLForms.analyze(CrawlURI curi,
CharSequence cs)
Run analysis: find form METHOD, ACTION, and all INPUT names/values
Log as configured.
|
protected void |
FormLoginProcessor.createFormSubmissionAttempt(CrawlURI curi,
HTMLForm templateForm,
String formProvince) |
void |
ExtractorHTMLForms.extract(CrawlURI curi) |
protected String |
FormLoginProcessor.getFormProvince(CrawlURI curi)
Get the 'form province' - either the configured (applicableSurtPrefix)
or inferred (full current server) range of URIs that is considered
covered by one form login
|
protected void |
FormLoginProcessor.innerProcess(CrawlURI curi) |
protected boolean |
FormLoginProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
ExtractorHTMLForms.shouldProcess(CrawlURI uri) |
Modifier and Type | Method and Description |
---|---|
boolean |
ObeyRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
FirstNamedRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
CustomRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
MostFavoredRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
abstract boolean |
RobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
boolean |
IgnoreRobotsPolicy.allows(String userAgent,
CrawlURI curi,
Robotstxt robotstxt) |
String |
RobotsPolicy.getPathQuery(CrawlURI curi) |
void |
CrawlServer.updateRobots(CrawlURI curi)
Update the server's robotstxt
|
Modifier and Type | Method and Description |
---|---|
static boolean |
FetchHistoryProcessor.hasIdenticalDigest(CrawlURI curi)
Utility method for testing if a CrawlURI's last two history
entries (one being the most recent fetch) have identical
content-digest information.
|
protected boolean |
AbstractPersistProcessor.hasWriteTag(CrawlURI uri) |
protected HashMap<String,Object>[] |
FetchHistoryProcessor.historyRealloc(CrawlURI curi)
Get or create proper-sized history array
|
protected void |
FetchHistoryProcessor.innerProcess(CrawlURI puri) |
protected void |
PersistStoreProcessor.innerProcess(CrawlURI curi) |
protected void |
PersistLoadProcessor.innerProcess(CrawlURI curi) |
protected void |
PersistLogProcessor.innerProcess(CrawlURI curi) |
protected void |
ContentDigestHistoryStorer.innerProcess(CrawlURI curi) |
protected void |
ContentDigestHistoryLoader.innerProcess(CrawlURI curi) |
void |
BdbContentDigestHistory.load(CrawlURI curi) |
abstract void |
AbstractContentDigestHistory.load(CrawlURI curi)
Looks up the history by key
persistKeyFor(curi) and loads it into
curi.getContentDigestHistory() . |
protected String |
AbstractContentDigestHistory.persistKeyFor(CrawlURI curi) |
static String |
PersistProcessor.persistKeyFor(CrawlURI curi)
Return a preferred String key for persisting the given CrawlURI's
AList state.
|
protected void |
FetchHistoryProcessor.saveHeader(CrawlURI curi,
Map<String,Object> map,
String key)
Save a header from the given HTTP operation into the Map.
|
protected boolean |
AbstractPersistProcessor.shouldLoad(CrawlURI curi)
Whether the current CrawlURI's state should be loaded
|
protected boolean |
FetchHistoryProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
PersistStoreProcessor.shouldProcess(CrawlURI uri) |
protected boolean |
PersistLoadProcessor.shouldProcess(CrawlURI uri) |
protected boolean |
PersistLogProcessor.shouldProcess(CrawlURI uri) |
protected boolean |
ContentDigestHistoryStorer.shouldProcess(CrawlURI uri) |
protected boolean |
ContentDigestHistoryLoader.shouldProcess(CrawlURI uri) |
protected boolean |
AbstractPersistProcessor.shouldStore(CrawlURI curi)
Whether the current CrawlURI's state should be persisted (to log or
direct to database)
|
void |
BdbContentDigestHistory.store(CrawlURI curi) |
abstract void |
AbstractContentDigestHistory.store(CrawlURI curi)
Stores
curi.getContentDigestHistory() for the key
persistKeyFor(curi) . |
Modifier and Type | Method and Description |
---|---|
void |
SeedListener.addedSeed(CrawlURI uuri) |
void |
TextSeedModule.addSeed(CrawlURI curi)
Add a new seed to scope.
|
abstract void |
SeedModule.addSeed(CrawlURI curi) |
protected void |
SeedModule.publishAddedSeed(CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
org.archive.io.warc.WARCRecordInfo |
MetadataRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
HttpRequestRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
DnsResponseRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
FtpControlConversationRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
HttpResponseRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
RevisitRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
FtpResponseRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
org.archive.io.warc.WARCRecordInfo |
WARCRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo)
Builds a warc record for this capture.
|
org.archive.io.warc.WARCRecordInfo |
WhoisResponseRecordBuilder.buildRecord(CrawlURI curi,
URI concurrentTo) |
protected String |
BaseWARCRecordBuilder.getHostAddress(CrawlURI curi)
Return IP address of given URI suitable for recording (as in a
classic ARC 5-field header line).
|
boolean |
MetadataRecordBuilder.shouldBuildRecord(CrawlURI curi)
If you don't want metadata records, take this class out of the chain.
|
boolean |
HttpRequestRecordBuilder.shouldBuildRecord(CrawlURI curi) |
boolean |
DnsResponseRecordBuilder.shouldBuildRecord(CrawlURI curi) |
boolean |
FtpControlConversationRecordBuilder.shouldBuildRecord(CrawlURI curi) |
boolean |
HttpResponseRecordBuilder.shouldBuildRecord(CrawlURI curi) |
boolean |
RevisitRecordBuilder.shouldBuildRecord(CrawlURI curi) |
boolean |
FtpResponseRecordBuilder.shouldBuildRecord(CrawlURI curi) |
boolean |
WARCRecordBuilder.shouldBuildRecord(CrawlURI curi)
Decides whether to build a record for the given capture.
|
boolean |
WhoisResponseRecordBuilder.shouldBuildRecord(CrawlURI curi) |
Modifier and Type | Method and Description |
---|---|
protected void |
WriterPoolProcessor.copyForwardWriteTagIfDupe(CrawlURI curi)
If this fetch is identical to the last written (archived) fetch, then
copy forward the writeTag.
|
protected String |
WriterPoolProcessor.getHostAddress(CrawlURI curi)
Deprecated.
WARCRecordBuilder instances use
BaseWARCRecordBuilder.getHostAddress(CrawlURI) |
protected OutputStream |
Kw3WriterProcessor.initOutputStream(CrawlURI curi)
Get the OutputStream for the file to write to.
|
protected void |
MirrorWriterProcessor.innerProcess(CrawlURI curi) |
protected void |
Kw3WriterProcessor.innerProcess(CrawlURI curi) |
protected void |
WriterPoolProcessor.innerProcess(CrawlURI puri) |
protected ProcessResult |
WARCWriterProcessor.innerProcessResult(CrawlURI curi)
Deprecated.
Writes a CrawlURI and its associated data to store file.
|
protected ProcessResult |
WARCWriterChainProcessor.innerProcessResult(CrawlURI curi) |
protected ProcessResult |
ARCWriterProcessor.innerProcessResult(CrawlURI curi)
Writes a CrawlURI and its associated data to store file.
|
protected abstract ProcessResult |
WriterPoolProcessor.innerProcessResult(CrawlURI uri) |
protected void |
WriterPoolProcessor.innerRejectProcess(CrawlURI curi) |
protected void |
WARCWriterProcessor.saveHeader(CrawlURI curi,
org.archive.util.anvl.ANVLRecord warcHeaders,
String origName,
String newName)
Deprecated.
Saves a header from the given HTTP operation into the
provider headers under a new name
|
protected boolean |
MirrorWriterProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
Kw3WriterProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
WriterPoolProcessor.shouldProcess(CrawlURI curi) |
protected boolean |
WARCWriterChainProcessor.shouldWrite(CrawlURI curi) |
protected boolean |
WriterPoolProcessor.shouldWrite(CrawlURI curi)
Whether the given CrawlURI should be written to archive files.
|
protected void |
BaseWARCWriterProcessor.updateMetadataAfterWrite(CrawlURI curi,
org.archive.io.warc.WARCWriter writer,
long startPosition) |
protected ProcessResult |
WARCWriterChainProcessor.write(CrawlURI curi) |
protected ProcessResult |
ARCWriterProcessor.write(CrawlURI curi,
long recordLength,
InputStream in,
String ip) |
protected ProcessResult |
WARCWriterProcessor.write(String lowerCaseScheme,
CrawlURI curi)
Deprecated.
|
protected void |
Kw3WriterProcessor.writeArchiveInfoPart(String boundary,
CrawlURI curi,
ReplayInputStream ris,
OutputStream out) |
protected void |
Kw3WriterProcessor.writeContentPart(String boundary,
CrawlURI curi,
ReplayInputStream ris,
OutputStream out) |
protected void |
WARCWriterProcessor.writeDnsRecords(CrawlURI curi,
org.archive.io.warc.WARCWriter w,
URI baseid,
String timestamp)
Deprecated.
|
protected URI |
WARCWriterProcessor.writeFtpControlConversation(org.archive.io.warc.WARCWriter w,
String timestamp,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord headers,
String controlConversation)
Deprecated.
|
protected void |
WARCWriterProcessor.writeFtpRecords(org.archive.io.warc.WARCWriter w,
CrawlURI curi,
URI baseid,
String timestamp)
Deprecated.
|
protected void |
WARCWriterProcessor.writeHttpRecords(CrawlURI curi,
org.archive.io.warc.WARCWriter w,
URI baseid,
String timestamp)
Deprecated.
|
protected URI |
WARCWriterProcessor.writeMetadata(org.archive.io.warc.WARCWriter w,
String timestamp,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields)
Deprecated.
|
protected void |
Kw3WriterProcessor.writeMimeFile(CrawlURI curi)
The actual writing of the Kulturarw3 MIME-file.
|
protected void |
WARCWriterChainProcessor.writeRecords(CrawlURI curi,
org.archive.io.warc.WARCWriter writer) |
protected URI |
WARCWriterProcessor.writeRequest(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields)
Deprecated.
|
protected URI |
WARCWriterProcessor.writeResource(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord namedFields)
Deprecated.
|
protected URI |
WARCWriterProcessor.writeResponse(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord suppliedFields)
Deprecated.
|
protected URI |
WARCWriterProcessor.writeRevisit(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord headers)
Deprecated.
|
protected URI |
WARCWriterProcessor.writeRevisit(org.archive.io.warc.WARCWriter w,
String timestamp,
String mimetype,
URI baseid,
CrawlURI curi,
org.archive.util.anvl.ANVLRecord headers,
long contentLength)
Deprecated.
|
protected void |
WARCWriterProcessor.writeWhoisRecords(org.archive.io.warc.WARCWriter w,
CrawlURI curi,
URI baseid,
String timestamp)
Deprecated.
|
Modifier and Type | Method and Description |
---|---|
protected CrawlURI |
ModuleTestBase.makeCrawlURI(String uri) |
Copyright © 2003–2021 Internet Archive. All rights reserved.