Package com.metreeca.xml.actions
Class Crawl
java.lang.Object
com.metreeca.xml.actions.Crawl
Site crawling.
Maps site root URLs to streams of URLs for HTML site pages.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionCrawls a site.Configures the fetch action (defaults toFetch
.Configures the content focus action (defaults to the identity function).prune
(BiPredicate<String, String> prune) Configures the prune action (defaults to always pass).threads
(int threads) Configures the number of concurrent requests (defaults to the number of processors)
-
Constructor Details
-
Crawl
public Crawl()
-
-
Method Details
-
threads
Configures the number of concurrent requests (defaults to the number of processors)- Parameters:
threads
- the maximum number of concurrent resource fetches; equivalent to the number of system processors if equal to zero- Returns:
- this action
- Throws:
IllegalArgumentException
- ifthreads
is negative
-
fetch
Configures the fetch action (defaults toFetch
.- Parameters:
fetch
- the action used to fetch pages- Returns:
- this action
- Throws:
NullPointerException
- iffetch
is null
-
focus
Configures the content focus action (defaults to the identity function).- Parameters:
focus
- a function taking as argument an element and returning an optional partial/restructured focus element, if one was identified, or an empty optional, otherwise- Returns:
- this action
- Throws:
NullPointerException
- iffocus
is null
-
prune
Configures the prune action (defaults to always pass).- Parameters:
prune
- a bi-predicate taking as arguments the site root URL and a link URL and returningtrue
if the link targets a site page orfalse
otherwise- Returns:
- this action
- Throws:
NullPointerException
- ifprune
is null
-
apply
Crawls a site.
-