Class Crawl

java.lang.Object
com.metreeca.xml.actions.Crawl
All Implemented Interfaces:
Function<String,Stream<String>>

public final class Crawl extends Object implements Function<String,Stream<String>>
Site crawling.

Maps site root URLs to streams of URLs for HTML site pages.

  • Constructor Details

    • Crawl

      public Crawl()
  • Method Details

    • threads

      public Crawl threads(int threads)
      Configures the number of concurrent requests (defaults to the number of processors)
      Parameters:
      threads - the maximum number of concurrent resource fetches; equivalent to the number of system processors if equal to zero
      Returns:
      this action
      Throws:
      IllegalArgumentException - if threads is negative
    • fetch

      public Crawl fetch(Fetch fetch)
      Configures the fetch action (defaults to Fetch.
      Parameters:
      fetch - the action used to fetch pages
      Returns:
      this action
      Throws:
      NullPointerException - if fetch is null
    • focus

      public Crawl focus(Function<? super Node,Optional<Node>> focus)
      Configures the content focus action (defaults to the identity function).
      Parameters:
      focus - a function taking as argument an element and returning an optional partial/restructured focus element, if one was identified, or an empty optional, otherwise
      Returns:
      this action
      Throws:
      NullPointerException - if focus is null
    • prune

      public Crawl prune(BiPredicate<String,String> prune)
      Configures the prune action (defaults to always pass).
      Parameters:
      prune - a bi-predicate taking as arguments the site root URL and a link URL and returning true if the link targets a site page or false otherwise
      Returns:
      this action
      Throws:
      NullPointerException - if prune is null
    • apply

      public Stream<String> apply(String root)
      Crawls a site.
      Specified by:
      apply in interface Function<String,Stream<String>>
      Parameters:
      root - the root URL of the site to be crawled
      Returns:
      a stream of links to nested HTML pages reachable from the root root; empty if root is null or empty