Class HtmlResource

java.lang.Object
org.openpdf.resource.AbstractResource
org.openpdf.resource.HtmlResource
All Implemented Interfaces:
Resource

public class HtmlResource extends AbstractResource
HtmlResource provides HTML5-compliant parsing using htmlunit-neko's DOMParser.

This class leverages the htmlunit-neko parser (https://github.com/HtmlUnit/htmlunit-neko) for error-tolerant HTML parsing with the following features:

  • HTML5 compliant parsing
  • Error tolerant - handles malformed HTML gracefully
  • Automatic tag balancing and fixing
  • Support for modern HTML5 semantic elements
  • Configurable parsing features

Example usage:

 // Parse HTML string with default settings
 HtmlResource resource = HtmlResource.load("<html><body>Hello</body></html>");
 Document doc = resource.getDocument();

 // Parse with custom configuration
 HtmlParserConfig config = HtmlParserConfig.builder()
     .reportErrors(true)
     .allowSelfClosingTags(true)
     .build();
 HtmlResource resource = HtmlResource.load(html, config);
 
See Also:
  • Method Details

    • load

      public static HtmlResource load(URL source)
      Load and parse an HTML document from a URL.
      Parameters:
      source - the URL to load HTML from
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(URL source, HtmlParserConfig config)
      Load and parse an HTML document from a URL with custom configuration.
      Parameters:
      source - the URL to load HTML from
      config - parser configuration
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(InputStream stream)
      Load and parse an HTML document from an InputStream.
      Parameters:
      stream - the InputStream containing HTML content
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(InputStream stream, HtmlParserConfig config, @Nullable String systemId)
      Load and parse an HTML document from an InputStream with custom configuration.
      Parameters:
      stream - the InputStream containing HTML content
      config - parser configuration
      systemId - optional system identifier for the document
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(Reader reader)
      Load and parse an HTML document from a Reader.
      Parameters:
      reader - the Reader containing HTML content
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(Reader reader, HtmlParserConfig config)
      Load and parse an HTML document from a Reader with custom configuration.
      Parameters:
      reader - the Reader containing HTML content
      config - parser configuration
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(String html)
      Load and parse an HTML document from a String.
      Parameters:
      html - the HTML content as a String
      Returns:
      the parsed HtmlResource
    • load

      public static HtmlResource load(String html, HtmlParserConfig config)
      Load and parse an HTML document from a String with custom configuration.
      Parameters:
      html - the HTML content as a String
      config - parser configuration
      Returns:
      the parsed HtmlResource
    • getDocument

      public Document getDocument()
      Get the parsed DOM Document.
      Returns:
      the DOM Document
    • getElapsedLoadTime

      @CheckReturnValue public long getElapsedLoadTime()
      Get the time taken to load and parse the document.
      Returns:
      elapsed time in milliseconds