java.lang.Object
org.openpdf.resource.AbstractResource
org.openpdf.resource.HtmlResource
- All Implemented Interfaces:
Resource
HtmlResource provides HTML5-compliant parsing using htmlunit-neko's DOMParser.
This class leverages the htmlunit-neko parser (https://github.com/HtmlUnit/htmlunit-neko) for error-tolerant HTML parsing with the following features:
- HTML5 compliant parsing
- Error tolerant - handles malformed HTML gracefully
- Automatic tag balancing and fixing
- Support for modern HTML5 semantic elements
- Configurable parsing features
Example usage:
// Parse HTML string with default settings
HtmlResource resource = HtmlResource.load("<html><body>Hello</body></html>");
Document doc = resource.getDocument();
// Parse with custom configuration
HtmlParserConfig config = HtmlParserConfig.builder()
.reportErrors(true)
.allowSelfClosingTags(true)
.build();
HtmlResource resource = HtmlResource.load(html, config);
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionGet the parsed DOM Document.longGet the time taken to load and parse the document.static HtmlResourceload(InputStream stream) Load and parse an HTML document from an InputStream.static HtmlResourceload(InputStream stream, HtmlParserConfig config, @Nullable String systemId) Load and parse an HTML document from an InputStream with custom configuration.static HtmlResourceLoad and parse an HTML document from a Reader.static HtmlResourceload(Reader reader, HtmlParserConfig config) Load and parse an HTML document from a Reader with custom configuration.static HtmlResourceLoad and parse an HTML document from a String.static HtmlResourceload(String html, HtmlParserConfig config) Load and parse an HTML document from a String with custom configuration.static HtmlResourceLoad and parse an HTML document from a URL.static HtmlResourceload(URL source, HtmlParserConfig config) Load and parse an HTML document from a URL with custom configuration.Methods inherited from class org.openpdf.resource.AbstractResource
getResourceInputSource, getResourceLoadTimeStamp
-
Method Details
-
load
Load and parse an HTML document from a URL.- Parameters:
source- the URL to load HTML from- Returns:
- the parsed HtmlResource
-
load
Load and parse an HTML document from a URL with custom configuration.- Parameters:
source- the URL to load HTML fromconfig- parser configuration- Returns:
- the parsed HtmlResource
-
load
Load and parse an HTML document from an InputStream.- Parameters:
stream- the InputStream containing HTML content- Returns:
- the parsed HtmlResource
-
load
public static HtmlResource load(InputStream stream, HtmlParserConfig config, @Nullable String systemId) Load and parse an HTML document from an InputStream with custom configuration.- Parameters:
stream- the InputStream containing HTML contentconfig- parser configurationsystemId- optional system identifier for the document- Returns:
- the parsed HtmlResource
-
load
Load and parse an HTML document from a Reader.- Parameters:
reader- the Reader containing HTML content- Returns:
- the parsed HtmlResource
-
load
Load and parse an HTML document from a Reader with custom configuration.- Parameters:
reader- the Reader containing HTML contentconfig- parser configuration- Returns:
- the parsed HtmlResource
-
load
Load and parse an HTML document from a String.- Parameters:
html- the HTML content as a String- Returns:
- the parsed HtmlResource
-
load
Load and parse an HTML document from a String with custom configuration.- Parameters:
html- the HTML content as a Stringconfig- parser configuration- Returns:
- the parsed HtmlResource
-
getDocument
Get the parsed DOM Document.- Returns:
- the DOM Document
-
getElapsedLoadTime
@CheckReturnValue public long getElapsedLoadTime()Get the time taken to load and parse the document.- Returns:
- elapsed time in milliseconds
-