Class HtmlTextExtractor
java.lang.Object
dev.langchain4j.data.document.transformer.HtmlTextExtractor
- All Implemented Interfaces:
DocumentTransformer
Extracts text from a given HTML document.
A CSS selector can be specified to extract text only from desired element(s).
Also, multiple CSS selectors can be specified to extract metadata from desired elements.
-
Constructor Summary
ConstructorsConstructorDescriptionConstructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.HtmlTextExtractor
(String cssSelector, Map<String, String> metadataCssSelectors, boolean includeLinks) Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the provided CSS selector. -
Method Summary
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface dev.langchain4j.data.document.DocumentTransformer
transformAll
-
Constructor Details
-
HtmlTextExtractor
public HtmlTextExtractor()Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML. -
HtmlTextExtractor
public HtmlTextExtractor(String cssSelector, Map<String, String> metadataCssSelectors, boolean includeLinks) Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the provided CSS selector.- Parameters:
cssSelector
- A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".metadataCssSelectors
- A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it inMetadata
under the key "title".includeLinks
- Specifies whether links should be included in the extracted text.
-
-
Method Details
-
transform
- Specified by:
transform
in interfaceDocumentTransformer
-