Class HtmlTextExtractor

java.lang.Object
dev.langchain4j.data.document.transformer.HtmlTextExtractor
All Implemented Interfaces:
DocumentTransformer

public class HtmlTextExtractor extends Object implements DocumentTransformer
Extracts text from a given HTML document. A CSS selector can be specified to extract text only from desired element(s). Also, multiple CSS selectors can be specified to extract metadata from desired elements.
  • Constructor Details

    • HtmlTextExtractor

      public HtmlTextExtractor()
      Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.
    • HtmlTextExtractor

      public HtmlTextExtractor(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)
      Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the provided CSS selector.
      Parameters:
      cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
      metadataCssSelectors - A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it in Metadata under the key "title".
      includeLinks - Specifies whether links should be included in the extracted text.
  • Method Details