Class HtmlToTextDocumentTransformer

java.lang.Object
dev.langchain4j.data.document.transformer.jsoup.HtmlToTextDocumentTransformer
All Implemented Interfaces:
dev.langchain4j.data.document.DocumentTransformer

public class HtmlToTextDocumentTransformer extends Object implements dev.langchain4j.data.document.DocumentTransformer
Extracts plain text from a given HTML document. A CSS selector can be specified to extract text only from desired HTML element(s). Also, multiple CSS selectors can be specified to extract metadata from desired HTML elements.
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.
    Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
    HtmlToTextDocumentTransformer(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)
    Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
  • Method Summary

    Modifier and Type
    Method
    Description
    dev.langchain4j.data.document.Document
    transform(dev.langchain4j.data.document.Document document)
     

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface dev.langchain4j.data.document.DocumentTransformer

    transformAll
  • Constructor Details

    • HtmlToTextDocumentTransformer

      public HtmlToTextDocumentTransformer()
      Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.
    • HtmlToTextDocumentTransformer

      public HtmlToTextDocumentTransformer(String cssSelector)
      Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
      Parameters:
      cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
    • HtmlToTextDocumentTransformer

      public HtmlToTextDocumentTransformer(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)
      Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the specified CSS selector.
      Parameters:
      cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
      metadataCssSelectors - A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it in Metadata under the key "title".
      includeLinks - Specifies whether links should be included in the extracted text.
  • Method Details

    • transform

      public dev.langchain4j.data.document.Document transform(dev.langchain4j.data.document.Document document)
      Specified by:
      transform in interface dev.langchain4j.data.document.DocumentTransformer