dev.langchain4j.data.document.transformer.HtmlTextExtractor

All Implemented Interfaces:: DocumentTransformer

public class HtmlTextExtractor extends Object implements DocumentTransformer

Extracts text from a given HTML document. A CSS selector can be specified to extract text only from desired element(s). Also, multiple CSS selectors can be specified to extract metadata from desired elements.

Constructor Summary

Constructors

Constructor

Description

HtmlTextExtractor()

Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.

HtmlTextExtractor(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)

Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the provided CSS selector.
Method Summary

Modifier and Type

Method

Description

Document

transform(Document document)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface dev.langchain4j.data.document.DocumentTransformer
transformAll

Constructor Details
- HtmlTextExtractor
  
  public HtmlTextExtractor()
  
  Constructs an instance of HtmlToTextTransformer that extracts all text from a given Document containing HTML.
- HtmlTextExtractor
  
  public HtmlTextExtractor(String cssSelector, Map<String,String> metadataCssSelectors, boolean includeLinks)
  
  Constructs an instance of HtmlToTextTransformer that extracts text from HTML elements matching the provided CSS selector.
  
  Parameters:
  
  cssSelector - A CSS selector. For example, "#page-content" will extract text from the HTML element with the id "page-content".
  
  metadataCssSelectors - A mapping from metadata keys to CSS selectors. For example, Mep.of("title", "#page-title") will extract all text from the HTML element with id "title" and store it in Metadata under the key "title".
  
  includeLinks - Specifies whether links should be included in the extracted text.
Method Details
- transform
  
  public Document transform(Document document)
  
  Specified by:
  
  transform in interface DocumentTransformer

Class HtmlTextExtractor

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface dev.langchain4j.data.document.DocumentTransformer

Constructor Details

HtmlTextExtractor

HtmlTextExtractor

Method Details

transform