Interface Tokenizer


public interface Tokenizer
An interface for tokenizing text data. Typically used for machine learning, artificial intelligence and interacting with vector databases. Implementations of this interface should provide a way to configure the tokenizer, and then use that configuration to tokenize the data in the Exchange.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Interface
    Description
    static interface 
    A nested interface representing the configuration options for this tokenizer.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    Configures this tokenizer using the provided configuration options.
    Returns the name of this tokenizer, which can be used for identification or logging purposes.
    Creates a new configuration for this tokenizer, with default values.
    tokenize(Exchange exchange)
    Tokenizes the data in the provided Exchange using the current configuration options.
  • Method Details

    • newConfiguration

      Tokenizer.Configuration newConfiguration()
      Creates a new configuration for this tokenizer, with default values.
      Returns:
      a new Configuration object
    • configure

      void configure(Tokenizer.Configuration configuration)
      Configures this tokenizer using the provided configuration options.
      Parameters:
      configuration - the configuration to use
    • name

      String name()
      Returns the name of this tokenizer, which can be used for identification or logging purposes.
      Returns:
      the name of this tokenizer
    • tokenize

      String[] tokenize(Exchange exchange)
      Tokenizes the data in the provided Exchange using the current configuration options.
      Parameters:
      exchange - the Exchange to tokenize
      Returns:
      an array of tokens produced by the tokenizer