Class

com.lucidworks.spark.analysis

LuceneTextAnalyzer

Related Doc: package analysis

Permalink

class LuceneTextAnalyzer extends Serializable

This class allows simple access to custom Lucene text processing pipelines, a.k.a. text analyzers, which are specified via a JSON schema that hosts named analyzer specifications and mappings from field name(s) to analyzer(s).

Here's an example schema with descriptions inline as comments:

{
  "defaultLuceneMatchVersion": "6.0.0" // Optional.  Supplied to analysis components
                                        //     that don't explicitly specify "luceneMatchVersion".
  "analyzers": [              // Optional.  If not included, all field mappings must be
    {                         //     to fully qualified class names of Lucene Analyzer subclasses.
      "name": "html",         // Required.  Mappings in the "fields" array below refer to this name.
      "charFilters":[{        // Optional.
        "type": "htmlstrip"   // Required. "htmlstrip" is the SPI name for HTMLStripCharFilter
      }],
      "tokenizer": {          // Required.  Only one allowed.
        "type": "standard"    // Required. "standard" is the SPI name for StandardTokenizer
      },
      "filters": [{           // Optional.
          "type": "stop",     // Required.  "stop" is the SPI name for StopFilter
          "ignoreCase": "true",  // Component-specific params
          "format": "snowball",
          "words": "org/apache/lucene/analysis/snowball/english_stop.txt"
        }, {
          "type": "lowercase" // Required. "lowercase" is the SPI name for LowerCaseFilter
      }]
    },
    { "name": "stdtok", "tokenizer": { "type": "standard" } }
  ],
  "fields": [{                // Required.  To lookup an analyzer for a field, first the "name"
                              //     mappings are consulted, and then the "regex" mappings are
                              //     tested, in the order specified.
      "name": "keywords",     // Either "name" or "regex" is required.  "name" matches the field name exactly.
      "analyzer": "org.apache.lucene.analysis.core.KeywordAnalyzer" // FQCN of an Analyzer subclass
    }, {
      "regex": ".*html.*"     // Either "name" or "regex" is required.  "regex" must match the whole field name.
      "analyzer": "html"      // Reference to the named analyzer specified in the "analyzers" section.
    }, {
      "regex": ".+",          // Either "name" or "regex" is required.  "regex" must match the whole field name.
      "analyzer": "stdtok"    // Reference to the named analyzer specified in the "analyzers" section.
  }]
}
Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. LuceneTextAnalyzer
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new LuceneTextAnalyzer(analysisSchema: String)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def analyze(fieldValues: Map[String, String]): Map[String, Seq[String]]

    Permalink

    For each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value.

    For each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value. Returns a map from the fields to the produced token sequences.

  5. def analyze(field: String, reader: Reader): Seq[String]

    Permalink

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, returning the produced token sequence.

  6. def analyze(field: String, str: String): Seq[String]

    Permalink

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, returning the produced token sequence.

  7. def analyze(field: String, o: Any): Seq[String]

    Permalink
  8. def analyzeJava(fieldValues: Map[String, String]): Map[String, List[String]]

    Permalink

    Java-friendly version: for each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value.

    Java-friendly version: for each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value. Returns a map from the fields to the produced token sequences.

  9. def analyzeJava(field: String, reader: Reader): List[String]

    Permalink

    Java-friendly version: looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, returning the produced token sequence.

  10. def analyzeJava(field: String, str: String): List[String]

    Permalink

    Java-friendly version: looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, returning the produced token sequence.

  11. def analyzeJava(field: String, o: Any): List[String]

    Permalink
  12. def analyzeMV(fieldValues: Map[String, Seq[String]]): Map[String, Seq[String]]

    Permalink

    For each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the each of the values.

    For each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the each of the values. Returns a map from the fields to the flattened concatenation of the produced token sequences.

  13. def analyzeMV(field: String, values: Seq[String]): Seq[String]

    Permalink

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on each of the given values, and returns the flattened concatenation of the produced token sequence.

  14. def analyzeMVJava(fieldValues: Map[String, List[String]]): Map[String, List[String]]

    Permalink

    Java-friendly version: for each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on each of the values.

    Java-friendly version: for each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on each of the values. Returns a map from the fields to the flattened concatenation of the produced token sequences.

  15. def analyzeMVJava(field: String, values: List[String]): List[String]

    Permalink

    Java-friendly version: looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on each of the given values, and returns the flattened concatenation of the produced token sequence.

  16. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  17. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  18. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  19. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  20. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  22. def getFieldAnalyzer(field: String): Option[Analyzer]

    Permalink

    Returns the analyzer mapped to the given field in the configured analysis schema, if any.

  23. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  24. def invalidMessages: String

    Permalink
  25. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  26. def isValid: Boolean

    Permalink
  27. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  28. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  29. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  30. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  31. def toPreAnalyzedJson(field: String, reader: Reader, stored: Boolean): String

    Permalink

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    - CharTermAttribute (token text), - OffsetAttribute (start and end position) - PositionIncrementAttribute (token position relative to the previous token)

    If stored = true, the original reader input value, read into a string, will be included as a value to be stored. (Note that the Solr schema for the destination Solr field must be configured to store the value; if it is not, then the stored value included in the JSON will be ignored by Solr.)

  32. def toPreAnalyzedJson(field: String, str: String, stored: Boolean): String

    Permalink

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    • CharTermAttribute (token text)
    • OffsetAttribute (start and end character offsets)
    • PositionIncrementAttribute (token position relative to the previous token)

    If stored = true, the original string input value will be included as a value to be stored. (Note that the Solr schema for the destination Solr field must be configured to store the value; if it is not, then the stored value included in the JSON will be ignored by Solr.)

  33. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  34. def tokenStream(fieldName: String, reader: Reader): TokenStream

    Permalink

    Looks up the analyzer mapped to fieldName and returns a org.apache.lucene.analysis.TokenStream for the analyzer to tokenize the contents of reader.

  35. def tokenStream(fieldName: String, text: String): TokenStream

    Permalink

    Looks up the analyzer mapped to fieldName and returns a org.apache.lucene.analysis.TokenStream for the analyzer to tokenize the contents of text.

  36. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped