com.lucidworks.spark.analysis

LuceneTextAnalyzer

class LuceneTextAnalyzer extends Serializable

This class allows simple access to custom Lucene text processing pipelines, a.k.a. text analyzers, which are specified via a JSON schema that hosts named analyzer specifications and mappings from field name(s) to analyzer(s).

Here's an example schema with descriptions inline as comments:

{
"defaultLuceneMatchVersion": "5.0.0" // Optional.  Supplied to analysis components
                                      //     that don't explicitly specify "luceneMatchVersion".
"analyzers": [              // Optional.  If not included, all field mappings must be
  {                         //     to fully qualified class names of Lucene Analyzer subclasses.
    "name": "html",         // Required.  Mappings in the "fields" array below refer to this name.
    "charFilters":[{        // Optional.
      "type": "htmlstrip"   // Required. "htmlstrip" is the SPI name for HTMLStripCharFilter
    }],
    "tokenizer": {          // Required.  Only one allowed.
      "type": "standard"    // Required. "standard" is the SPI name for StandardTokenizer
    },
    "filters": [{           // Optional.
        "type": "stop",     // Required.  "stop" is the SPI name for StopFilter
        "ignoreCase": "true",  // Component-specific params
        "format": "snowball",
        "words": "org/apache/lucene/analysis/snowball/english_stop.txt"
      }, {
        "type": "lowercase" // Required. "lowercase" is the SPI name for LowerCaseFilter
    }]
  },
  { "name": "stdtok", "tokenizer": { "type": "standard" } }
],
"fields": [{                // Required.  To lookup an analyzer for a field, first the "name"
                            //     mappings are consulted, and then the "regex" mappings are
                            //     tested, in the order specified.
    "name": "keywords",     // Either "name" or "regex" is required.  "name" matches the field name exactly.
    "analyzer": "org.apache.lucene.analysis.core.KeywordAnalyzer" // FQCN of an Analyzer subclass
  }, {
    "regex": ".*html.*"     // Either "name" or "regex" is required.  "regex" must match the whole field name.
    "analyzer": "html"      // Reference to the named analyzer specified in the "analyzers" section.
  }, {
    "regex": ".+",          // Either "name" or "regex" is required.  "regex" must match the whole field name.
    "analyzer": "stdtok"    // Reference to the named analyzer specified in the "analyzers" section.
}]
}
Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. LuceneTextAnalyzer
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new LuceneTextAnalyzer(analysisSchema: String)

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. def analyze(fieldValues: Map[String, String]): Map[String, Seq[String]]

    For each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value.

    For each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value. Returns a map from the fields to the produced token sequences.

  7. def analyze(field: String, reader: Reader): Seq[String]

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, returning the produced token sequence.

  8. def analyze(field: String, str: String): Seq[String]

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, returning the produced token sequence.

  9. def analyzeJava(fieldValues: Map[String, String]): Map[String, List[String]]

    Java-friendly version: for each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value.

    Java-friendly version: for each of the field->value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the value. Returns a map from the fields to the produced token sequences.

  10. def analyzeJava(field: String, reader: Reader): List[String]

    Java-friendly version: looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, returning the produced token sequence.

  11. def analyzeJava(field: String, str: String): List[String]

    Java-friendly version: looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, returning the produced token sequence.

  12. def analyzeMV(fieldValues: Map[String, Seq[String]]): Map[String, Seq[String]]

    For each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the each of the values.

    For each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on the each of the values. Returns a map from the fields to the flattened concatenation of the produced token sequences.

  13. def analyzeMV(field: String, values: Seq[String]): Seq[String]

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on each of the given values, and returns the flattened concatenation of the produced token sequence.

  14. def analyzeMVJava(fieldValues: Map[String, List[String]]): Map[String, List[String]]

    Java-friendly version: for each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on each of the values.

    Java-friendly version: for each of the field->multi-value pairs in fieldValues, looks up the analyzer mapped to the field from the configured analysis schema, and uses it to perform analysis on each of the values. Returns a map from the fields to the flattened concatenation of the produced token sequences.

  15. def analyzeMVJava(field: String, values: List[String]): List[String]

    Java-friendly version: looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on each of the given values, and returns the flattened concatenation of the produced token sequence.

  16. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  17. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  18. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  19. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  20. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  22. def getFieldAnalyzer(field: String): Option[Analyzer]

    Returns the analyzer mapped to the given field in the configured analysis schema, if any.

  23. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  24. def invalidMessages: String

  25. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  26. def isValid: Boolean

  27. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  28. final def notify(): Unit

    Definition Classes
    AnyRef
  29. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  30. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  31. def toPreAnalyzedJson(field: String, reader: Reader, stored: Boolean): String

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given reader, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    - CharTermAttribute (token text), - OffsetAttribute (start and end position) - PositionIncrementAttribute (token position relative to the previous token)

    If stored = true, the original reader input value, read into a string, will be included as a value to be stored. (Note that the Solr schema for the destination Solr field must be configured to store the value; if it is not, then the stored value included in the JSON will be ignored by Solr.)

  32. def toPreAnalyzedJson(field: String, str: String, stored: Boolean): String

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    Looks up the analyzer mapped to the given field from the configured analysis schema, uses it to perform analysis on the given string, and returns a PreAnalyzedField-compatible JSON string with the following serialized attributes:

    • CharTermAttribute (token text)
    • OffsetAttribute (start and end character offsets)
    • PositionIncrementAttribute (token position relative to the previous token)

    If stored = true, the original string input value will be included as a value to be stored. (Note that the Solr schema for the destination Solr field must be configured to store the value; if it is not, then the stored value included in the JSON will be ignored by Solr.)

  33. def toString(): String

    Definition Classes
    AnyRef → Any
  34. def tokenStream(fieldName: String, reader: Reader): TokenStream

    Looks up the analyzer mapped to fieldName and returns a org.apache.lucene.analysis.TokenStream for the analyzer to tokenize the contents of reader.

  35. def tokenStream(fieldName: String, text: String): TokenStream

    Looks up the analyzer mapped to fieldName and returns a org.apache.lucene.analysis.TokenStream for the analyzer to tokenize the contents of text.

  36. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  37. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped