Package

com.lucidworks.spark

analysis

Permalink

package analysis

Visibility
  1. Public
  2. All

Type Members

  1. class LuceneTextAnalyzer extends Serializable

    Permalink

    This class allows simple access to custom Lucene text processing pipelines, a.k.a.

    This class allows simple access to custom Lucene text processing pipelines, a.k.a. text analyzers, which are specified via a JSON schema that hosts named analyzer specifications and mappings from field name(s) to analyzer(s).

    Here's an example schema with descriptions inline as comments:

    {
      "defaultLuceneMatchVersion": "7.0.0" // Optional.  Supplied to analysis components
                                            //     that don't explicitly specify "luceneMatchVersion".
      "analyzers": [              // Optional.  If not included, all field mappings must be
        {                         //     to fully qualified class names of Lucene Analyzer subclasses.
          "name": "html",         // Required.  Mappings in the "fields" array below refer to this name.
          "charFilters":[{        // Optional.
            "type": "htmlstrip"   // Required. "htmlstrip" is the SPI name for HTMLStripCharFilter
          }],
          "tokenizer": {          // Required.  Only one allowed.
            "type": "standard"    // Required. "standard" is the SPI name for StandardTokenizer
          },
          "filters": [{           // Optional.
              "type": "stop",     // Required.  "stop" is the SPI name for StopFilter
              "ignoreCase": "true",  // Component-specific params
              "format": "snowball",
              "words": "org/apache/lucene/analysis/snowball/english_stop.txt"
            }, {
              "type": "lowercase" // Required. "lowercase" is the SPI name for LowerCaseFilter
          }]
        },
        { "name": "stdtok", "tokenizer": { "type": "standard" } }
      ],
      "fields": [{                // Required.  To lookup an analyzer for a field, first the "name"
                                  //     mappings are consulted, and then the "regex" mappings are
                                  //     tested, in the order specified.
          "name": "keywords",     // Either "name" or "regex" is required.  "name" matches the field name exactly.
          "analyzer": "org.apache.lucene.analysis.core.KeywordAnalyzer" // FQCN of an Analyzer subclass
        }, {
          "regex": ".*html.*"     // Either "name" or "regex" is required.  "regex" must match the whole field name.
          "analyzer": "html"      // Reference to the named analyzer specified in the "analyzers" section.
        }, {
          "regex": ".+",          // Either "name" or "regex" is required.  "regex" must match the whole field name.
          "analyzer": "stdtok"    // Reference to the named analyzer specified in the "analyzers" section.
      }]
    }

Ungrouped