annotators

Type Members

class ChunkTokenizer extends Tokenizer

Tokenizes and flattens extracted NER chunks.

The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

For extended examples of usage, see the ChunkTokenizerTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val entityExtractor = new TextMatcher()
  .setInputCols("sentence", "token")
  .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
  .setOutputCol("entity")

val chunkTokenizer = new ChunkTokenizer()
  .setInputCols("entity")
  .setOutputCol("chunk_token")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    entityExtractor,
    chunkTokenizer
  ))

val data = Seq(
  "Hello world, my name is Michael, I am an artist and I work at Benezar",
  "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
+-----------------------------------------------+---------------------------------------------------+
|entity                                         |chunk_token                                        |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+

class ChunkTokenizerModel extends TokenizerModel

Instantiated model of the ChunkTokenizer.
Instantiated model of the ChunkTokenizer. For usage and examples see the documentation of the main class.

class Chunker extends AnnotatorModel[Chunker] with HasSimpleAnnotate[Chunker]

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

"Peter Pipers employees are picking pecks of pickled peppers."
"<.>"

To then extract these tags, regexParsers need to be set with e.g.:

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("+", "+"))

When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

For more extended examples see the Spark NLP Workshop and the ChunkerTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val POSTag = PerceptronModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("pos")

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("+", "+"))

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    POSTag,
    chunker
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+

See also: PerceptronModel for Part-Of-Speech tagging

class DateMatcher extends AnnotatorModel[DateMatcher] with HasSimpleAnnotate[DateMatcher] with DateMatcherUtils

Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Spark NLP Workshop and the DateMatcherTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.DateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new DateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(false)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+

See also: MultiDateMatcher for matching multiple dates in a document

class DateMatcherTranslator extends Serializable
sealed trait DateMatcherTranslatorPolicy extends AnyRef
trait DateMatcherUtils extends Params

class DocumentNormalizer extends AnnotatorModel[DocumentNormalizer] with HasSimpleAnnotate[DocumentNormalizer]

Annotator which normalizes raw text from tagged text, e.g.

Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

For extended examples of usage, see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val cleanUpPatterns = Array("<[^>]*>")

val documentNormalizer = new DocumentNormalizer()
  .setInputCols("document")
  .setOutputCol("normalizedDocument")
  .setAction("clean")
  .setPatterns(cleanUpPatterns)
  .setReplacement(" ")
  .setPolicy("pretty_all")
  .setLowercase(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  documentNormalizer
))

val text =
  """



  THE WORLD'S LARGEST WEB DEVELOPER SITE

= THE WORLD'S LARGEST WEB DEVELOPER SITE =



Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..


"""
val data = Seq(text).toDF("text")
val pipelineModel = pipeline.fit(data)

val result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

class GraphExtraction extends AnnotatorModel[GraphExtraction] with HasSimpleAnnotate[GraphExtraction]

Extracts a dependency graph between entities.

The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.

Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:

Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.
Setting setMergeEntities to true will download the default pretrained models for those two Annotators automatically. The specific models can also be set with setDependencyParserModel and setTypedDependencyParserModel:

val graph_extraction = new GraphExtraction()
  .setInputCols("document", "token", "ner")
  .setOutputCol("graph")
  .setRelationshipTypes(Array("prefer-LOC"))
  .setMergeEntities(true)
//.setDependencyParserModel(Array("dependency_conllu", "en",  "public/models"))
//.setTypedDependencyParserModel(Array("dependency_typed_conllu", "en",  "public/models"))

To transform the resulting graph into a more generic form such as RDF, see the GraphFinisher.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserModel
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.annotators.GraphExtraction

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

val nerTagger = NerDLModel.pretrained()
  .setInputCols("sentence", "token", "embeddings")
  .setOutputCol("ner")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParser = DependencyParserModel.pretrained()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")

val typedDependencyParser = TypedDependencyParserModel.pretrained()
  .setInputCols("dependency", "pos", "token")
  .setOutputCol("dependency_type")

val graph_extraction = new GraphExtraction()
  .setInputCols("document", "token", "ner")
  .setOutputCol("graph")
  .setRelationshipTypes(Array("prefer-LOC"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger,
  posTagger,
  dependencyParser,
  typedDependencyParser,
  graph_extraction
))

val data = Seq("You and John prefer the morning flight through Denver").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("graph").show(false)
+-----------------------------------------------------------------------------------------------------------------+
|graph                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
+-----------------------------------------------------------------------------------------------------------------+

See also: GraphFinisher to output the paths in a more generic format, like RDF

class Lemmatizer extends AnnotatorApproach[LemmatizerModel]

Class to find lemmas out of words with the objective of returning a base dictionary word.

Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource. Pretrained models can be loaded with LemmatizerModel.pretrained.

For available pretrained models please see the Models Hub. For extended examples of usage, see the Spark NLP Workshop and the LemmatizerTestSpec.

Example

In this example, the lemma dictionary lemmas_small.txt has the form of

...
pick	->	pick	picks	picking	picked
peck	->	peck	pecking	pecked	pecks
pickle	->	pickle	pickles	pickled	pickling
pepper	->	pepper	peppers	peppered	peppering
...

where each key is delimited by -> and values are delimited by \t

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    lemmatizer
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")

val result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+

See also: LemmatizerModel for the instantiated model and pretrained models.

class LemmatizerModel extends AnnotatorModel[LemmatizerModel] with HasSimpleAnnotate[LemmatizerModel]

Instantiated Model of the Lemmatizer.
Instantiated Model of the Lemmatizer. For usage and examples, please see the documentation of that class. For available pretrained models please see the Models Hub.
Example
The lemmatizer from the example of the Lemmatizer can be replaced with:
```
val lemmatizer = LemmatizerModel.pretrained()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
```
This will load the default pretrained model which is "lemma_antbnc".
See also
Lemmatizer

class MultiDateMatcher extends AnnotatorModel[MultiDateMatcher] with HasSimpleAnnotate[MultiDateMatcher] with DateMatcherUtils

Matches standard date formats into a provided format.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

For extended examples of usage, see the Spark NLP Workshop and the MultiDateMatcherTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new MultiDateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("I saw him yesterday and he told me that he will visit us next week")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(false)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+

class NGramGenerator extends AnnotatorModel[NGramGenerator] with HasSimpleAnnotate[NGramGenerator]

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

For more extended examples see the Spark NLP Workshop and the NGramGeneratorTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.NGramGenerator
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("ngrams")
  .setN(2)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams
  ))

val data = Seq("This is my sentence.").toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(false)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+

class Normalizer extends AnnotatorApproach[NormalizerModel]

Annotator that cleans out tokens.

Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

For extended examples of usage, see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(true)
  .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
// if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer
))

val data = Seq("John and Peter are brothers. However they don't support each other that much.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = false)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+

class NormalizerModel extends AnnotatorModel[NormalizerModel] with HasSimpleAnnotate[NormalizerModel]

Instantiated Model of the Normalizer.
Instantiated Model of the Normalizer. For usage and examples, please see the documentation of that class.

See also
Normalizer for the base class
trait ReadablePretrainedLemmatizer extends ParamsAndFeaturesReadable[LemmatizerModel] with HasPretrained[LemmatizerModel]
trait ReadablePretrainedStopWordsCleanerModel extends ParamsAndFeaturesReadable[StopWordsCleaner] with HasPretrained[StopWordsCleaner]
trait ReadablePretrainedTextMatcher extends ParamsAndFeaturesReadable[TextMatcherModel] with HasPretrained[TextMatcherModel]
trait ReadablePretrainedTokenizer extends ParamsAndFeaturesReadable[TokenizerModel] with HasPretrained[TokenizerModel]

class RecursiveTokenizer extends AnnotatorApproach[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

Tokenizes raw text recursively based on a handful of definable rules.

Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

prefixes: Strings that will be split when found at the beginning of token.
suffixes: Strings that will be split when found at the end of token.
infixes: Strings that will be split when found at the middle of token.
whitelist: Whitelist of strings not to split

For extended examples of usage, see the Spark NLP Workshop and the TokenizerTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new RecursiveTokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer
))

val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("token.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+

class RecursiveTokenizerModel extends AnnotatorModel[RecursiveTokenizerModel] with HasSimpleAnnotate[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

Instantiated model of the RecursiveTokenizer.
Instantiated model of the RecursiveTokenizer. For usage and examples see the documentation of the main class.

class RegexMatcher extends AnnotatorApproach[RegexMatcherModel]

Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

A dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Spark NLP Workshop and the RegexMatcherTestSpec.

Example

In this example, the rules.txt has the form of

the\s\w+, followed by 'the'
ceremonies, ceremony

where each regex is separated by the identifier by ","

import ResourceHelper.spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.RegexMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")

val regexMatcher = new RegexMatcher()
  .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
  .setInputCols(Array("sentence"))
  .setOutputCol("regex")
  .setStrategy("MATCH_ALL")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))

val data = Seq(
  "My first sentence with the first rule. This is my second sentence with ceremonies rule."
).toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(false)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+

class RegexMatcherModel extends AnnotatorModel[RegexMatcherModel] with HasSimpleAnnotate[RegexMatcherModel]

Instantiated model of the RegexMatcher.
Instantiated model of the RegexMatcher. For usage and examples see the documentation of the main class.

class RegexTokenizer extends AnnotatorModel[RegexTokenizer] with HasSimpleAnnotate[RegexTokenizer]

A tokenizer that splits text by a regex pattern.

The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val regexTokenizer = new RegexTokenizer()
  .setInputCols("document")
  .setOutputCol("regexToken")
  .setToLowercase(true)
  .setPattern("\\s+")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    regexTokenizer
  ))

val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(false)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+

class Stemmer extends AnnotatorModel[Stemmer] with HasSimpleAnnotate[Stemmer]

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val stemmer = new Stemmer()
  .setInputCols("token")
  .setOutputCol("stem")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  stemmer
))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+

class StopWordsCleaner extends AnnotatorModel[StopWordsCleaner] with HasSimpleAnnotate[StopWordsCleaner]

This annotator takes a sequence of strings (e.g.

This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

val stopWords = StopWordsCleaner.pretrained()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)
// will load the default pretrained model `"stopwords_en"`.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and StopWordsCleanerTestSpec.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val stopWords = new StopWordsCleaner()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopWords
  ))

val data = Seq(
  "This is my first sentence. This is my second.",
  "This is my third sentence. This is my forth."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(false)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+

class TextMatcher extends AnnotatorApproach[TextMatcherModel] with ParamsAndFeaturesWritable

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setEntities. The text file can als be set directly as an ExternalResource.

For extended examples of usage, see the Spark NLP Workshop and the TextMatcherTestSpec.

Example

In this example, the entities file is of the form

...
dolore magna aliqua
lorem ipsum dolor. sit
laborum
...

where each line represents an entity phrase to be extracted.

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.TextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new TextMatcher()
  .setInputCols("document", "token")
  .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)
  .setTokenizer(tokenizer.fit(data))

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(false)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+

See also: BigTextMatcher to match large amounts of text

class TextMatcherModel extends AnnotatorModel[TextMatcherModel] with HasSimpleAnnotate[TextMatcherModel]

Instantiated model of the TextMatcher.
Instantiated model of the TextMatcher. For usage and examples see the documentation of the main class.

class Token2Chunk extends AnnotatorModel[Token2Chunk] with HasSimpleAnnotate[Token2Chunk]

Converts TOKEN type Annotations to CHUNK type.

This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val token2chunk = new Token2Chunk()
  .setInputCols("token")
  .setOutputCol("chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  token2chunk
))

val data = Seq("One Two Three Four").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+

class Tokenizer extends AnnotatorApproach[TokenizerModel]

Tokenizes raw text in document type columns into TokenizedSentence .

This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

For extended examples of usage see the Spark NLP Workshop and Tokenizer test class

Example

import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline

val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("token.result").show(false)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+

class TokenizerModel extends AnnotatorModel[TokenizerModel] with HasSimpleAnnotate[TokenizerModel]

Tokenizes raw text into word pieces, tokens.
Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.
This class represents an already fitted Tokenizer model.
See the main class Tokenizer for more examples of usage.

Value Members

object ChunkTokenizer extends DefaultParamsReadable[ChunkTokenizer] with Serializable

This is the companion object of ChunkTokenizer.
This is the companion object of ChunkTokenizer. Please refer to that class for the documentation.
object ChunkTokenizerModel extends ParamsAndFeaturesReadable[ChunkTokenizerModel] with Serializable
object Chunker extends DefaultParamsReadable[Chunker] with Serializable

This is the companion object of Chunker.
This is the companion object of Chunker. Please refer to that class for the documentation.
object DateMatcher extends DefaultParamsReadable[DateMatcher] with Serializable

This is the companion object of DateMatcher.
This is the companion object of DateMatcher. Please refer to that class for the documentation.
object DocumentNormalizer extends DefaultParamsReadable[DocumentNormalizer] with Serializable

This is the companion object of DocumentNormalizer.
This is the companion object of DocumentNormalizer. Please refer to that class for the documentation.
object EnglishStemmer
object Lemmatizer extends DefaultParamsReadable[Lemmatizer] with Serializable

This is the companion object of Lemmatizer.
This is the companion object of Lemmatizer. Please refer to that class for the documentation.
object LemmatizerModel extends ReadablePretrainedLemmatizer with Serializable

This is the companion object of LemmatizerModel.
This is the companion object of LemmatizerModel. Please refer to that class for the documentation.
object MultiDateMatcher extends DefaultParamsReadable[MultiDateMatcher] with Serializable

This is the companion object of MultiDateMatcher.
This is the companion object of MultiDateMatcher. Please refer to that class for the documentation.
object MultiDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable
object NGramGenerator extends ParamsAndFeaturesReadable[NGramGenerator] with Serializable
object Normalizer extends DefaultParamsReadable[Normalizer] with Serializable

This is the companion object of Normalizer.
This is the companion object of Normalizer. Please refer to that class for the documentation.
object NormalizerModel extends ParamsAndFeaturesReadable[NormalizerModel] with Serializable
object PretrainedAnnotations
object RecursiveTokenizerModel extends ParamsAndFeaturesReadable[RecursiveTokenizerModel] with Serializable
object RegexMatcher extends DefaultParamsReadable[RegexMatcher] with Serializable

This is the companion object of RegexMatcher.
This is the companion object of RegexMatcher. Please refer to that class for the documentation.
object RegexMatcherModel extends ParamsAndFeaturesReadable[RegexMatcherModel] with Serializable
object RegexTokenizer extends DefaultParamsReadable[RegexTokenizer] with Serializable

This is the companion object of RegexTokenizer.
This is the companion object of RegexTokenizer. Please refer to that class for the documentation.
object SingleDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable
object Stemmer extends DefaultParamsReadable[Stemmer] with Serializable

This is the companion object of Stemmer.
This is the companion object of Stemmer. Please refer to that class for the documentation.
object StopWordsCleaner extends ParamsAndFeaturesReadable[StopWordsCleaner] with ReadablePretrainedStopWordsCleanerModel with Serializable
object TextMatcher extends DefaultParamsReadable[TextMatcher] with Serializable

This is the companion object of TextMatcher.
This is the companion object of TextMatcher. Please refer to that class for the documentation.
object TextMatcherModel extends ReadablePretrainedTextMatcher with Serializable

This is the companion object of TextMatcherModel.
This is the companion object of TextMatcherModel. Please refer to that class for the documentation.
object Token2Chunk extends DefaultParamsReadable[Token2Chunk] with Serializable

This is the companion object of Token2Chunk.
This is the companion object of Token2Chunk. Please refer to that class for the documentation.
object Tokenizer extends DefaultParamsReadable[Tokenizer] with Serializable

This is the companion object of Tokenizer.
This is the companion object of Tokenizer. Please refer to that class for the documentation.
object TokenizerModel extends ReadablePretrainedTokenizer with Serializable

This is the companion object of TokenizerModel.
This is the companion object of TokenizerModel. Please refer to that class for the documentation.
package btm
package classifier
package common
package er
package keyword
package ld
package ner
package param
package parser
package pos
package sbd
package sda
package sentence_detector_dl
package seq2seq
package spell
package tokenizer
package ws

package annotators

Type Members

class ChunkTokenizer extends Tokenizer

Example

class ChunkTokenizerModel extends TokenizerModel

class Chunker extends AnnotatorModel[Chunker] with HasSimpleAnnotate[Chunker]

Example

class DateMatcher extends AnnotatorModel[DateMatcher] with HasSimpleAnnotate[DateMatcher] with DateMatcherUtils

Example

class DateMatcherTranslator extends Serializable

sealed trait DateMatcherTranslatorPolicy extends AnyRef

trait DateMatcherUtils extends Params

class DocumentNormalizer extends AnnotatorModel[DocumentNormalizer] with HasSimpleAnnotate[DocumentNormalizer]

Example

class GraphExtraction extends AnnotatorModel[GraphExtraction] with HasSimpleAnnotate[GraphExtraction]

Example

class Lemmatizer extends AnnotatorApproach[LemmatizerModel]

Example

class LemmatizerModel extends AnnotatorModel[LemmatizerModel] with HasSimpleAnnotate[LemmatizerModel]

Example

class MultiDateMatcher extends AnnotatorModel[MultiDateMatcher] with HasSimpleAnnotate[MultiDateMatcher] with DateMatcherUtils

Example

class NGramGenerator extends AnnotatorModel[NGramGenerator] with HasSimpleAnnotate[NGramGenerator]

Example

class Normalizer extends AnnotatorApproach[NormalizerModel]

Example

class NormalizerModel extends AnnotatorModel[NormalizerModel] with HasSimpleAnnotate[NormalizerModel]

trait ReadablePretrainedLemmatizer extends ParamsAndFeaturesReadable[LemmatizerModel] with HasPretrained[LemmatizerModel]

trait ReadablePretrainedStopWordsCleanerModel extends ParamsAndFeaturesReadable[StopWordsCleaner] with HasPretrained[StopWordsCleaner]

trait ReadablePretrainedTextMatcher extends ParamsAndFeaturesReadable[TextMatcherModel] with HasPretrained[TextMatcherModel]

trait ReadablePretrainedTokenizer extends ParamsAndFeaturesReadable[TokenizerModel] with HasPretrained[TokenizerModel]

class RecursiveTokenizer extends AnnotatorApproach[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

Example

class RecursiveTokenizerModel extends AnnotatorModel[RecursiveTokenizerModel] with HasSimpleAnnotate[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

class RegexMatcher extends AnnotatorApproach[RegexMatcherModel]

Example

class RegexMatcherModel extends AnnotatorModel[RegexMatcherModel] with HasSimpleAnnotate[RegexMatcherModel]

class RegexTokenizer extends AnnotatorModel[RegexTokenizer] with HasSimpleAnnotate[RegexTokenizer]

Example

class Stemmer extends AnnotatorModel[Stemmer] with HasSimpleAnnotate[Stemmer]

Example

class StopWordsCleaner extends AnnotatorModel[StopWordsCleaner] with HasSimpleAnnotate[StopWordsCleaner]

Example

class TextMatcher extends AnnotatorApproach[TextMatcherModel] with ParamsAndFeaturesWritable

Example

class TextMatcherModel extends AnnotatorModel[TextMatcherModel] with HasSimpleAnnotate[TextMatcherModel]

class Token2Chunk extends AnnotatorModel[Token2Chunk] with HasSimpleAnnotate[Token2Chunk]

Example

class Tokenizer extends AnnotatorApproach[TokenizerModel]

Example

class TokenizerModel extends AnnotatorModel[TokenizerModel] with HasSimpleAnnotate[TokenizerModel]

Value Members

object ChunkTokenizer extends DefaultParamsReadable[ChunkTokenizer] with Serializable

object ChunkTokenizerModel extends ParamsAndFeaturesReadable[ChunkTokenizerModel] with Serializable

object Chunker extends DefaultParamsReadable[Chunker] with Serializable

object DateMatcher extends DefaultParamsReadable[DateMatcher] with Serializable

object DocumentNormalizer extends DefaultParamsReadable[DocumentNormalizer] with Serializable

object EnglishStemmer

object Lemmatizer extends DefaultParamsReadable[Lemmatizer] with Serializable

object LemmatizerModel extends ReadablePretrainedLemmatizer with Serializable

object MultiDateMatcher extends DefaultParamsReadable[MultiDateMatcher] with Serializable

object MultiDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable

object NGramGenerator extends ParamsAndFeaturesReadable[NGramGenerator] with Serializable

object Normalizer extends DefaultParamsReadable[Normalizer] with Serializable

object NormalizerModel extends ParamsAndFeaturesReadable[NormalizerModel] with Serializable

object PretrainedAnnotations

object RecursiveTokenizerModel extends ParamsAndFeaturesReadable[RecursiveTokenizerModel] with Serializable

object RegexMatcher extends DefaultParamsReadable[RegexMatcher] with Serializable

object RegexMatcherModel extends ParamsAndFeaturesReadable[RegexMatcherModel] with Serializable

object RegexTokenizer extends DefaultParamsReadable[RegexTokenizer] with Serializable

object SingleDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable

object Stemmer extends DefaultParamsReadable[Stemmer] with Serializable

object StopWordsCleaner extends ParamsAndFeaturesReadable[StopWordsCleaner] with ReadablePretrainedStopWordsCleanerModel with Serializable

object TextMatcher extends DefaultParamsReadable[TextMatcher] with Serializable

object TextMatcherModel extends ReadablePretrainedTextMatcher with Serializable

object Token2Chunk extends DefaultParamsReadable[Token2Chunk] with Serializable

object Tokenizer extends DefaultParamsReadable[Tokenizer] with Serializable

object TokenizerModel extends ReadablePretrainedTokenizer with Serializable

package btm