Package

com.johnsnowlabs.nlp

annotators

Permalink

package annotators

Visibility
  1. Public
  2. All

Type Members

  1. class ChunkTokenizer extends Tokenizer

    Permalink

    Tokenizes and flattens extracted NER chunks.

    Tokenizes and flattens extracted NER chunks.

    The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

    For extended examples of usage, see the ChunkTokenizerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{ChunkTokenizer, TextMatcher, Tokenizer}
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val entityExtractor = new TextMatcher()
      .setInputCols("sentence", "token")
      .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT)
      .setOutputCol("entity")
    
    val chunkTokenizer = new ChunkTokenizer()
      .setInputCols("entity")
      .setOutputCol("chunk_token")
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        entityExtractor,
        chunkTokenizer
      ))
    
    val data = Seq(
      "Hello world, my name is Michael, I am an artist and I work at Benezar",
      "Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(false)
    +-----------------------------------------------+---------------------------------------------------+
    |entity                                         |chunk_token                                        |
    +-----------------------------------------------+---------------------------------------------------+
    |[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
    |[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
    +-----------------------------------------------+---------------------------------------------------+
  2. class ChunkTokenizerModel extends TokenizerModel

    Permalink

    Instantiated model of the ChunkTokenizer.

    Instantiated model of the ChunkTokenizer. For usage and examples see the documentation of the main class.

  3. class Chunker extends AnnotatorModel[Chunker] with HasSimpleAnnotate[Chunker]

    Permalink

    This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.

    This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

    "Peter Pipers employees are picking pecks of pickled peppers."
    "<.>"

    To then extract these tags, regexParsers need to be set with e.g.:

    val chunker = new Chunker()
      .setInputCols("sentence", "pos")
      .setOutputCol("chunk")
      .setRegexParsers(Array("+", "+"))

    When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

    For more extended examples see the Spark NLP Workshop and the ChunkerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val POSTag = PerceptronModel.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("pos")
    
    val chunker = new Chunker()
      .setInputCols("sentence", "pos")
      .setOutputCol("chunk")
      .setRegexParsers(Array("+", "+"))
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        POSTag,
        chunker
      ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk) as result").show(false)
    +-------------------------------------------------------------+
    |result                                                       |
    +-------------------------------------------------------------+
    |[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
    |[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
    |[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
    |[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
    +-------------------------------------------------------------+
    See also

    PerceptronModel for Part-Of-Speech tagging

  4. class DateMatcher extends AnnotatorModel[DateMatcher] with HasSimpleAnnotate[DateMatcher] with DateMatcherUtils

    Permalink

    Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

    Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

    Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

    Reads the following kind of dates:

    "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
    "Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
    "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
    "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
    "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

    For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

    Pretrained pipelines are available for this module, see Pipelines.

    For extended examples of usage, see the Spark NLP Workshop and the DateMatcherTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.DateMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val date = new DateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
      .setAnchorDateYear(2020)
      .setAnchorDateMonth(1)
      .setAnchorDateDay(11)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date
    ))
    
    val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("date").show(false)
    +-------------------------------------------------+
    |date                                             |
    +-------------------------------------------------+
    |[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
    |[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
    |[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
    +-------------------------------------------------+
    See also

    MultiDateMatcher for matching multiple dates in a document

  5. class DateMatcherTranslator extends Serializable

    Permalink
  6. sealed trait DateMatcherTranslatorPolicy extends AnyRef

    Permalink
  7. trait DateMatcherUtils extends Params

    Permalink
  8. class DocumentNormalizer extends AnnotatorModel[DocumentNormalizer] with HasSimpleAnnotate[DocumentNormalizer]

    Permalink

    Annotator which normalizes raw text from tagged text, e.g.

    Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

    For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val cleanUpPatterns = Array("<[^>]*>")
    
    val documentNormalizer = new DocumentNormalizer()
      .setInputCols("document")
      .setOutputCol("normalizedDocument")
      .setAction("clean")
      .setPatterns(cleanUpPatterns)
      .setReplacement(" ")
      .setPolicy("pretty_all")
      .setLowercase(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentNormalizer
    ))
    
    val text =
      """
    
    
    
      THE WORLD'S LARGEST WEB DEVELOPER SITE
    
    = THE WORLD'S LARGEST WEB DEVELOPER SITE =
    
    
    
    Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..
    
    
    """
    val data = Seq(text).toDF("text")
    val pipelineModel = pipeline.fit(data)
    
    val result = pipelineModel.transform(data)
    result.selectExpr("normalizedDocument.result").show(truncate=false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  9. class GraphExtraction extends AnnotatorModel[GraphExtraction] with HasSimpleAnnotate[GraphExtraction]

    Permalink

    Extracts a dependency graph between entities.

    Extracts a dependency graph between entities.

    The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree which describes how the entities relate to each other. For that a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words.

    Both the DependencyParserModel and TypedDependencyParserModel need to be present in the pipeline. There are two ways to set them:

    1. Both Annotators are present in the pipeline already. The dependencies are taken implicitly from these two Annotators.
    2. Setting setMergeEntities to true will download the default pretrained models for those two Annotators automatically. The specific models can also be set with setDependencyParserModel and setTypedDependencyParserModel:
    val graph_extraction = new GraphExtraction()
      .setInputCols("document", "token", "ner")
      .setOutputCol("graph")
      .setRelationshipTypes(Array("prefer-LOC"))
      .setMergeEntities(true)
    //.setDependencyParserModel(Array("dependency_conllu", "en",  "public/models"))
    //.setTypedDependencyParserModel(Array("dependency_typed_conllu", "en",  "public/models"))

    To transform the resulting graph into a more generic form such as RDF, see the GraphFinisher.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
    import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserModel
    import org.apache.spark.ml.Pipeline
    import com.johnsnowlabs.nlp.annotators.GraphExtraction
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
    
    val nerTagger = NerDLModel.pretrained()
      .setInputCols("sentence", "token", "embeddings")
      .setOutputCol("ner")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    val dependencyParser = DependencyParserModel.pretrained()
      .setInputCols("sentence", "pos", "token")
      .setOutputCol("dependency")
    
    val typedDependencyParser = TypedDependencyParserModel.pretrained()
      .setInputCols("dependency", "pos", "token")
      .setOutputCol("dependency_type")
    
    val graph_extraction = new GraphExtraction()
      .setInputCols("document", "token", "ner")
      .setOutputCol("graph")
      .setRelationshipTypes(Array("prefer-LOC"))
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentence,
      tokenizer,
      embeddings,
      nerTagger,
      posTagger,
      dependencyParser,
      typedDependencyParser,
      graph_extraction
    ))
    
    val data = Seq("You and John prefer the morning flight through Denver").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("graph").show(false)
    +-----------------------------------------------------------------------------------------------------------------+
    |graph                                                                                                            |
    +-----------------------------------------------------------------------------------------------------------------+
    |[[node, 13, 18, prefer, [relationship -> prefer,LOC, path1 -> prefer,nsubj,morning,flat,flight,flat,Denver], []]]|
    +-----------------------------------------------------------------------------------------------------------------+
    See also

    GraphFinisher to output the paths in a more generic format, like RDF

  10. class Lemmatizer extends AnnotatorApproach[LemmatizerModel]

    Permalink

    Class to find lemmas out of words with the objective of returning a base dictionary word.

    Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource. Pretrained models can be loaded with LemmatizerModel.pretrained.

    For available pretrained models please see the Models Hub. For extended examples of usage, see the Spark NLP Workshop and the LemmatizerTestSpec.

    Example

    In this example, the lemma dictionary lemmas_small.txt has the form of

    ...
    pick	->	pick	picks	picking	picked
    peck	->	peck	pecking	pecked	pecks
    pickle	->	pickle	pickles	pickled	pickling
    pepper	->	pepper	peppers	peppered	peppering
    ...

    where each key is delimited by -> and values are delimited by \t

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Lemmatizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val lemmatizer = new Lemmatizer()
      .setInputCols(Array("token"))
      .setOutputCol("lemma")
      .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        lemmatizer
      ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
      .toDF("text")
    
    val result = pipeline.fit(data).transform(data)
    result.selectExpr("lemma.result").show(false)
    +------------------------------------------------------------------+
    |result                                                            |
    +------------------------------------------------------------------+
    |[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
    +------------------------------------------------------------------+
    See also

    LemmatizerModel for the instantiated model and pretrained models.

  11. class LemmatizerModel extends AnnotatorModel[LemmatizerModel] with HasSimpleAnnotate[LemmatizerModel]

    Permalink

    Instantiated Model of the Lemmatizer.

    Instantiated Model of the Lemmatizer. For usage and examples, please see the documentation of that class. For available pretrained models please see the Models Hub.

    Example

    The lemmatizer from the example of the Lemmatizer can be replaced with:

    val lemmatizer = LemmatizerModel.pretrained()
      .setInputCols(Array("token"))
      .setOutputCol("lemma")

    This will load the default pretrained model which is "lemma_antbnc".

    See also

    Lemmatizer

  12. class MultiDateMatcher extends AnnotatorModel[MultiDateMatcher] with HasSimpleAnnotate[MultiDateMatcher] with DateMatcherUtils

    Permalink

    Matches standard date formats into a provided format.

    Matches standard date formats into a provided format.

    Reads the following kind of dates:

    "1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
    "Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
    "last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
    "next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
    "at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

    For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

    For extended examples of usage, see the Spark NLP Workshop and the MultiDateMatcherTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val date = new MultiDateMatcher()
      .setInputCols("document")
      .setOutputCol("date")
      .setAnchorDateYear(2020)
      .setAnchorDateMonth(1)
      .setAnchorDateDay(11)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      date
    ))
    
    val data = Seq("I saw him yesterday and he told me that he will visit us next week")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(date) as dates").show(false)
    +-----------------------------------------------+
    |dates                                          |
    +-----------------------------------------------+
    |[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
    |[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
    +-----------------------------------------------+
  13. class NGramGenerator extends AnnotatorModel[NGramGenerator] with HasSimpleAnnotate[NGramGenerator]

    Permalink

    A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).

    A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

    When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

    For more extended examples see the Spark NLP Workshop and the NGramGeneratorTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.NGramGenerator
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val nGrams = new NGramGenerator()
      .setInputCols("token")
      .setOutputCol("ngrams")
      .setN(2)
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        nGrams
      ))
    
    val data = Seq("This is my sentence.").toDF("text")
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(ngrams) as result").show(false)
    +------------------------------------------------------------+
    |result                                                      |
    +------------------------------------------------------------+
    |[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
    |[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
    |[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
    |[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
    +------------------------------------------------------------+
  14. class Normalizer extends AnnotatorApproach[NormalizerModel]

    Permalink

    Annotator that cleans out tokens.

    Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

    For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
    import org.apache.spark.ml.Pipeline
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val normalizer = new Normalizer()
      .setInputCols("token")
      .setOutputCol("normalized")
      .setLowercase(true)
      .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
    // if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      normalizer
    ))
    
    val data = Seq("John and Peter are brothers. However they don't support each other that much.")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("normalized.result").show(truncate = false)
    +----------------------------------------------------------------------------------------+
    |result                                                                                  |
    +----------------------------------------------------------------------------------------+
    |[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
    +----------------------------------------------------------------------------------------+
  15. class NormalizerModel extends AnnotatorModel[NormalizerModel] with HasSimpleAnnotate[NormalizerModel]

    Permalink

    Instantiated Model of the Normalizer.

    Instantiated Model of the Normalizer. For usage and examples, please see the documentation of that class.

    See also

    Normalizer for the base class

  16. trait ReadablePretrainedLemmatizer extends ParamsAndFeaturesReadable[LemmatizerModel] with HasPretrained[LemmatizerModel]

    Permalink
  17. trait ReadablePretrainedStopWordsCleanerModel extends ParamsAndFeaturesReadable[StopWordsCleaner] with HasPretrained[StopWordsCleaner]

    Permalink
  18. trait ReadablePretrainedTextMatcher extends ParamsAndFeaturesReadable[TextMatcherModel] with HasPretrained[TextMatcherModel]

    Permalink
  19. trait ReadablePretrainedTokenizer extends ParamsAndFeaturesReadable[TokenizerModel] with HasPretrained[TokenizerModel]

    Permalink
  20. class RecursiveTokenizer extends AnnotatorApproach[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

    Permalink

    Tokenizes raw text recursively based on a handful of definable rules.

    Tokenizes raw text recursively based on a handful of definable rules.

    Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

    • prefixes: Strings that will be split when found at the beginning of token.
    • suffixes: Strings that will be split when found at the end of token.
    • infixes: Strings that will be split when found at the middle of token.
    • whitelist: Whitelist of strings not to split

    For extended examples of usage, see the Spark NLP Workshop and the TokenizerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new RecursiveTokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer
    ))
    
    val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("token.result").show(false)
    +------------------------------------------------------------------+
    |result                                                            |
    +------------------------------------------------------------------+
    |[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
    +------------------------------------------------------------------+
  21. class RecursiveTokenizerModel extends AnnotatorModel[RecursiveTokenizerModel] with HasSimpleAnnotate[RecursiveTokenizerModel] with ParamsAndFeaturesWritable

    Permalink

    Instantiated model of the RecursiveTokenizer.

    Instantiated model of the RecursiveTokenizer. For usage and examples see the documentation of the main class.

  22. class RegexMatcher extends AnnotatorApproach[RegexMatcherModel]

    Permalink

    Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

    Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

    A dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set in either in the form of a delimited text file or directly as an ExternalResource.

    Pretrained pipelines are available for this module, see Pipelines.

    For extended examples of usage, see the Spark NLP Workshop and the RegexMatcherTestSpec.

    Example

    In this example, the rules.txt has the form of

    the\s\w+, followed by 'the'
    ceremonies, ceremony

    where each regex is separated by the identifier by ","

    import ResourceHelper.spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.RegexMatcher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    
    val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    
    val regexMatcher = new RegexMatcher()
      .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
      .setInputCols(Array("sentence"))
      .setOutputCol("regex")
      .setStrategy("MATCH_ALL")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))
    
    val data = Seq(
      "My first sentence with the first rule. This is my second sentence with ceremonies rule."
    ).toDF("text")
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(regex) as result").show(false)
    +--------------------------------------------------------------------------------------------+
    |result                                                                                      |
    +--------------------------------------------------------------------------------------------+
    |[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
    |[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
    +--------------------------------------------------------------------------------------------+
  23. class RegexMatcherModel extends AnnotatorModel[RegexMatcherModel] with HasSimpleAnnotate[RegexMatcherModel]

    Permalink

    Instantiated model of the RegexMatcher.

    Instantiated model of the RegexMatcher. For usage and examples see the documentation of the main class.

  24. class RegexTokenizer extends AnnotatorModel[RegexTokenizer] with HasSimpleAnnotate[RegexTokenizer]

    Permalink

    A tokenizer that splits text by a regex pattern.

    A tokenizer that splits text by a regex pattern.

    The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.RegexTokenizer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val regexTokenizer = new RegexTokenizer()
      .setInputCols("document")
      .setOutputCol("regexToken")
      .setToLowercase(true)
      .setPattern("\\s+")
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        regexTokenizer
      ))
    
    val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("regexToken.result").show(false)
    +-------------------------------------------------------+
    |result                                                 |
    +-------------------------------------------------------+
    |[this, is, my, first, sentence., this, is, my, second.]|
    +-------------------------------------------------------+
  25. class Stemmer extends AnnotatorModel[Stemmer] with HasSimpleAnnotate[Stemmer]

    Permalink

    Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.

    Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val stemmer = new Stemmer()
      .setInputCols("token")
      .setOutputCol("stem")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      stemmer
    ))
    
    val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
      .toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("stem.result").show(truncate = false)
    +-------------------------------------------------------------+
    |result                                                       |
    +-------------------------------------------------------------+
    |[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
    +-------------------------------------------------------------+
  26. class StopWordsCleaner extends AnnotatorModel[StopWordsCleaner] with HasSimpleAnnotate[StopWordsCleaner]

    Permalink

    This annotator takes a sequence of strings (e.g.

    This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

    By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

    val stopWords = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    // will load the default pretrained model `"stopwords_en"`.

    For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and StopWordsCleanerTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val stopWords = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    
    val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        stopWords
      ))
    
    val data = Seq(
      "This is my first sentence. This is my second.",
      "This is my third sentence. This is my forth."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("cleanTokens.result").show(false)
    +-------------------------------+
    |result                         |
    +-------------------------------+
    |[first, sentence, ., second, .]|
    |[third, sentence, ., forth, .] |
    +-------------------------------+
  27. class TextMatcher extends AnnotatorApproach[TextMatcherModel] with ParamsAndFeaturesWritable

    Permalink

    Annotator to match exact phrases (by token) provided in a file against a Document.

    Annotator to match exact phrases (by token) provided in a file against a Document.

    A text file of predefined phrases must be provided with setEntities. The text file can als be set directly as an ExternalResource.

    For extended examples of usage, see the Spark NLP Workshop and the TextMatcherTestSpec.

    Example

    In this example, the entities file is of the form

    ...
    dolore magna aliqua
    lorem ipsum dolor. sit
    laborum
    ...

    where each line represents an entity phrase to be extracted.

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.TextMatcher
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
    val entityExtractor = new TextMatcher()
      .setInputCols("document", "token")
      .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
      .setOutputCol("entity")
      .setCaseSensitive(false)
      .setTokenizer(tokenizer.fit(data))
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
    val results = pipeline.fit(data).transform(data)
    
    results.selectExpr("explode(entity) as result").show(false)
    +------------------------------------------------------------------------------------------+
    |result                                                                                    |
    +------------------------------------------------------------------------------------------+
    |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
    |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
    |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
    +------------------------------------------------------------------------------------------+
    See also

    BigTextMatcher to match large amounts of text

  28. class TextMatcherModel extends AnnotatorModel[TextMatcherModel] with HasSimpleAnnotate[TextMatcherModel]

    Permalink

    Instantiated model of the TextMatcher.

    Instantiated model of the TextMatcher. For usage and examples see the documentation of the main class.

  29. class Token2Chunk extends AnnotatorModel[Token2Chunk] with HasSimpleAnnotate[Token2Chunk]

    Permalink

    Converts TOKEN type Annotations to CHUNK type.

    Converts TOKEN type Annotations to CHUNK type.

    This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}
    
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val token2chunk = new Token2Chunk()
      .setInputCols("token")
      .setOutputCol("chunk")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      token2chunk
    ))
    
    val data = Seq("One Two Three Four").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk) as result").show(false)
    +------------------------------------------+
    |result                                    |
    +------------------------------------------+
    |[chunk, 0, 2, One, [sentence -> 0], []]   |
    |[chunk, 4, 6, Two, [sentence -> 0], []]   |
    |[chunk, 8, 12, Three, [sentence -> 0], []]|
    |[chunk, 14, 17, Four, [sentence -> 0], []]|
    +------------------------------------------+
  30. class Tokenizer extends AnnotatorApproach[TokenizerModel]

    Permalink

    Tokenizes raw text in document type columns into TokenizedSentence .

    Tokenizes raw text in document type columns into TokenizedSentence .

    This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

    Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

    For extended examples of usage see the Spark NLP Workshop and Tokenizer test class

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import org.apache.spark.ml.Pipeline
    
    val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
    val result = pipeline.transform(data)
    
    result.selectExpr("token.result").show(false)
    +-----------------------------------------------------------------------+
    |output                                                                 |
    +-----------------------------------------------------------------------+
    |[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
    +-----------------------------------------------------------------------+
  31. class TokenizerModel extends AnnotatorModel[TokenizerModel] with HasSimpleAnnotate[TokenizerModel]

    Permalink

    Tokenizes raw text into word pieces, tokens.

    Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

    This class represents an already fitted Tokenizer model.

    See the main class Tokenizer for more examples of usage.

Value Members

  1. object ChunkTokenizer extends DefaultParamsReadable[ChunkTokenizer] with Serializable

    Permalink

    This is the companion object of ChunkTokenizer.

    This is the companion object of ChunkTokenizer. Please refer to that class for the documentation.

  2. object ChunkTokenizerModel extends ParamsAndFeaturesReadable[ChunkTokenizerModel] with Serializable

    Permalink
  3. object Chunker extends DefaultParamsReadable[Chunker] with Serializable

    Permalink

    This is the companion object of Chunker.

    This is the companion object of Chunker. Please refer to that class for the documentation.

  4. object DateMatcher extends DefaultParamsReadable[DateMatcher] with Serializable

    Permalink

    This is the companion object of DateMatcher.

    This is the companion object of DateMatcher. Please refer to that class for the documentation.

  5. object DocumentNormalizer extends DefaultParamsReadable[DocumentNormalizer] with Serializable

    Permalink

    This is the companion object of DocumentNormalizer.

    This is the companion object of DocumentNormalizer. Please refer to that class for the documentation.

  6. object EnglishStemmer

    Permalink
  7. object Lemmatizer extends DefaultParamsReadable[Lemmatizer] with Serializable

    Permalink

    This is the companion object of Lemmatizer.

    This is the companion object of Lemmatizer. Please refer to that class for the documentation.

  8. object LemmatizerModel extends ReadablePretrainedLemmatizer with Serializable

    Permalink

    This is the companion object of LemmatizerModel.

    This is the companion object of LemmatizerModel. Please refer to that class for the documentation.

  9. object MultiDateMatcher extends DefaultParamsReadable[MultiDateMatcher] with Serializable

    Permalink

    This is the companion object of MultiDateMatcher.

    This is the companion object of MultiDateMatcher. Please refer to that class for the documentation.

  10. object MultiDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable

    Permalink
  11. object NGramGenerator extends ParamsAndFeaturesReadable[NGramGenerator] with Serializable

    Permalink
  12. object Normalizer extends DefaultParamsReadable[Normalizer] with Serializable

    Permalink

    This is the companion object of Normalizer.

    This is the companion object of Normalizer. Please refer to that class for the documentation.

  13. object NormalizerModel extends ParamsAndFeaturesReadable[NormalizerModel] with Serializable

    Permalink
  14. object PretrainedAnnotations

    Permalink
  15. object RegexMatcher extends DefaultParamsReadable[RegexMatcher] with Serializable

    Permalink

    This is the companion object of RegexMatcher.

    This is the companion object of RegexMatcher. Please refer to that class for the documentation.

  16. object RegexMatcherModel extends ParamsAndFeaturesReadable[RegexMatcherModel] with Serializable

    Permalink
  17. object SingleDatePolicy extends DateMatcherTranslatorPolicy with Product with Serializable

    Permalink
  18. object Stemmer extends DefaultParamsReadable[Stemmer] with Serializable

    Permalink

    This is the companion object of Stemmer.

    This is the companion object of Stemmer. Please refer to that class for the documentation.

  19. object StopWordsCleaner extends ParamsAndFeaturesReadable[StopWordsCleaner] with ReadablePretrainedStopWordsCleanerModel with Serializable

    Permalink
  20. object TextMatcher extends DefaultParamsReadable[TextMatcher] with Serializable

    Permalink

    This is the companion object of TextMatcher.

    This is the companion object of TextMatcher. Please refer to that class for the documentation.

  21. object TextMatcherModel extends ReadablePretrainedTextMatcher with Serializable

    Permalink

    This is the companion object of TextMatcherModel.

    This is the companion object of TextMatcherModel. Please refer to that class for the documentation.

  22. object Token2Chunk extends DefaultParamsReadable[Token2Chunk] with Serializable

    Permalink

    This is the companion object of Token2Chunk.

    This is the companion object of Token2Chunk. Please refer to that class for the documentation.

  23. object Tokenizer extends DefaultParamsReadable[Tokenizer] with Serializable

    Permalink

    This is the companion object of Tokenizer.

    This is the companion object of Tokenizer. Please refer to that class for the documentation.

  24. object TokenizerModel extends ReadablePretrainedTokenizer with Serializable

    Permalink

    This is the companion object of TokenizerModel.

    This is the companion object of TokenizerModel. Please refer to that class for the documentation.

  25. package btm

    Permalink
  26. package classifier

    Permalink
  27. package common

    Permalink
  28. package keyword

    Permalink
  29. package ld

    Permalink
  30. package ner

    Permalink
  31. package param

    Permalink
  32. package parser

    Permalink
  33. package pos

    Permalink
  34. package sbd

    Permalink
  35. package sda

    Permalink
  36. package sentence_detector_dl

    Permalink
  37. package seq2seq

    Permalink
  38. package spell

    Permalink
  39. package tokenizer

    Permalink
  40. package ws

    Permalink

Ungrouped