Package

com.johnsnowlabs

nlp

Permalink

package nlp

Visibility
  1. Public
  2. All

Type Members

  1. case class Annotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends Product with Serializable

    Permalink

    represents annotator's output parts and their details

    represents annotator's output parts and their details

    annotatorType

    the type of annotation

    begin

    the index of the first character under this annotation

    end

    the index after the last character under this annotation

    metadata

    associated metadata for this annotation

  2. abstract class AnnotatorApproach[M <: Model[M]] extends Estimator[M] with HasInputAnnotationCols with HasOutputAnnotationCol with HasOutputAnnotatorType with DefaultParamsWritable with CanBeLazy

    Permalink

    This class should grow once we start training on datasets and share params For now it stands as a dummy placeholder for future reference

  3. abstract class AnnotatorModel[M <: Model[M]] extends Model[M] with RawAnnotator[M] with CanBeLazy

    Permalink

    This trait implements logic that applies nlp using Spark ML Pipeline transformers Should strongly change once UsedDefinedTypes are allowed https://issues.apache.org/jira/browse/SPARK-7768

  4. trait CanBeLazy extends AnyRef

    Permalink
  5. class Chunk2Doc extends AnnotatorModel[Chunk2Doc] with HasSimpleAnnotate[Chunk2Doc]

    Permalink

    Converts a CHUNK type column back into DOCUMENT.

    Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

    For more extended examples on document pre-processing see the Spark NLP Workshop.

    Example

    Location entities are extracted and converted back into DOCUMENT type for further processing

    import spark.implicits._
    import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
    import com.johnsnowlabs.nlp.Chunk2Doc
    
    val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")
    
    // Extracts Named Entities amongst other things
    val pipeline = PretrainedPipeline("explain_document_dl")
    
    val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
    val explainResult = pipeline.transform(data)
    
    val result = chunkToDoc.transform(explainResult)
    result.selectExpr("explode(chunkConverted)").show(false)
    +------------------------------------------------------------------------------+
    |col                                                                           |
    +------------------------------------------------------------------------------+
    |[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
    |[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
    +------------------------------------------------------------------------------+
    See also

    Doc2Chunk for converting DOCUMENT annotations to CHUNK

    PretrainedPipeline on how to use the PretrainedPipeline

  6. class Doc2Chunk extends Model[Doc2Chunk] with RawAnnotator[Doc2Chunk]

    Permalink

    Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol.

    Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

    For more extended examples on document pre-processing see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val chunkAssembler = new Doc2Chunk()
      .setInputCols("document")
      .setChunkCol("target")
      .setOutputCol("chunk")
      .setIsArray(true)
    
    val data = Seq(
      ("Spark NLP is an open-source text processing library for advanced natural language processing.",
        Seq("Spark NLP", "text processing library", "natural language processing"))
    ).toDF("text", "target")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
    val result = pipeline.transform(data)
    
    result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
    +-----------------------------------------------------------------+---------------------+
    |result                                                           |annotatorType        |
    +-----------------------------------------------------------------+---------------------+
    |[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
    +-----------------------------------------------------------------+---------------------+
    See also

    Chunk2Doc for converting CHUNK annotations to DOCUMENT

  7. class DocumentAssembler extends Transformer with DefaultParamsWritable with HasOutputAnnotatorType with HasOutputAnnotationCol

    Permalink

    Prepares data into a format that is processable by Spark NLP.

    Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

    For more extended examples on document pre-processing see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    
    val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    
    val result = documentAssembler.transform(data)
    
    result.select("document").show(false)
    +----------------------------------------------------------------------------------------------+
    |document                                                                                      |
    +----------------------------------------------------------------------------------------------+
    |[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
    +----------------------------------------------------------------------------------------------+
    
    result.select("document").printSchema
    root
     |-- document: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- annotatorType: string (nullable = true)
     |    |    |-- begin: integer (nullable = false)
     |    |    |-- end: integer (nullable = false)
     |    |    |-- result: string (nullable = true)
     |    |    |-- metadata: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: string (valueContainsNull = true)
     |    |    |-- embeddings: array (nullable = true)
     |    |    |    |-- element: float (containsNull = false)
  8. class EmbeddingsFinisher extends Transformer with DefaultParamsWritable

    Permalink

    Extracts embeddings from Annotations into a more easily usable form.

    Extracts embeddings from Annotations into a more easily usable form.

    This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.

    By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

    For more extended examples see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import org.apache.spark.ml.Pipeline
    import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
    import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val normalizer = new Normalizer()
      .setInputCols("token")
      .setOutputCol("normalized")
    
    val stopwordsCleaner = new StopWordsCleaner()
      .setInputCols("normalized")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    
    val gloveEmbeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("document", "cleanTokens")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_sentence_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val data = Seq("Spark NLP is an open-source text processing library.")
      .toDF("text")
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      normalizer,
      stopwordsCleaner,
      gloveEmbeddings,
      embeddingsFinisher
    )).fit(data)
    
    val result = pipeline.transform(data)
    val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
      .map { row =>
        val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
        (vector.size, vector)
      }.toDF("size", "vector")
    
    resultWithSize.show(5, 80)
    +----+--------------------------------------------------------------------------------+
    |size|                                                                          vector|
    +----+--------------------------------------------------------------------------------+
    | 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
    | 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
    | 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
    | 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
    | 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
    +----+--------------------------------------------------------------------------------+
    See also

    Finisher for finishing Strings

  9. class FeaturesReader[T <: HasFeatures] extends MLReader[T]

    Permalink
  10. class FeaturesWriter[T] extends MLWriter with HasFeatures

    Permalink
  11. class Finisher extends Transformer with DefaultParamsWritable

    Permalink

    Converts annotation results into a format that easier to use.

    Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into String.

    For more extended examples on document pre-processing see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
    import com.johnsnowlabs.nlp.Finisher
    
    val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")
    
    // Extracts Named Entities amongst other things
    val pipeline = PretrainedPipeline("explain_document_dl")
    
    val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
    val explainResult = pipeline.transform(data)
    
    explainResult.selectExpr("explode(entities)").show(false)
    +------------------------------------------------------------------------------------------------------------------------------------------------------+
    |entities                                                                                                                                              |
    +------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
    +------------------------------------------------------------------------------------------------------------------------------------------------------+
    
    val result = finisher.transform(explainResult)
    result.select("output").show(false)
    +----------------------+
    |output                |
    +----------------------+
    |[New York, New Jersey]|
    +----------------------+
    See also

    EmbeddingsFinisher for finishing embeddings

  12. class GraphFinisher extends Transformer

    Permalink

    Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.

    Helper class to convert the knowledge graph from GraphExtraction into a generic format, such as RDF.

    Example

    This is a continuation of the example of GraphExtraction. To see how the graph is extracted, see the documentation of that class.

    import com.johnsnowlabs.nlp.GraphFinisher
    
    val graphFinisher = new GraphFinisher()
      .setInputCol("graph")
      .setOutputCol("graph_finished")
      .setOutputAsArray(false)
    
    val finishedResult = graphFinisher.transform(result)
    finishedResult.select("text", "graph_finished").show(false)
    +-----------------------------------------------------+-----------------------------------------------------------------------+
    |text                                                 |graph_finished                                                         |
    +-----------------------------------------------------+-----------------------------------------------------------------------+
    |You and John prefer the morning flight through Denver|[[(prefer,nsubj,morning), (morning,flat,flight), (flight,flat,Denver)]]|
    +-----------------------------------------------------+-----------------------------------------------------------------------+
    See also

    GraphExtraction to extract the graph.

  13. trait HasBatchedAnnotate[M <: Model[M]] extends AnyRef

    Permalink
  14. trait HasCaseSensitiveProperties extends ParamsAndFeaturesWritable

    Permalink
  15. trait HasFeatures extends AnyRef

    Permalink
  16. trait HasInputAnnotationCols extends Params

    Permalink
  17. trait HasOutputAnnotationCol extends Params

    Permalink
  18. trait HasOutputAnnotatorType extends AnyRef

    Permalink
  19. trait HasPretrained[M <: PipelineStage] extends AnyRef

    Permalink
  20. trait HasRecursiveFit[M <: Model[M]] extends AnyRef

    Permalink

    AnnotatorApproach'es may extend this trait in order to allow RecursivePipelines to include intermediate steps trained PipelineModel's

  21. trait HasRecursiveTransform[M <: Model[M]] extends AnyRef

    Permalink
  22. trait HasSimpleAnnotate[M <: Model[M]] extends AnyRef

    Permalink
  23. case class JavaAnnotation(annotatorType: String, begin: Int, end: Int, result: String, metadata: Map[String, String], embeddings: Array[Float] = Array.emptyFloatArray) extends Product with Serializable

    Permalink
  24. class LightPipeline extends AnyRef

    Permalink
  25. trait ParamsAndFeaturesReadable[T <: HasFeatures] extends DefaultParamsReadable[T]

    Permalink
  26. trait ParamsAndFeaturesWritable extends DefaultParamsWritable with Params with HasFeatures

    Permalink
  27. trait RawAnnotator[M <: Model[M]] extends Model[M] with ParamsAndFeaturesWritable with HasOutputAnnotatorType with HasInputAnnotationCols with HasOutputAnnotationCol

    Permalink
  28. class RecursivePipeline extends Pipeline

    Permalink
  29. class RecursivePipelineModel extends Model[RecursivePipelineModel] with MLWritable with Logging

    Permalink
  30. class TokenAssembler extends AnnotatorModel[TokenAssembler] with HasSimpleAnnotate[TokenAssembler]

    Permalink

    This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.

    This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. Requires DOCUMENT and TOKEN type annotations as input.

    For more extended examples on document pre-processing see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotator.Tokenizer
    import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
    import com.johnsnowlabs.nlp.TokenAssembler
    import org.apache.spark.ml.Pipeline
    
    // First, the text is tokenized and cleaned
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentences")
    
    val tokenizer = new Tokenizer()
      .setInputCols("sentences")
      .setOutputCol("token")
    
    val normalizer = new Normalizer()
      .setInputCols("token")
      .setOutputCol("normalized")
      .setLowercase(false)
    
    val stopwordsCleaner = new StopWordsCleaner()
      .setInputCols("normalized")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)
    
    // Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
    val tokenAssembler = new TokenAssembler()
      .setInputCols("sentences", "cleanTokens")
      .setOutputCol("cleanText")
    
    val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
      .toDF("text")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      normalizer,
      stopwordsCleaner,
      tokenAssembler
    )).fit(data)
    
    val result = pipeline.transform(data)
    result.select("cleanText").show(false)
    +---------------------------------------------------------------------------------------------------------------------------+
    |cleanText                                                                                                                  |
    +---------------------------------------------------------------------------------------------------------------------------+
    |[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
    +---------------------------------------------------------------------------------------------------------------------------+
    See also

    DocumentAssembler on the data structure

Value Members

  1. object Annotation extends Serializable

    Permalink
  2. object AnnotatorType

    Permalink
  3. object Chunk2Doc extends DefaultParamsReadable[Chunk2Doc] with Serializable

    Permalink

    This is the companion object of Chunk2Doc.

    This is the companion object of Chunk2Doc. Please refer to that class for the documentation.

  4. object Doc2Chunk extends DefaultParamsReadable[Doc2Chunk] with Serializable

    Permalink

    This is the companion object of Doc2Chunk.

    This is the companion object of Doc2Chunk. Please refer to that class for the documentation.

  5. object DocumentAssembler extends DefaultParamsReadable[DocumentAssembler] with Serializable

    Permalink

    This is the companion object of DocumentAssembler.

    This is the companion object of DocumentAssembler. Please refer to that class for the documentation.

  6. object EmbeddingsFinisher extends DefaultParamsReadable[EmbeddingsFinisher] with Serializable

    Permalink

    This is the companion object of EmbeddingsFinisher.

    This is the companion object of EmbeddingsFinisher. Please refer to that class for the documentation.

  7. object Finisher extends DefaultParamsReadable[Finisher] with Serializable

    Permalink

    This is the companion object of Finisher.

    This is the companion object of Finisher. Please refer to that class for the documentation.

  8. object SparkNLP

    Permalink
  9. object TokenAssembler extends DefaultParamsReadable[TokenAssembler] with Serializable

    Permalink

    This is the companion object of TokenAssembler.

    This is the companion object of TokenAssembler. Please refer to that class for the documentation.

  10. package annotators

    Permalink
  11. package embeddings

    Permalink
  12. object functions

    Permalink
  13. package pretrained

    Permalink
  14. package recursive

    Permalink
  15. package serialization

    Permalink
  16. package training

    Permalink
  17. package util

    Permalink

Ungrouped