Package

com.johnsnowlabs.nlp.annotators.ner

crf

Permalink

package crf

Visibility
  1. Public
  2. All

Type Members

  1. case class DictionaryFeatures(dict: Map[String, String]) extends Product with Serializable

    Permalink
  2. case class FeatureGenerator(dictFeatures: DictionaryFeatures) extends Product with Serializable

    Permalink

    Generates features for CrfBasedNer

  3. class NerCrfApproach extends AnnotatorApproach[NerCrfModel] with NerApproach[NerCrfApproach]

    Permalink

    Algorithm for training a Named Entity Recognition Model

    Algorithm for training a Named Entity Recognition Model

    For instantiated/pretrained models, see NerCrfModel.

    This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

    Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.

    For extended examples of usage, see the Spark NLP Workshop and the NerCrfApproachTestSpec.

    Example

    This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types. If a custom dataset is used, these need to be defined.

    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.annotator.NerCrfApproach
    import com.johnsnowlabs.nlp.training.CoNLL
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
    
    val nerTagger = new NerCrfApproach()
      .setInputCols("sentence", "token", "pos", "embeddings")
      .setLabelColumn("label")
      .setMinEpochs(1)
      .setMaxEpochs(3)
      .setC0(34)
      .setL2(3.0)
      .setOutputCol("ner")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      embeddings,
      nerTagger
    ))
    
    
    val conll = CoNLL()
    val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
    
    val pipelineModel = pipeline.fit(trainingData)
    See also

    NerConverter to further process the results

    NerDLApproach for a deep learning based approach

  4. class NerCrfModel extends AnnotatorModel[NerCrfModel] with HasSimpleAnnotate[NerCrfModel] with HasStorageRef

    Permalink

    Extracts Named Entities based on a CRF Model.

    Extracts Named Entities based on a CRF Model.

    This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS. These can be extracted with for example

    This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val nerTagger = NerCrfModel.pretrained()
      .setInputCols("sentence", "token", "word_embeddings", "pos")
      .setOutputCol("ner"

    The default model is "ner_crf", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
    import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
    import org.apache.spark.ml.Pipeline
    
    // First extract the prerequisites for the NerCrfModel
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("word_embeddings")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    // Then NER can be extracted
    val nerTagger = NerCrfModel.pretrained()
      .setInputCols("sentence", "token", "word_embeddings", "pos")
      .setOutputCol("ner")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentence,
      tokenizer,
      embeddings,
      posTagger,
      nerTagger
    ))
    
    val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("ner.result").show(false)
    +------------------------------------+
    |result                              |
    +------------------------------------+
    |[I-ORG, O, O, I-PER, O, O, I-LOC, O]|
    +------------------------------------+
    See also

    NerConverter to further process the results

    NerDLModel for a deep learning based approach

  5. trait ReadablePretrainedNerCrf extends ParamsAndFeaturesReadable[NerCrfModel] with HasPretrained[NerCrfModel]

    Permalink

Value Members

  1. object DictionaryFeatures extends Serializable

    Permalink
  2. object NerCrfApproach extends DefaultParamsReadable[NerCrfApproach] with Serializable

    Permalink

    This is the companion object of NerCrfApproach.

    This is the companion object of NerCrfApproach. Please refer to that class for the documentation.

  3. object NerCrfModel extends ReadablePretrainedNerCrf with Serializable

    Permalink

    This is the companion object of NerCrfModel.

    This is the companion object of NerCrfModel. Please refer to that class for the documentation.

Ungrouped