crf

Type Members

case class DictionaryFeatures(dict: Map[String, String]) extends Product with Serializable
case class FeatureGenerator(dictFeatures: DictionaryFeatures) extends Product with Serializable

Generates features for CrfBasedNer
class NerCrfApproach extends AnnotatorApproach[NerCrfModel] with NerApproach[NerCrfApproach]

Algorithm for training a Named Entity Recognition Model
Algorithm for training a Named Entity Recognition Model
For instantiated/pretrained models, see NerCrfModel.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example
- a SentenceDetector,
- a Tokenizer,
- a PerceptronModel and
- a WordEmbeddingsModel.
Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.
For extended examples of usage, see the Spark NLP Workshop and the NerCrfApproachTestSpec.
Example
This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types. If a custom dataset is used, these need to be defined.
```
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val nerTagger = new NerCrfApproach()
  .setInputCols("sentence", "token", "pos", "embeddings")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setC0(34)
  .setL2(3.0)
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  nerTagger
))


val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)
```
See also
NerConverter to further process the results
NerDLApproach for a deep learning based approach

class NerCrfModel extends AnnotatorModel[NerCrfModel] with HasSimpleAnnotate[NerCrfModel] with HasStorageRef

Extracts Named Entities based on a CRF Model.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS. These can be extracted with for example

This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained of the companion object:

val nerTagger = NerCrfModel.pretrained()
  .setInputCols("sentence", "token", "word_embeddings", "pos")
  .setOutputCol("ner"

The default model is "ner_crf", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop.

Example

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel
import org.apache.spark.ml.Pipeline

// First extract the prerequisites for the NerCrfModel
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("word_embeddings")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

// Then NER can be extracted
val nerTagger = NerCrfModel.pretrained()
  .setInputCols("sentence", "token", "word_embeddings", "pos")
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  posTagger,
  nerTagger
))

val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("ner.result").show(false)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-ORG, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

See also: NerConverter to further process the results
NerDLModel for a deep learning based approach

trait ReadablePretrainedNerCrf extends ParamsAndFeaturesReadable[NerCrfModel] with HasPretrained[NerCrfModel]

Value Members

object DictionaryFeatures extends Serializable
object NerCrfApproach extends DefaultParamsReadable[NerCrfApproach] with Serializable

This is the companion object of NerCrfApproach.
This is the companion object of NerCrfApproach. Please refer to that class for the documentation.
object NerCrfModel extends ReadablePretrainedNerCrf with Serializable

This is the companion object of NerCrfModel.
This is the companion object of NerCrfModel. Please refer to that class for the documentation.

package crf

Type Members

case class DictionaryFeatures(dict: Map[String, String]) extends Product with Serializable

case class FeatureGenerator(dictFeatures: DictionaryFeatures) extends Product with Serializable

class NerCrfApproach extends AnnotatorApproach[NerCrfModel] with NerApproach[NerCrfApproach]

Example

class NerCrfModel extends AnnotatorModel[NerCrfModel] with HasSimpleAnnotate[NerCrfModel] with HasStorageRef

Example

trait ReadablePretrainedNerCrf extends ParamsAndFeaturesReadable[NerCrfModel] with HasPretrained[NerCrfModel]

Value Members

object DictionaryFeatures extends Serializable

object NerCrfApproach extends DefaultParamsReadable[NerCrfApproach] with Serializable

object NerCrfModel extends ReadablePretrainedNerCrf with Serializable

Ungrouped