Generates features for CrfBasedNer
Algorithm for training a Named Entity Recognition Model
Extracts Named Entities based on a CRF Model.
Extracts Named Entities based on a CRF Model.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning
algorithm. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
.
These can be extracted with for example
This is the instantiated model of the NerCrfApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with pretrained
of the companion object:
val nerTagger = NerCrfModel.pretrained() .setInputCols("sentence", "token", "word_embeddings", "pos") .setOutputCol("ner"
The default model is "ner_crf"
, if no name is provided.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Spark NLP Workshop.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.Tokenizer import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel import com.johnsnowlabs.nlp.annotators.ner.crf.NerCrfModel import org.apache.spark.ml.Pipeline // First extract the prerequisites for the NerCrfModel val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("word_embeddings") val posTagger = PerceptronModel.pretrained() .setInputCols("sentence", "token") .setOutputCol("pos") // Then NER can be extracted val nerTagger = NerCrfModel.pretrained() .setInputCols("sentence", "token", "word_embeddings", "pos") .setOutputCol("ner") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, tokenizer, embeddings, posTagger, nerTagger )) val data = Seq("U.N. official Ekeus heads for Baghdad.").toDF("text") val result = pipeline.fit(data).transform(data) result.select("ner.result").show(false) +------------------------------------+ |result | +------------------------------------+ |[I-ORG, O, O, I-PER, O, O, I-LOC, O]| +------------------------------------+
NerConverter to further process the results
NerDLModel for a deep learning based approach
This is the companion object of NerCrfApproach.
This is the companion object of NerCrfApproach. Please refer to that class for the documentation.
This is the companion object of NerCrfModel.
This is the companion object of NerCrfModel. Please refer to that class for the documentation.
Algorithm for training a Named Entity Recognition Model
For instantiated/pretrained models, see NerCrfModel.
This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with
Annotation
type columns. The data should have columns of typeDOCUMENT, TOKEN, POS, WORD_EMBEDDINGS
and an additional label column of annotator typeNAMED_ENTITY
. Excluding the label, this can be done with for exampleOptionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.
For extended examples of usage, see the Spark NLP Workshop and the NerCrfApproachTestSpec.
Example
This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types. If a custom dataset is used, these need to be defined.
NerConverter to further process the results
NerDLApproach for a deep learning based approach