Trains an annotator that detects sentence boundaries using a deep learning approach.
Annotator that detects sentence boundaries using a deep learning approach.
Annotator that detects sentence boundaries using a deep learning approach.
Instantiated Model of the SentenceDetectorDLApproach. Detects sentence boundaries using a deep learning approach.
Pretrained models can be loaded with pretrained
of the companion object:
val sentenceDL = SentenceDetectorDLModel.pretrained() .setInputCols("document") .setOutputCol("sentencesDL")
The default model is "sentence_detector_dl"
, if no name is provided.
For available pretrained models please see the Models Hub.
Each extracted sentence can be returned in an Array or exploded to separate rows,
if explodeSentences
is set to true
.
For extended examples of usage, see the Spark NLP Workshop and the SentenceDetectorDLSpec.
In this example, the normal SentenceDetector
is compared to the SentenceDetectorDLModel
. In a pipeline,
SentenceDetectorDLModel
can be used as a replacement for the SentenceDetector
.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLModel import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentences") val sentenceDL = SentenceDetectorDLModel .pretrained("sentence_detector_dl", "en") .setInputCols("document") .setOutputCol("sentencesDL") val pipeline = new Pipeline().setStages(Array( documentAssembler, sentence, sentenceDL )) val data = Seq("""John loves Mary.Mary loves Peter Peter loves Helen .Helen loves John; Total: four people involved.""").toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(sentences.result) as sentences").show(false) +----------------------------------------------------------+ |sentences | +----------------------------------------------------------+ |John loves Mary.Mary loves Peter\n Peter loves Helen .| |Helen loves John; | |Total: four people involved. | +----------------------------------------------------------+ result.selectExpr("explode(sentencesDL.result) as sentencesDL").show(false) +----------------------------+ |sentencesDL | +----------------------------+ |John loves Mary. | |Mary loves Peter | |Peter loves Helen . | |Helen loves John; | |Total: four people involved.| +----------------------------+
SentenceDetector for non deep learning extraction
SentenceDetectorDLApproach for training a model yourself
Trains an annotator that detects sentence boundaries using a deep learning approach.
For pretrained models see SentenceDetectorDLModel.
Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with
setModelArchitecture
.The default model
"cnn"
is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.Each extracted sentence can be returned in an Array or exploded to separate rows, if
explodeSentences
is set totrue
.For extended examples of usage, see the SentenceDetectorDLSpec.
Example
The training process needs data, where each data point is a sentence.
In this example the
train.txt
file has the form ofwhere each line is one sentence. Training can then be started like so:
SentenceDetector for non deep learning extraction
SentenceDetectorDLModel for pretrained models