Trains a ClassifierDL for generic Multi-class Text Classification.
ClassifierDL for generic Multi-class Text Classification.
ClassifierDL for generic Multi-class Text Classification.
ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.
This is the instantiated model of the ClassifierDLApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with pretrained
of the companion object:
val classifierDL = ClassifierDLModel.pretrained() .setInputCols("sentence_embeddings") .setOutputCol("classification")
The default model is "classifierdl_use_trec6"
, if no name is provided. It uses embeddings from the
UniversalSentenceEncoder and is trained on the
TREC-6 dataset.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Spark NLP Workshop and the ClassifierDLTestSpec.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.SentenceDetector import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentence = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val useEmbeddings = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") val sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm") .setInputCols("sentence_embeddings") .setOutputCol("sarcasm") val pipeline = new Pipeline() .setStages(Array( documentAssembler, sentence, useEmbeddings, sarcasmDL )) val data = Seq( "I'm ready!", "If I could put into words how much I love waking up at 6 am on Mondays I would." ).toDF("text") val result = pipeline.fit(data).transform(data) result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out") .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm") .show(false) +-------------------------------------------------------------------------------+-------+ |sentence |sarcasm| +-------------------------------------------------------------------------------+-------+ |I'm ready! |normal | |If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm| +-------------------------------------------------------------------------------+-------+
SentimentDLModel for sentiment analysis
MultiClassifierDLModel for multi-class classification
Trains a MultiClassifierDL for Multi-label Text Classification.
Trains a MultiClassifierDL for Multi-label Text Classification.
MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.
For instantiated/pretrained models, see MultiClassifierDLModel.
The input to MultiClassifierDL
are Sentence Embeddings such as the state-of-the-art
UniversalSentenceEncoder,
BertSentenceEmbeddings, or
SentenceEmbeddings.
In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
Notes:
inputCol
.For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.
In this example, the training data has the form (Note: labels can be arbitrary)
mr,ref "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King. "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre. ...
It needs some pre-processing first, so the labels are of type Array[String]
. This can be done like so:
import spark.implicits._ import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder import org.apache.spark.ml.Pipeline import org.apache.spark.sql.functions.{col, udf} // Process training data to create text with associated array of labels def splitAndTrim = udf { labels: String => labels.split(", ").map(x=>x.trim) } val smallCorpus = spark.read .option("header", true) .option("inferSchema", true) .option("mode", "DROPMALFORMED") .csv("src/test/resources/classifier/e2e.csv") .withColumn("labels", splitAndTrim(col("mr"))) .withColumn("text", col("ref")) .drop("mr") smallCorpus.printSchema() // root // |-- ref: string (nullable = true) // |-- labels: array (nullable = true) // | |-- element: string (containsNull = true) // Then create pipeline for training val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") .setCleanupMode("shrink") val embeddings = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("embeddings") val docClassifier = new MultiClassifierDLApproach() .setInputCols("embeddings") .setOutputCol("category") .setLabelColumn("labels") .setBatchSize(128) .setMaxEpochs(10) .setLr(1e-3f) .setThreshold(0.5f) .setValidationSplit(0.1f) val pipeline = new Pipeline() .setStages( Array( documentAssembler, embeddings, docClassifier ) ) val pipelineModel = pipeline.fit(smallCorpus)
SentimentDLApproach for sentiment analysis
ClassifierDLApproach for single-class classification
MultiClassifierDL for Multi-label Text Classification.
MultiClassifierDL for Multi-label Text Classification.
MultiClassifierDL Bidirectional GRU with Convolution model we have built inside TensorFlow and supports up to 100 classes. The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.
This is the instantiated model of the MultiClassifierDLApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with pretrained
of the companion object:
val multiClassifier = MultiClassifierDLModel.pretrained() .setInputCols("sentence_embeddings") .setOutputCol("categories")
The default model is "multiclassifierdl_use_toxic"
, if no name is provided. It uses embeddings from the
UniversalSentenceEncoder and classifies toxic comments.
The data is based on the
Jigsaw Toxic Comment Classification Challenge.
For available pretrained models please see the Models Hub.
In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).
For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLModel import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val useEmbeddings = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") val multiClassifierDl = MultiClassifierDLModel.pretrained() .setInputCols("sentence_embeddings") .setOutputCol("classifications") val pipeline = new Pipeline() .setStages(Array( documentAssembler, useEmbeddings, multiClassifierDl )) val data = Seq( "This is pretty good stuff!", "Wtf kind of crap is this" ).toDF("text") val result = pipeline.fit(data).transform(data) result.select("text", "classifications.result").show(false) +--------------------------+----------------+ |text |result | +--------------------------+----------------+ |This is pretty good stuff!|[] | |Wtf kind of crap is this |[toxic, obscene]| +--------------------------+----------------+
SentimentDLModel for sentiment analysis
ClassifierDLModel for single-class classification
Trains a SentimentDL, an annotator for multi-class sentiment analysis.
Trains a SentimentDL, an annotator for multi-class sentiment analysis.
In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.
For the instantiated/pretrained models, see SentimentDLModel.
Notes:
"positive"
or 0
, negative sentiment as "negative"
or 1
.inputCol
.For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.
In this example, sentiment.csv
is in the form
text,label This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0 This was a terrible movie! The acting was bad really bad!,1
The model can then be trained with
import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel} import org.apache.spark.ml.Pipeline val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv") val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val useEmbeddings = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") val docClassifier = new SentimentDLApproach() .setInputCols("sentence_embeddings") .setOutputCol("sentiment") .setLabelColumn("label") .setBatchSize(32) .setMaxEpochs(1) .setLr(5e-3f) .setDropout(0.5f) val pipeline = new Pipeline() .setStages( Array( documentAssembler, useEmbeddings, docClassifier ) ) val pipelineModel = pipeline.fit(smallCorpus)
MultiClassifierDLApproach for general multi-class classification
ClassifierDLApproach for general single-class classification
SentimentDL, an annotator for multi-class sentiment analysis.
SentimentDL, an annotator for multi-class sentiment analysis.
In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.
This is the instantiated model of the SentimentDLApproach. For training your own model, please see the documentation of that class.
Pretrained models can be loaded with pretrained
of the companion object:
val sentiment = SentimentDLModel.pretrained() .setInputCols("sentence_embeddings") .setOutputCol("sentiment")
The default model is "sentimentdl_use_imdb"
, if no name is provided. It is english sentiment analysis trained on
the IMDB dataset.
For available pretrained models please see the Models Hub.
For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.
import spark.implicits._ import com.johnsnowlabs.nlp.base.DocumentAssembler import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder import com.johnsnowlabs.nlp.annotators.classifier.dl.SentimentDLModel import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val useEmbeddings = UniversalSentenceEncoder.pretrained() .setInputCols("document") .setOutputCol("sentence_embeddings") val sentiment = SentimentDLModel.pretrained("sentimentdl_use_twitter") .setInputCols("sentence_embeddings") .setThreshold(0.7F) .setOutputCol("sentiment") val pipeline = new Pipeline().setStages(Array( documentAssembler, useEmbeddings, sentiment )) val data = Seq( "Wow, the new video is awesome!", "bruh what a damn waste of time" ).toDF("text") val result = pipeline.fit(data).transform(data) result.select("text", "sentiment.result").show(false) +------------------------------+----------+ |text |result | +------------------------------+----------+ |Wow, the new video is awesome!|[positive]| |bruh what a damn waste of time|[negative]| +------------------------------+----------+
MultiClassifierDLModel for general multi-class classification
ClassifierDLModel for general single-class classification
This is the companion object of ClassifierDLApproach.
This is the companion object of ClassifierDLApproach. Please refer to that class for the documentation.
This is the companion object of ClassifierDLModel.
This is the companion object of ClassifierDLModel. Please refer to that class for the documentation.
This is the companion object of MultiClassifierDLModel.
This is the companion object of MultiClassifierDLModel. Please refer to that class for the documentation.
This is the companion object of SentimentApproach.
This is the companion object of SentimentApproach. Please refer to that class for the documentation.
This is the companion object of SentimentDLModel.
This is the companion object of SentimentDLModel. Please refer to that class for the documentation.
Trains a ClassifierDL for generic Multi-class Text Classification.
ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.
For instantiated/pretrained models, see ClassifierDLModel.
Notes:
inputCol
.For extended examples of usage, see the Spark NLP Workshop [1] [2] and the ClassifierDLTestSpec.
Example
In this example, the training data
"sentiment.csv"
has the form ofThen traning can be done like so:
SentimentDLApproach for sentiment analysis
MultiClassifierDLApproach for multi-class classification