Package

com.johnsnowlabs.nlp.annotators.ld

dl

Permalink

package dl

Visibility
  1. Public
  2. All

Type Members

  1. class LanguageDetectorDL extends AnnotatorModel[LanguageDetectorDL] with HasSimpleAnnotate[LanguageDetectorDL] with WriteTensorflowModel

    Permalink

    Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

    Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

    LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.

    Pretrained models can be loaded with pretrained of the companion object:

    Val languageDetector = LanguageDetectorDL.pretrained()
      .setInputCols("sentence")
      .setOutputCol("language")

    The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop And the LanguageDetectorDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val languageDetector = LanguageDetectorDL.pretrained()
      .setInputCols("document")
      .setOutputCol("language")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        languageDetector
      ))
    
    val data = Seq(
      "Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
      "Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
      "Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("language.result").show(false)
    +------+
    |result|
    +------+
    |[en]  |
    |[fr]  |
    |[de]  |
    +------+
  2. trait ReadLanguageDetectorDLTensorflowModel extends ReadTensorflowModel

    Permalink
  3. trait ReadablePretrainedLanguageDetectorDLModel extends ParamsAndFeaturesReadable[LanguageDetectorDL] with HasPretrained[LanguageDetectorDL]

    Permalink

Value Members

  1. object LanguageDetectorDL extends ReadablePretrainedLanguageDetectorDLModel with ReadLanguageDetectorDLTensorflowModel with Serializable

    Permalink

    This is the companion object of LanguageDetectorDL.

    This is the companion object of LanguageDetectorDL. Please refer to that class for the documentation.

Ungrouped