Package

com.johnsnowlabs.nlp

embeddings

Permalink

package embeddings

Visibility
  1. Public
  2. All

Type Members

  1. class AlbertEmbeddings extends AnnotatorModel[AlbertEmbeddings] with HasBatchedAnnotate[AlbertEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago

    ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago

    These word embeddings represent the outputs generated by the Albert model. All official Albert releases by google in TF-HUB are supported with this Albert Wrapper:

    Ported TF-Hub Models:

    "albert_base_uncased" | albert_base | 768-embed-dim, 12-layer, 12-heads, 12M parameters

    "albert_large_uncased" | albert_large | 1024-embed-dim, 24-layer, 16-heads, 18M parameters

    "albert_xlarge_uncased" | albert_xlarge | 2048-embed-dim, 24-layer, 32-heads, 60M parameters

    "albert_xxlarge_uncased" | albert_xxlarge | 4096-embed-dim, 12-layer, 64-heads, 235M parameters

    This model requires input tokenization with SentencePiece model, which is provided by Spark-NLP (See tokenizers package).

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = AlbertEmbeddings.pretrained()
     .setInputCols("sentence", "token")
     .setOutputCol("embeddings")

    The default model is "albert_base_uncased", if no name is provided.

    For extended examples of usage, see the Spark NLP Workshop and the AlbertEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Sources:

    ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

    https://github.com/google-research/ALBERT

    https://tfhub.dev/s?q=albert

    Paper abstract:

    Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

    Tips: ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.AlbertEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val embeddings = AlbertEmbeddings.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[1.1342473030090332,-1.3855540752410889,0.9818322062492371,-0.784737348556518...|
    |[0.847029983997345,-1.047153353691101,-0.1520637571811676,-0.6245765686035156...|
    |[-0.009860038757324219,-0.13450059294700623,2.707749128341675,1.2916892766952...|
    |[-0.04192575812339783,-0.5764210224151611,-0.3196685314178467,-0.527840495109...|
    |[0.15583214163780212,-0.1614152491092682,-0.28423872590065,-0.135491415858268...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  2. class BertEmbeddings extends AnnotatorModel[BertEmbeddings] with HasBatchedAnnotate[BertEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    Token-level embeddings using BERT.

    Token-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = BertEmbeddings.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("bert_embeddings")

    The default model is "small_bert_L2_768", if no name is provided.

    For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the BertEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Sources :

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    https://github.com/google-research/bert

    Paper abstract

    We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en")
      .setInputCols("token", "document")
      .setOutputCol("bert_embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("bert_embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-2.3497989177703857,0.480538547039032,-0.3238905668258667,-1.612930893898010...|
    |[-2.1357314586639404,0.32984697818756104,-0.6032363176345825,-1.6791689395904...|
    |[-1.8244884014129639,-0.27088963985443115,-1.059438943862915,-0.9817547798156...|
    |[-1.1648050546646118,-0.4725411534309387,-0.5938255786895752,-1.5780693292617...|
    |[-0.9125322699546814,0.4563939869403839,-0.3975459933280945,-1.81611204147338...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

    BertSentenceEmbeddings for sentence-level embeddings

  3. class BertSentenceEmbeddings extends AnnotatorModel[BertSentenceEmbeddings] with HasBatchedAnnotate[BertSentenceEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    Sentence-level embeddings using BERT.

    Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = BertSentenceEmbeddings.pretrained()
      .setInputCols("sentence")
      .setOutputCol("sentence_bert_embeddings")

    The default model is "sent_small_bert_L2_768", if no name is provided.

    For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the BertSentenceEmbeddingsTestSpec.

    Sources :

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    https://github.com/google-research/bert

    Paper abstract

    We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128")
      .setInputCols("sentence")
      .setOutputCol("sentence_bert_embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("sentence_bert_embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentence,
      embeddings,
      embeddingsFinisher
    ))
    
    val data = Seq("John loves apples. Mary loves oranges. John loves Mary.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-0.8951074481010437,0.13753940165042877,0.3108254075050354,-1.65693199634552...|
    |[-0.6180210709571838,-0.12179657071828842,-0.191165953874588,-1.4497021436691...|
    |[-0.822715163230896,0.7568016648292542,-0.1165061742067337,-1.59048593044281,...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

    BertEmbeddings for token-level embeddings

  4. class ChunkEmbeddings extends AnnotatorModel[ChunkEmbeddings] with HasSimpleAnnotate[ChunkEmbeddings]

    Permalink

    This annotator utilizes WordEmbeddings, BertEmbeddings etc.

    This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

    For extended examples of usage, see the Spark NLP Workshop and the ChunkEmbeddingsTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.{NGramGenerator, Tokenizer}
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.embeddings.ChunkEmbeddings
    import org.apache.spark.ml.Pipeline
    
    // Extract the Embeddings from the NGrams
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val nGrams = new NGramGenerator()
      .setInputCols("token")
      .setOutputCol("chunk")
      .setN(2)
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
    
    // Convert the NGram chunks into Word Embeddings
    val chunkEmbeddings = new ChunkEmbeddings()
      .setInputCols("chunk", "embeddings")
      .setOutputCol("chunk_embeddings")
      .setPoolingStrategy("AVERAGE")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        tokenizer,
        nGrams,
        embeddings,
        chunkEmbeddings
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(chunk_embeddings) as result")
      .select("result.annotatorType", "result.result", "result.embeddings")
      .show(5, 80)
    +---------------+----------+--------------------------------------------------------------------------------+
    |  annotatorType|    result|                                                                      embeddings|
    +---------------+----------+--------------------------------------------------------------------------------+
    |word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
    |word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
    |word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
    |word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
    +---------------+----------+--------------------------------------------------------------------------------+
  5. class DistilBertEmbeddings extends AnnotatorModel[DistilBertEmbeddings] with HasBatchedAnnotate[DistilBertEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base.

    DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = DistilBertEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")

    The default model is "distilbert_base_cased", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the DistilBertEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    The DistilBERT model was proposed in the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.

    Paper Abstract:

    As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

    Tips:

    • DistilBERT doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:[SEP]).
    • DistilBERT doesn't have options to select the input positions (:obj:position_ids input). This could be added if necessary though, just let us know if you need this option.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.DistilBertEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = DistilBertEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(true)
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[0.1127224713563919,-0.1982710212469101,0.5360898375511169,-0.272536993026733...|
    |[0.35534414649009705,0.13215228915214539,0.40981462597846985,0.14036104083061...|
    |[0.328085333108902,-0.06269335001707077,-0.017595693469047546,-0.024373905733...|
    |[0.15617232024669647,0.2967822253704071,0.22324979305267334,-0.04568954557180...|
    |[0.45411425828933716,0.01173491682857275,0.190129816532135,0.1178255230188369...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  6. class ElmoEmbeddings extends AnnotatorModel[ElmoEmbeddings] with HasSimpleAnnotate[ElmoEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.

    Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.

    Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = ElmoEmbeddings.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("elmo_embeddings")

    The default model is "elmo", if no name is provided.

    For available pretrained models please see the Models Hub.

    The pooling layer can be set with setPoolingLayer to the following values:

    • "word_emb": the character-based word representations with shape [batch_size, max_length, 512].
    • "lstm_outputs1": the first LSTM hidden state with shape [batch_size, max_length, 1024].
    • "lstm_outputs2": the second LSTM hidden state with shape [batch_size, max_length, 1024].
    • "elmo": the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024].

    For extended examples of usage, see the Spark NLP Workshop and the ElmoEmbeddingsTestSpec.

    Sources:

    https://tfhub.dev/google/elmo/3

    Deep contextualized word representations

    Paper abstract:

    We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.ElmoEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val embeddings = ElmoEmbeddings.pretrained()
      .setPoolingLayer("word_emb")
      .setInputCols("token", "document")
      .setOutputCol("embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[6.662458181381226E-4,-0.2541114091873169,-0.6275503039360046,0.5787073969841...|
    |[0.19154725968837738,0.22998669743537903,-0.2894386649131775,0.21524395048618...|
    |[0.10400570929050446,0.12288510054349899,-0.07056470215320587,-0.246389418840...|
    |[0.49932169914245605,-0.12706467509269714,0.30969417095184326,0.2643227577209...|
    |[-0.8871506452560425,-0.20039963722229004,-1.0601330995559692,0.0348707810044...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of other transformer based embeddings

  7. trait EmbeddingsCoverage extends AnyRef

    Permalink
  8. trait HasEmbeddingsProperties extends Params

    Permalink
  9. class LongformerEmbeddings extends AnnotatorModel[LongformerEmbeddings] with HasBatchedAnnotate[LongformerEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    Longformer is a transformer model for long documents.

    Longformer is a transformer model for long documents. The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan. longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = LongformerEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")

    The default model is "longformer_base_4096", if no name is provided. For available pretrained models please see the Models Hub.

    For some examples of usage, see LongformerEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Paper Abstract:

    Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

    The original code can be found here https://github.com/allenai/longformer.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = LongformerEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(true)
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
    |[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
    |[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
    |[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
    |[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  10. trait ReadAlbertTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  11. trait ReadBertSentenceTensorflowModel extends ReadTensorflowModel

    Permalink
  12. trait ReadBertTensorflowModel extends ReadTensorflowModel

    Permalink
  13. trait ReadDistilBertTensorflowModel extends ReadTensorflowModel

    Permalink
  14. trait ReadElmoTensorflowModel extends ReadTensorflowModel

    Permalink
  15. trait ReadLongformerTensorflowModel extends ReadTensorflowModel

    Permalink
  16. trait ReadRobertaSentenceTensorflowModel extends ReadTensorflowModel

    Permalink
  17. trait ReadRobertaTensorflowModel extends ReadTensorflowModel

    Permalink
  18. trait ReadUSETensorflowModel extends ReadTensorflowModel

    Permalink
  19. trait ReadXlmRobertaSentenceTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  20. trait ReadXlmRobertaTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  21. trait ReadXlnetTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel

    Permalink
  22. trait ReadablePretrainedAlbertModel extends ParamsAndFeaturesReadable[AlbertEmbeddings] with HasPretrained[AlbertEmbeddings]

    Permalink
  23. trait ReadablePretrainedBertModel extends ParamsAndFeaturesReadable[BertEmbeddings] with HasPretrained[BertEmbeddings]

    Permalink
  24. trait ReadablePretrainedBertSentenceModel extends ParamsAndFeaturesReadable[BertSentenceEmbeddings] with HasPretrained[BertSentenceEmbeddings]

    Permalink
  25. trait ReadablePretrainedDistilBertModel extends ParamsAndFeaturesReadable[DistilBertEmbeddings] with HasPretrained[DistilBertEmbeddings]

    Permalink
  26. trait ReadablePretrainedElmoModel extends ParamsAndFeaturesReadable[ElmoEmbeddings] with HasPretrained[ElmoEmbeddings]

    Permalink
  27. trait ReadablePretrainedLongformerModel extends ParamsAndFeaturesReadable[LongformerEmbeddings] with HasPretrained[LongformerEmbeddings]

    Permalink
  28. trait ReadablePretrainedRobertaModel extends ParamsAndFeaturesReadable[RoBertaEmbeddings] with HasPretrained[RoBertaEmbeddings]

    Permalink
  29. trait ReadablePretrainedRobertaSentenceModel extends ParamsAndFeaturesReadable[RoBertaSentenceEmbeddings] with HasPretrained[RoBertaSentenceEmbeddings]

    Permalink
  30. trait ReadablePretrainedUSEModel extends ParamsAndFeaturesReadable[UniversalSentenceEncoder] with HasPretrained[UniversalSentenceEncoder]

    Permalink
  31. trait ReadablePretrainedWordEmbeddings extends StorageReadable[WordEmbeddingsModel] with HasPretrained[WordEmbeddingsModel]

    Permalink
  32. trait ReadablePretrainedXlmRobertaModel extends ParamsAndFeaturesReadable[XlmRoBertaEmbeddings] with HasPretrained[XlmRoBertaEmbeddings]

    Permalink
  33. trait ReadablePretrainedXlmRobertaSentenceModel extends ParamsAndFeaturesReadable[XlmRoBertaSentenceEmbeddings] with HasPretrained[XlmRoBertaSentenceEmbeddings]

    Permalink
  34. trait ReadablePretrainedXlnetModel extends ParamsAndFeaturesReadable[XlnetEmbeddings] with HasPretrained[XlnetEmbeddings]

    Permalink
  35. trait ReadsFromBytes extends AnyRef

    Permalink
  36. class RoBertaEmbeddings extends AnnotatorModel[RoBertaEmbeddings] with HasBatchedAnnotate[RoBertaEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.

    The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.

    It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = RoBertaEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")

    The default model is "roberta_base", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the RoBertaEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Paper Abstract:

    Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

    Tips:

    • RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
    • RoBERTa doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)

    The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.RoBertaEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = RoBertaEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(true)
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
    |[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
    |[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
    |[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
    |[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  37. class RoBertaSentenceEmbeddings extends AnnotatorModel[RoBertaSentenceEmbeddings] with HasBatchedAnnotate[RoBertaSentenceEmbeddings] with WriteTensorflowModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    Sentence-level embeddings using RoBERTa.

    Sentence-level embeddings using RoBERTa. The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.

    It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = RoBertaSentenceEmbeddings.pretrained()
      .setInputCols("sentence")
      .setOutputCol("sentence_embeddings")

    The default model is "sent_roberta_base", if no name is provided. For available pretrained models please see the Models Hub.

    Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the RoBertaEmbeddingsTestSpec.

    Paper Abstract:

    Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

    Tips:

    • RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
    • RoBERTa doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)

    The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val sentenceEmbeddings = RoBertaSentenceEmbeddings.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
      .setCaseSensitive(true)
    
    // you can either use the output to train ClassifierDL, SentimentDL, or MultiClassifierDL
    // or you can use EmbeddingsFinisher to prepare the results for Spark ML functions
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("sentence_embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        sentenceEmbeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
    |[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
    |[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
    |[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
    |[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  38. class SentenceEmbeddings extends AnnotatorModel[SentenceEmbeddings] with HasSimpleAnnotate[SentenceEmbeddings] with HasEmbeddingsProperties with HasStorageRef

    Permalink

    Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

    Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

    This can be configured with setPoolingStrategy, which either be "AVERAGE" or "SUM".

    For more extended examples see the Spark NLP Workshop. and the SentenceEmbeddingsTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.embeddings.SentenceEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
    
    val embeddingsSentence = new SentenceEmbeddings()
      .setInputCols(Array("document", "embeddings"))
      .setOutputCol("sentence_embeddings")
      .setPoolingStrategy("AVERAGE")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("sentence_embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsSentence,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
    +--------------------------------------------------------------------------------+
  39. class UniversalSentenceEncoder extends AnnotatorModel[UniversalSentenceEncoder] with HasSimpleAnnotate[UniversalSentenceEncoder] with HasEmbeddingsProperties with HasStorageRef with WriteTensorflowModel

    Permalink

    The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

    The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("sentence")
      .setOutputCol("sentence_embeddings")

    The default model is "tfhub_use", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the UniversalSentenceEncoderTestSpec.

    Sources:

    Universal Sentence Encoder

    https://tfhub.dev/google/universal-sentence-encoder/2

    Paper abstract:

    We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val embeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("sentence")
      .setOutputCol("sentence_embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("sentence_embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[0.04616805538535118,0.022307956591248512,-0.044395286589860916,-0.0016493503...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  40. class WordEmbeddings extends AnnotatorApproach[WordEmbeddingsModel] with HasStorage with HasEmbeddingsProperties

    Permalink

    Word Embeddings lookup annotator that maps tokens to vectors.

    Word Embeddings lookup annotator that maps tokens to vectors.

    For instantiated/pretrained models, see WordEmbeddingsModel.

    A custom token lookup dictionary for embeddings can be set with setStoragePath. Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces.

    ...
    are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
    were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
    stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
    induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
    ...

    If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with WordEmbeddingsModel.withCoverageColumn and WordEmbeddingsModel.overallCoverage.

    For extended examples of usage, see the Spark NLP Workshop and the WordEmbeddingsTestSpec.

    Example

    In this example, the file random_embeddings_dim4.txt has the form of the content above.

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddings
    import com.johnsnowlabs.nlp.util.io.ReadAs
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = new WordEmbeddings()
      .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
      .setStorageRef("glove_4d")
      .setDimension(4)
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("The patient was diagnosed with diabetes.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(false)
    +----------------------------------------------------------------------------------+
    |result                                                                            |
    +----------------------------------------------------------------------------------+
    |[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
    |[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
    |[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
    |[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
    |[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
    |[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
    |[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
    +----------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

    SentenceEmbeddings to combine embeddings into a sentence-level representation

  41. class WordEmbeddingsModel extends AnnotatorModel[WordEmbeddingsModel] with HasSimpleAnnotate[WordEmbeddingsModel] with HasEmbeddingsProperties with HasStorageModel with ParamsAndFeaturesWritable

    Permalink

    Word Embeddings lookup annotator that maps tokens to vectors

    Word Embeddings lookup annotator that maps tokens to vectors

    This is the instantiated model of WordEmbeddings.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = WordEmbeddingsModel.pretrained()
        .setInputCols("document", "token")
        .setOutputCol("embeddings")

    The default model is "glove_100d", if no name is provided. For available pretrained models please see the Models Hub.

    There are also two convenient functions to retrieve the embeddings coverage with respect to the transformed dataset:

    • withCoverageColumn(dataset, embeddingsCol, outputCol): Adds a custom column with word coverage stats for the embedded field: (coveredWords, totalWords, coveragePercentage). This creates a new column with statistics for each row.
    val wordsCoverage = WordEmbeddingsModel.withCoverageColumn(resultDF, "embeddings", "cov_embeddings")
    wordsCoverage.select("text","cov_embeddings").show(false)
    +-------------------+--------------+
    |text               |cov_embeddings|
    +-------------------+--------------+
    |This is a sentence.|[5, 5, 1.0]   |
    +-------------------+--------------+
    • overallCoverage(dataset, embeddingsCol): Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.
    val wordsOverallCoverage = WordEmbeddingsModel.overallCoverage(wordsCoverage,"embeddings").percentage
    1.0

    For extended examples of usage, see the Spark NLP Workshop and the WordEmbeddingsTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-0.570580005645752,0.44183000922203064,0.7010200023651123,-0.417129993438720...|
    |[-0.542639970779419,0.4147599935531616,1.0321999788284302,-0.4024400115013122...|
    |[-0.2708599865436554,0.04400600120425224,-0.020260000601410866,-0.17395000159...|
    |[0.6191999912261963,0.14650000631809235,-0.08592499792575836,-0.2629800140857...|
    |[-0.3397899866104126,0.20940999686717987,0.46347999572753906,-0.6479200124740...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

    SentenceEmbeddings to combine embeddings into a sentence-level representation

  42. class WordEmbeddingsReader extends StorageReader[Array[Float]] with ReadsFromBytes

    Permalink
  43. class WordEmbeddingsWriter extends StorageBatchWriter[Array[Float]] with ReadsFromBytes

    Permalink
  44. class XlmRoBertaEmbeddings extends AnnotatorModel[XlmRoBertaEmbeddings] with HasBatchedAnnotate[XlmRoBertaEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmรƒยกn, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.

    The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmรƒยกn, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = XlmRoBertaEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")

    The default model is "xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the XlmRoBertaEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Paper Abstract:

    This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

    Tips:

    • XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
    • This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.XlmRoBertaEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val embeddings = XlmRoBertaEmbeddings.pretrained()
      .setInputCols("document", "token")
      .setOutputCol("embeddings")
      .setCaseSensitive(true)
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        embeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
    |[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
    |[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
    |[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
    |[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  45. class XlmRoBertaSentenceEmbeddings extends AnnotatorModel[XlmRoBertaSentenceEmbeddings] with HasBatchedAnnotate[XlmRoBertaSentenceEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    Sentence-level embeddings using XLM-RoBERTa.

    Sentence-level embeddings using XLM-RoBERTa. The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmรƒยกn, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = XlmRoBertaSentenceEmbeddings.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")

    The default model is "sent_xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

    Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the XlmRoBertaSentenceEmbeddingsTestSpec.

    Paper Abstract:

    This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

    Tips:

    • XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
    • This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols(Array("document"))
      .setOutputCol("token")
    
    val sentenceEmbeddings = XlmRoBertaSentenceEmbeddings.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
      .setCaseSensitive(true)
    
    // you can either use the output to train ClassifierDL, SentimentDL, or MultiClassifierDL
    // or you can use EmbeddingsFinisher to prepare the results for Spark ML functions
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("sentence_embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        tokenizer,
        sentenceEmbeddings,
        embeddingsFinisher
      ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
    |[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
    |[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
    |[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
    |[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

  46. class XlnetEmbeddings extends AnnotatorModel[XlnetEmbeddings] with HasBatchedAnnotate[XlnetEmbeddings] with WriteTensorflowModel with WriteSentencePieceModel with HasEmbeddingsProperties with HasStorageRef with HasCaseSensitiveProperties

    Permalink

    XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding

    XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding

    XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

    These word embeddings represent the outputs generated by the XLNet models.

    Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

    "xlnet_large_cased" = XLNet-Large | 24-layer, 1024-hidden, 16-heads

    "xlnet_base_cased" = XLNet-Base | 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).

    Pretrained models can be loaded with pretrained of the companion object:

    val embeddings = XlnetEmbeddings.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")

    The default model is "xlnet_base_cased", if no name is provided.

    For extended examples of usage, see the Spark NLP Workshop and the XlnetEmbeddingsTestSpec. Models from the HuggingFace ๐Ÿค— Transformers library are also compatible with Spark NLP ๐Ÿš€. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Sources :

    XLNet: Generalized Autoregressive Pretraining for Language Understanding

    https://github.com/zihangdai/xlnet

    Paper abstract:

    With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.Tokenizer
    import com.johnsnowlabs.nlp.embeddings.XlnetEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val embeddings = XlnetEmbeddings.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("embeddings")
    
    val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("embeddings")
      .setOutputCols("finished_embeddings")
      .setOutputAsVector(true)
      .setCleanAnnotations(false)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ))
    
    val data = Seq("This is a sentence.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
    +--------------------------------------------------------------------------------+
    |                                                                          result|
    +--------------------------------------------------------------------------------+
    |[-0.6287205219268799,-0.4865287244319916,-0.186111718416214,0.234187275171279...|
    |[-1.1967450380325317,0.2746637463569641,0.9481253027915955,0.3431355059146881...|
    |[-1.0777631998062134,-2.092679977416992,-1.5331977605819702,-1.11190271377563...|
    |[-0.8349916934967041,-0.45627787709236145,-0.7890847325325012,-1.028069257736...|
    |[-0.134845569729805,-0.11672890186309814,0.4945235550403595,-0.66587203741073...|
    +--------------------------------------------------------------------------------+
    See also

    Annotators Main Page for a list of transformer based embeddings

Value Members

  1. object AlbertEmbeddings extends ReadablePretrainedAlbertModel with ReadAlbertTensorflowModel with Serializable

    Permalink

    This is the companion object of AlbertEmbeddings.

    This is the companion object of AlbertEmbeddings. Please refer to that class for the documentation.

  2. object BertEmbeddings extends ReadablePretrainedBertModel with ReadBertTensorflowModel with Serializable

    Permalink

    This is the companion object of BertEmbeddings.

    This is the companion object of BertEmbeddings. Please refer to that class for the documentation.

  3. object BertSentenceEmbeddings extends ReadablePretrainedBertSentenceModel with ReadBertSentenceTensorflowModel with Serializable

    Permalink

    This is the companion object of BertSentenceEmbeddings.

    This is the companion object of BertSentenceEmbeddings. Please refer to that class for the documentation.

  4. object ChunkEmbeddings extends DefaultParamsReadable[ChunkEmbeddings] with Serializable

    Permalink

    This is the companion object of ChunkEmbeddings.

    This is the companion object of ChunkEmbeddings. Please refer to that class for the documentation.

  5. object DistilBertEmbeddings extends ReadablePretrainedDistilBertModel with ReadDistilBertTensorflowModel with Serializable

    Permalink

    This is the companion object of DistilBertEmbeddings.

    This is the companion object of DistilBertEmbeddings. Please refer to that class for the documentation.

  6. object ElmoEmbeddings extends ReadablePretrainedElmoModel with ReadElmoTensorflowModel with Serializable

    Permalink

    This is the companion object of ElmoEmbeddings.

    This is the companion object of ElmoEmbeddings. Please refer to that class for the documentation.

  7. object LongformerEmbeddings extends ReadablePretrainedLongformerModel with ReadLongformerTensorflowModel with Serializable

    Permalink

    This is the companion object of LongformerEmbeddings.

    This is the companion object of LongformerEmbeddings. Please refer to that class for the documentation.

  8. object PoolingStrategy

    Permalink
  9. object RoBertaEmbeddings extends ReadablePretrainedRobertaModel with ReadRobertaTensorflowModel with Serializable

    Permalink

    This is the companion object of RoBertaEmbeddings.

    This is the companion object of RoBertaEmbeddings. Please refer to that class for the documentation.

  10. object RoBertaSentenceEmbeddings extends ReadablePretrainedRobertaSentenceModel with ReadRobertaSentenceTensorflowModel with Serializable

    Permalink

    This is the companion object of RoBertaSentenceEmbeddings.

    This is the companion object of RoBertaSentenceEmbeddings. Please refer to that class for the documentation.

  11. object SentenceEmbeddings extends DefaultParamsReadable[SentenceEmbeddings] with Serializable

    Permalink

    This is the companion object of SentenceEmbeddings.

    This is the companion object of SentenceEmbeddings. Please refer to that class for the documentation.

  12. object UniversalSentenceEncoder extends ReadablePretrainedUSEModel with ReadUSETensorflowModel with Serializable

    Permalink

    This is the companion object of UniversalSentenceEncoder.

    This is the companion object of UniversalSentenceEncoder. Please refer to that class for the documentation.

  13. object WordEmbeddings extends DefaultParamsReadable[WordEmbeddings] with Serializable

    Permalink

    This is the companion object of WordEmbeddings.

    This is the companion object of WordEmbeddings. Please refer to that class for the documentation.

  14. object WordEmbeddingsBinaryIndexer

    Permalink
  15. object WordEmbeddingsModel extends ReadablePretrainedWordEmbeddings with EmbeddingsCoverage with Serializable

    Permalink

    This is the companion object of WordEmbeddingsModel.

    This is the companion object of WordEmbeddingsModel. Please refer to that class for the documentation.

  16. object WordEmbeddingsTextIndexer

    Permalink
  17. object XlmRoBertaEmbeddings extends ReadablePretrainedXlmRobertaModel with ReadXlmRobertaTensorflowModel with Serializable

    Permalink
  18. object XlmRoBertaSentenceEmbeddings extends ReadablePretrainedXlmRobertaSentenceModel with ReadXlmRobertaSentenceTensorflowModel with Serializable

    Permalink
  19. object XlnetEmbeddings extends ReadablePretrainedXlnetModel with ReadXlnetTensorflowModel with Serializable

    Permalink

    This is the companion object of XlnetEmbeddings.

    This is the companion object of XlnetEmbeddings. Please refer to that class for the documentation.

Ungrouped