tensorflow

Type Members

class ClassifierDatasetEncoder extends Serializable
case class ClassifierDatasetEncoderParams(tags: Array[String]) extends Product with Serializable
case class DatasetEncoderParams(tags: List[String], chars: List[Char], emptyVector: List[Float], embeddingsDim: Int, defaultTag: String = "O") extends Product with Serializable

tags
list of unique tags
chars
list of unique characters
emptyVector
list of embeddings
embeddingsDim
dimension of embeddings
defaultTag
the default tag
trait Logging extends AnyRef
case class ModelSignature(operation: String, value: String, matchingPatterns: List[String]) extends Product with Serializable
class NerBatch extends AnyRef

Batch that contains data in Tensorflow input format.
class NerDatasetEncoder extends Serializable
trait ReadTensorflowModel extends AnyRef
case class SentenceGrouper[T](getLength: (T) ⇒ Int, sizes: Array[Int] = Array(5, 10, 20, 50))(implicit evidence$1: ClassTag[T]) extends Product with Serializable
class TensorResources extends AnyRef

This class is being used to initialize Tensors of different types and shapes for Tensorflow operations
class TensorflowAlbert extends Serializable

This class is used to calculate ALBERT embeddings for For Sequence Batches of WordpieceTokenizedSentence.
This class is used to calculate ALBERT embeddings for For Sequence Batches of WordpieceTokenizedSentence. Input for this model must be tokenzied with a SentencePieceModel,
This Tensorflow model is using the weights provided by https://tfhub.dev/google/albert_base/3 * sequence_output: representations of every token in the input sequence with shape [batch_size, max_sequence_length, hidden_size].
ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago This these embeddings represent the outputs generated by the Albert model. All offical Albert releases by google in TF-HUB are supported with this Albert Wrapper:
TF-HUB Models : albert_base = https://tfhub.dev/google/albert_base/3 | 768-embed-dim, 12-layer, 12-heads, 12M parameters albert_large = https://tfhub.dev/google/albert_large/3 | 1024-embed-dim, 24-layer, 16-heads, 18M parameters albert_xlarge = https://tfhub.dev/google/albert_xlarge/3 | 2048-embed-dim, 24-layer, 32-heads, 60M parameters albert_xxlarge = https://tfhub.dev/google/albert_xxlarge/3 | 4096-embed-dim, 12-layer, 64-heads, 235M parameters
This model requires input tokenization with SentencePiece model, which is provided by Spark NLP
For additional information see : https://arxiv.org/pdf/1909.11942.pdf https://github.com/google-research/ALBERT https://tfhub.dev/s?q=albert
Tips:
ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.
class TensorflowBert extends Serializable

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture
BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture
See https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/BertEmbeddingsTestSpec.scala for further reference on how to use this API. Sources:
class TensorflowClassifier extends Serializable with Logging
class TensorflowDistilBert extends Serializable

The DistilBERT model was proposed in the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108.
The DistilBERT model was proposed in the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108. DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT's performances as measured on the GLUE language understanding benchmark.
The abstract from the paper is the following:
As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.
Tips:
- DistilBERT doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:[SEP]).
- DistilBERT doesn't have options to select the input positions (:obj:position_ids input). This could be added if necessary though, just let us know if you need this option.
class TensorflowElmo extends Serializable

Embeddings from a language model trained on the 1 Billion Word Benchmark.
Embeddings from a language model trained on the 1 Billion Word Benchmark.
Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.
word_emb: the character-based word representations with shape [batch_size, max_length, 512]. == word_emb
lstm_outputs1: the first LSTM hidden state with shape [batch_size, max_length, 1024]. === lstm_outputs1
lstm_outputs2: the second LSTM hidden state with shape [batch_size, max_length, 1024]. === lstm_outputs2
elmo: the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024] == elmo
See https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/ElmoEmbeddingsTestSpec.scala for further reference on how to use this API.
class TensorflowLD extends Serializable

Language Identification and Detection by using CNNs and RNNs architectures in TensowrFlow
Language Identification and Detection by using CNNs and RNNs architectures in TensowrFlow
The models are trained on large datasets such as Wikipedia and Tatoeba The output is a language code in Wiki Code style: https://en.wikipedia.org/wiki/List_of_Wikipedias
class TensorflowMarian extends Serializable

MarianTransformer: Fast Neural Machine Translation
MarianTransformer: Fast Neural Machine Translation
MarianTransformer uses models trained by MarianNMT.
Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development.
It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects (see below for an incomplete list).
Sources : MarianNMT https://marian-nmt.github.io/ Marian: Fast Neural Machine Translation in C++ https://www.aclweb.org/anthology/P18-4020/
class TensorflowMultiClassifier extends Serializable with Logging
class TensorflowNer extends Serializable with Logging
class TensorflowRoBerta extends Serializable

The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach https://arxiv.org/abs/1907.11692> by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach https://arxiv.org/abs/1907.11692> by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google's BERT model released in 2018.
It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates. The abstract from the paper is the following:
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.*
Tips:
- RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
- RoBERTa doesn't have :obj:token_type_ids, you don't need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)
The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.
class TensorflowSentenceDetectorDL extends Serializable with Logging
class TensorflowSentiment extends Serializable with Logging
class TensorflowSpell extends Logging with Serializable
class TensorflowT5 extends Serializable

This class is used to run T5 model for For Sequence Batches of WordpieceTokenizedSentence.
This class is used to run T5 model for For Sequence Batches of WordpieceTokenizedSentence. Input for this model must be tokenized with a SentencePieceModel,
class TensorflowUSE extends Serializable

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
See https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/test/scala/com/johnsnowlabs/nlp/embeddings/UniversalSentenceEncoderTestSpec.scala for further reference on how to use this API.
class TensorflowWrapper extends Serializable
class TensorflowXlmRoberta extends Serializable

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale https://arxiv.org/abs/1911.02116 by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.
The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale https://arxiv.org/abs/1911.02116 by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook's RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.
The abstract from the paper is the following:
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.
Tips:
- XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids. - This implementation is the same as RoBERTa. Refer to the com.johnsnowlabs.nlp.embeddings.RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.
class TensorflowXlnet extends Serializable

XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding
XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding
Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.
XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.
XLNet-Large = https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip | 24-layer, 1024-hidden, 16-heads XLNet-Base = https://storage.googleapis.com/xlnet/released_models/cased_L-12_H-768_A-12.zip | 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).
Sources :
https://arxiv.org/abs/1906.08237
https://github.com/zihangdai/xlnet
Paper abstract:
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.
case class Variables(variables: Array[Byte], index: Array[Byte]) extends Product with Serializable
trait WriteTensorflowModel extends AnyRef

Value Members

object NerBatch
object TensorResources
object TensorflowBert extends Serializable
object TensorflowWrapper extends Serializable

Companion object
package sentencepiece
package sign

package tensorflow

Type Members

class ClassifierDatasetEncoder extends Serializable

case class ClassifierDatasetEncoderParams(tags: Array[String]) extends Product with Serializable

case class DatasetEncoderParams(tags: List[String], chars: List[Char], emptyVector: List[Float], embeddingsDim: Int, defaultTag: String = "O") extends Product with Serializable

trait Logging extends AnyRef

case class ModelSignature(operation: String, value: String, matchingPatterns: List[String]) extends Product with Serializable

class NerBatch extends AnyRef

class NerDatasetEncoder extends Serializable

trait ReadTensorflowModel extends AnyRef

case class SentenceGrouper[T](getLength: (T) ⇒ Int, sizes: Array[Int] = Array(5, 10, 20, 50))(implicit evidence$1: ClassTag[T]) extends Product with Serializable

class TensorResources extends AnyRef

class TensorflowAlbert extends Serializable

class TensorflowBert extends Serializable

class TensorflowClassifier extends Serializable with Logging

class TensorflowDistilBert extends Serializable

class TensorflowElmo extends Serializable

class TensorflowLD extends Serializable

class TensorflowMarian extends Serializable

class TensorflowMultiClassifier extends Serializable with Logging

class TensorflowNer extends Serializable with Logging

class TensorflowRoBerta extends Serializable

class TensorflowSentenceDetectorDL extends Serializable with Logging

class TensorflowSentiment extends Serializable with Logging

class TensorflowSpell extends Logging with Serializable

class TensorflowT5 extends Serializable

class TensorflowUSE extends Serializable

class TensorflowWrapper extends Serializable

class TensorflowXlmRoberta extends Serializable

class TensorflowXlnet extends Serializable

case class Variables(variables: Array[Byte], index: Array[Byte]) extends Product with Serializable

trait WriteTensorflowModel extends AnyRef

Value Members

object NerBatch

object TensorResources

object TensorflowBert extends Serializable

object TensorflowWrapper extends Serializable

package sentencepiece

package sign

Ungrouped