Class UniversalDependenciesEnglishEWT

  • All Implemented Interfaces:
    ai.djl.training.dataset.Dataset

    public class UniversalDependenciesEnglishEWT
    extends TextDataset
    A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13.
    See Also:
    English Web Treebank LDC2012T13
    • Constructor Detail

      • UniversalDependenciesEnglishEWT

        protected UniversalDependenciesEnglishEWT​(UniversalDependenciesEnglishEWT.Builder builder)
        Creates a new instance of UniversalDependenciesEnglish.
        Parameters:
        builder - the builder object to build from
    • Method Detail

      • prepare

        public void prepare​(ai.djl.util.Progress progress)
                     throws java.io.IOException,
                            ai.djl.modality.nlp.embedding.EmbeddingException
        Prepares the dataset for use with tracked progress. In this method the TXT file will be parsed. The texts will be added to sourceTextData and the Universal POS tags will be added to universalPosTags. Only sourceTextData will then be preprocessed.
        Parameters:
        progress - the progress tracker
        Throws:
        java.io.IOException - for various exceptions depending on the dataset
        ai.djl.modality.nlp.embedding.EmbeddingException - if there are exceptions during the embedding process
      • get

        public ai.djl.training.dataset.Record get​(ai.djl.ndarray.NDManager manager,
                                                  long index)
        Gets the Record for the given index from the dataset.
        Specified by:
        get in class ai.djl.training.dataset.RandomAccessDataset
        Parameters:
        manager - the manager used to create the arrays
        index - the index of the requested data item
        Returns:
        a Record that contains the data and label of the requested data item. The data NDList contains one NDArray representing the text embedding, The label NDList contains one NDArray including the indices of the Universal POS tags of each token. For the index of each Universal POS tag, see the enum class UniversalDependenciesEnglishEWT.UniversalPosTag.
      • availableSize

        protected long availableSize()
        Returns the number of records available to be read in this Dataset.
        Specified by:
        availableSize in class ai.djl.training.dataset.RandomAccessDataset
        Returns:
        the number of records available to be read in this Dataset