Package ai.djl.basicdataset.nlp
Class UniversalDependenciesEnglishEWT
- java.lang.Object
-
- ai.djl.training.dataset.RandomAccessDataset
-
- ai.djl.basicdataset.nlp.TextDataset
-
- ai.djl.basicdataset.nlp.UniversalDependenciesEnglishEWT
-
- All Implemented Interfaces:
ai.djl.training.dataset.Dataset
public class UniversalDependenciesEnglishEWT extends TextDataset
A Gold Standard Universal Dependencies Corpus for English, built over the source material of the English Web Treebank LDC2012T13.- See Also:
- English Web Treebank LDC2012T13
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
UniversalDependenciesEnglishEWT.Builder
A builder for aUniversalDependenciesEnglishEWT
.-
Nested classes/interfaces inherited from class ai.djl.basicdataset.nlp.TextDataset
TextDataset.Sample
-
-
Field Summary
-
Fields inherited from class ai.djl.basicdataset.nlp.TextDataset
manager, mrl, prepared, samples, sourceTextData, targetTextData, usage
-
-
Constructor Summary
Constructors Modifier Constructor Description protected
UniversalDependenciesEnglishEWT(UniversalDependenciesEnglishEWT.Builder builder)
Creates a new instance ofUniversalDependenciesEnglish
.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected long
availableSize()
Returns the number of records available to be read in thisDataset
.static UniversalDependenciesEnglishEWT.Builder
builder()
Creates a new builder to build aUniversalDependenciesEnglishEWT
.ai.djl.training.dataset.Record
get(ai.djl.ndarray.NDManager manager, long index)
Gets theRecord
for the given index from the dataset.void
prepare(ai.djl.util.Progress progress)
Prepares the dataset for use with tracked progress.-
Methods inherited from class ai.djl.basicdataset.nlp.TextDataset
getProcessedText, getRawText, getSamples, getTextEmbedding, getVocabulary, preprocess
-
Methods inherited from class ai.djl.training.dataset.RandomAccessDataset
getData, getData, getData, getData, newSubDataset, newSubDataset, randomSplit, size, subDataset, subDataset, subDataset, subDataset, toArray
-
-
-
-
Constructor Detail
-
UniversalDependenciesEnglishEWT
protected UniversalDependenciesEnglishEWT(UniversalDependenciesEnglishEWT.Builder builder)
Creates a new instance ofUniversalDependenciesEnglish
.- Parameters:
builder
- the builder object to build from
-
-
Method Detail
-
builder
public static UniversalDependenciesEnglishEWT.Builder builder()
Creates a new builder to build aUniversalDependenciesEnglishEWT
.- Returns:
- a new builder
-
prepare
public void prepare(ai.djl.util.Progress progress) throws java.io.IOException, ai.djl.modality.nlp.embedding.EmbeddingException
Prepares the dataset for use with tracked progress. In this method the TXT file will be parsed. The texts will be added tosourceTextData
and the Universal POS tags will be added touniversalPosTags
. OnlysourceTextData
will then be preprocessed.- Parameters:
progress
- the progress tracker- Throws:
java.io.IOException
- for various exceptions depending on the datasetai.djl.modality.nlp.embedding.EmbeddingException
- if there are exceptions during the embedding process
-
get
public ai.djl.training.dataset.Record get(ai.djl.ndarray.NDManager manager, long index)
Gets theRecord
for the given index from the dataset.- Specified by:
get
in classai.djl.training.dataset.RandomAccessDataset
- Parameters:
manager
- the manager used to create the arraysindex
- the index of the requested data item- Returns:
- a
Record
that contains the data and label of the requested data item. The dataNDList
contains oneNDArray
representing the text embedding, The labelNDList
contains oneNDArray
including the indices of the Universal POS tags of each token. For the index of each Universal POS tag, see the enum classUniversalDependenciesEnglishEWT.UniversalPosTag
.
-
availableSize
protected long availableSize()
Returns the number of records available to be read in thisDataset
.- Specified by:
availableSize
in classai.djl.training.dataset.RandomAccessDataset
- Returns:
- the number of records available to be read in this
Dataset
-
-