public class TimeseriesTable extends AbstractDataset implements BatchReadable<byte[],TimeseriesTable.Entry>, BatchWritable<byte[],TimeseriesTable.Entry>
This Dataset works by partitioning time into bins representing time intervals. Entries added to the Dataset are added to a bin based on their timestamp and row key. Hence, every row in the underlying table contains entries that share the same time interval and row key. Data for each entry is stored in separate columns.
A user can set the time interval length for partitioning data into rows (as defined by
timeIntervalToStorePerRow
in the DatasetSpecification
properties).
This interval should be chosen according to the use case at hand. In general, larger time interval sizes mean
faster reading of small-to-medium time ranges (range size up to several time intervals) of entries data,
while having slower reading of very small time ranges of entries data (range size a small portion of the time
interval). Using a larger time interval also helps with faster batched writing of entries.
Vice versa, setting smaller time intervals provides faster reading of very small time ranges of entries data, but has slower batched writing of entries.
As expected, a larger time interval means that more data will be stored per row. A user should generally avoid storing more than 50 megabytes of data per row, since it affects performance.
The default value for time interval length is one hour and is generally suggested for users to use a value of
between one minute and several hours. In cases where the amount of written entries is small, the rule of thumb is:
row partition interval size = 5 * (average size of the time range to be read)
TimeseriesTable supports tagging, where each entry is (optionally) labeled with a set of tags used for filtering of items during data retrievals. For an entry to be retrievable using a given tag, the tag must be provided when the entry was written. If multiple tags are provided during reading, an entry must contain every one of these tags in order to qualify for return.
Due to the data format used for storing, filtering by tags during reading is done on client-side (not on a cluster). At the same time, filtering by entry keys happens on the server side, which is much more efficient performance-wise. Depending on the use-case you may want to push some of the tags you would use into the entry key for faster reading.
Notes on implementation:
Table
API which is used "under-the-hood". In particular the
implementation is constrained by the absence of a readHigherOrEq()
method in the
Table
API, which would return the next row with key greater
or equals to the given.
CounterTimeseriesTable
Modifier and Type | Class and Description |
---|---|
static class |
TimeseriesTable.Entry
Time series table entry.
|
static class |
TimeseriesTable.InputSplit
A method for using a Dataset as input for a MapReduce job.
|
class |
TimeseriesTable.TimeseriesTableRecordsReader
A record reader for time series.
|
Modifier and Type | Field and Description |
---|---|
static String |
ATTR_TIME_INTERVAL_TO_STORE_PER_ROW |
static long |
DEFAULT_TIME_INTERVAL_PER_ROW
See
TimeseriesTable javadoc for description. |
static int |
MAX_ROWS_TO_SCAN_PER_READ
Limit on the number of rows to scan per read.
|
protected Table |
table |
static String |
TYPE
Type name
|
Constructor and Description |
---|
TimeseriesTable(DatasetSpecification spec,
Table table)
Creates an instance of the table.
|
Modifier and Type | Method and Description |
---|---|
SplitReader<byte[],TimeseriesTable.Entry> |
createSplitReader(Split split)
Creates a reader for the split of a dataset.
|
List<Split> |
getInputSplits(int splitsCount,
byte[] key,
long startTime,
long endTime,
byte[]... tags)
Defines input selection for batch jobs.
|
List<Split> |
getSplits()
Returns all splits of the dataset.
|
Iterator<TimeseriesTable.Entry> |
read(byte[] key,
long startTime,
long endTime,
byte[]... tags)
Reads entries for a given time range and returns an
Iterator . |
Iterator<TimeseriesTable.Entry> |
read(byte[] key,
long startTime,
long endTime,
int offset,
int limit,
byte[]... tags)
Reads entries for a given time range and returns an
Iterator . |
void |
write(byte[] key,
TimeseriesTable.Entry value)
Writes an entry to the Dataset.
|
void |
write(TimeseriesTable.Entry entry)
Writes an entry to the Dataset.
|
close, commitTx, getName, getTransactionAwareName, getTxChanges, postTxCommit, rollbackTx, setMetricsCollector, startTx, toString, updateTx
public static final String TYPE
public static final String ATTR_TIME_INTERVAL_TO_STORE_PER_ROW
public static final long DEFAULT_TIME_INTERVAL_PER_ROW
TimeseriesTable
javadoc for description.public static final int MAX_ROWS_TO_SCAN_PER_READ
protected final Table table
public TimeseriesTable(DatasetSpecification spec, Table table)
public final void write(TimeseriesTable.Entry entry)
entry
- entry to writepublic final Iterator<TimeseriesTable.Entry> read(byte[] key, long startTime, long endTime, int offset, int limit, byte[]... tags)
Iterator
.
Provides the same functionality as read(byte[], long, long, byte[]...)
but accepts additional parameters for pagination purposes.
NOTE: A limit is placed on the max number of time intervals to be scanned during a read, as defined by
MAX_ROWS_TO_SCAN_PER_READ
.key
- key of the entries to readstartTime
- defines start of the time range to read, inclusiveendTime
- defines end of the time range to read, inclusiveoffset
- the number of initial entries to ignore and not add to the resultslimit
- upper limit on number of results returned. If limit is exceeded, the first limit
results
are returnedtags
- a set of tags which entries returned must contain. Tags for entries are defined at write-time and an
entry is only returned if it contains all of these tags.IllegalArgumentException
- when provided condition is incorrectpublic Iterator<TimeseriesTable.Entry> read(byte[] key, long startTime, long endTime, byte[]... tags)
Iterator
.
NOTE: A limit is placed on the max number of time intervals to be scanned during a read, as defined by
MAX_ROWS_TO_SCAN_PER_READ
.key
- key of the entries to readstartTime
- defines start of the time range to read, inclusiveendTime
- defines end of the time range to read, inclusivetags
- a set of tags which entries returned must contain. Tags for entries are defined at write-time and an
entry is only returned if it contains all of these tags.public List<Split> getInputSplits(int splitsCount, byte[] key, long startTime, long endTime, byte[]... tags)
splitsCount
- number of parts to split the data selection intokey
- key of the entries to readstartTime
- defines start of the time range to read, inclusiveendTime
- defines end of the time range to read, inclusivetags
- a set of tags which entries returned must contain. Tags for entries are defined at write-time and an
entry is only returned if it contains all of these tags.public List<Split> getSplits()
BatchReadable
For feeding the whole dataset into a batch job.
getSplits
in interface BatchReadable<byte[],TimeseriesTable.Entry>
Split
s.public SplitReader<byte[],TimeseriesTable.Entry> createSplitReader(Split split)
BatchReadable
createSplitReader
in interface BatchReadable<byte[],TimeseriesTable.Entry>
split
- The split to create a reader for.SplitReader
.public void write(byte[] key, TimeseriesTable.Entry value)
write(key, value)
in BatchWritable
.
The key is ignored in this method and instead it uses the key provided in the Entry
object.write
in interface BatchWritable<byte[],TimeseriesTable.Entry>
key
- row key to write to. Value is ignoredvalue
- entry to write. The key used to write to the table is extracted from this objectCopyright © 2021 Cask Data, Inc. Licensed under the Apache License, Version 2.0.