A DataStream is kind of like a table of data. It has fields (like columns) and rows of data. Each row
has an entry for each field (this may be null depending on the field definition).
It is a lazily evaluated data structure. Each operation on a stream will create a new derived stream,
but those operations will only occur when a final action is performed.
You can create a DataStream from an IO source, such as a Parquet file or a Hive table, or you may
create a fully evaluated one from an in memory structure. In the case of the former, the data
will only be loaded on demand as an action is performed.
A DataStream is split into one or more flows. Each flow can operate independantly
of the others. For example, if you filter a flow, each flow will be filtered seperately,
which allows it to be parallelized. If you write out a flow, each partition can be written out
to individual files, again allowing parallelization.
A DataStream is kind of like a table of data. It has fields (like columns) and rows of data. Each row has an entry for each field (this may be null depending on the field definition).
It is a lazily evaluated data structure. Each operation on a stream will create a new derived stream, but those operations will only occur when a final action is performed.
You can create a DataStream from an IO source, such as a Parquet file or a Hive table, or you may create a fully evaluated one from an in memory structure. In the case of the former, the data will only be loaded on demand as an action is performed.
A DataStream is split into one or more flows. Each flow can operate independantly of the others. For example, if you filter a flow, each flow will be filtered seperately, which allows it to be parallelized. If you write out a flow, each partition can be written out to individual files, again allowing parallelization.