A DataStream is kind of like a table of data.
An implementation of DataStream for which items are emitted by calling publish.
An implementation of DataStream for which items are emitted by calling publish. When no more items are to be published, call close() so that downstream subscribers can complete.
Subscribers to this publisher will block as normal, and so they should normally be placed into a separate thread.
A DataStream is kind of like a table of data. It has fields (like columns) and rows of data. Each row has an entry for each field (this may be null depending on the field definition).
It is a lazily evaluated data structure. Each operation on a stream will create a new derived stream, but those operations will only occur when a final action is performed.
You can create a DataStream from an IO source, such as a Parquet file or a Hive table, or you may create a fully evaluated one from an in memory structure. In the case of the former, the data will only be loaded on demand as an action is performed.
A DataStream is split into one or more flows. Each flow can operate independantly of the others. For example, if you filter a flow, each flow will be filtered seperately, which allows it to be parallelized. If you write out a flow, each partition can be written out to individual files, again allowing parallelization.