This is an optional component of QScript that can be used instead of ThetaJoin.
This is an optional component of QScript that can be used instead of ThetaJoin. It’s easier to implement, but more restricted (where ThetaJoin has an arbitrary predicate to determin if a pair of records should be combined, EquiJoin has an expression on each side that is compared with simple equality).
A variant with a simpler join type.
A variant with a simpler join type. A backend can choose to operate on
this structure by applying the equiJoinsOnly
transformation. Backends
without true join support will likely find it easier to work with this
than to handle full ThetaJoins.
Eliminates some values from a dataset, based on the result of FilterFunc.
Flattens nested structure, converting each value into a data set, which are then unioned.
Flattens nested structure, converting each value into a data set, which are then unioned.
struct
is an expression that evaluates to an array or object, which is
then “exploded” into multiple values. repair
is applied across the new
set, integrating the exploded values into the original set.
A data-level transformation.
Projections are technically dimensional (i.e., QScript) operations.
Projections are technically dimensional (i.e., QScript) operations. However, to a filesystem, they are merely Map operations. So, we use these components while building the QScript plan and they are then used in static path processing, but they are replaced with equivalent MapFuncs before being processed by the filesystem.
This is the primary form seen by a backend.
This is the primary form seen by a backend. It contains reads of files.
These nodes exist in all QScript structures that a backend sees.
These are the operations included in all forms of QScript.
This is the target of the core compiler.
This is the target of the core compiler. Normalization is applied to this structure, and it contains no Read or EquiJoin.
A backend-resolved Root
, which is now a path.
Performs a reduction over a dataset, with the dataset partitioned by the result of the MapFunc.
Performs a reduction over a dataset, with the dataset partitioned by the result of the MapFunc. So, rather than many-to-one, this is many-to-fewer.
bucket
partitions the values into buckets based on the result of the
expression, reducers
applies the provided reduction to each expression,
and repair finally turns those reduced expressions into a final value.
Sorts values within a bucket.
Sorts values within a bucket. This could be represented with LeftShift(Map(_.sort, Reduce(_ :: _, ???)) but backends tend to provide sort directly, so this avoids backends having to recognize the pattern. We could provide an algebra (Sort :+: QScript)#λ => QScript so that a backend without a native sort could eliminate this node.
Applies a function across two datasets, in the cases where the JoinFunc evaluates to true.
Applies a function across two datasets, in the cases where the JoinFunc evaluates to true. The branches represent the divergent operations applied to some common src. Each branch references the src exactly once. (Since no constructor has more than one recursive component, it’s guaranteed that neither side references the src _more_ than once.)
This case represents a full θJoin, but we could have an algebra that rewites it as Filter(_, EquiJoin(...)) to simplify behavior for the backend.
Creates a new dataset, |a|+|b|, containing all of the entries from each of the input sets, without any indication of which set they came from
Creates a new dataset, |a|+|b|, containing all of the entries from each of the input sets, without any indication of which set they came from
This could be handled as another join type, the anti-join
(T[EJson] \/ T[EJson] => T[EJson], specifically as _.merge
), with the
condition being κ(true)
,
The top level of a filesystem.
The top level of a filesystem. During compilation this represents /
, but
in the structure a backend sees, it represents the mount point.
Here we no longer care about provenance. Backends can’t do anything with it, so we simply represent joins and crosses directly. This also means that we don’t need to model certain things – project_d is just a data-level function, nest_d & swap_d only modify provenance and so are irrelevant here, and autojoin_d has been replaced with a lower-level join operation that doesn’t include the cross portion.