Parquet accepts globs separated by the ; character
Parquet accepts globs separated by the ; character
The format for specifying columns is described here: https://github.
The format for specifying columns is described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
Note that the format described there says that multiple globs can be combined with a ; character. Instead, we use a Set() here and will eventually join the set on the ; character for you.
(Since version 0.9.0) replace with Mappable.toIterator
Deprecated.
Deprecated. Use withColumnProjections, which uses a different glob syntax.
The format for specifying columns is described here: https://github.com/apache/parquet-mr/blob/3df3372a1ee7b6ea74af89f53a614895b8078609/parquet_cascading.md#2-projection-pushdown (Note that this link is different from the one below in withColumnProjections)
Note that the format described there says that multiple globs can be combined with a ; character. Instead, we use a Set() here and will eventually join the set on the ; character for you.
(Since version 0.15.1) Use withColumnProjections, which uses a different glob syntax
When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).
The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).
For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection
Here are two ways you can use these in a parquet source:
The other way is to add these as constructor arguments: