Parquet accepts globs separated by the ; character
Parquet accepts globs separated by the ; character
The format for specifying columns is described here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
The format for specifying columns is described here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
Note that the format described there says that multiple globs can be combined with a ; character. Instead, we use a Set() here and will eventually join the set on the ; character for you.
When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).
The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).
For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection
Here are two ways you can use these in a parquet source:
The other way is to add these as constructor arguments: