When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).
When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).
The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).
For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records
These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection
Here are two ways you can use these in a parquet source:
class MyParquetSource(dr: DateRange) extends DailySuffixParquetThrift("/a/path", dr) val mySourceFilteredAndProjected = new MyParquetSource(dr) { override val withFilter: Option[FilterPredicate] = Some(myFp) override val withColumnProjections: Set[String] = Set("a.b.c", "x.y") }
The other way is to add these as constructor arguments:
class MyParquetSource( dr: DateRange, override val withFilter: Option[FilterPredicate] = None override val withColumnProjections: Set[String] = Set() ) extends DailySuffixParquetThrift("/a/path", dr) val mySourceFilteredAndProjected = new MyParquetSource(dr, Some(myFp), Set("a.b.c", "x.y"))