com.twitter.scalding.parquet.thrift

DailySuffixParquetThrift

Related Doc: package thrift

class DailySuffixParquetThrift[T <: ThriftBase] extends DailySuffixSource with ParquetThrift[T]

When Using these sources or creating subclasses of them, you can provide a filter predicate and / or a set of fields (columns) to keep (project).

The filter predicate will be pushed down to the input format, potentially making the filter significantly more efficient than a filter applied to a TypedPipe (parquet push-down filters can skip reading entire chunks of data off disk).

For data with a large schema (many fields / columns), providing the set of columns you intend to use can also make your job significantly more efficient (parquet column projection push-down will skip reading unused columns from disk). The columns are specified in the format described here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

These settings are defined in the traits com.twitter.scalding.parquet.HasFilterPredicate and com.twitter.scalding.parquet.HasColumnProjection

Here are two ways you can use these in a parquet source:

class MyParquetSource(dr: DateRange) extends DailySuffixParquetThrift("/a/path", dr)

val mySourceFilteredAndProjected = new MyParquetSource(dr) {
  override val withFilter: Option[FilterPredicate] = Some(myFp)
  override val withColumns: Set[String] = Set("a/b/c", "x/y")
}

The other way is to add these as constructor arguments:

class MyParquetSource(
  dr: DateRange,
  override val withFilter: Option[FilterPredicate] = None
  override val withColumns: Set[String] = Set()
) extends DailySuffixParquetThrift("/a/path", dr)

val mySourceFilteredAndProjected = new MyParquetSource(dr, Some(myFp), Set("a/b/c", "x/y"))
Linear Supertypes
ParquetThrift[T], ParquetThriftBase[T], HasColumnProjection, HasFilterPredicate, LocalTapSource, typed.TypedSink[T], SingleMappable[T], Mappable[T], typed.TypedSource[T], DailySuffixSource, TimePathedSource, TimeSeqPathedSource, FileSource, LocalSourceOverride, SchemedSource, Source, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. DailySuffixParquetThrift
  2. ParquetThrift
  3. ParquetThriftBase
  4. HasColumnProjection
  5. HasFilterPredicate
  6. LocalTapSource
  7. TypedSink
  8. SingleMappable
  9. Mappable
  10. TypedSource
  11. DailySuffixSource
  12. TimePathedSource
  13. TimeSeqPathedSource
  14. FileSource
  15. LocalSourceOverride
  16. SchemedSource
  17. Source
  18. Serializable
  19. AnyRef
  20. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new DailySuffixParquetThrift(path: String, dateRange: DateRange)(implicit mf: Manifest[T])

Value Members

  1. final def !=(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  4. def allPaths: Iterable[String]

    Definition Classes
    TimeSeqPathedSource
  5. def allPathsFor(pattern: String): Iterable[String]

    Attributes
    protected
    Definition Classes
    TimeSeqPathedSource
  6. def andThen[U](fn: (T) ⇒ U): typed.TypedSource[U]

    Definition Classes
    TypedSource
  7. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  8. def checkFlowDefNotNull(implicit flowDef: FlowDef, mode: Mode): Unit

    Attributes
    protected
    Definition Classes
    Source
  9. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  10. final def columnGlobs: Set[ColumnProjectionGlob]

    Attributes
    protected[com.twitter.scalding.parquet]
    Definition Classes
    HasColumnProjection
  11. def config: parquet.cascading.ParquetValueScheme.Config[T]

    Definition Classes
    ParquetThriftBase
  12. def contraMap[U](fn: (U) ⇒ T): typed.TypedSink[U]

    Definition Classes
    TypedSink
  13. def converter[U >: T]: TupleConverter[U]

    Definition Classes
    SingleMappable → TypedSource
  14. def createHdfsReadTap(hdfsMode: Hdfs): Tap[JobConf, _, _]

    Attributes
    protected
    Definition Classes
    FileSource
  15. def createLocalTap(sinkMode: SinkMode): Tap[_, _, _]

    Definition Classes
    LocalTapSource → LocalSourceOverride
  16. def createTap(readOrWrite: AccessMode)(implicit mode: Mode): Tap[_, _, _]

    Definition Classes
    FileSource → Source
  17. val dateRange: DateRange

    Definition Classes
    TimeSeqPathedSource
  18. def defaultDurationFor(pattern: String): Option[Duration]

    Attributes
    protected
    Definition Classes
    TimeSeqPathedSource
  19. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  20. def equals(that: Any): Boolean

    Definition Classes
    TimeSeqPathedSource → AnyRef → Any
  21. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  22. final def flatMapTo[U](out: Fields)(mf: (T) ⇒ TraversableOnce[U])(implicit flowDef: FlowDef, mode: Mode, setter: TupleSetter[U]): Pipe

    Definition Classes
    Mappable
  23. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  24. def getPathStatuses(conf: Configuration): Iterable[(String, Boolean)]

    Definition Classes
    TimeSeqPathedSource
  25. final def globsInParquetStringFormat: Option[String]

    Parquet accepts globs separated by the ; character

    Parquet accepts globs separated by the ; character

    Attributes
    protected[com.twitter.scalding.parquet]
    Definition Classes
    HasColumnProjection
  26. def goodHdfsPaths(hdfsMode: Hdfs): Iterable[String]

    Attributes
    protected
    Definition Classes
    FileSource
  27. def hashCode(): Int

    Definition Classes
    TimeSeqPathedSource → AnyRef → Any
  28. def hdfsPaths: Iterable[String]

    Definition Classes
    TimeSeqPathedSource → FileSource
  29. def hdfsReadPathsAreGood(conf: Configuration): Boolean

    Definition Classes
    TimeSeqPathedSource → FileSource
  30. def hdfsScheme: Scheme[JobConf, RecordReader[_, _], OutputCollector[_, _], _, _]

    Definition Classes
    ParquetThrift → SchemedSource
  31. def hdfsWritePath: String

    Definition Classes
    TimePathedSource → FileSource
  32. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  33. def localPath: String

    Definition Classes
    TimePathedSource → LocalSourceOverride
  34. def localScheme: Scheme[Properties, InputStream, OutputStream, _, _]

    Definition Classes
    SchemedSource
  35. final def mapTo[U](out: Fields)(mf: (T) ⇒ U)(implicit flowDef: FlowDef, mode: Mode, setter: TupleSetter[U]): Pipe

    Definition Classes
    Mappable
  36. implicit val mf: Manifest[T]

  37. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  38. final def notify(): Unit

    Definition Classes
    AnyRef
  39. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  40. def pathIsGood(p: String, conf: Configuration): Boolean

    Attributes
    protected
    Definition Classes
    FileSource
  41. val pattern: String

    Definition Classes
    TimePathedSource
  42. val patterns: Seq[String]

    Definition Classes
    TimeSeqPathedSource
  43. def read(implicit flowDef: FlowDef, mode: Mode): Pipe

    Definition Classes
    Source
  44. def setter[U <: T]: TupleSetter[U]

    Definition Classes
    ParquetThriftBase → TypedSink
  45. def sinkFields: Fields

    Definition Classes
    TypedSink
  46. val sinkMode: SinkMode

    Definition Classes
    SchemedSource
  47. def sourceFields: Fields

    Definition Classes
    TypedSource
  48. def sourceId: String

    Definition Classes
    Source
  49. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  50. def toIterator(implicit config: Config, mode: Mode): Iterator[T]

    Definition Classes
    Mappable
  51. def toString(): String

    Definition Classes
    TimeSeqPathedSource → AnyRef → Any
  52. def transformForRead(pipe: Pipe): Pipe

    Attributes
    protected
    Definition Classes
    Source
  53. def transformForWrite(pipe: Pipe): Pipe

    Attributes
    protected
    Definition Classes
    Source
  54. def transformInTest: Boolean

    Definition Classes
    Source
  55. val tz: TimeZone

    Definition Classes
    TimeSeqPathedSource
  56. def validateTaps(mode: Mode): Unit

    Definition Classes
    FileSource → Source
  57. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  58. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  59. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  60. def withColumns: Set[String]

    The format for specifying columns is described here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

    The format for specifying columns is described here: https://github.com/apache/incubator-parquet-mr/blob/master/parquet_cascading.md#21-projection-pushdown-with-thriftscrooge-records

    Note that the format described there says that multiple globs can be combined with a ; character. Instead, we use a Set() here and will eventually join the set on the ; character for you.

    Definition Classes
    HasColumnProjection
  61. def withFilter: Option[FilterPredicate]

    Definition Classes
    HasFilterPredicate
  62. def writeFrom(pipe: Pipe)(implicit flowDef: FlowDef, mode: Mode): Pipe

    Definition Classes
    Source

Deprecated Value Members

  1. def readAtSubmitter[T](implicit mode: Mode, conv: TupleConverter[T]): Stream[T]

    Definition Classes
    Source
    Annotations
    @deprecated
    Deprecated

    (Since version 0.9.0) replace with Mappable.toIterator

Inherited from ParquetThrift[T]

Inherited from ParquetThriftBase[T]

Inherited from HasColumnProjection

Inherited from HasFilterPredicate

Inherited from LocalTapSource

Inherited from typed.TypedSink[T]

Inherited from SingleMappable[T]

Inherited from Mappable[T]

Inherited from typed.TypedSource[T]

Inherited from DailySuffixSource

Inherited from TimePathedSource

Inherited from TimeSeqPathedSource

Inherited from FileSource

Inherited from LocalSourceOverride

Inherited from SchemedSource

Inherited from Source

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped