com.twitter.scalding.commons.extensions
Type parameters: A: tuple of result types
Type parameters: A: tuple of result types
Parameters: checkpointName: name of the checkpoint resultFields: tuple of result field names flow: a function to run a flow to compute the result
Implicit parameters: args: provided by com.twitter.pluck.job.TwitterJob mode: provided by com.twitter.scalding.Job flowDef: provided by com.twitter.scalding.Job conv: provided by com.twitter.scalding.TupleConversions setter: provided by com.twitter.scalding.TupleConversions
Checkpoint provides a simple mechanism to read and write intermediate results from a Scalding flow to HDFS.
Checkpoints are useful for debugging one part of a long flow, when you would otherwise have to run many steps to get to the one you care about. To enable checkpoints, sprinkle calls to Checkpoint() throughout your flow, ideally after expensive steps.
When checkpoints are enabled, each Checkpoint() looks for a checkpoint file on HDFS. If it exists we read results from the file; otherwise we execute the flow and write the results to the file. When checkpoints are disabled, the flow is always executed and the results are never stored.
Each call to Checkpoint() takes the checkpoint name, as well as the types and names of the expected fields. A sample invocation might look like this: val pipe = Checkpoint[(Long, String, Long)]( "clicks", ('tweetId, 'clickUrl, 'clickCount)) { ... } where { ... } contains a flow which computes the result.
Most checkpoint parameters are specified via command-line flags: --checkpoint.clobber: if true, recompute and overwrite any existing checkpoint files. --checkpoint.clobber.<name>: override clobber for the given checkpoint. --checkpoint.file: specifies a filename prefix to use for checkpoint files. If blank, checkpoints are disabled; otherwise the file for checkpoint <name> is <prefix>_<name>. --checkpoint.file.<name>: override --checkpoint.file for the given checkpoint; specifies the whole filename, not the prefix. --checkpoint.format: specifies a file format, either sequencefile or tsv. Default is sequencefile for HDFS, tsv for local. --checkpoint.format.<name>: specifies file format for the given checkpoint.