Checkpoint provides a simple mechanism to read and write intermediate results
from a Scalding flow to HDFS.
Checkpoint provides a simple mechanism to read and write intermediate results
from a Scalding flow to HDFS.
Checkpoints are useful for debugging one part of a long flow, when you would
otherwise have to run many steps to get to the one you care about. To enable
checkpoints, sprinkle calls to Checkpoint() throughout your flow, ideally
after expensive steps.
When checkpoints are enabled, each Checkpoint() looks for a checkpoint file
on HDFS. If it exists we read results from the file; otherwise we execute
the flow and write the results to the file. When checkpoints are disabled,
the flow is always executed and the results are never stored.
Each call to Checkpoint() takes the checkpoint name, as well as the types and
names of the expected fields. A sample invocation might look like this:
val pipe = Checkpoint[(Long, String, Long)](
"clicks", ('tweetId, 'clickUrl, 'clickCount)) { ... }
where { ... } contains a flow which computes the result.
Most checkpoint parameters are specified via command-line flags:
--checkpoint.clobber: if true, recompute and overwrite any existing
checkpoint files.
--checkpoint.clobber.<name>: override clobber for the given checkpoint.
--checkpoint.file: specifies a filename prefix to use for checkpoint files.
If blank, checkpoints are disabled; otherwise the file for checkpoint
<name> is <prefix>_<name>.
--checkpoint.file.<name>: override --checkpoint.file for the given
checkpoint; specifies the whole filename, not the prefix.
--checkpoint.format: specifies a file format, either sequencefile or tsv.
Default is sequencefile for HDFS, tsv for local.
--checkpoint.format.<name>: specifies file format for the given checkpoint.
Checkpoint provides a simple mechanism to read and write intermediate results from a Scalding flow to HDFS.
Checkpoints are useful for debugging one part of a long flow, when you would otherwise have to run many steps to get to the one you care about. To enable checkpoints, sprinkle calls to Checkpoint() throughout your flow, ideally after expensive steps.
When checkpoints are enabled, each Checkpoint() looks for a checkpoint file on HDFS. If it exists we read results from the file; otherwise we execute the flow and write the results to the file. When checkpoints are disabled, the flow is always executed and the results are never stored.
Each call to Checkpoint() takes the checkpoint name, as well as the types and names of the expected fields. A sample invocation might look like this: val pipe = Checkpoint[(Long, String, Long)]( "clicks", ('tweetId, 'clickUrl, 'clickCount)) { ... } where { ... } contains a flow which computes the result.
Most checkpoint parameters are specified via command-line flags: --checkpoint.clobber: if true, recompute and overwrite any existing checkpoint files. --checkpoint.clobber.<name>: override clobber for the given checkpoint. --checkpoint.file: specifies a filename prefix to use for checkpoint files. If blank, checkpoints are disabled; otherwise the file for checkpoint <name> is <prefix>_<name>. --checkpoint.file.<name>: override --checkpoint.file for the given checkpoint; specifies the whole filename, not the prefix. --checkpoint.format: specifies a file format, either sequencefile or tsv. Default is sequencefile for HDFS, tsv for local. --checkpoint.format.<name>: specifies file format for the given checkpoint.