Package

com.twitter.scalding.commons

extensions

Permalink

package extensions

Visibility
  1. Public
  2. All

Value Members

  1. object Checkpoint

    Permalink

    Checkpoint provides a simple mechanism to read and write intermediate results from a Scalding flow to HDFS.

    Checkpoint provides a simple mechanism to read and write intermediate results from a Scalding flow to HDFS.

    Checkpoints are useful for debugging one part of a long flow, when you would otherwise have to run many steps to get to the one you care about. To enable checkpoints, sprinkle calls to Checkpoint() throughout your flow, ideally after expensive steps.

    When checkpoints are enabled, each Checkpoint() looks for a checkpoint file on HDFS. If it exists we read results from the file; otherwise we execute the flow and write the results to the file. When checkpoints are disabled, the flow is always executed and the results are never stored.

    Each call to Checkpoint() takes the checkpoint name, as well as the types and names of the expected fields. A sample invocation might look like this: val pipe = Checkpoint[(Long, String, Long)]( "clicks", ('tweetId, 'clickUrl, 'clickCount)) { ... } where { ... } contains a flow which computes the result.

    Most checkpoint parameters are specified via command-line flags: --checkpoint.clobber: if true, recompute and overwrite any existing checkpoint files. --checkpoint.clobber.<name>: override clobber for the given checkpoint. --checkpoint.file: specifies a filename prefix to use for checkpoint files. If blank, checkpoints are disabled; otherwise the file for checkpoint <name> is <prefix>_<name>. --checkpoint.file.<name>: override --checkpoint.file for the given checkpoint; specifies the whole filename, not the prefix. --checkpoint.format: specifies a file format, either sequencefile or tsv. Default is sequencefile for HDFS, tsv for local. --checkpoint.format.<name>: specifies file format for the given checkpoint.

Ungrouped