Builder classes used internally to implement coGroups (joins).
Builder classes used internally to implement coGroups (joins). Can also be used for more generalized joins, e.g., star joins.
This is a wrapper class on top of Map[String, String]
Csv value source separated by commas and quotes wrapping all fields
Sets up an implicit dateRange to use in your sources and an implicit timezone.
Sets up an implicit dateRange to use in your sources and an implicit timezone. Example args: --date 2011-10-02 2011-10-04 --tz UTC If no timezone is given, Pacific is assumed.
Mix this in for delimited schemes such as TSV or one-separated values By default, TSV is given
Execution[T] represents and computation that can be run and will produce a value T and keep track of counters incremented inside of TypedPipes using a Stat.
Execution[T] represents and computation that can be run and will produce a value T and keep track of counters incremented inside of TypedPipes using a Stat.
Execution[T] is the recommended way to compose multistep computations that involve branching (if/then), intermediate calls to remote services, file operations, or looping (e.g. testing for convergence).
Library functions are encouraged to implement functions from TypedPipes or ValuePipes to Execution[R] for some result R. Refrain from calling run in library code. Let the caller of your library call run.
Note this is a Monad, meaning flatMap composes in series as you expect. It is also an applicative functor, which means zip (called join in some libraries) composes two Executions is parallel. Prefer zip to flatMap if you want to run two Executions in parallel.
This represents the counters portion of the JobStats that are returned.
This represents the counters portion of the JobStats that are returned. Counters are just a vector of longs with counter name, group keys.
This is a simple job that allows you to launch Execution[T] instances using scalding.Tool and scald.rb.
This is a simple job that allows you to launch Execution[T] instances using scalding.Tool and scald.rb. You cannot print the graph.
This is a base class for File-based sources
Immutable state that we attach to the Flow using the FlowStateMap
This handles the mapReduceMap work on the map-side of the operation.
This handles the mapReduceMap work on the map-side of the operation. The code below attempts to be optimal with respect to memory allocations and performance, not functional style purity.
Implements reductions on top of a simple abstraction for the Fields-API We use the f-bounded polymorphism trick to return the type called Self in each operation.
This controls the sequence of reductions that happen inside a particular grouping operation.
This controls the sequence of reductions that happen inside a particular grouping operation. Not all elements can be combined, for instance, a scanLeft/foldLeft generally requires a sorting but such sorts are (at least for now) incompatible with doing a combine which includes some map-side reductions.
thrown when validateTaps fails
Allows working with an iterable object defined in the job (on the submitter) to be used within a Job as you would a Pipe/RichPipe
Allows working with an iterable object defined in the job (on the submitter) to be used within a Job as you would a Pipe/RichPipe
These lists should probably be very tiny by Hadoop standards. If they are getting large, you should probably dump them to HDFS and use the normal mechanisms to address the data (a FileSource).
Job is a convenience class to make using Scalding easier.
Job is a convenience class to make using Scalding easier. Subclasses of Job automatically have a number of nice implicits to enable more concise syntax, including: conversion from Pipe, Source or Iterable to RichPipe conversion from Source or Iterable to Pipe conversion to collections or Tuple[1-22] to cascading.tuple.Fields
Additionally, the job provides an implicit Mode and FlowDef so that functions that register starts or ends of a flow graph, specifically anything that reads or writes data on Hadoop, has the needed implicits available.
If you want to write code outside of a Job, you will want to either:
make all methods that may read or write data accept implicit FlowDef and Mode parameters.
OR:
write code that rather than returning values, it returns a (FlowDef, Mode) => T, these functions can be combined Monadically using algebird.monad.Reader.
This class is used to construct unit tests for scalding jobs.
This class is used to construct unit tests for scalding jobs. You should not use it unless you are writing tests. For examples of how to do that, see the tests included in the main scalding repository: https://github.com/twitter/scalding/tree/master/scalding-core/src/test/scala/com/twitter/scalding
A trait which provides a method to create a local tap.
Use this class to add support for Cascading local mode via the Hadoop tap.
Use this class to add support for Cascading local mode via the Hadoop tap. Put another way, this runs a Hadoop tap outside of Hadoop in the Cascading local mode
MapReduceMapBy Class
This handles the mapReduceMap work on the map-side of the operation.
This handles the mapReduceMap work on the map-side of the operation. The code below attempts to be optimal with respect to memory allocations and performance, not functional style purity.
Usually as soon as we open a source, we read and do some mapping operation on a single column or set of columns.
Usually as soon as we open a source, we read and do some mapping operation on a single column or set of columns. T is the type of the single column. If doing multiple columns T will be a TupleN representing the types, e.g. (Int,Long,String)
Prefer to use TypedSource unless you are working with the fields API
NOTE: If we don't make this extend Source, established implicits are ambiguous when TDsl is in scope.
An implementation of map-side combining which is appropriate for associative and commutative functions If a cacheSize is given, it is used, else we query the config for cascading.aggregateby.threshold (standard cascading param for an equivalent case) else we use a default value of 100,000
An implementation of map-side combining which is appropriate for associative and commutative functions If a cacheSize is given, it is used, else we query the config for cascading.aggregateby.threshold (standard cascading param for an equivalent case) else we use a default value of 100,000
This keeps a cache of keys up to the cache-size, summing values as keys collide On eviction, or completion of this Operation, the key-value pairs are put into outputCollector.
This NEVER spills to disk and generally never be a performance penalty. If you have poor locality in the keys, you just don't get any benefit but little added cost.
Note this means that you may still have repeated keys in the output even on a single mapper since the key space may be so large that you can't fit all of them in the cache at the same time.
You can use this with the Fields-API by doing:
val msr = new MapsideReduce(Semigroup.from(fn), 'key, 'value, None) // MUST map onto the same key,value space (may be multiple fields) val mapSideReduced = pipe.eachTo(('key, 'value) -> ('key, 'value)) { _ => msr }
That said, this is equivalent to AggregateBy, and the only value is that it is much simpler than AggregateBy. AggregateBy assumes several parallel reductions are happening, and thus has many loops, and array lookups to deal with that. Since this does many fewer allocations, and has a smaller code-path it may be faster for the typed-API.
Delimited files source allowing to override separator and quotation characters and header configuration
Allows the use of multiple Tsv input paths.
Allows the use of multiple Tsv input paths. The Tsv files will be process through your flow as if they are a single pipe. Tsv files must have the same schema. For more details on how multiple files are handled check the cascading docs.
This is only a TypedSource as sinking into multiple directories is not well defined
A tap that output nothing.
A tap that output nothing. It is used to drive execution of a task for side effect only. This can be used to drive a pipe without actually writing to HDFS.
Alternate typed TextLine source that keeps both 'offset and 'line fields.
This just blindly uses the first public constructor with the same arity as the fields size
One separated value (commonly used by Pig)
This is a base class for partition-based output sources
An implementation of SequenceFile output, split over a partition tap.
An implementation of SequenceFile output, split over a partition tap.
The root path for the output.
The partitioning strategy to use.
The set of fields to use for the sequence file.
How to handle conflicts with existing output.
An implementation of TSV output, split over a partition tap.
An implementation of TSV output, split over a partition tap.
The root path for the output.
The partitioning strategy to use.
Flag to indicate that the header should be written to the file.
How to handle conflicts with existing output.
This is a builder for Cascading's Debug object.
This is a builder for Cascading's Debug object. The default instance is the same default as cascading's new Debug() https://github.com/cwensel/cascading/blob/wip-2.5/cascading-core/src/main/java/cascading/operation/Debug.java#L46 This is based on work by: https://github.com/granthenke https://github.com/twitter/scalding/pull/559
Implements reductions on top of a simple abstraction for the Fields-API This is for associative and commutive operations (particularly Monoids and Semigroups play a big role here)
Implements reductions on top of a simple abstraction for the Fields-API This is for associative and commutive operations (particularly Monoids and Semigroups play a big role here)
We use the f-bounded polymorphism trick to return the type called Self in each operation.
Packs a tuple into any object with set methods, e.g.
Packs a tuple into any object with set methods, e.g. thrift or proto objects. TODO: verify that protobuf setters for field camel_name are of the form setCamelName. In that case this code works for proto.
This is an enrichment-pattern class for cascading.flow.FlowDef.
This is an enrichment-pattern class for cascading.flow.FlowDef. The rule is to never use this class directly in input or return types, but only to add methods to FlowDef.
This is an enrichment-pattern class for cascading.pipe.Pipe.
This is an enrichment-pattern class for cascading.pipe.Pipe. The rule is to never use this class directly in input or return types, but only to add methods to Pipe.
Scala 2.8 Iterators don't support scanLeft so we have to reimplement The Scala 2.9 implementation creates an off-by-one bug with the unused fields in the Fields API
A base class for sources that take a scheme trait.
Mappable extension that defines the proper converter implementation for a Mappable with a single item.
Represents a strategy for replicating rows when performing skewed joins.
See https://github.com/twitter/scalding/pull/229#issuecomment-10773810
See https://github.com/twitter/scalding/pull/229#issuecomment-10792296
Every source must have a correct toString method.
Every source must have a correct toString method. If you use case classes for instances of sources, you will get this for free. This is one of the several reasons we recommend using cases classes
java.io.Serializable is needed if the Source is going to have any methods attached that run on mappers or reducers, which will happen if you implement transformForRead or transformForWrite.
A simple trait for releasable resource.
A simple trait for releasable resource. Provides noop implementation.
Implements reductions on top of a simple abstraction for the Fields-API We use the f-bounded polymorphism trick to return the type called Self in each operation.
Ensures that a _SUCCESS file is present in the Source path, which must be a glob, as well as the requirements of FileSource.pathIsGood
This is a base class for template based output sources
An implementation of SequenceFile output, split over a template tap.
An implementation of SequenceFile output, split over a template tap.
The root path for the output.
The java formatter style string to use as the template. e.g. %s/%s.
The set of fields to use for the sequence file.
The set of fields to apply to the path.
How to handle conflicts with existing output.
An implementation of TSV output, split over a template tap.
An implementation of TSV output, split over a template tap.
The root path for the output.
The java formatter style string to use as the template. e.g. %s/%s.
The set of fields to apply to the path.
Flag to indicate that the header should be written to the file.
How to handle conflicts with existing output.
The set of fields to apply to the output.
Memory only testing for unit tests
The fields here are ('offset, 'line)
This will automatically produce a globbed version of the given path.
This will automatically produce a globbed version of the given path. THIS MEANS YOU MUST END WITH A / followed by * to match a file For writing, we write to the directory specified by the END time.
Tab separated value source
Mixed in to both TupleConverter and TupleSetter to improve arity safety of cascading jobs before we run anything on Hadoop.
Typeclass to represent converting from cascading TupleEntry to some type T.
Typeclass to represent converting from cascading TupleEntry to some type T. The most common application is to convert to scala Tuple objects for use with the Fields API. The typed API internally manually handles its mapping to cascading Tuples, so the implicit resolution mechanism is not used.
WARNING: if you are seeing issues with the singleConverter being found when you expect something else, you may have an issue where the enclosing scope needs to take an implicit TupleConverter of the correct type.
Unfortunately, the semantics we want (prefer to flatten tuples, but otherwise put everything into one postition in the tuple) are somewhat difficlut to encode in scala.
Typeclass roughly equivalent to a Lens, which allows getting items out of a tuple.
Typeclass roughly equivalent to a Lens, which allows getting items out of a tuple. This is useful because cascading has type coercion (string to int, for instance) that users expect in the fields API. This code is not used in the typesafe API, which does not allow suc silent coercion. See the generated TupleConverters for an example of where this is used
Typeclass for packing a cascading Tuple into some type T, this is used to put fields of a cascading tuple into Thrift, Protobuf, or case classes, for instance, but you can add your own instances to control how this is done.
Typeclass to represent converting back to (setting into) a cascading Tuple This looks like it can be contravariant, but it can't because of our approach of falling back to the singleSetter, you really want the most specific setter you can get.
Typeclass to represent converting back to (setting into) a cascading Tuple This looks like it can be contravariant, but it can't because of our approach of falling back to the singleSetter, you really want the most specific setter you can get. Put more directly: a TupleSetter[Any] is not just as good as TupleSetter[(Int, Int)] from the scalding DSL's point of view. The latter will flatten the (Int, Int), but the former won't.
This class is used to bind together a Fields instance which may contain a type array via getTypes, a TupleConverter and TupleSetter, which are inverses of one another.
This class is used to bind together a Fields instance which may contain a type array via getTypes, a TupleConverter and TupleSetter, which are inverses of one another. Note the size of the Fields object and the arity values for the converter and setter are all the same. Note in the com.twitter.scalding.macros package there are macros to generate this for case classes, which may be very convenient.
In the typed API every reduce operation is handled by this Buffer
Trait to assist with creating objects such as TypedTsv to read from separated files.
Trait to assist with creating objects such as TypedTsv to read from separated files. Override separator, skipHeader, writeHeader as needed.
Used to inject a typed unique identifier to uniquely name each scalding flow.
Used to inject a typed unique identifier to uniquely name each scalding flow. This is here mostly to deal with the case of testing where there are many concurrent threads running Flows. Users should never have to worry about these
Provide handlers and mapping for exceptions
(Since version 2015-07) Use FixedTypedText instead
(Since version 0.9.0) This trait does nothing now
Allows you to set the types, prefer this: If T is a subclass of Product, we assume it is a tuple.
Allows you to set the types, prefer this: If T is a subclass of Product, we assume it is a tuple. If it is not, wrap T in a Tuple1: e.g. TypedTsv[Tuple1[List[Int]]]
(Since version 2015-07) Use TypedTextDelimited instead
This object has all the implicit functions and values that are used to make the scalding DSL, which includes the functions for automatically creating cascading.tuple.Fields objects from scala tuples of Strings, Symbols or Ints, as well as the cascading.pipe.Pipe enrichment to RichPipe which adds the scala.collections-like API to Pipe.
This object has all the implicit functions and values that are used to make the scalding DSL, which includes the functions for automatically creating cascading.tuple.Fields objects from scala tuples of Strings, Symbols or Ints, as well as the cascading.pipe.Pipe enrichment to RichPipe which adds the scala.collections-like API to Pipe.
It's useful to import Dsl._ when you are writing scalding code outside of a Job.
Execution has many methods for creating Execution[T] instances, which are the preferred way to compose computations in scalding libraries.
The companion gives several ways to create ExecutionCounters from other CascadingStats, JobStats, or Maps
This is a mutable threadsafe store for attaching scalding information to the mutable flowDef
This is a mutable threadsafe store for attaching scalding information to the mutable flowDef
NOTE: there is a subtle bug in scala regarding case classes with multiple sets of arguments, and their equality. For this reason, we use Source.sourceId as the key in this map
A source outputs nothing.
A source outputs nothing. It is used to drive execution of a task for side effect only.
Alternate typed TextLine source that keeps both 'offset and 'line fields.
An implementation of SequenceFile output, split over a partition tap.
An implementation of SequenceFile output, split over a partition tap.
apply assumes user wants a DelimitedPartition (the only strategy bundled with Cascading).
An implementation of TSV output, split over a partition tap.
An implementation of TSV output, split over a partition tap.
Similar to TemplateSource, but with addition of tsvFields, to let users explicitly specify which fields they want to see in the TSV (allows user to discard path fields).
apply assumes user wants a DelimitedPartition (the only strategy bundled with Cascading).
A helper for working with class reflection.
A helper for working with class reflection. Allows us to avoid code repetition.
Provide apply method for creating XHandlers with default or custom settings and contain messages and mapping
Wrapper around a FlowProcess useful, for e.g.
Wrapper around a FlowProcess useful, for e.g. incrementing counters.
The objects for the Typed-API live in the scalding.typed package but are aliased here.
Use this to create Taps for testing.
Calling init registers "com.twitter.scalding" as a "tracing boundary" for Cascading.
Calling init registers "com.twitter.scalding" as a "tracing boundary" for Cascading. That means that when Cascading sends trace information to a DocumentService such as Driven, the trace will have information about the caller of Scalding instead of about the internals of Scalding. com.twitter.scalding.Job and its subclasses will automatically initialize Tracing.
register and unregister methods are provided for testing, but should not be needed for most development
Typeclass for objects which unpack an object into a tuple.
Typeclass for objects which unpack an object into a tuple. The packer can verify the arity, types, and also the existence of the getter methods at plan time, without having the job blow up in the middle of a run.
Typed comma separated values file
Typed one separated values file (commonly used by Pig)
Typed pipe separated values flile
Typed tab separated values file
Make sure this is in sync with version.sbt