Package

com.twitter.scalding

examples

Permalink

package examples

Visibility
  1. Public
  2. All

Type Members

  1. class MergeTest extends Job

    Permalink

    This example job does not yet work.

    This example job does not yet work. It is a test for Kyro serialization

  2. class PageRank extends Job

    Permalink

    Options: --input: the three column TSV with node, comma-sep-out-neighbors, initial pagerank (set to 1.0 first) --output: the name for the TSV you want to write to, same as above.

    Options: --input: the three column TSV with node, comma-sep-out-neighbors, initial pagerank (set to 1.0 first) --output: the name for the TSV you want to write to, same as above. optional arguments: --errorOut: name of where to write the L1 error between the input page-rank and the output if this is omitted, we don't compute the error --iterations: how many iterations to run inside this job. Default is 1, 10 is about as much as cascading can handle. --jumpprob: probability of a random jump, default is 0.15 --convergence: if this is set, after every "--iterations" steps, we check the error and see if we should continue. Since the error check is expensive (involving a join), you should avoid doing this too frequently. 10 iterations is probably a good number to set. --temp: this is the name where we will store a temporary output so we can compare to the previous for convergence checking. If convergence is set, this MUST be.

  3. class WeightedPageRank extends Job

    Permalink

    weighted page rank for the given graph, start from the given pagerank, perform one iteartion, test for convergence, if not yet, clone itself and start the next page rank job with updated pagerank as input.

    weighted page rank for the given graph, start from the given pagerank, perform one iteartion, test for convergence, if not yet, clone itself and start the next page rank job with updated pagerank as input.

    This class is very similar to the PageRank class, main differences are: 1. supported weighted pagerank 2. the reset pagerank is pregenerated, possibly through a previous job 3. dead pagerank is evenly distributed

    Options: --pwd: working directory, will read/generate the following files there numnodes: total number of nodes nodes: nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> pagerank: the page rank file eg pagerank_0, pagerank_1 etc totaldiff: the current max pagerank delta Optional arguments: --weighted: do weighted pagerank, default false --curiteration: what is the current iteration, default 0 --maxiterations: how many iterations to run. Default is 20 --jumpprob: probability of a random jump, default is 0.1 --threshold: total difference before finishing early, default 0.001

  4. class WeightedPageRankFromMatrix extends Job

    Permalink

    A weighted PageRank implementation using the Scalding Matrix API.

    A weighted PageRank implementation using the Scalding Matrix API. This assumes that all rows and columns are of type Int and values or egde weights are Double. If you want an unweighted PageRank, simply set the weights on the edges to 1.

    Input arguments:

    d -- damping factor n -- number of nodes in the graph currentIteration -- start with 0 probably maxIterations -- stop after n iterations convergenceThreshold -- using the sum of the absolute difference between iteration solutions, iterating stops once we reach this threshold rootDir -- the root directory holding all starting, intermediate and final data/output

    The expected structure of the rootDir is:

    rootDir |- iterations | |- 0 <-- a TSV of (row, value) of size n, value can be 1/n (generate this) | |- n <-- holds future iterations/solutions |- edges <-- a TSV of (row, column, value) for edges in the graph |- onesVector <-- a TSV of (row, 1) of size n (generate this) |- diff <-- a single line representing the difference between the last iterations |- constants <-- built at iteration 0, these are constant for any given matrix/graph |- M_hat |- priorVector

    Don't forget to set the number of reducers for this job: -D mapred.reduce.tasks=n

  5. class WordCountJob extends Job

    Permalink

Value Members

  1. object KMeans

    Permalink

Ungrouped