Copies the tuple, since cascading may change it after the end of an operation (and it is not safe to assume the consumer has not kept a ref to this tuple
Copies the tupleEntry, since cascading may change it after the end of an operation (and it is not safe to assume the consumer has not kept a ref to this tuple
By default we only set two keys: io.
By default we only set two keys: io.serializations cascading.tuple.element.comparator.default Override this class, call base and ++ your additional map to set more options
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc.
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc...) ALL is the output. Perhaps only fromFields=ALL will make sense 2) If one of from or to is a strict super set of the other, SWAP is used. 3) If they are equal, REPLACE is used. 4) Otherwise, ALL is used.
one iteration of pagerank inputPagerank: <'src_id_input, 'mass_input> return <'src_id, 'mass_n, 'mass_input>
one iteration of pagerank inputPagerank: <'src_id_input, 'mass_input> return <'src_id, 'mass_n, 'mass_input>
Here is a highlevel view of the unweighted algorithm: let N: number of nodes inputPagerank(N_i): prob of walking to node i, d(N_j): N_j's out degree then pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) / d_j) deadPagerank = (1 - \sum_{i} pagerankNext(N_i)) / N randomPagerank(N_i) = userMass(N_i) * ALPHA + deadPagerank * (1-ALPHA) pagerankOutput(N_i) = randomPagerank(N_i) + pagerankNext(N_i) * (1-ALPHA)
For weighted algorithm: let w(N_j, N_i): weight from N_j to N_i tw(N_j): N_j's total out weights then pagerankNext(N_i) = (\sum_{j points to i} inputPagerank(N_j) * w(N_j, N_i) / tw(N_j))
Multi-entry fields.
Multi-entry fields. This are higher priority than Product conversions so that List will not conflict with Product.
read the pregenerated nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior>
the total number of nodes, single line file
test convergence, if not yet, kick off the next iteration
test convergence, if not yet, kick off the next iteration
you should never call these directly, there are here to make the DSL work.
you should never call these directly, there are here to make the DSL work. Just know, you can treat a Pipe as a RichPipe and vice-versa within a Job
Useful to convert f : Any* to Fields.
Useful to convert f : Any* to Fields. This handles mixed cases ("hey", 'you). Not sure we should be this flexible, but given that Cascading will throw an exception before scheduling the job, I guess this is okay.
Handles treating any TupleN as a Fields object.
Handles treating any TupleN as a Fields object. This is low priority because List is also a Product, but this method will not work for List (because List is Product2(head, tail) and so productIterator won't work as expected. Lists are handled by an implicit in FieldConversions, which have higher priority.
'* means Fields.
'* means Fields.ALL, otherwise we take the .name
weighted page rank for the given graph, start from the given pagerank, perform one iteartion, test for convergence, if not yet, clone itself and start the next page rank job with updated pagerank as input.
This class is very similar to the PageRank class, main differences are: 1. supported weighted pagerank 2. the reset pagerank is pregenerated, possibly through a previous job 3. dead pagerank is evenly distributed
Options: --pwd: working directory, will read/generate the following files there numnodes: total number of nodes nodes: nodes file <'src_id, 'dst_ids, 'weights, 'mass_prior> pagerank: the page rank file eg pagerank_0, pagerank_1 etc totaldiff: the current max pagerank delta Optional arguments: --weighted: do weighted pagerank, default false --curiteration: what is the current iteration, default 0 --maxiterations: how many iterations to run. Default is 20 --jumpprob: probability of a random jump, default is 0.1 --threshold: total difference before finishing early, default 0.001