Copies the tuple, since cascading may change it after the end of an operation (and it is not safe to assume the consumer has not kept a ref to this tuple
Load or generate on first iteration the matrix M^ given A.
Copies the tupleEntry, since cascading may change it after the end of an operation (and it is not safe to assume the consumer has not kept a ref to this tuple
By default we only set two keys: io.
By default we only set two keys: io.serializations cascading.tuple.element.comparator.default Override this class, call base and ++ your additional map to set more options
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc.
Rather than give the full power of cascading's selectors, we have a simpler set of rules encoded below: 1) if the input is non-definite (ALL, GROUP, ARGS, etc...) ALL is the output. Perhaps only fromFields=ALL will make sense 2) If one of from or to is a strict super set of the other, SWAP is used. 3) If they are equal, REPLACE is used. 4) Otherwise, ALL is used.
Multi-entry fields.
Multi-entry fields. This are higher priority than Product conversions so that List will not conflict with Product.
Measure convergence by calculating the total of the absolute difference between the previous and next vectors.
Measure convergence by calculating the total of the absolute difference between the previous and next vectors. This stores the result after calculation.
Recurse and iterate again iff we are under the max number of iterations and vector has not converged.
Recurse and iterate again iff we are under the max number of iterations and vector has not converged.
you should never call these directly, there are here to make the DSL work.
you should never call these directly, there are here to make the DSL work. Just know, you can treat a Pipe as a RichPipe and vice-versa within a Job
Useful to convert f : Any* to Fields.
Useful to convert f : Any* to Fields. This handles mixed cases ("hey", 'you). Not sure we should be this flexible, but given that Cascading will throw an exception before scheduling the job, I guess this is okay.
Load or generate on first iteration the prior vector given d and n.
Handles treating any TupleN as a Fields object.
Handles treating any TupleN as a Fields object. This is low priority because List is also a Product, but this method will not work for List (because List is Product2(head, tail) and so productIterator won't work as expected. Lists are handled by an implicit in FieldConversions, which have higher priority.
'* means Fields.
'* means Fields.ALL, otherwise we take the .name
A weighted PageRank implementation using the Scalding Matrix API. This assumes that all rows and columns are of type
Int
and values or egde weights areDouble
. If you want an unweighted PageRank, simply set the weights on the edges to 1.Input arguments:
d -- damping factor n -- number of nodes in the graph currentIteration -- start with 0 probably maxIterations -- stop after n iterations convergenceThreshold -- using the sum of the absolute difference between iteration solutions, iterating stops once we reach this threshold rootDir -- the root directory holding all starting, intermediate and final data/output
The expected structure of the rootDir is:
rootDir |- iterations | |- 0 <-- a TSV of (row, value) of size n, value can be 1/n (generate this) | |- n <-- holds future iterations/solutions |- edges <-- a TSV of (row, column, value) for edges in the graph |- onesVector <-- a TSV of (row, 1) of size n (generate this) |- diff <-- a single line representing the difference between the last iterations |- constants <-- built at iteration 0, these are constant for any given matrix/graph |- M_hat |- priorVector
Don't forget to set the number of reducers for this job: -D mapred.reduce.tasks=n