The basic idea is to groupBy the dst key with BOTH the nodeset and the edge rows.
The basic idea is to groupBy the dst key with BOTH the nodeset and the edge rows. the nodeset rows have the old page-rank, the edge rows are reversed, so we can get the incoming page-rank from the nodes that point to each destination.
override this function to change how you generate a pipe of (Long, String, Double) where the first entry is the nodeid, the second is the list of neighbors, as a comma (no spaces) separated string representation of the numeric nodeids, the third is the initial page rank (if not starting from a previous run, this should be 1.0
override this function to change how you generate a pipe of (Long, String, Double) where the first entry is the nodeid, the second is the list of neighbors, as a comma (no spaces) separated string representation of the numeric nodeids, the third is the initial page rank (if not starting from a previous run, this should be 1.0
NOTE: if you want to run until convergence, the initialize method must read the same EXACT format as the output method writes. This is your job!
Here is where we check for convergence and then run the next job if we're not converged
Here is where we check for convergence and then run the next job if we're not converged
Options: --input: the three column TSV with node, comma-sep-out-neighbors, initial pagerank (set to 1.0 first) --output: the name for the TSV you want to write to, same as above. optional arguments: --errorOut: name of where to write the L1 error between the input page-rank and the output if this is omitted, we don't compute the error --iterations: how many iterations to run inside this job. Default is 1, 10 is about as much as cascading can handle. --jumpprob: probability of a random jump, default is 0.15 --convergence: if this is set, after every "--iterations" steps, we check the error and see if we should continue. Since the error check is expensive (involving a join), you should avoid doing this too frequently. 10 iterations is probably a good number to set. --temp: this is the name where we will store a temporary output so we can compare to the previous for convergence checking. If convergence is set, this MUST be.