public abstract class MRTask<T extends MRTask<T>> extends DTask<T> implements ForkJoinPool.ManagedBlocker
MRTask provides several map
and reduce
methods that can be
overridden to specify a computation. Several instances of this class will be
created to distribute the computation over F/J threads and machines. Non-transient
fields are copied and serialized to instances created for map invocations. Reduce
methods can store their results in fields. Results are serialized and reduced all the
way back to the invoking node. When the last reduce method has been called, fields
of the initial MRTask instance contains the computation results.
Apart from small reduced POJO returned to the calling node, MRTask can
produce output vector(s) as a result. These will have chunks co-located
with the input dataset, however, their number of lines will generally
differ, (so they won't be strictly compatible with the original). To produce
output vectors, call doAll.dfork version with required number of outputs and
override appropriate map
call taking required number of
NewChunks. MRTask will automatically close the new Appendable vecs and
produce an output frame with newly created Vecs.
Overview
Distributed computation may be invoked by an MRTask instance via the
doAll
, dfork
, or asyncExec
calls. A call to
doAll
is blocking, yet doAll
does pass control to
dfork
and asyncExec
, both of which are non-blocking.
Computation only occurs on instances of Frame, Vec, and Key. The amount of work to do
depends on the mode of computation: the first mode is over an array of Chunk instances;
the second is over an array Key instances. In both modes, divide-conquer-combine using
ForkJoin is the computation paradigm, which is manifested by the compute2
call.
MRTask Method Overriding
Computation is tailored primarily by overriding the map
call, with any
additional customization done by overriding the reduce
,
setupLocal
, closeLocal
, or postGlobal
calls.
An overridden setupLocal
is invoked during the call to
setupLocal0
.
DTask.DKeyTask<T extends DTask.DKeyTask,V extends Keyed>, DTask.RemoveCall
Modifier and Type | Field and Description |
---|---|
protected AppendableVec[] |
_appendables
Appendables are treated separately (roll-ups computed in map/reduce style, can not be passed via K/V store).
|
Frame |
_fr
This Frame instance is the handle for computation over a set of Vec instances.
|
protected Futures |
_fs
We can add more things to block on - in case we want a bunch of lazy
tasks produced by children to all end before this top-level task ends.
|
protected int |
_hi
Internal field to track a range of local Chunks to work on
|
Key[] |
_keys
This
Key[] instance is the handle used for computation when an MRTask is
invoked over an array of Key instances. |
protected T |
_left
Internal field to track the left & right sub-range of chunks to work on
|
protected int |
_lo
Internal field to track a range of local Chunks to work on
|
protected short |
_nhi
The range of Nodes to work on remotely
|
protected RPC<T> |
_nleft
Internal field to track the left & right remote nodes/JVMs to work on
|
protected short |
_nlo
The range of Nodes to work on remotely
|
protected RPC<T> |
_nrite
Internal field to track the left & right remote nodes/JVMs to work on
|
protected T |
_rite
Internal field to track the left & right sub-range of chunks to work on
|
protected boolean |
_run_local
If true, run entirely local - which will pull all the data locally.
|
protected boolean |
_topLocal
Internal field to track if this is a top-level local call
|
_ex, _modifiesInputs
Modifier | Constructor and Description |
---|---|
|
MRTask() |
protected |
MRTask(H2O.H2OCountedCompleter cmp) |
Modifier and Type | Method and Description |
---|---|
AppendableVec[] |
appendables()
Accessor for the protected array of AppendableVec instances.
|
T |
asyncExec(Frame fr) |
void |
asyncExec(int outputs,
Frame fr,
boolean run_local)
Fork the task in strictly non-blocking fashion.
|
void |
asyncExec(Key... keys) |
T |
asyncExec(Vec... vecs) |
void |
asyncExecOnAllNodes() |
boolean |
block()
Possibly blocks the current thread, for example waiting for
a lock or condition.
|
protected void |
closeLocal()
Override to do any remote cleaning on the last remote instance of
this object, for disposing of node-local shared data structures.
|
void |
compute2()
Called from FJ threads to do local work.
|
T |
dfork(Frame fr)
Invokes the map/reduce computation over the given Frame instance.
|
T |
dfork(int outputs,
Frame fr,
boolean run_local)
Invokes the map/reduce computation over the given Vec instances and produces
outputs Vec instances. |
T |
dfork(int outputs,
Vec... vecs)
Invokes the map/reduce computation over the given Vec instances and produces
outputs Vec instances. |
T |
dfork(Vec... vecs)
Invokes the map/reduce computation over the given array of Vec instances.
|
void |
dinvoke(H2ONode sender)
Called once on remote at top level, probably with a subset of the cloud.
|
T |
doAll(Frame fr) |
T |
doAll(Frame fr,
boolean run_local)
Invokes the map/reduce computation over the given Frame.
|
T |
doAll(int outputs,
Frame fr) |
T |
doAll(int outputs,
Frame fr,
boolean run_local) |
T |
doAll(int outputs,
Vec... vecs) |
T |
doAll(int outputs,
Vec vec,
boolean run_local) |
T |
doAll(Key... keys) |
T |
doAll(Vec... vecs)
Invokes the map/reduce computation over the given Vecs.
|
T |
doAll(Vec vec,
boolean run_local) |
T |
doAllNodes() |
T |
getResult()
Block for & get any final results from a dfork'd MRTask.
|
boolean |
isReleasable()
Returns
true if blocking is unnecessary. |
void |
map(Chunk c)
Override with your map implementation.
|
void |
map(Chunk[] cs)
Override with your map implementation.
|
void |
map(Chunk[] cs,
NewChunk nc) |
void |
map(Chunk[] cs,
NewChunk[] ncs) |
void |
map(Chunk[] cs,
NewChunk nc1,
NewChunk nc2) |
void |
map(Chunk c0,
Chunk c1)
Override with your map implementation.
|
void |
map(Chunk c0,
Chunk c1,
Chunk c2)
Override with your map implementation.
|
void |
map(Chunk c0,
Chunk c1,
Chunk c2,
NewChunk nc) |
void |
map(Chunk c0,
Chunk c1,
Chunk c2,
NewChunk nc1,
NewChunk nc2) |
void |
map(Chunk c0,
Chunk c1,
NewChunk nc) |
void |
map(Chunk c0,
Chunk c1,
NewChunk nc1,
NewChunk nc2) |
void |
map(Chunk c,
NewChunk nc) |
void |
map(Key key)
Override with your map implementation.
|
void |
onCompletion(CountedCompleter caller)
OnCompletion - reduce the left & right into self.
|
boolean |
onExceptionalCompletion(java.lang.Throwable ex,
CountedCompleter caller)
Cancel/kill all work as we can, then rethrow...
|
Frame |
outputFrame()
Get the resulting Frame from this invoked MRTask.
|
Frame |
outputFrame(Key key,
java.lang.String[] names,
java.lang.String[][] domains)
Get the resulting Frame from this invoked MRTask.
|
Frame |
outputFrame(java.lang.String[] names,
java.lang.String[][] domains)
Get the resulting Frame from this invoked MRTask.
|
protected void |
postGlobal() |
protected void |
postLocal()
Override to perform cleanup of large input arguments before sending over the wire.
|
byte |
priority() |
java.lang.String |
profString() |
void |
reduce(T mrt)
Override to combine results from 'mrt' into 'this' MRTask.
|
protected T |
self() |
void |
setProfile(boolean b) |
protected void |
setupLocal()
Override to do any remote initialization on the 1st remote instance of
this object, for initializing node-local shared data structures.
|
copyOver, getDException, hasException, logVerbose, onAck, onAckAck, setException
clone, compute, frozenType, icer, nextThrPriority, read_impl, read, readJSON_impl, readJSON, write_impl, write, writeJSON_impl, writeJSON
addToPendingCount, compareAndSetPendingCount, complete, exec, getCompleter, getPendingCount, getRawResult, setCompleter, setPendingCount, setRawResult, tryComplete
adapt, adapt, adapt, cancel, compareAndSetForkJoinTaskTag, completeExceptionally, fork, get, get, getException, getForkJoinTaskTag, getPool, getQueuedTaskCount, getSurplusQueuedTaskCount, helpQuiesce, inForkJoinPool, invoke, invokeAll, invokeAll, invokeAll, isCancelled, isCompletedAbnormally, isCompletedNormally, isDone, join, peekNextLocalTask, pollNextLocalTask, pollTask, quietlyComplete, quietlyInvoke, quietlyJoin, reinitialize, setForkJoinTaskTag, tryUnfork
public Frame _fr
doAll
with Frame and Vec[] instances. Top-level calls to
doAll
wrap Vec instances into a new Frame instance and set this into
_fr
during a call to asyncExec
.public Key[] _keys
Key[]
instance is the handle used for computation when an MRTask is
invoked over an array of Key
instances.protected AppendableVec[] _appendables
protected transient RPC<T extends MRTask<T>> _nleft
protected transient RPC<T extends MRTask<T>> _nrite
protected transient boolean _topLocal
protected transient T extends MRTask<T> _left
protected transient T extends MRTask<T> _rite
protected short _nlo
protected short _nhi
protected transient int _lo
protected transient int _hi
protected transient Futures _fs
protected boolean _run_local
public MRTask()
protected MRTask(H2O.H2OCountedCompleter cmp)
public AppendableVec[] appendables()
_nouputs
is 0, then the return result is null. Additionally, if outputFrame
is
not called and _noutputs > 0
, then these AppendableVec
instances must be closed by the caller.public java.lang.String profString()
public void setProfile(boolean b)
public byte priority()
priority
in class H2O.H2OCountedCompleter<T extends MRTask<T>>
public Frame outputFrame()
public Frame outputFrame(java.lang.String[] names, java.lang.String[][] domains)
names
- The names of the columns in the resulting Frame.domains
- The domains of the columns in the resulting Frame.public Frame outputFrame(Key key, java.lang.String[] names, java.lang.String[][] domains)
key
is not null, then the resulting Frame will appear in the DKV. AppendableVec instances
are closed into Vec instances, which then appear in the DKV.key
- If null, then the Frame will not appear in the DKV. Otherwise, this result
will appear in the DKV under this key.names
- The names of the columns in the resulting Frame.domains
- The domains of the columns in the resulting Frame.public void map(Chunk c)
public void map(Chunk c0, Chunk c1)
public void map(Chunk c0, Chunk c1, Chunk c2)
public void map(Chunk[] cs)
public void map(Key key)
public void reduce(T mrt)
protected void setupLocal()
protected void closeLocal()
protected T self()
public final T doAll(Vec... vecs)
public final T doAll(Frame fr, boolean run_local)
public void asyncExec(Key... keys)
public T doAllNodes()
public void asyncExecOnAllNodes()
public T dfork(Frame fr)
getResult
may be invoked
by the caller to block for pending computation to complete. This call produces no
output Vec instances or Frame instances.fr
- Perform the computation on this Frame instance.public final T dfork(Vec... vecs)
getResult
may be invoked
by the caller to block for pending computation to complete. This call produces no
output Vec instances or Frame instances.vecs
- Perform the computation on this array of Vec instances.public final T dfork(int outputs, Vec... vecs)
outputs
Vec instances. This call is asynchronous. It returns 'this', on
which getResult
may be invoked by the caller to block for pending
computation to complete.outputs
- The number of output Vec instances to create.vecs
- The input set of Vec instances upon which computation is performed.public final T dfork(int outputs, Frame fr, boolean run_local)
outputs
Vec instances. This call is asynchronous. It returns 'this', on
which getResult
may be invoked by the caller to block for pending
computation to complete.outputs
- The number of output Vec instances to create.fr
- The input Frame instances upon which computation is performed.run_local
- If true
, then all data is pulled to the calling
H2ONode
and all computation is performed locally. If
false
, then each H2ONode
performs
computation over its own node-local data.public final void asyncExec(int outputs, Frame fr, boolean run_local)
public final T getResult()
public boolean isReleasable()
ForkJoinPool.ManagedBlocker
true
if blocking is unnecessary.isReleasable
in interface ForkJoinPool.ManagedBlocker
public boolean block() throws java.lang.InterruptedException
ForkJoinPool.ManagedBlocker
block
in interface ForkJoinPool.ManagedBlocker
true
if no additional blocking is necessary
(i.e., if isReleasable would return true)java.lang.InterruptedException
- if interrupted while waiting
(the method is not required to do so, but is allowed to)public final void dinvoke(H2ONode sender)
public final void compute2()
compute2
in class H2O.H2OCountedCompleter<T extends MRTask<T>>
public final void onCompletion(CountedCompleter caller)
onCompletion
in class CountedCompleter
caller
- the task invoking this method (which may
be this task itself).protected void postGlobal()
protected void postLocal()
public final boolean onExceptionalCompletion(java.lang.Throwable ex, CountedCompleter caller)
onExceptionalCompletion
in class H2O.H2OCountedCompleter<T extends MRTask<T>>
ex
- the exceptioncaller
- the task invoking this method (which may
be this task itself).