See: Description
Interface | Description |
---|---|
VisibilityExpressionResolver |
Interface to convert visibility expressions into Tags for storing along with Cells in HFiles.
|
Class | Description |
---|---|
CellCounter |
A job with a a map and reduce phase to count cells in a table.
|
CellCreator |
Facade to create Cells for HFileOutputFormat.
|
CopyTable |
Tool used to copy a table to another one which can be on a different setup.
|
DefaultVisibilityExpressionResolver |
This implementation creates tags by expanding expression using label ordinal.
|
Driver |
Driver for hbase mapreduce jobs.
|
Export |
Export an HBase table.
|
GroupingTableMapper |
Extract grouping columns from input record.
|
HFileOutputFormat | Deprecated
use
HFileOutputFormat2 instead. |
HFileOutputFormat2 |
Writes HFiles.
|
HLogInputFormat |
Simple
InputFormat for HLog files. |
HRegionPartitioner<KEY,VALUE> |
This is used to partition the output keys into groups of keys.
|
IdentityTableMapper |
Pass the given key and record as-is to the reduce phase.
|
IdentityTableReducer | |
Import |
Import data written by
Export . |
Import.Importer |
Write table content out to files in hdfs.
|
Import.KeyValueImporter |
A mapper that just writes out KeyValues.
|
ImportTsv |
Tool to import data from a TSV file.
|
ImportTsv.TsvParser | |
KeyValueSerialization | |
KeyValueSerialization.KeyValueDeserializer | |
KeyValueSerialization.KeyValueSerializer | |
KeyValueSortReducer |
Emits sorted KeyValues.
|
LoadIncrementalHFiles |
Tool to load the output of HFileOutputFormat into an existing table.
|
MultiTableInputFormat |
Convert HBase tabular data from multiple scanners into a format that
is consumable by Map/Reduce.
|
MultiTableInputFormatBase |
A base for
MultiTableInputFormat s. |
MultiTableOutputFormat |
Hadoop output format that writes to one or more HBase tables.
|
MultiTableOutputFormat.MultiTableRecordWriter |
Record writer for outputting to multiple HTables.
|
MultithreadedTableMapper<K2,V2> |
Multithreaded implementation for @link org.apache.hbase.mapreduce.TableMapper
|
MutationSerialization | |
PutCombiner<K> |
Combine Puts.
|
PutSortReducer |
Emits sorted Puts.
|
ResultSerialization | |
RowCounter |
A job with a just a map phase to count rows.
|
SimpleTotalOrderPartitioner<VALUE> |
A partitioner that takes start and end keys and uses bigdecimal to figure
which reduce a key belongs to.
|
TableInputFormat |
Convert HBase tabular data into a format that is consumable by Map/Reduce.
|
TableInputFormatBase |
A base for
TableInputFormat s. |
TableMapper<KEYOUT,VALUEOUT> |
Extends the base
Mapper class to add the required input key
and value classes. |
TableMapReduceUtil |
Utility for
TableMapper and TableReducer |
TableOutputCommitter |
Small committer class that does not do anything.
|
TableOutputFormat<KEY> |
Convert Map/Reduce output and write it to an HBase table.
|
TableOutputFormat.TableRecordWriter<KEY> |
Writes the reducer output to an HBase table.
|
TableRecordReader |
Iterate over an HBase table data, return (ImmutableBytesWritable, Result)
pairs.
|
TableRecordReaderImpl |
Iterate over an HBase table data, return (ImmutableBytesWritable, Result)
pairs.
|
TableReducer<KEYIN,VALUEIN,KEYOUT> |
Extends the basic
Reducer class to add the required key and
value input/output classes. |
TableSnapshotInputFormat |
TableSnapshotInputFormat allows a MapReduce job to run over a table snapshot.
|
TableSnapshotInputFormat.TableSnapshotRegionSplit | |
TableSnapshotInputFormatImpl |
API-agnostic implementation for mapreduce over table snapshots.
|
TableSnapshotInputFormatImpl.InputSplit |
Implementation class for InputSplit logic common between mapred and mapreduce.
|
TableSnapshotInputFormatImpl.RecordReader |
Implementation class for RecordReader logic common between mapred and mapreduce.
|
TableSplit |
A table split corresponds to a key range (low, high) and an optional scanner.
|
TextSortReducer |
Emits Sorted KeyValues.
|
TsvImporterMapper |
Write table content out to files in hdfs.
|
TsvImporterTextMapper |
Write table content out to map output files.
|
WALPlayer |
A tool to replay WAL files as a M/R job.
|
Exception | Description |
---|---|
ImportTsv.TsvParser.BadTsvLineException |
MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under $HBASE_CONF_DIR
nor to HBase classes.
You could add hbase-site.xml
to
$HADOOP_HOME/conf
and add
HBase jars to the $HADOOP_HOME/lib
and copy these
changes across your cluster (or edit conf/hadoop-env.sh and add them to the
HADOOP_CLASSPATH
variable) but this will pollute your
hadoop install with HBase references; its also obnoxious requiring restart of
the hadoop cluster before it'll notice your HBase additions.
As of 0.90.x, HBase will just add its dependency jars to the job
configuration; the dependencies just need to be available on the local
CLASSPATH
. For example, to run the bundled HBase
RowCounter
mapreduce job against a table named usertable
,
type:
Expand$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable
$HBASE_HOME
and $HADOOP_HOME
in the above
appropriately to suit your local environment. The content of HADOOP_CLASSPATH
is set to the HBase CLASSPATH
via backticking the command
${HBASE_HOME}/bin/hbase classpath
.
When the above runs, internally, the HBase jar finds its zookeeper and
guava,
etc., dependencies on the passed
HADOOP_CLASSPATH and adds the found jars to the mapreduce
job configuration. See the source at
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)
for how this is done.
The above may not work if you are running your HBase from its build directory;
i.e. you've done $ mvn test install
at
${HBASE_HOME}
and you are now
trying to use this build in your mapreduce job. If you get
exception thrown, try doing the following:java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper ...
Notice how we preface the backtick invocation setting$ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable
HADOOP_CLASSPATH
with reference to the built HBase jar over in
the target
directory.
The HBase jar also serves as a Driver for some bundled mapreduce jobs. To learn about the bundled mapreduce jobs run:
$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar An example program must be given as the first argument. Valid program names are: copytable: Export a table from local cluster to peer cluster completebulkload: Complete a bulk data load. export: Write table data to HDFS. import: Import data written by Export. importtsv: Import data in TSV format. rowcounter: Count rows in HBase table
HBase can be used as a data source, TableInputFormat
,
and data sink, TableOutputFormat
or MultiTableOutputFormat
,
for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
TableMapper
and/or
TableReducer
. See the do-nothing
pass-through classes IdentityTableMapper
and
IdentityTableReducer
for basic usage. For a more
involved example, see RowCounter
or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce
unit test.
Running mapreduce jobs that have HBase as source or sink, you'll need to specify source/sink table and column names in your configuration.
Reading from HBase, the TableInputFormat asks HBase for the list of
regions and makes a map-per-region or mapred.map.tasks maps
,
whichever is smaller (If your job only has two maps, up mapred.map.tasks
to a number > number of regions). Maps will run on the adjacent TaskTracker
if you are running a TaskTracer and RegionServer per node.
Writing, it may make sense to avoid the reduce step and write yourself back into
HBase from inside your map. You'd do this when your job does not need the sort
and collation that mapreduce does on the map emitted data; on insert,
HBase 'sorts' so there is no point double-sorting (and shuffling data around
your mapreduce cluster) unless you need to. If you do not need the reduce,
you might just have your map emit counts of records processed just so the
framework's report at the end of your job has meaning or set the number of
reduces to zero and use TableOutputFormat. See example code
below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the HBase cluster.
There is also a new HBase partitioner that will run as many reducers as
currently existing regions. The
HRegionPartitioner
is suitable
when your table is large and your upload is not such that it will greatly
alter the number of existing regions when done; otherwise use the default
partitioner.
If importing into a new table, its possible to by-pass the HBase API and write your content directly to the filesystem properly formatted as HBase data files (HFiles). Your import will run faster, perhaps an order of magnitude faster if not more. For more on how this mechanism works, see Bulk Loads documentation.
See RowCounter
. This job uses
TableInputFormat
and
does a count of all rows in specified table.
You should be able to run
it by doing: % ./bin/hadoop jar hbase-X.X.X.jar
. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
offered. This will emit rowcouner 'usage'. Specify tablename, column to count
and output directory. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
Copyright © 2015 The Apache Software Foundation. All Rights Reserved.