org.apache.hadoop.hbase.mapreduce (HBase - Server 0.98.11-hadoop2 API)

Interface Summary
Interface Description

VisibilityExpressionResolver
Interface to convert visibility expressions into Tags for storing along with Cells in HFiles.

Interface Summary
Interface	Description
VisibilityExpressionResolver	Interface to convert visibility expressions into Tags for storing along with Cells in HFiles.

Class Summary
Class	Description
CellCounter	A job with a a map and reduce phase to count cells in a table.
CellCreator	Facade to create Cells for HFileOutputFormat.
CopyTable	Tool used to copy a table to another one which can be on a different setup.
DefaultVisibilityExpressionResolver	This implementation creates tags by expanding expression using label ordinal.
Driver	Driver for hbase mapreduce jobs.
Export	Export an HBase table.
GroupingTableMapper	Extract grouping columns from input record.
HFileOutputFormat	Deprecated use `HFileOutputFormat2` instead.
HFileOutputFormat2	Writes HFiles.
HLogInputFormat	Simple `InputFormat` for `HLog` files.
HRegionPartitioner<KEY,VALUE>	This is used to partition the output keys into groups of keys.
IdentityTableMapper	Pass the given key and record as-is to the reduce phase.
IdentityTableReducer	Convenience class that simply writes all values (which must be `Put` or `Delete` instances) passed to it out to the configured HBase table.
Import	Import data written by `Export`.
Import.Importer	Write table content out to files in hdfs.
Import.KeyValueImporter	A mapper that just writes out KeyValues.
ImportTsv	Tool to import data from a TSV file.
ImportTsv.TsvParser
KeyValueSerialization
KeyValueSerialization.KeyValueDeserializer
KeyValueSerialization.KeyValueSerializer
KeyValueSortReducer	Emits sorted KeyValues.
LoadIncrementalHFiles	Tool to load the output of HFileOutputFormat into an existing table.
MultiTableInputFormat	Convert HBase tabular data from multiple scanners into a format that is consumable by Map/Reduce.
MultiTableInputFormatBase	A base for `MultiTableInputFormat`s.
MultiTableOutputFormat	Hadoop output format that writes to one or more HBase tables.
MultiTableOutputFormat.MultiTableRecordWriter	Record writer for outputting to multiple HTables.
MultithreadedTableMapper<K2,V2>	Multithreaded implementation for @link org.apache.hbase.mapreduce.TableMapper
MutationSerialization
PutCombiner<K>	Combine Puts.
PutSortReducer	Emits sorted Puts.
ResultSerialization
RowCounter	A job with a just a map phase to count rows.
SimpleTotalOrderPartitioner<VALUE>	A partitioner that takes start and end keys and uses bigdecimal to figure which reduce a key belongs to.
TableInputFormat	Convert HBase tabular data into a format that is consumable by Map/Reduce.
TableInputFormatBase	A base for `TableInputFormat`s.
TableMapper<KEYOUT,VALUEOUT>	Extends the base `Mapper` class to add the required input key and value classes.
TableMapReduceUtil	Utility for `TableMapper` and `TableReducer`
TableOutputCommitter	Small committer class that does not do anything.
TableOutputFormat<KEY>	Convert Map/Reduce output and write it to an HBase table.
TableOutputFormat.TableRecordWriter<KEY>	Writes the reducer output to an HBase table.
TableRecordReader	Iterate over an HBase table data, return (ImmutableBytesWritable, Result) pairs.
TableRecordReaderImpl	Iterate over an HBase table data, return (ImmutableBytesWritable, Result) pairs.
TableReducer<KEYIN,VALUEIN,KEYOUT>	Extends the basic `Reducer` class to add the required key and value input/output classes.
TableSnapshotInputFormat	TableSnapshotInputFormat allows a MapReduce job to run over a table snapshot.
TableSnapshotInputFormat.TableSnapshotRegionSplit
TableSnapshotInputFormatImpl	API-agnostic implementation for mapreduce over table snapshots.
TableSnapshotInputFormatImpl.InputSplit	Implementation class for InputSplit logic common between mapred and mapreduce.
TableSnapshotInputFormatImpl.RecordReader	Implementation class for RecordReader logic common between mapred and mapreduce.
TableSplit	A table split corresponds to a key range (low, high) and an optional scanner.
TextSortReducer	Emits Sorted KeyValues.
TsvImporterMapper	Write table content out to files in hdfs.
TsvImporterTextMapper	Write table content out to map output files.
WALPlayer	A tool to replay WAL files as a M/R job.

Exception Summary
Exception Description

ImportTsv.TsvParser.BadTsvLineException

Exception Summary
Exception	Description
ImportTsv.TsvParser.BadTsvLineException

Package org.apache.hadoop.hbase.mapreduce Description

Provides HBase MapReduce Input/OutputFormats, a table indexing MapReduce job, and utility

HBase, MapReduce and the CLASSPATH
Bundled HBase MapReduce Jobs
HBase as MapReduce job data source and sink
Bulk Import writing HFiles directly
Example Code

HBase, MapReduce and the CLASSPATH

MapReduce jobs deployed to a MapReduce cluster do not by default have access to the HBase configuration under $HBASE_CONF_DIR nor to HBase classes. You could add hbase-site.xml to $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib and copy these changes across your cluster (or edit conf/hadoop-env.sh and add them to the HADOOP_CLASSPATH variable) but this will pollute your hadoop install with HBase references; its also obnoxious requiring restart of the hadoop cluster before it'll notice your HBase additions.

As of 0.90.x, HBase will just add its dependency jars to the job configuration; the dependencies just need to be available on the local CLASSPATH. For example, to run the bundled HBase RowCounter mapreduce job against a table named usertable, type:

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0.jar rowcounter usertable

Expand $HBASE_HOME and $HADOOP_HOME in the above appropriately to suit your local environment. The content of HADOOP_CLASSPATH is set to the HBase CLASSPATH via backticking the command ${HBASE_HOME}/bin/hbase classpath.

When the above runs, internally, the HBase jar finds its zookeeper and guava, etc., dependencies on the passed HADOOP_CLASSPATH and adds the found jars to the mapreduce job configuration. See the source at TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done.

The above may not work if you are running your HBase from its build directory; i.e. you've done $ mvn test install at ${HBASE_HOME} and you are now trying to use this build in your mapreduce job. If you get

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
...

exception thrown, try doing the following:

$ HADOOP_CLASSPATH=${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/target/hbase-0.90.0-SNAPSHOT.jar rowcounter usertable

Notice how we preface the backtick invocation setting HADOOP_CLASSPATH with reference to the built HBase jar over in the target directory.

Bundled HBase MapReduce Jobs

The HBase jar also serves as a Driver for some bundled mapreduce jobs. To learn about the bundled mapreduce jobs run:

$ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.0-SNAPSHOT.jar
An example program must be given as the first argument.
Valid program names are:
  copytable: Export a table from local cluster to peer cluster
  completebulkload: Complete a bulk data load.
  export: Write table data to HDFS.
  import: Import data written by Export.
  importtsv: Import data in TSV format.
  rowcounter: Count rows in HBase table

HBase as MapReduce job data source and sink

HBase can be used as a data source, TableInputFormat, and data sink, TableOutputFormat or MultiTableOutputFormat, for MapReduce jobs. Writing MapReduce jobs that read or write HBase, you'll probably want to subclass TableMapper and/or TableReducer. See the do-nothing pass-through classes IdentityTableMapper and IdentityTableReducer for basic usage. For a more involved example, see RowCounter or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce unit test.

Running mapreduce jobs that have HBase as source or sink, you'll need to specify source/sink table and column names in your configuration.

Reading from HBase, the TableInputFormat asks HBase for the list of regions and makes a map-per-region or mapred.map.tasks maps, whichever is smaller (If your job only has two maps, up mapred.map.tasks to a number > number of regions). Maps will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per node. Writing, it may make sense to avoid the reduce step and write yourself back into HBase from inside your map. You'd do this when your job does not need the sort and collation that mapreduce does on the map emitted data; on insert, HBase 'sorts' so there is no point double-sorting (and shuffling data around your mapreduce cluster) unless you need to. If you do not need the reduce, you might just have your map emit counts of records processed just so the framework's report at the end of your job has meaning or set the number of reduces to zero and use TableOutputFormat. See example code below. If running the reduce step makes sense in your case, its usually better to have lots of reducers so load is spread across the HBase cluster.

There is also a new HBase partitioner that will run as many reducers as currently existing regions. The HRegionPartitioner is suitable when your table is large and your upload is not such that it will greatly alter the number of existing regions when done; otherwise use the default partitioner.

Bulk import writing HFiles directly

If importing into a new table, its possible to by-pass the HBase API and write your content directly to the filesystem properly formatted as HBase data files (HFiles). Your import will run faster, perhaps an order of magnitude faster if not more. For more on how this mechanism works, see Bulk Loads documentation.

Example Code

Sample Row Counter

See RowCounter. This job uses TableInputFormat and does a count of all rows in specified table. You should be able to run it by doing: % ./bin/hadoop jar hbase-X.X.X.jar. This will invoke the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs offered. This will emit rowcouner 'usage'. Specify tablename, column to count and output directory. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH so the rowcounter gets pointed at the right hbase cluster (or, build a new jar with an appropriate hbase-site.xml built into your job jar).

Package org.apache.hadoop.hbase.mapreduce