org.apache.hadoop.hbase.client
Class Scan

java.lang.Object
  extended by org.apache.hadoop.hbase.client.Operation
      extended by org.apache.hadoop.hbase.client.OperationWithAttributes
          extended by org.apache.hadoop.hbase.client.Query
              extended by org.apache.hadoop.hbase.client.Scan
All Implemented Interfaces:
Attributes

@InterfaceAudience.Public
@InterfaceStability.Stable
public class Scan
extends Query

Used to perform Scan operations.

All operations are identical to Get with the exception of instantiation. Rather than specifying a single row, an optional startRow and stopRow may be defined. If rows are not specified, the Scanner will iterate over all rows.

To scan everything for each row, instantiate a Scan object.

To modify scanner caching for just this scan, use setCaching. If caching is NOT set, we will use the caching value of the hosting HTable. See HTable.setScannerCaching(int). In addition to row caching, it is possible to specify a maximum result size, using setMaxResultSize(long). When both are used, single server requests are limited by either number of rows or maximum result size, whichever limit comes first.

To further define the scope of what to get when scanning, perform additional methods as outlined below.

To get all columns from specific families, execute addFamily for each family to retrieve.

To get specific columns, execute addColumn for each column to retrieve.

To only retrieve columns within a specific range of version timestamps, execute setTimeRange.

To only retrieve columns with a specific timestamp, execute setTimestamp.

To limit the number of versions of each column to be returned, execute setMaxVersions.

To limit the maximum number of values returned for each call to next(), execute setBatch.

To add a filter, execute setFilter.

Expert: To explicitly disable server-side block caching for this scan, execute setCacheBlocks(boolean).

Note: Usage alters Scan instances. Internally, attributes are updated as the Scan runs and if enabled, metrics accumulate in the Scan instance. Be aware this is the case when you go to clone a Scan instance or if you go to reuse a created Scan instance; safer is create a Scan instance per usage.


Field Summary
static String HINT_LOOKAHEAD
          Deprecated. without replacement This is now a no-op, SEEKs and SKIPs are optimizated automatically.
static String SCAN_ATTRIBUTES_METRICS_DATA
           
static String SCAN_ATTRIBUTES_METRICS_ENABLE
           
static String SCAN_ATTRIBUTES_TABLE_NAME
           
 
Fields inherited from class org.apache.hadoop.hbase.client.Query
filter
 
Fields inherited from class org.apache.hadoop.hbase.client.OperationWithAttributes
ID_ATRIBUTE
 
Constructor Summary
Scan()
          Create a Scan operation across all rows.
Scan(byte[] startRow)
          Create a Scan operation starting at the specified row.
Scan(byte[] startRow, byte[] stopRow)
          Create a Scan operation for the range of rows specified.
Scan(byte[] startRow, Filter filter)
           
Scan(Get get)
          Builds a scan object with the same specs as get.
Scan(Scan scan)
          Creates a new instance of this class while copying all values.
 
Method Summary
 Scan addColumn(byte[] family, byte[] qualifier)
          Get the column from the specified family with the specified qualifier.
 Scan addFamily(byte[] family)
          Get all columns from the specified family.
 boolean doLoadColumnFamiliesOnDemand()
          Get the logical value indicating whether on-demand CF loading should be allowed.
 int getBatch()
           
 boolean getCacheBlocks()
          Get whether blocks should be cached for this Scan.
 int getCaching()
           
 byte[][] getFamilies()
           
 Map<byte[],NavigableSet<byte[]>> getFamilyMap()
          Getting the familyMap
 Filter getFilter()
           
 Map<String,Object> getFingerprint()
          Compile the table and column family (i.e.
 Boolean getLoadColumnFamiliesOnDemandValue()
          Get the raw loadColumnFamiliesOnDemand setting; if it's not set, can be null.
 long getMaxResultSize()
           
 int getMaxResultsPerColumnFamily()
           
 int getMaxVersions()
           
 int getRowOffsetPerColumnFamily()
          Method for retrieving the scan's offset per row per column family (#kvs to be skipped)
 byte[] getStartRow()
           
 byte[] getStopRow()
           
 TimeRange getTimeRange()
           
 boolean hasFamilies()
           
 boolean hasFilter()
           
 boolean isGetScan()
           
 boolean isRaw()
           
 boolean isReversed()
          Get whether this scan is a reversed one.
 boolean isSmall()
          Get whether this scan is a small scan
 int numFamilies()
           
 void setBatch(int batch)
          Set the maximum number of values to return for each call to next()
 void setCacheBlocks(boolean cacheBlocks)
          Set whether blocks should be cached for this Scan.
 void setCaching(int caching)
          Set the number of rows for caching that will be passed to scanners.
 Scan setFamilyMap(Map<byte[],NavigableSet<byte[]>> familyMap)
          Setting the familyMap
 Scan setFilter(Filter filter)
          Apply the specified server-side filter when performing the Query.
 void setLoadColumnFamiliesOnDemand(boolean value)
          Set the value indicating whether loading CFs on demand should be allowed (cluster default is false).
 void setMaxResultSize(long maxResultSize)
          Set the maximum result size.
 void setMaxResultsPerColumnFamily(int limit)
          Set the maximum number of values to return per row per Column Family
 Scan setMaxVersions()
          Get all available versions.
 Scan setMaxVersions(int maxVersions)
          Get up to the specified number of versions of each column.
 void setRaw(boolean raw)
          Enable/disable "raw" mode for this scan.
 Scan setReversed(boolean reversed)
          Set whether this scan is a reversed one
 void setRowOffsetPerColumnFamily(int offset)
          Set offset for the row per Column Family.
 void setSmall(boolean small)
          Set whether this scan is a small scan
 Scan setStartRow(byte[] startRow)
          Set the start row of the scan.
 Scan setStopRow(byte[] stopRow)
          Set the stop row.
 Scan setTimeRange(long minStamp, long maxStamp)
          Get versions of columns only within the specified timestamp range, [minStamp, maxStamp).
 Scan setTimeStamp(long timestamp)
          Get versions of columns with the specified timestamp.
 Map<String,Object> toMap(int maxCols)
          Compile the details beyond the scope of getFingerprint (row, columns, timestamps, etc.) into a Map along with the fingerprinted information.
 
Methods inherited from class org.apache.hadoop.hbase.client.Query
getACL, getACLStrategy, getAuthorizations, getIsolationLevel, setACL, setACL, setACLStrategy, setAuthorizations, setIsolationLevel
 
Methods inherited from class org.apache.hadoop.hbase.client.OperationWithAttributes
getAttribute, getAttributeSize, getAttributesMap, getId, setAttribute, setId
 
Methods inherited from class org.apache.hadoop.hbase.client.Operation
toJSON, toJSON, toMap, toString, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

HINT_LOOKAHEAD

@Deprecated
public static final String HINT_LOOKAHEAD
Deprecated. without replacement This is now a no-op, SEEKs and SKIPs are optimizated automatically.
EXPERT ONLY. An integer (not long) indicating to the scanner logic how many times we attempt to retrieve the next KV before we schedule a reseek. The right value depends on the size of the average KV. A reseek is more efficient when it can skip 5-10 KVs or 512B-1KB, or when the next KV is likely found in another HFile block. Setting this only has any effect when columns were added with addColumn(byte[], byte[])
Scan s = new Scan(...);
 s.addColumn(...);
 s.setAttribute(Scan.HINT_LOOKAHEAD, Bytes.toBytes(2));
 
Default is 0 (always reseek).

See Also:
Constant Field Values

SCAN_ATTRIBUTES_METRICS_ENABLE

public static final String SCAN_ATTRIBUTES_METRICS_ENABLE
See Also:
Constant Field Values

SCAN_ATTRIBUTES_METRICS_DATA

public static final String SCAN_ATTRIBUTES_METRICS_DATA
See Also:
Constant Field Values

SCAN_ATTRIBUTES_TABLE_NAME

public static final String SCAN_ATTRIBUTES_TABLE_NAME
See Also:
Constant Field Values
Constructor Detail

Scan

public Scan()
Create a Scan operation across all rows.


Scan

public Scan(byte[] startRow,
            Filter filter)

Scan

public Scan(byte[] startRow)
Create a Scan operation starting at the specified row.

If the specified row does not exist, the Scanner will start from the next closest row after the specified row.

Parameters:
startRow - row to start scanner at or after

Scan

public Scan(byte[] startRow,
            byte[] stopRow)
Create a Scan operation for the range of rows specified.

Parameters:
startRow - row to start scanner at or after (inclusive)
stopRow - row to stop scanner before (exclusive)

Scan

public Scan(Scan scan)
     throws IOException
Creates a new instance of this class while copying all values.

Parameters:
scan - The scan instance to copy from.
Throws:
IOException - When copying the values fails.

Scan

public Scan(Get get)
Builds a scan object with the same specs as get.

Parameters:
get - get to model scan after
Method Detail

isGetScan

public boolean isGetScan()

addFamily

public Scan addFamily(byte[] family)
Get all columns from the specified family.

Overrides previous calls to addColumn for this family.

Parameters:
family - family name
Returns:
this

addColumn

public Scan addColumn(byte[] family,
                      byte[] qualifier)
Get the column from the specified family with the specified qualifier.

Overrides previous calls to addFamily for this family.

Parameters:
family - family name
qualifier - column qualifier
Returns:
this

setTimeRange

public Scan setTimeRange(long minStamp,
                         long maxStamp)
                  throws IOException
Get versions of columns only within the specified timestamp range, [minStamp, maxStamp). Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the defaut.

Parameters:
minStamp - minimum timestamp value, inclusive
maxStamp - maximum timestamp value, exclusive
Returns:
this
Throws:
IOException - if invalid time range
See Also:
setMaxVersions(), setMaxVersions(int)

setTimeStamp

public Scan setTimeStamp(long timestamp)
                  throws IOException
Get versions of columns with the specified timestamp. Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the defaut.

Parameters:
timestamp - version timestamp
Returns:
this
Throws:
IOException
See Also:
setMaxVersions(), setMaxVersions(int)

setStartRow

public Scan setStartRow(byte[] startRow)
Set the start row of the scan.

Parameters:
startRow - row to start scan on (inclusive) Note: In order to make startRow exclusive add a trailing 0 byte
Returns:
this

setStopRow

public Scan setStopRow(byte[] stopRow)
Set the stop row.

Parameters:
stopRow - row to end at (exclusive) Note: In order to make stopRow inclusive add a trailing 0 byte
Returns:
this

setMaxVersions

public Scan setMaxVersions()
Get all available versions.

Returns:
this

setMaxVersions

public Scan setMaxVersions(int maxVersions)
Get up to the specified number of versions of each column.

Parameters:
maxVersions - maximum versions for each column
Returns:
this

setBatch

public void setBatch(int batch)
Set the maximum number of values to return for each call to next()

Parameters:
batch - the maximum number of values

setMaxResultsPerColumnFamily

public void setMaxResultsPerColumnFamily(int limit)
Set the maximum number of values to return per row per Column Family

Parameters:
limit - the maximum number of values returned / row / CF

setRowOffsetPerColumnFamily

public void setRowOffsetPerColumnFamily(int offset)
Set offset for the row per Column Family.

Parameters:
offset - is the number of kvs that will be skipped.

setCaching

public void setCaching(int caching)
Set the number of rows for caching that will be passed to scanners. If not set, the default setting from HTable.getScannerCaching() will apply. Higher caching values will enable faster scanners but will use more memory.

Parameters:
caching - the number of rows for caching

getMaxResultSize

public long getMaxResultSize()
Returns:
the maximum result size in bytes. See setMaxResultSize(long)

setMaxResultSize

public void setMaxResultSize(long maxResultSize)
Set the maximum result size. The default is -1; this means that no specific maximum result size will be set for this scan, and the global configured value will be used instead. (Defaults to unlimited).

Parameters:
maxResultSize - The maximum result size in bytes.

setFilter

public Scan setFilter(Filter filter)
Description copied from class: Query
Apply the specified server-side filter when performing the Query. Only Filter.filterKeyValue(Cell) is called AFTER all tests for ttl, column match, deletes and max versions have been run.

Overrides:
setFilter in class Query
Parameters:
filter - filter to run on the server
Returns:
this for invocation chaining

setFamilyMap

public Scan setFamilyMap(Map<byte[],NavigableSet<byte[]>> familyMap)
Setting the familyMap

Parameters:
familyMap - map of family to qualifier
Returns:
this

getFamilyMap

public Map<byte[],NavigableSet<byte[]>> getFamilyMap()
Getting the familyMap

Returns:
familyMap

numFamilies

public int numFamilies()
Returns:
the number of families in familyMap

hasFamilies

public boolean hasFamilies()
Returns:
true if familyMap is non empty, false otherwise

getFamilies

public byte[][] getFamilies()
Returns:
the keys of the familyMap

getStartRow

public byte[] getStartRow()
Returns:
the startrow

getStopRow

public byte[] getStopRow()
Returns:
the stoprow

getMaxVersions

public int getMaxVersions()
Returns:
the max number of versions to fetch

getBatch

public int getBatch()
Returns:
maximum number of values to return for a single call to next()

getMaxResultsPerColumnFamily

public int getMaxResultsPerColumnFamily()
Returns:
maximum number of values to return per row per CF

getRowOffsetPerColumnFamily

public int getRowOffsetPerColumnFamily()
Method for retrieving the scan's offset per row per column family (#kvs to be skipped)

Returns:
row offset

getCaching

public int getCaching()
Returns:
caching the number of rows fetched when calling next on a scanner

getTimeRange

public TimeRange getTimeRange()
Returns:
TimeRange

getFilter

public Filter getFilter()
Overrides:
getFilter in class Query
Returns:
RowFilter

hasFilter

public boolean hasFilter()
Returns:
true is a filter has been specified, false if not

setCacheBlocks

public void setCacheBlocks(boolean cacheBlocks)
Set whether blocks should be cached for this Scan.

This is true by default. When true, default settings of the table and family are used (this will never override caching blocks if the block cache is disabled for that family or entirely).

Parameters:
cacheBlocks - if false, default settings are overridden and blocks will not be cached

getCacheBlocks

public boolean getCacheBlocks()
Get whether blocks should be cached for this Scan.

Returns:
true if default caching should be used, false if blocks should not be cached

setReversed

public Scan setReversed(boolean reversed)
Set whether this scan is a reversed one

This is false by default which means forward(normal) scan.

Parameters:
reversed - if true, scan will be backward order
Returns:
this

isReversed

public boolean isReversed()
Get whether this scan is a reversed one.

Returns:
true if backward scan, false if forward(default) scan

setLoadColumnFamiliesOnDemand

public void setLoadColumnFamiliesOnDemand(boolean value)
Set the value indicating whether loading CFs on demand should be allowed (cluster default is false). On-demand CF loading doesn't load column families until necessary, e.g. if you filter on one column, the other column family data will be loaded only for the rows that are included in result, not all rows like in normal case. With column-specific filters, like SingleColumnValueFilter w/filterIfMissing == true, this can deliver huge perf gains when there's a cf with lots of data; however, it can also lead to some inconsistent results, as follows: - if someone does a concurrent update to both column families in question you may get a row that never existed, e.g. for { rowKey = 5, { cat_videos => 1 }, { video => "my cat" } } someone puts rowKey 5 with { cat_videos => 0 }, { video => "my dog" }, concurrent scan filtering on "cat_videos == 1" can get { rowKey = 5, { cat_videos => 1 }, { video => "my dog" } }. - if there's a concurrent split and you have more than 2 column families, some rows may be missing some column families.


getLoadColumnFamiliesOnDemandValue

public Boolean getLoadColumnFamiliesOnDemandValue()
Get the raw loadColumnFamiliesOnDemand setting; if it's not set, can be null.


doLoadColumnFamiliesOnDemand

public boolean doLoadColumnFamiliesOnDemand()
Get the logical value indicating whether on-demand CF loading should be allowed.


getFingerprint

public Map<String,Object> getFingerprint()
Compile the table and column family (i.e. schema) information into a String. Useful for parsing and aggregation by debugging, logging, and administration tools.

Specified by:
getFingerprint in class Operation
Returns:
Map

toMap

public Map<String,Object> toMap(int maxCols)
Compile the details beyond the scope of getFingerprint (row, columns, timestamps, etc.) into a Map along with the fingerprinted information. Useful for debugging, logging, and administration tools.

Specified by:
toMap in class Operation
Parameters:
maxCols - a limit on the number of columns output prior to truncation
Returns:
Map

setRaw

public void setRaw(boolean raw)
Enable/disable "raw" mode for this scan. If "raw" is enabled the scan will return all delete marker and deleted rows that have not been collected, yet. This is mostly useful for Scan on column families that have KEEP_DELETED_ROWS enabled. It is an error to specify any column when "raw" is set.

Parameters:
raw - True/False to enable/disable "raw" mode.

isRaw

public boolean isRaw()
Returns:
True if this Scan is in "raw" mode.

setSmall

public void setSmall(boolean small)
Set whether this scan is a small scan

Small scan should use pread and big scan can use seek + read seek + read is fast but can cause two problem (1) resource contention (2) cause too much network io [89-fb] Using pread for non-compaction read request https://issues.apache.org/jira/browse/HBASE-7266 On the other hand, if setting it true, we would do openScanner,next,closeScanner in one RPC call. It means the better performance for small scan. [HBASE-9488]. Generally, if the scan range is within one data block(64KB), it could be considered as a small scan.

Parameters:
small -

isSmall

public boolean isSmall()
Get whether this scan is a small scan

Returns:
true if small scan


Copyright © 2007-2015 The Apache Software Foundation. All Rights Reserved.