Scan (Apache HBase - Client 0.98.16-hadoop2 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.hbase.client
Class Scan

java.lang.Object
  org.apache.hadoop.hbase.client.Operation
      org.apache.hadoop.hbase.client.OperationWithAttributes
          org.apache.hadoop.hbase.client.Query
              org.apache.hadoop.hbase.client.Scan

All Implemented Interfaces:: Attributes

@InterfaceAudience.Public @InterfaceStability.Stable public class Scan
extends Query
extends Query

Used to perform Scan operations.

All operations are identical to Get with the exception of instantiation. Rather than specifying a single row, an optional startRow and stopRow may be defined. If rows are not specified, the Scanner will iterate over all rows.

To scan everything for each row, instantiate a Scan object.

To modify scanner caching for just this scan, use setCaching. If caching is NOT set, we will use the caching value of the hosting HTable. See HTable.setScannerCaching(int). In addition to row caching, it is possible to specify a maximum result size, using setMaxResultSize(long). When both are used, single server requests are limited by either number of rows or maximum result size, whichever limit comes first.

To further define the scope of what to get when scanning, perform additional methods as outlined below.

To get all columns from specific families, execute addFamily for each family to retrieve.

To get specific columns, execute addColumn for each column to retrieve.

To only retrieve columns within a specific range of version timestamps, execute setTimeRange.

To only retrieve columns with a specific timestamp, execute setTimestamp.

To limit the number of versions of each column to be returned, execute setMaxVersions.

To limit the maximum number of values returned for each call to next(), execute setBatch.

To add a filter, execute setFilter.

Expert: To explicitly disable server-side block caching for this scan, execute setCacheBlocks(boolean).

Note: Usage alters Scan instances. Internally, attributes are updated as the Scan runs and if enabled, metrics accumulate in the Scan instance. Be aware this is the case when you go to clone a Scan instance or if you go to reuse a created Scan instance; safer is create a Scan instance per usage.

Field Summary
`static String`	`HINT_LOOKAHEAD` Deprecated. without replacement This is now a no-op, SEEKs and SKIPs are optimizated automatically.
`static String`	`SCAN_ATTRIBUTES_METRICS_DATA`
`static String`	`SCAN_ATTRIBUTES_METRICS_ENABLE`
`static String`	`SCAN_ATTRIBUTES_TABLE_NAME`

Fields inherited from class org.apache.hadoop.hbase.client.Query
`filter`

Fields inherited from class org.apache.hadoop.hbase.client.OperationWithAttributes
`ID_ATRIBUTE`

Constructor Summary
`Scan()` Create a Scan operation across all rows.
`Scan(byte[] startRow)` Create a Scan operation starting at the specified row.
`Scan(byte[] startRow, byte[] stopRow)` Create a Scan operation for the range of rows specified.
`Scan(byte[] startRow, Filter filter)`
`Scan(Get get)` Builds a scan object with the same specs as get.
`Scan(Scan scan)` Creates a new instance of this class while copying all values.

Method Summary
`Scan`	`addColumn(byte[] family, byte[] qualifier)` Get the column from the specified family with the specified qualifier.
`Scan`	`addFamily(byte[] family)` Get all columns from the specified family.
`boolean`	`doLoadColumnFamiliesOnDemand()` Get the logical value indicating whether on-demand CF loading should be allowed.
`int`	`getBatch()`
`boolean`	`getCacheBlocks()` Get whether blocks should be cached for this Scan.
`int`	`getCaching()`
`byte[][]`	`getFamilies()`
`Map<byte[],NavigableSet<byte[]>>`	`getFamilyMap()` Getting the familyMap
`Filter`	`getFilter()`
`Map<String,Object>`	`getFingerprint()` Compile the table and column family (i.e.
`Boolean`	`getLoadColumnFamiliesOnDemandValue()` Get the raw loadColumnFamiliesOnDemand setting; if it's not set, can be null.
`long`	`getMaxResultSize()`
`int`	`getMaxResultsPerColumnFamily()`
`int`	`getMaxVersions()`
`int`	`getRowOffsetPerColumnFamily()` Method for retrieving the scan's offset per row per column family (#kvs to be skipped)
`byte[]`	`getStartRow()`
`byte[]`	`getStopRow()`
`TimeRange`	`getTimeRange()`
`boolean`	`hasFamilies()`
`boolean`	`hasFilter()`
`boolean`	`isGetScan()`
`boolean`	`isRaw()`
`boolean`	`isReversed()` Get whether this scan is a reversed one.
`boolean`	`isSmall()` Get whether this scan is a small scan
`int`	`numFamilies()`
`void`	`setBatch(int batch)` Set the maximum number of values to return for each call to next()
`void`	`setCacheBlocks(boolean cacheBlocks)` Set whether blocks should be cached for this Scan.
`void`	`setCaching(int caching)` Set the number of rows for caching that will be passed to scanners.
`Scan`	`setFamilyMap(Map<byte[],NavigableSet<byte[]>> familyMap)` Setting the familyMap
`Scan`	`setFilter(Filter filter)` Apply the specified server-side filter when performing the Query.
`void`	`setLoadColumnFamiliesOnDemand(boolean value)` Set the value indicating whether loading CFs on demand should be allowed (cluster default is false).
`void`	`setMaxResultSize(long maxResultSize)` Set the maximum result size.
`void`	`setMaxResultsPerColumnFamily(int limit)` Set the maximum number of values to return per row per Column Family
`Scan`	`setMaxVersions()` Get all available versions.
`Scan`	`setMaxVersions(int maxVersions)` Get up to the specified number of versions of each column.
`void`	`setRaw(boolean raw)` Enable/disable "raw" mode for this scan.
`Scan`	`setReversed(boolean reversed)` Set whether this scan is a reversed one
`void`	`setRowOffsetPerColumnFamily(int offset)` Set offset for the row per Column Family.
`void`	`setSmall(boolean small)` Set whether this scan is a small scan
`Scan`	`setStartRow(byte[] startRow)` Set the start row of the scan.
`Scan`	`setStopRow(byte[] stopRow)` Set the stop row.
`Scan`	`setTimeRange(long minStamp, long maxStamp)` Get versions of columns only within the specified timestamp range, [minStamp, maxStamp).
`Scan`	`setTimeStamp(long timestamp)` Get versions of columns with the specified timestamp.
`Map<String,Object>`	`toMap(int maxCols)` Compile the details beyond the scope of getFingerprint (row, columns, timestamps, etc.) into a Map along with the fingerprinted information.

Methods inherited from class org.apache.hadoop.hbase.client.Query
`getACL, getACLStrategy, getAuthorizations, getIsolationLevel, setACL, setACL, setACLStrategy, setAuthorizations, setIsolationLevel`

Methods inherited from class org.apache.hadoop.hbase.client.OperationWithAttributes
`getAttribute, getAttributeSize, getAttributesMap, getId, setAttribute, setId`

Methods inherited from class org.apache.hadoop.hbase.client.Operation
`toJSON, toJSON, toMap, toString, toString`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

HINT_LOOKAHEAD

@Deprecated
public static final String HINT_LOOKAHEAD

Deprecated. without replacement This is now a no-op, SEEKs and SKIPs are optimizated automatically.

EXPERT ONLY. An integer (not long) indicating to the scanner logic how many times we attempt to retrieve the next KV before we schedule a reseek. The right value depends on the size of the average KV. A reseek is more efficient when it can skip 5-10 KVs or 512B-1KB, or when the next KV is likely found in another HFile block. Setting this only has any effect when columns were added with addColumn(byte[], byte[])

Scan s = new Scan(...);
 s.addColumn(...);
 s.setAttribute(Scan.HINT_LOOKAHEAD, Bytes.toBytes(2));

Default is 0 (always reseek).

See Also:: Constant Field Values

SCAN_ATTRIBUTES_METRICS_ENABLE

public static final String SCAN_ATTRIBUTES_METRICS_ENABLE

See Also:: Constant Field Values

SCAN_ATTRIBUTES_METRICS_DATA

public static final String SCAN_ATTRIBUTES_METRICS_DATA

See Also:: Constant Field Values

SCAN_ATTRIBUTES_TABLE_NAME

public static final String SCAN_ATTRIBUTES_TABLE_NAME

See Also:: Constant Field Values

Constructor Detail

Scan

public Scan()

Create a Scan operation across all rows.

Scan

public Scan(byte[] startRow,
            Filter filter)

Scan

public Scan(byte[] startRow)

Create a Scan operation starting at the specified row.

If the specified row does not exist, the Scanner will start from the next closest row after the specified row.

Parameters:: startRow - row to start scanner at or after

Scan

public Scan(byte[] startRow,
            byte[] stopRow)

Create a Scan operation for the range of rows specified.

Parameters:: startRow - row to start scanner at or after (inclusive); stopRow - row to stop scanner before (exclusive)

Scan

public Scan(Scan scan)
     throws IOException

Creates a new instance of this class while copying all values.

Parameters:: scan - The scan instance to copy from.
Throws:: IOException - When copying the values fails.

Scan

public Scan(Get get)

Builds a scan object with the same specs as get.

Parameters:: get - get to model scan after

Method Detail

isGetScan

public boolean isGetScan()

addFamily

public Scan addFamily(byte[] family)

Get all columns from the specified family.

Overrides previous calls to addColumn for this family.

Parameters:: family - family name
Returns:: this

addColumn

public Scan addColumn(byte[] family,
                      byte[] qualifier)

Get the column from the specified family with the specified qualifier.

Overrides previous calls to addFamily for this family.

Parameters:: family - family name; qualifier - column qualifier
Returns:: this

setTimeRange

public Scan setTimeRange(long minStamp,
                         long maxStamp)
                  throws IOException

Get versions of columns only within the specified timestamp range, [minStamp, maxStamp). Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the defaut.

Parameters:: minStamp - minimum timestamp value, inclusive; maxStamp - maximum timestamp value, exclusive
Returns:: this
Throws:: IOException - if invalid time range
See Also:: setMaxVersions(), setMaxVersions(int)

setTimeStamp

public Scan setTimeStamp(long timestamp)
                  throws IOException

Get versions of columns with the specified timestamp. Note, default maximum versions to return is 1. If your time range spans more than one version and you want all versions returned, up the number of versions beyond the defaut.

Parameters:: timestamp - version timestamp
Returns:: this
Throws:: IOException
See Also:: setMaxVersions(), setMaxVersions(int)

setStartRow

public Scan setStartRow(byte[] startRow)

Set the start row of the scan.

Parameters:: startRow - row to start scan on (inclusive) Note: In order to make startRow exclusive add a trailing 0 byte
Returns:: this

setStopRow

public Scan setStopRow(byte[] stopRow)

Set the stop row.

Parameters:: stopRow - row to end at (exclusive) Note: In order to make stopRow inclusive add a trailing 0 byte
Returns:: this

setMaxVersions

public Scan setMaxVersions()

Get all available versions.

Returns:: this

setMaxVersions

public Scan setMaxVersions(int maxVersions)

Get up to the specified number of versions of each column.

Parameters:: maxVersions - maximum versions for each column
Returns:: this

setBatch

public void setBatch(int batch)

Set the maximum number of values to return for each call to next()

Parameters:: batch - the maximum number of values

setMaxResultsPerColumnFamily

public void setMaxResultsPerColumnFamily(int limit)

Set the maximum number of values to return per row per Column Family

Parameters:: limit - the maximum number of values returned / row / CF

setRowOffsetPerColumnFamily

public void setRowOffsetPerColumnFamily(int offset)

Set offset for the row per Column Family.

Parameters:: offset - is the number of kvs that will be skipped.

setCaching

public void setCaching(int caching)

Set the number of rows for caching that will be passed to scanners. If not set, the default setting from HTable.getScannerCaching() will apply. Higher caching values will enable faster scanners but will use more memory.

Parameters:: caching - the number of rows for caching

getMaxResultSize

public long getMaxResultSize()

Returns:: the maximum result size in bytes. See setMaxResultSize(long)

setMaxResultSize

public void setMaxResultSize(long maxResultSize)

Set the maximum result size. The default is -1; this means that no specific maximum result size will be set for this scan, and the global configured value will be used instead. (Defaults to unlimited).

Parameters:: maxResultSize - The maximum result size in bytes.

setFilter

public Scan setFilter(Filter filter)

Description copied from class: Query

Apply the specified server-side filter when performing the Query. Only Filter.filterKeyValue(Cell) is called AFTER all tests for ttl, column match, deletes and max versions have been run.

Overrides:: setFilter in class Query

Parameters:: filter - filter to run on the server
Returns:: this for invocation chaining

setFamilyMap

public Scan setFamilyMap(Map<byte[],NavigableSet<byte[]>> familyMap)

Setting the familyMap

Parameters:: familyMap - map of family to qualifier
Returns:: this

getFamilyMap

public Map<byte[],NavigableSet<byte[]>> getFamilyMap()

Getting the familyMap

Returns:: familyMap

numFamilies

public int numFamilies()

Returns:: the number of families in familyMap

hasFamilies

public boolean hasFamilies()

Returns:: true if familyMap is non empty, false otherwise

getFamilies

public byte[][] getFamilies()

Returns:: the keys of the familyMap

getStartRow

public byte[] getStartRow()

Returns:: the startrow

getStopRow

public byte[] getStopRow()

Returns:: the stoprow

getMaxVersions

public int getMaxVersions()

Returns:: the max number of versions to fetch

getBatch

public int getBatch()

Returns:: maximum number of values to return for a single call to next()

getMaxResultsPerColumnFamily

public int getMaxResultsPerColumnFamily()

Returns:: maximum number of values to return per row per CF

getRowOffsetPerColumnFamily

public int getRowOffsetPerColumnFamily()

Method for retrieving the scan's offset per row per column family (#kvs to be skipped)

Returns:: row offset

getCaching

public int getCaching()

Returns:: caching the number of rows fetched when calling next on a scanner

getTimeRange

public TimeRange getTimeRange()

Returns:: TimeRange

getFilter

public Filter getFilter()

Overrides:: getFilter in class Query

Returns:: RowFilter

hasFilter

public boolean hasFilter()

Returns:: true is a filter has been specified, false if not

setCacheBlocks

public void setCacheBlocks(boolean cacheBlocks)

Set whether blocks should be cached for this Scan.

This is true by default. When true, default settings of the table and family are used (this will never override caching blocks if the block cache is disabled for that family or entirely).

Parameters:: cacheBlocks - if false, default settings are overridden and blocks will not be cached

getCacheBlocks

public boolean getCacheBlocks()

Get whether blocks should be cached for this Scan.

Returns:: true if default caching should be used, false if blocks should not be cached

setReversed

public Scan setReversed(boolean reversed)

Set whether this scan is a reversed one

This is false by default which means forward(normal) scan.

Parameters:: reversed - if true, scan will be backward order
Returns:: this

isReversed

public boolean isReversed()

Get whether this scan is a reversed one.

Returns:: true if backward scan, false if forward(default) scan

setLoadColumnFamiliesOnDemand

public void setLoadColumnFamiliesOnDemand(boolean value)

Set the value indicating whether loading CFs on demand should be allowed (cluster default is false). On-demand CF loading doesn't load column families until necessary, e.g. if you filter on one column, the other column family data will be loaded only for the rows that are included in result, not all rows like in normal case. With column-specific filters, like SingleColumnValueFilter w/filterIfMissing == true, this can deliver huge perf gains when there's a cf with lots of data; however, it can also lead to some inconsistent results, as follows: - if someone does a concurrent update to both column families in question you may get a row that never existed, e.g. for { rowKey = 5, { cat_videos => 1 }, { video => "my cat" } } someone puts rowKey 5 with { cat_videos => 0 }, { video => "my dog" }, concurrent scan filtering on "cat_videos == 1" can get { rowKey = 5, { cat_videos => 1 }, { video => "my dog" } }. - if there's a concurrent split and you have more than 2 column families, some rows may be missing some column families.

getLoadColumnFamiliesOnDemandValue

public Boolean getLoadColumnFamiliesOnDemandValue()

Get the raw loadColumnFamiliesOnDemand setting; if it's not set, can be null.

doLoadColumnFamiliesOnDemand

public boolean doLoadColumnFamiliesOnDemand()

Get the logical value indicating whether on-demand CF loading should be allowed.

getFingerprint

public Map<String,Object> getFingerprint()

Compile the table and column family (i.e. schema) information into a String. Useful for parsing and aggregation by debugging, logging, and administration tools.

Specified by:: getFingerprint in class Operation

Returns:: Map

toMap

public Map<String,Object> toMap(int maxCols)

Compile the details beyond the scope of getFingerprint (row, columns, timestamps, etc.) into a Map along with the fingerprinted information. Useful for debugging, logging, and administration tools.

Specified by:: toMap in class Operation

Parameters:: maxCols - a limit on the number of columns output prior to truncation
Returns:: Map

setRaw

public void setRaw(boolean raw)

Enable/disable "raw" mode for this scan. If "raw" is enabled the scan will return all delete marker and deleted rows that have not been collected, yet. This is mostly useful for Scan on column families that have KEEP_DELETED_ROWS enabled. It is an error to specify any column when "raw" is set.

Parameters:: raw - True/False to enable/disable "raw" mode.

isRaw

public boolean isRaw()

Returns:: True if this Scan is in "raw" mode.

setSmall

public void setSmall(boolean small)

Set whether this scan is a small scan

Small scan should use pread and big scan can use seek + read seek + read is fast but can cause two problem (1) resource contention (2) cause too much network io [89-fb] Using pread for non-compaction read request https://issues.apache.org/jira/browse/HBASE-7266 On the other hand, if setting it true, we would do openScanner,next,closeScanner in one RPC call. It means the better performance for small scan. [HBASE-9488]. Generally, if the scan range is within one data block(64KB), it could be considered as a small scan.

Parameters:: small -

isSmall

public boolean isSmall()

Get whether this scan is a small scan

Returns:: true if small scan

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.hbase.client Class Scan

HINT_LOOKAHEAD

SCAN_ATTRIBUTES_METRICS_ENABLE

SCAN_ATTRIBUTES_METRICS_DATA

SCAN_ATTRIBUTES_TABLE_NAME

Scan

Scan

Scan

Scan

Scan

Scan

isGetScan

addFamily

addColumn

setTimeRange

setTimeStamp

setStartRow

setStopRow

setMaxVersions

setMaxVersions

setBatch

setMaxResultsPerColumnFamily

setRowOffsetPerColumnFamily

setCaching

getMaxResultSize

setMaxResultSize

setFilter

setFamilyMap

getFamilyMap

numFamilies

hasFamilies

getFamilies

getStartRow

getStopRow

getMaxVersions

getBatch

getMaxResultsPerColumnFamily

getRowOffsetPerColumnFamily

getCaching

getTimeRange

getFilter

hasFilter

setCacheBlocks

getCacheBlocks

setReversed

isReversed

setLoadColumnFamiliesOnDemand

getLoadColumnFamiliesOnDemandValue

doLoadColumnFamiliesOnDemand

getFingerprint

toMap

setRaw

isRaw

setSmall

isSmall

org.apache.hadoop.hbase.client
Class Scan