org.hbase.async
Class Scanner

java.lang.Object
  extended by org.hbase.async.Scanner

public final class Scanner
extends Object

Creates a scanner to read data sequentially from HBase.

This class is not synchronized as it's expected to be used from a single thread at a time. It's rarely (if ever?) useful to scan concurrently from a shared scanner using multiple threads. If you want to optimize large table scans using extra parallelism, create a few scanners and give each of them a partition of the table to scan. Or use MapReduce.

Unlike HBase's traditional client, there's no method in this class to explicitly open the scanner. It will open itself automatically when you start scanning by calling nextRows(). Also, the scanner will automatically call close() when it reaches the end key. If, however, you would like to stop scanning before reaching the end key, you must call close() before disposing of the scanner. Note that it's always safe to call close() on a scanner.

If you keep your scanner open and idle for too long, the RegionServer will close the scanner automatically for you after a timeout configured on the server side. When this happens, you'll get an UnknownScannerException when you attempt to use the scanner again. Also, if you scan too slowly (e.g. you take a long time between each call to nextRows()), you may prevent HBase from splitting the region if the region is also actively being written to while you scan. For heavy processing you should consider using MapReduce.

A Scanner is not re-usable. Should you want to scan the same rows or the same table again, you must create a new one.

A note on passing byte arrays in argument

None of the method that receive a byte[] in argument will copy it. For more info, please refer to the documentation of HBaseRpc.

A note on passing Strings in argument

All strings are assumed to use the platform's default charset.


Field Summary
static int DEFAULT_MAX_NUM_KVS
          The default maximum number of KeyValues the server is allowed to return in a single RPC response to a Scanner.
static int DEFAULT_MAX_NUM_ROWS
          The default maximum number of rows to scan per RPC.
 
Method Summary
 Deferred<Object> close()
          Closes this scanner (don't forget to call this when you're done with it!).
 byte[] getCurrentKey()
          Returns the row key this scanner is currently at.
 long getMaxTimestamp()
          Returns the maximum timestamp to scan (exclusive).
 int getMaxVersions()
          Returns the maximum number of versions to return for each cell scanned.
 long getMinTimestamp()
          Returns the minimum timestamp to scan (inclusive).
 Deferred<ArrayList<ArrayList<KeyValue>>> nextRows()
          Scans a number of rows.
 Deferred<ArrayList<ArrayList<KeyValue>>> nextRows(int nrows)
          Scans a number of rows.
 void setFamily(byte[] family)
          Specifies a particular column family to scan.
 void setFamily(String family)
          Specifies a particular column family to scan.
 void setKeyRegexp(String regexp)
          Sets a regular expression to filter results based on the row key.
 void setKeyRegexp(String regexp, Charset charset)
          Sets a regular expression to filter results based on the row key.
 void setMaxNumKeyValues(int max_num_kvs)
          Sets the maximum number of KeyValues the server is allowed to return in a single RPC response.
 void setMaxNumRows(int max_num_rows)
          Sets the maximum number of rows to scan per RPC (for better performance).
 void setMaxTimestamp(long timestamp)
          Sets the maximum timestamp to scan (exclusive).
 void setMaxVersions(int versions)
          Sets the maximum number of versions to return for each cell scanned.
 void setMinTimestamp(long timestamp)
          Sets the minimum timestamp to scan (inclusive).
 void setQualifier(byte[] qualifier)
          Specifies a particular column qualifier to scan.
 void setQualifier(String qualifier)
          Specifies a particular column qualifier to scan.
 void setQualifiers(byte[][] qualifiers)
          Specifies one or more column qualifiers to scan.
 void setServerBlockCache(boolean populate_blockcache)
          Sets whether or not the server should populate its block cache.
 void setStartKey(byte[] start_key)
          Specifies from which row key to start scanning (inclusive).
 void setStartKey(String start_key)
          Specifies from which row key to start scanning (inclusive).
 void setStopKey(byte[] stop_key)
          Specifies up to which row key to scan (exclusive).
 void setStopKey(String stop_key)
          Specifies up to which row key to scan (exclusive).
 void setTimeRange(long min_timestamp, long max_timestamp)
          Sets the time range to scan.
 String toString()
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

DEFAULT_MAX_NUM_KVS

public static final int DEFAULT_MAX_NUM_KVS
The default maximum number of KeyValues the server is allowed to return in a single RPC response to a Scanner.

This default value is exposed only as a hint but the value itself is not part of the API and is subject to change without notice.

See Also:
setMaxNumKeyValues(int), Constant Field Values

DEFAULT_MAX_NUM_ROWS

public static final int DEFAULT_MAX_NUM_ROWS
The default maximum number of rows to scan per RPC.

This default value is exposed only as a hint but the value itself is not part of the API and is subject to change without notice.

See Also:
setMaxNumRows(int), Constant Field Values
Method Detail

getCurrentKey

public byte[] getCurrentKey()
Returns the row key this scanner is currently at. Do not modify the byte array returned.


setStartKey

public void setStartKey(byte[] start_key)
Specifies from which row key to start scanning (inclusive).

Parameters:
start_key - The row key to start scanning from. If you don't invoke this method, scanning will begin from the first row key in the table. This byte array will NOT be copied.
Throws:
IllegalStateException - if scanning already started.

setStartKey

public void setStartKey(String start_key)
Specifies from which row key to start scanning (inclusive).

Throws:
IllegalStateException - if scanning already started.
See Also:
setStartKey(byte[])

setStopKey

public void setStopKey(byte[] stop_key)
Specifies up to which row key to scan (exclusive).

Parameters:
stop_key - The row key to scan up to. If you don't invoke this method, or if the array is empty (stop_key.length == 0), every row up to and including the last one will be scanned. This byte array will NOT be copied.
Throws:
IllegalStateException - if scanning already started.

setStopKey

public void setStopKey(String stop_key)
Specifies up to which row key to scan (exclusive).

Throws:
IllegalStateException - if scanning already started.
See Also:
setStopKey(byte[])

setFamily

public void setFamily(byte[] family)
Specifies a particular column family to scan.

Parameters:
family - The column family. This byte array will NOT be copied.
Throws:
IllegalStateException - if scanning already started.

setFamily

public void setFamily(String family)
Specifies a particular column family to scan.


setQualifier

public void setQualifier(byte[] qualifier)
Specifies a particular column qualifier to scan.

Note that specifying a qualifier without a family has no effect.

Parameters:
qualifier - The column qualifier. This byte array will NOT be copied.
Throws:
IllegalStateException - if scanning already started.

setQualifier

public void setQualifier(String qualifier)
Specifies a particular column qualifier to scan.


setQualifiers

public void setQualifiers(byte[][] qualifiers)
Specifies one or more column qualifiers to scan.

Note that specifying qualifiers without a family has no effect.

Parameters:
qualifiers - The column qualifiers. These byte arrays will NOT be copied.
Throws:
IllegalStateException - if scanning already started.
Since:
1.4

setKeyRegexp

public void setKeyRegexp(String regexp)
Sets a regular expression to filter results based on the row key.

This is equivalent to calling setKeyRegexp(String, Charset) with the ISO-8859-1 charset in argument.

Parameters:
regexp - The regular expression with which to filter the row keys.

setKeyRegexp

public void setKeyRegexp(String regexp,
                         Charset charset)
Sets a regular expression to filter results based on the row key.

This regular expression will be applied on the server-side, on the row key. Rows for which the key doesn't match will not be returned to this scanner, which can be useful to carefully select which rows are matched when you can't just do a prefix match, and cut down the amount of data transfered on the network.

Don't use an expensive regular expression, because Java's implementation uses backtracking and matching will happen on the server side, potentially on many many row keys. See Regular Expression Matching Can Be Simple And Fast for more details on regular expression performance (or lack thereof) and what "backtracking" means.

Parameters:
regexp - The regular expression with which to filter the row keys.
charset - The charset used to decode the bytes of the row key into a string. The RegionServer must support this charset, otherwise it will unexpectedly close the connection the first time you attempt to use this scanner.

setServerBlockCache

public void setServerBlockCache(boolean populate_blockcache)
Sets whether or not the server should populate its block cache.

Parameters:
populate_blockcache - if false, the block cache of the server will not be populated as the rows are being scanned. If true (the default), the blocks loaded by the server in order to feed the scanner may be added to the block cache, which will make subsequent read accesses to the same rows and other neighbouring rows faster. Whether or not blocks will be added to the cache depend on the table's configuration.

If you scan a sequence of keys that is unlikely to be accessed again in the near future, you can help the server improve its cache efficiency by setting this to false.

Throws:
IllegalStateException - if scanning already started.

setMaxNumRows

public void setMaxNumRows(int max_num_rows)
Sets the maximum number of rows to scan per RPC (for better performance).

Every time nextRows() is invoked, up to this number of rows may be returned. The default value is DEFAULT_MAX_NUM_ROWS.

This knob has a high performance impact. If it's too low, you'll do too many network round-trips, if it's too high, you'll spend too much time and memory handling large amounts of data. The right value depends on the size of the rows you're retrieving.

If you know you're going to be scanning lots of small rows (few cells, and each cell doesn't store a lot of data), you can get better performance by scanning more rows by RPC. You probably always want to retrieve at least a few dozen kilobytes per call.

If you want to err on the safe side, it's better to use a value that's a bit too high rather than a bit too low. Avoid extreme values (such as 1 or 1024) unless you know what you're doing.

Note that unlike many other methods, it's fine to change this value while scanning. Changing it will take affect all the subsequent RPCs issued. This can be useful you want to dynamically adjust how much data you want to receive at once (provided that you can estimate the size of your rows).

Parameters:
max_num_rows - A strictly positive integer.
Throws:
IllegalArgumentException - if the argument is zero or negative.

setMaxNumKeyValues

public void setMaxNumKeyValues(int max_num_kvs)
Sets the maximum number of KeyValues the server is allowed to return in a single RPC response.

If you're dealing with wide rows, in which you have many cells, you may want to limit the number of cells (KeyValues) that the server returns in a single RPC response.

The default is DEFAULT_MAX_NUM_KVS, unlike in HBase's client where the default is -1. If you set this to a negative value, the server will always return full rows, no matter how wide they are. If you request really wide rows, this may cause increased memory consumption on the server side as the server has to build a large RPC response, even if it tries to avoid copying data. On the client side, the consequences on memory usage are worse due to the lack of framing in RPC responses. The client will have to buffer a large RPC response and will have to do several memory copies to dynamically grow the size of the buffer as more and more data comes in.

Parameters:
max_num_kvs - A non-zero value.
Throws:
IllegalArgumentException - if the argument is zero.
IllegalStateException - if scanning already started.

setMaxVersions

public void setMaxVersions(int versions)
Sets the maximum number of versions to return for each cell scanned.

By default a scanner will only return the most recent version of each cell. If you want to get all possible versions available, pass Integer.MAX_VALUE in argument.

Parameters:
versions - A strictly positive number of versions to return.
Throws:
IllegalStateException - if scanning already started.
IllegalArgumentException - if versions <= 0
Since:
1.4

getMaxVersions

public int getMaxVersions()
Returns the maximum number of versions to return for each cell scanned.

Returns:
A strictly positive integer.
Since:
1.4

setMinTimestamp

public void setMinTimestamp(long timestamp)
Sets the minimum timestamp to scan (inclusive).

KeyValues that have a timestamp strictly less than this one will not be returned by the scanner. HBase has internal optimizations to avoid loading in memory data filtered out in some cases.

Parameters:
timestamp - The minimum timestamp to scan (inclusive).
Throws:
IllegalArgumentException - if timestamp < 0.
IllegalArgumentException - if timestamp > getMaxTimestamp().
Since:
1.3
See Also:
setTimeRange(long, long)

getMinTimestamp

public long getMinTimestamp()
Returns the minimum timestamp to scan (inclusive).

Returns:
A positive integer.
Since:
1.3

setMaxTimestamp

public void setMaxTimestamp(long timestamp)
Sets the maximum timestamp to scan (exclusive).

KeyValues that have a timestamp greater than or equal to this one will not be returned by the scanner. HBase has internal optimizations to avoid loading in memory data filtered out in some cases.

Parameters:
timestamp - The maximum timestamp to scan (exclusive).
Throws:
IllegalArgumentException - if timestamp < 0.
IllegalArgumentException - if timestamp < getMinTimestamp().
Since:
1.3
See Also:
setTimeRange(long, long)

getMaxTimestamp

public long getMaxTimestamp()
Returns the maximum timestamp to scan (exclusive).

Returns:
A positive integer.
Since:
1.3

setTimeRange

public void setTimeRange(long min_timestamp,
                         long max_timestamp)
Sets the time range to scan.

KeyValues that have a timestamp that do not fall in the range [min_timestamp; max_timestamp[ will not be returned by the scanner. HBase has internal optimizations to avoid loading in memory data filtered out in some cases.

Parameters:
min_timestamp - The minimum timestamp to scan (inclusive).
max_timestamp - The maximum timestamp to scan (exclusive).
Throws:
IllegalArgumentException - if min_timestamp < 0
IllegalArgumentException - if max_timestamp < 0
IllegalArgumentException - if min_timestamp > max_timestamp
Since:
1.3

nextRows

public Deferred<ArrayList<ArrayList<KeyValue>>> nextRows(int nrows)
Scans a number of rows. Calling this method is equivalent to:
   this.setMaxNumRows(nrows);
   this.nextRows();
 

Parameters:
nrows - The maximum number of rows to retrieve.
Returns:
A deferred list of rows.
See Also:
setMaxNumRows(int), nextRows()

nextRows

public Deferred<ArrayList<ArrayList<KeyValue>>> nextRows()
Scans a number of rows.

The last row returned may be partial if it's very wide and setMaxNumKeyValues(int) wasn't called with a negative value in argument.

Once this method returns null once (which indicates that this Scanner is done scanning), calling it again leads to an undefined behavior.

Returns:
A deferred list of rows. Each row is a list of KeyValue and each element in the list returned represents a different row. Rows are returned in sequential order. null is returned if there are no more rows to scan. Otherwise its size is guaranteed to be less than or equal to the value last given to setMaxNumRows(int).
See Also:
setMaxNumRows(int), setMaxNumKeyValues(int)

close

public Deferred<Object> close()
Closes this scanner (don't forget to call this when you're done with it!).

Closing a scanner already closed has no effect. The deferred returned will be called back immediately.

Returns:
A deferred object that indicates the completion of the request. The Object has not special meaning and can be null.

toString

public String toString()
Overrides:
toString in class Object