com.ibm.icu.text
Class RuleBasedBreakIterator

java.lang.Object
  extended by com.ibm.icu.text.BreakIterator
      extended by com.ibm.icu.text.RuleBasedBreakIterator
All Implemented Interfaces:
Cloneable
Direct Known Subclasses:
DictionaryBasedBreakIterator

public class RuleBasedBreakIterator
extends BreakIterator

Rule Based Break Iterator This is a port of the C++ class RuleBasedBreakIterator from ICU4C.

Status:
Stable ICU 2.0.

Field Summary
protected static String fDebugEnv
          Deprecated. This API is ICU internal only.
protected  int fDictionaryCharCount
          Deprecated. This API is ICU internal only.
protected  com.ibm.icu.text.RBBIDataWrapper fRData
          Deprecated. This API is ICU internal only.
static boolean fTrace
          Deprecated. This API is ICU internal only.
static int WORD_IDEO
          Tag value for words containing ideographic characters, lower limit
static int WORD_IDEO_LIMIT
          Tag value for words containing ideographic characters, upper limit
static int WORD_KANA
          Tag value for words containing kana characters, lower limit
static int WORD_KANA_LIMIT
          Tag value for words containing kana characters, upper limit
static int WORD_LETTER
          Tag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.
static int WORD_LETTER_LIMIT
          Tag value for words containing letters, upper limit
static int WORD_NONE
          Tag value for "words" that do not fit into any of other categories.
static int WORD_NONE_LIMIT
          Upper bound for tags for uncategorized words.
static int WORD_NUMBER
          Tag value for words that appear to be numbers, lower limit.
static int WORD_NUMBER_LIMIT
          Tag value for words that appear to be numbers, upper limit.
 
Fields inherited from class com.ibm.icu.text.BreakIterator
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD
 
Constructor Summary
RuleBasedBreakIterator()
          Deprecated. This API is ICU internal only.
RuleBasedBreakIterator(String rules)
          Construct a RuleBasedBreakIterator from a set of rules supplied as a string.
 
Method Summary
protected static void checkOffset(int offset, CharacterIterator text)
          Throw IllegalArgumentException unless begin <= offset < end.
 Object clone()
          Clones this iterator.
static void compileRules(String rules, OutputStream ruleBinary)
          Compile a set of source break rules into the binary state tables used by the break iterator engine.
 int current()
          Returns the current iteration position.
 void dump()
          Deprecated. This API is ICU internal only.
 boolean equals(Object that)
          Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.
 int first()
          Sets the current iteration position to the beginning of the text.
 int following(int offset)
          Sets the iterator to refer to the first boundary position following the specified position.
static RuleBasedBreakIterator getInstanceFromCompiledRules(InputStream is)
          Create a break iterator from a precompiled set of break rules.
 int getRuleStatus()
          Return the status tag from the break rule that determined the most recently returned break position.
 int getRuleStatusVec(int[] fillInArray)
          Get the status (tag) values from the break rule(s) that determined the most recently returned break position.
 CharacterIterator getText()
          Return a CharacterIterator over the text being analyzed.
 int hashCode()
          Compute a hashcode for this BreakIterator
 boolean isBoundary(int offset)
          Returns true if the specfied position is a boundary position.
 int last()
          Sets the current iteration position to the end of the text.
 int next()
          Advances the iterator to the next boundary position.
 int next(int n)
          Advances the iterator either forward or backward the specified number of steps.
 int preceding(int offset)
          Sets the iterator to refer to the last boundary position before the specified position.
 int previous()
          Moves the iterator backwards, to the last boundary preceding this one.
 void setText(CharacterIterator newText)
          Set the iterator to analyze a new piece of text.
 String toString()
          Returns the description (rules) used to create this iterator.
 
Methods inherited from class com.ibm.icu.text.BreakIterator
getAvailableLocales, getAvailableULocales, getBreakInstance, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, registerInstance, registerInstance, setText, unregister
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

WORD_NONE

public static final int WORD_NONE
Tag value for "words" that do not fit into any of other categories. Includes spaces and most punctuation.

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_NONE_LIMIT

public static final int WORD_NONE_LIMIT
Upper bound for tags for uncategorized words.

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_NUMBER

public static final int WORD_NUMBER
Tag value for words that appear to be numbers, lower limit.

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_NUMBER_LIMIT

public static final int WORD_NUMBER_LIMIT
Tag value for words that appear to be numbers, upper limit.

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_LETTER

public static final int WORD_LETTER
Tag value for words that contain letters, excluding hiragana, katakana or ideographic characters, lower limit.

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_LETTER_LIMIT

public static final int WORD_LETTER_LIMIT
Tag value for words containing letters, upper limit

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_KANA

public static final int WORD_KANA
Tag value for words containing kana characters, lower limit

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_KANA_LIMIT

public static final int WORD_KANA_LIMIT
Tag value for words containing kana characters, upper limit

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_IDEO

public static final int WORD_IDEO
Tag value for words containing ideographic characters, lower limit

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

WORD_IDEO_LIMIT

public static final int WORD_IDEO_LIMIT
Tag value for words containing ideographic characters, upper limit

See Also:
Constant Field Values
Status:
Draft ICU 3.0.

fRData

protected com.ibm.icu.text.RBBIDataWrapper fRData
Deprecated. This API is ICU internal only.
The rule data for this BreakIterator instance

Status:
Internal. This API is ICU internal only.

fDictionaryCharCount

protected int fDictionaryCharCount
Deprecated. This API is ICU internal only.
Counter for the number of characters encountered with the "dictionary" flag set. Normal RBBI iterators don't use it, although the code for updating it is live. Dictionary Based break iterators (a subclass of us) access this field directly.

Status:
Internal. This API is ICU internal only.

fTrace

public static boolean fTrace
Deprecated. This API is ICU internal only.
Debugging flag. Trace operation of state machine when true.

Status:
Internal. This API is ICU internal only.

fDebugEnv

protected static String fDebugEnv
Deprecated. This API is ICU internal only.
Control debug, trace and dump options.

Status:
Internal. This API is ICU internal only.
Constructor Detail

RuleBasedBreakIterator

public RuleBasedBreakIterator()
Deprecated. This API is ICU internal only.

Status:
Internal. This API is ICU internal only.

RuleBasedBreakIterator

public RuleBasedBreakIterator(String rules)
Construct a RuleBasedBreakIterator from a set of rules supplied as a string.

Parameters:
rules - The break rules to be used.
Status:
Stable ICU 2.2.
Method Detail

getInstanceFromCompiledRules

public static RuleBasedBreakIterator getInstanceFromCompiledRules(InputStream is)
                                                           throws IOException
Create a break iterator from a precompiled set of break rules. Creating a break iterator from the binary rules is much faster than creating one from source rules. The binary rules are generated by the RuleBasedBreakIterator.compileRules() function. Binary break iterator rules are not guaranteed to be compatible between different versions of ICU.

Parameters:
is - an input stream supplying the compiled binary rules.
Throws:
IOException - if there is an error while reading the rules from the InputStream.
See Also:
compileRules(String, OutputStream)
Status:
Draft ICU 4.8.

clone

public Object clone()
Clones this iterator.

Overrides:
clone in class BreakIterator
Returns:
A newly-constructed RuleBasedBreakIterator with the same behavior as this one.
Status:
Stable ICU 2.0.

equals

public boolean equals(Object that)
Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.

Overrides:
equals in class Object
Status:
Stable ICU 2.0.

toString

public String toString()
Returns the description (rules) used to create this iterator. (In ICU4C, the same function is RuleBasedBreakIterator::getRules())

Overrides:
toString in class Object
Status:
Stable ICU 2.0.

hashCode

public int hashCode()
Compute a hashcode for this BreakIterator

Overrides:
hashCode in class Object
Returns:
A hash code
Status:
Stable ICU 2.0.

dump

public void dump()
Deprecated. This API is ICU internal only.

Dump the contents of the state table and character classes for this break iterator. For debugging only.

Status:
Internal. This API is ICU internal only.

compileRules

public static void compileRules(String rules,
                                OutputStream ruleBinary)
                         throws IOException
Compile a set of source break rules into the binary state tables used by the break iterator engine. Creating a break iterator from precompiled rules is much faster than creating one from source rules. Binary break rules are not guaranteed to be compatible between different versions of ICU.

Parameters:
rules - The source form of the break rules
ruleBinary - An output stream to receive the compiled rules.
Throws:
IOException - If there is an error writing the output.
See Also:
getInstanceFromCompiledRules(InputStream)
Status:
Draft ICU 4.8.

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Specified by:
first in class BreakIterator
Returns:
The offset of the beginning of the text.
Status:
Stable ICU 2.0.

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Specified by:
last in class BreakIterator
Returns:
The text's past-the-end offset.
Status:
Stable ICU 2.0.

next

public int next(int n)
Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().

Specified by:
next in class BreakIterator
Parameters:
n - The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).
Returns:
The character offset of the boundary position n boundaries away from the current one.
Status:
Stable ICU 2.0.

next

public int next()
Advances the iterator to the next boundary position.

Specified by:
next in class BreakIterator
Returns:
The position of the first boundary after this one.
Status:
Stable ICU 2.0.

previous

public int previous()
Moves the iterator backwards, to the last boundary preceding this one.

Specified by:
previous in class BreakIterator
Returns:
The position of the last boundary position preceding this one.
Status:
Stable ICU 2.0.

following

public int following(int offset)
Sets the iterator to refer to the first boundary position following the specified position.

Specified by:
following in class BreakIterator
Parameters:
offset - The position from which to begin searching for a break position.
Returns:
The position of the first break after the current position.
Status:
Stable ICU 2.0.

preceding

public int preceding(int offset)
Sets the iterator to refer to the last boundary position before the specified position.

Overrides:
preceding in class BreakIterator
Parameters:
offset - The position to begin searching for a break from.
Returns:
The position of the last boundary before the starting position.
Status:
Stable ICU 2.0.

checkOffset

protected static final void checkOffset(int offset,
                                        CharacterIterator text)
Throw IllegalArgumentException unless begin <= offset < end.

Status:
Stable ICU 2.0.

isBoundary

public boolean isBoundary(int offset)
Returns true if the specfied position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".

Overrides:
isBoundary in class BreakIterator
Parameters:
offset - the offset to check.
Returns:
True if "offset" is a boundary position.
Status:
Stable ICU 2.0.

current

public int current()
Returns the current iteration position.

Specified by:
current in class BreakIterator
Returns:
The current iteration position.
Status:
Stable ICU 2.0.

getRuleStatus

public int getRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. For rules that do not specify a status, a default value of 0 is returned. If more than one rule applies, the numerically largest of the possible status values is returned.

Of the standard types of ICU break iterators, only the word break iterator provides status values. The values are defined in class RuleBasedBreakIterator, and allow distinguishing between words that contain alphabetic letters, "words" that appear to be numbers, punctuation and spaces, words containing ideographic characters, and more. Call getRuleStatus after obtaining a boundary position from next(), previous(), or any other break iterator functions that returns a boundary position.

Returns:
the status from the break rule that determined the most recently returned break position.
Status:
Draft ICU 3.0.

getRuleStatusVec

public int getRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.

The status values used by the standard ICU break rules are defined as public constants in class RuleBasedBreakIterator.

If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.

Parameters:
fillInArray - an array to be filled in with the status values.
Returns:
The number of rule status values from rules that determined the most recent boundary returned by the break iterator. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.
Status:
Draft ICU 3.0.

getText

public CharacterIterator getText()
Return a CharacterIterator over the text being analyzed. This version of this method returns the actual CharacterIterator we're using internally. Changing the state of this iterator can have undefined consequences. If you need to change it, clone it first.

Specified by:
getText in class BreakIterator
Returns:
An iterator over the text being analyzed.
Status:
Stable ICU 2.0.

setText

public void setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.

Specified by:
setText in class BreakIterator
Parameters:
newText - An iterator over the text to analyze.
Status:
Stable ICU 2.0.


Copyright (c) 2011 IBM Corporation and others.