com.ibm.icu.text
Class BreakIterator

java.lang.Object
  extended by com.ibm.icu.text.BreakIterator
All Implemented Interfaces:
Cloneable
Direct Known Subclasses:
RuleBasedBreakIterator

public abstract class BreakIterator
extends Object
implements Cloneable

[icu enhancement] ICU's replacement for java.text.BreakIterator. Methods, fields, and other functionality specific to ICU are labeled '[icu]'.

A class that locates boundaries in text. This class defines a protocol for objects that break up a piece of natural-language text according to a set of criteria. Instances or subclasses of BreakIterator can be provided, for example, to break a piece of text into words, sentences, or logical characters according to the conventions of some language or group of languages. We provide five built-in types of BreakIterator:

BreakIterator's interface follows an "iterator" model (hence the name), meaning it has a concept of a "current position" and methods like first(), last(), next(), and previous() that update the current position. All BreakIterators uphold the following invariants: BreakIterator accesses the text it analyzes through a CharacterIterator, which makes it possible to use BreakIterator to analyze text in any text-storage vehicle that provides a CharacterIterator interface. Note: Some types of BreakIterator can take a long time to create, and instances of BreakIterator are not currently cached by the system. For optimal performance, keep instances of BreakIterator around as long as makes sense. For example, when word-wrapping a document, don't create and destroy a new BreakIterator for each line. Create one break iterator for the whole document (or whatever stretch of text you're wrapping) and use it to do the whole job of wrapping the text.

Examples:

Creating and using text boundaries

 public static void main(String args[]) {
      if (args.length == 1) {
          String stringToExamine = args[0];
          //print each word in order
          BreakIterator boundary = BreakIterator.getWordInstance();
          boundary.setText(stringToExamine);
          printEachForward(boundary, stringToExamine);
          //print each sentence in reverse order
          boundary = BreakIterator.getSentenceInstance(Locale.US);
          boundary.setText(stringToExamine);
          printEachBackward(boundary, stringToExamine);
          printFirst(boundary, stringToExamine);
          printLast(boundary, stringToExamine);
      }
 }
 
Print each element in order
 public static void printEachForward(BreakIterator boundary, String source) {
     int start = boundary.first();
     for (int end = boundary.next();
          end != BreakIterator.DONE;
          start = end, end = boundary.next()) {
          System.out.println(source.substring(start,end));
     }
 }
 
Print each element in reverse order
 public static void printEachBackward(BreakIterator boundary, String source) {
     int end = boundary.last();
     for (int start = boundary.previous();
          start != BreakIterator.DONE;
          end = start, start = boundary.previous()) {
         System.out.println(source.substring(start,end));
     }
 }
 
Print first element
 public static void printFirst(BreakIterator boundary, String source) {
     int start = boundary.first();
     int end = boundary.next();
     System.out.println(source.substring(start,end));
 }
 
Print last element
 public static void printLast(BreakIterator boundary, String source) {
     int end = boundary.last();
     int start = boundary.previous();
     System.out.println(source.substring(start,end));
 }
 
Print the element at a specified position
 public static void printAt(BreakIterator boundary, int pos, String source) {
     int end = boundary.following(pos);
     int start = boundary.previous();
     System.out.println(source.substring(start,end));
 }
 
Find the next word
 public static int nextWordStartAfter(int pos, String text) {
     BreakIterator wb = BreakIterator.getWordInstance();
     wb.setText(text);
     int last = wb.following(pos);
     int current = wb.next();
     while (current != BreakIterator.DONE) {
         for (int p = last; p < current; p++) {
             if (Character.isLetter(text.charAt(p)))
                 return last;
         }
         last = current;
         current = wb.next();
     }
     return BreakIterator.DONE;
 }
 
(The iterator returned by BreakIterator.getWordInstance() is unique in that the break positions it returns don't represent both the start and end of the thing being iterated over. That is, a sentence-break iterator returns breaks that each represent the end of one sentence and the beginning of the next. With the word-break iterator, the characters between two boundaries might be a word, or they might be the punctuation or whitespace between two words. The above code uses a simple heuristic to determine which boundary is the beginning of a word: If the characters between this boundary and the next boundary include at least one letter (this can be an alphabetical letter, a CJK ideograph, a Hangul syllable, a Kana character, etc.), then the text between this boundary and the next is a word; otherwise, it's the material between words.)

See Also:
CharacterIterator
Status:
Stable ICU 2.0.

Field Summary
static int DONE
          DONE is returned by previous() and next() after all valid boundaries have been returned.
static int KIND_CHARACTER
          [icu]
static int KIND_LINE
          [icu]
static int KIND_SENTENCE
          [icu]
static int KIND_TITLE
          [icu]
static int KIND_WORD
          [icu]
 
Constructor Summary
protected BreakIterator()
          Default constructor.
 
Method Summary
 Object clone()
          Clone method.
abstract  int current()
          Return the iterator's current position.
abstract  int first()
          Return the first boundary position.
abstract  int following(int offset)
          Sets the iterator's current iteration position to be the first boundary position following the specified position.
static Locale[] getAvailableLocales()
          Returns a list of locales for which BreakIterators can be used.
static ULocale[] getAvailableULocales()
          [icu] Returns a list of locales for which BreakIterators can be used.
static BreakIterator getBreakInstance(ULocale where, int kind)
          Deprecated. This API is ICU internal only.
static BreakIterator getCharacterInstance()
          Returns a new instance of BreakIterator that locates logical-character boundaries.
static BreakIterator getCharacterInstance(Locale where)
          Returns a new instance of BreakIterator that locates logical-character boundaries.
static BreakIterator getCharacterInstance(ULocale where)
          [icu] Returns a new instance of BreakIterator that locates logical-character boundaries.
static BreakIterator getLineInstance()
          Returns a new instance of BreakIterator that locates legal line- wrapping positions.
static BreakIterator getLineInstance(Locale where)
          Returns a new instance of BreakIterator that locates legal line- wrapping positions.
static BreakIterator getLineInstance(ULocale where)
          [icu] Returns a new instance of BreakIterator that locates legal line- wrapping positions.
 ULocale getLocale(ULocale.Type type)
          [icu] Returns the locale that was used to create this object, or null.
static BreakIterator getSentenceInstance()
          Returns a new instance of BreakIterator that locates sentence boundaries.
static BreakIterator getSentenceInstance(Locale where)
          Returns a new instance of BreakIterator that locates sentence boundaries.
static BreakIterator getSentenceInstance(ULocale where)
          [icu] Returns a new instance of BreakIterator that locates sentence boundaries.
abstract  CharacterIterator getText()
          Returns a CharacterIterator over the text being analyzed.
static BreakIterator getTitleInstance()
          [icu] Returns a new instance of BreakIterator that locates title boundaries.
static BreakIterator getTitleInstance(Locale where)
          [icu] Returns a new instance of BreakIterator that locates title boundaries.
static BreakIterator getTitleInstance(ULocale where)
          [icu] Returns a new instance of BreakIterator that locates title boundaries.
static BreakIterator getWordInstance()
          Returns a new instance of BreakIterator that locates word boundaries.
static BreakIterator getWordInstance(Locale where)
          Returns a new instance of BreakIterator that locates word boundaries.
static BreakIterator getWordInstance(ULocale where)
          [icu] Returns a new instance of BreakIterator that locates word boundaries.
 boolean isBoundary(int offset)
          Return true if the specfied position is a boundary position.
abstract  int last()
          Return the last boundary position.
abstract  int next()
          Advances the iterator forward one boundary.
abstract  int next(int n)
          Advances the specified number of steps forward in the text (a negative number, therefore, advances backwards).
 int preceding(int offset)
          Sets the iterator's current iteration position to be the last boundary position preceding the specified position.
abstract  int previous()
          Advances the iterator backward one boundary.
static Object registerInstance(BreakIterator iter, Locale locale, int kind)
          [icu] Registers a new break iterator of the indicated kind, to use in the given locale.
static Object registerInstance(BreakIterator iter, ULocale locale, int kind)
          [icu] Registers a new break iterator of the indicated kind, to use in the given locale.
abstract  void setText(CharacterIterator newText)
          Sets the iterator to analyze a new piece of text.
 void setText(String newText)
          Sets the iterator to analyze a new piece of text.
static boolean unregister(Object key)
          [icu] Unregisters a previously-registered BreakIterator using the key returned from the register call.
 
Methods inherited from class java.lang.Object
equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DONE

public static final int DONE
DONE is returned by previous() and next() after all valid boundaries have been returned.

See Also:
Constant Field Values
Status:
Stable ICU 2.0.

KIND_CHARACTER

public static final int KIND_CHARACTER
[icu]

See Also:
Constant Field Values
Status:
Stable ICU 2.4.

KIND_WORD

public static final int KIND_WORD
[icu]

See Also:
Constant Field Values
Status:
Stable ICU 2.4.

KIND_LINE

public static final int KIND_LINE
[icu]

See Also:
Constant Field Values
Status:
Stable ICU 2.4.

KIND_SENTENCE

public static final int KIND_SENTENCE
[icu]

See Also:
Constant Field Values
Status:
Stable ICU 2.4.

KIND_TITLE

public static final int KIND_TITLE
[icu]

See Also:
Constant Field Values
Status:
Stable ICU 2.4.
Constructor Detail

BreakIterator

protected BreakIterator()
Default constructor. There is no state that is carried by this abstract base class.

Status:
Stable ICU 2.0.
Method Detail

clone

public Object clone()
Clone method. Creates another BreakIterator with the same behavior and current state as this one.

Overrides:
clone in class Object
Returns:
The clone.
Status:
Stable ICU 2.0.

first

public abstract int first()
Return the first boundary position. This is always the beginning index of the text this iterator iterates over. For example, if the iterator iterates over a whole string, this function will always return 0. This function also updates the iteration position to point to the beginning of the text.

Returns:
The character offset of the beginning of the stretch of text being broken.
Status:
Stable ICU 2.0.

last

public abstract int last()
Return the last boundary position. This is always the "past-the-end" index of the text this iterator iterates over. For example, if the iterator iterates over a whole string (call it "text"), this function will always return text.length(). This function also updated the iteration position to point to the end of the text.

Returns:
The character offset of the end of the stretch of text being broken.
Status:
Stable ICU 2.0.

next

public abstract int next(int n)
Advances the specified number of steps forward in the text (a negative number, therefore, advances backwards). If this causes the iterator to advance off either end of the text, this function returns DONE; otherwise, this function returns the position of the appropriate boundary. Calling this function is equivalent to calling next() or previous() n times.

Parameters:
n - The number of boundaries to advance over (if positive, moves forward; if negative, moves backwards).
Returns:
The position of the boundary n boundaries from the current iteration position, or DONE if moving n boundaries causes the iterator to advance off either end of the text.
Status:
Stable ICU 2.0.

next

public abstract int next()
Advances the iterator forward one boundary. The current iteration position is updated to point to the next boundary position after the current position, and this is also the value that is returned. If the current position is equal to the value returned by last(), or to DONE, this function returns DONE and sets the current position to DONE.

Returns:
The position of the first boundary position following the iteration position.
Status:
Stable ICU 2.0.

previous

public abstract int previous()
Advances the iterator backward one boundary. The current iteration position is updated to point to the last boundary position before the current position, and this is also the value that is returned. If the current position is equal to the value returned by first(), or to DONE, this function returns DONE and sets the current position to DONE.

Returns:
The position of the last boundary position preceding the iteration position.
Status:
Stable ICU 2.0.

following

public abstract int following(int offset)
Sets the iterator's current iteration position to be the first boundary position following the specified position. (Whether the specified position is itself a boundary position or not doesn't matter-- this function always moves the iteration position to the first boundary after the specified position.) If the specified position is the past-the-end position, returns DONE.

Parameters:
offset - The character position to start searching from.
Returns:
The position of the first boundary position following "offset" (whether or not "offset" itself is a boundary position), or DONE if "offset" is the past-the-end offset.
Status:
Stable ICU 2.0.

preceding

public int preceding(int offset)
Sets the iterator's current iteration position to be the last boundary position preceding the specified position. (Whether the specified position is itself a boundary position or not doesn't matter-- this function always moves the iteration position to the last boundary before the specified position.) If the specified position is the starting position, returns DONE.

Parameters:
offset - The character position to start searching from.
Returns:
The position of the last boundary position preceding "offset" (whether of not "offset" itself is a boundary position), or DONE if "offset" is the starting offset of the iterator.
Status:
Stable ICU 2.0.

isBoundary

public boolean isBoundary(int offset)
Return true if the specfied position is a boundary position. If the function returns true, the current iteration position is set to the specified position; if the function returns false, the current iteration position is set as though following() had been called.

Parameters:
offset - the offset to check.
Returns:
True if "offset" is a boundary position.
Status:
Stable ICU 2.0.

current

public abstract int current()
Return the iterator's current position.

Returns:
The iterator's current position.
Status:
Stable ICU 2.0.

getText

public abstract CharacterIterator getText()
Returns a CharacterIterator over the text being analyzed. For at least some subclasses of BreakIterator, this is a reference to the actual iterator being used by the BreakIterator, and therefore, this function's return value should be treated as const. No guarantees are made about the current position of this iterator when it is returned. If you need to move that position to examine the text, clone this function's return value first.

Returns:
A CharacterIterator over the text being analyzed.
Status:
Stable ICU 2.0.

setText

public void setText(String newText)
Sets the iterator to analyze a new piece of text. The new piece of text is passed in as a String, and the current iteration position is reset to the beginning of the string. (The old text is dropped.)

Parameters:
newText - A String containing the text to analyze with this BreakIterator.
Status:
Stable ICU 2.0.

setText

public abstract void setText(CharacterIterator newText)
Sets the iterator to analyze a new piece of text. The BreakIterator is passed a CharacterIterator through which it will access the text itself. The current iteration position is reset to the CharacterIterator's start index. (The old iterator is dropped.)

Parameters:
newText - A CharacterIterator referring to the text to analyze with this BreakIterator (the iterator's current position is ignored, but its other state is significant).
Status:
Stable ICU 2.0.

getWordInstance

public static BreakIterator getWordInstance()
Returns a new instance of BreakIterator that locates word boundaries. This function assumes that the text being analyzed is in the default locale's language.

Returns:
An instance of BreakIterator that locates word boundaries.
Status:
Stable ICU 2.0.

getWordInstance

public static BreakIterator getWordInstance(Locale where)
Returns a new instance of BreakIterator that locates word boundaries.

Parameters:
where - A locale specifying the language of the text to be analyzed.
Returns:
An instance of BreakIterator that locates word boundaries.
Status:
Stable ICU 2.0.

getWordInstance

public static BreakIterator getWordInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates word boundaries.

Parameters:
where - A locale specifying the language of the text to be analyzed.
Returns:
An instance of BreakIterator that locates word boundaries.
Status:
Stable ICU 3.2.

getLineInstance

public static BreakIterator getLineInstance()
Returns a new instance of BreakIterator that locates legal line- wrapping positions. This function assumes the text being broken is in the default locale's language.

Returns:
A new instance of BreakIterator that locates legal line-wrapping positions.
Status:
Stable ICU 2.0.

getLineInstance

public static BreakIterator getLineInstance(Locale where)
Returns a new instance of BreakIterator that locates legal line- wrapping positions.

Parameters:
where - A Locale specifying the language of the text being broken.
Returns:
A new instance of BreakIterator that locates legal line-wrapping positions.
Status:
Stable ICU 2.0.

getLineInstance

public static BreakIterator getLineInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates legal line- wrapping positions.

Parameters:
where - A Locale specifying the language of the text being broken.
Returns:
A new instance of BreakIterator that locates legal line-wrapping positions.
Status:
Stable ICU 3.2.

getCharacterInstance

public static BreakIterator getCharacterInstance()
Returns a new instance of BreakIterator that locates logical-character boundaries. This function assumes that the text being analyzed is in the default locale's language.

Returns:
A new instance of BreakIterator that locates logical-character boundaries.
Status:
Stable ICU 2.0.

getCharacterInstance

public static BreakIterator getCharacterInstance(Locale where)
Returns a new instance of BreakIterator that locates logical-character boundaries.

Parameters:
where - A Locale specifying the language of the text being analyzed.
Returns:
A new instance of BreakIterator that locates logical-character boundaries.
Status:
Stable ICU 2.0.

getCharacterInstance

public static BreakIterator getCharacterInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates logical-character boundaries.

Parameters:
where - A Locale specifying the language of the text being analyzed.
Returns:
A new instance of BreakIterator that locates logical-character boundaries.
Status:
Stable ICU 3.2.

getSentenceInstance

public static BreakIterator getSentenceInstance()
Returns a new instance of BreakIterator that locates sentence boundaries. This function assumes the text being analyzed is in the default locale's language.

Returns:
A new instance of BreakIterator that locates sentence boundaries.
Status:
Stable ICU 2.0.

getSentenceInstance

public static BreakIterator getSentenceInstance(Locale where)
Returns a new instance of BreakIterator that locates sentence boundaries.

Parameters:
where - A Locale specifying the language of the text being analyzed.
Returns:
A new instance of BreakIterator that locates sentence boundaries.
Status:
Stable ICU 2.0.

getSentenceInstance

public static BreakIterator getSentenceInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates sentence boundaries.

Parameters:
where - A Locale specifying the language of the text being analyzed.
Returns:
A new instance of BreakIterator that locates sentence boundaries.
Status:
Stable ICU 3.2.

getTitleInstance

public static BreakIterator getTitleInstance()
[icu] Returns a new instance of BreakIterator that locates title boundaries. This function assumes the text being analyzed is in the default locale's language. The iterator returned locates title boundaries as described for Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, please use a word boundary iterator. getWordInstance()

Returns:
A new instance of BreakIterator that locates title boundaries.
Status:
Stable ICU 2.0.

getTitleInstance

public static BreakIterator getTitleInstance(Locale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries. The iterator returned locates title boundaries as described for Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, please use Word Boundary iterator.getWordInstance()

Parameters:
where - A Locale specifying the language of the text being analyzed.
Returns:
A new instance of BreakIterator that locates title boundaries.
Status:
Stable ICU 2.0.

getTitleInstance

public static BreakIterator getTitleInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries. The iterator returned locates title boundaries as described for Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration, please use Word Boundary iterator.getWordInstance()

Parameters:
where - A Locale specifying the language of the text being analyzed.
Returns:
A new instance of BreakIterator that locates title boundaries.
Status:
Stable ICU 3.2 s.

registerInstance

public static Object registerInstance(BreakIterator iter,
                                      Locale locale,
                                      int kind)
[icu] Registers a new break iterator of the indicated kind, to use in the given locale. Clones of the iterator will be returned if a request for a break iterator of the given kind matches or falls back to this locale.

Parameters:
iter - the BreakIterator instance to adopt.
locale - the Locale for which this instance is to be registered
kind - the type of iterator for which this instance is to be registered
Returns:
a registry key that can be used to unregister this instance
Status:
Stable ICU 2.4.

registerInstance

public static Object registerInstance(BreakIterator iter,
                                      ULocale locale,
                                      int kind)
[icu] Registers a new break iterator of the indicated kind, to use in the given locale. Clones of the iterator will be returned if a request for a break iterator of the given kind matches or falls back to this locale.

Parameters:
iter - the BreakIterator instance to adopt.
locale - the Locale for which this instance is to be registered
kind - the type of iterator for which this instance is to be registered
Returns:
a registry key that can be used to unregister this instance
Status:
Stable ICU 3.2.

unregister

public static boolean unregister(Object key)
[icu] Unregisters a previously-registered BreakIterator using the key returned from the register call. Key becomes invalid after this call and should not be used again.

Parameters:
key - the registry key returned by a previous call to registerInstance
Returns:
true if the iterator for the key was successfully unregistered
Status:
Stable ICU 2.4.

getBreakInstance

public static BreakIterator getBreakInstance(ULocale where,
                                             int kind)
Deprecated. This API is ICU internal only.

Returns a particular kind of BreakIterator for a locale. Avoids writing a switch statement with getXYZInstance(where) calls.

Status:
Internal. This API is ICU internal only.

getAvailableLocales

public static Locale[] getAvailableLocales()
Returns a list of locales for which BreakIterators can be used.

Returns:
An array of Locales. All of the locales in the array can be used when creating a BreakIterator.
Status:
Stable ICU 2.6.

getAvailableULocales

public static ULocale[] getAvailableULocales()
[icu] Returns a list of locales for which BreakIterators can be used.

Returns:
An array of Locales. All of the locales in the array can be used when creating a BreakIterator.
Status:
Draft ICU 3.2 (retain).

getLocale

public final ULocale getLocale(ULocale.Type type)
[icu] Returns the locale that was used to create this object, or null. This may may differ from the locale requested at the time of this object's creation. For example, if an object is created for locale en_US_CALIFORNIA, the actual data may be drawn from en (the actual locale), and en_US may be the most specific locale that exists (the valid locale).

Note: The actual locale is returned correctly, but the valid locale is not, in most cases.

Parameters:
type - type of information requested, either ULocale.VALID_LOCALE or ULocale.ACTUAL_LOCALE.
Returns:
the information specified by type, or null if this object was not constructed from locale data.
See Also:
ULocale, ULocale.VALID_LOCALE, ULocale.ACTUAL_LOCALE
Status:
Draft ICU 2.8 (retain).


Copyright (c) 2011 IBM Corporation and others.