com.ibm.icu.text
Class BreakIterator
java.lang.Object
com.ibm.icu.text.BreakIterator
- All Implemented Interfaces:
- Cloneable
- Direct Known Subclasses:
- RuleBasedBreakIterator
public abstract class BreakIterator
- extends Object
- implements Cloneable
[icu enhancement] ICU's replacement for java.text.BreakIterator
. Methods, fields, and other functionality specific to ICU are labeled '[icu]'.
A class that locates boundaries in text. This class defines a protocol for
objects that break up a piece of natural-language text according to a set
of criteria. Instances or subclasses of BreakIterator can be provided, for
example, to break a piece of text into words, sentences, or logical characters
according to the conventions of some language or group of languages.
We provide five built-in types of BreakIterator:
- getTitleInstance() returns a BreakIterator that locates boundaries
between title breaks.
- getSentenceInstance() returns a BreakIterator that locates boundaries
between sentences. This is useful for triple-click selection, for example.
- getWordInstance() returns a BreakIterator that locates boundaries between
words. This is useful for double-click selection or "find whole words" searches.
This type of BreakIterator makes sure there is a boundary position at the
beginning and end of each legal word. (Numbers count as words, too.) Whitespace
and punctuation are kept separate from real words.
- getLineInstance() returns a BreakIterator that locates positions where it is
legal for a text editor to wrap lines. This is similar to word breaking, but
not the same: punctuation and whitespace are generally kept with words (you don't
want a line to start with whitespace, for example), and some special characters
can force a position to be considered a line-break position or prevent a position
from being a line-break position.
- getCharacterInstance() returns a BreakIterator that locates boundaries between
logical characters. Because of the structure of the Unicode encoding, a logical
character may be stored internally as more than one Unicode code point. (A with an
umlaut may be stored as an a followed by a separate combining umlaut character,
for example, but the user still thinks of it as one character.) This iterator allows
various processes (especially text editors) to treat as characters the units of text
that a user would think of as characters, rather than the units of text that the
computer sees as "characters".
BreakIterator's interface follows an "iterator" model (hence the name), meaning it
has a concept of a "current position" and methods like first(), last(), next(),
and previous() that update the current position. All BreakIterators uphold the
following invariants:
- The beginning and end of the text are always treated as boundary positions.
- The current position of the iterator is always a boundary position (random-
access methods move the iterator to the nearest boundary position before or
after the specified position, not _to_ the specified position).
- DONE is used as a flag to indicate when iteration has stopped. DONE is only
returned when the current position is the end of the text and the user calls next(),
or when the current position is the beginning of the text and the user calls
previous().
- Break positions are numbered by the positions of the characters that follow
them. Thus, under normal circumstances, the position before the first character
is 0, the position after the first character is 1, and the position after the
last character is 1 plus the length of the string.
- The client can change the position of an iterator, or the text it analyzes,
at will, but cannot change the behavior. If the user wants different behavior, he
must instantiate a new iterator.
BreakIterator accesses the text it analyzes through a CharacterIterator, which makes
it possible to use BreakIterator to analyze text in any text-storage vehicle that
provides a CharacterIterator interface.
Note: Some types of BreakIterator can take a long time to create, and
instances of BreakIterator are not currently cached by the system. For
optimal performance, keep instances of BreakIterator around as long as makes
sense. For example, when word-wrapping a document, don't create and destroy a
new BreakIterator for each line. Create one break iterator for the whole document
(or whatever stretch of text you're wrapping) and use it to do the whole job of
wrapping the text.
Examples:
Creating and using text boundaries
public static void main(String args[]) {
if (args.length == 1) {
String stringToExamine = args[0];
//print each word in order
BreakIterator boundary = BreakIterator.getWordInstance();
boundary.setText(stringToExamine);
printEachForward(boundary, stringToExamine);
//print each sentence in reverse order
boundary = BreakIterator.getSentenceInstance(Locale.US);
boundary.setText(stringToExamine);
printEachBackward(boundary, stringToExamine);
printFirst(boundary, stringToExamine);
printLast(boundary, stringToExamine);
}
}
Print each element in order
public static void printEachForward(BreakIterator boundary, String source) {
int start = boundary.first();
for (int end = boundary.next();
end != BreakIterator.DONE;
start = end, end = boundary.next()) {
System.out.println(source.substring(start,end));
}
}
Print each element in reverse order
public static void printEachBackward(BreakIterator boundary, String source) {
int end = boundary.last();
for (int start = boundary.previous();
start != BreakIterator.DONE;
end = start, start = boundary.previous()) {
System.out.println(source.substring(start,end));
}
}
Print first element
public static void printFirst(BreakIterator boundary, String source) {
int start = boundary.first();
int end = boundary.next();
System.out.println(source.substring(start,end));
}
Print last element
public static void printLast(BreakIterator boundary, String source) {
int end = boundary.last();
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Print the element at a specified position
public static void printAt(BreakIterator boundary, int pos, String source) {
int end = boundary.following(pos);
int start = boundary.previous();
System.out.println(source.substring(start,end));
}
Find the next word
public static int nextWordStartAfter(int pos, String text) {
BreakIterator wb = BreakIterator.getWordInstance();
wb.setText(text);
int last = wb.following(pos);
int current = wb.next();
while (current != BreakIterator.DONE) {
for (int p = last; p < current; p++) {
if (Character.isLetter(text.charAt(p)))
return last;
}
last = current;
current = wb.next();
}
return BreakIterator.DONE;
}
(The iterator returned by BreakIterator.getWordInstance() is unique in that
the break positions it returns don't represent both the start and end of the
thing being iterated over. That is, a sentence-break iterator returns breaks
that each represent the end of one sentence and the beginning of the next.
With the word-break iterator, the characters between two boundaries might be a
word, or they might be the punctuation or whitespace between two words. The
above code uses a simple heuristic to determine which boundary is the beginning
of a word: If the characters between this boundary and the next boundary
include at least one letter (this can be an alphabetical letter, a CJK ideograph,
a Hangul syllable, a Kana character, etc.), then the text between this boundary
and the next is a word; otherwise, it's the material between words.)
- See Also:
CharacterIterator
- Status:
- Stable ICU 2.0.
Constructor Summary |
protected |
BreakIterator()
Default constructor. |
Method Summary |
Object |
clone()
Clone method. |
abstract int |
current()
Return the iterator's current position. |
abstract int |
first()
Return the first boundary position. |
abstract int |
following(int offset)
Sets the iterator's current iteration position to be the first
boundary position following the specified position. |
static Locale[] |
getAvailableLocales()
Returns a list of locales for which BreakIterators can be used. |
static ULocale[] |
getAvailableULocales()
[icu] Returns a list of locales for which BreakIterators can be used. |
static BreakIterator |
getBreakInstance(ULocale where,
int kind)
Deprecated. This API is ICU internal only. |
static BreakIterator |
getCharacterInstance()
Returns a new instance of BreakIterator that locates logical-character
boundaries. |
static BreakIterator |
getCharacterInstance(Locale where)
Returns a new instance of BreakIterator that locates logical-character
boundaries. |
static BreakIterator |
getCharacterInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates logical-character
boundaries. |
static BreakIterator |
getLineInstance()
Returns a new instance of BreakIterator that locates legal line-
wrapping positions. |
static BreakIterator |
getLineInstance(Locale where)
Returns a new instance of BreakIterator that locates legal line-
wrapping positions. |
static BreakIterator |
getLineInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates legal line-
wrapping positions. |
ULocale |
getLocale(ULocale.Type type)
[icu] Returns the locale that was used to create this object, or null. |
static BreakIterator |
getSentenceInstance()
Returns a new instance of BreakIterator that locates sentence boundaries. |
static BreakIterator |
getSentenceInstance(Locale where)
Returns a new instance of BreakIterator that locates sentence boundaries. |
static BreakIterator |
getSentenceInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates sentence boundaries. |
abstract CharacterIterator |
getText()
Returns a CharacterIterator over the text being analyzed. |
static BreakIterator |
getTitleInstance()
[icu] Returns a new instance of BreakIterator that locates title boundaries. |
static BreakIterator |
getTitleInstance(Locale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries. |
static BreakIterator |
getTitleInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates title boundaries. |
static BreakIterator |
getWordInstance()
Returns a new instance of BreakIterator that locates word boundaries. |
static BreakIterator |
getWordInstance(Locale where)
Returns a new instance of BreakIterator that locates word boundaries. |
static BreakIterator |
getWordInstance(ULocale where)
[icu] Returns a new instance of BreakIterator that locates word boundaries. |
boolean |
isBoundary(int offset)
Return true if the specfied position is a boundary position. |
abstract int |
last()
Return the last boundary position. |
abstract int |
next()
Advances the iterator forward one boundary. |
abstract int |
next(int n)
Advances the specified number of steps forward in the text (a negative
number, therefore, advances backwards). |
int |
preceding(int offset)
Sets the iterator's current iteration position to be the last
boundary position preceding the specified position. |
abstract int |
previous()
Advances the iterator backward one boundary. |
static Object |
registerInstance(BreakIterator iter,
Locale locale,
int kind)
[icu] Registers a new break iterator of the indicated kind, to use in the given
locale. |
static Object |
registerInstance(BreakIterator iter,
ULocale locale,
int kind)
[icu] Registers a new break iterator of the indicated kind, to use in the given
locale. |
abstract void |
setText(CharacterIterator newText)
Sets the iterator to analyze a new piece of text. |
void |
setText(String newText)
Sets the iterator to analyze a new piece of text. |
static boolean |
unregister(Object key)
[icu] Unregisters a previously-registered BreakIterator using the key returned
from the register call. |
DONE
public static final int DONE
- DONE is returned by previous() and next() after all valid
boundaries have been returned.
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.0.
KIND_CHARACTER
public static final int KIND_CHARACTER
- [icu]
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.4.
KIND_WORD
public static final int KIND_WORD
- [icu]
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.4.
KIND_LINE
public static final int KIND_LINE
- [icu]
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.4.
KIND_SENTENCE
public static final int KIND_SENTENCE
- [icu]
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.4.
KIND_TITLE
public static final int KIND_TITLE
- [icu]
- See Also:
- Constant Field Values
- Status:
- Stable ICU 2.4.
BreakIterator
protected BreakIterator()
- Default constructor. There is no state that is carried by this abstract
base class.
- Status:
- Stable ICU 2.0.
clone
public Object clone()
- Clone method. Creates another BreakIterator with the same behavior and
current state as this one.
- Overrides:
clone
in class Object
- Returns:
- The clone.
- Status:
- Stable ICU 2.0.
first
public abstract int first()
- Return the first boundary position. This is always the beginning
index of the text this iterator iterates over. For example, if
the iterator iterates over a whole string, this function will
always return 0. This function also updates the iteration position
to point to the beginning of the text.
- Returns:
- The character offset of the beginning of the stretch of text
being broken.
- Status:
- Stable ICU 2.0.
last
public abstract int last()
- Return the last boundary position. This is always the "past-the-end"
index of the text this iterator iterates over. For example, if the
iterator iterates over a whole string (call it "text"), this function
will always return text.length(). This function also updated the
iteration position to point to the end of the text.
- Returns:
- The character offset of the end of the stretch of text
being broken.
- Status:
- Stable ICU 2.0.
next
public abstract int next(int n)
- Advances the specified number of steps forward in the text (a negative
number, therefore, advances backwards). If this causes the iterator
to advance off either end of the text, this function returns DONE;
otherwise, this function returns the position of the appropriate
boundary. Calling this function is equivalent to calling next() or
previous() n times.
- Parameters:
n
- The number of boundaries to advance over (if positive, moves
forward; if negative, moves backwards).
- Returns:
- The position of the boundary n boundaries from the current
iteration position, or DONE if moving n boundaries causes the iterator
to advance off either end of the text.
- Status:
- Stable ICU 2.0.
next
public abstract int next()
- Advances the iterator forward one boundary. The current iteration
position is updated to point to the next boundary position after the
current position, and this is also the value that is returned. If
the current position is equal to the value returned by last(), or to
DONE, this function returns DONE and sets the current position to
DONE.
- Returns:
- The position of the first boundary position following the
iteration position.
- Status:
- Stable ICU 2.0.
previous
public abstract int previous()
- Advances the iterator backward one boundary. The current iteration
position is updated to point to the last boundary position before
the current position, and this is also the value that is returned. If
the current position is equal to the value returned by first(), or to
DONE, this function returns DONE and sets the current position to
DONE.
- Returns:
- The position of the last boundary position preceding the
iteration position.
- Status:
- Stable ICU 2.0.
following
public abstract int following(int offset)
- Sets the iterator's current iteration position to be the first
boundary position following the specified position. (Whether the
specified position is itself a boundary position or not doesn't
matter-- this function always moves the iteration position to the
first boundary after the specified position.) If the specified
position is the past-the-end position, returns DONE.
- Parameters:
offset
- The character position to start searching from.
- Returns:
- The position of the first boundary position following
"offset" (whether or not "offset" itself is a boundary position),
or DONE if "offset" is the past-the-end offset.
- Status:
- Stable ICU 2.0.
preceding
public int preceding(int offset)
- Sets the iterator's current iteration position to be the last
boundary position preceding the specified position. (Whether the
specified position is itself a boundary position or not doesn't
matter-- this function always moves the iteration position to the
last boundary before the specified position.) If the specified
position is the starting position, returns DONE.
- Parameters:
offset
- The character position to start searching from.
- Returns:
- The position of the last boundary position preceding
"offset" (whether of not "offset" itself is a boundary position),
or DONE if "offset" is the starting offset of the iterator.
- Status:
- Stable ICU 2.0.
isBoundary
public boolean isBoundary(int offset)
- Return true if the specfied position is a boundary position. If the
function returns true, the current iteration position is set to the
specified position; if the function returns false, the current
iteration position is set as though following() had been called.
- Parameters:
offset
- the offset to check.
- Returns:
- True if "offset" is a boundary position.
- Status:
- Stable ICU 2.0.
current
public abstract int current()
- Return the iterator's current position.
- Returns:
- The iterator's current position.
- Status:
- Stable ICU 2.0.
getText
public abstract CharacterIterator getText()
- Returns a CharacterIterator over the text being analyzed.
For at least some subclasses of BreakIterator, this is a reference
to the actual iterator being used by the BreakIterator,
and therefore, this function's return value should be treated as
const. No guarantees are made about the current position
of this iterator when it is returned. If you need to move that
position to examine the text, clone this function's return value first.
- Returns:
- A CharacterIterator over the text being analyzed.
- Status:
- Stable ICU 2.0.
setText
public void setText(String newText)
- Sets the iterator to analyze a new piece of text. The new
piece of text is passed in as a String, and the current
iteration position is reset to the beginning of the string.
(The old text is dropped.)
- Parameters:
newText
- A String containing the text to analyze with
this BreakIterator.- Status:
- Stable ICU 2.0.
setText
public abstract void setText(CharacterIterator newText)
- Sets the iterator to analyze a new piece of text. The
BreakIterator is passed a CharacterIterator through which
it will access the text itself. The current iteration
position is reset to the CharacterIterator's start index.
(The old iterator is dropped.)
- Parameters:
newText
- A CharacterIterator referring to the text
to analyze with this BreakIterator (the iterator's current
position is ignored, but its other state is significant).- Status:
- Stable ICU 2.0.
getWordInstance
public static BreakIterator getWordInstance()
- Returns a new instance of BreakIterator that locates word boundaries.
This function assumes that the text being analyzed is in the default
locale's language.
- Returns:
- An instance of BreakIterator that locates word boundaries.
- Status:
- Stable ICU 2.0.
getWordInstance
public static BreakIterator getWordInstance(Locale where)
- Returns a new instance of BreakIterator that locates word boundaries.
- Parameters:
where
- A locale specifying the language of the text to be
analyzed.
- Returns:
- An instance of BreakIterator that locates word boundaries.
- Status:
- Stable ICU 2.0.
getWordInstance
public static BreakIterator getWordInstance(ULocale where)
- [icu] Returns a new instance of BreakIterator that locates word boundaries.
- Parameters:
where
- A locale specifying the language of the text to be
analyzed.
- Returns:
- An instance of BreakIterator that locates word boundaries.
- Status:
- Stable ICU 3.2.
getLineInstance
public static BreakIterator getLineInstance()
- Returns a new instance of BreakIterator that locates legal line-
wrapping positions. This function assumes the text being broken
is in the default locale's language.
- Returns:
- A new instance of BreakIterator that locates legal
line-wrapping positions.
- Status:
- Stable ICU 2.0.
getLineInstance
public static BreakIterator getLineInstance(Locale where)
- Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
- Parameters:
where
- A Locale specifying the language of the text being broken.
- Returns:
- A new instance of BreakIterator that locates legal
line-wrapping positions.
- Status:
- Stable ICU 2.0.
getLineInstance
public static BreakIterator getLineInstance(ULocale where)
- [icu] Returns a new instance of BreakIterator that locates legal line-
wrapping positions.
- Parameters:
where
- A Locale specifying the language of the text being broken.
- Returns:
- A new instance of BreakIterator that locates legal
line-wrapping positions.
- Status:
- Stable ICU 3.2.
getCharacterInstance
public static BreakIterator getCharacterInstance()
- Returns a new instance of BreakIterator that locates logical-character
boundaries. This function assumes that the text being analyzed is
in the default locale's language.
- Returns:
- A new instance of BreakIterator that locates logical-character
boundaries.
- Status:
- Stable ICU 2.0.
getCharacterInstance
public static BreakIterator getCharacterInstance(Locale where)
- Returns a new instance of BreakIterator that locates logical-character
boundaries.
- Parameters:
where
- A Locale specifying the language of the text being analyzed.
- Returns:
- A new instance of BreakIterator that locates logical-character
boundaries.
- Status:
- Stable ICU 2.0.
getCharacterInstance
public static BreakIterator getCharacterInstance(ULocale where)
- [icu] Returns a new instance of BreakIterator that locates logical-character
boundaries.
- Parameters:
where
- A Locale specifying the language of the text being analyzed.
- Returns:
- A new instance of BreakIterator that locates logical-character
boundaries.
- Status:
- Stable ICU 3.2.
getSentenceInstance
public static BreakIterator getSentenceInstance()
- Returns a new instance of BreakIterator that locates sentence boundaries.
This function assumes the text being analyzed is in the default locale's
language.
- Returns:
- A new instance of BreakIterator that locates sentence boundaries.
- Status:
- Stable ICU 2.0.
getSentenceInstance
public static BreakIterator getSentenceInstance(Locale where)
- Returns a new instance of BreakIterator that locates sentence boundaries.
- Parameters:
where
- A Locale specifying the language of the text being analyzed.
- Returns:
- A new instance of BreakIterator that locates sentence boundaries.
- Status:
- Stable ICU 2.0.
getSentenceInstance
public static BreakIterator getSentenceInstance(ULocale where)
- [icu] Returns a new instance of BreakIterator that locates sentence boundaries.
- Parameters:
where
- A Locale specifying the language of the text being analyzed.
- Returns:
- A new instance of BreakIterator that locates sentence boundaries.
- Status:
- Stable ICU 3.2.
getTitleInstance
public static BreakIterator getTitleInstance()
- [icu] Returns a new instance of BreakIterator that locates title boundaries.
This function assumes the text being analyzed is in the default locale's
language. The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use a word boundary iterator.
getWordInstance()
- Returns:
- A new instance of BreakIterator that locates title boundaries.
- Status:
- Stable ICU 2.0.
getTitleInstance
public static BreakIterator getTitleInstance(Locale where)
- [icu] Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use Word Boundary iterator.
getWordInstance()
- Parameters:
where
- A Locale specifying the language of the text being analyzed.
- Returns:
- A new instance of BreakIterator that locates title boundaries.
- Status:
- Stable ICU 2.0.
getTitleInstance
public static BreakIterator getTitleInstance(ULocale where)
- [icu] Returns a new instance of BreakIterator that locates title boundaries.
The iterator returned locates title boundaries as described for
Unicode 3.2 only. For Unicode 4.0 and above title boundary iteration,
please use Word Boundary iterator.
getWordInstance()
- Parameters:
where
- A Locale specifying the language of the text being analyzed.
- Returns:
- A new instance of BreakIterator that locates title boundaries.
- Status:
- Stable ICU 3.2
s.
registerInstance
public static Object registerInstance(BreakIterator iter,
Locale locale,
int kind)
- [icu] Registers a new break iterator of the indicated kind, to use in the given
locale. Clones of the iterator will be returned if a request for a break iterator
of the given kind matches or falls back to this locale.
- Parameters:
iter
- the BreakIterator instance to adopt.locale
- the Locale for which this instance is to be registeredkind
- the type of iterator for which this instance is to be registered
- Returns:
- a registry key that can be used to unregister this instance
- Status:
- Stable ICU 2.4.
registerInstance
public static Object registerInstance(BreakIterator iter,
ULocale locale,
int kind)
- [icu] Registers a new break iterator of the indicated kind, to use in the given
locale. Clones of the iterator will be returned if a request for a break iterator
of the given kind matches or falls back to this locale.
- Parameters:
iter
- the BreakIterator instance to adopt.locale
- the Locale for which this instance is to be registeredkind
- the type of iterator for which this instance is to be registered
- Returns:
- a registry key that can be used to unregister this instance
- Status:
- Stable ICU 3.2.
unregister
public static boolean unregister(Object key)
- [icu] Unregisters a previously-registered BreakIterator using the key returned
from the register call. Key becomes invalid after this call and should not be used
again.
- Parameters:
key
- the registry key returned by a previous call to registerInstance
- Returns:
- true if the iterator for the key was successfully unregistered
- Status:
- Stable ICU 2.4.
getBreakInstance
public static BreakIterator getBreakInstance(ULocale where,
int kind)
- Deprecated. This API is ICU internal only.
- Returns a particular kind of BreakIterator for a locale.
Avoids writing a switch statement with getXYZInstance(where) calls.
- Status:
- Internal. This API is ICU internal only.
getAvailableLocales
public static Locale[] getAvailableLocales()
- Returns a list of locales for which BreakIterators can be used.
- Returns:
- An array of Locales. All of the locales in the array can
be used when creating a BreakIterator.
- Status:
- Stable ICU 2.6.
getAvailableULocales
public static ULocale[] getAvailableULocales()
- [icu] Returns a list of locales for which BreakIterators can be used.
- Returns:
- An array of Locales. All of the locales in the array can
be used when creating a BreakIterator.
- Status:
- Draft ICU 3.2 (retain).
getLocale
public final ULocale getLocale(ULocale.Type type)
- [icu] Returns the locale that was used to create this object, or null.
This may may differ from the locale requested at the time of
this object's creation. For example, if an object is created
for locale en_US_CALIFORNIA, the actual data may be
drawn from en (the actual locale), and
en_US may be the most specific locale that exists (the
valid locale).
Note: The actual locale is returned correctly, but the valid
locale is not, in most cases.
- Parameters:
type
- type of information requested, either ULocale.VALID_LOCALE
or ULocale.ACTUAL_LOCALE
.
- Returns:
- the information specified by type, or null if
this object was not constructed from locale data.
- See Also:
ULocale
,
ULocale.VALID_LOCALE
,
ULocale.ACTUAL_LOCALE
- Status:
- Draft ICU 2.8 (retain).
Copyright (c) 2011 IBM Corporation and others.