com.ibm.icu.text
Class DictionaryBasedBreakIterator

java.lang.Object
  extended by com.ibm.icu.text.BreakIterator
      extended by com.ibm.icu.text.RuleBasedBreakIterator
          extended by com.ibm.icu.text.DictionaryBasedBreakIterator
All Implemented Interfaces:
Cloneable

public class DictionaryBasedBreakIterator
extends RuleBasedBreakIterator

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

Status:
Stable ICU 2.0.

Field Summary
 
Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator
fDebugEnv, fDictionaryCharCount, fRData, fTrace, WORD_IDEO, WORD_IDEO_LIMIT, WORD_KANA, WORD_KANA_LIMIT, WORD_LETTER, WORD_LETTER_LIMIT, WORD_NONE, WORD_NONE_LIMIT, WORD_NUMBER, WORD_NUMBER_LIMIT
 
Fields inherited from class com.ibm.icu.text.BreakIterator
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD
 
Constructor Summary
protected DictionaryBasedBreakIterator(InputStream compiledRules)
          Deprecated. This API is ICU internal only.
  DictionaryBasedBreakIterator(InputStream compiledRules, InputStream dictionaryStream)
          Deprecated. This API is ICU internal only.
  DictionaryBasedBreakIterator(String rules, InputStream dictionaryStream)
          Constructs a DictionaryBasedBreakIterator.
 
Method Summary
 int first()
          Sets the current iteration position to the beginning of the text.
 int following(int offset)
          Sets the current iteration position to the first boundary position after the specified position.
 int getRuleStatus()
          Return the status tag from the break rule that determined the most recently returned break position.
 int getRuleStatusVec(int[] fillInArray)
          Get the status (tag) values from the break rule(s) that determined the most recently returned break position.
protected  int handleNext()
          Deprecated. This API is ICU internal only.
 int last()
          Sets the current iteration position to the end of the text.
 int preceding(int offset)
          Sets the current iteration position to the last boundary position before the specified position.
 int previous()
          Advances the iterator one step backwards.
 void setText(CharacterIterator newText)
          Set the iterator to analyze a new piece of text.
 
Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator
checkOffset, clone, compileRules, current, dump, equals, getInstanceFromCompiledRules, getText, hashCode, isBoundary, next, next, toString
 
Methods inherited from class com.ibm.icu.text.BreakIterator
getAvailableLocales, getAvailableULocales, getBreakInstance, getCharacterInstance, getCharacterInstance, getCharacterInstance, getLineInstance, getLineInstance, getLineInstance, getLocale, getSentenceInstance, getSentenceInstance, getSentenceInstance, getTitleInstance, getTitleInstance, getTitleInstance, getWordInstance, getWordInstance, getWordInstance, registerInstance, registerInstance, setText, unregister
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

DictionaryBasedBreakIterator

protected DictionaryBasedBreakIterator(InputStream compiledRules)
                                throws IOException
Deprecated. This API is ICU internal only.

Construct a DictionarBasedBreakIterator from precompiled rules. Use by ThaiBreakEngine uses the BreakCTDictionary.

Parameters:
compiledRules - an input stream containing the binary (flattened) compiled rules.
Throws:
IOException
Status:
Internal. This API is ICU internal only.

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(String rules,
                                    InputStream dictionaryStream)
                             throws IOException
Constructs a DictionaryBasedBreakIterator.

Parameters:
rules - Same as the rules parameter on RuleBasedBreakIterator, except for the special meaning of "_dictionary_". This parameter is just passed through to RuleBasedBreakIterator constructor.
dictionaryStream - the stream containing the dictionary data
Throws:
IOException
Status:
Stable ICU 2.0.

DictionaryBasedBreakIterator

public DictionaryBasedBreakIterator(InputStream compiledRules,
                                    InputStream dictionaryStream)
                             throws IOException
Deprecated. This API is ICU internal only.

Construct a DictionarBasedBreakIterator from precompiled rules.

Parameters:
compiledRules - an input stream containing the binary (flattened) compiled rules.
dictionaryStream - an input stream containing the dictionary data
Throws:
IOException
Status:
Internal. This API is ICU internal only.
Method Detail

setText

public void setText(CharacterIterator newText)
Description copied from class: RuleBasedBreakIterator
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.

Overrides:
setText in class RuleBasedBreakIterator
Parameters:
newText - An iterator over the text to analyze.
Status:
Stable ICU 2.0.

first

public int first()
Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Overrides:
first in class RuleBasedBreakIterator
Returns:
The offset of the beginning of the text.
Status:
Stable ICU 2.0.

last

public int last()
Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Overrides:
last in class RuleBasedBreakIterator
Returns:
The text's past-the-end offset.
Status:
Stable ICU 2.0.

previous

public int previous()
Advances the iterator one step backwards.

Overrides:
previous in class RuleBasedBreakIterator
Returns:
The position of the last boundary position before the current iteration position
Status:
Stable ICU 2.0.

preceding

public int preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position.

Overrides:
preceding in class RuleBasedBreakIterator
Parameters:
offset - The position to begin searching from
Returns:
The position of the last boundary before "offset"
Status:
Stable ICU 2.0.

following

public int following(int offset)
Sets the current iteration position to the first boundary position after the specified position.

Overrides:
following in class RuleBasedBreakIterator
Parameters:
offset - The position to begin searching forward from
Returns:
The position of the first boundary after "offset"
Status:
Stable ICU 2.0.

getRuleStatus

public int getRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position. TODO: not supported with dictionary based break iterators.

Overrides:
getRuleStatus in class RuleBasedBreakIterator
Returns:
the status from the break rule that determined the most recently returned break position.
Status:
Draft ICU 3.0.

getRuleStatusVec

public int getRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.

TODO: not supported for dictionary based break iterator.

Overrides:
getRuleStatusVec in class RuleBasedBreakIterator
Parameters:
fillInArray - an array to be filled in with the status values.
Returns:
The number of rule status values from rules that determined the most recent boundary returned by the break iterator. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.
Status:
Draft ICU 3.0.

handleNext

protected int handleNext()
Deprecated. This API is ICU internal only.

This is the implementation function for next().

Status:
Internal. This API is ICU internal only.


Copyright (c) 2012 IBM Corporation and others.