|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectcom.ibm.icu.text.BreakIterator
com.ibm.icu.text.RuleBasedBreakIterator
com.ibm.icu.text.DictionaryBasedBreakIterator
public class DictionaryBasedBreakIterator
A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator_Old is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words. DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator_Old, but adds one more special substitution name: _dictionary_. This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in _dictionary_, it goes back through that range and derives additional break positions (if possible) using the dictionary. DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It uses Class.getResource() to locate the dictionary file. The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.
Field Summary |
---|
Fields inherited from class com.ibm.icu.text.RuleBasedBreakIterator |
---|
fDebugEnv, fDictionaryCharCount, fRData, fTrace, WORD_IDEO, WORD_IDEO_LIMIT, WORD_KANA, WORD_KANA_LIMIT, WORD_LETTER, WORD_LETTER_LIMIT, WORD_NONE, WORD_NONE_LIMIT, WORD_NUMBER, WORD_NUMBER_LIMIT |
Fields inherited from class com.ibm.icu.text.BreakIterator |
---|
DONE, KIND_CHARACTER, KIND_LINE, KIND_SENTENCE, KIND_TITLE, KIND_WORD |
Constructor Summary | |
---|---|
protected |
DictionaryBasedBreakIterator(InputStream compiledRules)
Deprecated. This API is ICU internal only. |
|
DictionaryBasedBreakIterator(InputStream compiledRules,
InputStream dictionaryStream)
Deprecated. This API is ICU internal only. |
|
DictionaryBasedBreakIterator(String rules,
InputStream dictionaryStream)
Constructs a DictionaryBasedBreakIterator. |
Method Summary | |
---|---|
int |
first()
Sets the current iteration position to the beginning of the text. |
int |
following(int offset)
Sets the current iteration position to the first boundary position after the specified position. |
int |
getRuleStatus()
Return the status tag from the break rule that determined the most recently returned break position. |
int |
getRuleStatusVec(int[] fillInArray)
Get the status (tag) values from the break rule(s) that determined the most recently returned break position. |
protected int |
handleNext()
Deprecated. This API is ICU internal only. |
int |
last()
Sets the current iteration position to the end of the text. |
int |
preceding(int offset)
Sets the current iteration position to the last boundary position before the specified position. |
int |
previous()
Advances the iterator one step backwards. |
void |
setText(CharacterIterator newText)
Set the iterator to analyze a new piece of text. |
Methods inherited from class com.ibm.icu.text.RuleBasedBreakIterator |
---|
checkOffset, clone, compileRules, current, dump, equals, getInstanceFromCompiledRules, getText, hashCode, isBoundary, next, next, toString |
Methods inherited from class java.lang.Object |
---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
Constructor Detail |
---|
protected DictionaryBasedBreakIterator(InputStream compiledRules) throws IOException
compiledRules
- an input stream containing the binary (flattened) compiled rules.
IOException
public DictionaryBasedBreakIterator(String rules, InputStream dictionaryStream) throws IOException
rules
- Same as the rules parameter on RuleBasedBreakIterator,
except for the special meaning of "_dictionary_". This parameter is just
passed through to RuleBasedBreakIterator constructor.dictionaryStream
- the stream containing the dictionary data
IOException
public DictionaryBasedBreakIterator(InputStream compiledRules, InputStream dictionaryStream) throws IOException
compiledRules
- an input stream containing the binary (flattened) compiled rules.dictionaryStream
- an input stream containing the dictionary data
IOException
Method Detail |
---|
public void setText(CharacterIterator newText)
RuleBasedBreakIterator
setText
in class RuleBasedBreakIterator
newText
- An iterator over the text to analyze.public int first()
first
in class RuleBasedBreakIterator
public int last()
last
in class RuleBasedBreakIterator
public int previous()
previous
in class RuleBasedBreakIterator
public int preceding(int offset)
preceding
in class RuleBasedBreakIterator
offset
- The position to begin searching from
public int following(int offset)
following
in class RuleBasedBreakIterator
offset
- The position to begin searching forward from
public int getRuleStatus()
getRuleStatus
in class RuleBasedBreakIterator
public int getRuleStatusVec(int[] fillInArray)
TODO: not supported for dictionary based break iterator.
getRuleStatusVec
in class RuleBasedBreakIterator
fillInArray
- an array to be filled in with the status values.
protected int handleNext()
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |