|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.parser.callback.DefaultCallback
it.unimi.dsi.parser.callback.TextExtractor
public class TextExtractor
A callback extracting text and titles.
This callbacks extracts all text in the page, and the title.
The resulting
text is available through text
, and the title through title
.
Note that text
and title
are never trimmed.
Field Summary | |
---|---|
MutableString |
text
The text resulting from the parsing process. |
MutableString |
title
The title resulting from the parsing process. |
Fields inherited from interface it.unimi.dsi.parser.callback.Callback |
---|
EMPTY_CALLBACK_ARRAY |
Constructor Summary | |
---|---|
TextExtractor()
|
Method Summary | |
---|---|
boolean |
characters(char[] characters,
int offset,
int length,
boolean flowBroken)
Receive notification of character data inside an element. |
void |
configure(BulletParser parser)
Configure the parser to parse text. |
boolean |
endElement(Element element)
Receive notification of the end of an element. |
void |
startDocument()
Receive notification of the beginning of the document. |
boolean |
startElement(Element element,
Map<Attribute,MutableString> attrMapUnused)
Receive notification of the start of an element. |
Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback |
---|
cdata, endDocument, getInstance |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public final MutableString text
public final MutableString title
Constructor Detail |
---|
public TextExtractor()
Method Detail |
---|
public void configure(BulletParser parser)
configure
in interface Callback
configure
in class DefaultCallback
public void startDocument()
Callback
The callback must use this method to reset its internal state so that it can be resued. It must be safe to invoke this method several times.
startDocument
in interface Callback
startDocument
in class DefaultCallback
public boolean characters(char[] characters, int offset, int length, boolean flowBroken)
Callback
You must not write into text
, as it could be passed
around to many callbacks.
flowBroken
will be true iff
the flow was broken before text
. This feature makes it possible
to extract quickly the text in a document without looking at the elements.
characters
in interface Callback
characters
in class DefaultCallback
characters
- an array containing the character data.offset
- the start position in the array.length
- the number of characters to read from the array.flowBroken
- whether the flow is broken at the start of text
.
public boolean endElement(Element element)
Callback
This method will never be called for element without closing tags, even if such a tag is found.
endElement
in interface Callback
endElement
in class DefaultCallback
element
- the element whose closing tag was found.
public boolean startElement(Element element, Map<Attribute,MutableString> attrMapUnused)
Callback
For simple elements, this is the only notification that the callback will ever receive.
startElement
in interface Callback
startElement
in class DefaultCallback
element
- the element whose opening tag was found.attrMapUnused
- a map from Attribute
s to MutableString
s.
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |