it.unimi.dsi.parser.callback
Class TextExtractor

java.lang.Object
  extended by it.unimi.dsi.parser.callback.DefaultCallback
      extended by it.unimi.dsi.parser.callback.TextExtractor
All Implemented Interfaces:
Callback

public class TextExtractor
extends DefaultCallback

A callback extracting text and titles.

This callbacks extracts all text in the page, and the title. The resulting text is available through text, and the title through title.

Note that text and title are never trimmed.


Field Summary
 MutableString text
          The text resulting from the parsing process.
 MutableString title
          The title resulting from the parsing process.
 
Fields inherited from interface it.unimi.dsi.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
 
Constructor Summary
TextExtractor()
           
 
Method Summary
 boolean characters(char[] characters, int offset, int length, boolean flowBroken)
          Receive notification of character data inside an element.
 void configure(BulletParser parser)
          Configure the parser to parse text.
 boolean endElement(Element element)
          Receive notification of the end of an element.
 void startDocument()
          Receive notification of the beginning of the document.
 boolean startElement(Element element, Map<Attribute,MutableString> attrMapUnused)
          Receive notification of the start of an element.
 
Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback
cdata, endDocument, getInstance
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

text

public final MutableString text
The text resulting from the parsing process.


title

public final MutableString title
The title resulting from the parsing process.

Constructor Detail

TextExtractor

public TextExtractor()
Method Detail

configure

public void configure(BulletParser parser)
Configure the parser to parse text.

Specified by:
configure in interface Callback
Overrides:
configure in class DefaultCallback

startDocument

public void startDocument()
Description copied from interface: Callback
Receive notification of the beginning of the document.

The callback must use this method to reset its internal state so that it can be resued. It must be safe to invoke this method several times.

Specified by:
startDocument in interface Callback
Overrides:
startDocument in class DefaultCallback

characters

public boolean characters(char[] characters,
                          int offset,
                          int length,
                          boolean flowBroken)
Description copied from interface: Callback
Receive notification of character data inside an element.

You must not write into text, as it could be passed around to many callbacks.

flowBroken will be true iff the flow was broken before text. This feature makes it possible to extract quickly the text in a document without looking at the elements.

Specified by:
characters in interface Callback
Overrides:
characters in class DefaultCallback
Parameters:
characters - an array containing the character data.
offset - the start position in the array.
length - the number of characters to read from the array.
flowBroken - whether the flow is broken at the start of text.
Returns:
true to keep the parser parsing, false to stop it.

endElement

public boolean endElement(Element element)
Description copied from interface: Callback
Receive notification of the end of an element. Warning: unless specific decorators are used, in general a callback will just receive notifications for elements whose closing tag appears explicitly in the document.

This method will never be called for element without closing tags, even if such a tag is found.

Specified by:
endElement in interface Callback
Overrides:
endElement in class DefaultCallback
Parameters:
element - the element whose closing tag was found.
Returns:
true to keep the parser parsing, false to stop it.

startElement

public boolean startElement(Element element,
                            Map<Attribute,MutableString> attrMapUnused)
Description copied from interface: Callback
Receive notification of the start of an element.

For simple elements, this is the only notification that the callback will ever receive.

Specified by:
startElement in interface Callback
Overrides:
startElement in class DefaultCallback
Parameters:
element - the element whose opening tag was found.
attrMapUnused - a map from Attributes to MutableStrings.
Returns:
true to keep the parser parsing, false to stop it.