it.unimi.dsi.parser.callback
Class LinkExtractor

java.lang.Object
  extended by it.unimi.dsi.parser.callback.DefaultCallback
      extended by it.unimi.dsi.parser.callback.LinkExtractor
All Implemented Interfaces:
Callback

public class LinkExtractor
extends DefaultCallback

A callback extracting links.

This callbacks extracts links existing in the web page. The links are then accessible in urls (a set of Strings). Note that we guarantee that the iteration order in the set is exactly the order in which links have been met (albeit copies appear just once).


Field Summary
 Set<String> urls
          The URLs resulting from the parsing process.
 
Fields inherited from interface it.unimi.dsi.parser.callback.Callback
EMPTY_CALLBACK_ARRAY
 
Constructor Summary
LinkExtractor()
           
 
Method Summary
 String base()
          Returns the URL specified by the BASE element.
 void configure(BulletParser parser)
          Configure the parser to parse elements and certain attributes.
 String metaLocation()
          Returns the URL specified by META HTTP-EQUIV elements of location type.
 String metaRefresh()
          Returns the URL specified by META HTTP-EQUIV elements of refresh type.
 void startDocument()
          Receive notification of the beginning of the document.
 boolean startElement(Element element, Map<Attribute,MutableString> attrMap)
          Receive notification of the start of an element.
 
Methods inherited from class it.unimi.dsi.parser.callback.DefaultCallback
cdata, characters, endDocument, endElement, getInstance
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

urls

public final Set<String> urls
The URLs resulting from the parsing process.

Constructor Detail

LinkExtractor

public LinkExtractor()
Method Detail

configure

public void configure(BulletParser parser)
Configure the parser to parse elements and certain attributes.

The required attributes are SRC , HREF , HTTP-EQUIV , and CONTENT .

Specified by:
configure in interface Callback
Overrides:
configure in class DefaultCallback

startDocument

public void startDocument()
Description copied from interface: Callback
Receive notification of the beginning of the document.

The callback must use this method to reset its internal state so that it can be resued. It must be safe to invoke this method several times.

Specified by:
startDocument in interface Callback
Overrides:
startDocument in class DefaultCallback

startElement

public boolean startElement(Element element,
                            Map<Attribute,MutableString> attrMap)
Description copied from interface: Callback
Receive notification of the start of an element.

For simple elements, this is the only notification that the callback will ever receive.

Specified by:
startElement in interface Callback
Overrides:
startElement in class DefaultCallback
Parameters:
element - the element whose opening tag was found.
attrMap - a map from Attributes to MutableStrings.
Returns:
true to keep the parser parsing, false to stop it.

metaLocation

public String metaLocation()
Returns the URL specified by META HTTP-EQUIV elements of location type. More precisely, this method returns a non- null result iff there is at least one META HTTP-EQUIV element specifying a location URL (if there is more than one, we keep the first one).

Returns:
the first URL specified by a META HTTP-EQUIV elements of location type, or null.

base

public String base()
Returns the URL specified by the BASE element. More precisely, this method returns a non- null result iff there is at least one BASE element specifying a derelativisation URL (if there is more than one, we keep the first one).

Returns:
the first URL specified by a BASE element, or null.

metaRefresh

public String metaRefresh()
Returns the URL specified by META HTTP-EQUIV elements of refresh type. More precisely, this method returns a non- null result iff there is at least one META HTTP-EQUIV element specifying a refresh URL (if there is more than one, we keep the first one).

Returns:
the first URL specified by a META HTTP-EQUIV elements of refresh type, or null.