Package org.archive.modules.extractor
package org.archive.modules.extractor
-
ClassDescriptionExtended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.Subclasses the standard ExtractorJS to add some configuration option.Extracts link from the fetched content of a URI, as opposed to its headers.Abstract base class for unit testing ContentExtractor implementations.Overwrite action tags, that may hold URI, to use
CrawlUriSWFAction
action.Extracts links from fetched URIs.This extractor is parsing URIs from CSS type files.This class allows the caller to extract href style links from word97-format word documents.Basic link-extraction, from an HTML content-body, using regular expressions.Extracts URIs from HTTP response headers.An extractor for finding 'implied' URIs inside other URIs.Processes Javascript files for strings that are likely to be crawlable URIs.An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.Bean interface for parameters consulted by multiple Extractors, and thus provided by some shared object.Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIsExtracts URIs from SWF (flash/shockwave) files.A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.An extractor for finding URIs inside other URIs.A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).The kind of "hop" from one URI to another.XPath-like context for HTML discovered URIs.A processor for calculating custom HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.Improved link-extraction from an HTML content-body using jericho-html parser.The context of link discovery.Class for representing handy default LinkContext values.Supports PDF parsing operations.Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.