org.archive.modules.extractor (Heritrix 3: 'modules' subproject (reusable components) 3.7.0 API)

package org.archive.modules.extractor

Related Packages

Package

Description

org.archive.modules

The beginnings of a refactored settings framework.
Class

Description

AggressiveExtractorHTML

Extended version of ExtractorHTML with more aggressive javascript link extraction where javascript code is parsed first with general HTML tags regex, and than by javascript speculative link regex.

ConfigurableExtractorJS

Subclasses the standard ExtractorJS to add some configuration option.

ContentExtractor

Extracts link from the fetched content of a URI, as opposed to its headers.

ContentExtractorTestBase

Abstract base class for unit testing ContentExtractor implementations.

CustomSWFTags

Overwrite action tags, that may hold URI, to use CrawlUriSWFAction action.

Extractor

Extracts links from fetched URIs.

ExtractorCSS

This extractor is parsing URIs from CSS type files.

ExtractorDOC

This class allows the caller to extract href style links from word97-format word documents.

ExtractorHTML

Basic link-extraction, from an HTML content-body, using regular expressions.

ExtractorHTTP

Extracts URIs from HTTP response headers.

ExtractorImpliedURI

An extractor for finding 'implied' URIs inside other URIs.

ExtractorJS

Processes Javascript files for strings that are likely to be crawlable URIs.

ExtractorMultipleRegex

An extractor that uses regular expressions to find strings in the fetched content of a URI, and constructs outlink URIs from those strings.

ExtractorParameters

Bean interface for parameters consulted by multiple Extractors, and thus provided by some shared object.

ExtractorPDF

Allows the caller to process a CrawlURI representing a PDF for the purpose of extracting URIs

ExtractorRobotsTxt

ExtractorSitemap

ExtractorSWF

Extracts URIs from SWF (flash/shockwave) files.

ExtractorUniversal

A last ditch extractor that will look at the raw byte code and try to extract anything that looks like a link.

ExtractorURI

An extractor for finding URIs inside other URIs.

ExtractorXML

A simple extractor which finds HTTP URIs inside XML/RSS files, inside attribute values and simple elements (those with only whitespace + HTTP URI + whitespace as contents).

Hop

The kind of "hop" from one URI to another.

HTMLLinkContext

XPath-like context for HTML discovered URIs.

HTTPContentDigest

A processor for calculating custom HTTP content digests in place of the default (if any) computed by the HTTP fetcher processors.

JerichoExtractorHTML

Improved link-extraction from an HTML content-body using jericho-html parser.

LinkContext

The context of link discovery.

LinkContext.SimpleLinkContext

Class for representing handy default LinkContext values.

PDFParser

Supports PDF parsing operations.

StringExtractorTestBase

StringExtractorTestBase.TestData

TempDirProvider

TrapSuppressExtractor

Pseudo-extractor that suppresses link-extraction of likely trap pages, by noticing when content's digest is identical to that of its 'via'.

UriErrorLoggerModule

Package org.archive.modules.extractor