All Classes and Interfaces
Class
Description
Abstract base class for parsers that use the AutoDetectReader and need
to use the
EncodingDetector configured by TikaConfigAbstract base class for parsers that call external processes.
Abstract base class for parser wrappers which may / will
process a given stream multiple times, merging the results
of the various parsers used.
The various strategies for handling metadata emitted by
multiple parsers.
Deprecated.
for removal in 4.x
This is a special handler to be used only with the
RecursiveParserWrapper.Exception to be thrown when a document does not allow content extraction.
Until we can find a common standard, we'll use these options.
This class contains utilities for dealing with tika annotations
Worker thread that takes EmitData off the queue, batches it
and tries to emit it as a batch
This is the main class for handling async requests.
Final evaluation state of a
.This config object can be used to tune how conservative we want to be
when parsing data that is extremely compressible and resembles a ZIP
bomb.
Factory for an AutoDetectParser
An input stream reader that automatically detects the character encoding
to be used for converting bytes to characters.
Basic factory for creating common types of ContentHandlers
Common handler types for content.
For now, this is an in-memory EmbeddedDocumentBytesHandler that stores
all the bytes in memory.
Content handler decorator that only passes everything inside
the XHTML <body/> tag to the underlying handler.
Very slight modification of Commons' BoundedInputStream
so that we can figure out if this hit the bound or not.
This is a simple wrapper around
PipesIterator
that allows it to be called in its own thread.This filter runs a regex against the first value in the "sourceField".
Intermediate evaluation state of a
.../*... XPath expression.Class to help de-obfuscate phone numbers in text.
This class clears the entire metadata object if the
attachment type matches one of the types.
This class clears the entire metadata object if the
mime matches the mime filter.
Met keys from NCAR CCSM files in the Climate Forecast Convention.
Content type detector that combines multiple different detection mechanisms.
A Composite Parser that wraps up all the available External Parsers,
and provides an easy way to access them.
Composite XPath evaluation state.
Composite parser that delegates parsing tasks to a component parser
based on the declared content type of the incoming document.
Utility Class for Concurrency in Tika
Allows Thread Pool to be Configurable.
Tika container extractor interface.
Decorator base class for the
ContentHandler interface.Interface to allow easier injection of code for getting a new ContentHandler
This exception should be thrown when the parse absolutely, positively has to stop.
A collection of Creative Commons properties names.
Decrypts the incoming document stream and delegates further parsing to
another parser instance.
Some dates in some file formats do not have a timezone.
Date related utility methods and constants
A composite detector based on all the
Detector implementations
available through the service provider mechanism.Loads EmbeddedStreamTranslators via service loading.
A composite encoding detector based on all the
EncodingDetector implementations
available through the service provider mechanism.A composite parser based on all the
Parser implementations
available through the
service provider mechanism.A version of
DefaultDetector for probabilistic mime
detectors, which use statistical techniques to blend the
results of differing underlying detectors when attempting
to detect the type of a given file.A translator which picks the first available
Translator
implementations available through the
service provider mechanism.Base class for parser implementations that want to delegate parts of the
task of parsing an input document to another parser.
Content type detector.
Interface for digester.
This is used in
AutoDetectParserConfig to (optionally)
wrap the parser in a digesting parser.Encodes byte array from a MessageDigest to String
Interface for different document selection strategies for purposes like
embedded document extraction by a
ContainerExtractor instance.A collection of Dublin Core metadata names.
Content handler decorator that maps element
QNames using
a Map.Final evaluation state of an XPath expression that targets an element.
Content handler decorator that prevents the
EmbeddedContentHandler.startDocument()
and EmbeddedContentHandler.endDocument() events from reaching the decorated handler.This factory creates EmbeddedDocumentExtractors that require an
EmbeddedDocumentBytesHandler in the
ParseContext should extend this.Utility class to handle common issues with embedded documents.
Tika container extractor callback interface.
Interface for different filtering of embedded streams.
Tika embedder interface
Utility class that will apply the appropriate fetcher
to the fetcherString based on the prefix.
Dummy detector that returns application/octet-stream for all documents.
Dummy parser that always produces an empty XHTML document without even
attempting to parse the given document stream.
Dummy translator that always declines to give any text.
Character encoding detector.
A wrapper around a
ContentHandler which will ignore normal
SAX calls to EndDocumentShieldingContentHandler.endDocument(), and only fire them later.General Endian Related Utilties.
EPub properties collection.
Dummy parser that always throws a
TikaException without even
attempting to parse the given document stream.Content handler decorator which wraps a
TransformerHandler in order to
allow the TITLE tag to render as <title></title>
rather than <title/> which is accomplished
by calling the ContentHandler.characters(char[], int, int) method
with a length of 1 but a zero length char array.Embedder that uses an external program (like sed or exiftool) to embed text
content and metadata into a given document.
Parser that uses an external program (like catdoc or pdf2txt) to extract
text content and metadata from a given document.
This is a next generation external parser that uses some of the more
recent additions to Tika.
Consumer contract
Builds up ExternalParser instances based on XML file(s)
which define what to run, for what, and how to process
any output metadata.
Met Keys used by the
ExternalParsersConfigReader.Creates instances of ExternalParser based on XML
configuration files.
This should be catastrophic
Tries multiple parsers in turn, until one succeeds.
Interface for an object that will fetch an InputStream given
a fetch string.
Utility class to hold multiple fetchers.
If something goes wrong in parsing the fetcher string
Pair of fetcherName (which fetcher to call) and the key
to send to that fetcher to retrieve a specific file.
Field annotation is a contract for binding
Param value from
Tika Configuration to an object.This runs the linux 'file' command against a file.
Reads a list of file names/relative paths from a UTF-8 file.
A collection of metadata elements for file system level metadata
Geographic schema.
If
Metadata contains a TikaCoreProperties.LATITUDE and
a TikaCoreProperties.LONGITUDE, this filter concatenates those with a
comma in the order LATITUDE,LONGITUDE.HandlerConfig.PARSE_MODE.RMETA "recursive metadata" is the same as the -J option
in tika-app and the /rmeta endpoint in tika-server.A set of Hex encoding and decoding utility methods.
A collection of HTTP header names.
Components that must do special processing across multiple fields
at initialization time should implement this interface.
This is to be used to handle potential recoverable problems that
might arise during initialization.
A factory which returns a fresh
InputStream for the same
resource each time.IPTC photo metadata schema.
SAX content handler that updates a language detector based on all the
received character content.
Support for language tags (as defined by https://tools.ietf.org/html/bcp47)
Writer that builds a language profile based on all the written content.
Content handler that collects links from an XHTML document.
Interface for error handling strategies in service class loading.
Simple PipesReporter that logs everything at the debug level.
Stream wrapper that make it easy to read up to n bytes ahead from
a stream that supports the mark feature.
Metadata for describing machines, such as their
architecture, type and endian-ness
Content type detection based on magic bytes, i.e. type-specific patterns
near the beginning of the document input stream.
XPath element matcher.
Content handler decorator that only passes the elements, attributes,
and text nodes that match the given XPath expression.
Internet media type.
Registry of known Internet media types.
A collection of Message related property names.
A multi-valued metadata container.
Filters the metadata in place after the parse
Internet media type.
A class to encapsulate MimeType related exceptions.
This class is a MimeType repository.
Creates instances of MimeTypes.
A reader for XML files compliant with the freedesktop MIME-info DTD.
Met Keys used by the
MimeTypesReader.Final evaluation state of a
...Intermediate evaluation state of a
...Content type detection based on the resource name.
Final evaluation state of a
...Always returns the charset passed in via the initializer
This filter performs no operations on the metadata
and leaves it untouched.
Office Document properties collection.
Core properties as defined in the Office Open XML specification part Two that are not
in the DublinCore namespace.
Extended properties as defined in the Office Open XML specification part Four.
Content handler decorator that always returns an empty stream from the
OfflineContentHandler.resolveEntity(String, String) method to prevent potential
network or other external resources from being accessed by an XML parser.Deprecated.
after 2.5.0 this functionality was moved to the CompositeDetector
XMP Paged-text schema.
The range of pages to render.
This is a serializable model class for parameters from configuration file.
Simple pointer class to allow parsers to pass on the parent contenthandler through
to the embedded document's parse
Parse context.
Tika parser interface.
An implementation of
ContainerExtractor powered by the regular
Parser API.Decorator base class for the
Parser interface.Use this class to store exceptions, warnings and other information
during the parse.
Lightweight, easily serializable class that contains enough information
to build a
ParserFactoryParser decorator that post-processes the results from a decorated parser.
Helper util methods for Parsers themselves.
Helper class for parsers of package archives or other compound document
formats that support embedded or attached component documents.
Reader for the text content from a given binary stream.
Interface for providing a password to a Parser for handling Encrypted
and Password Protected Documents.
PDF properties collection.
Class used to extract phone numbers while parsing.
XMP Photoshop metadata schema.
The PipesClient is designed to be single-threaded.
Fatal exception that means that something went seriously wrong.
Abstract class that handles the testing for timeouts/thread safety
issues.
This is called asynchronously by the AsyncProcessor.
Base class that includes filtering by
PipesResult.STATUSThis server is forked from the PipesClient.
Selector for combining different mime detection results
based on probability
build class for probability parameters setting
XMP property definition.
XMP property definition violation exception.
QuattroPro properties collection.
This class extracts a range of bytes from a given fetch key.
This is a helper class that wraps a parser in a recursive handler.
This is the default implementation of
AbstractRecursiveParserWrapperHandler.Inspired from Nutch code class OutlinkExtractor.
Interface for a renderer.
This should be to track state for each file (embedded or otherwise).
Use this in the ParseContext to keep track of unique ids for rendered
images in embedded docs.
Empty interface for requests to a renderer.
Wraps an input stream, reading it only once, but making it available
for rereading an arbitrary number of times.
Content handler for Rich Text, it will extract XHTML <img/>
tag <alt/> attribute and XHTML <a/> tag <name/>
attribute into the output.
Recursive Unpacker and text and metadata extractor.
Use this to throw a SAXException in subclassed methods that don't throw SAXExceptions
Content handler decorator that makes sure that the character events
(
SafeContentHandler.characters(char[], int, int) or
SafeContentHandler.ignorableWhitespace(char[], int, int)) passed to the decorated
content handler contain only valid XML characters.Internal interface that allows both character and
ignorable whitespace content to be filtered the same way.
Content handler decorator that attempts to prevent denial of service
attacks against Tika parsers.
Internal utility class that Tika uses to look up service providers.
Service Loading and Ordering related utils
Simple Thread Pool Executor
This class provides a collection of the most important technical standard organizations.
Class that represents a standard reference.
StandardsExtractingContentHandler is a Content Handler used to extract
standard references while parsing.
StandardText relies on regular expressions to extract standard references
from text.
This is to be used to limit the amount of metadata that a
parser can add based on the
StandardWriteFilter.maxTotalEstimatedSize,
StandardWriteFilter.maxFieldSize, StandardWriteFilter.maxValuesPerField, and
StandardWriteFilter.maxKeySize.Factory class for
StandardWriteFilter.The RecursiveParserWrapper wraps the parser sent
into the parsecontext and then uses that parser
to store state (among many other things).
Sentinel exception to stop parsing xml once target is found
while SAX parsing.
Evaluation state of a
...//... XPath expression.Runs the input stream through all available parsers,
merging the metadata from them based on the
AbstractMultipleParser.MetadataPolicy chosen.Copied from commons-lang to avoid requiring the dependency
A content handler decorator that tags potential exceptions so that the
handler that caused the exception can easily be identified.
A
SAXException wrapper that tags the wrapped exception with
a given object reference.
A specialized input stream implementation which records the last portion read
from an underlying stream.
Content handler proxy that forwards the received SAX events to zero or
more underlying content handlers.
Utility class for tracking and ultimately closing or otherwise disposing
a collection of temporary resources.
Content handler decorator that only passes the
TextContentHandler.characters(char[], int, int) and
(@link TextContentHandler.ignorableWhitespace(char[], int, int)
(plus TextContentHandler.startDocument() and TextContentHandler.endDocument() events to
the decorated content handler.Content type detection of plain text documents.
Final evaluation state of a
...Utility class for computing a histogram of the bytes seen in a stream.
XMP Exif TIFF schema.
Facade class for accessing Tika functionality.
Bundle activator that adjust the class loading mechanism of the
ServiceLoader class to work correctly in an OSGi environment.Parse xml config file.
Tika Config Exception is an exception to occur when there is an error
in Tika config file and/or one or more of the parsers failed to initialize
from that erroneous config.
Contains a core set of basic Tika metadata properties, which all parsers
will attempt to supply (where the file format permits).
A file might contain different types of embedded documents.
Tika exception
Input stream with extended capabilities.
A collection of Tika metadata keys used in Mime Type resolution
Metadata properties for paged text, metadata appropriate
for an individual page (useful for embedded document handlers
called on individual pages).
Runtime/unchecked version of
TimeoutExceptionSAX event handler that serializes the HTML document to a character stream.
Interface for pipesiterators that allow counting of total
documents.
SAX event handler that writes all character content out to a character
stream.
SAX event handler that serializes the XML document to a character stream.
Interface for Translator services.
Content type detection based on a content type hint.
Parsers should throw this exception when they encounter
a file format that they do not support.
Simple fetcher for URLs.
WordPerfect properties collection.
SAX event handler that writes content up to an optional write
limit out to a character stream or other decorated handler.
Content handler decorator that simplifies the task of producing XHTML
events for Tika content parsers.
Utility functions for reading XML.
Utility class that uses a
SAXParser to determine
the namespace URI and local name of the root element of an XML file.Content handler decorator that simplifies the task of producing XMP output.
XMP Dynamic Media schema.
Deprecated.
Experimental method, will change shortly
XMP Rights management schema.
Parser for a very simple XPath subset.
Exception thrown by the AutoDetectParser when a file contains zero-bytes.
Detector to identify zero length files as application/x-zerovalue