All Classes and Interfaces

Class
Description
 
 
 
Abstract base class for parsers that use the AutoDetectReader and need to use the EncodingDetector configured by TikaConfig
Abstract base class for parsers that call external processes.
 
Abstract base class for parser wrappers which may / will process a given stream multiple times, merging the results of the various parsers used.
The various strategies for handling metadata emitted by multiple parsers.
Deprecated.
for removal in 4.x
This is a special handler to be used only with the RecursiveParserWrapper.
Exception to be thrown when a document does not allow content extraction.
Until we can find a common standard, we'll use these options.
This class contains utilities for dealing with tika annotations
 
Worker thread that takes EmitData off the queue, batches it and tries to emit it as a batch
This is the main class for handling async requests.
 
 
Final evaluation state of a .
 
This config object can be used to tune how conservative we want to be when parsing data that is extremely compressible and resembles a ZIP bomb.
Factory for an AutoDetectParser
An input stream reader that automatically detects the character encoding to be used for converting bytes to characters.
Basic factory for creating common types of ContentHandlers
Common handler types for content.
 
For now, this is an in-memory EmbeddedDocumentBytesHandler that stores all the bytes in memory.
Content handler decorator that only passes everything inside the XHTML <body/> tag to the underlying handler.
Very slight modification of Commons' BoundedInputStream so that we can figure out if this hit the bound or not.
This is a simple wrapper around PipesIterator that allows it to be called in its own thread.
This filter runs a regex against the first value in the "sourceField".
 
Intermediate evaluation state of a .../*... XPath expression.
Class to help de-obfuscate phone numbers in text.
This class clears the entire metadata object if the attachment type matches one of the types.
This class clears the entire metadata object if the mime matches the mime filter.
Met keys from NCAR CCSM files in the Climate Forecast Convention.
 
Content type detector that combines multiple different detection mechanisms.
 
 
A Composite Parser that wraps up all the available External Parsers, and provides an easy way to access them.
Composite XPath evaluation state.
 
Composite parser that delegates parsing tasks to a component parser based on the declared content type of the incoming document.
 
 
Utility Class for Concurrency in Tika
 
Allows Thread Pool to be Configurable.
Tika container extractor interface.
Decorator base class for the ContentHandler interface.
 
Interface to allow easier injection of code for getting a new ContentHandler
This exception should be thrown when the parse absolutely, positively has to stop.
A collection of Creative Commons properties names.
Decrypts the incoming document stream and delegates further parsing to another parser instance.
 
Some dates in some file formats do not have a timezone.
Date related utility methods and constants
A composite detector based on all the Detector implementations available through the service provider mechanism.
Loads EmbeddedStreamTranslators via service loading.
A composite encoding detector based on all the EncodingDetector implementations available through the service provider mechanism.
 
A composite parser based on all the Parser implementations available through the service provider mechanism.
A version of DefaultDetector for probabilistic mime detectors, which use statistical techniques to blend the results of differing underlying detectors when attempting to detect the type of a given file.
A translator which picks the first available Translator implementations available through the service provider mechanism.
Base class for parser implementations that want to delegate parts of the task of parsing an input document to another parser.
Content type detector.
 
 
Interface for digester.
This is used in AutoDetectParserConfig to (optionally) wrap the parser in a digesting parser.
Encodes byte array from a MessageDigest to String
Interface for different document selection strategies for purposes like embedded document extraction by a ContainerExtractor instance.
A collection of Dublin Core metadata names.
Content handler decorator that maps element QNames using a Map.
 
Final evaluation state of an XPath expression that targets an element.
 
 
Content handler decorator that prevents the EmbeddedContentHandler.startDocument() and EmbeddedContentHandler.endDocument() events from reaching the decorated handler.
 
 
 
This factory creates EmbeddedDocumentExtractors that require an EmbeddedDocumentBytesHandler in the ParseContext should extend this.
 
 
Utility class to handle common issues with embedded documents.
Tika container extractor callback interface.
Interface for different filtering of embedded streams.
Tika embedder interface
 
 
 
Utility class that will apply the appropriate fetcher to the fetcherString based on the prefix.
 
Dummy detector that returns application/octet-stream for all documents.
 
 
Dummy parser that always produces an empty XHTML document without even attempting to parse the given document stream.
Dummy translator that always declines to give any text.
Character encoding detector.
 
A wrapper around a ContentHandler which will ignore normal SAX calls to EndDocumentShieldingContentHandler.endDocument(), and only fire them later.
General Endian Related Utilties.
 
EPub properties collection.
Dummy parser that always throws a TikaException without even attempting to parse the given document stream.
 
 
Content handler decorator which wraps a TransformerHandler in order to allow the TITLE tag to render as <title></title> rather than <title/> which is accomplished by calling the ContentHandler.characters(char[], int, int) method with a length of 1 but a zero length char array.
Embedder that uses an external program (like sed or exiftool) to embed text content and metadata into a given document.
Parser that uses an external program (like catdoc or pdf2txt) to extract text content and metadata from a given document.
This is a next generation external parser that uses some of the more recent additions to Tika.
Consumer contract
Builds up ExternalParser instances based on XML file(s) which define what to run, for what, and how to process any output metadata.
Met Keys used by the ExternalParsersConfigReader.
Creates instances of ExternalParser based on XML configuration files.
 
This should be catastrophic
Tries multiple parsers in turn, until one succeeds.
 
 
Interface for an object that will fetch an InputStream given a fetch string.
 
Utility class to hold multiple fetchers.
If something goes wrong in parsing the fetcher string
Pair of fetcherName (which fetcher to call) and the key to send to that fetcher to retrieve a specific file.
Field annotation is a contract for binding Param value from Tika Configuration to an object.
 
This runs the linux 'file' command against a file.
Reads a list of file names/relative paths from a UTF-8 file.
 
 
A collection of metadata elements for file system level metadata
 
 
 
 
 
 
 
 
Geographic schema.
If Metadata contains a TikaCoreProperties.LATITUDE and a TikaCoreProperties.LONGITUDE, this filter concatenates those with a comma in the order LATITUDE,LONGITUDE.
 
HandlerConfig.PARSE_MODE.RMETA "recursive metadata" is the same as the -J option in tika-app and the /rmeta endpoint in tika-server.
A set of Hex encoding and decoding utility methods.
 
A collection of HTTP header names.
 
Components that must do special processing across multiple fields at initialization time should implement this interface.
This is to be used to handle potential recoverable problems that might arise during initialization.
 
A factory which returns a fresh InputStream for the same resource each time.
 
IPTC photo metadata schema.
 
 
SAX content handler that updates a language detector based on all the received character content.
Support for language tags (as defined by https://tools.ietf.org/html/bcp47)
 
Writer that builds a language profile based on all the written content.
 
Content handler that collects links from an XHTML document.
Interface for error handling strategies in service class loading.
Simple PipesReporter that logs everything at the debug level.
Stream wrapper that make it easy to read up to n bytes ahead from a stream that supports the mark feature.
Metadata for describing machines, such as their architecture, type and endian-ness
 
Content type detection based on magic bytes, i.e. type-specific patterns near the beginning of the document input stream.
XPath element matcher.
Content handler decorator that only passes the elements, attributes, and text nodes that match the given XPath expression.
Internet media type.
Registry of known Internet media types.
A collection of Message related property names.
A multi-valued metadata container.
Filters the metadata in place after the parse
 
 
Internet media type.
A class to encapsulate MimeType related exceptions.
This class is a MimeType repository.
Creates instances of MimeTypes.
A reader for XML files compliant with the freedesktop MIME-info DTD.
Met Keys used by the MimeTypesReader.
Final evaluation state of a ...
Intermediate evaluation state of a ...
Content type detection based on the resource name.
 
 
 
 
Final evaluation state of a ...
Always returns the charset passed in via the initializer
This filter performs no operations on the metadata and leaves it untouched.
 
Office Document properties collection.
Core properties as defined in the Office Open XML specification part Two that are not in the DublinCore namespace.
Extended properties as defined in the Office Open XML specification part Four.
Content handler decorator that always returns an empty stream from the OfflineContentHandler.resolveEntity(String, String) method to prevent potential network or other external resources from being accessed by an XML parser.
Deprecated.
after 2.5.0 this functionality was moved to the CompositeDetector
 
XMP Paged-text schema.
The range of pages to render.
This is a serializable model class for parameters from configuration file.
This class stores metdata for Field annotation are used to map them to Param at runtime
Simple pointer class to allow parsers to pass on the parent contenthandler through to the embedded document's parse
Parse context.
Tika parser interface.
An implementation of ContainerExtractor powered by the regular Parser API.
Decorator base class for the Parser interface.
Use this class to store exceptions, warnings and other information during the parse.
 
Lightweight, easily serializable class that contains enough information to build a ParserFactory
Parser decorator that post-processes the results from a decorated parser.
Helper util methods for Parsers themselves.
Helper class for parsers of package archives or other compound document formats that support embedded or attached component documents.
 
Reader for the text content from a given binary stream.
Interface for providing a password to a Parser for handling Encrypted and Password Protected Documents.
PDF properties collection.
Class used to extract phone numbers while parsing.
XMP Photoshop metadata schema.
The PipesClient is designed to be single-threaded.
 
 
Fatal exception that means that something went seriously wrong.
Abstract class that handles the testing for timeouts/thread safety issues.
 
This is called asynchronously by the AsyncProcessor.
Base class that includes filtering by PipesResult.STATUS
 
 
This server is forked from the PipesClient.
 
Selector for combining different mime detection results based on probability
build class for probability parameters setting
 
XMP property definition.
 
 
XMP property definition violation exception.
 
QuattroPro properties collection.
This class extracts a range of bytes from a given fetch key.
This is a helper class that wraps a parser in a recursive handler.
This is the default implementation of AbstractRecursiveParserWrapperHandler.
 
Inspired from Nutch code class OutlinkExtractor.
Interface for a renderer.
 
 
This should be to track state for each file (embedded or otherwise).
Use this in the ParseContext to keep track of unique ids for rendered images in embedded docs.
Empty interface for requests to a renderer.
 
 
 
Wraps an input stream, reading it only once, but making it available for rereading an arbitrary number of times.
Content handler for Rich Text, it will extract XHTML <img/> tag <alt/> attribute and XHTML <a/> tag <name/> attribute into the output.
 
Recursive Unpacker and text and metadata extractor.
 
Use this to throw a SAXException in subclassed methods that don't throw SAXExceptions
Content handler decorator that makes sure that the character events (SafeContentHandler.characters(char[], int, int) or SafeContentHandler.ignorableWhitespace(char[], int, int)) passed to the decorated content handler contain only valid XML characters.
Internal interface that allows both character and ignorable whitespace content to be filtered the same way.
Content handler decorator that attempts to prevent denial of service attacks against Tika parsers.
Internal utility class that Tika uses to look up service providers.
Service Loading and Ordering related utils
Simple Thread Pool Executor
This class provides a collection of the most important technical standard organizations.
Class that represents a standard reference.
 
StandardsExtractingContentHandler is a Content Handler used to extract standard references while parsing.
StandardText relies on regular expressions to extract standard references from text.
This is to be used to limit the amount of metadata that a parser can add based on the StandardWriteFilter.maxTotalEstimatedSize, StandardWriteFilter.maxFieldSize, StandardWriteFilter.maxValuesPerField, and StandardWriteFilter.maxKeySize.
Factory class for StandardWriteFilter.
The RecursiveParserWrapper wraps the parser sent into the parsecontext and then uses that parser to store state (among many other things).
Sentinel exception to stop parsing xml once target is found while SAX parsing.
 
 
 
Evaluation state of a ...//... XPath expression.
Runs the input stream through all available parsers, merging the metadata from them based on the AbstractMultipleParser.MetadataPolicy chosen.
Copied from commons-lang to avoid requiring the dependency
A content handler decorator that tags potential exceptions so that the handler that caused the exception can easily be identified.
A SAXException wrapper that tags the wrapped exception with a given object reference.
A specialized input stream implementation which records the last portion read from an underlying stream.
Content handler proxy that forwards the received SAX events to zero or more underlying content handlers.
Utility class for tracking and ultimately closing or otherwise disposing a collection of temporary resources.
 
Content type detection of plain text documents.
Final evaluation state of a ...
Utility class for computing a histogram of the bytes seen in a stream.
XMP Exif TIFF schema.
Facade class for accessing Tika functionality.
Bundle activator that adjust the class loading mechanism of the ServiceLoader class to work correctly in an OSGi environment.
Parse xml config file.
Tika Config Exception is an exception to occur when there is an error in Tika config file and/or one or more of the parsers failed to initialize from that erroneous config.
 
 
Contains a core set of basic Tika metadata properties, which all parsers will attempt to supply (where the file format permits).
A file might contain different types of embedded documents.
 
Tika exception
Input stream with extended capabilities.
 
A collection of Tika metadata keys used in Mime Type resolution
Metadata properties for paged text, metadata appropriate for an individual page (useful for embedded document handlers called on individual pages).
 
Runtime/unchecked version of TimeoutException
SAX event handler that serializes the HTML document to a character stream.
Interface for pipesiterators that allow counting of total documents.
 
 
SAX event handler that writes all character content out to a character stream.
SAX event handler that serializes the XML document to a character stream.
 
 
Interface for Translator services.
Content type detection based on a content type hint.
Parsers should throw this exception when they encounter a file format that they do not support.
Simple fetcher for URLs.
 
WordPerfect properties collection.
 
 
SAX event handler that writes content up to an optional write limit out to a character stream or other decorated handler.
Content handler decorator that simplifies the task of producing XHTML events for Tika content parsers.
Utility functions for reading XML.
Utility class that uses a SAXParser to determine the namespace URI and local name of the root element of an XML file.
 
Content handler decorator that simplifies the task of producing XMP output.
XMP Dynamic Media schema.
Deprecated.
Experimental method, will change shortly
 
 
XMP Rights management schema.
Parser for a very simple XPath subset.
Exception thrown by the AutoDetectParser when a file contains zero-bytes.
 
Detector to identify zero length files as application/x-zerovalue