All Classes and Interfaces
Class
Description
This class specifies the base class for file chunking
Intermediate layer to set
OfficeParserConfig uniformly.Base class for all Tika OOXML extractors.
ActiveMime is a macro container format used in some mso files.
The class is used to represent the number of the array.
Base object for FSSHTTPB.
The class is used to read/set bit value for a byte array
A class is used to extract values across byte boundaries with arbitrary bit positions.
Cell of content.
Cell decorator.
Cell manifest data element
Defines an accessor interface
Contains chm extractor assertions
A container that contains chm block information such as: i. initial block is
using to reset main tree ii. start block is using for knowing where to start
iii. end block is using for knowing where to stop iv. start offset is using
for knowing where to start reading v. end offset is using for knowing where
to stop reading
Represents entry types: uncompressed, compressed
Represents intel file states during decompression
Represents lzx states: started decoding, not started decoding
Holds chm listing entries
Extracts text from chm file.
The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD
Total header length, including header section table and following data. 000C:
DWORD 1 (unknown) 0010: DWORD a timestamp 0014: DWORD Windows Language ID
0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} 0028: GUID
{7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC} Note: a GUID is $10 bytes, arranged
as 1 DWORD, 2 WORDs, and 8 BYTEs. 0000: QWORD Offset of section from
beginning of file 0008: QWORD Length of section Following the header section
table is 8 bytes of additional header data.
Directory header The directory starts with a header; its format is as
follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length
of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory
chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD
Depth of the index tree - 1 there is no index, 2 if there is one level of
PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none
(though at least one file has 0 despite there being no index chunk, probably
a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD
Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C:
DWORD Number of directory chunks (total) 0030: DWORD Windows language ID
0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is
the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050:
DWORD -1 (unknown)
Decompresses a chm block.
::DataSpace/Storage//ControlData This file contains $20 bytes of
information on the compression.
LZXC reset table For ensuring a decompression.
Description Note: not always exists An index chunk has the following format:
0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of
directory chunk 0008: Directory index entries (to quickref/free area) The
quickref area in an PMGI is the same as in an PMGL The format of a directory
index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded)
ENCINT: directory listing chunk which starts with name Encoded Integers aka
ENCINT An ENCINT is a variable-length integer.
Description There are two types of directory chunks -- index chunks, and
listing chunks.
This class is used to create instance of AbstractChunking.
A 9-byte encoding of values in the range 0x0002000000000000 through 0xFFFFFFFFFFFFFFFF
This class is used to represent the CompactID structrue.
Base class of data element
Specifies an data element hash stream object
The enumeration of the data element type
Data Node Object data
Data Size Object
The format of a directory listing entry is as follows: BYTE: length of name
BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT:
length The offset is from the beginning of the content section the file is
in, after the section has been decompressed (if appropriate).
This class is used to represent the property contains 8 bytes of data in the PropertySet.rgData stream field.
This class records metadata about embedded parts that exists in the xml
of the main document.
Extracts files embedded in EMF and offers a
very rough capability to extract text if there
is text stored in the EMF.
Excel parser implementation which uses POI's Event API
to handle the contents of a Workbook.
This class is used to represent the property contains 4 bytes of data in the PropertySet.rgData stream field.
FSSHTTPB Serialize interface.
The class is used to build a root node object.
The interface of the property in OneNote file.
Parser that handles Microsoft Access files via
Jackcess
This class is used to represent a JCID
This class is used to represent the JCID object.
The class is used to build a intermediate node object.
This is an optional PST parser that relies on the user installing
the GPL-3 libpst/readpst commandline tool and configuring
Tika to call this library via tika-config.xml
Linked cell.
Contains the information for a single list in the list or list override tables.
Computes the number text which goes at the beginning of each list paragraph
Implement a converter which converts to/from little-endian byte arrays
OOXML metadata extractor.
Parser for temporary MSOFfice files.
This class is used to represent the property contains no data.
Number cell.
The ObjectGroupData class.
The internal class for build a list of DataElement from a node object.
Object Group Declarations
Specifies an object group metadata
Object Metadata Declaration
object data BLOB declaration
object data BLOB reference
This class is used to represent a ObjectSpaceObjectPropSet.
This class is used to represent a ObjectSpaceObjectStreamOfContextIDs.
This class is used to represent a ObjectSpaceObjectStreamOfOIDs.
This class is used to represent a ObjectSpaceObjectStreamOfOSIDs.
Defines a Microsoft document content extractor.
A POI-powered Tika Parser for very old versions of Excel, from
pre-OLE2 days, such as Excel 4.
This class is used to represent the property contains 1 byte of data in the PropertySet.rgData stream field.
OneNote tika parser capable of parsing Microsoft OneNote files.
Options when walking the one note tree.
Interface implemented by all Tika OOXML extractors.
Figures out the correct
OOXMLExtractor for the supplied document and
returns it.Office Open XML (OOXML) parser.
This class is intended to handle anything that might contain IBodyElements:
main document, headers, footers, notes, slides, etc.
This is a wrapper around OPCPackage that calls revert() instead of close().
Outlook Message Parser.
Parser for MS Outlook PST email storage files
A detector that works on a POIFS OLE2 document
to figure out exactly what the file is.
This class is used to represent a PropertyID.
This class is used to represent a PropertySet.
This class is used to represent the property set.
The class is used to represent the prtArrayOfPropertyValues .
This class is used to represent the prtFourBytesOfLengthFollowedByData.
This class is used to process RDC analysis chunking
The enumeration of request type.
Specifies a revision manifest object group references, each followed by object group extended GUIDs
Specifies a revision manifest root declare, each followed by root and object extended GUIDs
The class is used to represent the revision store object.
RTF parser
WARNING: This class is mutable.
Signature Object
Parses wordml 2003 format Excel files.
Specifies the storage index cell mappings (with cell identifier, cell mapping extended GUID,
and cell mapping serial number)
Specifies the storage index revision mappings (with revision and revision mapping
extended GUIDs, and revision mapping serial number)
Specifies one or more storage manifest root declare.
Specifies a storage manifest schema GUID
An 16-bit header for a compound object would indicate the end of a stream object
An 8-bit header for a compound object would indicate the end of a stream object
This class specifies the base class for 16-bit or 32-bit stream object header start
An 16-bit header for a compound object would indicate the start of a stream object
An 32-bit header for a compound object would indicate the start of a stream object
The enumeration of the stream object type header start
Extractor for Common OLE2 (HPSF) metadata
SAX/Streaming pptx extractior
This is an experimental, alternative extractor for docx files.
Text cell.
Overrides Excel's General format to include more
significant digits than the MS Spec allows.
A Format that allows up to 15 significant digits for integers.
A POI-powered Tika Parser for TNEF (Transport Neutral
Encoding Format) messages, aka winmail.dat
This class is used to represent the property contains 2 bytes of data in the PropertySet.rgData stream field.
The
unsigned byte typeThe
unsigned int typeThe
unsigned long typeA utility class for static access to unsigned number functionality.
A base type for unsigned numbers.
The
unsigned short typeThis parser offers a very rough capability to extract text if there
is text stored in the WMF files.
Parses wordml 2003 format word files.
Currently, mostly a pass-through class to hold pkg and properties
and keep the general framework similar to our other POI-integrated
extractors.
Turns formatted sheet events into HTML
Captures information on interesting tags, whilst
delegating the main work to the formatting handler
Experimental class that is based on POI's XSSFEventBasedExcelExtractor
Stub class of POI's XWPFNumbering because onDocumentRead() is protected
For Tika, all we need (so far) is a mapping between styleId and a style's name.
This class is used to process zip file chunking