All Classes and Interfaces

Class
Description
This class specifies the base class for file chunking
 
 
 
Intermediate layer to set OfficeParserConfig uniformly.
Base class for all Tika OOXML extractors.
 
ActiveMime is a macro container format used in some mso files.
 
 
The class is used to represent the number of the array.
Base object for FSSHTTPB.
 
The class is used to read/set bit value for a byte array
 
A class is used to extract values across byte boundaries with arbitrary bit positions.
 
 
Cell of content.
Cell decorator.
 
 
 
Cell manifest data element
Defines an accessor interface
Contains chm extractor assertions
A container that contains chm block information such as: i. initial block is using to reset main tree ii. start block is using for knowing where to start iii. end block is using for knowing where to stop iv. start offset is using for knowing where to start reading v. end offset is using for knowing where to stop reading
 
Represents entry types: uncompressed, compressed
Represents intel file states during decompression
Represents lzx states: started decoding, not started decoding
 
Holds chm listing entries
Extracts text from chm file.
The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD Total header length, including header section table and following data. 000C: DWORD 1 (unknown) 0010: DWORD a timestamp 0014: DWORD Windows Language ID 0018: GUID {7C01FD10-7BAA-11D0-9E0C-00A0-C922-E6EC} 0028: GUID {7C01FD11-7BAA-11D0-9E0C-00A0-C922-E6EC} Note: a GUID is $10 bytes, arranged as 1 DWORD, 2 WORDs, and 8 BYTEs. 0000: QWORD Offset of section from beginning of file 0008: QWORD Length of section Following the header section table is 8 bytes of additional header data.
Directory header The directory starts with a header; its format is as follows: 0000: char[4] 'ITSP' 0004: DWORD Version number 1 0008: DWORD Length of the directory header 000C: DWORD $0a (unknown) 0010: DWORD $1000 Directory chunk size 0014: DWORD "Density" of quickref section, usually 2 0018: DWORD Depth of the index tree - 1 there is no index, 2 if there is one level of PMGI chunks 001C: DWORD Chunk number of root index chunk, -1 if there is none (though at least one file has 0 despite there being no index chunk, probably a bug) 0020: DWORD Chunk number of first PMGL (listing) chunk 0024: DWORD Chunk number of last PMGL (listing) chunk 0028: DWORD -1 (unknown) 002C: DWORD Number of directory chunks (total) 0030: DWORD Windows language ID 0034: GUID {5D02926A-212E-11D0-9DF9-00A0C922E6EC} 0044: DWORD $54 (This is the length again) 0048: DWORD -1 (unknown) 004C: DWORD -1 (unknown) 0050: DWORD -1 (unknown)
Decompresses a chm block.
::DataSpace/Storage//ControlData This file contains $20 bytes of information on the compression.
LZXC reset table For ensuring a decompression.
 
 
 
Description Note: not always exists An index chunk has the following format: 0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of directory chunk 0008: Directory index entries (to quickref/free area) The quickref area in an PMGI is the same as in an PMGL The format of a directory index entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: directory listing chunk which starts with name Encoded Integers aka ENCINT An ENCINT is a variable-length integer.
Description There are two types of directory chunks -- index chunks, and listing chunks.
 
 
This class is used to create instance of AbstractChunking.
 
A 9-byte encoding of values in the range 0x0002000000000000 through 0xFFFFFFFFFFFFFFFF
This class is used to represent the CompactID structrue.
 
Base class of data element
Specifies an data element hash stream object
 
 
The enumeration of the data element type
 
 
Data Node Object data
Data Size Object
The format of a directory listing entry is as follows: BYTE: length of name BYTEs: name (UTF-8 encoded) ENCINT: content section ENCINT: offset ENCINT: length The offset is from the beginning of the content section the file is in, after the section has been decompressed (if appropriate).
This class is used to represent the property contains 8 bytes of data in the PropertySet.rgData stream field.
 
This class records metadata about embedded parts that exists in the xml of the main document.
Extracts files embedded in EMF and offers a very rough capability to extract text if there is text stored in the EMF.
 
Excel parser implementation which uses POI's Event API to handle the contents of a Workbook.
 
 
 
 
 
This class is used to represent the property contains 4 bytes of data in the PropertySet.rgData stream field.
 
 
 
 
 
 
FSSHTTPB Serialize interface.
 
The class is used to build a root node object.
The interface of the property in OneNote file.
Parser that handles Microsoft Access files via Jackcess
This class is used to represent a JCID
This class is used to represent the JCID object.
 
The class is used to build a intermediate node object.
This is an optional PST parser that relies on the user installing the GPL-3 libpst/readpst commandline tool and configuring Tika to call this library via tika-config.xml
 
Linked cell.
Contains the information for a single list in the list or list override tables.
Computes the number text which goes at the beginning of each list paragraph
Implement a converter which converts to/from little-endian byte arrays
OOXML metadata extractor.
 
 
 
Parser for temporary MSOFfice files.
This class is used to represent the property contains no data.
 
Number cell.
The ObjectGroupData class.
 
The internal class for build a list of DataElement from a node object.
Object Group Declarations
Specifies an object group metadata
Object Metadata Declaration
object data BLOB declaration
 
object data BLOB reference
 
This class is used to represent a ObjectSpaceObjectPropSet.
 
 
This class is used to represent a ObjectSpaceObjectStreamOfContextIDs.
This class is used to represent a ObjectSpaceObjectStreamOfOIDs.
This class is used to represent a ObjectSpaceObjectStreamOfOSIDs.
Defines a Microsoft document content extractor.
 
 
A POI-powered Tika Parser for very old versions of Excel, from pre-OLE2 days, such as Excel 4.
This class is used to represent the property contains 1 byte of data in the PropertySet.rgData stream field.
OneNote tika parser capable of parsing Microsoft OneNote files.
 
Options when walking the one note tree.
Interface implemented by all Tika OOXML extractors.
Figures out the correct OOXMLExtractor for the supplied document and returns it.
Office Open XML (OOXML) parser.
 
This class is intended to handle anything that might contain IBodyElements: main document, headers, footers, notes, slides, etc.
 
 
 
This is a wrapper around OPCPackage that calls revert() instead of close().
Outlook Message Parser.
 
Parser for MS Outlook PST email storage files
 
A detector that works on a POIFS OLE2 document to figure out exactly what the file is.
 
This class is used to represent a PropertyID.
This class is used to represent a PropertySet.
This class is used to represent the property set.
 
The class is used to represent the prtArrayOfPropertyValues .
This class is used to represent the prtFourBytesOfLengthFollowedByData.
 
This class is used to process RDC analysis chunking
The enumeration of request type.
 
 
Specifies a revision manifest object group references, each followed by object group extended GUIDs
Specifies a revision manifest root declare, each followed by root and object extended GUIDs
The class is used to represent the revision store object.
 
RTF parser
WARNING: This class is mutable.
 
 
Signature Object
 
Parses wordml 2003 format Excel files.
Specifies the storage index cell mappings (with cell identifier, cell mapping extended GUID, and cell mapping serial number)
 
 
Specifies the storage index revision mappings (with revision and revision mapping extended GUIDs, and revision mapping serial number)
 
Specifies one or more storage manifest root declare.
Specifies a storage manifest schema GUID
 
 
An 16-bit header for a compound object would indicate the end of a stream object
An 8-bit header for a compound object would indicate the end of a stream object
This class specifies the base class for 16-bit or 32-bit stream object header start
An 16-bit header for a compound object would indicate the start of a stream object
An 32-bit header for a compound object would indicate the start of a stream object
 
 
The enumeration of the stream object type header start
Extractor for Common OLE2 (HPSF) metadata
SAX/Streaming pptx extractior
This is an experimental, alternative extractor for docx files.
Text cell.
Overrides Excel's General format to include more significant digits than the MS Spec allows.
A Format that allows up to 15 significant digits for integers.
A POI-powered Tika Parser for TNEF (Transport Neutral Encoding Format) messages, aka winmail.dat
This class is used to represent the property contains 2 bytes of data in the PropertySet.rgData stream field.
The unsigned byte type
The unsigned int type
The unsigned long type
 
A utility class for static access to unsigned number functionality.
A base type for unsigned numbers.
The unsigned short type
 
This parser offers a very rough capability to extract text if there is text stored in the WMF files.
 
 
 
Parses wordml 2003 format word files.
 
Currently, mostly a pass-through class to hold pkg and properties and keep the general framework similar to our other POI-integrated extractors.
 
 
 
 
 
Turns formatted sheet events into HTML
Captures information on interesting tags, whilst delegating the main work to the formatting handler
Experimental class that is based on POI's XSSFEventBasedExcelExtractor
 
Stub class of POI's XWPFNumbering because onDocumentRead() is protected
For Tika, all we need (so far) is a mapping between styleId and a style's name.
 
This class is used to process zip file chunking