Class EMFParser

  • All Implemented Interfaces:
    Serializable, org.apache.tika.parser.Parser

    public class EMFParser
    extends Object
    implements org.apache.tika.parser.Parser
    Extracts files embedded in EMF and offers a very rough capability to extract text if there is text stored in the EMF.

    To improve text extraction, we'd have to implement quite a bit more at the POI level. We'd want to track changes in font and use that information for identifying character sets, inserting spaces and new lines.

    We're also relying on storage order for text order, which isn't great. We'd have to do something like what PDFBox or XPS do to sort the runs and then put the cow back together from the hamburger...lol...

    See Also:
    Serialized Form
    • Field Detail

      • EMF_ICON_ONLY

        public static org.apache.tika.metadata.Property EMF_ICON_ONLY
      • EMF_ICON_STRING

        public static org.apache.tika.metadata.Property EMF_ICON_STRING
    • Constructor Detail

      • EMFParser

        public EMFParser()
    • Method Detail

      • getSupportedTypes

        public Set<org.apache.tika.mime.MediaType> getSupportedTypes​(org.apache.tika.parser.ParseContext context)
        Specified by:
        getSupportedTypes in interface org.apache.tika.parser.Parser
      • parse

        public void parse​(InputStream stream,
                          ContentHandler handler,
                          org.apache.tika.metadata.Metadata metadata,
                          org.apache.tika.parser.ParseContext context)
                   throws IOException,
                          SAXException,
                          org.apache.tika.exception.TikaException
        Specified by:
        parse in interface org.apache.tika.parser.Parser
        Throws:
        IOException
        SAXException
        org.apache.tika.exception.TikaException