Partial DocumentParser
implementation, leaving only one of the parse
methods abstract.
Default eu.cdevreeze.yaidom.parse.ElemProducingSaxHandler implementation.
Default eu.cdevreeze.yaidom.parse.ElemProducingSaxHandler implementation.
This is a trait instead of a class, so it is easy to mix in EntityResolver
s, ErrorHandler
s, etc.
eu.cdevreeze.yaidom.simple.Document parser.
eu.cdevreeze.yaidom.simple.Document parser. This trait is purely abstract.
Implementing classes deal with the details of parsing XML strings/streams into yaidom Document
s.
The eu.cdevreeze.yaidom.simple package itself is agnostic of those details.
Typical implementations use DOM, StAX or SAX, but make them easier to use in the tradition of the "template" classes of the Spring framework. That is, resource management is done as much as possible by the DocumentParser, typical usage is easy, and complex scenarios are still possible. The idea is that the parser is configured once, and that it should be re-usable multiple times.
One of the parse
methods takes an InputStream
instead of Source
object, because that works better with a DOM implementation.
Although DocumentParser
instances should be re-usable multiple times, implementing classes are encouraged to indicate
to what extent re-use of a parser instance is indeed supported (single-threaded, or even multi-threaded).
DOM-based Document
parser.
DOM-based Document
parser.
Typical non-trivial creation is as follows, assuming class MyEntityResolver
, which extends EntityResolver
,
and class MyErrorHandler
, which extends ErrorHandler
:
val dbf = DocumentBuilderFactory.newInstance() dbf.setNamespaceAware(true) def createDocumentBuilder(dbf: DocumentBuilderFactory): DocumentBuilder = { val db = dbf.newDocumentBuilder() db.setEntityResolver(new MyEntityResolver) db.setErrorHandler(new MyErrorHandler) db } val docParser = DocumentParserUsingDom.newInstance(dbf, createDocumentBuilder _)
If we want the DocumentBuilderFactory
to be a validating one, using an XML Schema, we could obtain the DocumentBuilderFactory
as follows:
val schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI) val schemaSource = new StreamSource(new File(pathToSchema)) val schema = schemaFactory.newSchema(schemaSource) val dbf = { val result = DocumentBuilderFactory.newInstance() result.setNamespaceAware(true) result.setSchema(schema) result }
A custom EntityResolver
could be used to retrieve DTDs locally, or even to suppress DTD resolution.
The latter can be coded as follows (see http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds),
risking some loss of information:
class MyEntityResolver extends EntityResolver { override def resolveEntity(publicId: String, systemId: String): InputSource = { // This dirty hack may not work on IBM JVMs new InputSource(new java.io.StringReader("")) } }
For completeness, a custom ErrorHandler
class that simply prints parse exceptions to standard output:
class MyErrorHandler extends ErrorHandler { def warning(exc: SAXParseException): Unit = { println(exc) } def error(exc: SAXParseException): Unit = { println(exc) } def fatalError(exc: SAXParseException): Unit = { println(exc) } }
If more flexibility is needed in configuring the DocumentParser
than offered by this class, consider
writing a wrapper DocumentParser
which wraps a DocumentParserUsingDom
, but adapts the parse
method.
This would make it possible to adapt the conversion from a DOM Document
to yaidom Document
, for example.
A DocumentParserUsingDom
instance can be re-used multiple times, from the same thread.
If the DocumentBuilderFactory
is thread-safe, it can even be re-used from multiple threads.
Typically a DocumentBuilderFactory
cannot be trusted to be thread-safe, however. In a web application,
one (safe) way to deal with that is to use one DocumentBuilderFactory
instance per request.
DOM-LS-based Document
parser.
DOM-LS-based Document
parser.
Typical non-trivial creation is as follows, assuming class MyEntityResolver
, which extends LSResourceResolver
,
and class MyErrorHandler
, which extends DOMErrorHandler
:
def createParser(domImplLS: DOMImplementationLS): LSParser = { val parser = domImplLS.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null) parser.getDomConfig.setParameter("resource-resolver", new MyEntityResolver) parser.getDomConfig.setParameter("error-handler", new MyErrorHandler) parser } val domParser = DocumentParserUsingDomLS.newInstance().withParserCreator(createParser _)
A custom LSResourceResolver
could be used to retrieve DTDs locally, or even to suppress DTD resolution.
The latter can be coded as follows (see http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds),
risking some loss of information:
class MyEntityResolver extends LSResourceResolver { override def resolveResource(tpe: String, namespaceURI: String, publicId: String, systemId: String, baseURI: String): LSInput = { val input = domImplLS.createLSInput() // This dirty hack may not work on IBM JVMs input.setCharacterStream(new jio.StringReader("")) input } }
For completeness, a custom DOMErrorHandler
class that simply throws an exception:
class MyErrorHandler extends DOMErrorHandler { override def handleError(exc: DOMError): Boolean = { sys.error(exc.toString) } }
If more flexibility is needed in configuring the DocumentParser
than offered by this class, consider
writing a wrapper DocumentParser
which wraps a DocumentParserUsingDomLS
, but adapts the parse
method.
This would make it possible to set an encoding on the LSInput
, for example. As another example, this would
allow for adapting the conversion from a DOM Document
to yaidom Document
.
A DocumentParserUsingDomLS
instance can be re-used multiple times, from the same thread.
If the DOMImplementationLS
is thread-safe, it can even be re-used from multiple threads.
Typically a DOMImplementationLS
cannot be trusted to be thread-safe, however. In a web application,
one (safe) way to deal with that is to use one DOMImplementationLS
instance per request.
SAX-based Document
parser.
SAX-based Document
parser.
Typical non-trivial creation is as follows, assuming a trait MyEntityResolver
, which extends EntityResolver
,
and a trait MyErrorHandler
, which extends ErrorHandler
:
val spf = SAXParserFactory.newInstance().makeNamespaceAndPrefixAware val parser = DocumentParserUsingSax.newInstance( spf, () => new DefaultElemProducingSaxHandler with MyEntityResolver with MyErrorHandler )
If we want the SAXParserFactory
to be a validating one, using an XML Schema, we could obtain the SAXParserFactory
as follows:
val schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI) val schemaSource = new StreamSource(new File(pathToSchema)) val schema = schemaFactory.newSchema(schemaSource) val spf = { val result = SAXParserFactory.newInstance().makeNamespaceAndPrefixAware result.setSchema(schema) result }
A custom EntityResolver
could be used to retrieve DTDs locally, or even to suppress DTD resolution.
The latter can be coded as follows (see http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds),
risking some loss of information:
trait MyEntityResolver extends EntityResolver { override def resolveEntity(publicId: String, systemId: String): InputSource = { // This dirty hack may not work on IBM JVMs new InputSource(new java.io.StringReader("")) } }
For completeness, a custom ErrorHandler
trait that simply prints parse exceptions to standard output:
trait MyErrorHandler extends ErrorHandler { override def warning(exc: SAXParseException): Unit = { println(exc) } override def error(exc: SAXParseException): Unit = { println(exc) } override def fatalError(exc: SAXParseException): Unit = { println(exc) } }
It is even possible to parse HTML (including very poor HTML) into well-formed Documents by using a SAXParserFactory
from the TagSoup library.
For example:
val parser = DocumentParserUsingSax.newInstance(new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl)
If more flexibility is needed in configuring the DocumentParser
than offered by this class, consider
writing a wrapper DocumentParser
which wraps a DocumentParserUsingSax
, but adapts the parse
method.
This would make it possible to set additional properties on the XML Reader, for example.
As can be seen above, parsing is based on the JAXP SAXParserFactory
instead of the SAX 2.0 XMLReaderFactory
.
A DocumentParserUsingSax
instance can be re-used multiple times, from the same thread.
If the SAXParserFactory
is thread-safe, it can even be re-used from multiple threads.
Typically a SAXParserFactory
cannot be trusted to be thread-safe, however. In a web application,
one (safe) way to deal with that is to use one SAXParserFactory
instance per request.
StAX-based Document
parser.
StAX-based Document
parser.
Typical non-trivial creation is as follows, assuming a class MyXmlResolver
, which extends XMLResolver
,
and a class MyXmlReporter
, which extends XMLReporter
:
val xmlInputFactory = XMLInputFactory.newFactory() xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, java.lang.Boolean.TRUE) xmlInputFactory.setXMLResolver(new MyXmlResolver) xmlInputFactory.setXMLReporter(new MyXmlReporter) val docParser = DocumentParserUsingStax.newInstance(xmlInputFactory)
A custom XMLResolver
could be used to retrieve DTDs locally, or even to suppress DTD resolution.
The latter can be coded as follows (compare with http://stuartsierra.com/2008/05/08/stop-your-java-sax-parser-from-downloading-dtds),
risking some loss of information:
class MyXmlResolver extends XMLResolver { override def resolveEntity(publicId: String, systemId: String, baseUri: String, namespace: String): Any = { // This dirty hack may not work on IBM JVMs new java.io.StringReader("") } }
A trivial XMLReporter
could look like this:
class MyXmlReporter extends XMLReporter { override def report(message: String, errorType: String, relatedInformation: AnyRef, location: Location): Unit = { println("Location: %s. Error type: %s. Message: %s.".format(location, errorType, message)) } }
If more flexibility is needed in configuring the DocumentParser
than offered by this class, consider
writing a wrapper DocumentParser
which wraps a DocumentParserUsingStax
, but adapts the parse
method.
This would make it possible to adapt the conversion from StAX events to yaidom Document
, for example.
A DocumentParserUsingStax
instance can be re-used multiple times, from the same thread.
If the XMLInputFactory
is thread-safe, it can even be re-used from multiple threads.
Typically a XMLInputFactory
cannot be trusted to be thread-safe, however. In a web application,
one (safe) way to deal with that is to use one XMLInputFactory
instance per request.
Contract of a SAX ContentHandler that, once ready, can be asked for the resulting eu.cdevreeze.yaidom.simple.Elem using
method resultingElem
, or the resulting eu.cdevreeze.yaidom.simple.Document using method
resultingDocument
.
Mixin extending DefaultHandler
that contains a Locator
.
Mixin extending DefaultHandler
that contains a Locator
. Typically this Locator
is used by an ErrorHandler
mixed in after this trait.
It is also used by DefaultElemProducingSaxHandler
, for example to get the XML declaration.
Thread-local DocumentParser.
Thread-local DocumentParser. This class exists because typical JAXP factory objects (DocumentBuilderFactory etc.) are not thread-safe, but still expensive to create. Using this DocumentParser facade backed by a thread local DocumentParser, we can create a ThreadLocalDocumentParser once, and re-use it all the time without having to worry about thread-safety issues.
Note that each ThreadLocalDocumentParser instance (!) has its own thread-local document parser. Typically it makes no sense to have more than one ThreadLocalDocumentParser instance in one application. In a Spring application, for example, a single instance of a ThreadLocalDocumentParser can be configured.
Support for parsing XML into yaidom
Document
s andElem
s. This package offers the eu.cdevreeze.yaidom.parse.DocumentParser trait, as well as several implementations. Those implementations use JAXP (SAX, DOM or StAX), and most of them use theconvert
package to convert JAXP artifacts to yaidomDocument
s.For example:
This example chose a SAX-based implementation, and used the default configuration of that document parser.
Having several different fully configurable JAXP-based implementations shows that yaidom is pessimistic about the transparency of parsing and printing XML. It also shows that yaidom is optimistic about the available (heap) memory and processing power, because of the 2 separated steps of JAXP parsing/printing and (in-memory)
convert
conversions. Using JAXP means that escaping of characters is something that JAXP deals with, and that's definitely better than trying to do it yourself.One
DocumentParser
implementation does not use anyconvert
conversion. That isDocumentParserUsingSax
. It is likely the fastest of theDocumentParser
implementations.The preferred
DocumentParser
for XML (not HTML) parsing isDocumentParserUsingDomLS
, if memory usage is not an issue. ThisDocumentParser
implementation is best integrated with DOM, and is highly configurable, although DOM LS configuration is somewhat involved.This package depends on the eu.cdevreeze.yaidom.core, eu.cdevreeze.yaidom.queryapi, eu.cdevreeze.yaidom.simple and eu.cdevreeze.yaidom.convert packages, and not the other way around.