T - Type of the objects that represent the records of the XML file. The
PCollection generated by this source will be of this type.
PipelineRunner that is
used to execute the Dataflow job. Please refer to the documentation of corresponding
PipelineRunners for more details.public class XmlSource<T> extends FileBasedSource<T>
PCollection of a given type. An Dataflow read transform can be
created by passing an XmlSource object to Read.from(). Please note the
example given below.
The XML file must be of the following form where root and record are XML element names that are defined by the user. Root is the name of the root element of the XML document.
<root>
<record> ... </record>
<record> ... </record>
<record> ... </record>
...
<record> ... </record>
</root>
Basically the XML document should contain a set of record elements where a record may contain
arbitrary XML content. Root and/or record elements may additionally contain an arbitrary number
of XML attributes. Users must provide the name of the root element and record
element when creating the source. Additionally users must provide a class of a JAXB annotated
Java type that can be used convert records into Java objects and vice versa using JAXB
marshalling/unmarshalling mechanisms. Reading the source will generate a PCollection of
the given JAXB annotated Java type. Optionally users may provide a minimum size of a bundle that
should be created for the source. An example Dataflow read transformation that uses XmlSource is
given below.
XmlSource<String> source = XmlSource.<String>from(file.toPath().toString())
.withRootElement("root").withRecordElement("record")
.withRecordClass(Record.class).withMinBundleSize(128);
PCollection<String> output = p.apply(Read.from(source);
Currently only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
To use XmlSource, explicitly declare dependencies on following two jars from Woodstox
StAX XML parser.
(1) stax2-api-3.1.1.jar
(2) woodstox-core-asl-4.1.2.jar
These dependencies have been declared as optional in Maven sdk/pom.xml file of Google Cloud
Dataflow.
FileBasedSource.FileBasedReader<T>, FileBasedSource.ModeOffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Modifier and Type | Method and Description |
|---|---|
FileBasedSource<T> |
createForSubrangeOfFile(String fileName,
long start,
long end)
Creates and returns a new
FileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. |
FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates and returns an instance of a
FileBasedReader implementation for the current
source assuming the source represents a single file. |
static <T> XmlSource<T> |
from(String fileOrPatternSpec)
Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file
pattern.
|
Coder<T> |
getDefaultOutputCoder()
Returns the default
Coder to use for the data read from this source. |
Class<T> |
getRecordClass() |
String |
getRecordElement() |
String |
getRootElement() |
boolean |
producesSortedKeys(PipelineOptions options)
Whether this source is known to produce key/value pairs with the (encoded) keys in
lexicographically sorted order.
|
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
XmlSource<T> |
withMinBundleSize(long minBundleSize)
Sets a parameter
minBundleSize for the minimum bundle size of the source. |
XmlSource<T> |
withRecordClass(Class<T> recordClass)
Sets a JAXB annotated class that can be populated using a record of the provided XML file.
|
XmlSource<T> |
withRecordElement(String recordElement)
Sets name of the record element of the XML document.
|
XmlSource<T> |
withRootElement(String rootElement)
Sets name of the root element of the XML document.
|
createReader, createSourceForSubrange, getEstimatedSizeBytes, getFileOrPatternSpec, getMaxEndOffset, getMode, isSplittable, splitIntoBundles, toStringgetBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetpublic static <T> XmlSource<T> from(String fileOrPatternSpec)
XmlSource.public XmlSource<T> withRootElement(String rootElement)
public XmlSource<T> withRecordElement(String recordElement)
public XmlSource<T> withRecordClass(Class<T> recordClass)
public XmlSource<T> withMinBundleSize(long minBundleSize)
minBundleSize for the minimum bundle size of the source. Please refer
to OffsetBasedSource for the definition of minBundleSize. This is an optional
parameter.public FileBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
FileBasedSourceFileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. When current source is
being split, this method is used to generate new sub-sources. When creating the source
subclasses must call the constructor FileBasedSource.FileBasedSource(String, long, long, long) of
FileBasedSource with corresponding parameter values passed here.createForSubrangeOfFile in class FileBasedSource<T>fileName - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions).public FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedSourceFileBasedReader implementation for the current
source assuming the source represents a single file. File patterns will be handled by
FileBasedSource implementation automatically.createSingleFileReader in class FileBasedSource<T>public boolean producesSortedKeys(PipelineOptions options) throws Exception
BoundedSourceproducesSortedKeys in class BoundedSource<T>Exceptionpublic void validate()
SourceIt is recommended to use Preconditions for implementing
this method.
validate in class FileBasedSource<T>public Coder<T> getDefaultOutputCoder()
SourceCoder to use for the data read from this source.getDefaultOutputCoder in class Source<T>public String getRootElement()
public String getRecordElement()