T
- Type of the objects that represent the records of the XML file. The
PCollection
generated by this source will be of this type.public class XmlSource<T> extends FileBasedSource<T>
PCollection
of a given type. An Dataflow read transform can be
created by passing an XmlSource
object to Read.from()
. Please note the
example given below.
The XML file must be of the following form, where root
and record
are XML
element names that are defined by the user:
<root>
<record> ... </record>
<record> ... </record>
<record> ... </record>
...
<record> ... </record>
</root>
Basically, the XML document should contain a single root element with an inner list consisting
entirely of record elements. The records may contain arbitrary XML content; however, that content
must not contain the start <record>
or end </record>
tags. This
restriction enables reading from large XML files in parallel from different offsets in the file.
Root and/or record elements may additionally contain an arbitrary number of XML attributes.
Additionally users must provide a class of a JAXB annotated Java type that can be used convert
records into Java objects and vice versa using JAXB marshalling/unmarshalling mechanisms. Reading
the source will generate a PCollection
of the given JAXB annotated Java type.
Optionally users may provide a minimum size of a bundle that should be created for the source.
The following example shows how to read from XmlSource
in a Dataflow pipeline:
XmlSource<String> source = XmlSource.<String>from(file.toPath().toString())
.withRootElement("root")
.withRecordElement("record")
.withRecordClass(Record.class);
PCollection<String> output = p.apply(Read.from(source));
Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
To use XmlSource
:
These dependencies have been declared as optional in Maven sdk/pom.xml file of Google Cloud Dataflow.
PipelineRunner
that is
used to execute the Dataflow job. Please refer to the documentation of corresponding
PipelineRunners
for more details.FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
OffsetBasedSource.OffsetBasedReader<T>
BoundedSource.BoundedReader<T>
Source.Reader<T>
Modifier and Type | Method and Description |
---|---|
protected FileBasedSource<T> |
createForSubrangeOfFile(String fileName,
long start,
long end)
Creates and returns a new
FileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. |
protected FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates and returns an instance of a
FileBasedReader implementation for the current
source assuming the source represents a single file. |
static <T> XmlSource<T> |
from(String fileOrPatternSpec)
Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file
pattern.
|
Coder<T> |
getDefaultOutputCoder()
Returns the default
Coder to use for the data read from this source. |
Class<T> |
getRecordClass() |
String |
getRecordElement() |
String |
getRootElement() |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
boolean |
producesSortedKeys(PipelineOptions options)
Whether this source is known to produce key/value pairs sorted by lexicographic order on
the bytes of the encoded key.
|
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
XmlSource<T> |
withMinBundleSize(long minBundleSize)
Sets a parameter
minBundleSize for the minimum bundle size of the source. |
XmlSource<T> |
withRecordClass(Class<T> recordClass)
Sets a JAXB annotated class that can be populated using a record of the provided XML file.
|
XmlSource<T> |
withRecordElement(String recordElement)
Sets name of the record element of the XML document.
|
XmlSource<T> |
withRootElement(String rootElement)
Sets name of the root element of the XML document.
|
createReader, createSourceForSubrange, expandFilePattern, getEstimatedSizeBytes, getFileOrPatternSpec, getMaxEndOffset, getMode, isSplittable, splitIntoBundles, toString
allowsDynamicSplitting, getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset
public static <T> XmlSource<T> from(String fileOrPatternSpec)
XmlSource
.public XmlSource<T> withRootElement(String rootElement)
public XmlSource<T> withRecordElement(String recordElement)
public XmlSource<T> withRecordClass(Class<T> recordClass)
public XmlSource<T> withMinBundleSize(long minBundleSize)
minBundleSize
for the minimum bundle size of the source. Please refer
to OffsetBasedSource
for the definition of minBundleSize. This is an optional
parameter.protected FileBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
FileBasedSource
FileBasedSource
of the same type as the current
FileBasedSource
backed by a given file and an offset range. When current source is
being split, this method is used to generate new sub-sources. When creating the source
subclasses must call the constructor FileBasedSource.FileBasedSource(String, long, long, long)
of
FileBasedSource
with corresponding parameter values passed here.createForSubrangeOfFile
in class FileBasedSource<T>
fileName
- file backing the new FileBasedSource
.start
- starting byte offset of the new FileBasedSource
.end
- ending byte offset of the new FileBasedSource
. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions)
.protected FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedSource
FileBasedReader
implementation for the current
source assuming the source represents a single file. File patterns will be handled by
FileBasedSource
implementation automatically.createSingleFileReader
in class FileBasedSource<T>
public boolean producesSortedKeys(PipelineOptions options) throws Exception
BoundedSource
producesSortedKeys
in class BoundedSource<T>
Exception
public void validate()
Source
It is recommended to use Preconditions
for implementing
this method.
validate
in class FileBasedSource<T>
public void populateDisplayData(DisplayData.Builder builder)
Source
populateDisplayData(DisplayData.Builder)
is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData)
. Implementations may call
super.populateDisplayData(builder)
in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder)
to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData
in interface HasDisplayData
populateDisplayData
in class FileBasedSource<T>
builder
- The builder to populate with display data.HasDisplayData
public Coder<T> getDefaultOutputCoder()
Source
Coder
to use for the data read from this source.getDefaultOutputCoder
in class Source<T>
public String getRootElement()
public String getRecordElement()