XmlSource (Google Cloud Dataflow SDK 1.7.0 API)

java.lang.Object
- com.google.cloud.dataflow.sdk.io.Source<T>
- - com.google.cloud.dataflow.sdk.io.BoundedSource<T>
  - - com.google.cloud.dataflow.sdk.io.OffsetBasedSource<T>
    - - com.google.cloud.dataflow.sdk.io.FileBasedSource<T>
      - com.google.cloud.dataflow.sdk.io.XmlSource<T>

Type Parameters:

T - Type of the objects that represent the records of the XML file. The PCollection generated by this source will be of this type.

All Implemented Interfaces:

HasDisplayData, Serializable
```
public class XmlSource<T>
extends FileBasedSource<T>
```
A source that can be used to read XML files. This source reads one or more XML files and creates a PCollection of a given type. An Dataflow read transform can be created by passing an XmlSource object to Read.from(). Please note the example given below.
The XML file must be of the following form, where root and record are XML element names that are defined by the user:
```
 
 <root>
 <record> ... </record>
 <record> ... </record>
 <record> ... </record>
 ...
 <record> ... </record>
 </root>
 
 
```
Basically, the XML document should contain a single root element with an inner list consisting entirely of record elements. The records may contain arbitrary XML content; however, that content must not contain the start <record> or end </record> tags. This restriction enables reading from large XML files in parallel from different offsets in the file.
Root and/or record elements may additionally contain an arbitrary number of XML attributes. Additionally users must provide a class of a JAXB annotated Java type that can be used convert records into Java objects and vice versa using JAXB marshalling/unmarshalling mechanisms. Reading the source will generate a PCollection of the given JAXB annotated Java type. Optionally users may provide a minimum size of a bundle that should be created for the source.
The following example shows how to read from XmlSource in a Dataflow pipeline:
```
 
 XmlSource<String> source = XmlSource.<String>from(file.toPath().toString())
     .withRootElement("root")
     .withRecordElement("record")
     .withRecordClass(Record.class);
 PCollection<String> output = p.apply(Read.from(source));
 
 
```
Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
To use XmlSource:
1. Explicitly declare a dependency on org.codehaus.woodstox:stax2-api
2. Include a compatible implementation on the classpath at run-time, such as org.codehaus.woodstox:woodstox-core-asl
These dependencies have been declared as optional in Maven sdk/pom.xml file of Google Cloud Dataflow.
Permissions
Permission requirements depend on the PipelineRunner that is used to execute the Dataflow job. Please refer to the documentation of corresponding PipelineRunners for more details.
See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
  FileBasedSource.FileBasedReader<T>, FileBasedSource.Mode
- Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
  OffsetBasedSource.OffsetBasedReader<T>
- Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.BoundedSource
  BoundedSource.BoundedReader<T>
- Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.Source
  Source.Reader<T>

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected FileBasedSource<T>`	`createForSubrangeOfFile(String fileName, long start, long end)` Creates and returns a new `FileBasedSource` of the same type as the current `FileBasedSource` backed by a given file and an offset range.
`protected FileBasedSource.FileBasedReader<T>`	`createSingleFileReader(PipelineOptions options)` Creates and returns an instance of a `FileBasedReader` implementation for the current source assuming the source represents a single file.
`static <T> XmlSource<T>`	`from(String fileOrPatternSpec)` Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file pattern.
`Coder<T>`	`getDefaultOutputCoder()` Returns the default `Coder` to use for the data read from this source.
`Class<T>`	`getRecordClass()`
`String`	`getRecordElement()`
`String`	`getRootElement()`
`void`	`populateDisplayData(DisplayData.Builder builder)` Register display data for the given transform or component.
`boolean`	`producesSortedKeys(PipelineOptions options)` Whether this source is known to produce key/value pairs sorted by lexicographic order on the bytes of the encoded key.
`void`	`validate()` Checks that this source is valid, before it can be used in a pipeline.
`XmlSource<T>`	`withMinBundleSize(long minBundleSize)` Sets a parameter `minBundleSize` for the minimum bundle size of the source.
`XmlSource<T>`	`withRecordClass(Class<T> recordClass)` Sets a JAXB annotated class that can be populated using a record of the provided XML file.
`XmlSource<T>`	`withRecordElement(String recordElement)` Sets name of the record element of the XML document.
`XmlSource<T>`	`withRootElement(String rootElement)` Sets name of the root element of the XML document.

Methods inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource
createReader, createSourceForSubrange, expandFilePattern, getEstimatedSizeBytes, getFileOrPatternSpec, getMaxEndOffset, getMode, isSplittable, splitIntoBundles, toString

Methods inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource
allowsDynamicSplitting, getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffset

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Method Detail
  - from
```
public static <T> XmlSource<T> from(String fileOrPatternSpec)
```
    Creates an XmlSource for a single XML file or a set of XML files defined by a Java "glob" file pattern. Each XML file should be of the form defined in XmlSource.
  - withRootElement
```
public XmlSource<T> withRootElement(String rootElement)
```
    Sets name of the root element of the XML document. This will be used to create a valid starting root element when initiating a bundle of records created from an XML document. This is a required parameter.
  - withRecordElement
```
public XmlSource<T> withRecordElement(String recordElement)
```
    Sets name of the record element of the XML document. This will be used to determine offset of the first record of a bundle created from the XML document. This is a required parameter.
  - withRecordClass
```
public XmlSource<T> withRecordClass(Class<T> recordClass)
```
    Sets a JAXB annotated class that can be populated using a record of the provided XML file. This will be used when unmarshalling record objects from the XML file. This is a required parameter.
  - withMinBundleSize
```
public XmlSource<T> withMinBundleSize(long minBundleSize)
```
    Sets a parameter minBundleSize for the minimum bundle size of the source. Please refer to OffsetBasedSource for the definition of minBundleSize. This is an optional parameter.
  - createForSubrangeOfFile
```
protected FileBasedSource<T> createForSubrangeOfFile(String fileName,
                                                     long start,
                                                     long end)
```
    Description copied from class: FileBasedSource
    
    Creates and returns a new FileBasedSource of the same type as the current FileBasedSource backed by a given file and an offset range. When current source is being split, this method is used to generate new sub-sources. When creating the source subclasses must call the constructor FileBasedSource.FileBasedSource(String, long, long, long) of FileBasedSource with corresponding parameter values passed here.
    
    Specified by:
    
    createForSubrangeOfFile in class FileBasedSource<T>
    
    Parameters:
    
    fileName - file backing the new FileBasedSource.
    
    start - starting byte offset of the new FileBasedSource.
    
    end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE, in which case it will be inferred using FileBasedSource.getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions).
  - createSingleFileReader
```
protected FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
```
    Description copied from class: FileBasedSource
    
    Creates and returns an instance of a FileBasedReader implementation for the current source assuming the source represents a single file. File patterns will be handled by FileBasedSource implementation automatically.
    
    Specified by:
    
    createSingleFileReader in class FileBasedSource<T>
  - producesSortedKeys
```
public boolean producesSortedKeys(PipelineOptions options)
                           throws Exception
```
    Description copied from class: BoundedSource
    
    Whether this source is known to produce key/value pairs sorted by lexicographic order on the bytes of the encoded key.
    
    Specified by:
    
    producesSortedKeys in class BoundedSource<T>
    
    Throws:
    
    Exception
  - validate
```
public void validate()
```
    Description copied from class: Source
    
    Checks that this source is valid, before it can be used in a pipeline.
    It is recommended to use Preconditions for implementing this method.
    
    Overrides:
    
    validate in class FileBasedSource<T>
  - populateDisplayData
```
public void populateDisplayData(DisplayData.Builder builder)
```
    Description copied from class: Source
    
    Register display data for the given transform or component.
    populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect display data via DisplayData.from(HasDisplayData). Implementations may call super.populateDisplayData(builder) in order to register display data in the current namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use the namespace of the subcomponent.
    By default, does not register any display data. Implementors may override this method to provide their own display data.
    
    Specified by:
    
    populateDisplayData in interface HasDisplayData
    
    Overrides:
    
    populateDisplayData in class FileBasedSource<T>
    
    Parameters:
    
    builder - The builder to populate with display data.
    
    See Also:
    
    HasDisplayData
  - getDefaultOutputCoder
```
public Coder<T> getDefaultOutputCoder()
```
    Description copied from class: Source
    
    Returns the default Coder to use for the data read from this source.
    
    Specified by:
    
    getDefaultOutputCoder in class Source<T>
  - getRootElement
```
public String getRootElement()
```
  - getRecordElement
```
public String getRecordElement()
```
  - getRecordClass
```
public Class<T> getRecordClass()
```

Class XmlSource<T>

Permissions

Nested Class Summary

Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource

Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource

Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.BoundedSource

Nested classes/interfaces inherited from class com.google.cloud.dataflow.sdk.io.Source

Method Summary

Methods inherited from class com.google.cloud.dataflow.sdk.io.FileBasedSource

Methods inherited from class com.google.cloud.dataflow.sdk.io.OffsetBasedSource

Methods inherited from class java.lang.Object

Method Detail

from

withRootElement

withRecordElement

withRecordClass

withMinBundleSize

createForSubrangeOfFile

createSingleFileReader

producesSortedKeys

validate

populateDisplayData

getDefaultOutputCoder

getRootElement

getRecordElement

getRecordClass