CsvDFSSource (hudi-utilities

java.lang.Object
- org.apache.hudi.utilities.sources.Source<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
- - org.apache.hudi.utilities.sources.RowSource
  - - org.apache.hudi.utilities.sources.CsvDFSSource

All Implemented Interfaces:

Serializable, SourceCommitCallback
```
public class CsvDFSSource
extends RowSource
```
Reads data from CSV files on DFS as the data source. Internally, we use Spark to read CSV files thus any limitation of Spark CSV also applies here (e.g., limited support for nested schema). You can set the CSV-specific configs in the format of hoodie.deltastreamer.csv.* that are Spark compatible to deal with CSV files in Hudi. The supported options are: "sep", "encoding", "quote", "escape", "charToEscapeQuoteEscaping", "comment", "header", "enforceSchema", "inferSchema", "samplingRatio", "ignoreLeadingWhiteSpace", "ignoreTrailingWhiteSpace", "nullValue", "emptyValue", "nanValue", "positiveInf", "negativeInf", "dateFormat", "timestampFormat", "maxColumns", "maxCharsPerColumn", "mode", "columnNameOfCorruptRecord", "multiLine" Detailed information of these CSV options can be found at: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv-scala.collection.Seq- If the source Avro schema is provided through the FilebasedSchemaProvider using "hoodie.deltastreamer.schemaprovider.source.schema.file" config, the schema is passed to the CSV reader without inferring the schema from the CSV file.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.hudi.utilities.sources.Source
  Source.SourceType

Field Summary

Fields
Modifier and Type Field and Description

protected static List<String> CSV_CONFIG_KEYS

protected static String CSV_SRC_CONFIG_PREFIX
- Fields inherited from class org.apache.hudi.utilities.sources.Source
  props, sparkContext, sparkSession

Fields
Modifier and Type	Field and Description
`protected static List<String>`	`CSV_CONFIG_KEYS`
`protected static String`	`CSV_SRC_CONFIG_PREFIX`

Constructor Summary

Constructors
Constructor and Description
`CsvDFSSource(TypedProperties props, org.apache.spark.api.java.JavaSparkContext sparkContext, org.apache.spark.sql.SparkSession sparkSession, SchemaProvider schemaProvider)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected Pair<Option<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>,String>`	`fetchNextBatch(Option<String> lastCkptStr, long sourceLimit)`

Methods inherited from class org.apache.hudi.utilities.sources.RowSource
fetchNewData

Methods inherited from class org.apache.hudi.utilities.sources.Source
fetchNext, getSourceType, getSparkSession

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.hudi.utilities.callback.SourceCommitCallback
onCommit

Field Detail

CSV_SRC_CONFIG_PREFIX

protected static final String CSV_SRC_CONFIG_PREFIX

See Also:: Constant Field Values

CSV_CONFIG_KEYS

protected static final List<String> CSV_CONFIG_KEYS

Constructor Detail

CsvDFSSource

public CsvDFSSource(TypedProperties props,
                    org.apache.spark.api.java.JavaSparkContext sparkContext,
                    org.apache.spark.sql.SparkSession sparkSession,
                    SchemaProvider schemaProvider)

Method Detail

fetchNextBatch

protected Pair<Option<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>,String> fetchNextBatch(Option<String> lastCkptStr,
                                                                                                     long sourceLimit)

Specified by:: fetchNextBatch in class RowSource

Class CsvDFSSource

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.hudi.utilities.sources.Source

Field Summary

Fields inherited from class org.apache.hudi.utilities.sources.Source

Constructor Summary

Method Summary

Methods inherited from class org.apache.hudi.utilities.sources.RowSource

Methods inherited from class org.apache.hudi.utilities.sources.Source

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.hudi.utilities.callback.SourceCommitCallback

Field Detail

CSV_SRC_CONFIG_PREFIX

CSV_CONFIG_KEYS

Constructor Detail

CsvDFSSource

Method Detail

fetchNextBatch