Reads data from CSV files on DFS as the data source.
Internally, we use Spark to read CSV files thus any limitation of Spark CSV also applies here
(e.g., limited support for nested schema).
You can set the CSV-specific configs in the format of hoodie.deltastreamer.csv.*
that are Spark compatible to deal with CSV files in Hudi. The supported options are:
"sep", "encoding", "quote", "escape", "charToEscapeQuoteEscaping", "comment",
"header", "enforceSchema", "inferSchema", "samplingRatio", "ignoreLeadingWhiteSpace",
"ignoreTrailingWhiteSpace", "nullValue", "emptyValue", "nanValue", "positiveInf",
"negativeInf", "dateFormat", "timestampFormat", "maxColumns", "maxCharsPerColumn",
"mode", "columnNameOfCorruptRecord", "multiLine"
Detailed information of these CSV options can be found at:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv-scala.collection.Seq-
If the source Avro schema is provided through the
FilebasedSchemaProvider
using "hoodie.deltastreamer.schemaprovider.source.schema.file" config, the schema is
passed to the CSV reader without inferring the schema from the CSV file.