CSVConnector

Connector that loads CSV files and returns the result as a DataFrame.

You can set the following CSV-specific options to deal with CSV files:

sep (default ,): sets a single character as a separator for each field and value.
encoding (default UTF-8): decodes the CSV files by the given encoding type.
quote (default "): sets a single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv.
escape (default \): sets a single character used for escaping quotes inside an already quoted value.
charToEscapeQuoteEscaping (default escape or \0): sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different, \0 otherwise.
comment (default empty string): sets a single character used for skipping lines beginning with this character. By default, it is disabled.
header (default false): uses the first line as names of columns.
enforceSchema (default true): If it is set to true, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set to false, the schema will be validated against all headers in CSV files in the case when the header option is set to true. Field names in the schema and column names in CSV headers are checked by their positions taking into account spark.sql.caseSensitive. Though the default value is true, it is recommended to disable the enforceSchema option to avoid incorrect results.
inferSchema (default false): infers the input schema automatically from data. It requires one extra pass over the data.
samplingRatio (default is 1.0): defines fraction of rows used for schema inferring.
ignoreLeadingWhiteSpace (default false): a flag indicating whether or not leading whitespaces from values being read should be skipped.
ignoreTrailingWhiteSpace (default false): a flag indicating whether or not trailing whitespaces from values being read should be skipped.
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
emptyValue (default empty string): sets the string representation of an empty value.
nanValue (default NaN): sets the string representation of a non-number" value.
positiveInf (default Inf): sets the string representation of a positive infinity value.
negativeInf (default -Inf): sets the string representation of a negative infinity value.
dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
maxColumns (default 20480): defines a hard limit of how many columns a record can have.
maxCharsPerColumn (default -1): defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length
mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field configured by columnNameOfCorruptRecord, and sets other fields to null. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length of the schema, sets null to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST : throws an exception when it meets corrupted records.
columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.
multiLine (default false): parse one record, which may span multiple lines.

Annotations: @Evolving()

Linear Supertypes

FileConnector, HasSparkSession, Connector, Logging, AnyRef, Any

Instance Constructors

new CSVConnector(path: String, inferSchema: String, delimiter: String, header: String, saveMode: SaveMode)
new CSVConnector(config: Conf)
new CSVConnector(config: Config)
new CSVConnector(options: Map[String, String])
new CSVConnector(options: FileConnectorConf)
new CSVConnector(spark: SparkSession, path: String, inferSchema: String, delimiter: String, header: String, saveMode: SaveMode)

Annotations
@deprecated
Deprecated
(Since version 0.3.4) use the constructor with no spark session
new CSVConnector(spark: SparkSession, conf: Conf)

Annotations
@deprecated
Deprecated
(Since version 0.3.4) use the constructor with no spark session
new CSVConnector(spark: SparkSession, config: Config)

Annotations
@deprecated
Deprecated
(Since version 0.3.4) use the constructor with no spark session
new CSVConnector(spark: SparkSession, options: Map[String, String])

Annotations
@deprecated
Deprecated
(Since version 0.3.4) use the constructor with no spark session
new CSVConnector(spark: SparkSession, options: FileConnectorConf)

Annotations
@deprecated
Deprecated
(Since version 0.3.4) use the constructor with no spark session

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
lazy val basePath: Path

Get the basePath of the current path.
Get the basePath of the current path. If the value path is a file path, then its basePath will be it's parent's path. Otherwise it will be the current path itself.

Definition Classes
FileConnector
def canWrite: Boolean

Definition Classes
FileConnector
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def delete(): Unit

Delete the current file or directory
Delete the current file or directory

Definition Classes
FileConnector
def dropUserDefinedSuffix: Boolean

Get the boolean value of dropUserDefinedSuffix.
Get the boolean value of dropUserDefinedSuffix.
returns
true if the column will be dropped, false otherwise

Definition Classes
FileConnector
def dropUserDefinedSuffix(boo: Boolean): CSVConnector.this.type

Set to true to drop the column containing user defined suffix (default name _user_defined_suffix)
Set to true to drop the column containing user defined suffix (default name _user_defined_suffix)
boo
true to drop, false to keep

Definition Classes
FileConnector
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def filesToLoad(detailed: Boolean): Array[Path]

List files to be loaded.
List files to be loaded.
If the current connector has a non-empty filename pattern, then return a list of file paths that match the pattern.
When the filename pattern is not set: If the absolute path of this connector is a directory, return the path of the directory if detailed is set to false. Otherwise, return a list of file paths in the directory
detailed
true to return a list of file paths if the current absolute path is a directory

Definition Classes
FileConnector
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def getFileSystem: FileSystem

Get the current filesystem based on the path URI
Get the current filesystem based on the path URI

Definition Classes
FileConnector
def getSize: Long

Get the sum of file size
Get the sum of file size
returns
size in byte

Definition Classes
FileConnector
def getUserDefinedSuffixKey: String

Get the value of user defined suffix column name
Get the value of user defined suffix column name

Definition Classes
FileConnector
def getWriteCount: Long

Definition Classes
FileConnector
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def listFiles(): Array[String]

List ALL the file paths (in format of string) of the current path of connector
List ALL the file paths (in format of string) of the current path of connector

Definition Classes
FileConnector
def listFilesToLoad(detailed: Boolean = true): Array[String]

List all the file path (in format of string) to be loaded.
List all the file path (in format of string) to be loaded.
If the current connector has a non-empty filename pattern, then return a list of file paths that match the pattern.
When the filename pattern is not set: If the absolute path of this connector is a directory, return the path of the directory if detailed is set to false. Otherwise, return a list of file paths in the directory
When the filename pattern IS set, a list of file paths will always be returned
detailed
true to list all file paths when the absolute path points to a directory otherwise return only the directory path.

Definition Classes
FileConnector
def listPaths(): Array[Path]

List ALL the file paths of the current path of connector
List ALL the file paths of the current path of connector

Definition Classes
FileConnector
def log: Logger

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
val options: FileConnectorConf

Definition Classes
CSVConnector → FileConnector
def partitionBy(columns: String*): CSVConnector.this.type

Definition Classes
FileConnector
def read(): DataFrame

Read a DataFrame from a file with the path defined during the instantiation.
Read a DataFrame from a file with the path defined during the instantiation.

Definition Classes
FileConnector → Connector
Annotations
@throws( s"$absolutePath doesn't exist" ) @throws( s"$absolutePath doesn't exist" )
lazy val reader: DataFrameReader

DataFrame reader for the current path of connector
DataFrame reader for the current path of connector

Definition Classes
FileConnector → Connector
def resetSuffix(force: Boolean = false): CSVConnector.this.type

Reset suffix to None
Reset suffix to None
force
set to true to ignore the validity check of suffix value

Definition Classes
FileConnector
val schema: Option[StructType]

Definition Classes
FileConnector
def setSuffix(suffix: Option[String]): CSVConnector.this.type

The current version of FileConnector doesn't support a mix of suffix and non-suffix write when the DataFrame is partitioned.
The current version of FileConnector doesn't support a mix of suffix and non-suffix write when the DataFrame is partitioned.
This method will detect, in the case of a partitioned table, if user try to use both suffix write and non-suffix write
suffix
an option of suffix in string format

Definition Classes
FileConnector
def setUserDefinedSuffixKey(key: String): CSVConnector.this.type

Set the name of user defined suffix column (by default is _user_defined_suffix
Set the name of user defined suffix column (by default is _user_defined_suffix
key
name of the new key

Definition Classes
FileConnector
val spark: SparkSession

Definition Classes
HasSparkSession
val storage: Storage

Definition Classes
CSVConnector → Connector
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
def write(t: DataFrame): Unit

Definition Classes
FileConnector → Connector
def write(df: DataFrame, suffix: Option[String]): Unit

Write a DataFrame into file
Write a DataFrame into file
df
dataframe to be written
suffix
optional, String, write the df in a sub-directory of the defined path

Definition Classes
FileConnector → Connector
def writeToPath(df: DataFrame, filepath: String): Unit

Write a DataFrame into the given path with the given save mode
Write a DataFrame into the given path with the given save mode

Definition Classes
FileConnector
val writer: (DataFrame) ⇒ DataFrameWriter[Row]

Initialize a DataFrame writer.
Initialize a DataFrame writer. A new writer will be initiate only if the hashcode of input DataFrame is different than the last written DataFrame.

Definition Classes
FileConnector → Connector

Related Doc: package connector

class CSVConnector extends FileConnector

Instance Constructors

new CSVConnector(path: String, inferSchema: String, delimiter: String, header: String, saveMode: SaveMode)

new CSVConnector(config: Conf)

new CSVConnector(config: Config)

new CSVConnector(options: Map[String, String])

new CSVConnector(options: FileConnectorConf)

new CSVConnector(spark: SparkSession, path: String, inferSchema: String, delimiter: String, header: String, saveMode: SaveMode)

new CSVConnector(spark: SparkSession, conf: Conf)

new CSVConnector(spark: SparkSession, config: Config)

new CSVConnector(spark: SparkSession, options: Map[String, String])

new CSVConnector(spark: SparkSession, options: FileConnectorConf)

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

lazy val basePath: Path

def canWrite: Boolean

def clone(): AnyRef

def delete(): Unit

def dropUserDefinedSuffix: Boolean

def dropUserDefinedSuffix(boo: Boolean): CSVConnector.this.type

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def filesToLoad(detailed: Boolean): Array[Path]

def finalize(): Unit

final def getClass(): Class[_]

def getFileSystem: FileSystem

def getSize: Long

def getUserDefinedSuffixKey: String

def getWriteCount: Long

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def listFiles(): Array[String]

def listFilesToLoad(detailed: Boolean = true): Array[String]

def listPaths(): Array[Path]

def log: Logger

def logName: String

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

val options: FileConnectorConf

def partitionBy(columns: String*): CSVConnector.this.type

def read(): DataFrame

lazy val reader: DataFrameReader

def resetSuffix(force: Boolean = false): CSVConnector.this.type

val schema: Option[StructType]

def setSuffix(suffix: Option[String]): CSVConnector.this.type

def setUserDefinedSuffixKey(key: String): CSVConnector.this.type

val spark: SparkSession

val storage: Storage

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

def write(t: DataFrame): Unit

def write(df: DataFrame, suffix: Option[String]): Unit

def writeToPath(df: DataFrame, filepath: String): Unit

val writer: (DataFrame) ⇒ DataFrameWriter[Row]

Inherited from FileConnector

Inherited from HasSparkSession

Inherited from Connector

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped