(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
Get the basePath of the current path.
Get the basePath of the current path. If the value path is a file path, then its basePath will be it's parent's path. Otherwise it will be the current path itself.
Delete the current file or directory
Delete the current file or directory
Get the boolean value of dropUserDefinedSuffix.
Get the boolean value of dropUserDefinedSuffix.
true if the column will be dropped, false otherwise
Set to true to drop the column containing user defined suffix (default name _user_defined_suffix)
Set to true to drop the column containing user defined suffix (default name _user_defined_suffix)
true to drop, false to keep
List files to be loaded.
List files to be loaded.
If the current connector has a non-empty filename pattern, then return a list of file paths that match the pattern.
When the filename pattern is not set: If the absolute path of this connector is a directory, return the path of the directory if detailed is set to false. Otherwise, return a list of file paths in the directory
true to return a list of file paths if the current absolute path is a directory
Get the current filesystem based on the path URI
Get the current filesystem based on the path URI
Get the sum of file size
Get the value of user defined suffix column name
Get the value of user defined suffix column name
List ALL the file paths (in format of string) of the current path of connector
List ALL the file paths (in format of string) of the current path of connector
List all the file path (in format of string) to be loaded.
List all the file path (in format of string) to be loaded.
If the current connector has a non-empty filename pattern, then return a list of file paths that match the pattern.
When the filename pattern is not set: If the absolute path of this connector is a directory, return the path of the directory if detailed is set to false. Otherwise, return a list of file paths in the directory
When the filename pattern IS set, a list of file paths will always be returned
true to list all file paths when the absolute path points to a directory otherwise return only the directory path.
List ALL the file paths of the current path of connector
List ALL the file paths of the current path of connector
Read a DataFrame from a file with the path defined during the instantiation.
Read a DataFrame from a file with the path defined during the instantiation.
DataFrame reader for the current path of connector
DataFrame reader for the current path of connector
Reset suffix to None
Reset suffix to None
set to true to ignore the validity check of suffix value
The current version of FileConnector doesn't support a mix of suffix and non-suffix write when the DataFrame is partitioned.
The current version of FileConnector doesn't support a mix of suffix and non-suffix write when the DataFrame is partitioned.
This method will detect, in the case of a partitioned table, if user try to use both suffix write and non-suffix write
an option of suffix in string format
Set the name of user defined suffix column (by default is _user_defined_suffix
Set the name of user defined suffix column (by default is _user_defined_suffix
name of the new key
Write a DataFrame into file
Write a DataFrame into file
dataframe to be written
optional, String, write the df in a sub-directory of the defined path
Write a DataFrame into the given path with the given save mode
Write a DataFrame into the given path with the given save mode
Initialize a DataFrame writer.
Initialize a DataFrame writer. A new writer will be initiate only if the hashcode of input DataFrame is different than the last written DataFrame.
Connector that loads CSV files and returns the result as a
DataFrame
.You can set the following CSV-specific options to deal with CSV files:
sep
(default,
): sets a single character as a separator for each field and value.encoding
(defaultUTF-8
): decodes the CSV files by the given encoding type.quote
(default"
): sets a single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set notnull
but an empty string. This behaviour is different fromcom.databricks.spark.csv
.escape
(default\
): sets a single character used for escaping quotes inside an already quoted value.charToEscapeQuoteEscaping
(defaultescape
or\0
): sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different,\0
otherwise.comment
(default empty string): sets a single character used for skipping lines beginning with this character. By default, it is disabled.header
(defaultfalse
): uses the first line as names of columns.enforceSchema
(defaulttrue
): If it is set totrue
, the specified or inferred schema will be forcibly applied to datasource files, and headers in CSV files will be ignored. If the option is set tofalse
, the schema will be validated against all headers in CSV files in the case when theheader
option is set totrue
. Field names in the schema and column names in CSV headers are checked by their positions taking into accountspark.sql.caseSensitive
. Though the default value is true, it is recommended to disable theenforceSchema
option to avoid incorrect results.inferSchema
(defaultfalse
): infers the input schema automatically from data. It requires one extra pass over the data.samplingRatio
(default is 1.0): defines fraction of rows used for schema inferring.ignoreLeadingWhiteSpace
(defaultfalse
): a flag indicating whether or not leading whitespaces from values being read should be skipped.ignoreTrailingWhiteSpace
(defaultfalse
): a flag indicating whether or not trailing whitespaces from values being read should be skipped.nullValue
(default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.emptyValue
(default empty string): sets the string representation of an empty value.nanValue
(defaultNaN
): sets the string representation of a non-number" value.positiveInf
(defaultInf
): sets the string representation of a positive infinity value.negativeInf
(default-Inf
): sets the string representation of a negative infinity value.dateFormat
(defaultyyyy-MM-dd
): sets the string that indicates a date format. Custom date formats follow the formats atjava.text.SimpleDateFormat
. This applies to date type.timestampFormat
(defaultyyyy-MM-dd'T'HH:mm:ss.SSSXXX
): sets the string that indicates a timestamp format. Custom date formats follow the formats atjava.text.SimpleDateFormat
. This applies to timestamp type.maxColumns
(default20480
): defines a hard limit of how many columns a record can have.maxCharsPerColumn
(default-1
): defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited lengthmode
(defaultPERMISSIVE
): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.PERMISSIVE
: when it meets a corrupted record, puts the malformed string into a field configured bycolumnNameOfCorruptRecord
, and sets other fields tonull
. To keep corrupt records, an user can set a string type field namedcolumnNameOfCorruptRecord
in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. A record with less/more tokens than schema is not a corrupted record to CSV. When it meets a record having fewer tokens than the length of the schema, setsnull
to extra fields. When the record has more tokens than the length of the schema, it drops extra tokens.DROPMALFORMED
: ignores the whole corrupted records.FAILFAST
: throws an exception when it meets corrupted records.columnNameOfCorruptRecord
(default is the value specified inspark.sql.columnNameOfCorruptRecord
): allows renaming the new field having malformed string created byPERMISSIVE
mode. This overridesspark.sql.columnNameOfCorruptRecord
.multiLine
(defaultfalse
): parse one record, which may span multiple lines.