(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
(Since version 0.3.4) use the constructor with no spark session
Get the basePath of the current path.
Get the basePath of the current path. If the value path is a file path, then its basePath will be it's parent's path. Otherwise it will be the current path itself.
Delete the current file or directory
Delete the current file or directory
Get the boolean value of dropUserDefinedSuffix.
Get the boolean value of dropUserDefinedSuffix.
true if the column will be dropped, false otherwise
Set to true to drop the column containing user defined suffix (default name _user_defined_suffix)
Set to true to drop the column containing user defined suffix (default name _user_defined_suffix)
true to drop, false to keep
List files to be loaded.
List files to be loaded.
If the current connector has a non-empty filename pattern, then return a list of file paths that match the pattern.
When the filename pattern is not set: If the absolute path of this connector is a directory, return the path of the directory if detailed is set to false. Otherwise, return a list of file paths in the directory
true to return a list of file paths if the current absolute path is a directory
Get the current filesystem based on the path URI
Get the current filesystem based on the path URI
Get the sum of file size
Get the value of user defined suffix column name
Get the value of user defined suffix column name
List ALL the file paths (in format of string) of the current path of connector
List ALL the file paths (in format of string) of the current path of connector
List all the file path (in format of string) to be loaded.
List all the file path (in format of string) to be loaded.
If the current connector has a non-empty filename pattern, then return a list of file paths that match the pattern.
When the filename pattern is not set: If the absolute path of this connector is a directory, return the path of the directory if detailed is set to false. Otherwise, return a list of file paths in the directory
When the filename pattern IS set, a list of file paths will always be returned
true to list all file paths when the absolute path points to a directory otherwise return only the directory path.
List ALL the file paths of the current path of connector
List ALL the file paths of the current path of connector
Read a DataFrame from a file with the path defined during the instantiation.
Read a DataFrame from a file with the path defined during the instantiation.
DataFrame reader for the current path of connector
DataFrame reader for the current path of connector
Reset suffix to None
Reset suffix to None
set to true to ignore the validity check of suffix value
The current version of FileConnector doesn't support a mix of suffix and non-suffix write when the DataFrame is partitioned.
The current version of FileConnector doesn't support a mix of suffix and non-suffix write when the DataFrame is partitioned.
This method will detect, in the case of a partitioned table, if user try to use both suffix write and non-suffix write
an option of suffix in string format
Set the name of user defined suffix column (by default is _user_defined_suffix
Set the name of user defined suffix column (by default is _user_defined_suffix
name of the new key
Write a DataFrame into file
Write a DataFrame into file
dataframe to be written
optional, String, write the df in a sub-directory of the defined path
Write a JSON file in the standard format.
Write a JSON file in the standard format.
This method will collect all the DataFrame partitions to the spark driver so it may impact the performance when the amount of data to write is huge.
DataFrame to be written
Write a DataFrame into the given path with the given save mode
Write a DataFrame into the given path with the given save mode
Initialize a DataFrame writer.
Initialize a DataFrame writer. A new writer will be initiate only if the hashcode of input DataFrame is different than the last written DataFrame.
Connector that loads JSON files and returns the results as a
DataFrame
.You can set the following JSON-specific options to deal with non-standard JSON files:
primitivesAsString
(defaultfalse
): infers all primitive values as a string typeprefersDecimal
(defaultfalse
): infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles.allowComments
(defaultfalse
): ignores Java/C++ style comment in JSON recordsallowUnquotedFieldNames
(defaultfalse
): allows unquoted JSON field namesallowSingleQuotes
(defaulttrue
): allows single quotes in addition to double quotesallowNumericLeadingZeros
(defaultfalse
): allows leading zeros in numbers (e.g. 00012)allowBackslashEscapingAnyCharacter
(defaultfalse
): allows accepting quoting of all character using backslash quoting mechanismallowUnquotedControlChars
(defaultfalse
): allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.mode
(defaultPERMISSIVE
): allows a mode for dealing with corrupt records during parsing.PERMISSIVE
: when it meets a corrupted record, puts the malformed string into a field configured bycolumnNameOfCorruptRecord
, and sets other fields tonull
. To keep corrupt records, an user can set a string type field namedcolumnNameOfCorruptRecord
in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds acolumnNameOfCorruptRecord
field in an output schema.DROPMALFORMED
: ignores the whole corrupted records.FAILFAST
: throws an exception when it meets corrupted records.columnNameOfCorruptRecord
(default is the value specified inspark.sql.columnNameOfCorruptRecord
): allows renaming the new field having malformed string created byPERMISSIVE
mode. This overridesspark.sql.columnNameOfCorruptRecord
.dateFormat
(defaultyyyy-MM-dd
): sets the string that indicates a date format. Custom date formats follow the formats atjava.text.SimpleDateFormat
. This applies to date type.timestampFormat
(defaultyyyy-MM-dd'T'HH:mm:ss.SSSXXX
): sets the string that indicates a timestamp format. Custom date formats follow the formats atjava.text.SimpleDateFormat
. This applies to timestamp type.multiLine
(defaultfalse
): parse one record, which may span multiple lines, per fileencoding
(by default it is not set): allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If the encoding is not specified andmultiLine
is set totrue
, it will be detected automatically.lineSep
(default covers all\r
,\r\n
and\n
): defines the line separator that should be used for parsing.samplingRatio
(default is 1.0): defines fraction of input JSON objects used for schema inferring.dropFieldIfAllNull
(defaultfalse
): whether to ignore column of all null values or empty array/struct during schema inference.