Package org.hpccsystems.spark
Class HpccFileWriter
- java.lang.Object
-
- org.hpccsystems.spark.HpccFileWriter
-
- All Implemented Interfaces:
Serializable
public class HpccFileWriter extends Object implements Serializable
A helper class that creates a job in Spark that writes a given RDD to HPCC Systems.- See Also:
- Serialized Form
-
-
Constructor Summary
Constructors Constructor Description HpccFileWriter(String connectionString, String user, String pass)
HpccFileWriter Constructor Attempts to open a connection to the specified HPCC cluster and validates the user.HpccFileWriter(org.hpccsystems.ws.client.utils.Connection espconninfo)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description org.apache.spark.sql.types.StructType
inferSchema(List<PySparkField> exampleFields)
Generates an inferred schema based on an example Map of FieldNames to Example Field Objects.long
saveToHPCC(org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName)
Saves the provided RDD to the specified file within the specified cluster.long
saveToHPCC(org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite)
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helperlong
saveToHPCC(org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName)
Saves the provided RDD to the specified file within the specified cluster.long
saveToHPCC(org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite)
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helperlong
saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName)
Saves the provided RDD to the specified file within the specified cluster.long
saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName)
Saves the provided RDD to the specified file within the specified cluster.long
saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite)
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helperlong
saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.sql.types.StructType rddSchema, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> rdd, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite)
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helperlong
saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName)
Saves the provided RDD to the specified file within the specified cluster.long
saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite)
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helperlong
saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName)
Saves the provided RDD to the specified file within the specified cluster.long
saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite)
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper
-
-
-
Constructor Detail
-
HpccFileWriter
public HpccFileWriter(org.hpccsystems.ws.client.utils.Connection espconninfo) throws org.hpccsystems.commons.errors.HpccFileException
- Throws:
org.hpccsystems.commons.errors.HpccFileException
-
HpccFileWriter
public HpccFileWriter(String connectionString, String user, String pass) throws Exception
HpccFileWriter Constructor Attempts to open a connection to the specified HPCC cluster and validates the user.- Parameters:
connectionString
- of format {http|https}://{HOST}:{PORT}. Host and port are the same as the ecl watch host and port.user
- a valid ecl watch accountpass
- the password for the provided user- Throws:
Exception
- general exception
-
-
Method Detail
-
saveToHPCC
public long saveToHPCC(org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster. Will use HPCC default file compression. Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
scalaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster. Will use HPCC default file compression. Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
schema
- The Schema of the provided RDDscalaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster. Will use HPCC default file compression. Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
javaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster. Will use HPCC default file compression. Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
schema
- The Schema of the provided RDDjavaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception
-
saveToHPCC
public long saveToHPCC(org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
scalaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.fileCompression
- compression algorithm to use on filesoverwrite
- overwrite flag- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
schema
- The Schema of the provided RDDscalaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.fileCompression
- compression algorithm to use on filesoverwrite
- overwrite flag- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
javaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.fileCompression
- compression algorithm to use on filesoverwrite
- overwrite flag- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.sql.types.StructType schema, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
schema
- The Schema of the provided RDDjavaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.fileCompression
- compression algorithm to use on filesoverwrite
- overwrite flag- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster. Will use HPCC default file compression. Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
sc
- The current SparkContextscalaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> javaRDD, String clusterName, String fileName) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster. Will use HPCC default file compression. Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
sc
- The current SparkContextjavaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.rdd.RDD<org.apache.spark.sql.Row> scalaRDD, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
sc
- The current SparkContextscalaRDD
- The RDD to save to HPCCclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.fileCompression
- compression algorithm to use on filesoverwrite
- overwrite flag- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
saveToHPCC
public long saveToHPCC(org.apache.spark.SparkContext sc, org.apache.spark.sql.types.StructType rddSchema, org.apache.spark.api.java.JavaRDD<org.apache.spark.sql.Row> rdd, String clusterName, String fileName, org.hpccsystems.dfs.client.CompressionAlgorithm fileCompression, boolean overwrite) throws Exception, org.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
Saves the provided RDD to the specified file within the specified cluster Note: PySpark datasets can be written to HPCC by first calling inferSchema to generate a valid Java Schema and converting the PySpark RDD to a JavaRDD via the _py2java() helper- Parameters:
sc
- The current SparkContextrddSchema
- rdd schemardd
- java rdd rowclusterName
- The name of the cluster to save to.fileName
- The name of the logical file in HPCC to create. Follows HPCC file name conventions.fileCompression
- compression algorithm to use on filesoverwrite
- ovewrite flag- Returns:
- Returns the number of records written
- Throws:
Exception
- general exceptionorg.hpccsystems.ws.client.wrappers.ArrayOfEspExceptionWrapper
- array of esp exception wrapper
-
inferSchema
public org.apache.spark.sql.types.StructType inferSchema(List<PySparkField> exampleFields) throws Exception
Generates an inferred schema based on an example Map of FieldNames to Example Field Objects. This function is targeted primary at helping PySpark users write datasets back to HPCC.- Parameters:
exampleFields
- list of python spark fields- Returns:
- Returns a valid Spark schema based on the example rowDictionary
- Throws:
Exception
- general exception
-
-