Environment which provides a base path into which the application can write its data Unless overridden, paths will be of the form {uri}/data/{environment}/{project}/{branch} where environment is the logical environment (e.g.
Environment defining a sandbox in which an application can write
Environment which provides databases.
Environment which provides databases. By default, there will be a single database of the form {environment}_{project}_{branch} where environment is the logical environment (e.g. dev, test), project is the name of the application and branch is the Git branch
N.B when environment is 'prod', the branch is omitted from the database name as we assume it will always be master
e.g. dev_my_project_feature_abc, prod_my_project
During the development lifecycle of Spark applications, it is useful to create sandbox environments comprising paths and Hive databases etc.
During the development lifecycle of Spark applications, it is useful to create sandbox environments comprising paths and Hive databases etc. which are tied to specific logical environments (e.g. dev, test, prod) and feature development (i.e Git branches). e.g. when working on a feature called new_feature for a project called my_project, the application should write its data to paths under /data/dev/my_project/new_feature/ and create tables in a database called dev_my_project_new_feature (actual implementation of what these environments should look like can be defined by extending Env or one of its subclasses - the final implementation should be a case class whose values define the environment i.e env, branch etc.)
This is a generic Spark Application which uses an implementation of Env to generate application-specific configuration and subsequently parse this configuration into a case class to be used for the application logic.
the type of the Env implementation (must be a case class)
This is a SparkApp specifically for applications using Waimak
Trait for defining Waimak-app specific configuration
Performs create and cleanup operations for the Env implementation used by a provided implementation of SparkApp The following configuration values should be present in the SparkSession:
Performs create and cleanup operations for the Env implementation used by a provided implementation of SparkApp The following configuration values should be present in the SparkSession:
spark.waimak.environment.ids: comma-separated unique ids for the environments spark.waimak.environment.{environmentid}.appClassName: the application class to use (must extend SparkApp) spark.waimak.environment.action: the environment action to perform (create or cleanup)
The Env implementation expects configuration values prefixed with spark.waimak.environment.{environmentid}.
Allows multiple Spark applications to be run in a single main method whilst obeying configured dependency constraints.
Allows multiple Spark applications to be run in a single main method whilst obeying configured dependency constraints. The following configuration values should be present in the SparkSession:
spark.waimak.apprunner.apps: a comma-delimited list of the names (identifiers) of all of the applications being run (e.g. myapp1,myapp2)
spark.waimak.apprunner.{appname}.appClassName: for each application, the application class to use (must extend SparkApp) (e.g. spark.waimak.apprunner.myapp1.appClassName = com.example.MyWaimakApp)
spark.waimak.apprunner.{appname}.dependencies: for each application, an optional comma-delimited list of dependencies. If omitted, the application will have no dependencies and will not wait for other apps to finish before starting execution. Dependencies must match the names provided in spark.waimak.apprunner.apps (e.g. spark.waimak.apprunner.myapp1.dependencies = myapp2)
The Env implementation used by the provided SparkApp implementation expects configuration values prefixed with: spark.waimak.environment.{appname}.
Environment which provides a base path into which the application can write its data Unless overridden, paths will be of the form {uri}/data/{environment}/{project}/{branch} where environment is the logical environment (e.g. dev, test), project is the name of the application and branch is the Git branch
N.B when environment is 'prod', the branch is omitted from the path as we assume it will always be master
e.g. hdfs:///data/dev/my_project/feature_abc, hdfs:///data/prod/my_project