The AkkaClusterSupervisorActor launches Spark Contexts as external processes that connect back with the master node via Akka Cluster.
implementation of a NamedObjectPersister for Broadcast objects
Performs sending back a response in streaming fashion using chunk encoding
Represents a context based on SparkContext.
Represents a context based on SparkContext. Examples include: StreamingContext, SQLContext.
The Job Server can spin up not just a vanilla SparkContext, but anything that implements ContextLike.
implementation of a NamedObjectPersister for DataFrame objects
An Actor that manages the data files stored by the job server to disc.
An Actor that manages the jars stored by the job server.
An Actor that manages the jars stored by the job server. It's important that threads do not try to load a class from a jar as a new one is replacing it, so using an actor to serialize requests is perfect.
A class to make Java jobs easier to write.
A class to make Java jobs easier to write. In Java: public class MySparkJob extends JavaSparkJob {
A cache for SparkJob classes.
A cache for SparkJob classes. A lot of times jobs are run repeatedly, and especially for low-latency jobs, why retrieve the jar and load it every single time?
The JobManager actor supervises jobs running in a single SparkContext, as well as shared metadata.
The JobManager actor supervises jobs running in a single SparkContext, as well as shared metadata. It creates a SparkContext (or a StreamingContext etc. depending on the factory class) It also creates and supervises a JobResultActor and JobStatusActor, although an existing JobResultActor can be passed in as well.
num-cpu-cores = 4 # Total # of CPU cores to allocate across the cluster memory-per-node = 512m # -Xmx style memory string for total memory to use for executor on one node dependent-jar-uris = ["local://opt/foo/my-foo-lib.jar"] # URIs for dependent jars to load for entire context context-factory = "spark.jobserver.context.DefaultSparkContextFactory" spark.mesos.coarse = true # per-context, rather than per-job, resource allocation rdd-ttl = 24 h # time-to-live for RDDs in a SparkContext. Don't specify = forever is-adhoc = false # true if context is ad-hoc context context.name = "sql" # Name of context
spark {
jobserver {
max-jobs-per-context = 16 # Number of jobs that can be run simultaneously per context
}
}
It is an actor to manage results that are returned from jobs.
It is an actor to manage results that are returned from jobs.
TODO: support multiple subscribers for same JobID
An implementation of NamedObjects API for the Job Server.
An implementation of NamedObjects API for the Job Server. Note that this contains code that executes on the same thread as the job. Uses spray caching for cache references to named objects and to avoid that the same object is created multiple times
It is an actor to manage job status updates
This class starts and stops JobManagers / Contexts in-process.
This class starts and stops JobManagers / Contexts in-process. It is responsible for watching out for the death of contexts/JobManagers.
Contexts can be configured to be created automatically at job server initialization. Configuration example:
spark { contexts { olap-demo { num-cpu-cores = 4 # Number of cores to allocate. Required. memory-per-node = 1024m # Executor memory per node, -Xmx style eg 512m, 1G, etc. } } }
spark { jobserver { context-creation-timeout = 15 s yarn-context-creation-timeout = 40 s } # Default settings for all context creation context-settings { spark.mesos.coarse = true } }
wrapper for named objects of type Broadcast
wrapper for named objects of type DataFrame
implementations of this abstract class should handle the specifics of each named object's persistence
NamedObjects - a trait that gives you safe, concurrent creation and access to named objects such as RDDs or DataFrames (the native SparkContext interface only has access to RDDs by numbers).
NamedObjects - a trait that gives you safe, concurrent creation and access to named objects such as RDDs or DataFrames (the native SparkContext interface only has access to RDDs by numbers). It facilitates easy sharing of data objects amongst jobs sharing the same SparkContext. If two jobs simultaneously tries to create a data object with the same name and in the same namespace, only one will win and the other will retrieve the same one.
Note that to take advantage of NamedObjectSupport, a job must mix this in and use the APIs here instead of
the native DataFrame/RDD cache()
, otherwise we will not know about the names.
A test job that accepts a SQLContext, as opposed to the regular SparkContext.
A test job that accepts a SQLContext, as opposed to the regular SparkContext. Just initializes some dummy data into a table.
wrapper for named objects of type RDD[T]
please use NamedObjectSupport instead !
implementation of a NamedObjectPersister for RDD[T] objects
This trait is the main API for Spark jobs submitted to the Job Server.
Defines a Job that runs on a StreamingContext, note that these jobs are usually long running jobs and there's (yet) no way in Spark Job Server to query the status of these jobs.
Message for storing a JAR for an application given the byte array of the JAR file
Message for storing one or more local JARs based on the given map.
Message for storing one or more local JARs based on the given map.
Map where the key is the appName and the value is the local path to the JAR.
Messages common to all ContextSupervisors
A test job that accepts a HiveContext, as opposed to the regular SparkContext.
A test job that accepts a HiveContext, as opposed to the regular SparkContext. Initializes some dummy data into a table, reads it back out, and returns a count (Will create Hive metastore at job-server/metastore_db if Hive isn't configured)
This job simply runs the Hive SQL in the config.
The JobManager is the main entry point for the forked JVM process running an individual SparkContext.
The JobManager is the main entry point for the forked JVM process running an individual SparkContext. It is passed $workDir $clusterAddr $configFile
Each forked process has a working directory with log files for that context only, plus a file "context.conf" which contains context-specific settings.
The Spark Job Server is a web service that allows users to submit and run Spark jobs, check status, and view results.
The Spark Job Server is a web service that allows users to submit and run Spark jobs, check status, and view results. It may offer other goodies in the future. It only takes in one optional command line arg, a config file to override the default (and you can still use -Dsetting=value to override) -- Configuration --
spark { master = "local" jobserver { port = 8090 } }
A Spark job example that implements the SparkJob trait and can be submitted to the job server.
A Spark job example that implements the SparkJob trait and can be submitted to the job server.
Set the config with the sentence to split or count: input.string = "adsfasdf asdkf safksf a sdfa"
validate() returns SparkJobInvalid if there is no input.string
Message requesting a listing of the available JARs
A test job that accepts a SQLContext, as opposed to the regular SparkContext.
A test job that accepts a SQLContext, as opposed to the regular SparkContext. Just initializes some dummy data into a table.
This job simply runs the SQL in the config.
The AkkaClusterSupervisorActor launches Spark Contexts as external processes that connect back with the master node via Akka Cluster.
Currently, when the Supervisor gets a MemberUp message from another actor, it is assumed to be one starting up, and it will be asked to identify itself, and then the Supervisor will try to initialize it.
See the LocalContextSupervisorActor for normal config options. Here are ones specific to this class.
Configuration
deploy { manager-start-cmd = "./manager_start.sh" }