Wednesday, November 5, 2014

Spark Utilities

Spark Utilities

All code described here is in the projects at distributed-tools on the described are in the class com.lordjoe.distributed.SpareUtilities in the subproject spark-implementation

In working with Spark I find there is a need for a library of commonly used functions. I out these in a class called SparkUtilities. My general convention is classes named utilities are collections of static function. 

A major is getCurrentContext(). Because JavaSparkContexts cannot be serialized, it is not possible to pass a context into a function. The body of the function executing on a slave process will need to find a local copy of the context. If no local copy exists then one will need to be constructed, All of this is handled by getCurrentContext(). It caches a constructed context in a transient field. The transient key word will cause the field not to be serialized. The code below will cause one JavaSparkContext to be constructed per slave VM.

One important function is guaranteeSparkMaster. When running on a machine with out a cluster, the spark master will be undefined.. calling sparkConf.setMaster("local[*]"); causes the job to run in a local master (with the proper threads for the processor). This is good for debugging. The fact that code does this means there is no need to set up code or command line to run locally. If there is no cluster available getCurrentContext defaults to a local cluster.

No comments:

Post a Comment