Taming The Elephant: Spark Utilities

Spark Utilities

All code described here is in the projects at distributed-tools on code.google.com. the described are in the class com.lordjoe.distributed.SpareUtilities in the subproject spark-implementation

In working with Spark I find there is a need for a library of commonly used functions. I out these in a class called SparkUtilities. My general convention is classes named utilities are collections of static function.

A major is getCurrentContext(). Because JavaSparkContexts cannot be serialized, it is not possible to pass a context into a function. The body of the function executing on a slave process will need to find a local copy of the context. If no local copy exists then one will need to be constructed, All of this is handled by getCurrentContext(). It caches a constructed context in a transient field. The transient key word will cause the field not to be serialized. The code below will cause one JavaSparkContext to be constructed per slave VM.

One important function is guaranteeSparkMaster. When running on a machine with out a cluster, the spark master will be undefined.. calling sparkConf.setMaster("local[*]"); causes the job to run in a local master (with the proper threads for the processor). This is good for debugging. The fact that code does this means there is no need to set up code or command line to run locally. If there is no cluster available getCurrentContext defaults to a local cluster.

//  private transient static ThreadLocal<javasparkcontext> threadContext;
    private transient static JavaSparkContext threadContext;
    //  private transient static ThreadLocal<javasqlcontext> threadContext;
    private transient static JavaSQLContext sqlContext;
    private static final Properties sparkProperties = new Properties();
    private static String appName = "Anonymous";
     private static boolean local;

public static boolean isLocal() {
        return local;
    }
   
    /**
     * create a JavaSparkContext for the thread if none exists
     *
     * @return
     */
    public static synchronized JavaSparkContext getCurrentContext() {
//        if (threadContext == null)
//            threadContext = new ThreadLocal<javasparkcontext>();
//        JavaSparkContext ret = threadContext.get();
        JavaSparkContext ret = threadContext;
        if (ret != null)
            return ret;
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName(getAppName());
        SparkUtilities.guaranteeSparkMaster(sparkConf);
        SparkContext sc = new SparkContext(sparkConf);
 
         sparkConf.set("spark.mesos.coarse", "true");
        sparkConf.set("spark.executor.memory", "2500m");
           ret = new JavaSparkContext(sparkConf);
        threadContext = ret;
        return ret;
    }
 
    public static synchronized Configuration getHadoopConfiguration() {
        Configuration configuration = getCurrentContext().hadoopConfiguration();
        return configuration;
    }
 
   /**
     * if no spark master is  defined then use "local
     *
     * @param sparkConf the configuration
     */
    public static void guaranteeSparkMaster(@Nonnull SparkConf sparkConf) {
        Option<string> option = sparkConf.getOption("spark.master");

if (!option.isDefined()) {   // use local over nothing   {
            sparkConf.setMaster("local[*]");
            setLocal(true);
            /**
             * liquanpei@gmail.com suggests to correct
             * 14/10/08 09:36:35 ERROR broadcast.TorrentBroadcast: Reading broadcast variable 0 failed
             14/10/08 09:36:35 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 5.006378813 s
             14/10/08 09:36:35 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
             14/10/08 09:36:35 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
             java.lang.NullPointerException
             at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
             at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)

*/
            //  sparkConf.set("spark.broadcast.factory","org.apache.spark.broadcast.HttpBroadcastFactory" );
        }
        else {
            setLocal(option.get().startsWith("local"));
        }
        // set all properties in the SparkProperties file
        for (String property : sparkProperties.stringPropertyNames()) {
            if (!property.startsWith("spark."))
                continue;
            sparkConf.set(property, sparkProperties.getProperty(property));

}

Wednesday, November 5, 2014

Spark Utilities

Spark Utilities

No comments:

Post a Comment

About Me

Blog Archive