Taming The Elephant: 2014

Monday, November 17, 2014

Using a Complex Structure as a Spark Accumulator

In two earlier blogs I discussed the uses of Accumulators in Spark. I will continue this discussion by describing how Accumulators may be used for more complex structures. The structure I will describe keeps statistics on a double variable tracking mean, standard deviation, minimum and maximum values.
While the code was written in part to test the capabilities of accumulators, there were real problems motivating it. Accumulators are an excellent way to pull secondary, summary results from processing and RDD without interfering with the main result.
The Statistics class can accumulate statistics on data. The object is immutable. addition results in a new objects combining one or more number or one other Statistics. The code is shown below.

package com.lordjoe.distributed;
package com.lordjoe.distributed.spark;

import org.apache.spark.*;

import java.io.*;

/**
 * com.lordjoe.distributed.spark.Statistics
 * keep statistics
 * User: Steve
 * Date: 11/13/2014
 */
public class Statistics implements Serializable {

public static final StatisticsAccumulatorParam PARAM_INSTANCE = new StatisticsAccumulatorParam();

public static final Statistics ZERO = new Statistics();
    private final int number;
    private final double sum;
    private final double sumsquare;
    private final double max;
    private final double min;

private Statistics() {
        number = 0;
        sum = 0;
        sumsquare = 0;
        max = Double.MIN_VALUE;
        min = Double.MAX_VALUE;
     }

/**
     * build with 1 or more numbers
     * @param d  first value
     * @param values other values - if any
     */
    public Statistics(double d, double... values) {
        number = 1 + values.length;
        double tsum = d;
        double tsumsq = d * d;
        double tmin = d;
        double tsmax = d;
        for (int i = 0; i < values.length; i++) {
            double value = values[i];
            tsum += value;
            tsumsq += value * value;
            tmin = Math.max(value, d);
            tsmax = Math.min(value, d);
         }
         sum = tsum;
        sumsquare = tsumsq;
        max = tmin;
        min = tsmax;
    }

private Statistics(Statistics s1, Statistics s2) {
        number = s1.number + s2.number;
        sum = s1.sum + s2.sum;
        sumsquare = s1.sumsquare + s2.sumsquare;
        max = Math.max(s1.max, s2.max);
        min = Math.min(s1.min, s2.min);
    }

private Statistics(Statistics s1, double d) {
        number = s1.number + 1;
        sum = s1.sum + d;
        sumsquare = s1.sumsquare + d * d;
        max = Math.max(s1.max, d);
        min = Math.min(s1.min, d);
    }

public Statistics add(double d) {
        return new Statistics(this, d);
    }

public Statistics add(Statistics d) {
        return new Statistics(this, d);
    }

public int getNumber() {
        return number;
    }

public double getSum() {
        return sum;
    }

public double getSumsquare() {
        return sumsquare;
    }

public double getMax() {
        return max;
    }

public double getMin() {
        return min;
    }

public double getAverage() {
        if (number == 0)
            return 0;
        return sum / number;
    }

public double getStandardDeviation() {
        if (number < 2)
            return Double.MAX_VALUE;
           double variance = (sumsquare - sum * getAverage()) / (number - 1.0);
        return Math.sqrt(variance);
      }

/**
     *   class to make an  AccumulatorParam<Statistics> PARAM_INSTANCE exposes this
     */
    public static class StatisticsAccumulatorParam implements AccumulatorParam<statistics>, Serializable {
        // only use PARAM_INSTANCE
        private StatisticsAccumulatorParam() {}
        @Override
        public Statistics addAccumulator(final Statistics r, final Statistics t) {
              return r.add(t);
           }

@Override
        public Statistics addInPlace(final Statistics r, final Statistics t) {

return r.add(t);
        }

@Override
        public Statistics zero(final Statistics initialValue) {
            return Statistics.ZERO.add(initialValue);
        }
    }

}

The code may be used in two ways - to create an accumulator or directly as shown in the following
sample using combineByKey

To create an accumulator say
// Make an accumulators using Statistics
final Accumulator<Statistics> totalLetters = ctx.accumulator(Statistics.ZERO, "Total Letters ", Statistics.PARAM_INSTANCE);
// lines from word count
JavaRDD<String> lines = ctx.textFile(args[0], 4);

JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public Iterable<String> call(final String s) throws Exception {
// Handle accumulator here
totalLetters.add((long)s.length()); // count all letters
... // other stuff
}
});
// more code
Statistics letterStatistics = totalLetters.value();
int numberLetters = letterStatistics.getNumber();
double averageLineLength = letterStatistics.getAverage();

When multiple keys are involved the same structure may be used in combineByKey to generate separate statistics for each key

package com.lordjoe.distributed;

import com.lordjoe.distributed.spark.*;
import org.apache.spark.*;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.*;
import scala.*;

import java.lang.Double;
import java.util.*;

/**
 * com.lordjoe.distributed.StatisticalTests
 * Demonstration of custom accumulators
 * User: Steve
 * Date: 9/2/2014
 */
public class StatisticalTests {

public static final Integer[] keys = {1, 2, 3, 4, 5, 6, 7, 8};
    public static final Random RND = new Random();

public static final int MIN_ENTRIES = 4000;
    public static final int MAX_ENTRIES = 10000;

/**
     * return a distribution with mean 4 * key
     * sd key
     * @param key
     * @return
     */
    private static double buildStats(final Integer key) {
        double v = RND.nextGaussian();
        v *= key;
        v += 4 * key;
        return v ;
    }

/**
     * Usage - args[0] is the name of a file to count words
     * like
     * RedBadge.txt
     *
     * @param args
     */
    public static void main(final String[] args) {

SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("StatisticalTests");

Option<string> option = sparkConf.getOption("spark.master");
        if (!option.isDefined()) {   // use local over nothing
            sparkConf.setMaster("local[*]");
        }
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);

List<integer> keyList = Arrays.asList(keys);
        JavaRDD<integer> keys = ctx.parallelize(keyList);

/*
         * generate values - key = integer, value = gaussian(4 * key,key
         */
        JavaPairRDD<Integer, Double> dataWithStatistics = keys.flatMapToPair(new PairFlatMapFunction<Integer, Integer, Double>() {
            @Override
            public Iterable<Tuple2<Integer, Double>> call(final Integer key) throws Exception {
                List<Tuple2<Integer, Double>> holder = new ArrayList<Tuple2<Integer, Double>>();
                int numberEntries = MIN_ENTRIES + RND.nextInt(MAX_ENTRIES);
                for (int i = 0; i < numberEntries; i++) {
                    double stats = buildStats(key);
                    holder.add(new Tuple2<Integer, Double>(key,stats)) ;
                 }
                 return holder;
            }
        });

/*
             Create statistics bu using combinebyKey
         */
        JavaPairRDD<Integer, Statistics> generatedStatistics = dataWithStatistics.combineByKey(
                new Function<Double, Statistics>() {
                    @Override
                    public Statistics call(final Double start) throws Exception {
                        return new Statistics(start);
                    }
                }, new Function2<Statistics, Double, Statistics>() {
                    @Override
                    public Statistics call(final Statistics in, final Double added) throws Exception {
                        return in.add(added);
                    }
                },
                new Function2<Statistics, Statistics, Statistics>() {
                    @Override
                    public Statistics call(final Statistics in, final Statistics added) throws Exception {
                        return in.add(added);
                    }
                }

);

List<Tuple2<Integer, Statistics>> statistics = generatedStatistics.collect();

for (Tuple2<Integer, Statistics> statistic : statistics) {
            Integer key = statistic._1();
            Statistics value = statistic._2();
            System.out.println("key  = " + key);
            System.out.println("total values  = " + value.getNumber());
             System.out.println("average  = " + String.format("%10.2f", value.getAverage()));
            System.out.println("Sd  = " + String.format("%10.2f", value.getStandardDeviation()));
            System.out.println("Max  = " + value.getMax());
            System.out.println("Min  = " + value.getMin());
            System.out.println("================================================");
         }
     }
 }

All code for this article is available here.

Friday, November 14, 2014

The Power of Spark Accumulators

Spark Accumulators, discussed in an earlier blog) are massively more powerful than Hadoop counters because they support multiple types of data. I have not seem discussions of using accumulators holding large sets of data, something that some of the classes discussed here could certainly do. The code discussed here is available here.

The only things required for an accumulator are a an AccumulatorParam instance defining how to construct a zero element and how to combine multiple instances.

AccumulatorParam use a Long as a Counter (accumulator)

AccumulatorParam to accumulate a single string by concatenation

import org.apache.spark.*;
import java.io.*;

/**
* com.lordjoe.distributed.spark.StringAccumulableParam
 * Usage
 *      final Accumulator<string> myString = ctx.accumulator("", "Letter Statistics", new StringAccumulableParam("\t");
 *     ...
 *        myString.add("xyxzzy");
* User: Steve
* Date: 11/12/2014
*/
public class StringAccumulableParam implements AccumulatorParam<string>,Serializable {

private final String separator;

public StringAccumulableParam(final String pSeparator) {
        separator = pSeparator;
    }
    public StringAccumulableParam( ) {
         this(",");
     }

@Override
    public String addAccumulator(final String r, final String t) {
        if(r.isEmpty())
            return t;
        if(t.isEmpty())
             return r;
         return r +  separator + t;
    }
     @Override
    public String addInPlace(final String r , final String t) {
         if(r.isEmpty())
             return t;
         if(t.isEmpty())
              return r;
        return  r +  separator + t;
    }
     @Override
    public String zero(final String initialValue) {
        return "";
    }
}

AccumulatorParam use a Set of Strings as an accumulator

import org.apache.spark.*;
import java.io.*;
import java.util.*;

/**
 * com.lordjoe.distributed.spark.StringSetAccumulableParam
 * Accumulates a set of Strings
 * Usage
  *      final Accumulator<Set<string>> myStringSet = ctx.accumulator(Collextions.emptySet(), "String Set", StringSetAccumulableParam.INSTANCE);
  *     ...
 *         Set<string> toAdd  = new HashSet<string>();
 *         toAdd.add("xyxzzy")
  *        myStringSet.add(toAdd);
 * User: Steve
 * Date: 11/12/2014
 */
public class StringSetAccumulableParam implements AccumulatorParam<Set<string>>, Serializable {

public static final StringSetAccumulableParam INSTANCE = new StringSetAccumulableParam();

private StringSetAccumulableParam() {}
    @Override
    public Set<string> addAccumulator(final Set<string> r, final Set<string> t) {
        r.addAll(t);    // todo ask if we should make a new set
         return r;
       }

@Override
    public Set<string> addInPlace(final Set<string> r, final Set<string> t) {
        r.addAll(t);
        return r;
    }

@Override
    public Set<string> zero(final Set<string> initialValue) {
        return new HashSet<string>(initialValue);
    }
}

How to use accumulators

Accumulators may be used in two ways. First, the accumulator may be created in code as a final variable in the scope of the function - this is especially useful for lambdas, functions created on line.

The following is an illustration of this

Using an accumulator as a final local variable

Alternatively a function may be defined with an accumulator as a member variable. Here the function is defined as a class and later used. I prefer this approach to lambdas especially if significant work is done in the function.
In a later blog I will discuss using a base class for more sophisticated logging

Using an accumulator as a member variable

Wednesday, November 12, 2014

Managing Spark Accumulators

Managing Accumulators

As I move from relatively simple Spark problems to significant issues such as Proteomic search - especially with sufficient processing to raise performance issues, it is important to be able to measure and track performance.

The first issues I want to track is where (on which slave process) and how often a function is called. In a later post I will discuss uses of subclassing to instrument Functions to log and centralizing code for logging in a common base class. In this section the issue of how to keep global counts of function use can be done inside a Spark project. All code is in the Google code distributed-tools project.

Spark gives a class called Accumulator,

In Hadoop Counters are very useful in tracking performance, A counter is an object that any portion of the code can get and increment, The values of a counter are eventually consistent. Even current values (if available) are useful in tracking the progress of a job.

In a Spark job with a number of steps, Accumulators may be used to track the number of operations in each step.

Getting an Accumulator

JavaSparkContext has a method accumulator which returns an accumulator with a given name.

JavaSparkContext currentContext = SparkUtilities.getCurrentContext();

String accName = "My Accumulator Name";

Accumulator<Integer> accumulator = currentContext.accumulator(0, accName );

Accumulators should be gotten in the executor code and can be serialized into functions.

Using an Accumulator

Accumulators may be of any type supporting an add operator although only types Integer and Double are hard coded. The accumulator has an add method taking the proper type - say an integer and will add it to the current value.

Accumulator<Integer> numberCalls = ...

...

numberCalls.add(1); // increment

Reading an Accumulator

Accumulators may read only in the executor program. The accumulator has a value method which returns the current value of the accumulator. After RDDs are collected this should be accurate.

Accumulator<Integer> numberCalls = ...

...

int myCallCount = numberCalls.value( );

My Library Code

The class com.lordjoe.distributed.spark.SparkAccumulators is a class designed to make handling Accumulators simpler. The class is a singleton containing a Map from accumulator names to Accumulators. A function with access to the singleton, say through a final local variable or, as in my code a member variable set in the constructor, can use the instance to look up existing accumulator.

This needs to be called in the executor once to initialize. It is normally called in a function which initializes the library. The call is required to make sure one object is constructed in the Executor.

SparkAccumulators.createInstance();

Once initialized the instance can be gotten and used to create an accumulator. These are stored in the internal map.

SparkAccumulators instance = SparkAccumulators,getInstance();

instance .createAccumulator("MyAccumulator");

Accumulators may be incremented using the incrementAccumulator function, There is an alternative version taking an amount to increment - the default is 1.

instance .incrementAccumulator("MyAccumulator");

Use in a function

public class MyFunction implements Function<T,R>,Serializable {

private SparkAccumulators accumulators;

public MyFunction() {

if(accumulators == null)

instance = SparkAccumulators,getInstance(); // happens in Executor

}

public R call(T input) {

// doStuff

accumulators.incrementAccumulator()"MyAccumulator"; // keep count

}

Better code using inheritance to put all logging in one place will be discussed in a later blog.

Wednesday, November 5, 2014

Spark Utilities

All code described here is in the projects at distributed-tools on code.google.com. the described are in the class com.lordjoe.distributed.SpareUtilities in the subproject spark-implementation

In working with Spark I find there is a need for a library of commonly used functions. I out these in a class called SparkUtilities. My general convention is classes named utilities are collections of static function.

A major is getCurrentContext(). Because JavaSparkContexts cannot be serialized, it is not possible to pass a context into a function. The body of the function executing on a slave process will need to find a local copy of the context. If no local copy exists then one will need to be constructed, All of this is handled by getCurrentContext(). It caches a constructed context in a transient field. The transient key word will cause the field not to be serialized. The code below will cause one JavaSparkContext to be constructed per slave VM.

One important function is guaranteeSparkMaster. When running on a machine with out a cluster, the spark master will be undefined.. calling sparkConf.setMaster("local[*]"); causes the job to run in a local master (with the proper threads for the processor). This is good for debugging. The fact that code does this means there is no need to set up code or command line to run locally. If there is no cluster available getCurrentContext defaults to a local cluster.

//  private transient static ThreadLocal<javasparkcontext> threadContext;
    private transient static JavaSparkContext threadContext;
    //  private transient static ThreadLocal<javasqlcontext> threadContext;
    private transient static JavaSQLContext sqlContext;
    private static final Properties sparkProperties = new Properties();
    private static String appName = "Anonymous";
     private static boolean local;

public static boolean isLocal() {
        return local;
    }
   
    /**
     * create a JavaSparkContext for the thread if none exists
     *
     * @return
     */
    public static synchronized JavaSparkContext getCurrentContext() {
//        if (threadContext == null)
//            threadContext = new ThreadLocal<javasparkcontext>();
//        JavaSparkContext ret = threadContext.get();
        JavaSparkContext ret = threadContext;
        if (ret != null)
            return ret;
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName(getAppName());
        SparkUtilities.guaranteeSparkMaster(sparkConf);
        SparkContext sc = new SparkContext(sparkConf);
 
         sparkConf.set("spark.mesos.coarse", "true");
        sparkConf.set("spark.executor.memory", "2500m");
           ret = new JavaSparkContext(sparkConf);
        threadContext = ret;
        return ret;
    }
 
    public static synchronized Configuration getHadoopConfiguration() {
        Configuration configuration = getCurrentContext().hadoopConfiguration();
        return configuration;
    }
 
   /**
     * if no spark master is  defined then use "local
     *
     * @param sparkConf the configuration
     */
    public static void guaranteeSparkMaster(@Nonnull SparkConf sparkConf) {
        Option<string> option = sparkConf.getOption("spark.master");

if (!option.isDefined()) {   // use local over nothing   {
            sparkConf.setMaster("local[*]");
            setLocal(true);
            /**
             * liquanpei@gmail.com suggests to correct
             * 14/10/08 09:36:35 ERROR broadcast.TorrentBroadcast: Reading broadcast variable 0 failed
             14/10/08 09:36:35 INFO broadcast.TorrentBroadcast: Reading broadcast variable 0 took 5.006378813 s
             14/10/08 09:36:35 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 0
             14/10/08 09:36:35 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
             java.lang.NullPointerException
             at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
             at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)

*/
            //  sparkConf.set("spark.broadcast.factory","org.apache.spark.broadcast.HttpBroadcastFactory" );
        }
        else {
            setLocal(option.get().startsWith("local"));
        }
        // set all properties in the SparkProperties file
        for (String property : sparkProperties.stringPropertyNames()) {
            if (!property.startsWith("spark."))
                continue;
            sparkConf.set(property, sparkProperties.getProperty(property));

}

Realize and Return - debugging Spark

All code described here is in the projects at distributed-tools on code.google.com. the described are in the class com.lordjoe.distributed.SpareUtilities in the subproject spark-implementation

Spark and Java 8 Streaming use lazy evaluation to manage operations of collections. This means that when an line like

The operations are saved but not executed until evaluation is required. While this makes operation efficient, it makes it difficult to debug. During development of small samples on a single machine running in local mode it is frequently useful to stop and look at the results before passing them to the nest step.

/**
     * force a JavaPairRDD to evaluate then return the results as a JavaPairRDD
     *
     * @param inp this is an RDD - usually one you want to examine during debugging
     * @param handler all otuples are passed here
      * @param <t> whatever inp is a list of
     * @return non-null RDD of the same values but realized
     */
    @Nonnull
    public static <K, V> JavaPairRDD<K, V> realizeAndReturn(@Nonnull final JavaPairRDD<K, V> inp,ObjectFoundListener<Tuple2<K, V>> handler) {
        JavaSparkContext jcx = getCurrentContext();
        if (!isLocal())    // not to use on the cluster - only for debugging
            return inp; //
        List<Tuple2<K, V>> collect = (List<Tuple2<K, V>>) (List) inp.collect();    // break here and take a look
           return (JavaPairRDD<K, V>) jcx.parallelizePairs(collect);
    }

/**
     * force a JavaRDD to evaluate then return the results as a JavaRDD
     *
     * @param inp this is an RDD - usually one you want to examine during debugging
     * @param handler all objects are passed here
      * @param <t> whatever inp is a list of
     * @return non-null RDD of the same values but realized
     */
    @Nonnull
    public static <K, V> JavaRDD< V> realizeAndReturn(@Nonnull final JavaRDD<v> inp,ObjectFoundListener<v> handler) {
        JavaSparkContext jcx = getCurrentContext();
        if (!isLocal())    // not to use on the cluster - only for debugging
            return inp; //
        List<v> collect = (List<v>) (List) inp.collect();    // break here and take a look
          return (JavaRDD<v>) jcx.parallelize(collect);
    }

Theses functions require that all data be held in memory in a List - not a good idea for Bid Data seta but fine for debugging. The code does two things.
First, it forces all code to execute. This allows debugging of all the steps up to the realization and can isolate errors.
Second, all results are held in a list. Placing a break point allows the list to be examined to see if the values are reasonable.

The code below shows how realizeAndReturn can be used, Note that for any JavaRDD or JavaPairRDD the return is of the same type of the original and can serve in the code as a new value.
My general strategy is to follow each operation with a line or realizeAndReturn and comment them out as things are successful.
When problems arise the lines can be uncommented forcing more frequent evaluation and allowing a peek at intermediate results

Monday, November 17, 2014

Using a Complex Structure as a Spark Accumulator

Friday, November 14, 2014

More on Spark Accumulators

The Power of Spark Accumulators

AccumulatorParam use a Long as a Counter (accumulator)

AccumulatorParam to accumulate a single string by concatenation

AccumulatorParam use a Set of Strings as an accumulator

How to use accumulators

Using an accumulator as a final local variable

Using an accumulator as a member variable

Wednesday, November 12, 2014

Managing Spark Accumulators

Managing Accumulators

Getting an Accumulator

Using an Accumulator

Reading an Accumulator

My Library Code

Use in a function

Wednesday, November 5, 2014

Spark Utilities

Spark Utilities

Realize and Return - debugging Spark

Realize and Return - debugging Spark

About Me

Blog Archive