Taming The Elephant: More on Spark Accumulators

The Power of Spark Accumulators

Spark Accumulators, discussed in an earlier blog) are massively more powerful than Hadoop counters because they support multiple types of data. I have not seem discussions of using accumulators holding large sets of data, something that some of the classes discussed here could certainly do. The code discussed here is available here.

The only things required for an accumulator are a an AccumulatorParam instance defining how to construct a zero element and how to combine multiple instances.

AccumulatorParam use a Long as a Counter (accumulator)

AccumulatorParam to accumulate a single string by concatenation

import org.apache.spark.*;
import java.io.*;

/**
* com.lordjoe.distributed.spark.StringAccumulableParam
 * Usage
 *      final Accumulator<string> myString = ctx.accumulator("", "Letter Statistics", new StringAccumulableParam("\t");
 *     ...
 *        myString.add("xyxzzy");
* User: Steve
* Date: 11/12/2014
*/
public class StringAccumulableParam implements AccumulatorParam<string>,Serializable {

private final String separator;

public StringAccumulableParam(final String pSeparator) {
        separator = pSeparator;
    }
    public StringAccumulableParam( ) {
         this(",");
     }

@Override
    public String addAccumulator(final String r, final String t) {
        if(r.isEmpty())
            return t;
        if(t.isEmpty())
             return r;
         return r +  separator + t;
    }
     @Override
    public String addInPlace(final String r , final String t) {
         if(r.isEmpty())
             return t;
         if(t.isEmpty())
              return r;
        return  r +  separator + t;
    }
     @Override
    public String zero(final String initialValue) {
        return "";
    }
}

AccumulatorParam use a Set of Strings as an accumulator

import org.apache.spark.*;
import java.io.*;
import java.util.*;

/**
 * com.lordjoe.distributed.spark.StringSetAccumulableParam
 * Accumulates a set of Strings
 * Usage
  *      final Accumulator<Set<string>> myStringSet = ctx.accumulator(Collextions.emptySet(), "String Set", StringSetAccumulableParam.INSTANCE);
  *     ...
 *         Set<string> toAdd  = new HashSet<string>();
 *         toAdd.add("xyxzzy")
  *        myStringSet.add(toAdd);
 * User: Steve
 * Date: 11/12/2014
 */
public class StringSetAccumulableParam implements AccumulatorParam<Set<string>>, Serializable {

public static final StringSetAccumulableParam INSTANCE = new StringSetAccumulableParam();

private StringSetAccumulableParam() {}
    @Override
    public Set<string> addAccumulator(final Set<string> r, final Set<string> t) {
        r.addAll(t);    // todo ask if we should make a new set
         return r;
       }

@Override
    public Set<string> addInPlace(final Set<string> r, final Set<string> t) {
        r.addAll(t);
        return r;
    }

@Override
    public Set<string> zero(final Set<string> initialValue) {
        return new HashSet<string>(initialValue);
    }
}

How to use accumulators

Accumulators may be used in two ways. First, the accumulator may be created in code as a final variable in the scope of the function - this is especially useful for lambdas, functions created on line.

The following is an illustration of this

Using an accumulator as a final local variable

Alternatively a function may be defined with an accumulator as a member variable. Here the function is defined as a class and later used. I prefer this approach to lambdas especially if significant work is done in the function.
In a later blog I will discuss using a base class for more sophisticated logging

Friday, November 14, 2014

More on Spark Accumulators

The Power of Spark Accumulators

AccumulatorParam use a Long as a Counter (accumulator)

AccumulatorParam to accumulate a single string by concatenation

AccumulatorParam use a Set of Strings as an accumulator

How to use accumulators

Using an accumulator as a final local variable

Using an accumulator as a member variable

No comments:

Post a Comment

About Me

Blog Archive