Friday, November 14, 2014

More on Spark Accumulators

The Power of Spark Accumulators

 Spark Accumulators, discussed in an earlier blog) are massively more powerful than Hadoop counters because they support multiple types of data.  I have not seem discussions of using accumulators holding large sets of data, something that some of the classes discussed here could certainly do. The code discussed here is available here.

The only things required for an accumulator are a an AccumulatorParam instance defining how to construct a zero element and how to combine multiple instances. 


AccumulatorParam use a Long as a Counter (accumulator)



AccumulatorParam to accumulate a single string by concatenation



AccumulatorParam use a Set of Strings as an accumulator



How to use accumulators

Accumulators may be used in two ways. First, the accumulator may be created in code as a final variable in the scope of the function - this is especially useful for lambdas, functions created on line. 
The following is an illustration of this

Using an accumulator as a final local variable




Alternatively a function may be defined with an accumulator as a member variable. Here the function is defined as a class and later used. I prefer this approach to lambdas especially if significant work is done in the function.
In a later blog I will discuss using a base class for more sophisticated logging

Using an accumulator as a member variable



No comments:

Post a Comment