Taming The Elephant: 2015

Using KryoSerializer in SPARK

There is a general consensus that Kryo is a faster serializer than standard Java serialization. In my first months of using Spark I avoided Kryo serialization because Kryo requires all classes that will be serialized to be registered before use. The default errors when this is not done properly happen at read time with little indication as to the problem class.

My initial approach was to use Java serialization.. This is set by executing.

SparkConf  sparkConf = ...sparkConf.set("spark.serializer","org.apache.spark.serializer.JavaSerializer");

When you attempt to run code (I recommend running a small data set locally until all issues are resolved). All classes used by the JavaSerializer must implement the tagging interface java.io.Serializable . My experience is that many classes will not and this capability must be added to your code. Remember the classes which were changed because if and when the code is switched to use Kyro these classes will need to be registered.

Eventually the Spark code will work using Java serialization. I do not recommend optimizations like using Kryo until there is confidence that the code will run under SPARK as there are many other issues to resolve and optimization should not be a major early concern.

Once code is running well using JavaSerializer the developer might switch code to Kryo.

KryoRegistrar Class

Before Kryo can be used there needs to be a class implementing KryoRegistrator and registered as the spark.kryo.registrator. The class I use is listed below omitting only most of a long list of classes registered. the critical method is registerClasses. I added helper methods allowing classes to be registered by Class or full name and registering Arrays of registered classes since Kryo seems to need arrays registered as well.
Once a registrar is created, add lines of code to register all classes JavaSerializer complained about (you did save these didn't you?). Next run the program. It will crash with a KryoException but the message will name the unregistered class and show code to register it. Using a small sample in local mode continue running, crashing and registering the offending class in local mode until Kryo is happy and the sample runs. At this point the code should run properly on the cluster. My experience with a reasonably large and complex program is an hour running with a good development environment , I use IntelliJ, should suffice to find and register all needed classes.

package com.lordjoe.distributed.hydra;

import com.esotericsoftware.kryo.*;
import org.apache.spark.serializer.*;
import javax.annotation.*;
import java.io.*;
import java.lang.reflect.*;

/**
 * com.lordjoe.distributed.hydra.HydraKryoSerializer
 * User: Steve
 * Date: 10/28/2014
 */
public class HydraKryoSerializer implements KryoRegistrator, Serializable {

public HydraKryoSerializer() {
    }

/**
     * register a class indicated by name
     *
     * @param kryo
     * @param s       name of a class - might not exist
     * @param handled Set of classes already handles
     */
    protected void doRegistration(@Nonnull Kryo kryo, @Nonnull String s ) {
        Class c;
        try {
            c = Class.forName(s);
            doRegistration(kryo,  c);
        }
        catch (ClassNotFoundException e) {
            return;
        }
     }

/**
      * register a class
      *
      * @param kryo
      * @param s       name of a class - might not exist
      * @param handled Set of classes already handles
      */
    protected void doRegistration(final Kryo kryo , final Class pC) {
           if (kryo != null) {
            kryo.register(pC);
               // also register arrays of that class
            Class arrayType = Array.newInstance(pC, 0).getClass();
            kryo.register(arrayType);
        }
    }

/**
     * do the real work of registering all classes
     * @param kryo
     */
     @Override
    public void registerClasses(@Nonnull Kryo kryo) {
        kryo.register(Object[].class);
        kryo.register(scala.Tuple2[].class);
 
        doRegistration(kryo, "scala.collection.mutable.WrappedArray$ofRef");
        doRegistration(kryo, "org.systemsbiology.xtandem.scoring.VariableStatistics");
        doRegistration(kryo, "org.systemsbiology.xtandem.scoring.SpectralPeakUsage$PeakUsage");
// and many more similar nines
 
    }

}
 ;

My preliminary data suggest that using Kryo reduces the run time of one of my larger samples from about 60 to around 40 minutes on a 16 node cluster. Your results may vary.

Friday, February 13, 2015

Using KryoSerializer in SPARK

Using KryoSerializer in SPARK

KryoRegistrar Class

About Me

Blog Archive