Taming The Elephant: Realize and Return

Realize and Return - debugging Spark

All code described here is in the projects at distributed-tools on code.google.com. the described are in the class com.lordjoe.distributed.SpareUtilities in the subproject spark-implementation

Spark and Java 8 Streaming use lazy evaluation to manage operations of collections. This means that when an line like

The operations are saved but not executed until evaluation is required. While this makes operation efficient, it makes it difficult to debug. During development of small samples on a single machine running in local mode it is frequently useful to stop and look at the results before passing them to the nest step.

/**
     * force a JavaPairRDD to evaluate then return the results as a JavaPairRDD
     *
     * @param inp this is an RDD - usually one you want to examine during debugging
     * @param handler all otuples are passed here
      * @param <t> whatever inp is a list of
     * @return non-null RDD of the same values but realized
     */
    @Nonnull
    public static <K, V> JavaPairRDD<K, V> realizeAndReturn(@Nonnull final JavaPairRDD<K, V> inp,ObjectFoundListener<Tuple2<K, V>> handler) {
        JavaSparkContext jcx = getCurrentContext();
        if (!isLocal())    // not to use on the cluster - only for debugging
            return inp; //
        List<Tuple2<K, V>> collect = (List<Tuple2<K, V>>) (List) inp.collect();    // break here and take a look
           return (JavaPairRDD<K, V>) jcx.parallelizePairs(collect);
    }

/**
     * force a JavaRDD to evaluate then return the results as a JavaRDD
     *
     * @param inp this is an RDD - usually one you want to examine during debugging
     * @param handler all objects are passed here
      * @param <t> whatever inp is a list of
     * @return non-null RDD of the same values but realized
     */
    @Nonnull
    public static <K, V> JavaRDD< V> realizeAndReturn(@Nonnull final JavaRDD<v> inp,ObjectFoundListener<v> handler) {
        JavaSparkContext jcx = getCurrentContext();
        if (!isLocal())    // not to use on the cluster - only for debugging
            return inp; //
        List<v> collect = (List<v>) (List) inp.collect();    // break here and take a look
          return (JavaRDD<v>) jcx.parallelize(collect);
    }

Theses functions require that all data be held in memory in a List - not a good idea for Bid Data seta but fine for debugging. The code does two things.
First, it forces all code to execute. This allows debugging of all the steps up to the realization and can isolate errors.
Second, all results are held in a list. Placing a break point allows the list to be examined to see if the values are reasonable.

The code below shows how realizeAndReturn can be used, Note that for any JavaRDD or JavaPairRDD the return is of the same type of the original and can serve in the code as a new value.
My general strategy is to follow each operation with a line or realizeAndReturn and comment them out as things are successful.
When problems arise the lines can be uncommented forcing more frequent evaluation and allowing a peek at intermediate results

Wednesday, November 5, 2014

Realize and Return - debugging Spark

Realize and Return - debugging Spark

No comments:

Post a Comment

About Me

Blog Archive