Wednesday, November 5, 2014

Realize and Return - debugging Spark

Realize and Return - debugging Spark

All code described here is in the projects at distributed-tools on code.google.com. the described are in the class com.lordjoe.distributed.SpareUtilities in the subproject spark-implementation

Spark and Java 8 Streaming use lazy evaluation to manage operations of collections. This means that when an line like


The operations are saved but not executed until evaluation is required. While this makes operation efficient, it makes it difficult to debug. During development of small samples on a single machine running in local mode it is frequently useful to stop and look at the results before passing them to the nest step.


Theses functions require that all data be held in memory in a List - not a good idea for Bid Data seta but fine for debugging. The code does two things.
First, it forces all code to execute. This allows debugging of all the steps up to the realization and can isolate errors.
Second, all results are held in a list. Placing a break point allows the list to be examined to see if the values are reasonable.

The code below shows how realizeAndReturn can be used, Note that for any JavaRDD or JavaPairRDD the return is of the same type of the original and can serve in the code as a new value.
My general strategy is to follow each operation with a line or realizeAndReturn and comment them out as things are successful.
When problems arise the lines can be uncommented forcing more frequent evaluation and allowing a peek at intermediate results


No comments:

Post a Comment