Wednesday, August 11, 2010

Building a Hadoop Jar File

Hadoop processes work by sending a jar file to each machine in a Hadoop cluster. The jar file contains all of the code, the resources and the libraries required by the Hadoop process. Normally the way I develop is to get my job running using a local Hadoop instance with a single processor and when the code runs properly on a single instance deploying at to a cluster.

Usually it requires a significant amount of testing and debugging on the target cluster before a job runs successfully. This means that it is necessary to rapidly create and deploy new jar files during the testing and debugging phase. Eventually, the last successful jar file may be deployed to a cluster for production work. In my development process you want building and deploying a jar file to be a rapid and mindless operation.

A Hadoop tar file is a standard Java jar with one exception. All of the libraries used by the code are in a top level directory called lib. Any class files in the class path outside of jars can simply be copied into the jar file with the appropriate directory structure.

The easiest way to create a jar is to run a Java application using the same classpath that was used when the application was tested in a single machine cluster. Any directories found in the classpath can simply have their contents copied into the jar file. Any jars in the class path may be added in a directory called lib. The only problem with this approach is that it means that jars which may not be needed by the application war which are were already present on the target system such as the standard Hadoop jars will be needlessly copied. What what the code listed below does is to maintain a blacklist of jars that are not required in a deployed application. Usually this list is well-known to a developer and rarely changes.

The following code will generate a deployable Hadoop jar assuming

package org.systemsbiology.aws;

import java.io.*;
import java.util.*;
import java.util.zip.*;

/**
* org.systemsbiology.aws.HadoopDeployer
* build a hadoop jar file
*
* @author Steve Lewis
* @date Apr 3, 2007
*/
public class HadoopDeployer {
    public static HadoopDeployer[] EMPTY_ARRAY = {};
    public static Class THIS_CLASS = HadoopDeployer.class;

    public static final String[] EEXCLUDED_JARS_LIST = {
             "hadoop-0.20.2-core.jar",
            "slf4j-log4j12-1.4.3.jar",
            "log4j-1.2.15.jar",
             "junit-4.8.1.jar"

              // … add other jars to exclude
    };

    public static final Set<String> EXCLUDED_JARS = new HashSet(Arrays.asList(EEXCLUDED_JARS_LIST));

    /**
     * Turn the classpath into a list of files to add to the jar
     * @param pathItems !null list is classpath items - files and directories
     * @param javaHome   !null java diretory - things in javaHome are automatically excluded
     * @return
     */
    public static File[] filterClassPath(String[] pathItems, String javaHome) {
        List holder = new ArrayList();

        for (int i = 0; i < pathItems.length; i++) {
            String item = pathItems[i];
            if (".".equals(item))
                continue; // ignore
            if (inExcludedJars(item))
                continue;  // ignore  excluded jars
            if (item.indexOf(javaHome) > -1)
                continue;  // ignore java jars
            File itemFile = new File(item);
            if (!itemFile.exists())
                continue; // ignore non-existant jars
            if (itemFile.isFile()) {
                holder.add(itemFile);
                continue;
            }
            if (itemFile.isDirectory()) {
                continue; // ignore directories
            }

        }
        File[] ret = new File[holder.size()];
        holder.toArray(ret);
        return ret;
    }

    /**
     * get a list of directories in the classpath
     * @param pathItems !null list of items
     * @param javaHome   !nu; java home
     * @return !Null list of existing files
     */
    public static File[] filterClassPathDirectories(String[] pathItems, String javaHome) {
        List holder = new ArrayList();
        for (int i = 0; i < pathItems.length; i++) {
            String item = pathItems[i];
            if (".".equals(item))
                continue;    // ignore  .
            if (EXCLUDED_JARS.contains(item))
                continue;    // ignore  excluded jars
            if (item.indexOf(javaHome) > -1)
                continue;   // ignore java jars
            File itemFile = new File(item);
            if (!itemFile.exists())
                continue;   // ignore non-existant jars
            if (itemFile.isFile()) {
                continue;   // ignore files
            }
            if (itemFile.isDirectory())
                holder.add(itemFile);
        }

        File[] ret = new File[holder.size()];
        holder.toArray(ret);
        return ret;
    }

    /**
     * true if s is the name of an excluded jar
     * @param s !null name
     * @return  as above
     */
    protected static boolean inExcludedJars(String s) {
        for (int i = 0; i <  EEXCLUDED_JARS_LIST.length; i++) {
            String test =  EEXCLUDED_JARS_LIST[i];
            if (s.endsWith(test))
                return true;
        }
        return false;
    }

    /**
     * copy jars to a lib directory
     * @param out !null open output stream
     * @param libs   !null file list - should be jar files
     * @throws IOException on error
     */
    public static void copyLibraries(ZipOutputStream out, File[] libs) throws IOException {
        for (int i = 0; i < libs.length; i++) {
            File lib = libs[i];
            final String name = "lib/" + lib.getName();
            System.out.println(name);
            ZipEntry ze = new ZipEntry(name);
            out.putNextEntry(ze);
            copyFile(lib, out);
            out.closeEntry();
        }
    }

    /**
     *
     * @param src !null  destination file
     * @param dst  !null open output stream
     * @return  true if no problem
     */
    public static boolean copyFile(File src, ZipOutputStream dst) {
        int bufsize = 1024;
        try {
            RandomAccessFile srcFile = new RandomAccessFile(src, "r");
            long len = srcFile.length();
            if (len > 0x7fffffff) {
                return (false);
            }
            // too large
            int l = (int) len;
            if (l == 0) {
                return (false);
            }
            // failure - no data

            int bytesRead = 0;
            byte[] buffer = new byte[bufsize];
            while ((bytesRead = srcFile.read(buffer, 0, bufsize)) != -1) {
                dst.write(buffer, 0, bytesRead);
            }
            srcFile.close();
            return true;
        }
        catch (IOException ex) {
            return (false);
        }
    }

    /**
     * Create a deployable Jar as jarFile
     * @param jarFile !null creatable file for the Jan
     */
    public static void deployLibrariesToJar(File jarFile) {
        try {
            ZipOutputStream out = new ZipOutputStream(new FileOutputStream(jarFile));

            String javaHome = System.getProperty("java.home");
            String classpath = System.getProperty("java.class.path");
            String[] pathItems = null;
            if (classpath.contains(";")) {
                pathItems = classpath.split(";");  // windows
            }
            else {
                if (classpath.contains(":")) { // linux
                    pathItems = classpath.split(":");   // Linux stlye
                }
                else {
                    String[] items = {classpath};
                    pathItems = items; // only 1 I guess
                }
            }
            File[] pathLibs = filterClassPath(pathItems, javaHome);
            copyLibraries(out, pathLibs);
            File[] pathDirectories = filterClassPathDirectories(pathItems, javaHome);
            for (int i = 0; i < pathDirectories.length; i++) {
                File pathDirectory = pathDirectories[i];
                copyLibraryDirectory("", pathDirectory, out);
            }
            out.flush();
            out.close();

        }
        catch (IOException e) {
            throw new RuntimeException(e);

        }
    }

    /**
     * make a path string adding name to
     * @param path !null current path
     * @param name  !null name - uaually a subdirctory
     * @return
     */
    public static String nextPath(String path, String name) {
        if (path == null || path.length() == 0)
            return name;
        return path + "/" + name;
    }

    /**
     * Copy a library - if a
     * @param s path - really used only when creating a directory path
     * @param dir !null existing file or directory
     * @param pOut  !null open output stream
     * @throws IOException
     */
    private static void copyLibraryDirectory(final String s, final File dir, final ZipOutputStream pOut) throws IOException {
        File[] list = dir.listFiles();
        if (list == null) return;
        for (int i = 0; i < list.length; i++) {
            File file = list[i];
            if (file.isDirectory()) {
                final String np = nextPath(s, file.getName());
                copyLibraryDirectory(np, file, pOut);
            }
            else {
                final String np = nextPath(s, file.getName());
                ZipEntry ze = new ZipEntry(np);
                pOut.putNextEntry(ze);
                copyFile(file, pOut);
                pOut.closeEntry();
            }
        }
    }

    /**
     * create a deployable Hadoop jar using the existing classpath
     * @param pJarName !null name of a file that is creatable
     */
    public static void makeHadoopJar(final String pJarName) {
        File deployDir = new File(pJarName);
        deployLibrariesToJar(deployDir);
    }

    /**
     * Sample use
     * @param args    ignored
     */
    public static void main(String[] args) {
        String jarName = "FooBar.jar";
        if(args.length > 0)
            jarName = args[0];
        makeHadoopJar(jarName);

    }
}

1 comment:

  1. Hello All.

    We are working on a plug gable component (scheduler) for hadoop.
    Currently I am stuck on How to compile my java files and how to plug the jar into hadoop.
    If anyone has some experience please guide me here.

    Best Regards,
    Amar

    ReplyDelete