Developing With Hadoop (Aspire 2)

Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Feature only available with Aspire Enterprise

Developing With Aspire for Hadoop

Aspire for Hadoop provides a collection of Java classes that implement the necessary Hadoop objects to build an interaction between Aspire and Hadoop to be able to program and configure big data jobs using Aspire pipelines.

The AspireHadoopMapper, AspireHadoopReducer and AspireHadoopCombiner are implementations of mappers, reducers and combiners Hadoop task tracker jobs that will launch Aspire application pipelines to run their tasks through. A generic AspireHadoopDriver is also provided as part of the Aspire Hadoop MapRed component as a Hadoop job that allows the configuration of map/reduce jobs using at least a mapper configuration with the combiner and the reducer as optional.

When a new Aspire Job is created inside one of these Hadoop task tracker jobs (AspireHadoopMapper, AspireHadoopReducer or AspireHadoopCombiner) the required Hadoop objects are linked in the Aspire Job so any Aspire component configured in the pipeline can interact with Hadoop (key, values, context, counters).

To create the different Aspire pipelines for your Hadoop jobs any Aspire component can be used.

The list of available components can be found here:

Interacting with Hadoop from Aspire

Aside from the provided components, Aspire gives you the ability to create your own components to interact with Hadoop, or to write this interaction with Hadoop using the aspire-groovy component as described below.

List of available Aspire classes to interact with Hadoop

Interacting with Hadoop from Aspire Groovy Script

HadoopContext, HadoopIterableWrapper, HadoopConfFactory and HadoopFSFactory are available to be used in groovy scripts for custom coding.

Iterate over Reducer Values

HadoopIterableWrapper can be looped with a groovy closure. To access each AspireObject inside the closure, use the variable name it.

    def count = 0;
    hadoopIterable.each() {
      count ++;
      def url = it.getText("url");
    doc.add("count", count);

Emit from groovy

To emit key/value pairs directly from a groovy script, use the HadoopContext write(key, value) method.

    hadoopContext.write("key", new AspireObject("newDoc"));

Creating an Aspire Hadoop Job

Using AspireHadoopDriver

Using the generic Aspire Hadoop MapRed component, create a configuration XML as described here.

AspireHadoopDriver input and output key/value pairs are of type Text/AspireObjectWritable.

Creating a new Aspire based Hadoop Job

Sometimes, you will want to take advantage of running Aspire inside of a Hadoop for one of your tasks--map or reduce-- but not both. For this, you will need to create a new Hadoop Driver. When individually using AspireHadoopMapper, AspireHadoopReducer or AspireHadoopCombiner, take into consideration the following:

  • The input/output pairs are: Text/AspireObjectWritable.
  • They all expect a Configuration property called aspire-home with the local path where the aspire-for-hadoop-2.0 folder is located.
  • An Aspire application.xml as a string is expected in a Configuration property called ${taskType}-application (where ${taskType} is either: map, reduce or combine depending on the task you are configuring).