Post WebHDFS (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Post WebHDFS (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-post-webhdfs
subType  default
Inputs  An AspireObject with the metadata of each document to be posted and a key (optional).
Outputs  A HDFS file entry consisting of the key and a JSON representation of the AspireObject as the value.
Feature only available with Aspire Enterprise

The Post WebHDFS Stage stage writes key/value pairs into HDFS where the key is a user-defined field from the job's AspireObject (or the job id, if the key is not defined) and the value is the AspireObject of the job. Each key/value pair will be written to a single file until a file size threshold is reached. A new file is then created with a sequential id (i.e. aspire-00000, aspire-00001, aspire-00002, ..., aspire-N).

Communication to HDFS will be through the WebHDFS REST API.


Configuration

Element Type Default Description
hdfsUrl String http://localhost:8020 The HDFS Namenode URL.
folderPath String The path within the HDFS server where the files will be stored. If empty, the user home folder will be used.
filePrefixName String aspire The prefix of the name of the files that will be stored. Each file name will be completed with a sequential counter value. (I.e. aspíre-00000).
username String The username to set to the HTTP calls.
fileSize long 64*1024*1024 (64Mb) The max size of each file to be created. When the file size is reached, a new file is created.
outputKey String An AXPath of the metadata field to use as the output key.
ignoreAspireBatch boolean true Tells the component whether or not create a new file for each Aspire batch. NOTE: If this is false and Aspire Job batching is enabled, the fileSize value will be ignored and each file will contain exactly as many key/value pairs as the batch size.
timeout int 30000 Time in milliseconds to wait until the file can be closed, after the last job has been processed.


Example

This section provides an example of Post HDFS configuration to a local HDFS server.

<component name="PostWebHDFS" subType="default" factoryName="aspire-post-webhdfs">
  <hdfsUrl>http://localhost:8020/</hdfsUrl>
  <folderPath>/webhdfs/v1/user/jsmith/test/</folderPath> 
  <filePrefixName>aspire-</filePrefixName>
  <username>jsmith</username>
  <outputKey>weekDay</outputKey>
</component>

Output

Monday     {"doc":{"weekDay":"Monday","name":"jsmith","date":"2013\/07\/16","url":"http:\/\/www.searctechnologies.com\/products\/we-are-great.html"}}
Wednesday  {"doc":{"weekDay":"Wednesday","name":"jsmith","date":"2013\/07\/16","url":"http:\/\/www.searctechnologies.com\/home.html"}}

HDFS Configuration Requirements

This component uses the APPEND operation to add data to the HDFS files, so your HDFS System must have it enabled at the config/hdfs-site.xml configuration file in your HDFS Server.