Publish to HDFS Tutorial (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Step 1: Launch Aspire and open the Content Source Management Page

Aspire Content Source Management Page

Launch Aspire (if it's not already running). See:

Browse to: http://localhost:50505. For details on using the Aspire Content Source Management page, please refer to UI Introduction.


Step 2: Create a new Content Source

For this step please follow the step from the Configuration Tutorial of the connector of you choice, please refer to Connector list.

Step 3: Add a new Publish to CDH HDFS to the Workflow

To add a Publisher to CDH HDFS drag from the Publish to CDH HDFS rule from the Workflow Library and drop to the Workflow Tree where you want to add it. This will automatically open the Publish to CDH HDFS window for the configuration of the publisher.

Step 3a: Configure the Publish to CDH HDFS

  1. Enter the name of the publisher. (This name must be unique).
  2. Enter the description of the publisher that will be shown in the Workflow Tree.
  3. Select the publishing protocol to use:
    1. HDFS (Java API)
    2. WebHDFS (REST API)

Note: Not all HDFS clusters have WebHDFS enabled.

Publish using HDFS

Publish to CDH using HDFS

In the HDFS section of the Publish to CDH HDFS window specify the connection information to publish to HDFS.

  1. Enter the HDFS URL. Use hdfs:// protocol and the port (by default 8020). I.e. hdfs://localhost:8020
  2. Specify the location of the Output key. An AXPath of the node inside the AspireObject. I.e. /doc/docType
  3. Specify the absolute HDFS Folder Path where the files will be published to. I.e. /user/jsmith/my_aspire_output. (The user which runs Aspire must have write access to the HDFS folder).
  4. Specify the Max File Size in MegaBytes. If left as -1 it will use the HDFS Block Size as the file limit.
  5. Specify a File Prefix Name. I.e. aspire-, files will be named: aspire-00000, aspire-00001, aspire-00002m, etc.
  6. Debug: Check if you want to run the publisher in debug mode.
  7. Click on the Add button.

Publish using WebHDFS

Publish to CDH using WebHDFS

In the Web HDFS section of the Publish to CDH HDFS window specify the connection information to publish to HDFS.

  1. Enter the WebHDFS URL. Use http:// protocol and the port (by default 8020). I.e. http://localhost:8020
  2. Specify the Username to connect as. The user must exist in HDFS and have write access to the HDFS folder.
  3. Specify the location of the Output key. An AXPath of the node inside the AspireObject. I.e. /doc/docType
  4. Specify the absolute HDFS Folder Path where the files will be published to. I.e. /user/jsmith/my_aspire_output
  5. Specify the Max File Size in MegaBytes. If left as -1 it will use the HDFS Block Size as the file limit.
  6. Specify a File Prefix Name. I.e. aspire-, files will be named: aspire-00000, aspire-00001, aspire-00002m, etc.
  7. Debug: Check if you want to run the publisher in debug mode.
  8. Click on the Add button.

Once you've clicked on the Add button, it will take a moment for Aspire to download all of the necessary components (the Jar files) from the Maven repository and load them into Aspire. Once that's done, the publisher will appear in the Workflow Tree.

For details on using the Workflow section, please refer to Workflow introduction.