Storage Handler 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Storage Handler

Storage Handler
Description: The storage handler is a service for managing the opening and closing of storage, typically files on the file system. It can open and close ordinary files, and it can also manage storage for any component which implements the AspireStorageInterface.
Inputs: None
Outputs: Document Variables - as configured - to hold all opened storage objects.
Factory: aspire-storage-handler
Sub Type: default
Object Type: AspireDocument

Other Notes

  • All objects will be automatically closed by the pipeline manager when the job is completed, unless they are closed earlier.
  • Multiple instances of aspire-storage-handler can be configured in a pipeline to open or close multiple files or component-storage objects as needed.

Storage Handler Commands

The configuration for the storage handler is a of storage operations which are performed sequentially.

Element Type Default Description
commands parent element none This is a parent tag which contains a list of open or close commands which will be executed whenever a job is processed.

Note that there are two types of open commands: one set for opening a storage object from a component, and a second type for opening simple file storage (see below).

commands/{command} element none Within the <commands> tag, you can include any number of storage manipulation commands. These commands are listed below.
<component name="storageHandlerOpen" subType="default" factoryName="aspire-storage-handler">
  <config>
    <commands>
      <mkdir path="data/vectors" />
      
      <delete path="data/vectors/{XML:APPLICANT_ID}.bak" />
      
      <rename fromPath="data/vectors/{XML:APPLICANT_ID}.vec" 
              toPath=="data/vectors/{XML:APPLICANT_ID}.bak" "/>
      
      <open componentRef="/system/vectorPipeline/storeVector"
                path="data/vectors/{XML:APPLICANT_ID}.vec" />
      
      <open variable="statsFileOut" path="data/vectors/stats.txt" objectType="PrintWriter" />
    </commands>
  </config>
</component>

Command: <open> for custom storage for components

Use the following configuration for opening storage objects required by some other component in the system. This is especially useful for components which do things like write data for sub-jobs.

For example, this command was first implemented to create sequence files of Mahout vectors. A storage handler on the parent pipeline first opens the sequence file by sending an <open> command to the "StoreVector" component, which creates the storage object and stores it into the AspireDocument as a variable.

Then for each job processed by the "StoreVector" component, StoreVector will fetch the object from the AspireDocument variable (traversing up the sub-job/parent-job hierarchy if necessary) and will then store the vector using the previously opened storage object.

Finally, when the parent job is complete, a second storage handler will be called to close the storage object (or the pipeline manager will do it if no one else does).


Element Type Default Description
open/@componentRef String none This specifies the component which will be used to create the storage object. The value for the @componentRef attribute must be either an absolute or relative Aspire component name. Note that this component must implement the AspireStorageInterface to be manageable by the storage handler.
open/@variable String none (Optional for component storage) This specifies the variable name in the AspireDocument to which the storage object is assigned. This can be any variable name which is descriptive of the object. Variables in document storage can be accessed by any stage (with doc.getVariable()) or any Groovy script (as an ordinary variable).
open/@path String none (required by most components) Specifies a path template which specifies the file or directory where the custom component storage will be located. Note that not all components will use @path (see each component's individual wiki documentation for details).
open/{other attributes or nested elements} depends none The component specified by the @componentRef attribute will receive the <open> XML element when the storage object is opened by the storage handler. This means that any attributes are nested elements on the <open> element will be passed to the component and may be used by the component to affect how the storage object is opened and initialized.

For more information on what other nested XML is required/available for individual components, see the component's wiki page.


Example:

 <open componentRef="../sub-doc-pipeline/storageVector"
       path="data/vectors/{XML:APPLICANT_ID}.vec />

Command: <open> for file I/O

The Storage Handler can also be used to open file system files for I/O. The resulting file objects are stored as variables in the AspireDocument which can then be used by Groovy scripts or other components for reading and writing files.

Element Type Default Description
open/@path string none (required) Specifies a path template which specifies the file to be opened. Note that all relative paths will be relative to ASPIRE_HOME.
open/@variable String none (Required for file I/O) The variable in the AspireDocument where the open file stream object (i.e. the InputStream, OutputStream, etc.) is stored. The variable can be accessed as a Groovy variable or using the doc.getVariable() method inside of components.
open/@objectType One of "OutputStream", "Writer", "PrintWriter", "InputStream", or "BufferedReader" none (required) Specifies the type of file object to create. The Storage Handler will automatically create a buffered version of the specified object.

For example, opening an "InputStream" will create an instance of java.io.BufferedInputStream. "BufferedReader" is called out separately since it supports more methods than simple "Reader" (specifically, readLine()).

open/@append boolean false Specifies if the file should be opened for append.
open/@encoding string UTF-8 Specifies the file encoding used to read or write the file. Ignored when objectType is InputStream or OutputStream.


For example:

     <open variable="statsFileOut" path="data/vectors/stats.txt" 
           objectType="PrintWriter" append="true" encoding="UTF-8" />

Command: <close>

Closes open storage. This operation will get the named variable from the AspireDocument, and then will close the object (if it implements Closeable). Note that any variable on the document can be closed - it does not have to have been previously opened by a prior storage handler (although it typically is).


Element Type Default Description
close/@variable string none Specifies the variable on the AspireDocument whose object will be closed. Note that the object is checked to ensure that it supports Closeable, and if it is, it will be closed. Otherwise, it is ignored.


Example:

 <close variable="statsFileOut"  />

Command: <closeAll>

Closes all variables opened previously by an earlier storage handler. Does not close all variables in the AspireDocument, just the ones opened by any previous storage handler.

Contains no attributes.

Example:

 <closeAll/>


Command: <mkdir> for files and directories

Creates the specified directory. If the directory already exists, does nothing and does not report an error. Throws an exception if unable to create the directory. Automatically creates all parent directories if any of them do not already exist.

Element Type Default Description
mkDir/@path string none Specifies a path template for the directory to create. Note that relative paths will be relative to ASPIRE_HOME.

Example:

 <mkDir path="data/vectors" />

Command: <delete> for files and directories

Deletes the specified directory or file. If the directory or file does not already exist, does nothing and does not report an error.

When deleting a directory, automatically deletes all of the files contained within the directory.

Returns an exception error if the deletion was unsuccessful (for example, a locked file which could not be deleted).

Element Type Default Description
delete/@path string none Specifies a path template for the directory or file to delete. Note that relative paths will be relative to ASPIRE_HOME.

Example:

 <delete path="data/vectors/{XML:APPLICANT_ID}.bak" />


Command: <rename> for files and directories

Renames a file or directory. If the "from" path does not exist, does nothing and does not report an error. Returns an exception error if the rename could not proceed (for example, due to a locked file or if the destination exists).

Element Type Default Description
open/@fromPath string none Specifies the source path for the directory or file to rename. Note that it is not considered to be an error if this path does not exist.
open/@toPath string none Specifies the destination path template for the directory or file. Note that an exception will be thrown if this destination already exists.

Example:

 <rename fromPath="data/vectors/{XML:APPLICANT_ID}.vec" 
           toPath="data/vectors/{XML:APPLICANT_ID}.bak" />

Implementing A Custom Component Source Object

To implement a component which requires custom storage management, do the following.

Step 1: Add the "com.searchtechnologies:aspire-storage-handler" as a dependency to your project in your "pom.xml" file.


Step 2: Have your pipeline stage implement AspireStorageInterface.

For example:

 public class MyPipelineStage extends ComponentImpl implements AspireStorageInterface {
   .
   .
   .
 }


Step 3: Create the code for the open() method required by AspireStorageInterface.

For example:

 @Override
 public Closeable open(Job j, Element config) throws AspireException {
   String pathToOpen = StandardPathBuilder.build(this, config, j);
   
   // Create, initialize, and return the appropriate object for your custom storage
   //
   // Note that the object must implement "Closeable".
   //
   // If you use the "StandardPathBuilder", then the <open> command specified in the
   // storage handler must have an @path attribute specifying the location which contains
   // the storage to open.
   //
   MyNewStorageObject storageObject = new MyNewStorageObject(... pathToOpen ... );
   
   // Now, be sure to save the object in a variable so you can access it later.
   AspireDocument doc = (AspireDocument)job.getObject();
   
   doc.putVariable("MyComponentStorageVar", storageObject);
   
   return storageObject;
 }


Step 4: Inside your pipeline stage, access the variable which contains the storage object as follows:

 @Override
 public void process(Job j) throws AspireException {
   MyNewStorageObject myStorage = 
     (MyNewStorageObject) JobDocumentHierarchyMap.get(j, "MyComponentStorageVar");
   
   .
   .
   .
 }

Note the use of JobDocumentHierarchyMap.get() which traverses the job hierarchy to fetch the specified variable from the appropriate AspireDocument. This allows the storage handler to open the storage on a parent job, but then for the storage to be actually used in a sub-job pipeline.