Storage Handler (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Storage Handler (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-storage-handler
subType  default
Inputs  None
Outputs  Document Variables - as configured - to hold all opened storage objects.

The Storage Handler component is a service for managing the opening and closing of storage, typically files on the file system. It can open and close ordinary files, and it can also manage storage for any component which implements the AspireStorageInterface (see AspireStorageInterface.java)

  • All objects will be automatically closed by the pipeline manager when the job is completed, unless they are closed earlier.
  • Multiple instances of aspire-storage-handler can be configured in a pipeline to open or close multiple files or component-storage objects as needed.

Storage Handler Commands

The configuration for the storage handler is a of storage operations which are performed sequentially.

Element Type Default Description
commands parent element none This is a parent tag which contains a list of open or close commands which will be executed whenever a job is processed.

Note that there are two types of open commands: one set for opening a storage object from a component, and a second type for opening simple file storage (see below).

commands/{command} element none Within the <commands> tag, you can include any number of storage manipulation commands. These commands are listed below.
<component name="StorageHandlerOpen" subType="default" factoryName="aspire-storage-handler">
  <commands>
    <mkdir path="data/vectors" />
      
    <delete path="data/vectors/{XML:APPLICANT_ID}.bak" />
    
    <rename fromPath="data/vectors/{XML:APPLICANT_ID}.vec" 
            toPath=="data/vectors/{XML:APPLICANT_ID}.bak" "/>
      
    <open componentRef="/system/vectorPipeline/storeVector"
              path="data/vectors/{XML:APPLICANT_ID}.vec" />
      
    <open variable="statsFileOut" path="data/vectors/stats.txt" objectType="PrintWriter" />
  </commands>
</component>

Command: <open> for custom storage for components

Use the following configuration for opening storage objects required by some other component in the system. This is especially useful for components which do things like write data for sub-jobs.

For example, this command was first implemented to create sequence files of Mahout vectors. A storage handler on the parent pipeline first opens the sequence file by sending an <open> command to the "StoreVector" component, which creates the storage object and stores it into the AspireObject as a variable.

Then for each job processed by the "StoreVector" component, StoreVector will fetch the object from the AspireObject variable (traversing up the sub-job/parent-job hierarchy if necessary) and will then store the vector using the previously opened storage object.

Finally, when the parent job is complete, a second storage handler will be called to close the storage object (or the pipeline manager will do it if no one else does).


Element Type Default Description
open/@componentRef String none This specifies the component which will be used to create the storage object. The value for the @componentRef attribute must be either an absolute or relative Aspire component name. Note that this component must implement the AspireStorageInterface to be manageable by the storage handler.
open/@variable String none (Optional for component storage) This specifies the variable name in the AspireObject to which the storage object is assigned. This can be any variable name which is descriptive of the object. Variables in document storage can be accessed by any stage (with doc.getVariable()) or any Groovy script (as an ordinary variable).
open/@path String none (required by most components) Specifies a path template which specifies the file or directory where the custom component storage will be located. Note that not all components will use @path (see each component's individual wiki documentation for details).
open/{other attributes or nested elements} depends none The component specified by the @componentRef attribute will receive the <open> XML element when the storage object is opened by the storage handler. This means that any attributes are nested elements on the <open> element will be passed to the component and may be used by the component to affect how the storage object is opened and initialized.

For more information on what other nested XML is required/available for individual components, see the component's wiki page.


Example:

 <open componentRef="../sub-doc-pipeline/storageVector"
       path="data/vectors/{XML:APPLICANT_ID}.vec />

Command: <open> for file I/O

The Storage Handler can also be used to open file system files for I/O. The resulting file objects are stored as variables in the AspireObject which can then be used by Groovy scripts or other components for reading and writing files.

Element Type Default Description
open/@path string none (required) Specifies a path template which specifies the file to be opened. Note that all relative paths will be relative to ASPIRE_HOME.
open/@variable String none (Required for file I/O) The variable in the AspireObject where the open file stream object (i.e. the InputStream, OutputStream, etc.) is stored. The variable can be accessed as a Groovy variable or using the doc.getVariable() method inside of components.
open/@objectType One of "OutputStream", "Writer", "PrintWriter", "InputStream", or "BufferedReader" none (required) Specifies the type of file object to create. The Storage Handler will automatically create a buffered version of the specified object.

For example, opening an "InputStream" will create an instance of java.io.BufferedInputStream. "BufferedReader" is called out separately since it supports more methods than simple "Reader" (specifically, readLine()).

Note: The output object will be constructed to automatically provide buffering and the encoding specified (defaults to UTF-8).

open/@append boolean false Specifies if the file should be opened for append.
open/@encoding string UTF-8 Specifies the file encoding used to read or write the file. Ignored when objectType is InputStream or OutputStream.


For example:

     <open variable="statsFileOut" path="data/vectors/stats.txt" 
           objectType="PrintWriter" append="true" encoding="UTF-8" />

Command: <close>

Closes open storage. This operation will get the named variable from the AspireObject, and then will close the object (if it implements Closeable). Note that any variable on the document can be closed - it does not have to have been previously opened by a prior storage handler (although it typically is).


Element Type Default Description
close/@variable string none Specifies the variable on the AspireObject whose object will be closed. Note that the object is checked to ensure that it supports Closeable, and if it is, it will be closed. Otherwise, it is ignored.


Example:

 <close variable="statsFileOut"  />

Command: <closeAll>

Closes all variables opened previously by an earlier storage handler. Does not close all variables in the AspireObject, just the ones opened by any previous storage handler.

Contains no attributes.

Example:

 <closeAll/>


Command: <mkdir> for files and directories

Creates the specified directory. If the directory already exists, does nothing and does not report an error. Throws an exception if unable to create the directory. Automatically creates all parent directories if any of them do not already exist.

Element Type Default Description
mkDir/@path string none Specifies a path template for the directory to create. Note that relative paths will be relative to ASPIRE_HOME.


Example:

 <mkDir path="data/vectors" />

Command: <delete> for files and directories

Deletes the specified directory or file. If the directory or file does not already exist, does nothing and does not report an error.

When deleting a directory, automatically deletes all of the files contained within the directory.

Returns an exception error if the deletion was unsuccessful (for example, a locked file which could not be deleted).

Element Type Default Description
delete/@path string none Specifies a path template for the directory or file to delete. Note that relative paths will be relative to ASPIRE_HOME.


Example:

 <delete path="data/vectors/{XML:APPLICANT_ID}.bak" />

Command: <rename> for files and directories

Renames a file or directory. If the "from" path does not exist, does nothing and does not report an error. Returns an exception error if the rename could not proceed (for example, due to a locked file or if the destination exists).

Element Type Default Description
open/@fromPath string none Specifies the source path for the directory or file to rename. Note that it is not considered to be an error if this path does not exist.
open/@toPath string none Specifies the destination path template for the directory or file. Note that an exception will be thrown if this destination already exists.

Example:

 <rename fromPath="data/vectors/{XML:APPLICANT_ID}.vec" 
           toPath="data/vectors/{XML:APPLICANT_ID}.bak" />

Command: <write> for files and variables

 (2.1 Release)  

Copy the content of specified variable to a given file location. If the content of the variable is instance of InputStream, Serializable or AspireObject write that content to the file otherwise write whole AspireObject to the file.

Element Type Default Description
write/@variable string none This specifies the job variable name to read content.
write/@tag string none This specifies the variable name in the AspireObject.
write/@path string none Specifies a path template which specifies the file to be opened. Note that all relative paths will be relative to ASPIRE_HOME.
write/@append boolean false Specifies if the file should be opened for append.


Implementing A Custom Component Source Object

To implement a component which requires custom storage management, do the following.

Step 1: Add the "com.searchtechnologies:aspire-storage-handler" as a dependency to your project in your "pom.xml" file.


Step 2: Have your pipeline stage implement AspireStorageInterface.

For example:

 public class MyPipelineStage extends ComponentImpl implements AspireStorageInterface {
   .
   .
   .
 }


Step 3: Create the code for the open() method required by AspireStorageInterface.

For example:

 @Override
 public Closeable open(Job j, Element config) throws AspireException {
   String pathToOpen = StandardPathBuilder.build(this, config, j);
   
   // Create, initialize, and return the appropriate object for your custom storage
   //
   // Note that the object must implement "Closeable".
   //
   // If you use the "StandardPathBuilder", then the <open> command specified in the
   // storage handler must have an @path attribute specifying the location which contains
   // the storage to open.
   //
   MyNewStorageObject storageObject = new MyNewStorageObject(... pathToOpen ... );
   
   // Now, be sure to save the object in a variable so you can access it later.
   AspireObject doc = (AspireObject)job.get();
   
   doc.putVariable("MyComponentStorageVar", storageObject);
   
   return storageObject;
 }


Step 4: Inside your pipeline stage, access the variable which contains the storage object as follows:

 @Override
 public void process(Job j) throws AspireException {
   MyNewStorageObject myStorage = 
     (MyNewStorageObject) JobDocumentHierarchyMap.get(j, "MyComponentStorageVar");
   
   .
   .
   .
 }

Note the use of JobDocumentHierarchyMap.get() which traverses the job hierarchy to fetch the specified variable from the appropriate AspireObject. This allows the storage handler to open the storage on a parent job, but then for the storage to be actually used in a sub-job pipeline.