Hot Folder Feeder 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Simple Feeder / Hot Folder Feeder

Hot Folder Feeder
Description: Periodically monitors a number of directories, processing files in those directories and publishing to an Aspire pipeline manager.

The hot folder monitors one or more directories, periodically polling them to look for the presence of files (with an optional file name filter). When the input directory is polled and a file found (filtered by the optional filter), that file is moved to a in-process directory. The file is then published (from this in-process directory) to the configured pipeline. When processing of the job is complete, the file is moved from the in-process directory to the completed directory if successful or the quarantine directory if not. When all the files in the input directory have been processed, the feeder processes the next directory and when no more directories exist, the feeder sleeps for a period of time before polling the directories again.

Inputs: The files in the monitored directories
Outputs: An AspireDocument object containing the path to the discovered file in the monitored directory in the <url> and <fetchUrl> tags, published to the configured pipeline manager.
Factory: aspire-filefeeder (previously aspire.FileFeeder)
Sub Type: hotFolderFeeder
Object Type: Produces AspireDocument objects.

Other Notes

  • As files are processed, they are moved to a "processing" directory.
  • Following processing, files are moved to a "complete" directory if the processing was successful or a "quarantine" directory if it was unsuccessful.
  • This feeder is based on the Simple Feeder


Configuration

This feeder takes all parameters from the Simple Feeder plus the following:

Element Type Default Description
feederLabel string HotFolderFeeder The feeder label submitted in the <feederLabel> of the published document.
jobResultXsl string Use built in XSL The XSL used to transform job result XML returned from the pipeline. Use to control what content is stored in the retained errors for failed jobs.
writeResultFileForCompleted boolean true By default, the Hot Folder Feeder will create a result.xml file for each completed document in the 'completed' folder. This however, may not be wanted in some cases (i.e. you need to move all the completed files somewhere else, and not the result.xml), so with this option you can disable this feature. It is always wanted to keep the result.xml file, since it provides useful tracking information.
hotFolders None The configuration of the folders to monitor. See below.


Folder Configuration

The hot folder feeder monitors one or more directories, periodically polling them to look for the presence of files. The folder configuration is shown below.

Element Type Description
hotFolders/hotFolder parent tag Holds all of the information for a single set of hotFolder directories. Each <hotFolder> tag holds the information for set of inputQueue/inProcess/completed/quarantine directories plus all of the parameters (timeouts, wildcard patterns, etc.) necessary for processing the files.

Note that you can have multiple <hotFolder> tags in the same hot folder feeder, as many as you'd like, to handle multiple hot folders from the same feeder.

hotFolders/hotFolder/@match String A regular expression detailing the names of the files in the input directory that will be processed. If the file name is not matched by this expression, the file will be ignored. If this option is not specified, all files will be processed.
hotFolders/hotFolder/inputQueueFolder string The input directory to monitor. Files found in this directory when the feeder polls will be moved to the in-process directory and published.
hotFolders/hotFolder/inProcessFolder string Files found in the input directory will be moved to this directory and published. Files remain in this directory until they are completely processed, after which they are moved to "completed" or "quarantine" as appropriate. Should the system crash, the files in this directory are the ones that never finished, and so should probably be resubmitted (or, they may be the cause of the crash).
hotFolders/hotFolder/completedFolder string The completed directory. Files that are processed succesfully will be moved to this directory.

If a file is split into sub-jobs, the parent file is still considered to be "successful" (in the current design) even if one of its children/sub-jobs reported an exception error. The parent file is only reported as unsuccessful if the pipeline which processed the main job itself reported an exception.

hotFolders/hotFolder/quarantineFolder string The quarantine directory. Files that are processed succesfully will be moved to this directory.


Metadata Mapper Configuration

The hot folder feeder maps some metadata fields to fields in the AspireDocument XML.

Field Default Output Field Description
fileName fileName The filename of the published file.
path fileName The path to the file.
fullFileName fileName The full filename (including the path) to the file.
fullPath fullPath The full path to the file (excluding the file name).

Example Configurations

Simple

 <component name="simpleDomainFeeder" subType="hotFolderFeeder" factoryName="aspire-filefeeder">
   <config>
     <hotFolders>
       <hotFolder match=".*\.arc\.gz">
         <inputQueueFolder>${crawlDataBase}/simpleDomain/input-queue</inputQueueFolder>
         <quarantineFolder>${crawlDataBase}/simpleDomain/quarantine</quarantineFolder>
         <completedFolder>${crawlDataBase}/simpleDomain/completed</completedFolder>
         <inProcessFolder>${crawlDataBase}/simpleDomain/in-process</inProcessFolder>
       </hotFolder>
     </hotFolders>
     <branches>
       <branch event="onPublish" pipelineManager="arc-reader-pipe-manager" pipeline="process-arc-file" />
     </branches>
   </config>
 </component>

Complex

  <component name="simpleDomainFeeder" subType="hotFolderFeeder" factoryName="aspire-filefeeder">
    <config>
      <feederLabel>CrawlDomain</feederLabel>        
      <metadataMap>
        <map from="fileName" to="fileName"/>
        <map from="fullPath" to="fullPath"/>
      </metadataMap>
      <autoStart>${autoFeedArc}</autoStart>
      <loopWait>43200000</loopWait>
      <feedWait>30000</feedWait>
      <hotFolders>
        <hotFolder match=".*\.arc\.gz">
          <inputQueueFolder>${crawlDataBase}/simpleDomain/input-queue</inputQueueFolder>
          <quarantineFolder>${crawlDataBase}/simpleDomain/quarantine</quarantineFolder>
          <completedFolder>${crawlDataBase}/simpleDomain/completed</completedFolder>
          <inProcessFolder>${crawlDataBase}/simpleDomain/in-process</inProcessFolder>
        </hotFolder>
      </hotFolders>
      <branches>
        <branch event="onPublish" pipelineManager="arc-reader-pipe-manager" pipeline="process-arc-file" />
      </branches>
    </config>
  </component>