Scan Directory (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Scan Directory (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-storage-handler
subType  scanDir
Inputs  Directory location specified in <fetchUrl> from AspireObject <p/> Alternatively, you can specify pathToScan to scan the same directory every time (feedOne merely launches the job, in this case).
Outputs  Sub Jobs, each with an AspireObject which contains a <fetchUrl> that holds a URL to the file which was scanned.

The Scan Directory stage is subtype of storage-handler service. It scans the directory including all sub-directories and creates sub-jobs for all nested files.


Configuration

Element Type Default Description
branches None The configuration of the pipeline to publish to. See below.
fileNamePatterns/include/@pattern String null The include pattern can be regular expression to allow files e.g. ".*.xml$".
fileNamePatterns/exclude/@pattern String null The exclude pattern can be regular expression to disallow files e.g. ".*tmp[^/]$".
pathToScan String null The directory location e.g. file:///C:/aspire-home/data specified in <pathToScan> would be scanned in the absence of fetchUrl to feed allowed files. When fetchUrl (AspireObject element) is specified, that location will be scanned to feed allowed files.


Branch Configuration

The feed one feeder publishes files using the branch manager. It publishes using the onPublish event. You must therefore include a <branches> element in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

Element Type Description
branches/branch/@event String The event to configure. This must be onPublish.
branches/branch/@pipelineManager string The name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline string The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.


Metadata Mapper Configuration

The ScanDir stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.


Field Default Output Field Description
protocol protocol The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com").
host host The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com").
mimeType mimeType The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html".
encoding encoding The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8"
expirationDate expirationDate The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time.
modificationDate modificationDate The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time.
redirectUrl redirectUrl If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL.
status - The HTTP response status message. For example, "HTTP/1.1 200 OK".
all other HTTP headers - Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area.


Scanning directory via HTTP Command

You can tell Scan Directory to scan the directory using an HTTP command directly through the Admin interface. The URL would be:

http://<server>:50505/aspire/<component-name>?cmd=feed&url=<directory to feed>

Example Configuration for directory scan

Always scan the same directory

  <component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
    <pathToScan>file:///C:/aspire-home/st_files</pathToScan> 
    <fileNamePatterns>
      <include pattern=".*.xml$" />
      <exclude pattern=".*tmp[^/]$" />
    </fileNamePatterns>
    <branches>
      <branch event="onPublish" pipelineManager="ProcessFile" pipeline="process-doc" />
    </branches> 
  </component>

In this example, directory specified for <pathToScan> is scanned and based on include/exclude patterns, the fetchUrl is generated to push to "ProcessFile" pipeline. Multiple include pattern and exclude pattern can be specified with multiple entries of <include /> and <exclude/> tags. If same pattern is specified in include and exclude pattern, then exclude takes the precedence.

Complex configuration

This configuration specifies meta data mapping.

  <component name="ScanDir" subType="scanDir" factoryName="aspire-storage-handler" >
    <fileNamePatterns>
      <include pattern=".*.txt$" />
      <exclude pattern=".*tmp[^/]$" />
    </fileNamePatterns>
    <metadataMap>
      <map from="content-length-bytes" to="file-length"/>
      <map from="file-name" to="file-name"/>
    </metadataMap>
    <branches>
      <branch event="onPublish" pipelineManager="." pipeline="ProcessFile" />
    </branches> 
  </component>

Example Output


<doc>
    <fetchUrl>file:/C:/work/workspace1/aspire-storage-handler/testdata/scanDirTest1/printwriter.txt</fetchUrl>
    <file-length source="ScanDir/content-length-bytes">19</file-length>
    <file-name source="ScanDir/file-name">printwriter.txt</file-name>
    <extension source="ScanDir">
        <field name="modified-date">2011-04-13T16:49:49Z</field>
        <field name="parent-dir">testdata\scanDirTest1</field>
        <field name="absolute-path">C:\work\workspace1\aspire-storage-handler\testdata\scanDirTest1\printwriter.txt</field>
    </extension>
  .
  .
  .
</doc>