Hierarchy Extractor (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Hierarchy Extractor (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-hierarchy-extractor
subType  default
Inputs  AspireObject with a 'hierarchy' tag
Outputs  Send jobs to index any new parents, their fields and ACLs
Feature only available with Aspire Enterprise

The Hierarchy Extractor looks for the 'hierarchy' tag in a job, and when located, sends jobs to index any new parents, their fields and ACLs.


Configuration

Element Type Default Description
acls/acl/@usergroup string The user/group name for the ACL.
acls/acl/type string Allow Indicates whether the user/group will have access to the crawled files. Options include: allow, deny.
acls/acl/entity string group Specifies if the ACL corresponds to a group or user. Options include: group, user.

If no fixed ACLs configured as above, then a union of parent plus children ACLs is going to be used as the ParentACLs, and each time a new child adds a new ACL to the Union, the parent job is going to be reindexed.

Branch Handler Configuration

This component publishes to the onAdd, onDelete and onUpdate, so a branch must be configured for each of these three events.

Element Type Description
branches/branch/@event string The event to configure - onAdd, onDelete or onUpdate.
branches/branch/@pipelineManager string The name of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline string The name of the pipeline to publish to. If missing, publishes to the default pipeline for the pipeline manager.
branches/branch/@allowRemote boolean Indicates if this pipeline can be found on remote servers (see Distributed Processing for details).
branches/branch/@batching boolean Indicates if the jobs processed by this pipeline should be marked for batch processing (useful for publishers or other components that support batch processing).
branches/branch/@batchSize int The max size of the batches that the branch handler will created.
branches/branch/@batchTimeout long Time to wait before the batch is closed if the batchSize hasn't been reached.
branches/branch/@simultaneousBatches int The max number of simultanous batches that will be handled by the branch handler.

Example Configurations

Simple

<component name="HierarchyExtractor" factoryName="aspire-hierarchy-extractor" subType="default">
   <branches>
      <branch event="onAdd" pipelineManager="." pipeline="addPipeline" batching="true"/>
      <branch event="onDelete" pipelineManager="." pipeline="deletePipeline" batching="true"/>
   </branches>
</component>

Fixed ACLs Configuration

<component name="HierarchyExtractor" factoryName="aspire-hierarchy-extractor" subType="default">
   <acls>
      <acl usergroup="mycompany\aaguilar">
          <type>allow</type>
          <entity>user</entity>
      </acl>
      <acl usergroup="mycompany\stAllEmployees">
          <type>deny</type>
          <entity>group</entity>
      </acl>
   </acls>
   <branches>
      <branch event="onAdd" pipelineManager="." pipeline="addPipeline" batching="true"/>
      <branch event="onDelete" pipelineManager="." pipeline="deletePipeline" batching="true"/>
   </branches>
</component>

Example Output

For every new parent found a job will be sent to the "onAdd" event of the branch handler:

<doc source="/HierarchyExtractor/Main/HierarchyExtractor">
  <hierarchy>
    <item id="CDCE0D45AC20FDE62F5CEB6118643033" level="1" name="FSC" type="aspire/filesystem" url="C:\testdata\a\">
      <ancestors/>
    </item>
  </hierarchy>
  <id>C:\testdata\a\</id>
  <url>C:\testdata\a\</url>
  <fetchUrl>C:\testdata\a\</fetchUrl>
  <action>add</action>
  <md5>CDCE0D45AC20FDE62F5CEB6118643033</md5>
  <mimeType>aspire/filesystem</mimeType>
  <lastModified>2014-03-21T17:44:20Z</lastModified>
  <dataSize>0</dataSize>
  <content>url:C:\testdata\a\ docId:CDCE0D45AC20FDE62F5CEB6118643033</content>
  <sourceName>FSC</sourceName>
  <sourceType>filesystem</sourceType>
  <acls>
    <acl access="allow" domain="mycompany" entity="user" fullname="mycompany\aaguilar" name="aaguilar" scope="global"/>
    <acl access="deny" domain="mycompany" entity="group" fullname="mycompany\stAllEmployees" name="stAllEmployees" scope="global"/>
  </acls>
</doc>

Parent Database Management

There are 5 servlet commands you can use to manage the parent database, avaliable from the debug console:

  • Reindex
    Resend the jobs to the "onAdd" event of the configured Branch Handler
  • Dump
    Creates a dump file of the database, that you can import later
  • Import
    Imports the data from a dump file from the file system
  • Clear
    Deletes all content from the database, you can decide if you want to send delete jobs to the "onDelete" branch of the configured Branch Handler.
  • Statistics
    Return the count of parents stored in the database