Branch Handler (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

The Branch Handler is a common utility used by many components to specify how jobs are routed.

  • It is used by feeders to determine where jobs should be fed.
  • It is used by the pipeline manager to determine where jobs that have been completed or have received an error should be sent.
  • It is used by stages that create sub-jobs to determine where the sub-job should go.

See Programming Components which use the Branch Handler for information on how to code your component to use the branch handler.

Branch Events

Components and stages which use the branch handler will branch the job based on a particular type of event. Most components will use the "onPublish" event - the most common type, which simply says that the job is being published, i.e., the job is ready to be sent someplace else.

The pipeline manager also has "onError", "onComplete" and "onTerminate" events, which branch jobs on exceptions, job completion and job termination. See the pipeline manager for more details.

Other types of branch events may be defined by other components. See the component configuration for a description of the various branch events that they define and what they mean.

Configuration

Components and stages which use the branch handler will require a <branches> tag in their <config> section, as shown in the following examples.

For each branch, you can determine which pipeline manager will receive the job when the branch event occurs:

<branches>
  <branch event="onPublish" pipelineManager="ProcessPatentPipelineManager" />
</branches>

In the above example, the job will be sent to the default pipeline within the "process-patent-pipeline-manager" when the "onPublish" event occurs for the job.

You can also identify which pipeline should receive the job:

<branches>
  <branch event="onPublish" pipelineManager="ProcessPatentPipelineManager" 
          pipeline="process-application" />
</branches>

As well as which stage within the pipeline:

<branches>
  <branch event="onPublish" pipelineManager="ProcessPatentPipelineManager" 
          pipeline="process-application" stage="processInventors"/>
</branches>


Writing Jobs to a File

You can use the @writeToFile attribute to write all jobs branched from a branch handler to a file. This is typically used for unit testing of components.

 <branches>
   <branch event="onPublish" writeToFile="testout/scanDirTest.out"/>
 </branches>

See Programming Components which use the Branch Handler for more details.

Branching the Current Job

The above method, using branchhandler.enqueue() to enqueue a job on a pipeline manager, is good for new jobs.

If you want to make the current job go someplace else, the best method is to use the pipeline manager's branching structure. This is done by calling job.setBranch("branchLabel") in your component, for example:

 job.setBranch("onMissingData");

Note that this can be called by Groovy scripting components as well.

Next, in your pipeline manager, specify where the branch event should go:

  <pipeline name="doc-process" default="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <!-- The following pipeline stage causes an "onMissingData" event -->
      <stage component="checkForMissingData" />
    </stages>
    <branches>
      <branch event="onError" pipeline="error-pipeline" />
      <branch event="onMissingData" pipelineManager="dataEnhancementPipelineManager"/>
    </branches>
  </pipeline>

See the Pipeline Manager for more details on configuring pipelines.

Using this technique, the pipeline stage can be written to cause certain events to occur. It is the pipeline configuration that determines the actual location where the job will be sent.

Notes:

  • The same thread that processes the job will be the one that carries the job through to completion, even if it has to go through another pipeline manager object to do so (pipeline managers have a process() method just for this occurrence).
  • The event skips the remainder of the pipeline.
  • Components which cause branch events can be re-used in various different pipelines as much as needed.

Branching vs Enqueuing

One of the most frequent confusions related to the branch handler is where to place the <branches> configuration.

If you are enqueuing or processing jobs from a component (typically feeders or enqueue() or process() in Groovy scripts), you should place the <branches> tag with in the component configuration. In this case, the branch must contain a reference to the pipelineManager. You cannot enqueue() or process() jobs on a branch where the pipeline manager is not specified.

 <component name="httpFeeder" subType="default" factoryName="aspire-http-feeder">
   <config>
     <servletName>cgi-bin</servletName>
     <feederLabel>httpFeeder</feederLabel>
     <branches>
       <branch event="onPublish" pipelineManager="pipeManager" pipeline="query"/>  <<<< WORKS
       <branch event="onUpdate" pipeline="query"/>                                 <<<< THROWS EXCEPTION WHEN JOB ENQUEUED
     </branches> 
   </config>
 </component>

If you are branching the current job from a component (say aspire-tools/conditionalBranch or job.setBranch("event") in Groovy scripts) then should place the <branches> tag with in the pipeline configuration of the pipeline manager (see pipeline manager branches for more details). In this case, if the branch does not contain a reference to the pipelineManager, it is assumed to be the current one.

 <component name="pipeManager" subType="pipeline" factoryName="aspire-application">
   <config>
     <components>
       .
       .
       <component name="federate" subType="default" factoryName="aspire-groovy">
         <config>
           <script>
           <![CDATA[
             .
             .
             // Set the main job to branch so we miss the unfederated query
             job.setBranch("onFederatedQuery");
           ]]>
           </script>
         </config>
       </component>
       .
       .
     </components>
     <pipelines>
       <pipeline name="query" default="true">
         <stages>
           .
           .
           <stage component="federate" />
           <stage component="loadXMLResults" />
           <stage component="waitForFederate" />
           .
           .
         </stages>
         <branches>
           <branch event="onFederatedQuery" stage="waitForFederate"/>
         </branches>
       </pipeline>
     </pipelines>
   </config>
 </component>

Job Batching

The branch handler can be configured to handle batches of jobs. The purpose of this is to permit several smaller jobs to be put together into a larger one, hopefully reducing the amount of transactions/requests required on pipeline stages and thus increasing performance.

When batching is enabled, only components that do batching (such as Post HTTP) will actually take any advantage of this option. Other components will work the same (single jobs).

See each particular component documentation to find out if that component supports job batching.

Example configuration:

<branches>
  <branch event="onPublish" pipelineManager="ProcessPatentPipelineManager" 
          pipeline="process-application" 
          batching="true"
          batchSize="10"
          batchTimeout="1000"
          simultaneousBatches="2"
          batchPipeline="batchCompletedPipeline" 
          batchPipelineManager="ProcessCompletedBatchPipelineManager"
          batchStage="batchCompletedStage"  />
</branches>

Attributes:

  • batching - Indicates if batching is enabled or not for this branch.
  • batchSize - Maximum number of jobs in a batch (default is 100).
  • batchTimeout - Maximum time a batch is allow to be inactive in milliseconds (default is 15000ms).
  • simultaneousBatches - Number of simultaneous batches to have alive at the same time (default is 2).
  • batchPipelineManager - PipelineManager where the batch job will be sent. (optional)
  • batchPipeline - Pipeline where the batch job will be sent. (optional)
  • batchStage - Stage where the batch job will be sent. (optional)
  • batchMethod - Indicates how the batch job must be sent ("process" or "enqueue"). (default is "enqueue")

The options batchPipelineManager, batchPipeline and batchStage can be configured the same way as the options pipelineManager, pipeline and stage, with the difference that they are going to be used for sending the batch job once the batch is completed. For further details see Batch Job

Remote Branching

Refer to Distributed Processing for general information on remote branching.

Remote branching will cause a job to be sent and enqueued on a remote node's pipeline manager. Notice that pipeline manager, pipeline and stage names are specified in the same way (optional pipeline and stage, relative or absolute path to the pipeline manager, etc).

You must have configured Distributed Communications on the settings.xml file. Also, you must have configured at least one discovery method, so that there are remote nodes available to send jobs.

If the pipeline is not found locally, a remote pipeline will be used.

To enable remote branching, set allowRemote="true" on the desired <branch>. For example:

  <branches>
    <branch event="onBranch" pipelineManager="pipeMgr" allowRemote="true"/>
  </branches> 

By default, if allowRemote is true, jobs will still be sent locally if the target pipeline manager/pipeline/stage exist locally. This is controlled by an attribute localQueueThreshold. This attribute tells how much jobs should be handled locally at the same time. For example, if you set it to 0, all jobs are sent remotely. If you set it to 10, if the local resource is busy processing 10 jobs, the 10 + 1 job will be sent remotely.


Example:

  <branches>
    <branch event="onBranch" pipelineManager="pipeMgr" allowRemote="true" localQueueThreshold="100"/>
  </branches> 

By default, localQueueThreshold is 30. If the branch target doesn't exist locally everything is sent remotely.