Pipeline Manager

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

The Pipeline Manager is responsible for processing jobs (which contain documents to process) and pipelines. Pipeline managers are essentially passive and wait for jobs to arrive (from feeders). The jobs are put on an internal job queue and are then picked up by execution threads (a thread pool), which processes the job through the pipeline to completion.

Key Concepts

Jobs

Pipeline managers process jobs. Jobs can come from either of two sources:

  1. Feeders - Receive things to do from outside Aspire and create new jobs which they then send to pipeline managers to execute.
  2. Other Pipeline Stages - Sometimes, a job is divided into multiple sub-jobs, for example, a Zip file which may contain many nested documents to process. When this happens, a pipeline stage can create a sub-job, which is then sent to a pipeline manager.

Queues and Threading

Every pipeline manager maintains a queue and a thread pool. New jobs received by the pipeline manager will be first placed on the queue. When a thread becomes available, it will take the next job from the queue and will process that job through the specified pipeline.

Note that the thread carries the job all the way through to completion (see below). This may include branching the job to other pipelines (see the <branches> tag below).

See below for a list of parameters that can control job queueing and threading pools (the size of the queue, the maximum number of threads, etc.). Also note that there is a timeout for idle threads, so that system resources are minimized when not all threads are required.

Multiple Pipeline Managers

If you need multiple queues and multiple thread pools, then just create multiple pipeline managers. This is a useful technique for managing thread pools to ensure that one pool does not get starved.

In general, a pipeline manager should only process a certain type of job. If you have multiple types of jobs, it is best to create multiple pipeline managers. For example, parent jobs and sub-jobs are best handled by multiple pipeline managers to ensure that parent job processing is not starved for threads while the sub-jobs are processing.

Branching or Routing to Other Pipeline Managers

Branching from one pipeline manager to another does not cause the job to be re-enqueued onto the remote pipeline manager's job queue. Instead, the original thread is used to continue processing of the job through the remote pipeline manager's pipeline.

This means that once a thread has accepted a job to process, that thread will process the job all the way through to completion - even if this means calling the process() method of other pipeline managers to process the job.

The same is true when jobs are routed to other Pipeline Managers using routing tables.

Pipeline Managers are also Component Managers

A pipeline manager is a sub-class of component managers. This means that component manager configuration (such as installing bundles and creating components) are also available as part of the pipeline manager configuration.

Job Completion

A job is "completed" in either of two situations:

  1. It has been processed through the last pipeline stage and there is no "onComplete" branch defined.
    • If there is an "onComplete" branch defined, then the job will be branched to the specified destination and its processing will continue.
  2. Any pipeline stage has returned an exception while processing the job.
    • This will typically cause the job to "complete".
    • However, if the pipeline contains an "onError" branch, then the job may continue processing on some other pipeline (an error handling pipeline, for example).

If either of these two situations occur, then the job is "completed". The pipeline manager will do the following for completed jobs:

  1. If the job is a sub-job, the parent job will be notified that this sub-job is complete.
  2. If the job is a parent job, then the pipeline manager will block the job until all of its sub-jobs are complete.
  3. If there are listeners on the job, they will be notified that the job is completed and each one will be given the opportunity to do further processing on the job.
    • This is typically used for feeders, which may need to do some cleanup and/or bookkeeping about the job once it is completed.
  4. Once all feeders are complete, the job is closed.
    • The job's data object (typically an instance of AspireDocument) will be closed.
    • Any objects that are on the "Closeable list" of the AspireDocument (input streams, RDBMS connections, and so on) that are attached to the job object will be closed.
    • All references are set to null to release memory as quickly as possible.

Health Checks

Pipeline managers are also responsible for performing system health checks. Health checks check the overall health of the system for things like:

  • Jobs which encounter exception errors
  • Jobs which are too slow
  • Components which are still initializing after startup

Once configured, the health of the entire server, as well as detailed information on each health check, is available through the admin RESTful interface.

See "Health Checks" below for more details.

Configuration

The basic structure for configuring a pipeline manager is as follows:

    <component name="PIPELINE-MANAGER-NAME" subType="pipeline" factoryName="aspire-application">
      <queueSize>30</queueSize>
      <maxThreads>10</maxThreads>
      <queueTimeout>300000</queueTimeout>
      <shutdownTimeout>300000</shutdownTimeout>

      <components>
        <!-- Identify and configure all components here, any order -->
        <component name="COMPONENT1" subType="SUBTYPE" factoryName="aspire-component1"> ... </component>
        <component name="COMPONENT2" subType="SUBTYPE" factoryName="aspire-component2"> ... </component>
        <component name="COMPONENT3" subType="SUBTYPE" factoryName="aspire-component3"> ... </component>
        .
        .
        .
      </components>

      <pipelines>
        <!-- The list of pipelines go here -->
        <pipeline name="PIPELINE1" default="true">
          <stages>
            <!-- List all stages in the pipeline here, in order. -->
            <stage component="COMPONENT1" />
            <stage component="COMPONENT2" />
            <stage component="COMPONENT3" />
            .
            .
            .
          </stages>
        </pipeline>

        <!-- A pipeline manager can manage any number of pipelines -->
        <pipeline name="PIPELINE1"> ... </pipeline>
        <pipeline name="PIPELINE2"> ... </pipeline>
        .
        .
        .
      </pipelines>
    </component>

Notes:

  • The pipeline contains a list of components, a list of pipelines, and a list of stages for each pipeline.
  • Each stage references a component by name from the list of components configured under the <components> tag.

Execution Parameters

The following execution parameters are configurable for the pipeline manager:

Element Type units Default Description
queueSize int jobs 30 The size of the queue for processing jobs. If the job queue is full, then feeders, which attempt to put a new job on the queue, will be blocked until the queue has room. It is recommended that the queue size be at least as large as the number of threads, if not two or three times larger.
maxThreads int threads 10 The maximum number of threads to create to handle jobs put on the pipeline manager job queue.
queueTimeout int ms 300000
(5 minutes)
The maximum time that a feeder (or, possibly, a sub-job that is enqueued) will be blocked for a full queue. If the queue is still full after the specified time, an exception error is returned to whomever is queuing up the job.
shutdownTimeout int ms 300000
(5 minutes)
When shutting down the pipeline manager, the maximum time to wait for all active threads to complete.
gatherStatistics boolean false Gather statistics about the stages and pipelines. This can be turned on or off via the UI as well.

Components

The components list specified in the pipeline manager is a simple list of components each with their custom configuration parameters. For more details, see the discussion under the Configuring Components section of the system configuration file documentation.

Note that all components configured under the <components> tag must be pipeline stages if they are to be referenced in a pipeline <stage> element.

Pipelines

Most pipeline configurations are a simple list of stages, for example:

<pipelines>
  <pipeline name="doc-process" default="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <stage component="dateChooser" />
      <stage component="extractDomain" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
</pipelines>

Enabling and Disabling Pipelines and Stages

 (1.2 Release)  In a similar manner to components, pipelines and references to stages can be disabled using @enable or @disable attributes. If both @enabled and @disabled flags are specified, the value of @enable takes precedence. Disabled pipelines are completely removed from the system, as if they had never been written into the XML file at all. In the case of the stage reference, disabling removes the reference to the stage from the pipeline, but does not alter the component definition for the stage.

These flags are useful for turning on or off pipelines and references to stages in response to property settings (either as an App Bundle or via property settings specified in the settings.xml file).

Example:

<pipelines>
  <!-- The next two pipelines are declared, but disabled. -->
  <pipeline name="doc-process1" enable="false">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <stage component="dateChooser" />
      <stage component="extractDomain" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
  <pipeline name="doc-process2" disable="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" />
      <stage component="dateChooser" />
      <stage component="extractDomain" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
  
  <!-- The next pipeline is enabled, but disables the 'splitter', 'dateChooser' and 'extractDomain' components. -->
  <pipeline name="doc-process3" enable="true">
    <stages>
      <stage component="fetchUrl" />
      <stage component="extractText" />
      <stage component="splitter" enable="false" />
      <stage component="dateChooser" disable="true" />
      <stage component="extractDomain" enable="false" />
      <stage component="printToFile" />
      <stage component="feed2Solr" />
    </stages>
  </pipeline>
</pipelines>

If neither @enable or @disable are present, then it is assumed that the pipeline or stage is enabled.

Pipeline Configuration

Element Description
pipeline/@name The name of the pipeline. Can be used to branch from one pipeline to another (see branching statements below).
pipeline/@default "true" if the pipeline is the default pipeline for the pipeline manager. Jobs sent to the pipeline manager will be automatically sent to the default pipeline unless another pipeline is specified by name.
pipeline/@enable  (1.2 Release)  True if the the pipeline should be enabled.
pipeline/@disable  (1.2 Release)  True if the the pipeline should be disabled.
pipeline/stages/stage The list of stages which make up the pipeline. Each pipeline is a single linear list of stages.
pipeline/stage/@component The name of the component which will serve as the pipeline stage. Note that all pipeline stages are also Aspire components (the reverse is not true).
pipeline/stage/@enable  (1.2 Release)  True if the the stage should be enabled.
pipeline/stage/@disable  (1.2 Release)  True if the the stage should be disabled.

Typically these references are "local" references, i.e., references to components defined within the same pipeline manager. However, it is perfectly okay to use absolute path names, such as /Common/OtherPipelineManager/OtherStage, or relative paths, such as ../OtherPipelineManager/OtherStage, as the component attribute. In this way you can share components across pipeline manager configurations.

Note, however, that sharing components in this way is rarely required. Only do this if the component contains some large resource (such as a dictionary loaded into RAM) that needs to be shared to preserve memory.

Pipeline Branches

Pipelines can also be configured with branches which determine what happens to a job/document when certain events occur. Branches are configured inside the pipeline using a <branches> tag, like below:

<pipelines>
  <pipeline name="doc-process" default="true">
    <stages>
      <stage component="FetchUrl" />
      <stage component="ExtractText" />
      <stage component="Splitter" />
      <stage component="DateChooser" />
      <stage component="ExtractDomain" />
      <stage component="PrintToFile" />
      <stage component="Feed2Solr" />
    </stages>
    <branches>
      <branch event="onError" pipeline="error-pipeline" />
      <branch event="onComplete" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" />
      <branch event="onMyEvent" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" stage="some-stage"/>
    </branches>
  </pipeline>

  <pipeline name="error-pipeline">
    <!-- process packages for which exception errors are thrown -->
    .
    .
    .
  </pipeline>
</pipelines>

If @pipelineManager is not specified, then the event will branch to the same pipeline manager. If @pipeline is not specified, the event will branch to the same pipeline on this pipeline manager (if @pipelineManager is not given), or the default pipeline on the specified pipeline manager. If @stage is specified, then the processing of the job will continue with that stage (which could be in the middle of the pipeline), on the pipeline manager and pipeline determined by the above rules.

There are three built-in events which can be triggered for a job which is being processed by the pipeline:

Branch Event Description
onError If any exception error is thrown by a pipeline stage processing a job, the pipeline manager will look for an "onEvent" branch and will route the job to the specified destination if it exists.
onComplete When the job has completed a pipeline, the pipeline manager will look for an "onComplete" branch. If it exists, the job will be routed to the specified destination.
onTerminate If any job is terminated by a pipeline stage (note: this is different than an exception error, see below), the pipeline manager will check for an "onTerminate" event and if found will route the terminated job to the specified destination. Once the job is routed, it no longer becomes "terminated" and then it continues as before.

However, other components may raise other events.

Terminating Jobs

There are many cases where a job will need to be terminated. Note that "termination" is not the same as "exception" or "error". Jobs that are terminated are still considered to be "successful". Basically, termination means that the job (or sub-job) just skips the rest of the pipeline.

Termination is useful for any situation where you do not want to further process a job, i.e., it allows stages to "filter out" jobs from the pipeline. This is typically used for documents that contain some sort of expected situation that indicates the job should not be indexed, e.g., if the document is a duplicate of some other document, or maybe it doesn't contain enough domain keywords, or perhaps it was used as a starting point by the crawler but it is not desired to index the document itself.

Termination is implemented with a "terminate()" method. When called, terminate() sets a termination flag which is checked by the pipeline manager as soon as the current stage is complete. Jobs with the flag set will skip all remaining stages of the pipeline. Note that jobs also have a setTerminate(flag) and getTerminate() methods so you can check and set/clear the flag as much as you'd like. These methods can be used both in stages and in groovy scripts.

Also note that pipelines can have branches and that a new _optional_ "onTerminate" event has been added to the pipeline manager.

For example:

  <pipeline name="test2">
    <stages>
      <stage component="Schwarzenegger"/>
      <stage component="OldFashionedSgml"/>
    </stages>
    <branches>
      <branch event="onTerminate" pipeline="process-terminate"/>
    </branches>
  </pipeline>
   <pipeline name="process-terminate">
    <stages>
      <stage component="NewFangledXml"/>
      <stage component="AndAnother"/>
    </stages>
  </pipeline>

In the above example, the "Schwarzenegger" stage causes the job to be terminated (Arnold is the Terminator, right?). This is trapped by the pipeline's "onTerminate" branch, which then sends the job to the "process-terminate" pipeline where it continues.

Again, note that having the branch is purely optional. If it doesn't exist, the job will simply skip all remaining stages in the pipeline and then exit.

Groovy Pipelines

 (1.3 Release)   Groovy pipelines are pipelines where you can control the flow of the jobs through the stages, using a groovy script instead of a list of stages. For example:

<pipelines>
  <pipeline name="doc-process" default="true">
    <script><![CDATA[
      job | FetchUrl | ExtractText | Splitter | DateChooser | ExtractDomain | PrintToFile | Feed2Solr
    ]]></script>
  </pipeline>
</pipelines>

Variables

Variable Description
job Java Type = Job

References the job which is being processed by this groovy pipeline. You can use this variable to process it through stages

doc Java Type = AspireObject

The AspireObject which holds all of the metadata for the current document being processed by this groovy pipeline. This is the same as job.get() - the job's data object.

StageName Every stage component configured in the actual Pipeline Manager is bound by its name to this groovy pipeline.

This way you can reference the stages by using their configured names (i.e. job | FetchUrl | ExtractText).

External Stages

If you want to reference a stage configured outside the actual Pipeline Manager, you can reference it by using the path to that stage component:

  job | stage("../OtherPipelineManager/HierarchyExtractor")

Stage Listing

The groovy pipelines allows you to dynamically build a list of stages to execute. This way you can have a better and easier control of what stages should and shouldn't be processed based on the input job metadata.

  def myPath = ((doc.action == "add" || doc.action == "update")? 
                  FetchUrl |         //Stages to process if "add" or "update" action was received
                      ExtractText | 
                      ExtractDomain :  
                  PrintToFile         //Stages to process if no "add" or "update" action was received
               ) | Feed2Solr          //Stage to process every time after all stages

  job | myPath

Redirects

You can use the redirect feature to print to a file the contents of the jobs received in the actual groovy pipeline, using the ">>" operator and then specifying the target file path.

   job | FetchUrl | ExtractText >> "out.xml" | Feed2Solr

In the previous example the redirect is executed before the "Feed2Solr" stage, so if that stage adds or modify any content on the job metadata, it will not be reflected in the "out.xml" file.

Closure Stages

A Closure Stage is an embedded stage (to the Groovy Pipeline) that receives a groovy closure to execute. For example:

  job | stage{doc.add("fetchUrl","http://www.searchtechnologies.com")} | FetchUrl | ExtractText | Splitter | DateChooser | ExtractDomain | PrintToFile | Feed2Solr

You can use this to configure other job flows too:

  job | stage{
           doc.add("fetchUrl","http://www.searchtechnologies.com");
           job | FetchUrl | ExtractText | Splitter | DateChooser | ExtractDomain;
           println "End of Closure Stage"
        } | PrintToFile | Feed2Solr

Control flow

Groovy control flow statements can be used to control what pipeline to execute given any condition you want:

  job | FetchUrl | ExtractText;

  if (doc.type.text == "text/xml") 
   job | XMLProcessor | Post2Solr >> "xmlFiles.xml";
  else if (doc.type.text == "text/html") 
    job | HTTPProcessor | Post2Solr >> "htmlFiles.xml";
  else
    job | Post2Solr >> "otherFiles.xml";
    

Iterations

You can loop through some stages as needed:

  for (i in 0..9) { 
    job | stage {doc.add("stageNum",i)}
  }

The previous example will produce the following job:

<doc>
  <stageNum>0</stageNum>
  <stageNum>1</stageNum>
  <stageNum>2</stageNum>
  <stageNum>3</stageNum>
  <stageNum>4</stageNum>
  <stageNum>5</stageNum>
  <stageNum>6</stageNum>
  <stageNum>7</stageNum>
  <stageNum>8</stageNum>
  <stageNum>9</stageNum>
</doc>

Pipeline branches

Groovy pipelines can also be configured with branches which determine what happens to a job/document when certain events occur. Those branches are configured the same way as in normal pipelines:

<pipelines>
  <pipeline name="doc-process" default="true">
    <script><![CDATA[
      job | FetchUrl | ExtractText | Splitter | DateChooser | ExtractDomain | PrintToFile | Feed2Solr
    ]]></script>
    <branches>
      <branch event="onError" pipeline="error-pipeline" />
      <branch event="onComplete" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" />
      <branch event="onMyEvent" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" stage="some-stage"/>
    </branches>
  </pipeline>

  <pipeline name="error-pipeline">
    <!-- process packages for which exception errors are thrown -->
    .
    .
    .
  </pipeline>
</pipelines>
Stage exceptions

Stage exceptions are a way, inside groovy pipelines, to have the same control of branches/errors but handled independently by Stage. To configure it you have to call the exceptions() method of the stage to be configured, it receives a Map of labels vs Stage (or List of Stages) For example:

   <pipeline name="doc-process" default="true">
    <script><![CDATA[
      job | FetchUrl.exceptions([
              onComplete:  stage{job >> "fetchUrlCompleted.xml"} | stage{println "FetchUrl completed for "+job.jobId}
             ])| ExtractText | 
                 Splitter | 
                 DateChooser | 
                 ExtractDomain | 
                 PrintToFile | 
                 Feed2Solr.exceptions([
                    onError: stage{job >> "Feed2SolrErrors.xml"},
                    onComplete: stage{job >> "Feed2SolrCompleted.xml"}
                 ]) >> "finished.xml"
    ]]></script>
    <branches>
      <branch event="onError" pipeline="error-pipeline" />
      <branch event="onComplete" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" />
      <branch event="onMyEvent" pipelineManager="SomeOtherPipemgr" pipeline="some-other-pipeline" stage="some-stage"/>
    </branches>
  </pipeline>

In this case when a job completes successfully the FetchUrl stage, it will execute stage{job >> "fetchUrlCompleted.xml"} | stage{println "FetchUrl completed for "+job.jobId} before continuing with ExtractText. This is the same for onTerminate and onError exceptions. If a stage specifies an exception for onTerminate, onError or any other event label (i.e. job.setBranch("onAdd")), subsequent stages will receive the job without the job termination, or job branch, but if any exception or branch is generated in a stage with no exceptions declarations to handle it, it will propagate the exception up until it founds any stage that handles it. If there are no stages to handle the exception/branch it will be branched according to the <branches> section of the groovy pipeline.

For example, in the previous example, if Feed2Solr has an error, it will execute stage{job >> "Feed2SolrErrors.xml"} and then the job will continue to the next Stage, which is a redirect to "finished.xml" and then, at the end, the "onComplete" branch from the pipeline will be executed. If the "onError" exception wouldn't be configured in Feed2Solr stage, then any error thrown in this stage will be handled by the "onError" branch of the pipeline, and the execution of the pipeline will end at that moment, without executing the redirect to "finished.xml".

You can also configure exceptions to lists of Stages:

  def myStagePath = FetchUrl | ExtractText | Splitter | DateChooser | ExtractDomain | PrintToFile ;
  job | myStagePath.exceptions([
        onComplete: Feed2Solr
      ]);

Nested exception handling is also available:

  def myStagePath = FetchUrl | ExtractText | Splitter | DateChooser | ExtractDomain | PrintToFile ;
  job | myStagePath.exceptions([
        onComplete: Feed2Solr.exceptions([
                      onError: stage{"job >> 'fetchUrlError.xml'"},
                      onComplete: stage{"job >> 'indexedJobs.xml'"}
                    ])
      ]);

Handling Subjobs

Groovy pipelines provide a way of controlling the flow of sub jobs through stages. Using the subJobs() method of each stage, you can specify what you want to execute for possible subjobs generated in that Stage. It receives a single Groovy Closure or a Map of label (used when the subJob was brached) vs a Stage (or a List of stages):

  job | FetchUrl | XmlSubJobExtractor.subJobs([
                     onSubJob: stage{job | FetchUrl | ExtractText | PostHttp >> "subjobs.xml"}
                   ])

or just a single Closure that will be executed no matter what are the branch labels for the subjobs:

  job | FetchUrl | XmlSubJobExtractor.subJobs(
                     {job | FetchUrl | ExtractText | PostHttp >> "subjobs.xml"}
                   )

A different Thread Pool Manager will be assigned to each Stage and parent job to process their subjobs.

To configure the maximum number of thread pools and their sizes.

Element Default Description
pipeline/script/@maxThreadPools 10 The maximum number of thread pools to handle simultaneously by this Groovy pipeline for subjobs. If the maximum number of thread pools in use has been reached, then jobs that want to create new subjobs will have to wait until a thread pool is released by another job.
pipeline/script/@maxThreadsPerPool 10 The maximum number of threads to create (per thread pool) to handle subjobs.
pipeline/script/@maxQueueSizePerPool 30 The size of the queue (per thread pool) for processing subjobs. If the job queue is full, then feeders, which attempt to put a new job on the queue, will be blocked until the queue has room. It is recommended that the queue size be at least as large as the number of threads, if not two or three times larger.

For example:

  <pipeline name="doc-process" default="true">
    <script maxThreadPools="10" maxThreadsPerPool="10" maxQueueSizePerPool="30"><![CDATA[
        job | FetchUrl | XmlSubJobExtractor.subJobs([
                     onSubJob: stage{job | FetchUrl | ExtractText | PostHttp >> "subjobs.xml"}
                   ])
    ]]></script>
  </pipeline>

Creating jobs

You can create jobs inside a Groovy Pipeline by using the createJob method:

   contactsJob = createJob('<doc><url>'+doc.url.text()+'/contacts.html</url></doc>')
   contactsJob | FetchUrl | ExtractText

Filesystem job feeder

You can use groovy pipelines to create jobs for each file and directory from a given path. For this purpose the groovy pipelines provides a function named 'dir'. There are 3 possible arguments:

Name default Description
Path Aspire Home path Directory where the files and directories will be fetched. If running AspireShell this it can be changed using 'cd' (change directory) command.
Closure Closure to execute with every job created for each file and directory
Arguments "" Specifies if directories should create jobs (using "+d") and if the extraction of files should be recursively (using "+r"). By default, if no Arguments specified only files will create jobs and will not crawl recursively.

This function can be used in 4 different ways:

  • dir (Path, Closure, Arguments)
  • dir (Path, Closure)
  • dir (Closure, Arguments)
  • dir (Closure)

Each job created will have an <url> field pointing to the corresponding file/directory.

Example:

  dir {job | FetchUrl | ExtractText >> "files.xml"}               //Only files inside the Aspire_Home directory
  dir ({job | FetchUrl | ExtractText >> "files+dir.xml"},"+d")    //Files and directories inside the Aspire_Home directory
  dir ({job | FetchUrl | ExtractText >> "files+dir.xml"},"+d+r")  //Files and directories recursively inside the Aspire_Home directory
  dir ("data",{job | FetchUrl | ExtractText >> "data_files.xml"}) //Only files inside the Aspire_Home/data directory
  dir ("data",{job | FetchUrl | ExtractText >> "data_files+dir.xml"},"+d") //Files and directories inside the Aspire_Home/data directory

Configuring Health Checks

Pipeline managers can now be configured to perform health checks to determine how Aspire is performing.

Configuration

To configure health checks for your system add a <healthChecks> section to the pipeline managers for which you desire health checks. Details on how to configure each type of health check are given below. Multiple health checks can be configured per pipeline manager.

The following is an example configuration:

 <component name="MyPipelineManager" subType="pipeline" factoryName="aspire-application">
   <healthChecks>
       
     <timestamp name="Test Timestamp" redThreshold="4000" yellowThreshold="1000" 
                history="5" />
       
     <initialization name="Test Long Initializer">
        <check componentRef="/pipeline/LongInitializer"/>
        <check componentRef="/pipeline/Concat-Test"/>
     </initialization>
       
     <jobCount redThreshold="1" />
       
   </healthChecks>
     
   <!-- Configure your pipelines here -->
     
   <!-- Configure your components here -->
   
 </component>

Note that all health checks can have a "user friendly name" attached to them (the @name attribute).

Once health checks are configured for your pipeline managers, they will be automatically accumulated and made available as requested.

Types

There are multiple types of health checks available. The types currently available include:

  • Initialization - Checks to see if components are fully initialized
  • Timestamp - Timestamps jobs when they start and when they are complete. Jobs which take too long to complete can be flagged as either "RED" or "YELLOW".
  • Job Count - Provides status on total jobs started, completed successfully, and completed with an error. Health is based on the count of jobs which failed.
  • Latency - Computes a moving average of the time it takes to complete a job.

Usage

Health Checks can be:

  • INITIALIZING - Either the Aspire System itself is still loading configurations, or a component has a long initialization which is still in progress.
  • GREEN - Everything okay
  • YELLOW - Some health check is showing poor behavior
  • RED - Jobs are failing or jobs are too slow to satisfy the service levels

The health checks for all pipeline managers across the system are accumulated into a single health check for the entire server. URLS are also available for accessing and managing health checks:

Health Check: Initialization

This health check is used to check components to see if they are initializing. If they are, the health of the system will be returned as "INITIALIZING".

Configuration:

     <initialization name="Test Long Initializer">
        <check componentRef="/pipeline/LongInitializer"/>
        <check componentRef="/pipeline/Concat-Test"/>
     </initialization>
  • <check> - Specifies the component to check.

Attribute:

  • @componentRef - The path name to the component. This must be the full path name of the component.

Health details:

  • This health check returns YELLOW if a component is not available.
  • This health check returns RED if there is an error accessing the component or getting the component's status.
    • A health of RED supercedes a health of INITIALIZING.

Note: It is expected that the initialization health check will be performed automatically in a future release of Aspire, at which point this health check will be deprecated.

Health Check: Job Count

This health check provides a total count of jobs and will determine the health of the system based on the total number of failed jobs that occur.

Configuration:

 <jobCount name="Count of Document Jobs" redThreshold="3" yellowThreshold="1"/>

Attributes:

  • @redThreshold - If total number of failed jobs is greater than or equal to this number, system health is RED. Should be 1 or more.
    • If @redThreshold is "0", your system will be RED all the time! So, set it to 1 or more.
  • @yellowThreshold - If total number of failed jobs is greater than or equal to this number, system health is YELLOW. Should be 1 or more.

Health details will show:

  • Total count of jobs initiated
  • Total count of jobs which completed successfully
  • Total count of jobs which failed with an unhandled exception
  • Total count of jobs outstanding (equals total initiated minus total completed)

Health Check: Time Stamps

Timestamp health checks are used to check the duration of every job and are typically used for occasional jobs (for example, nightly) that take a long time to run (i.e., hours).

Configuration:

 <timestamp name="Rebuild Dictionary Token Stats" history="5" redThreshold="10000" yellowThreshold="2000"/>

Attribute:

  • @history - The number of past timestamps to display on the health detail page.
  • @redThreshold - (milliseconds) If a job takes this much time to complete (or more), health will be RED.
  • @yellowThreshold - (milliseconds) If a job takes this much time to complete (or more), health will be at least YELLOW.

History:

A history of old timestamps will be kept and displayed in the health check details.

Note that only the most recent timestamp will contribute to the health of the overall system.

Health Check: Latency

Latency health checks compute a moving average of the time it takes to complete a job and then will flag RED or YELLOW if average job latency rises above specified thresholds.

Averages are computed over a specified number of jobs (@jobsToAverage). This method will also compute a peak average latency and give a history of averages for previous time periods.

Configuration: (defaults to 15 minute intervals over 24 hours)

 <latency name="Process Single Document" jobsToAverage="5" isSticky="true" 
          redThreshold="15000" yellowThreshold="5000" />

Configuration: (specify the interval and history length)

 <latency name="Process Single Document" jobsToAverage="5" isSticky="true" 
          redThreshold="15000" yellowThreshold="5000"
          interval="3600000" history="48" />

Attributes for Moving Averages:

  • @jobsToAverage - The number of jobs to include in the moving average which is used to compute health.
  • @isSticky - Specifies whether peak values are "sticky", that is, they hang around until cleared. For example, if your jobs slow down and the health is RED, it will remain RED (i.e., "sticky") even if they start to speed up again. Can be "true" or "false".
  • @redThreshold - (ms) If the moving average of job latencies rises above this threshold, health will be RED.
  • @yellowThreshold - (ms) The average latency threshold for declaring the health to be YELLOW.

Attributes for History Presentation:

  • @interval - (ms) The size of each interval in the history. Measured in milliseconds. If not supplied, defaults to 900000 (15 minutes).
  • @history - (count) The number of history intervals to save. Histories are a rolling window from the current time back through this number of intervals. Defaults to 96 intervals (equals 24 hour's worth of intervals).

Notes:

  • Histories are independent of moving average computations. The moving averages are computed independently from the history display. Therefore, it is possible for history intervals to show long latencies and for the health status to still be GREEN. This would occur when a history interval contains fewer than the number of jobs specified in @jobsToAverage.
  • Don't make @jobsToAverage too small. If this number is too small, then the computations will become unreliable.
  • Failed jobs are not added to latency numbers. Latencies are only computed for successful jobs, since these are the only ones which will provide reliable measurement data.