Monitoring crawl statistics and performance (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Training Material

If you're interested in learning more, here's a recording of the Tech Talk on the Performance and Auditing Tech Talk along with the Performance and Auditing Tech Talk presentation.

Monitoring crawl statistics

The Aspire Connectors are able to fetch content from the content sources, while providing a way to monitor the status of each crawl. The basic information provided in those status are: added, updated and deleted documents in the crawl. Also if any error happens for some documents, those errors are going to be counted and displayed as well.

To see the basic statistics of the crawls click on "Statistics":

Content Source tile

This popup will show up:

Content Source Statistics


Historical Statistics

This section allows you to see the statistics for previous crawls, by clicking on "Historical Statistics" and then selecting the Start datetime of the crawl you want to check.

Historical Statistics

View Audit Logs

The Aspire Audit Logs is a log feature to track all actions done for the documents by a content source. You can track all ADDED, UPDATED, DELETED and NO CHANGED documents among the crawl. The goal of these logs is to help the administrator to identify differences between the content crawled by Aspire and what is indexed in the search engine.

You can also track WORKFLOW_ERRORS, which correspond to errors occurred during the Workflow execution, and BATCH_ERRORS, which are problems when sending a batch of documents to a search engine.

The Aspire publishers can be configured to dump their indexes to file in the form of Audit Logs, which then can be compared to the content-sources Audit Logs in order to determine differences and possible crawl problems. For more detailed information about dumps and index comparisons go to Aspire Audit Logs.

Aspire Performance Reports

The Aspire Performance Reports is a feature aimed to help the Developers and Administrators to identify hot-spots or bottlenecks of the execution of processing, extraction or publisher stages.

The Performance Reports include information about job start and end times, execution paths including timing information for:

  • Pipeline Manager
  • Pipelines
  • Stages
  • Workflow Rules
  • Scanner methods

How does it work?

Example: Given the following application.xml file:

<application name="PerformanceStatisticsExample">
  <components>
    <component name="StandardPipeManager" subType="pipeline" factoryName="aspire-application">
      <components>
        <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />
        <component name="ExtractText" subType="default" factoryName="aspire-extract-text" />
        <component name="ExtractDomain" subType="default" factoryName="aspire-extract-domain" />
        <component name="PrintToFile" subType="printToError" factoryName="aspire-tools">
          <outputFile>log/${app.name}/exampleDebug.out</outputFile>
        </component>
      </components>
      <pipelines>
        <pipeline name="doc-process" default="true">
          <stages>
            <stage component="FetchUrl" />
            <stage component="ExtractText" />
            <stage component="ExtractDomain" />
            <stage component="PrintToFile" />
          </stages>
        </pipeline>
      </pipelines>
    </component>
  </components>
</application>

When a job processes that application the following information will be generated:

<performanceStatistics name="root" process="true">
  <stats>
    <startTime>2014-08-19T22:27:57Z</startTime>
    <endTime>2014-08-19T22:28:08Z</endTime>
    <processingTime>10927</processingTime>
  </stats>
  <pipelineManager name="/PerformanceStatisticsExample/StandardPipeManager">
    <stats>
      <startTime>2014-08-19T22:27:57Z</startTime>
      <endTime>2014-08-19T22:28:08Z</endTime>
      <processingTime>10926</processingTime>
    </stats>
    <pipeline name="doc-process">
      <stats>
        <startTime>2014-08-19T22:27:57Z</startTime>
        <endTime>2014-08-19T22:28:08Z</endTime>
        <processingTime>10926</processingTime>
      </stats>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/FetchUrl">
        <stats>
          <startTime>2014-08-19T22:27:57Z</startTime>
          <endTime>2014-08-19T22:28:02Z</endTime>
          <processingTime>5595</processingTime>
        </stats>
      </stage>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/ExtractText">
        <stats>
          <startTime>2014-08-19T22:28:02Z</startTime>
          <endTime>2014-08-19T22:28:08Z</endTime>
          <processingTime>5330</processingTime>
        </stats>
      </stage>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/ExtractDomain">
        <stats>
          <startTime>2014-08-19T22:28:08Z</startTime>
          <endTime>2014-08-19T22:28:08Z</endTime>
          <processingTime>0</processingTime>
        </stats>
      </stage>
      <stage name="/PerformanceStatisticsExample/StandardPipeManager/PrintToFile">
        <stats>
          <startTime>2014-08-19T22:28:08Z</startTime>
          <endTime>2014-08-19T22:28:08Z</endTime>
          <processingTime>0</processingTime>
        </stats>
      </stage>
    </pipeline>
  </pipelineManager>
</performanceStatistics>

Note that the processing time of a parent node is the sum of its children, sometimes it also gets an overhead besides the children's sum. The processingTime is given in milliseconds, if it is 0, it means it took less than 1 millisecond to process, because it doesn't handle smaller time units than milliseconds.

For further information on how to enable and download the logs and reports go to: Performance Reports