Introduction to Aspire Applications

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Introduction

Application development in Aspire is all about configuring components and then stringing them together into pipelines.

A typical application will contain:

  • One or more feeders
    Feeders create jobs and send them to pipelines
  • One or more pipeline managers
    Pipeline Managers (PMs) receive jobs and process them through pipelines
  • Lots of components
    Components process jobs, doing something specific to the job
  • Some method for outputting the results
    This is a component too, but typically a component that writes to some external item, like a file, relational database, or search engine.
AspireStandardApplication.png

application.xml

An Aspire application is encapsulated in an XML configuration file. You will find an example of this file in your distribution, in $ASPIRE_HOME/config/application.xml. See Aspire Application Configuration for more information.

You can have multiple Application XML files

A single Aspire instance can load multiple application.xml files (with different names). All files can be active at the same time.

Multiple configuration files helps make configuration control simpler. For example, you can have a different application.xml for every collection, which makes it easy to add or remove new collections of data to your system.

Application XML files are simply lists of components

Basically, an application.xml file holds a simple list of components. Each component will have configuration data which is used to initialize the component and control what it does.

Some components are feeders that produce jobs. And of course, some components are pipeline managers, which themselves have nested components.

Application XML files can be stored and downloaded from Maven repositories

When this happens, the application is called an "App Bundle."

App Bundles are a convenient method for sharing applications across many Aspire instances, spread across a network or spread across the world.

Typical layout of an Application

An application, as specified in an application.xml file, typically has the following structure:

<application name="MyAppName">
  <components>
    <-- Feeder Components -->
    <component> . . . </component>
    <component> . . . </component>
    .
    .
    .

    <-- Pipeline Managers -->
    <component name="..." subType="pipeline" factoryName="aspire-application">
      <pipelines>
        <pipeline>
          <stage .../>
          <stage .../>
          <stage .../>
          .
          .
          .
        </pipeline>
      </pipelines>

      <components>
        <-- Components used by and nested within the pipeline manager -->
        <component> . . . </component>
        <component> . . . </component>
        .
        .
        .
      </components>
    </component>


    <-- More pipeline managers (or other components) usually go here -->
    .
    .
    .
  </components>
</application>

The Pipeline Manager

As you might expect, the Pipeline Manager (PM) plays a pivotol role in Aspire.

Pipelines are sequences of components. PM's receive jobs and process jobs through pipelines. There are two methods that a PM can process a job:

  • process() - Synchronous job processing
    When a PM "processes" a job, it means that the job is processed immediately. In this situation, it is the thread that calls the process() method, which is used to actually carry the job through each of the components in the pipeline.
  • enqueue() - Asynchronous job processing
    Jobs that are enqueue()'d are placed on an input queue to be processed by the PM at a future time (or right away, if a thread is available).

Of these two methods, enqueue() is used the most. To manage enqueue(), the PM maintains two structures: the job queue and the thread pool.

The Job Queue

Every PM maintains an input queue of jobs. This queue has a maximum upper limit which can be set with <queueSize> parameter, e.g., <queueSize>30</queueSize>.

If the queue is full, the feeder that is submitting the job will be blocked. If the queue remains full, after a timeout period, an exception will be thrown to the feeder.

The Thread Pool

Every PM maintains a thread pool. Threads will be created as necessary to process jobs on the Job Queue, and then will be shut down if they are idle for a timeout period.

The PM specifies a maximum number of threads. This maximum can be set with the <maxTrheads> parameter.

The maximum number of threads can also be dynamically adjusted on the user interface.

Job Branching and Routing

There are three different ways in which a job can move around the system.

  1. "Normal Flow" - From pipeline stage to pipeline stage
    A pipeline is a sequence of stages managed by the PM. Once a job is submitted to the pipeline, the PM will automatically send the job to each stage in turn.
  2. "Branching" - From one pipeline to another
    Jobs can be branched from one pipeline to another with "branches." Branches are all handled by the Branch Handler, which specifies the destination PM and the pipeline (within the named pipeline manager) to which the job will branched.
    Pipeline Branching occurs when a job has some event (such as "onComplete" or "onError"). These are defined in the <pipeline> tag of the PM.
    Sub Job Branching occurs when sub-jobs are created and branched to pipelines. These are defined as part of the Sub Job extractor component.
  3. "Routing" - Dynamic routes attached to a job
    Routing tables can be dynamically generated and attached to jobs. This is unlike branching, which are specified in the XML file.
    Routing also occurs at a higher level than branching. Once a job is routed to a PM, the PM takes over and is in full control of the job, which may be branched around using the Branch Handler any number of times. Only once the job is completely done with the PM, i.e., when it is "complete", is it then routed to the next PM in the routing table.

Parent Jobs and Sub-jobs

Perhaps the most powerful aspect of Aspire is its ability to create sub-jobs from parent jobs. Once one understands how this works, it opens up endless possibilities.

Let's start with a few examples.

Example 1: Processing a directory of files

JOB: Use Feed One to initiate the job.

  • Parent job holds the path of the directory to scan.
  • Scan Directory is used to create sub-jobs for each file in the directory.
SUB-JOB: One sub-job for every file
  • Fetch URL to fetch the content of the document.
  • Extract Text to extract text from each document
  • Post XML to send the document and its text to the search engine.


Example 2: Processing Wikipedia Dump Files

See the Wikipedia blog entry for a complete description.

JOB: Use Feed One to initiate the job.

SUB-JOB: One sub-job for every BZIP2 compressed XML file.
SUB-SUB-JOB: Processes each wikipedia page
  • Groovy Scripting to terminate pages which are redirects, to extract categories, to identify disambiguation pages, and to clean up the static document teasers.
  • Post XML to send the document and its text to the search engine.


Example 3: Processing 70 Million Database Records

Processing large numbers of records must be done in batches, otherwise the database may be locked for long periods of time, preventing anyone else from using it.

JOB: Use Feed One to initiate the job.

  • Groovy Scripting to create 10,000 batches, each with an ID from 0 to 9,999. Each batch is submitted as a sub-job.
SUB-JOB: One sub-job for each batch of records.
  • RDB Sub Job Feeder to select out all records for the specified batch. This is doing by looking for all records where the [record ID] modulo [10,000] is equal to the [batch ID].
  • The RDB sub job feeder will submit each individual record as a sub-sub-job.
SUB-SUB-JOB: Processes each individual record
  • Post XML to send the document and its text to the search engine.

Multiple Pipeline Managers

Best practices in Aspire is to create a separate Pipeline Manager (PM) every time you create sub-jobs from a parent job. For example, in the Wikipedia example above, there would be three PM's:

  • One to handle the parent job (initiate the index run)
  • One to handle each BZIP2 file
  • One to handle each individual Wikipedia page.

Why so many Pipeline Managers? Why not just one to do everything?

The issue has to do with thread starvation. Suppose you had just a single pool of (say) 10 threads in a single PM. What happens is that if you are processing more than 10 BZIP2 files, then all threads may be used up processing these files. This would leave no threads left-over to process the actual Wikipedia pages and then the system would grind to a halt.

Using separate PM's for each level of job neatly avoids this issue. Since each PM has its own thread pool, there can never be a situation where parent jobs use up all of the threads leaving nothing left over for the sub-jobs. Thread pools for different levels of jobs are kept separate, which assures the high performance even with very complex structures.

Sub Job Extractors

The common denominator in all of the examples above is that they all contain "sub job extractors". These are components which divide up a larger job into smaller pieces, and then spawn off separate sub-jobs for each piece.

Every Sub Job Extractor will be configured with a Branch Handler, which specifies where the sub-jobs should be sent after they have been created. Note that the Branch Handler can also branch jobs to remote servers and can also combine jobs into batches.

Some of the more useful Sub Job Extractors include:

  • XML Sub Job Extractor - Assumes InputStream is XML and splits it into multiple sub-jobs. Every tag underneath the root tag becomes a separate sub-job.
  • Tabular Files Extractor - Assumes InputStream is a CSV or tab-delimited file. Every row of the file becomes a new sub-job.
  • RDB Sub Job Feeder - Executes a SQL select statement (which can have substitutable parameters that are filled in with job metadata) and submits all of the selected records as separate sub-jobs.
  • Scan Directory - Scans through a directory and submits all of the files as separate sub-jobs. Can also do recursive directory scans including sub-folders.
  • Groovy Scripting - Can be used to do any sort of loop which creates sub-jobs and branches them.

All of the scanners (see below) are also, technically, Sub Job Extractors as well.

Feeders, Scanners, and the Scheduler

Feeders generate jobs. All Aspire applications will have feeders of some sort or another.

There are a couple types of feeders:

  • Pull Feeders
  • Push Feeders

Recommendation: Feed One is always a good place to start, even if you only end up using it for debugging.

The Scheduler has a stored list of jobs, and submits them to a processing pipeline following a given schedule.

Scanners are like feeders in that they scan through external servers and create jobs, but they operate at a higher level:

Pull Feeders

Pull feeders poll external databases for things to process, and then pull those items into the system and submit them to pipelines as jobs.

Pull feeders include any of the following components:

  • Hot Folder Feeder - Scans a 'hot folder' for new files to process.
  • File System Feeder - Scans an entire file system (including, optionally, nested folders) for all files and submits all files to a pipeline.
  • Single Page Feeder - Scans through a list of URLs and submits them to a pipeline.
  • RDB Feeder - Reads records from a relational database and submits each row to a pipeline.
  • RSS Feeder - Reads records from an RSS feed, identifies new submissions and submits them as to a pipeline.

Push Feeders

Push feeders are passive. They respond to outside events from external agents who push new jobs to Aspire.

  • Feed One - Responds to requests from the administrator via the Debug console. When the Administrator clicks "submit," it creates a job and submits it down the pipeline.
  • HTTP Feeder - Receives RESTFul HTTP requests (URLs with parameters) and submits those requests as jobs to a pipeline. Typically returns XML to the calling program.
  • JMS Message Feeder - Receives jobs via a JMS Queue.
  • Aspire LDAP Proxy - Receives requests from LDAP clients. Authentication requests are usually handled by the LDAP proxy directly (it proxies the request to an actual LDAP server), but group expansion requests will create jobs and submit those jobs down the pipeline.

And... The Scheduler

The Scheduler also generates jobs like a feeder, but it is not based on an external event. Instead, it loads a processing schedule and submits jobs at particular intervals or particular times of the day or week.

Although similar to a feeder, the Scheduler is a special type of component. It is installed apart from any pipelines and “lives” from when Aspire starts up until system shutdown. Normally you use it together with scanners to schedule periodic scans of a repository.

Scanners

Scanners typically receive jobs from the Scheduler, which includes all of the information needed to specify all of the details of the content source (server, username, password, directory, etc.). Scanners can be used as components in your own application as long as you provide all of the necessary information as a job that you send to the scanner.

Scanners available include:

Pipeline Stages

Every pipeline Stage is an Aspire Component. The only difference is that stages have a process(job) method implemented that can process jobs when called upon by the Pipeline Manager.

The most useful pipeline stage is the Groovy Scripting stage. This stage can perform just about any metadata manipulation or I/O function that you might require. Very often, Aspire components start off as Groovy scripting stages and then are migrated to full Java components.

Processors That Open InputStreams

Some processors open up input streams to external content. These input streams are actual Java InputStream objects from which bytes can be read.

Once the InputStream is open, a later stage can read data from the input stream and process it.

  • Fetch URL - Opens a connection to a URL (any type) and creates an InputStream for the content.
  • Storage Handler - Opens input streams to files on disk (in addition to many other functions). Typically used to manage implement file-system repositories.

Processors That Read InputStreamss

Once an InputStream is open, you can do a number of different things with it.

  • Extract Text - Extracts text from an InputStream.
  • XML Loader - Assumes that the stream is XML. Reads all of the XML from the stream and loads it into the job's AspireObject so it is available for later stages.
  • BZip2 Decompress Stream - Decompresses an InputStream.
    It is expected that more compression / decompression components will be created in the future.

Sub Job Extractors That Read InputStreams

  • XML Sub Job Extractor - Assumes InputStream is XML and splits it into multiple sub-jobs. Every tag underneath the root tag becomes a separate sub-job.
  • Tabular Files Extractor - Assumes InputStream is a CSV or tab-delimited file. Every row of the file becomes a new sub-job.

Processors That Process Content

By far the most important content processor is the Groovy Stage, which can do just about anything, based on a script which you write directly into the Application XML file.

Other content processors include:

  • Extract Domain - Extracts the domain name from a URL.
  • Tagger - Categorizes documents based on words and phrases they contain

Services Components

Some components do not process jobs, but instead provide a service, which is used by other components.

Sometimes the service is simply to make certain Java classes available to components that need them. This is primarily the case for the aspire-lucene component, for example. Services in this category include:

And in other cases, the service may be to create a pool of resources that can be drawn upon as needed. This is the case for the RDBMS Connection component, which maintains a pool of open relational database connections. Services in this category include:

  • RDBMS Connection - Relational database connection pool. Connection information is configured with the component.
  • Apache Derby Embedded Database - Embedded Apache Derby database. Maintains connections to an in-memory or on-disk Apache Derby RDBMS.
  • Multi RDBMS Connection Pool - An RDBMS connection pool which can handle any number of connections to servers. Connection information is provided whenever a connection is required.