Key Concepts

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Overview

Aspire is a high-performance content processing system. It was designed to:

  • Acquire documents from multiple sources
  • Extract metadata and text from these documents
  • Cleanse, normalize, extract, and otherwise modify document metadata
  • Transform and send (aka "Publish") the documents and/or their metadata to multiple destinations (such as relational databases and search engines)

Aspire is a system of modular components, with each component performing a specific piece of functionality. The components as a whole represent the Aspire framework, which can be used by developers to combine components in different ways to perform specific processing tasks, or to develop entirely new components of their own.

The architecture of Aspire (based on OSGi) makes developing and sharing components and applications across computers, development teams, and organizations extremely easy. New applications and software updates can be dynamically loaded into running Aspire systems with just a couple of clicks, a new deployment paradigm which dramatically improves the usability and scalability of the system, while simplifying the administration - all without losing flexibility.

However, Aspire can also be used out-of-the-box by System Administrators, with simple configuration and no programming. For these types of environments, Search Technologies has created "application bundles" which are "prepackaged" groups of Aspire components, combined into an application, such as a Documentum connector (which extracts data from a Documentum repository) or a GSA publisher (which sends data to a GSA for indexing). There is a System Administration GUI that makes it very easy to install Aspire applications, manage content repositories, and manage an Aspire server.

Applications

"Applications", or "Apps" are (typically) large, chunky, units of functionality in Aspire. For example, a SharePoint connector, an RDBMS connector, Publish to the Google Search Appliance (or Solr, or FAST, or...) are all examples of "Apps" that can be installed into an Aspire server.

Any number of applications can be downloaded or configured and installed into an Aspire server. In this way, a single instance of Aspire can talk to multiple content sources, relational databases, and search engines all at the same time. Apps can be dynamically added, removed, or updated without having to shutdown or restart the Aspire server.

Each App is described by a configuration file, the "application.xml" file, which describes all of the components, pipelines, and scripts which combine together to implement the application. In addition, an App may have additional supporting files, such as XSLT transforms (to transform internal metadata structures into XML required by search engine indexers, for example), static web pages, dictionaries, etc.

Applications in Aspire can be specified either on disk, typically as an "application.xml" file stored in the "config" directory, or bundled up into a JAR file and deployed to a Maven repository. Bundled Apps stored in Maven repositories are called, appropriately, "App Bundles". Once deployed to the repository, they can be easily shared across multiple Aspire servers - either within the same Aspire cluster, or across multiple and distributed installations.

Pipeline Processing

Document processing in Aspire is structured as a series of pipelines of document processing stages. Each pipeline stage does a relatively small, useful task, such as:

  • Fetch a document from an HTTP server or from the file system
  • Analyze the document for type and extract text from it
  • Choose which document date is the most accurate
  • Associate the document with metadata from an outside source
  • Transform the document (and its metadata) so that it can be indexed into a search engine.

and so on.

Pipeline stages are chained together into multiple pipelines. A document comes into the pipeline (from a feeder, see below), is processed by each stage in turn until the pipeline is complete.

Jobs

A job is created for each document to be processed by Aspire. The job contains the document data as well as information about the job (what pipeline is currently executing the job, what sub-jobs the job has spawned, what errors have been received while processing the job, etc.).

Sub Jobs

Occasionally, a job will need to be split into multiple sub-jobs. This typically occurs for any document which contains other, smaller documents.

For example, a CSV file may be processed by Aspire, and each row of the CSV file is a separate document to be processed. The Aspire Tabular Files Extractor stage will automatically split the one CSV file into multiple sub-jobs, one for each line of the CSV file. These sub-jobs are then sent to other pipelines and will all be processed in parallel by multiple threads.

A second example is the XML Sub Job Extractor, which reads an XML file containing a list of records, and then to process each individual record one at a time, as sub-jobs on their own pipeline

Sub-jobs are connected to their parent job. The parent job can not complete until all of its sub-jobs have completed.

Pipeline Manager

The Pipeline Manager is responsible for managing the pipelines. A pipeline manager is the component which receives new jobs, distributes the jobs to multiple execution threads, and then manages those threads so that each job is sent to each pipeline stage in the pipeline.

The pipeline manager is also a Component Manager (see below), and so it can define and configure new components (typically pipeline stages).

Pipelines

The pipeline manager can manage multiple pipelines, each one identified by name. Each pipeline is made up of a single list of pipeline stages. Branching from one pipeline (i.e. list of stages) to another is available.

Feeders

Document feeders get the whole process started. A document feeder is responsible for identifying when a new document is available to be processed, gathering up the document information (usually into an XML structure which describes the document's metadata), and then sending that document as a new job to a pipeline manager.

Aspire currently has the feeders which will scan for new documents in hot folders, across a file system, from a list of URLs stored in an XML file, or from an RSS feed.

Typically feeders are "pull" type feeders. That is, the feeder occasionally starts off (the frequency is configurable) and then polls the data source to locate new documents. All new documents are then packaged up as new jobs and are then sent to a pipeline manager to be processed.


Components

Aspire is a component based system. All pipeline stages are independent components. All document feeders are also components.

Additional components which are not pipeline stages can also be created. For example, to share common data objects across the system.

In Aspire, a "component" is actually a separately configured instance of a component. In Java world, this means that each component correlates to a Java object. It is expected that these objects are fully thread-safe and can process multiple documents simultaneously.

So, for example, you can have five different instances of a Hot Folder Feeder. In Aspire, this would be called five different "Hot Folder Components". Each Hot Folder component can read from a different hot folder at a different interval, and they all live quite happily in the same running installation of Aspire.

Component Jar Files = OSGi Bundles

The java code which actually implements the algorithm for a component is compiled and gathered into a Java jar file called a 'bundle'.

Each bundle is a separate jar file. Bundles are stored in the "bundles/aspire" directory (for Aspire components) and the "bundles/system" directory (for OSGi services - see below).

Since version 0.4, Bundles can now be automatically downloaded from Maven Repositories by Aspire. These bundles will be loaded by Aspire directly from the machine's local repository (where artifacts are stored after being downloaded from remote repositories). This also allows for multiple instances of Aspire on the same machine to all share the same bundle code.

Component Factories

Each Aspire bundle is also a component factory. A component factory, as implied by its name, can manufacture multiple copies (or instances) of Aspire components, each one configured differently and used in different ways.

For example, one could have multiple "rdb" components to read from different relational databases or multiple "push xml" components to push documents to multiple search engines from the same Aspire installation. The same document could be transformed twice and pushed first to a Google Appliance, and then to a SOLR server for example.

Component factories are referenced by names which are hard-coded into the bundle jar. These names are defined in the documentation which is this wiki. The names are names like "aspire-rdb", which are the same names on the JAR files and, coincidentally, are the same names as the Maven artifact ID for the component (if you're a Maven user).

Component Managers

Component factories (see previous section) are passive. Someone (or some thing) must ask them to create a new component when it is needed. This is the job of the Component Manager.

Component managers do two things:

  1. Load new bundles into Aspire
  2. Create components

Of course the component managers do not create the components directly, instead they call the appropriate component factory (as specified by the configuration file) to do that for them.

And so, a component manager will read a list of components from a configuration file, determine the type of each component (i.e. what component factory should be used to create the component) and then sends the component's configuration and the component's name to the factory to be created.

Component Managers are also Components

Note that each component manager is also a component itself. This means it's possible for a component manager to include another component manager.

This allows for a hierarchical nesting of groups components within Aspire.

The Aspire Application is also a Component

The Aspire application itself is also a component. It is the "starter" component which initially reads the main configuration file(s) and creates component managers, one per configuration file, each of which then starts loading bundles and creating the components as required. It can also be requested to do this dynamically, as needed.

A Pipeline Manager is also A Component Manager

Each pipeline manager is also a component manager. This means that pipeline managers can also create new components.

Typically, the components that a pipeline manager creates are pipeline stages (every pipeline stage is also an Aspire component), which are then used in pipelines for document processing.

A Component Factory can Manufacture Multiple Types of Components

A single Java jar file, which is a single 'bundle' in Aspire, can actually contain the java code for multiple types of components. This can help reduce the amount of java code which has to be loaded into the system, by sharing third party libraries where needed amongst multiple component types which are all bundled into the same jar file. It is also possible to put multiple different default configurations for a single component into the same jar - a feature which is rarely used as of now.

Therefore, each component to be configured must also specify a "sub type" in addition to a "component factory name". The sub type identifies to the component factory exactly which sub component of the jar will be created. Inside the bundle there is a ComponentFactory.xml file which identifies the actual java class object (and default configuration) which will be created for each sub type.

OSGi and Apache Felix

Aspire is built around OSGi. OSGi is a standard for managing dynamic modules in Java. It allows for modules (packaged into Java JARs and called 'bundles') to be dynamically loaded and unloaded and linked to other modules.

Apache Felix is the implementation of OSGi for which Aspire was developed. Apache Felix provides user interfaces for managing bundles, including loading and unloading them.

Component Repositories

Components Jar files (aka bundles) can be loaded from two different types of repositories: distribution repositories (i.e. a directory in the distribution) or downloaded directly from Maven repositories.

Using the Maven Repository option makes using Aspire especially easy. Simply specify a component factory with Maven coordinates (for example, "com.searchtechnologies:aspire-rdb:0.5-SNAPSHOT") and Aspire will automatically download the component from Maven, dynamically load it into Aspire, and will then start using it.