For Information on Aspire 3.0 Click Here
--- Notice ---
Aspire 3.0, the latest version of Aspire, has been recently released with a new associated wiki. Please click on the banner above for more information.
What is Aspire?
Aspire is a framework and libraries of extensible components designed to enable creation of solutions to acquire data from one or more content repositories (such as file systems, relational databases, cloud storage, or content management systems), extract metadata and text from the documents, analyze, modify and enhance the content and metadata if needed, and then publish each document, together with its metadata, to a search engine or other target application.
Aspire uses Apache Felix (an open source implementation of OSGi) to install, start, stop, update, and uninstall Aspire components and applications without requiring a reboot, supporting improved uptime and making system administration easier. Each individual piece of processing functionality within Aspire is a modular component that can be used by itself, or in conjunction with other components to create an Aspire application.
What is Aspire used for?
Aspire is being used in many types of customer applications, here are some examples:
- Enterprise search to enrich content with additional metadata to support advanced navigation.
- Staffing and recruitment to provide search and match solutions between candidate CVs and job descriptions
- State government information site to extract metadata from OCR files and normalize the data prior to indexing
- Records management to automatically categorize corporate data as it is migrated into SharePoint where content needs to be aggregated and categorized before searching
- Legal research to find and analyze content for forward and reverse citations to other content to improve recall and analysis
- Company intranet to automatically create enterprise-wide sitemaps for browsing style investigation
- Federal government information site to intelligently split up large single files pertaining to laws into searchable” chapters and clauses
- Basic content access (connector) to one or more content repositories for search engines
- Analyzing and grouping content geospatially for localization
Aspire is extremely flexible. By pulling the data processing pipelines out of the search engine, Aspire can more powerfully and efficiently manipulate content and metadata, can process it in multiple pipelines simultaneously (and over multiple machines)for higher performance, and then feed it to one or more engines for indexing.
The Aspire framework supports creating Natural Language Processing (NLP), Machine Learning, and other analytic processing for text through a rich set of basic components. More detailed descriptions can be found on this page: Natural Language Processing (NLP)
If you want to start using Aspire, see here.
The administrator is responsible for installing, configuring, and maintaining Aspire deployments. Aspire deployments are managed through a web-based, point-and-click interface, the same used by the Aspire developer, however it is expected that an administrator only needs to fill in configuration information. Depending on your environment, you may wish to have a single Aspire system administrator, or you may wish to have several, each responsible for different content sources.
The System Administration UI has the following main functions.
Content Source administration functions include:
- Configuring properties for Aspire connectors
- Setting up crawling schedules for repositories
- Managing full and incremental crawls
- Managing security
- Monitoring system health and performance
- Monitoring crawl statistics and performance
- Index Auditing
Administration UI Security:
Document Level Security:
- Solr Document security filtering
- Google Search Appliance filtering - Using any Aspire connector that provides the document ACLs is enough and normal GSA filtering works.
- SharePoint 2013 filtering (on premise)
The release of Aspire 2 included some major changes in administration. If you are administering Aspire 2 click here for more in-depth information.
If you are administering Aspire 1.x click here for more in-depth information.
Aspire deployments are dynamically built from components and subcomponents. Aspire also includes the concept of “application bundles,” which are essentially groups of components pre-packaged to perform a specific function and have embedded files to define their look and feel within the Aspire Administration UI. System developers can easily combine components in various ways to process data according to the needs of the application.
Standard Aspire components can be mixed with custom 3rd-party components and with new components. The high level developer’s view of Aspire processing control is based on three major component types:
- Component Managers
- Pipeline Managers
- Tokenization Manager
If you are developing for Aspire 2 click here for more in-depth information.
If you are developing for Aspire 1.x click here for more in-depth information.
Aspire Community vs. Enterprise Distributions
- Performance and reliability
- Ease of administration
- Making dynamic (on-the-fly) configuration changes
- Dynamically adding new components
- Dynamic refresh of component code
- Rich built-in XML processing methods including XPath and XSLT
- Hierarchical component configuration
- Rich and comprehensive web-based administration and control interface
- A strong developer environment
- Intuitive workflow interface
- Supports processing content in diverse languages
- Easy mapping of document fields to search fields
- Rich built-in JSON and XML processing methods, including XPath, XSLT
- Use of scripting to build complex processing components
- Hierarchical component configuration
- Tightly integrated with Maven repositories for sharing and loading component code
- Sharing and loading component code
- Process streams of tokens, for performing text analytics
- Entity extraction
- Latent Semantic Analysis
- Document vector creation and comparison
- Topic Analysis
- Support for security
- Handle Proxy LDAP requests, including:
- Authenticating users
- Determining user group membership across a multitude of systems
- Handle Proxy LDAP requests, including:
- Support to Federate search requests
- Distribute queries to multiple search engines
- Merge search results
- Support for Hadoop
- Ability to write to HDFS
- Ability to include Aspire within Map/Reduce jobs
Structure of an Aspire Solution
Aspire deployments can be divided into three high-level functional areas: content access, content processing, and publishing.
- Content access fetches the documents and associated metadata from the content repositories. The applications that perform this function are called Aspire Connectors. These use the supported application programing interfaces (APIs) of target repositories to access content, metadata, and security credentials. Where available, Aspire connectors capture the full directory structure from the repository, to support browsable enterprise site maps.
- Content processing analyses, augments, and transforms content. Depending on the needs of the application this can involve simple use of regular expressions to a wide range of complex semantic and statistical processing techniques. Content processing can spawn Hadoop Map/Reduce jobs for large processing tasks.
- Publishing refers to the components in an Aspire deployment that are responsible for pushing the processed text from the content processing pipeline(s) to the target system, typically a search engine or file directory, in the correct form, and where available using the search engine’s ingestion API. The applications that perform this function are called Aspire Publishers. XML and JSON output is also available.
Functional Component Hierarchy
- Component - atomic piece of Aspire logic
- Configurable Component -single component wrapped with a dxf so it can be used with the Admin UI
- Application or Application Bundle - multiple component wrapped with a dxf and possibly configuration files)
Aspire core releases are given version numbers to help identify what software an Aspire solution is built upon. The version number contains a major version, left most digit, that is reserved to denote the overall architecture. The second digit represents the minor version and denotes a release with new features. The third number, if present represents the stability release version, this denotes a release with multiple "bug fixes". In rare cases there can be a fourth digit if it is necessary to release a version with one or just a few bug fixes.
Currently the version numbers for Aspire connectors are the same as the major and minor releases. For example the current Jive connector and Aspire core are both 2.1. Over time the version numbers after the major digit can diverge. With the release of Aspire 2.0 the version dependencies between Aspire core and connectors and Publishers has been eliminated. This allows Search Technologies to release new versions of connectors or publishers between Aspire core releases. The major version number must always match.
For Downloading Aspire 2.0 please follow this link:
For Aspire 1.X:
Version Specific Information
This section has links to the detailed information for managing and developing for each major Aspire release. The release notes for each Aspire version can be accessed by clicking here. There have been two Major releases of Aspire. The release of Aspire 2 included significant enough changes we decided to create its own branch of the wiki while maintaining the Aspire 1.x branch. The two links below allow you to navigate to these branches