Managing Content Sources

Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Content sources (also called "repositories") can be managed using the System Admin user interface. We define a "content source" as any repository or data set you wish to crawl--file systems, the web, SharePoint, Documentum, etc. Each source is set up in Aspire with information about:

  • where to find the source (URL)
  • how to connect to the source (including access credentials)
  • where to start the crawl (i.e., inclusion and exclusion information)
  • how often to crawl the source (done via automated scheduling)
  • where to send information that is crawled (i.e., generally to a search engine, but potentially to a processing pipeline for additional data manipulation prior to indexing)

One of the best things about Aspire is that changes can be made on the fly; it is very easy to accommodate new content sources, moves of data sources to different locations, modified crawling schedules, or other changes that happen in your environment. In most cases, these changes can be accomplished without having to re-index the entire data set, which can be very time consuming for large repositories.

Before you add any Connector applications or set up any content sources on your Aspire system, you need to load the Content Source Manager application first. The CS Manager maintains both the crawl database of content sources and properties, as well as the scheduler which initiates crawl jobs at specified times and periods.

When you create a new content source, you need to set up three main bodies of information (each found on a different tab on the setup page):

  • Basic Information
  • Connector Properties
  • Routing Table

For detailed information about what to enter for each connector type, see the Tutorial for that particular Connector.

Below are some general concepts to keep in mind.

Snapshot Directories

The snapshot directory for any content source you crawl (stored in the path you specify when you install the Content Source Manager application) holds the crawl information (by "Crawl ID"), including crawl statistics and errors (if any). There should be a separate snapshot directory for each different data source.

Snapshot files are what enable Aspire to keep track of what has changed since the last time a content source was crawled (enabling incremental crawls as well as full crawls).

Be aware that you can do a partial re-index of a data source if you need to by manually deleting part of a snapshot file, and then performing an index "Update" rather than a Full crawl.

Crawl Schedules

Content Source Basic Properties

Each content source can have its own crawling schedule. This is set up on the Basic Information tab for the content source (see sample at right), and is controlled by the CS Manager.

You can schedule a crawl to be run manually, periodically (every so many hours or minutes), daily at a particular time, or weekly on a specific day and time.

For the crawl to take place, make sure the Active flag is checked for the connector (otherwise, it won't run). This flag can be used to prevent a source from being crawled if that source is temporarily offline or down for maintenance, etc.

Routing Tables

Routing tables are used to tell Aspire where to send the outputs of the current process. For all content sources you crawl, you will need to specify the name of the next application in the pipeline--typically a publisher application for indexing--to which the data should be sent.

If you have created your own custom applications (like a Name extractor and a Location extractor, for example), those might come before the publisher in the routing table. The order of the routing table is important. Jobs specified in the routing table will be processed in the order the entries appear, and they must be identified by Application name. You can reorder jobs using the Up and Down arrows.

If you are operating in a distributed environment (Aspire is installed on multiple servers within a cluster), it is assumed that the same application loaded onto two different nodes within an Aspire cluster will represent the same functionality, and that a job can be sent to any machine within the Aspire Cluster which has an application of that same name. For more information, see Aspire Application Configuration.