Single Page Feeder 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Simple Feeder / Single Page Feeder

Single Page Feeder
Description: Periodically feeds single pages of content. The URLs of the pages to feed are held in the CCD. As pages are fed, the feeder checks to see if it has feed to the same domain with in a configureable time period. If it has, then the page is put in to a queue to be fed later. Otherwise the pages is fed immediately. Once all pages from the CCD have been checked (and possible fed), the feeder processes the queue of remaining items until all items have been fed.
Inputs: URLs from the CCD
Outputs: An AspireDocument object containing the <url>, published to the configured pipeline manager.
Factory: aspire-singlepagefeeder (previously aspire.SinglePageFeeder)
Sub Type: default
Object Type: Produces AspireDocument objects.

Other Notes

  • The URLs themselves are not fetched, this is performed in the pipeline
  • This feeder is based on the Simple Feeder


Configuration

This feeder takes all parameters from the Simple Feeder plus the following:

Element Type Default Description
ccdLocation string ccd The location with in the system of the content control database (CCD)
feederLabel string CrawlSinglePage The feeder label submitted in the <feederLabel> of the published document and when querying the CCD.
domainWait int 3000
(= 3s)
The minimum number of milliseconds between publishing URLs from the same domain. Can be changed to 0 to feed files as fast as possible.


Metadata Mapper Configuration

The single page feeder maps some metadata fields to fields in the AspireDocument XML.

Field Default Output Field Description
feedDate feedDate The time when the URL is published.

Example Configurations

Simple

  <component name="SinglePageFeeder" subType="default" factoryName="aspire-singlepagefeeder">
    <config>
      <ccdLocation>/systemCommon/ccd</ccdLocation>
      <autoStart>${autoFeedPages}</autoStart>
      <branches>
        <branch event="onPublish" pipelineManager="standard-pipe-manager" />
      </branches>
    </config>
  </component>

Complex

  <component name="SinglePageFeeder" subType="default" factoryName="aspire-singlepagefeeder">
    <config>
      <ccdLocation>/systemCommon/ccd</ccdLocation>
      <autoStart>${autoFeedPages}</autoStart>
      <loopWait>21600000</loopWait>
      <feedWait>100</feedWait>      <!-- Only wait 1/10s between URLs not in the same domain -->
      <domainWait>3000</domainWait> <!-- Wait 3 s between URLs in the same domain -->
      <metadataMap>
        <map from="category" to="category"/>
        <map from="subCategory" to="subCategory"/>
        <map from="geographicArea" to="geographicArea"/>
        <map from="searchKeywords1" to="searchKeywords1"/>
        <map from="boost" to="boostTokens"/>
      </metadataMap>
      <branches>
        <branch event="onPublish" pipelineManager="standard-pipe-manager" />
      </branches>
    </config>
  </component>