ARC Reader 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / ARC File Reader

ARC File Reader
Description: Reads an ARC file, splits it up, and branches each nested URL as a separate sub-job.
Inputs: <fullPath> specifying the file-path of an ARC file to read.
Outputs: An AspireDocument object with various metadata from the URL. Also, the actual URL content from the crawler will be loaded into memory as a set of bytes (and can later be processed as byte input stream).
Factory: aspire-arc-reader
Sub Type: default
Object Type: Produces AspireDocument objects for each URL contained with the ARC file. These are published as sub jobs.

Configuration

Element Type Default Description
branches None The configuration of the pipeline to publish to. See below.
urlElementName String fullPath The XML element name from which the file path of the ARC file is fetched. Currently the content of this tag must be a file-system path, not a URL.
submit boolean true If true actually publish extracted documents as sub jobs down the pipeline. Otherwise just print out a debug message about the document. Used for debugging.
allowMimeTypes list of <mimeType> tags application/pdf and text/html Specifies a list of nested <mimeType> tags each of which specifies a mime type to be allowed to flow down the pipeline. Other mime types found in the ARC file will be skipped.
metadataMap metadataMapping Metadata map from the standard metadata mapper. See below for more details.

Branch Configuration

The ARC Reader stage splits up an ARC file into sub jobs and publishes them using the branch manager. It publishes using the onPublish event for each sub-job.

You must therefore include a <branches> element in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

Metadata Mapper Configuration

The ARC Reader stage produces some additional metadata fields which can be mapped to fields in the AspireDocument XML.

Field Default Output Field Description
url url The full URL of the document which was crawled and downloaded by the crawler and stored into the ARC file.
mimeType mimeType The mime type returned by the HTTP server (from the Content-Type header) to the crawler for the URL. For example: "text/html".
encoding encoding The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8"
Date crawlDateTime The time at which this URL was crawled by the crawler. Originally formatted in "yyyyMMddHHmmss" formatted and, by default, re-formatted into ISO 8601 date-time.
offset I believe this is the byte offset into the ARC file where the sub-document starts.
length I believe this is the byte length of sub-document, in bytes, within the ARC file.
date unknown
ip4 unknown
digest unknown
Last-Modified modificationDateTime The modification date time reported by the HTTP server in the "last-modified" http header, if it exists. By default automatically reformatted as an ISO 8601 date-time.
All other HTTP headers All other HTTP headers retrieved by the crawler may be available for mapping to output XML elements.

Example Configuration

  <component name="arcReader" subType="default" factoryName="aspire-arc-reader">
    <config>
      <branches>
        <branch event="onPublish" pipelineManager="../SubJobPipelineManager"/>
      </branches>
    </config>
  </component>