XML Sub Job Extractor 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / XML Sub-Job Extractor 0.4

XML Sub-Job Extractor 0.4
Description: Extracts individual XML records from a single XML file which contains a list of records. Each record is sent to a sub-job pipeline to be processed individually.
Inputs: AspireDocument containing a data stream (i.e. object['contentStream'] or object['contentBytes'] which is a stream to the XML to process). NOTE: A previous job (typically FetchURL) must have opened the input stream.
Outputs: An AspireDocument object containing data for each sub-job contain the XML of the individual XML record, published to the configured sub-job pipeline manager.
Factory: aspire-xml-files
Sub Type: xmlSubJobExtractor
Object Type: Produces AspireDocument objects.


Operation

This stage is primarily intended to split an XML file containing a list of records, and then to process each individual record one at a time, as sub-jobs on their own pipeline. These sorts of XML files are commonly produced by relational databases.

This stage takes an Aspire document which contains a data stream. It assumes that the data stream represents an XML document and then parses through the XML document to extract sub-job documents.

Note that the XML sub-job handler does not load the entire XML into an in-memory DOM object. Instead, it reads data from the input stream and outputs XML records to the sub-job pipeline as they are found using a SAX handler. This makes it very fast with very low memory requirements.

Typical input XML documents look like this:

 <records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <record id="32">
     <first_name>george</first_name>
     <last_name>washington</last_name>
     <description>Founding father #1</description>
   </record>
   <record id="33">
     <first_name>thomas</first_name>
     <last_name>jefferson</last_name>
     <description>Founding father #2</description>
   </record>
 </records>

Note that every child of the root element will be processed as a separate sub-job document. Therefore, the above XML will produce the following sub-job XML documents:

Sub Job #1:

 <doc id="32" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <parent>
      -- NOTE:  a copy of the parent metadata is stored here --
   </parent>
   <first_name>george</first_name>
   <last_name>washington</last_name>
   <description>Founding father #1</description>
 </doc>

Sub Job #2:

 <doc id="33" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
   <parent>
      -- NOTE:  a copy of the parent metadata is stored here --
   </parent>
   <first_name>thomas</first_name>
   <last_name>jefferson</last_name>
   <description>Founding father #2</description>
 </doc>

The top-level <doc> element for the sub job will contain all of the attributes for the parent XML element from the original file (i.e. the attributes on the <records> element from above) as well as all of the attributes from each sub-record (all of the attributes from each <record> element in turn). This should ensure that transforms on XML files which require nested name-spaces can occur properly.

Configuration

Element Type Default Description
branches None The configuration of the pipeline to publish to. See below.
maxSubJobs integer 0 (= all) The maximum number of subjobs to generate. If there are more possible jobs in the input XML file, they will be ignored.

Branch Configuration

The XML Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.

Element Type Description
branches/branch/@event String The event to configure. Should always be "onSubJob".
branches/branch/@pipelineManager string The URL of the pipeline manager to publish to. Can be relative.
branches/branch/@pipeline string The name of the pipeline to publish to.

Example Configuration

    <!-- Use FetchUrl to open a stream on the object which is then used by XMLSubJobExtract -->
    <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />
    
    <component name="XMLSubJobExtract" subType="xmlSubJobExtractor" factoryName="aspire-xml-files">
      <config>
        <branches>
          <branch event="onSubJob" pipelineManager="../ProcessSingleRecord" />
        </branches>
      </config>
    </component>