XML Sub Job Extractor 0.4
For Information on Aspire 3.1 Click Here
|XML Sub-Job Extractor 0.4|
|Description:||Extracts individual XML records from a single XML file which contains a list of records. Each record is sent to a sub-job pipeline to be processed individually.|
|Inputs:||AspireDocument containing a data stream (i.e. object['contentStream'] or object['contentBytes'] which is a stream to the XML to process). NOTE: A previous job (typically FetchURL) must have opened the input stream.|
|Outputs:||An AspireDocument object containing data for each sub-job contain the XML of the individual XML record, published to the configured sub-job pipeline manager.|
|Object Type:||Produces AspireDocument objects.|
This stage is primarily intended to split an XML file containing a list of records, and then to process each individual record one at a time, as sub-jobs on their own pipeline. These sorts of XML files are commonly produced by relational databases.
This stage takes an Aspire document which contains a data stream. It assumes that the data stream represents an XML document and then parses through the XML document to extract sub-job documents.
Note that the XML sub-job handler does not load the entire XML into an in-memory DOM object. Instead, it reads data from the input stream and outputs XML records to the sub-job pipeline as they are found using a SAX handler. This makes it very fast with very low memory requirements.
Typical input XML documents look like this:
<records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <record id="32"> <first_name>george</first_name> <last_name>washington</last_name> <description>Founding father #1</description> </record> <record id="33"> <first_name>thomas</first_name> <last_name>jefferson</last_name> <description>Founding father #2</description> </record> </records>
Note that every child of the root element will be processed as a separate sub-job document. Therefore, the above XML will produce the following sub-job XML documents:
Sub Job #1:
<doc id="32" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <parent> -- NOTE: a copy of the parent metadata is stored here -- </parent> <first_name>george</first_name> <last_name>washington</last_name> <description>Founding father #1</description> </doc>
Sub Job #2:
<doc id="33" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <parent> -- NOTE: a copy of the parent metadata is stored here -- </parent> <first_name>thomas</first_name> <last_name>jefferson</last_name> <description>Founding father #2</description> </doc>
The top-level <doc> element for the sub job will contain all of the attributes for the parent XML element from the original file (i.e. the attributes on the <records> element from above) as well as all of the attributes from each sub-record (all of the attributes from each <record> element in turn). This should ensure that transforms on XML files which require nested name-spaces can occur properly.
|branches||None||The configuration of the pipeline to publish to. See below.|
|maxSubJobs||integer||0 (= all)||The maximum number of subjobs to generate. If there are more possible jobs in the input XML file, they will be ignored.|
The XML Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.
|branches/branch/@event||String||The event to configure. Should always be "onSubJob".|
|branches/branch/@pipelineManager||string||The URL of the pipeline manager to publish to. Can be relative.|
|branches/branch/@pipeline||string||The name of the pipeline to publish to.|
<!-- Use FetchUrl to open a stream on the object which is then used by XMLSubJobExtract --> <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" /> <component name="XMLSubJobExtract" subType="xmlSubJobExtractor" factoryName="aspire-xml-files"> <config> <branches> <branch event="onSubJob" pipelineManager="../ProcessSingleRecord" /> </branches> </config> </component>