XML Sub Job Extractor
For Information on Aspire 3.1 Click Here
This stage is primarily intended to split an XML file containing a list of records, and then to process each individual record one at a time, as sub-jobs on their own pipeline. These sorts of XML files are commonly produced by relational databases.
This stage takes an job which contains a data stream. It assumes that the data stream represents an XML document and then parses through the XML document to extract sub-job documents.
Note that the XML sub-job handler does not load the entire XML into an in-memory DOM object. Instead, it reads data from the input stream and outputs XML records to the sub-job pipeline as they are found using a SAX handler. This makes it very fast with very low memory requirements.
Typical input XML documents look like this:
<records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <record id="32"> <first_name>george</first_name> <last_name>washington</last_name> <description>Founding father #1</description> </record> <record id="33"> <first_name>thomas</first_name> <last_name>jefferson</last_name> <description>Founding father #2</description> </record> </records>
Note that every child of the root element (and which element represents the "root" can be specified with the rootNode configuration parameter) will be processed as a separate sub-job document. Therefore, the above XML will produce the following sub-job XML documents:
Sub Job #1:
<doc id="32" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlTag="record"> <parent> -- NOTE: a copy of the parent metadata is stored here -- </parent> <first_name>george</first_name> <last_name>washington</last_name> <description>Founding father #1</description> </doc>
Sub Job #2:
<doc id="33" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlTag="record"> <parent> -- NOTE: a copy of the parent metadata is stored here -- </parent> <first_name>thomas</first_name> <last_name>jefferson</last_name> <description>Founding father #2</description> </doc>
The top-level <doc> element for the sub job will contain all of the attributes for the parent XML element from the original file (i.e. the attributes on the <records> element from above) as well as all of the attributes from each sub-record (all of the attributes from each <record> element in turn). This should ensure that transforms on XML files which require nested name-spaces can occur properly.
|branches||None||The configuration of the pipeline to publish to. See below.|
|maxSubJobs||integer||0 (= all)||The maximum number of subjobs to generate. If there are more possible jobs in the input XML file, they will be ignored.|
|characterEncoding||String||UTF-8||The character encoding of the XML file to be read, if not UTF-8.|
|rootNode||String||None||The root node which contains the sub-jobs to publish. If not specified, the root node of the entire XML tree is considered to be the root node.
This value should be in path format, for example: /results/hits . This will publish as sub-jobs all of the child elements which occur within the <results>/<hits> tag.
Note: This is not an XPath, just a path which represents a named node within the XML hierarchy. It should start with a / and this will be added if missing.
|(1.1 Release) cleanse||boolean||true||Set to true if you want to clean the XML content from non-readable characters (.i.e ASCII code 15).|
|(1.1.1 Release) (1.2.2 Release) (1.3 Release) honorDTD||boolean||false||Set to true if you want to fetch XML's DTD.|
The XML Sub Job Extractor publishes documents using the branch manager. It publishes using the events configured above. You must therefore include <branches> for these events in the configuration to publish to a pipeline within a pipeline manager. See Branch Handler for more details.
|branches/branch/@event||String||The event to configure. Should always be "onSubJob".|
|branches/branch/@pipelineManager||string||The URL of the pipeline manager to publish to. Can be relative.|
|branches/branch/@pipeline||string||The name of the pipeline to publish to.|
<!-- Use FetchUrl to open a stream on the object which is then used by XMLSubJobExtract --> <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" /> <component name="XMLSubJobExtractor" subType="xmlSubJobExtractor" factoryName="aspire-xml-files"> <branches> <branch event="onSubJob" pipelineManager="../ProcessSingleRecord" /> </branches> </component>