Post XML 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Post XML Stage 0.4

Post XML Stage 0.4
Description: Applies an XSLT transform to the input XML. Then posts the resulting transformed XML to a remote RESTful interface via HTTP.
Inputs: The entire XML representation of the document as processed so far.
Outputs: A transformed XML which is then posted to a remote server via a RESTful interface.
Factory: aspire-post-xml
Sub Type: default
Object Type: AspireDocument

Description

First transforms the AspireDocument XML from the previous stage using an XSL transform, and then posts the resulting data (via streaming) to a remote server using an HTTP POST method. The server is selected from the list using round robin.

If the remote server returns something other than HTTP 200 (in the HTTP headers), will retry the post at most two more times, sleeping 1 second between each try.

If the remote server returns HTTP 200, but the okay response string (see below) can not be found in the response data, the document will be flagged as an error and should be quarantined.

Also has an option for posting a fixed literal string to the remote server, for doing other types of notifications.

Supports job batching.

Configuration

Element Type Default Description
postUrl string http://localhost:8983/solr/update A semicolon separated list of the URLs to which the resulting, transformed XML file will be posted. The exact URL to post to is selected based on round robin.
postString string   Instead of posting the transform of the incoming document XML, post this string instead. When specified, the document XML is not transformed, nor is it posted. Only the postString is sent to the remote server. Can be useful for doing things like a SOLR commit or some other notification.
postXsl string   The XSL transform file to be used to transform the incoming document XML into the XML which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME.
debugOutFile string   For debugging purposes, the transformed XML will also be appended to this output file as well as to the remote server.
okayResponse string <int name="status">0</int> The response from the remote server will be scanned for this string. If it exists, then it will be assumed that the posting was successful. If it failed, then an error will be returned for the document.
readTimeout int 300000ms (5 minutes) Specifies the read timeout for the HTTP Connection - how long to wait before the server responds.
connectionTimeout int 60000ms (1 minute) The connection timeout, how long to wait for the servers to respond to a connection request.
maxTries int 3 The number of times to try submitting the document. If a submit fails, a different URL will be selected from the list of URLs (if the number of URLs is greater than on) and the document will be resubmitted.
retryWait int 1000ms (1 second) The time to sleep between submissions of failed documents.
deterministic boolean false If true, the round robin will be deterministic based on the document id, and the same document will always be sent to the same host. If false, the round robin will send the document to the first available host
broadcast boolean false If true, the document will be sent to all configured servers (one after the other)
idPath String /doc/fetchUrl When using deterministic round robin, obtain the document id from the given xpath of the document.
fixedLengthOutput (deprecated) boolean false This specifies that content is written using fixed-length output. This means the entire document is first buffered into memory, so that the fixed length is known, then the length is sent to the server. It appears that this is required when writing content to the Google Search Appliance. (deprecated: GSA accepts content even when the length is unknown)
multipartForm parentTag   Posts the XML output as a multipart form, with name/value pairs written to the POST stream (as HTTP headers) before the content itself. Name/value pairs are specified with <multipartForm><param> elements.
multipartForm/@contentParam String data Specifies the parameter name to hold the content of the XSL transform of the job's document - i.e. the content of the XML itself.
multipartForm/param and param/@name String   Holds parameter name/value pairs of form data to send to the HTTP server. Note that values are specified as the content of the <param> tag, and can be encoded using substitutions from the Simple Templates method.

Example Configuration

  <component name="PostXML" subType="default" factoryName="aspire-post-xml">
    <config>
      <postUrl>http://localhost:8983/solr/update</postUrl>
      <postXsl>config/aspire2solr.xsl</postXsl>
      <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
    </config>
 </component>

Multi Server Configuration Example

  <component name="PostXML" subType="default" factoryName="aspire-post-xml">
    <config>
      <postUrl>http://server1:8983/solr/update;
               http://server2:8983/solr/update;
               http://server3:8983/solr/update;
               http://server4:8983/solr/update
      </postUrl>
      <postXsl>config/aspire2solr.xsl</postXsl>
      <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
      <maxTries>10</maxTries>
      <retryWait>10000</retryWait>
    </config>
 </component>

Commit Example

Example using the <postString> method to automatically post a commit command to SOLR. This is typically used after the "WaitForSubJobs" component in the parent pipeline.

 <component name="SolrCommit" subType="default" factoryName="aspire-post-xml">
   <config>
     <postUrl>http://localhost:8983/solr/update</postUrl>
     <postString>
       <![CDATA[
         <commit/>
       ]]>
      </postString>
    </config>
 </component>


Multi-Part Form Example

Useful for writing to the Google Search Appliance.

 <component name="SolrCommit" subType="default" factoryName="aspire-post-xml">
   <config>
     <postUrl>http://localhost:8983/solr/update</postUrl>
     <fixedLengthOutput>true</fixedLengthOutput>
     <postXsl>config/aspire2solr.xsl</postXsl>
     <multipartForm contentParam="data">
       <param name="datasource">This is the datasource value</param>
       <param name="feedtype">{XML:feedValue}</param>
     </multipartForm>
    </config>
 </component>


Batching Post XML

This allows to send batches of documents to the server's open stream instead of doing it one at a time. Batching posts has a great positive performance impact, specially on the GSA.

Batching is possible in the following situations:

  • You have a sub-job extractor on a parent pipeline, and the post-xml stage is on a (different) sub-job pipeline.
  • The target search engine or application allows multiple documents/records in one post. For example, the GSA allows multiple <record> elements in the same XML feed.

How to Configure

You will need to add the following components into the parent pipeline:

  1. Open batch: Will ask Post XML to open a new batch. The created batch object is stored in the parent job and visible to all sub-jobs (but only Post XML should use it). This component is an instance of Storage Handler with the "open" command configured.
  2. Close batch: Closes the existing batch object. It will send any remaining data and close/flush the current open stream to the server. Is also an instance of Storage Handler.


Configuration of the Open Batch Component

It must be setup as open command from component Storage Handler.

Element Type Default Description
commands/open/@componentRef String   Path to the Post XML component name that will be used to open the batches.
commands/open/batchSize int 1 Max size of the batch. Every batchSize documents, the server stream will be closed an reopened to start a new batch.
commands/open/batchTimeout int 10000 Time in milliseconds that the batch will remain open before being closed by a background thread. The timeout is started when the last documents was received by Post XML. Set to 0 to disable.
commands/open/postHeader String empty string String that is wrote in the stream before the first document is received. This consists of the required feed headers for the target search engine or application.
commands/open/postFooter String empty string String that is wrote in the stream after closing the batch. This consists of the required feed footer for the target search engine or application.


Configuration of the Close Batch Component

It must be setup as close command from component Storage Handler.

Element Type Default Description
commands/close/@componentRef String   Path to the Post XML component name that will be used to open the batches.
commands/close/@variable String   Name of the variable in the parent job that holds the current batch. This value must always be "batchReference".


The Post XML configuration remains the same. Notice that this implies that you can have multipart, deterministic, round robin and broadcasting Post XML working in a batched fashion.

Configuration Example

Open Batch Example
<component name="openBatch" subType="default" factoryName="aspire-storage-handler">
	<config>
		<commands>
			<open componentRef="/searchengine/Post/PostAddOrUpdateToGSA">
				<batchSize>10</batchSize>
				<batchTimeOut>5000</batchTimeOut>
				<postHeader><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd"><gsafeed><header><datasource>Aspire_feed</datasource><feedtype>incremental</feedtype></header><group action="add">]]></postHeader>
				<postFooter><![CDATA[</group></gsafeed>]]></postFooter>
			</open>
		</commands>
	</config>
</component>
Close Batch Example
<component name="closeBatch" subType="default" factoryName="aspire-storage-handler">
	<config>
		<commands>
			<close variable="batchReference" componentRef="/searchengine/Post/PostAddOrUpdateToGSA"/>									
		</commands>
	</config>
</component>
Pipeline Configuration Example

File:Batching pipeline example 04.jpg

Feed to the GSA example (configuration and XSL)

This section provides an example of PostXml configuration and a XSL template that may be useful for feeding documents to the GSA.

Configure aspire-post-xml to use multipart form option. This will prevent the GSA from rejecting the feed because of wrong encodings. Example:

<component name="PostAddOrUpdateToGSA" subType="default" factoryName="aspire-post-xml">
     <config>
           <postUrl>${gsaFeedUrl}</postUrl>
           <postXsl>data/aspire2gsa.xsl</postXsl>
           <okayResponse>Success</okayResponse>
           <debugOutFile>data/debug/gsa.out</debugOutFile>
           <multipartForm contentParam="data">
                <param name="datasource">ppp_feed</param>
                <param name="feedtype">incremental</param>
           </multipartForm>
     </config>
</component>

Notes:

  • The value of ${gsaFeedUrl} is http://10.10.40.46:19900/xmlfeed, where 10.10.40.46 is the GSA IP address.
  • Download aspire2gsa.xsl from here File:Aspire2gsa.rar. Check GSA Feeds Guide and GSA Connector Developer's Guide for more details on the feed XML format.
  • okayResponse is configured to match the response from GSA.
  • debugOutFile is optional, it that file you can see the transformed documents (as they are sent to the GSA).
  • mulitpartForm->datasource: Your feed will show with this name under “Crawl and Index->Feeds” section on GSA administration.
  • multipartForm->feedType: The GSA will keep versions of the same document.

Common issues (and how they are normally fixed)

  • There is a feed for each document. Is this normal? Yes, this is normal. This is the most simple feed scenario, one document per feed XML sent to the GSA. If you want more than one feed, checkout the section above to see how to enable batching. There is a noticeable performance improvement when batches of documents are sent to the GSA.
  • PostXML returns error 401 for any feed. Check that the Aspire machine is on the list of List of Trusted IP Addresses in “Crawl and Index->Feeds” on GSA administration. Or that Trust feeds from all IP addresses is selected.
  • GSA rejects the feed without even opening it. Check that the fed URLs match at least one expression of Start Crawling from the Following URLs in “Crawl and Index->Crawl URLs” on GSA administration.
  • GSA feed shows error “Missing or invalid content” or “Content attribute not properly specified” messages: This is likely a problem with the XSLT. Check that there are no <meta name="someField" content=””> entries on the generated feedXml (in newer versions of the GSA you can download the feed from GSA administration). This is commonly because the XSL is extracting a field that is empty or didn’t exist on the AspireDocument (AspireObject).