Post HTTP (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Post HTTP (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-post-http
subType  default
Inputs  An Aspire Object with the metadata of each document to be posted.
Outputs  A transformed XML, JSON or just plain text, which is then posted to a remote server via a RESTful interface.

The Post HTTP Stage stage applies an XSLT or a JSON transform to the input AspireObject and then posts the resulting transformed XML or JSON to a remote RESTful interface via HTTP. The server is selected from the list using round robin or a deterministic selection to ensure that a single document will only be sent to one server.

If the remote server returns something other than HTTP 200 or 201 (in the HTTP headers), it will retry the post, sleeping 1 second between each try and failing the document after a set number of tries. If the component is configured to use round robin, retries for failures will attempt to pick a different server.

If the remote server returns HTTP 200 or 201, but the okay response string (see below) can not be found in the response data, the document will be flagged as an error and should be quarantined.

The component also has an option for posting a fixed literal string to the remote server, for doing other types of notifications, and supports job batching; see Branch Handler (Aspire 2).

Configuration

Element Type Default Description
postUrl string http://localhost:8983/solr/update A semicolon separated list of the URLs to which the resulting, transformed XML file will be posted. The exact URL to post to is selected based on round robin or a deterministic algorithm based configuration.
deterministic boolean false If true, the server URL selected will be deterministic based on the document id, and the same document will always be sent to the same host. If false, the sever URL will be selected based on round robin and will send the document to the first available host.
idPath String /doc/fetchUrl When using deterministic round robin, obtain the document id from the given xPath of the document.
broadcast boolean false If true, the document will be sent to all configured servers (one after the other).
postString string   Instead of posting the transform of the incoming Aspire Object document, post this string instead. When specified, the document XML is not transformed, nor is it posted. Only the postString is sent to the remote server. Can be useful for doing things like a SOLR commit or some other notification.
postXsl string   The XSL transform file to be used to transform the incoming Aspire Object document into the XML which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME.
postJsonTransform string   The JSON transform file to be used to transform the incoming Aspire Object document into the JSON which is posted to the remote server. There is no default. It must be specified unless postString is specified. Note that this file will be made to be relative to ASPIRE_HOME. See Post JSON.
debugOutFile string   For debugging purposes, the transformed document will also be appended to this output file as well as to the remote server. Creates multiple debug-out files, one for every open thread. Further, old debug-out files will no longer be overwritten (files are written with "-###" suffix attached to them, after the main file name but before the ".xxx" extension). Therefore, you may want to store debug out files in the data or logs directory to avoid cluttering up your filesystem.

Allows a semi-colon separated list of debug-out files. For example <debugOutFile>/mnt/out1/debug-out.txt;/mnt/out2/debug-out.txt</debugOutFile> This will automatically round-robin new debug out files to the different locations. If the different locations are on separate hard drives, then IO output performance can be vastly improved.

okayResponse string <int name="status">0</int> The response from the remote server will be scanned for this string. If it exists, then it will be assumed that the posting was successful. If it failed, then an error will be returned for the document.
readTimeout int 300000ms (5 minutes) Specifies the read timeout for the HTTP Connection - how long to wait before the server responds.
connectionTimeout int 60000ms (1 minute) The connection timeout, how long to wait for the servers to respond to a connection request.
maxTries int 3 The number of times to try submitting the document. If a submit fails, a different URL will be selected from the list of URLs (if the number of URLs is greater than on) and the document will be resubmitted.
retryWait int 1000ms (1 second) The time to sleep between submissions of failed documents.
multipartForm parentTag   Posts the transformed output as a multipart form, with name/value pairs written to the POST stream (as HTTP headers) before the content itself. Name/value pairs are specified with <multipartForm><param> elements.
multipartForm/@contentParam String data Specifies the parameter name to hold the content of the transformed output of the job's document - i.e. the content of the XML or JSON itself.
multipartForm/param and param/@name String   Holds parameter name/value pairs of form data to send to the HTTP server. Note that values are specified as the content of the <param> tag, and can be encoded using substitutions from the Simple Templates method.
saxonProcessor boolean false Set on true if you want to use SAXON processors (which support XSLT 2.0).
authentication String "none" Indicates what type of authentication that must be used.("none" no authentication, "basic" Basic authentication with encode Base64)
username String null Sets the username, in case the authentication is needed.
password String null Sets the password, in case the authentication is needed.
contentType String null Sets the content-type header to be sent to server. Ignored when sending multi-part forms. Example: "text/xml".
requestProperties (2.1 Release)     see bellow Configurable HTTP request properties. Such as "user-agent".
maxResults (2.1 Release)   Integer 2^(31)-1 (Maximum integer allowed) (Index dump) How many documents can be fetched by the search engine for the same query
pageSize (2.1 Release)   Integer 10000 (Index dump) How many documents to fetch per page
urlField (2.1 Release)   String displayUrl (Index dump) Field used to store the url in the search engine
idField (2.1 Release)   String id (Index dump) Field used to store the id in the search engine.
timestampField (2.1 Release)   String submitTS (Index dump) The name of the timestamp field holding the index timestamp of every document.


Request Properties Configuration

 (2.1 Release)   Specially useful to set custom or specialized security tokens before a post operation.

Field/Attribute Description
requestProperty{@name} Name of the request property.
requestProperty Value of the request property.

XSLT Transform

Example Configuration

  <component name="PostHTTP" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:8983/solr/update</postUrl>
    <postXsl>config/xsl/aspireToSolr.xsl</postXsl>
    <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
 </component>


Example Configuration with Basic Authentication

  <component name="PostHTTP" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:8983/solr/update</postUrl>
    <postXsl>config/xsl/aspireToSolr.xsl</postXsl>
    <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
    <authentication>basic</authentication>
    <username>admin</username>
    <password>pass</password>
 </component>

Multi Server Configuration Example

  <component name="PostHTTP" subType="default" factoryName="aspire-post-http">
    <postUrl>http://server1:8983/solr/update;
             http://server2:8983/solr/update;
             http://server3:8983/solr/update;
             http://server4:8983/solr/update
    </postUrl>
    <postXsl>config/xsl/aspireToSolr.xsl</postXsl>
    <okayResponse><![CDATA[<int name="status">0</int>]]></okayResponse>
    <maxTries>10</maxTries>
    <retryWait>10000</retryWait>
 </component>

Commit Example

Example using the <postString> method to automatically post a commit command to SOLR. This is typically used after the "WaitForSubJobs" component in the parent pipeline.

  <component name="SolrCommit" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:8983/solr/update</postUrl>
    <contentType>text/xml</contentType>
    <postString>
      <![CDATA[
        <commit/>
      ]]>
    </postString>
  </component>

Multi-Part Form Example

Useful for writing to the Google Search Appliance (GSA).

  <component name="MultipartPost" subType="default" factoryName="aspire-post-http">
    <postUrl>${gsaFeedUrl}</postUrl>
    <fixedLengthOutput>true</fixedLengthOutput>
    <postXsl>config/xsl/aspireToGSA.xsl</postXsl>
    <multipartForm contentParam="data">
      <param name="datasource">This is the datasource value</param>
      <param name="feedtype">{XML:feedValue}</param>
    </multipartForm>
  </component>

Batching XML

All you need, is to set up the Branch Handler (Aspire 2) to use batching. All jobs that get to the stage (for example they come from a sub job extractor) will be ready to be batched when they get to PostHTTP.

Once you set up the branch handler, then set this two additional parameters on PostHTTP:

Element Type Default Description
postHeader String empty string String that is wrote in the stream before the first document is received. This consists of the required feed headers for the target search engine or application.
postFooter String empty string String that is wrote in the stream after closing the batch. This consists of the required feed footer for the target search engine or application.

Example

Sample application XML configuration:


<?xml version="1.0" encoding="UTF-8"?>
<application name="FeedOneExample">
  
  <components>
    <component name="StandardPipeManager" subType="pipeline" factoryName="aspire-application">
      <components>
        <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />

        <component name="WaitForSubJobs" subType="waitForSubJobs" factoryName="aspire-tools"/>

        <component name="XMLSubJobExtract" subType="xmlSubJobExtractor" factoryName="aspire-xml-files">
        <branches>
          <branch event="onSubJob" pipelineManager="." 
                  pipeline="subJobs-process" 
                  batching="true"
                  batchSize="1000"
                  batchTimeout="1000"
                  simultaneousBatches="2"  />
          </branches>
        </component>

        <component name="PostToGSA" subType="default" factoryName="aspire-post-http">
          <postUrl>${gsaFeedUrl}</postUrl>
          <postXsl>config/xsl/aspireToGSA.xsl</postXsl>
          <okayResponse>Success</okayResponse>
          <debugOutFile>data/debug/gsa.txt</debugOutFile>
          <postHeader><![CDATA[<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "gsafeed.dtd"><gsafeed><header><datasource>Macomb_poc_feed</datasource><feedtype>incremental</feedtype></header><group action="add">]]></postHeader>
          <postFooter><![CDATA[</group></gsafeed>]]></postFooter>
          <multipartForm contentParam="data">
            <param name="datasource">Macomb_poc_feed</param>
            <param name="feedtype">incremental</param>
          </multipartForm>
        </component>
          
      </components>
      <pipelines>

        <pipeline name="doc-process" default="true">
          <stages>
            <stage component="XMLSubJobExtract" />
          </stages>
        </pipeline>
		  
        <pipeline name="subJobs-process">
          <stages>
            <stage component="PostToGSA" />			  
          </stages>
        </pipeline>
      </pipelines>
    </component>
  </components>
</application>


Dynamic Request Properties

 (2.1 Release)  

Besides setting static request properties on initialize through the component's configuration (see above), request properties can be dynamically set through the AspireObject of the incoming job.

Request properties in the AspireObject are read from the structure:

  <doc>
    <requestProperties>
      <requestProperty name="PROP_NAME">PROP_VALUE</requestProperty>
      <requestProperty name="PROP_NAME2">PROP_VALUE2</requestProperty>
      ...
    </requestProperties>
  </doc>

When working with Aspire Batches the values of the first job of the batch will be the ones used to open the connection with the server.

Feed to the GSA example (configuration and XSL)

This section provides an example of PostXml configuration and a XSL template that may be useful for feeding documents to the GSA.

Configure aspire-post-http to use multipart form option. This will prevent the GSA from rejecting the feed because of wrong encodings. Example:

<component name="PostAddOrUpdateToGSA" subType="default" factoryName="aspire-post-http">
  <config>
    <postUrl>${gsaFeedUrl}</postUrl>
    <postXsl>config/xsl/aspireToGSA.xsl</postXsl>
    <okayResponse>Success</okayResponse>
    <debugOutFile>data/debug/gsa.out</debugOutFile>
    <multipartForm contentParam="data">
      <param name="datasource">ppp_feed</param>
      <param name="feedtype">incremental</param>
    </multipartForm>
  </config>
</component>

Notes:

Common issues (and how they are normally fixed)

  • There is a feed for each document. Is this normal? Yes, this is normal. This is the most simple feed scenario, one document per feed XML sent to the GSA. If you want more than one feed, checkout the section above to see how to enable batching in branch handler. There is a noticeable performance improvement when batches of documents are sent to the GSA.
  • PostHTTP returns error 401 for any feed. Check that the Aspire machine is on the list of List of Trusted IP Addresses in “Crawl and Index->Feeds” on GSA administration. Or that Trust feeds from all IP addresses is selected.
  • GSA rejects the feed without even opening it. Check that the fed URLs match at least one expression of Start Crawling from the Following URLs in “Crawl and Index->Crawl URLs” on GSA administration.
  • GSA feed shows error “Missing or invalid content” or “Content attribute not properly specified” messages: This is likely a problem with the XSLT. Check that there are no <meta name="someField" content=””> entries on the generated feedXml (in newer versions of the GSA you can download the feed from GSA administration). This is commonly because the XSL is extracting a field that is empty or didn’t exist on the AspireDocument (AspireObject).


JSON Transform

JSON transformers are Groovy Scripts that use JSON Builders to create JSON objects from AspireObjects as input. Further information about JSON transformers syntax at Post JSON

Elasticsearch Indexing Example

Single document indexing (without batching).

  <component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:9200/testindex/testtype/</postUrl>
    <postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform>
    <okayResponse><![CDATA[{"ok":true]]></okayResponse>
  </component>


Elasticsearch Indexing with Basic Authentication Example

Single document indexing (without batching).

  <component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:9200/testindex/testtype/</postUrl>
    <postJsonTransform>config/json/aspireToElasticsearch.groovy</postJsonTransform>
    <okayResponse><![CDATA[{"ok":true]]></okayResponse>
    <authentication>basic</authentication>
    <username>admin</username>
    <password>pass</password>
  </component>

Elasticsearch bulk indexing

Elasticsearch bulk indexing (batching).

  <component name="PostElasticsearch" subType="default" factoryName="aspire-post-http">
    <postUrl>http://localhost:9200/_bulk</postUrl>
    <postJsonTransform>config/json/aspireToElasticsearchBulk.groovy</postJsonTransform>
    <okayResponse><![CDATA[{"took":]]></okayResponse>
 </component>