HTTP Feeder

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


HTTP Feeder
Factory  aspire-http-feeder
subType  default
Inputs  RESTful requests in standard URL query string format (name=value pairs).
Outputs  AspireObjects containing HTTP Request data, including all name=value pairs from the query string.
View 0.4
Documentation

Use the HTTP Feeder to receive RESTFul requests and to feed these requests to an Aspire pipeline. This feeder can turn Aspire into a "RESTful Web Service", accepting requests from outside clients, processing jobs, and then returning results.

The HTTP feeder will register a brand new servlet URL, based on the Aspire server path. For example, if your servletName is "submitFiles", then the new URL will be http://server:50505/submitFiles. In other words, it is separate and apart from the standard Aspire admin user interface (which is under "/aspire").

There are two modes of operation for the HTTP Feeder: 1) Input parameters specified on the URL, and 2) Input data POST'ed to the feeder. In the case of parameters on the URL, the input parameters are added to the AspireObject which is fed down the pipeline. In the case of POSTed data, this may either be parameters from a form that will be added to AspireObject which is fed down the pipeline or data streamed to the servlet which is attached to the published Job as a stream.

The HTTP Feeder can also be used to upload files, using a Multipart form submission. See below for details.

Using the HTTP Feeder as a User Interface

The HTTP Feeder can be used as a user interface. Click here for instructions on how to do this.

Parameters Specified on the URL

In the first mode, parameters are specified on the URL in param=value format. For example: http://server:50505/submitFiles?param1=value1&param2=value2 .

These parameters will be stored in the resulting AspireDocument passed down the pipeline as XML tags at the top level. For example:

 <doc>
   <feederLabel>HttpFeeder</feederLabel>
   <param1 source="HTTPFeederServlet">value1</param1>
   <param2 source="HTTPFeederServlet">value2</param2>
 </doc>

The pipeline would then be responsible (via groovy scripting or whatever) for processing the job as necessary. The results would be returned as XML data.

Information from the Servlet

Information from the servlet is also added to the job published by the HTTPFeeder Information is added as elements to the <aspireHttpFeederServlet> tag:

 <doc>
   <aspireHttpFeederServlet remotePort="52124" relativePath="/xml-search" serverName="localhost" source="HTTPFeederServlet"
      remoteHost="127.0.0.1" serverPort="50505" remoteAddr="127.0.0.1" fullPath="/cgi-bin/xml-search" servletPath="/cgi-bin">
     <queryString>param1=value1&param2=value2</queryString>
   </aspireHttpFeederServlet>
   .
   .
 </doc>

The following information is available:

Attribute Description
source The name of the HttpFeeder
remoteHost The hostname of the client (e.g., browser).
remoteAddr The IP address of the client (e.g., browser).
remotePort The port used by the client (e.g., browser).
serverName The name of the server running the HttpFeeder.
serverPort The port the HttpFeeder is listening on.
servletPath The path the HttpFeeder is responding to.
fullPath The full path requested by the client.
relativePath The path requested by the client relative to the servletPath.
queryString
 (1.1 Release)  
The entire query string (ie, everything after the ? in the URL).
maxUploadSize
 (1.2 Release)  
The maximum size of file that can be uploaded (in bytes - defaults to 10,485,760 bytes - 10Mb). Starting in version  (2.0 Release)   this may be specified using a suffix to specify bytes/kilobytes/megabytes/gigabytes (b/kb/mb/gb). If the suffix is not given, the parameter is in bytes.

XML Data POSTed to the Service

If you wish to actually post data to the service, this can currently be done by setting the "XMLContent" parameter to TRUE below.

Despite its name, XMLContent does not actually require that the content be in XML. The content can be HTML, PDF, or anything. Perhaps the config parameter will be renamed in the future.

When XML content is true, data streamed to the servlet via POST will be set as an input stream attached to the job published by the feeder. You can access the data using the Standards.Basic.getContentStream(Job j) method in the package com.searchtechnologies.aspire.framework.

This also means that you can follow the HTTP feeder with any pipeline stage that uses the content stream. For example, XML Sub Job Extractor, Tabular Files Extractor, XML Loader, and Extract Text can all be the first pipeline stage to receive the job.

Also, just FYI, the "curl" command (available with http://www.cygwin.com or on most Linux installs) is a great way to test submitting data to the service. For example, to POST the document as the content to an Aspire servlet, you could do the following:

 curl --data-binary "@data\full_text.xml" http://localhost:50515/submitFiles

Multipart Form Submissions

HTML supports submitting "multipart forms" made up of multiple parameters, some of which may represent uploaded file content.

In order for the HTTP feeder to receive multipart forms, you need to enable them and then specify how files are handled. You may choose to handle posted files as a stream (choose stream for the <fileHandler> option), or as files (choose file for the <fileHandler> option). If you choose to handle posted files as files, you must also specify the directory they are uploaded to.

NOTE: setting the XMLContent option of the HttpFeeder automatically disables multipart form submission processing

Stream Handler

When the file handler is set to stream, only a single file may be uploaded at a time. Also, all parameters which are received BEFORE the file will be added to the job's as XML tags on the AspireObject. Parameters received AFTER the file are ignored. The file itself will be attached as an InputStream to the job and subsequent stages can access the data using the Standards.Basic.getContentStream(Job j) method in the package com.searchtechnologies.aspire.framework and so data can be streamed directly from the client through whatever processing you need to do. The file is NOT stored locally on the Aspire server by the HttpFeeder

Example configuration
 <component name="MyHTTPFeeder" factoryName="aspire-http-feeder" subType="default">
   .
   .
   <multipartForm>
     <fileHandler>stream</fileHandler>
   </multipartForm>
 </component>

File Handler

When the file handler is set to file, multiple files may be uploaded by a single form submission. Using the file handler requires the HttpFeeder <uploadDir> to be configured. Any file submitted will be uploaded and saved to this directory. The uploaded file is saved using its original filename (filename only, not the complete path).

No streams are added to the Aspire job, and if you wish to reference the file, you will need to access the job's AspireObject and extract the value for the tag corresponding to the HTML form input that caused the file to be uploaded. This value is the full path to the saved copy of the uploaded file on the Aspire server.

For example, if the file was uploaded via the following form:

 <form enctype="multipart/form-data" method=POST  action="http://localhost:50505/xmlfeed">
   XML file to push:
   <input type="file" name="data">
   <input type="submit" value=">Submit<">
 </form>

The AspireObject for the job would look similar too:

 <doc>
   <aspireHttpFeederServlet remotePort="56494" serverName="localhost" source="HTTPFeederServlet" remoteHost="127.0.0.1" serverPort="50505" remoteAddr="127.0.0.1" fullPath="/xmlfeed" servletPath="/xmlfeed">
     <queryString/>
   </aspireHttpFeederServlet>
   C:\tmp\1.2distroTest\distro-test\target\aspire-distribution-1.0-distribution/data/upload\htmlContentFeed.xml
 </doc>

All ordinary HTML form input parameters will be added to the job's AspireObject as XML tags.

Example configuration
 <component name="MyHTTPFeeder" factoryName="aspire-http-feeder" subType="default">
   .
   .
   <multipartForm>
     <fileHandler>file</fileHandler>
     <uploadDir>data/upload</uploadDir>
   </multipartForm>
 </component>

Configuration

Element Type Default Description
branches parent tag None The configuration of the pipeline to publish to. See below.
waitForJob boolean true Indicates to the component whether or not wait for the job to complete .
servletName String httpFeeder Name of the servlet that will feed the files. For example, if servletName is "submitFiles", then you would send files to the httpFeeder using the "http://localhost:50505/submitFiles?params..." URL.
feederLabel String HttpFeeder The <feederLabel> value to be included with the document as it is sent to the pipeline. For example, HttpFeeder.
XMLContent boolean true Set this parameter to true if you will be POST-ing XML data to the HTTP Feeder. This XML data will be set as an input stream attached to the job published by the feeder. Subsequent stages can access the data using the Standards.Basic.getContentStream(Job j) method in the package com.searchtechnologies.aspire.framework.
xmlRootName String doc The name of the root element, for example <root>. This will be the root element of the AspireDocument object which is passed down the pipeline.
xsltFileName String null The path of the XSL transform file to be used to format the output xml. Path names will be relative to Aspire Home.
outputMime String text/xml Specifies the mime type which the HTTP feeder will report back to the HTTP client. Change this to "text/html" if your transform creates HTML which should be shown by a browser.
resultMimeTypeField String   Set the mime type using the value found in the field specified. The field must exist as a child of the root (ie a parameter value of mimeType looks for value in the /doc/mimeType field in the default AspireObject) . If the field does not exist or is empty, then the mimeType reverts back to the value from the parameter <outputMime>
NOTE: The value is extracted before the transformation (if any) is applied.
multipartForm parent tag   Enable multi-part form submission, which allows for uploading files to the HTTP server through HTML forms, as well as other input elements.
multipartForm/fileHandler String stream Specify the type of file handler to use for posted files. The stream (default) handler will attach an InputStream to the file stream to the job and subsequent stages can access the data using the Standards.Basic.getContentStream(Job j) method in the package com.searchtechnologies.aspire.framework. The file handler will upload the file to the specified directory (see below). No input stream is attached to the job for the file handler. See above for more details and restrictions.
multipartForm/uploadDir String   Specify the location where files from multi-part forms will be uploaded when using the file handler. See above for more details.
saxonProcessor boolean false Set on true if you want to use SAXON Processors to transform using XSLT 2.0 files.
debugOutFile String   Specify the location where the XSLT processed output will be written to. This is used for debugging the transforms.

Example Configurations for HTML Form-Style Parameters

This will handle either parameters specified on the URL with HTTP GET, or parameters POST'ed from an HTML <form>.

 <component name="MyHTTPFeeder" factoryName="aspire-http-feeder" subType="default">
   <servletName>submitFiles</servletName>
   <feederLabel>HttpFeeder</feederLabel>
   <xsltFileName>config/categorizeOutput.xsl</xsltFileName>
   <branches>
     <branch event="onPublish" pipelineManager="CategorizeFolderOrFile" />
   </branches> 
 </component>

Example configuration for posting XML to Aspire

 <component name="MyHTTPFeeder" factoryName="aspire-http-feeder" subType="default">
   <servletName>submitFiles</servletName>
   <feederLabel>HttpFeeder</feederLabel>
   <XMLContent>true</XMLContent>
   <xsltFileName>config/extractor.xsl</xsltFileName>
   <branches>
     <branch event="onPublish" pipelineManager="CategorizeFolderOrFile" />
   </branches> 
 </component>

Serving Files

The HTTPFeeder can also serve up ordinary HTML files so it can be used as a more complete, end-to-end user interface for simple user interfaces.

Files are stored inside the Aspire Home directory, in the "web/httpfeeder/<servlet-name>" directory.

For example, a request for:

Will access the file from:

  • $ASPIRE_HOME/web/httpfeeder/submitFiles/test.html

Note that “index.html” is also supported. So, a request for:

Will return:

  • $ASPIRE_HOME/web/httpfeeder/submitFiles/index.html

If it exists.

NOTE: if porting from version 0.4, note that the position of the required directory on disk has changed from web/<servlet-name> to web/httpfeeder/<servlet-name>.

Returning Binary Data

Raw binary data can be returned from the HTTPFeeder. This will happen automatically if the following conditions are met:

  • The output mime type is "application/octet-stream"
    • This can be set with either the <outputMime> or <resultMimeTypeField> configuration parameters.
  • There is a job variable called "byteDataResults"

Note that the job variable must (currently) hold data of type ByteArrayOutputStream.

If the above situation occurs, the HTTPFeeder will do the following:

  1. Fetch the array of bytes from the ByteArrayOutputStream
  2. Set the returned content-length to the length of the array of bytes
  3. Writes the byte data back to the client