HTTP Feeder 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Http Feeder 0.4

Http Feeder 0.4
Description: Feeds a single URL down the pipeline in response to an http request.
Inputs: An URL with the file/directory to feed.
Outputs: The file as an XML representation in the Aspire Document.
Factory: aspire-http-feeder (previously aspire.Tools)
Sub Type: default
Object Type: Produces AspireDocument objects.

Use the HTTP feeder to accept jobs from the outside world and then feed these jobs down an Aspire pipeline. This feeder can turn Aspire into a "RESTful Web Service", accepting requests from outside clients, processing jobs, and then returning results.

The HTTP feeder will register a brand new servlet URL, based on the Aspire server path. For example, if your servletName is "submitFiles", then the new URL will be http://server:50505/submitFiles. In other words, it is separate and apart from the standard Aspire admin user interface (which is under "/aspire").

There are two modes of operation for the HTTP Feeder: 1) Input parameters specified on the URL, and 2) Input data POST'ed to the feeder as XML.

Using the HTTP Feeder as a User Interface

The HTTP Feeder can be used as a user interface. Click here for instructions on how to do this.

Parameters Specified on the URL

In the first mode, parameters are specified on the URL in param=value format. For example: http://server:50505/submitFiles?param1=value1&param2=value2 .

These parameters will be stored in the resulting AspireDocument passed down the pipeline as XML tags at the top level. For example:

 <doc>
   <feederLabel>HttpFeeder</feederLabel>
   <param1 source="HTTPFeederServlet">value1</param1>
   <param2 source="HTTPFeederServlet">value2</param2>
 </doc>

The pipeline would then be responsible (via groovy scripting or whatever) for processing the job as necessary. The results would be returned as XML data.

XML Data POSTed to the Service

If you wish to actually post data to the service, this can currently be done by setting the "XMLContent" parameter to TRUE below.

Despite its name, XMLContent does not actually require that the content be in XML. The content can be HTML, PDF, or anything. Perhaps the config parameter will be renamed in the future.

When XML content is true, data streamed to the servlet via POST will be set as an input stream on the AspireDocument object. This means that you can access the data via doc.getContentStream().

This also means that you can follow the HTTP feeder with any pipeline stage that uses the content stream. For example, XML Sub Job Extractor, Tabular Files Extractor, XML Loader, and Extract Text can all be the first pipeline stage to receive the job.

Also, just FYI, the "curl" command (available with http://www.cygwin.com or on most Linux installs) is a great way to test submitting data to the service. For example, to POST the document as the content to an Aspire servlet, you could do the following:

 curl -d "@data\full_text.xml" http://localhost:50515/submitFiles


Multipart Form Submissions

HTML supports submitting "multi-part forms" made up of multiple parameters, some of which may represent uploaded file content.

In order for the HTTP feeder to receive multi-part forms, you need to enable them and then specify how files are handled:

 <config>
   .
   .
   .
   <multipartForm>
     <fileHandler>file</fileHandler>
     <uploadDir>data/upload</uploadDir>
   </multipartForm>
 </config>

In the above example, all ordinary HTML form input parameters will be added to the Aspire Job as XML tags in the ordinary way on the Aspire Document.

Files received by the HTTP feeder will be automatically uploaded to the "data/upload" directory (all relative-path directories are relative to Aspire Home). Multiple files can be specified as part of the multipart form.

The other type of fileHandler available is "stream":

 <fileHandler>stream</fileHandler>

With this handler, only a single file may be uploaded at a time. Also, all parameters which are received AFTER the file are ignored. The advantage of the stream handler is that a stream to the input file is placed on the AspireDocument, and so data can be streamed directly from the client through whatever processing you need to do.

Configuration

Element Type Default Description
branches parent tag None The configuration of the pipeline to publish to. See below.
waitForJob boolean true Indicates to the component whether or not wait for the job to complete .
servletName String httpFeeder Name of the servlet that will feed the files. For example, if servletName is "submitFiles", then you would send files to the httpFeeder using the "http://localhost:50505/submitFiles?params..." URL.
feederLabel String HttpFeeder The <feederLabel> value to be included with the document as it is sent to the pipeline. For example, HttpFeeder.
XMLContent boolean true Set this parameter to "true" if you will be POST-ing XML data to the HTTP Feeder. This XML data will
xmlRootName String doc The name of the root element, for example <root>. This will be the root element of the AspireDocument object which is passed down the pipeline.
xsltFileName String null The path of the XSL transform file to be used to format the output xml. Path names will be relative to Aspire Home.
outputMime String text/xml Specifies the mime type which the HTTP feeder will report back to the HTTP client. Change this to "text/html" if your transform creates HTML which should be shown by a browser.
multipartForm parent tag   Enable multi-part form submission, which allows for uploading files to the HTTP server through HTML forms, as well as other input elements.
multipartForm/fileHandler String stream Specify the type of file handler to use for uploaded files. The "file" handler will upload the file to the specified directory (see below). The "stream" handler will attach an InputStream to the file stream to the job. See above for more details and restrictions.
multipartForm/uploadDir String   Specify the location where files from multi-part forms will be uploaded. Only used for the "file" handler (although a bug in requires that it be specified for "stream" handler too, even though it is not used). See above for more details.
saxonProcessor boolean false Set on true if you want to use SAXON Processors to transform using XSLT 2.0 files.
debugOutFile String   Specify the location where the XSLT processed output will be written to. This is used for debugging the transforms.

Example Configurations for HTML Form-Style Parameters

This will handle either parameters specified on the URL with HTTP GET, or parameters POST'ed from an HTML <form>.

 <component name="MyHTTPFeeder" factoryName="aspire-http-feeder" subType="default">
     <config>
       <servletName>submitFiles</servletName>
       <feederLabel>HttpFeeder</feederLabel>
       <xsltFileName>config/categorizeOutput.xsl</xsltFileName>
       <branches>
         <branch event="onPublish" pipelineManager="CategorizeFolderOrFile" />
       </branches> 
     </config>
   </component>

Example configuration for posting XML to Aspire

 <component name="MyHTTPFeeder" factoryName="aspire-http-feeder" subType="default">
     <config>
       <servletName>submitFiles</servletName>
       <feederLabel>HttpFeeder</feederLabel>
       <XMLContent>true</XMLContent>
       <xsltFileName>config/extractor.xsl</xsltFileName>
       <branches>
         <branch event="onPublish" pipelineManager="CategorizeFolderOrFile" />
       </branches> 
     </config>
 </component>

Serving Files

The HTTPFeeder can also serve up ordinary HTML files so it can be used as a more complete, end-to-end user interface for simple user interfaces.

Files are stored inside the Aspire Home directory, in the "web/<servlet-name>" directory.

For example, a request for:

Will access the file from:

  • $ASPIRE_HOME/web/httpfeeder/submitFiles/test.html

Note that “index.html” is also supported. So, a request for:

Will return:

  • $ASPIRE_HOME/web/submitFiles/index.html

If it exists.