Fetch URL 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Fetch URL Stage 0.4

Fetch URL Stage 0.4
Description: Fetch URL gets a URL and opens up a stream on the URL which can be read by down-stream pipeline stages.
Inputs: <fetchUrl> (if it exists) or <url>
Outputs: Sets document variable 'contentStream', to an InputStream object ready to be read.

Also HTTP headers are automatically mapped to output elements using the metadata mapper (see below), and a special element, <httpResponse> is also created for HTTP URLs.

Factory: aspire-fetch-url (previously aspire.FetchURL).
Sub Type: default
Object Type: AspireDocument

Other Notes

  • Fetch URL stage will prefix the URL with "http://" if there is no URL protocol specified.
  • Fetch URL can open streams on file system files with the file:// protocol (often used)

Other Outputs

The <httpResponse> element will contain the HTTP response information if the protocol was an "http://" protocol. For example:

<httpResponse code="200" source="FetchURLStage">OK</httpResponse>

Configuration

Element Type Default Description
connectionTimeout int 600000
(10 minutes)
Maximum time to wait (in ms) for establishing a connection to the remote server.
readTimeout int 600000
(10 minutes)
Maximum time to wait (in ms) for reading the entire content.
enableRedirects boolean true If true, then automatically detect when a URL is being redirected by the HTTP server and follow the redirect. I believe this only works for HTTP servers which provide a 3XX redirection response, i.e. I doubt it works for other kinds of JavaScript redirects. If this is set, the redirected URL
maxBytes int 10485760
(= 10mb)
Specifies the maximum number of bytes to read from the URL.
method String GET The method for posting CGI parameters to the remote server. Either POST or GET. This configuration element is ignored for non-HTTP connections. In the POST case, all query parameters will be detached from the URL and submitted as the request body.
metadataMap   see below Standard Metadata Mapper configuration. See below.


Metadata Mapper Configuration

The fetch URL stage contains a large number of additional metadata fields which can be mapped to fields in the AspireDocument XML.


Field Default Output Field Description
protocol protocol The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com")
host host The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com")
mimeType mimeType The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html".
encoding encoding The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8"
expirationDate expirationDate The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time.
modificationDate modificationDate The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time.
redirectUrl redirectUrl If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL.
status - The HTTP response status message. For example, "HTTP/1.1 200 OK".
all other HTTP headers - Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area.

Example Configurations

Simple

 <component name="fetchUrl" subType="default" factoryName="aspire-fetch-url" />

Complex

 <component name="fetchUrl" subType="default" factoryName="aspire-fetch-url">
   <config>
     <connectionTimeout>1000</connectionTimeout>
     <maxBytes>1000000</maxBytes>
     <!-- note that all of the default mappings are included automatically -->
     <metadataMap>
      <map from="Cache-Control" to="cacheControl"/>
       <map from="Server" to="server"/>
      <map from="Set-Cookie" to="cookieValue"/>
     </metadataMap>
   </config>
 </component>

Example Output

<doc>
  <fetchUrl>http://www.searchtechnologies.com</fetchUrl> 
  <httpResponse code="200" source="FetchURLStage">OK</httpResponse> 
  <protocol source="FetchURLStage/protocol">http</protocol> 
  <host source="FetchURLStage/host">www.searchtechnologies.com</host> 
  <mimeType source="FetchURLStage/mimeType">text/html</mimeType> 
  <encoding source="FetchURLStage/encoding">utf-8</encoding> 
  <extension source="FetchURLStage">
    <field name="status">HTTP/1.1 200 OK</field> 
    <field name="Date">Wed, 02 Dec 2009 15:05:24 GMT</field> 
    <field name="Server">Microsoft-IIS/6.0</field> 
    <field name="X-Powered-By">ASP.NET</field> 
    <field name="X-AspNet-Version">2.0.50727</field> 
    <field name="Set-Cookie">ASP.NET_SessionId=vkprqxru0k2gjy455o1j31u3; path=/; HttpOnly</field> 
    <field name="Cache-Control">private</field> 
    <field name="Content-Type">text/html; charset=utf-8</field> 
    <field name="Content-Length">9584</field> 
  </extension>
  .
  .
  .
</doc>

Note: The actual document content is sent down the pipeline as a java InputStream, which can be accessed from the AspireDocument object with the getObject("contentStream") method.

Fetching via https://

If you're fetching files via https://, you may encounter issues if the certicate the server is using is not properly signed.

Typically you'll see an exception such as:

 AspireException(aspire.FetchURLStage.other-connect-error): com.searchtechnologies.aspire.services.AspireException: Unable to open connection to URL "https://server:8443/path/file". (component='/fastProxyServer/queryPipeManager/queryFast', componentFactory='aspire-fetch-url')
       at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:284)
       at com.searchtechnologies.aspire.application.JobHandler.runNested(JobHandler.java:114)
       at com.searchtechnologies.aspire.application.JobHandler.run(JobHandler.java:52)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
       at java.lang.Thread.run(Unknown Source)
 Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
       at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
       at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(Unknown Source)
       at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source)
       at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source)
       at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(Unknown Source)
       at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(Unknown Source)
       at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Unknown Source)
       at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Unknown Source)
       at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source)
       at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source)
       at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
       at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
       at sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source)
       at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
       at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(Unknown Source)
       at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:185)
       ... 5 more
 Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
       at sun.security.validator.PKIXValidator.doBuild(Unknown Source)
       at sun.security.validator.PKIXValidator.engineValidate(Unknown Source)
       at sun.security.validator.Validator.validate(Unknown Source)
       at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.validate(Unknown Source)
       at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
       at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
       ... 17 more
 Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
       at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(Unknown Source)
       at java.security.cert.CertPathBuilder.build(Unknown Source)
       ... 23 more

In order to fetch these pages, you need to import the certificate from the offending server to a keystore and then configure aspire to use that keystore.

Using a web browser, export the certificate (IE):

   Connect to https://my.domain.com
   Go to Tools > Internet Options > Content > Certificates > Intermediate Certification Authorities [or "Trusted Root Certification Authorities"]
   Choose whichever certificate is needed
   Click “Export…”, then “Next>”
   Select “DER encoded binary X.509 (.CER)”
   Name the file myDomain.cer [change the name as applicable]
   Select “Finish”

Next, install the cerificate:

keytool -import -alias myDomain -file myDomain.cer -trustcacerts -keystore \path\myKeystore

Then configure Felix to use the keystore by adding the following to the java command line:

 -Djavax.net.ssl.trustStore=C:\path\myKeystore

ie:

 java -Djavax.net.ssl.trustStore=C:\path\myKeystore -Xmx250m -Xms250m %FELIX_CONFIG_PROP% "%ASPIRE_HOME_PROP%" -jar bin\felix.jar