Fetch URL (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Fetch URL (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-fetch-url
subType  default
Inputs  <fetchUrl> (if it exists) or <url>
Outputs  Sets job variable 'contentStream', to an InputStream object ready to be read. HTTP headers are mapped to output elements using the metadata mapper (see below), and an element, <httpResponse> is also created for HTTP URLs.

The Fetch URL stage opens an InputStream to the given URL which can be read by down-stream pipeline stages.

  • Fetch URL stage will prefix the URL with "http://" if there is no URL protocol specified.
  • Fetch URL can open streams on file system files with the file:// protocol (often used)

Other Outputs

The <httpResponse> element will contain the HTTP response information if the protocol was an "http://" protocol. For example:

<httpResponse code="200" source="FetchURLStage">OK</httpResponse>

Configuration

Element Type Default Description
connectionTimeout int 600000
(10 minutes)
Maximum time to wait (in ms) for establishing a connection to the remote server.
readTimeout int 600000
(10 minutes)
Maximum time to wait (in ms) for reading the entire content.
enableRedirects boolean true Sets whether HTTP redirects (requests with response code 3xx) should be automatically followed by the Fetch URL stage. See here for details.
maxBytes int 10485760
(10 MB)
Specifies the maximum number of bytes to read from the URL.
method String GET The method for posting CGI parameters to the remote server. Either POST or GET. This configuration element is ignored for non-HTTP connections. In the POST case, all query parameters will be detached from the URL and submitted as the request body.
requestProperties   see bellow Configurable HTTP request properties. Such as "user-agent".
fetchUrlPath String doc/fetchUrl The path to the element in the AspireObject that contains the URL to fetch.
metadataMap   see below Standard Metadata Mapper configuration. See below.


Metadata Mapper Configuration

The fetch URL stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML.


Field Default Output Field Description
protocol protocol The protocol of the URL (for example, "http" for "http://www.searchtechnologies.com").
host host The host name of the URL (for example, "www.searchtechnologies.com" for "http://www.searchtechnologies.com").
mimeType mimeType The mime type returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "text/html".
encoding encoding The content encoding as returned by the HTTP server (from the Content-Type header), for http:// URLs only. For example: "UTF-8".
expirationDate expirationDate The expiration date reported by the HTTP server in the "expires" http header, if it exists. Formatted as an ISO 8601 date-time.
modificationDate modificationDate The modification date reported by the HTTP server in the "last-modified" http header, if it exists. Formatted as an ISO 8601 date-time.
redirectUrl redirectUrl If the HTTP server reported a 3XX code and the URL was automatically redirected to another URL, this element provides the new URL.
status - The HTTP response status message. For example, "HTTP/1.1 200 OK".
all other HTTP headers - Note that any HTTP header is available to be mapped by the metadata mapper. All headers not mapped are automatically put into the <extension> area.

Request Properties Configuration

Some URLs are not accessible if some request properties are not set.

Field/Attribute Description
requestProperty{@name} Name of the request property.
requestProperty Value of the request property.

Example Configurations

Simple

 <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url" />

Complex

 <component name="FetchUrl" subType="default" factoryName="aspire-fetch-url">
   <connectionTimeout>1000</connectionTimeout>
   <maxBytes>1000000</maxBytes>
   <!-- note that all of the default mappings are included automatically -->
   <metadataMap>
    <map from="Cache-Control" to="cacheControl"/>
     <map from="Server" to="server"/>
    <map from="Set-Cookie" to="cookieValue"/>
   </metadataMap>
   <requestProperties>
    <requestProperty name="user-agent">aspire/fetchUrl 1.2</requestProperty>
   </requestProperties>
 </component>

Example Output

<doc>
  <fetchUrl>http://www.searchtechnologies.com</fetchUrl> 
  <httpResponse code="200" source="FetchURLStage">OK</httpResponse> 
  <protocol source="FetchURLStage/protocol">http</protocol> 
  <host source="FetchURLStage/host">www.searchtechnologies.com</host> 
  <mimeType source="FetchURLStage/mimeType">text/html</mimeType> 
  <encoding source="FetchURLStage/encoding">utf-8</encoding> 
  <extension source="FetchURLStage">
    <field name="status">HTTP/1.1 200 OK</field> 
    <field name="Date">Wed, 02 Dec 2009 15:05:24 GMT</field> 
    <field name="Server">Microsoft-IIS/6.0</field> 
    <field name="X-Powered-By">ASP.NET</field> 
    <field name="X-AspNet-Version">2.0.50727</field> 
    <field name="Set-Cookie">ASP.NET_SessionId=vkprqxru0k2gjy455o1j31u3; path=/; HttpOnly</field> 
    <field name="Cache-Control">private</field> 
    <field name="Content-Type">text/html; charset=utf-8</field> 
    <field name="Content-Length">9584</field> 
  </extension>
  .
  .
  .
</doc>

Note: The actual document content is sent down the pipeline as a java InputStream, which can be accessed from the job object via the "contentStream" variable.

Fetching via https://

If you're fetching files via https://, you may encounter issues if the certificate the server is using is not properly signed.

Typically you'll see an exception such as:

  AspireException(aspire.FetchURLStage.other-connect-error): com.searchtechnologies.aspire.services.AspireException: Unable to open connection to URL "https://server:8443/path/file". (component='/fastProxyServer/queryPipeManager/queryFast', componentFactory='aspire-fetch-url')
        at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:284)
        at com.searchtechnologies.aspire.application.JobHandler.runNested(JobHandler.java:114)
        at com.searchtechnologies.aspire.application.JobHandler.run(JobHandler.java:52)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
  Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at com.sun.net.ssl.internal.ssl.Alerts.getSSLException(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.fatal(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.fatalSE(Unknown Source)
        at com.sun.net.ssl.internal.ssl.ClientHandshaker.serverCertificate(Unknown Source)
        at com.sun.net.ssl.internal.ssl.ClientHandshaker.processMessage(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.processLoop(Unknown Source)
        at com.sun.net.ssl.internal.ssl.Handshaker.process_record(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.performInitialHandshake(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
        at com.sun.net.ssl.internal.ssl.SSLSocketImpl.startHandshake(Unknown Source)
        at sun.net.www.protocol.https.HttpsClient.afterConnect(Unknown Source)
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(Unknown Source)
        at com.searchtechnologies.aspire.docprocessing.fetchurl.FetchURLStage.process(FetchURLStage.java:185)
        ... 5 more
  Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.validator.PKIXValidator.doBuild(Unknown Source)
        at sun.security.validator.PKIXValidator.engineValidate(Unknown Source)
        at sun.security.validator.Validator.validate(Unknown Source)
        at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.validate(Unknown Source)
        at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
        at com.sun.net.ssl.internal.ssl.X509TrustManagerImpl.checkServerTrusted(Unknown Source)
        ... 17 more
  Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(Unknown Source)
        at java.security.cert.CertPathBuilder.build(Unknown Source)
        ... 23 more

In order to fetch these pages, you need to import the certificate from the offending server to a keystore and then configure aspire to use that keystore.

Using a web browser

  • Export the certificate (IE):
    • Connect to https://my.domain.com
    • Go to Tools > Internet Options > Content > Certificates > Intermediate Certification Authorities [or "Trusted Root Certification Authorities"]
    • Choose whichever certificate is needed
    • Click “Export…”, then “Next>”
    • Select “DER encoded binary X.509 (.CER)”
    • Name the file myDomain.cer [change the name as applicable]
    • Select “Finish”
  • Install the certificate:
keytool -import -alias myDomain -file myDomain.cer -trustcacerts -keystore \path\myKeystore
  • Configure Felix to use the keystore by adding the following to the java command line:
-Djavax.net.ssl.trustStore=C:\path\myKeystore

For example:

java -Djavax.net.ssl.trustStore=C:\path\myKeystore -Xmx250m -Xms250m %FELIX_CONFIG_PROP% "%ASPIRE_HOME_PROP%" -jar bin\felix.jar