Publish to ElasticSearch Application Bundle (Aspire 2)

From wiki.searchtechnologies.com
Revision as of 00:15, 9 December 2015 by Dherrera (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

For Information on Aspire 3.1 Click Here


Publish to ElasticSearch Application Bundle (Aspire 2)
AppBundle Name  Publish To ElasticSearch
Maven Coordinates  com.searchtechnologies.aspire:app-publish-to-elasticsearch
Versions  2.2.2
Type Flags  job-input
Inputs  AspireObject from a connector's subjob with metadata and content extracted from a specific file/folder.
Outputs  A JSON transformation of the AspireObject sent to the ElasticSearch's bulk URL.

The Publish to ElasticSearch application performs content feeds to a ElasticSearch of metadata and content of files extracted by Aspire connectors. The feed to the ElasticSearch can be customized by editing the JSON transformation file provided by the user.


Configuration

This section lists all configuration parameters available to install the ElasticSearch Application Bundle.

Property Type Default Description
ElasticIndex string index1 Index to which the jobs are going to be published.
ElasticNoUrl boolean true Indicates if the publisher must use a Url or build one from the host and port entered.
ElasticPort interger 9200 ElasticSearch port where to send the feeds
ElasticHost string none ElasticSearch hostname or IP adress. e.g. server.domain.com
ElasticUrl string none Complete Url where the feeds are going to be send. e.g. http://localhost:9200/bulk_
aspireToElasticGroovy string ${appbundle.home}/config/groovy/aspireToElasticsearchBulk.groovy Location of the Groovy to transform the job data to a ElasticSearch feed. See Edit Groovy.
maxResults (2.1 Release)   int 1000000 (Index dump) How many documents can be fetched by the search engine for the same query
pageSize (2.1 Release)   int 10000 (Index dump) How many documents to fetch per page
urlField (2.1 Release)   string displayUrl (Index dump) Field used to store the url in the search engine
idField (2.1 Release)   string id (Index dump) Field used to store the id in the search engine.
timestampField (2.1 Release)   string submitTS (Index dump) The name of the timestamp field holding the index timestamp of every document.

Configuration Example

  <application config="com.searchtechnologies.aspire:app-publish-to-gsa">
    <properties>
      <ElasticIndex>index1</ElasticIndex>
      <ElasticNoUrl>true</ElasticNoUrl>
      <ElasticHost>localhost</ElasticHost>
      <ElasticPort>9200</ElasticPort>
      <aspireToElasticGroovy>${appbundle.home}/config/groovy/aspireToElasticsearchBulk.groovy</aspireToElasticGroovy>
      <debug>false</debug>
    </properties>
  </application>

Note: Any optional properties can be removed from the configuration to use the default value described on the table above.

Edit Groovy

The default Groovy transformation file can be found in File:AspireToElasticsearchBulk.groovy.


The default transformation Groovy file provided by the publisher expects metadata as described in Connector AspireObject Metadata.

Add metadata field

To add a new metadata field extracted by an Aspire Connector add an groovy element inside the builder.$object() that is right after the builder.flush().

   metadata-name doc.metadatafield

Change the document ID

The id of a ElasticSearch document is used to uniquely identify a file in the index. By default, Publish To ElasticSearch will use the following fields from the Aspire document in order of precedence (if one is missing, then the next will be used):

  • fetchUrl
  • url
  • displayUrl
  • id

If you want to change this behavior, edit or create a new Groovy file which has the following element inside builder.index():

  '_id' value-for-id


For more information in how to create a Groovy file transformation please see Post JSON page