Extract Text (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Extract Text (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-extract-text
subType  default
Inputs  object['contentStream'] or object['contentBytes'] (the content to be parsed)
Outputs  <content> holds the text content extracted from the document. Other metadata output is available and is mapped with the metadata mapper (see below).


The extract-text component takes the input stream or the input array of bytes and uses Apache Tika to extract the text and metadata from the stream.

Determining the Parser

The method for determining which Apache Tika text extractor to use is as follows:

  1. If <mimeType> element exists within the AspireObject, then use this to look up the parser type.
  2. Otherwise, allow Apache Tika to auto-detect the correct text extractor
    • <fetchUrl> (if it exists) or <url> is set as the Apache Tika "resourceName" to help it automatically determine the correct parser to use.

Extraction Timeouts

If the extraction takes too long, then the thread which is doing the extraction will be forcibly stopped using Thread.interrupt(), and if that doesn't work (after three retries), Thread.stop() with a forced NullPointer exception.

This was done because Apache Tika contains bugs which cause infinite loops for some types of HTML documents.

The situation should be carefully monitored, because if too many of these exceptions occur, Aspire could become unstable.

Configuration

Element Type Default Description
extractTimeout int 180000
(3 minutes)
Maximum time to wait (in ms) for the text extraction (Maximum value 180000000 equals 3000 minutes).
maxCharacters int/String 1,000,000 Maximum number characters to extract from the document. If the limit is exceeded, the extracted text will be truncated. Use a numeric value or "unlimited."
metadataMap see below Standard Metadata Mapper configuration. See below.
wordPerTag boolean true  (2.2.1 Release)  If words are to be split per XML/HTML tag


Metadata Mapper (Aspire 2) Configuration

The Extract Text stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML. See Apache Tika for a description of all of the metadata fields extracted. The following ones will be mapped by default. Note that the mappings are specified in order, a higher-level mapping will be preferred over a lower mapping if both are possible.

For more information on metadata formats used below, see:


Apache Tika Field Default Output Field Description
DC.title title The Dublin Core title of the document.
DC.date modificationDate Dublin Core last modified date, converted to ISO 8601 format.
DC.description description Dublin Core description.
DC.contributor contributor Dublin Core contributor name.
title title Any other title (such as PDF title or HTML title) that Apache Tika is able to extract from the document.
created creationDateTime The creation date-time, typically from PDF properties. Formatted as an ISO 8601 date-time.
Last-Modified modificationDateTime Last modified date-time. Formatted as an ISO 8601 date-time.
Author author Author name. Typically from PDF document properties.
Content-Type contentType The HTTP formatted content type of the document.
description description Document description from either HTML meta fields or PDF document properties.
language language Auto-detected language code from Apache Tika.
Keywords keywords Keywords field from either HTML meta fields or PDF document properties.

Example Configurations

Simple

  <component name="ExtractText" subType="default" factoryName="aspire-extract-text" />

Complex

  <component name="ExtractText" subType="default" factoryName="aspire-extract-text">
   <extractTimeout>60000</extractTimeout>
   <tikaConfig>config/my-tika-config.xml</tikaConfig>
   <!-- note that all of the default mappings are included automatically -->
   <metadataMap>
    <map from="Keywords" to="newKeywordsField"/>
    <map from="description" to="newDescriptionField"/>
   </metadataMap>
 </component>

Example Output

<doc>
  <fetchUrl>http://www.searchtechnologies.com</fetchUrl> 
  .
  .
  .
  <title source="ExtractTextStage/title">Search Technologies: The independent enterprise search experts</title> 
  <description source="ExtractTextStage/description">We advise companies on enterprise search product selection, and we provide efficient, cost effective implementation and integration services. Search Technologies are the expert in the search space</description> 
  <language source="ExtractTextStage/language">en</language> 
  <extension source="ExtractTextStage">
    <field name="Content-Language">en</field> 
    <field name="Content-Encoding">ISO-8859-1</field> 
    <field name="resourceName">http://www.searchtechnologies.com</field> 
  </extension>
  <content source="ExtractTextStage">
  <![CDATA[ 
	Home
	About Us	Executive Team
	Careers
	Solutions	Enterprise Search Consulting
	Microsoft/Fast ESP Services
	Google Search Appliance
	Open Source Enterprise Search
	SharePoint Search
	RetrievalWare Support & Migration
	Image Management
  .
  .
  .
  </content>
</doc>