Extract Text 0.4

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Aspire / Aspire Components / Extract Text Stage

Extract Text Stage
Description: Takes the input stream or the input array of bytes and uses Apache Tika to extract the text and metadata from the stream.
Inputs: object['contentStream'] or object['contentBytes'] to fetch the content to be parsed.

<fetchUrl> (if it exists) or <url> to set as the Apache Tika "resourceName" to help it automatically determine the correct parser to use.

Outputs: <content> holds the text content extracted from the document. Other metadata output is available and is mapped with the metadata mapper (see below).
Factory: aspire-extract-text (previously aspire.ExtractText)
Sub Type: default
Object Type: AspireObject

Determining the Parser

The method for determining which Apache Tika text extractor to use is as follows:

  1. If <mimeType> element exists within the AspireObject, then use this to look up the parser type.
  2. Otherwise, allow Apache Tika to auto-detect the correct text extractor

Extraction Timeouts

If the extraction takes too long, then the thread which is doing the extraction will be forcibly stopped using Thread.interrupt(), and if that doesn't work (after three retries), Thread.stop() with a forced NullPointer exception.

This was done because Apache Tika contains bugs which cause infinite loops for some types of HTML documents.

The situation should be carefully monitored, because if too many of these exceptions occur, Aspire could become unstable.


Element Type Default Description
extractTimeout int 180000
(3 minutes)
Maximum time to wait (in ms) for the text extraction.
maxCharacters int/String 1,000,000 Maximum number characters to extract from the document. If the limit is exceeded, the extracted text will be truncated. Use a numeric value or "unlimited"
metadataMap see below Standard Metadata Mapper configuration. See below.

Metadata Mapper Configuration

The Extract Text stage contains a large number of additional metadata fields which can be mapped to fields in the AspireObject XML. See Apache Tika for a description of all of the metadata fields extracted. The following ones will be mapped by default. Note that the mappings are specified in order, a higher-level mapping will be preferred over a lower mapping if both are possible.

For more information on metadata formats used below, see:

Apache Tika Field Default Output Field Description
DC.title title The Dublin Core title of the document.
DC.date modificationDate Dublin Core last modified date, converted to ISO 8601 format.
DC.description description Dublin Core description.
DC.contributor contributor Dublin Core contributor name.
title title Any other title (such as PDF title or HTML title) that Apache Tika is able to extract from the document.
created creationDateTime The creation date-time, typically from PDF properties. Formatted as an ISO 8601 date-time.
Last-Modified modificationDateTime Last modified date-time. Formatted as an ISO 8601 date-time.
Author author Author name. Typically from PDF document properties.
Content-Type contentType The HTTP formatted content type of the document.
description description Document description from either HTML meta fields or PDF document properties.
language language Auto-detected language code from Apache Tika.
Keywords keywords Keywords field from either HTML meta fields or PDF document properties.

Example Configurations


  <component name="extractText" subType="default" factoryName="aspire-extract-text" />


  <component name="extractText" subType="default" factoryName="aspire-extract-text">
     <!-- note that all of the default mappings are included automatically -->
      <map from="Keywords" to="newKeywordsField"/>
      <map from="description" to="newDescriptionField"/>

Example Output

  <title source="ExtractTextStage/title">Search Technologies: The independent enterprise search experts</title> 
  <description source="ExtractTextStage/description">We advise companies on enterprise search product selection, and we provide efficient, cost effective implementation and integration services. Search Technologies are the expert in the search space</description> 
  <language source="ExtractTextStage/language">en</language> 
  <extension source="ExtractTextStage">
    <field name="Content-Language">en</field> 
    <field name="Content-Encoding">ISO-8859-1</field> 
    <field name="resourceName">http://www.searchtechnologies.com</field> 
  <content source="ExtractTextStage">

	About Us	Executive Team


	Solutions	Enterprise Search Consulting

	Microsoft/Fast ESP Services

	Google Search Appliance

	Open Source Enterprise Search

	SharePoint Search

	RetrievalWare Support & Migration

	Image Management