Tagger (Aspire 2)

From wiki.searchtechnologies.com
Jump to: navigation, search

For Information on Aspire 3.1 Click Here

Tagger (Aspire 2)
Factory Name  com.searchtechnologies.aspire:aspire-tag-text
subType  default
Inputs  <content> (by default), but potentially any field from the AspireObject and one or more tag files
Outputs  <tags> in the AspireObject

The Text Tagger stage scans the document for the occurrence of words/phrases and adds counts of the phrases to the document <tags> element. It uses a number of "tag files", containing lists of phrases to match and optionally their synonyms. The output will count the occurrences of the tags and synonyms and output XML. Outputs are per tag file, so the tagger can tag documents against differing lists and maintain separation. The tagger tags the content element by default, but can tag multiple fields from the AspireObject. Optionally, in the case of a field that is marked as the document body, the tagger will separately count occurrences of words with in a certain distance of the start of the text, allowing subsequent stages to bias on proximity to the start of the text


Element Type Default Description
output String tags The base output element for the tags in the Aspire document.

Configuration of Content to Tag

By default, the tagger processes the <content> tag from the Aspire document. However, it is possible to configure it to tag other fields.

Element Type Default Description
tagFields/tagField/@field String None (must be specified) The field to process.
tagFields/tagField/@isBody boolean false Flag to indicate this field contains the document body.
tagFields/tagField/@startTokens int 0 For the document body only, the number of tokens from the start of the text to consider separately as being near the document start.


  • If no <tagField> tags exist, the tagger defaults to processing the <content> element as the document body.
  • More than one <tagField> element may be used, to tag more than one field.
  • If any <tagField> tags are used, you MUST specify ALL fields to tag, including the <content> if required.

Tag List Configuration

The tagger requires at least one tag list in order to tag files. This text file contains a list of phrases and their synonyms. Each phrase should appear on a new line and any synonyms should appear on subsequent lines, preceded with a + symbol.

Element Type Default Description
tagLists/tagList/@id String None (must be specified) An identifier for the tags. This will be output in the Aspire Document against any tags from this file that are identified.
tagLists/tagList/@tagFile String None (must be specified) The path to the file containing the tags. Relative to $ASPIRE_HOME.


  • More than one <tagList> may be specified.

Example tag file

	+United Kingdom
	+Great Britain

	+United States of America
	+United States

Tokeniser Configuration

By default, the tagger uses the classes org.apache.lucene.analysis.standard.StandardTokenizer and org.apache.lucene.analysis.LowerCaseFilter to tokenize and lowercase the document text and phrases from the tag files. However, these may be overriden if required.

Element Type Default Description
tokenProcessing/tokenizer/@class String org.apache.lucene.analysis.standard.StandardTokenizer String representing the class to use for the tokeniser. Must conform to the parameters/return type of org.apache.lucene.analysis.standard.StandardTokenizer.
tokenProcessing/tokenizer/@jar String None (built in) Jar file the tokenizer class file exists in. Relative to $ASPIRE_HOME.
tokenProcessing/tokenFilter/@class String org.apache.lucene.analysis.LowerCaseFilter String representing the class to use as a filter. Must conform to the parameters/return type of org.apache.lucene.analysis.LowerCaseFilter.
tokenProcessing/tokenFilter/@jar String None (built in) Jar file the token filter class file exists in. Relative to $ASPIRE_HOME.


  • More than one token filter may be used
  • The element may contain further attributes. If the configured class implements the AspireInitializer interface, this will be called, and the config element of the component called. This allows the classes to be initialised with any required information

Example Configurations


 <component name="tagger" subType="default" factoryName="aspire-tag-text">
     <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>


 <component name="tagger" subType="default" factoryName="aspire-tag-text">
     <tagField field="title"/>
     <tagField field="content" isBody="true" startTokens="20"/>
     <tagList id="geo" tagFile="data/tagFiles/geo.txt"/>
     <tagList id="hr" tagFile="data/tagFiles/hr.txt"/>
     <tagList id="sport" tagFile="data/tagFiles/sports.txt"/>
     <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizer"/>
     <tokenFilter class="org.apache.lucene.analysis.LowerCaseFilter"/>
     <tokenFilter jar="lib/aspire-lemmatizer.jar" class="org.apache.lucene.analysis.LemmatizerFilter" dictionary="data/dict/gcide_out.xml"/>

Even More Complex

If the above kinds of configurations are not enough for your needs, use a full text tokenization pipeline by setting up a Tokenization Manager, and then adding all the token filters desired. The tag lists will be handled by Extractor stages.

Example Output

The following is a sample output for the tagger. Note: Does not show synonyms or sub-tags (will add an example for that later).

   <tags source="textTagger">
       <category name="responsibility">
           <tag body="1" name="administrator"/>
           <tag body="1" name="responsible for" topBody="1"/>
           <tag name="senior" topBody="1"/>
           <tag body="1" name="managing"/>
           <tag body="3" name="management" topBody="2"/>